LLM & AI Glossary

A

API (Application Programming Interface)

Category: General

A set of protocols and tools that allows different software applications to communicate with each other. In the context of LLMs, an API lets developers send text prompts to a language model and receive generated responses without needing to host the model themselves.

Example: The Groq API allows you to send a request with a prompt and receive AI-generated text in response via HTTP.

Attention Mechanism

Category: Architecture

A neural network component that allows the model to focus on relevant parts of the input when generating output. The "attention" mechanism helps the model understand which words or tokens are most important for a given context.

Why it matters: Attention is the core innovation behind Transformer models (like GPT and BERT), enabling them to process longer sequences and understand context better than previous architectures.

Autoregressive Model

Category: Model Type

A model that generates text one token at a time, using previously generated tokens as input for the next prediction. Most modern LLMs (GPT, Llama, Claude) are autoregressive.

B

BERT (Bidirectional Encoder Representations from Transformers)

Category: Model Architecture

A transformer-based model developed by Google that processes text bidirectionally (understanding context from both left and right). Unlike GPT, BERT is primarily used for understanding tasks (classification, Q&A) rather than generation.

Use Cases: Search engines (Google uses BERT), sentiment analysis, named entity recognition

Bias (AI)

Category: Ethics

Systematic errors or unfair tendencies in AI model outputs, often reflecting biases present in training data. Can manifest as gender, racial, or cultural stereotypes.

BPE (Byte-Pair Encoding)

Category: Tokenization

A tokenization algorithm that breaks text into subword units. Commonly used in modern LLMs to handle rare words and multiple languages efficiently.

C

Chain-of-Thought (CoT) Prompting

Category: Prompting Technique

A prompting method that encourages the model to "think step-by-step" by explicitly showing its reasoning process before arriving at an answer. Dramatically improves performance on complex reasoning tasks.

Prompt: "Calculate 47 × 23 step by step."

Response:
Step 1: 47 × 20 = 940
Step 2: 47 × 3 = 141  
Step 3: 940 + 141 = 1,081
Final Answer: 1,081

Chatbot

Category: Application

A conversational AI agent that uses an LLM to interact with users in natural language. Can be rule-based (simple) or AI-powered (advanced).

Context Window

Category: Model Specification

The maximum number of tokens (words/pieces) a model can process in a single request, including both input and output. Measured in tokens (e.g., 4K, 32K, 128K, 1M).

4K

~3,000 words

Old GPT-3.5

128K

~96,000 words

GPT-4 Turbo

1M

~750,000 words

Gemini 2.0 Flash

D - E

Decoder-Only Architecture

Category: Architecture

A type of transformer that only uses the decoder component (no encoder). GPT, Llama, and most modern LLMs use this architecture, optimized for text generation.

Embedding

Category: Representation

A numerical representation of text (words, sentences, or documents) as vectors in high-dimensional space. Words with similar meanings have similar embeddings.

Example: "king" and "queen" would have embeddings that are close together in vector space.

Encoder-Decoder Architecture

Category: Architecture

A transformer architecture with separate encoder (processes input) and decoder (generates output) components. Used in translation models like T5 and BART.

F - G

Few-Shot Learning

Category: Training Method

The ability of a model to learn from a small number of examples (typically 2-10) provided in the prompt, without additional training.

Example: Providing 3 examples of sentiment classification before asking the model to classify a new sentence.

Fine-Tuning

Category: Training Method

The process of taking a pre-trained model and further training it on a specific dataset to specialize it for a particular task or domain. More efficient than training from scratch.

Foundation Model

Category: General

A large-scale pre-trained model that serves as the base for various downstream tasks. Examples: GPT-4, Llama 3, Claude.

GGUF (GPT-Generated Unified Format)

Category: Model Format

A file format for storing LLM model weights, optimized for efficient loading and inference with llama.cpp. Successor to GGML format.

Guardrails

Category: Safety

Safety mechanisms built into LLMs to prevent harmful, biased, or inappropriate outputs. Can include content filters, alignment techniques, and output validation.

H - I

Hallucination

Category: Limitations

When an LLM generates plausible-sounding but factually incorrect or nonsensical information. A major limitation of current language models.

Example: Claiming a fake historical event happened or citing non-existent research papers.

Inference

Category: Process

The process of using a trained model to generate predictions or outputs. In LLMs, this means generating text based on a prompt.

Instruction Tuning

Category: Training Method

Fine-tuning a language model on instructions and desired responses to make it better at following user commands. Models like "Llama 3.3 Instruct" are instruction-tuned versions.

L - M

LLM (Large Language Model)

Category: Core Concept

A neural network trained on massive amounts of text data (billions of words) to understand and generate human-like text. "Large" refers to the number of parameters (billions to trillions).

7B

Small LLM

70B

Medium LLM

400B+

Large LLM

LoRA (Low-Rank Adaptation)

Category: Technique

An efficient fine-tuning method that only trains a small number of additional parameters instead of the entire model. Allows customization with minimal computational resources.

Mixture of Experts (MoE)

Category: Architecture

An architecture where the model consists of multiple "expert" sub-networks, and a gating mechanism decides which experts to activate for each input. Enables larger models with efficient inference.

Example: Mixtral 8x7B has 8 expert networks but only activates 2 per token.

Multimodal Model

Category: Model Type

A model that can process and generate multiple types of data (text, images, audio, video). Examples: GPT-4 Vision, Gemini Pro Vision.

P - Q

Parameter

Category: Model Specification

A learnable value in a neural network that gets adjusted during training. The number of parameters is often used as a proxy for model size and capability.

Scale: Llama 3.3 has 70 billion parameters • GPT-4 has ~1.7 trillion parameters (estimated)

Prompt Engineering

Category: Technique

The practice of carefully crafting input prompts to get the best possible output from an LLM. Includes techniques like few-shot learning, chain-of-thought, and role-playing.

Quantization

Category: Optimization

A technique to reduce model size and memory usage by representing weights with lower precision (e.g., 4-bit or 8-bit instead of 16-bit). Enables running larger models on consumer hardware.

Q4

Smallest

Q5

Balanced

Q8

Better

FP16

Full

R - S

RAG (Retrieval-Augmented Generation)

Category: Technique

A method that combines LLMs with external knowledge retrieval. The system first searches a database for relevant information, then uses that context to generate more accurate responses.

Why it matters: Reduces hallucinations • Provides source citations • Updates model knowledge without retraining

RLHF (Reinforcement Learning from Human Feedback)

Category: Training Method

A training technique where human evaluators rank model outputs, and the model learns to produce responses that humans prefer. Used to align models with human values.

Self-Attention

Category: Mechanism

The mechanism in transformers that allows each token to "attend" to all other tokens in the sequence, learning contextual relationships.

Streaming

Category: Output Mode

A response mode where the LLM sends generated tokens as soon as they're produced (word-by-word) instead of waiting for the complete response. Provides a better user experience for long outputs.

System Prompt

Category: Prompt Engineering

An initial instruction given to the model that sets its behavior, personality, or role for the entire conversation. Typically used in chatbot applications.

System: "You are a helpful Python programming assistant. 
Provide clear, concise code examples with explanations."

T

Temperature

Category: Parameter

A parameter (0.0 to 2.0) that controls randomness in generation. Lower = more focused/deterministic, Higher = more creative/random.

0.0 - 0.3

Factual tasks, code, math

0.7 - 1.0

Chatbots, Q&A

1.5 - 2.0

Creative writing, brainstorming

Token

Category: Core Concept

The basic unit of text that an LLM processes. Not exactly words—can be parts of words, whole words, or punctuation. "Tokenization" splits text into these units.

"Hello world!" → ["Hello", " world", "!"] = 3 tokens

Transformer

Category: Architecture

The foundational neural network architecture (introduced in 2017) that powers modern LLMs. Uses self-attention mechanisms instead of recurrence. Introduced in the paper "Attention Is All You Need."

Z

Zero-Shot Learning

Category: Capability

The ability of a model to perform a task without any examples in the prompt. Large models excel at zero-shot learning due to their extensive pre-training.

Example: Asking "Translate to French: Hello" without providing any translation examples first.

Cookie Consent

A

API (Application Programming Interface)

Attention Mechanism

Autoregressive Model

B

BERT (Bidirectional Encoder Representations from Transformers)

Bias (AI)

BPE (Byte-Pair Encoding)

C

Chain-of-Thought (CoT) Prompting

Chatbot

Context Window

D - E

Decoder-Only Architecture

Embedding

Encoder-Decoder Architecture

F - G

Few-Shot Learning

Fine-Tuning

Foundation Model

GGUF (GPT-Generated Unified Format)

Guardrails

H - I

Hallucination

Inference

Instruction Tuning

L - M

LLM (Large Language Model)

LoRA (Low-Rank Adaptation)

Mixture of Experts (MoE)

Multimodal Model

P - Q

Parameter

Prompt Engineering

Quantization

R - S

RAG (Retrieval-Augmented Generation)

RLHF (Reinforcement Learning from Human Feedback)

Self-Attention

Streaming

System Prompt

T

Temperature

Token

Transformer

Z

Zero-Shot Learning

Want to Learn More?