Your comprehensive guide to understanding the terminology of Large Language Models and Artificial Intelligence.
A set of protocols and tools that allows different software applications to communicate with each other. In the context of LLMs, an API lets developers send text prompts to a language model and receive generated responses without needing to host the model themselves.
Example: The Groq API allows you to send a request with a prompt and receive AI-generated text in response via HTTP.
A neural network component that allows the model to focus on relevant parts of the input when generating output. The "attention" mechanism helps the model understand which words or tokens are most important for a given context.
Why it matters: Attention is the core innovation behind Transformer models (like GPT and BERT), enabling them to process longer sequences and understand context better than previous architectures.
A model that generates text one token at a time, using previously generated tokens as input for the next prediction. Most modern LLMs (GPT, Llama, Claude) are autoregressive.
A transformer-based model developed by Google that processes text bidirectionally (understanding context from both left and right). Unlike GPT, BERT is primarily used for understanding tasks (classification, Q&A) rather than generation.
Use Cases: Search engines (Google uses BERT), sentiment analysis, named entity recognition
Systematic errors or unfair tendencies in AI model outputs, often reflecting biases present in training data. Can manifest as gender, racial, or cultural stereotypes.
A tokenization algorithm that breaks text into subword units. Commonly used in modern LLMs to handle rare words and multiple languages efficiently.
A prompting method that encourages the model to "think step-by-step" by explicitly showing its reasoning process before arriving at an answer. Dramatically improves performance on complex reasoning tasks.
Prompt: "Calculate 47 × 23 step by step." Response: Step 1: 47 × 20 = 940 Step 2: 47 × 3 = 141 Step 3: 940 + 141 = 1,081 Final Answer: 1,081
A conversational AI agent that uses an LLM to interact with users in natural language. Can be rule-based (simple) or AI-powered (advanced).
The maximum number of tokens (words/pieces) a model can process in a single request, including both input and output. Measured in tokens (e.g., 4K, 32K, 128K, 1M).
A type of transformer that only uses the decoder component (no encoder). GPT, Llama, and most modern LLMs use this architecture, optimized for text generation.
A numerical representation of text (words, sentences, or documents) as vectors in high-dimensional space. Words with similar meanings have similar embeddings.
Example: "king" and "queen" would have embeddings that are close together in vector space.
A transformer architecture with separate encoder (processes input) and decoder (generates output) components. Used in translation models like T5 and BART.
The ability of a model to learn from a small number of examples (typically 2-10) provided in the prompt, without additional training.
Example: Providing 3 examples of sentiment classification before asking the model to classify a new sentence.
The process of taking a pre-trained model and further training it on a specific dataset to specialize it for a particular task or domain. More efficient than training from scratch.
A large-scale pre-trained model that serves as the base for various downstream tasks. Examples: GPT-4, Llama 3, Claude.
A file format for storing LLM model weights, optimized for efficient loading and inference with llama.cpp. Successor to GGML format.
Safety mechanisms built into LLMs to prevent harmful, biased, or inappropriate outputs. Can include content filters, alignment techniques, and output validation.
When an LLM generates plausible-sounding but factually incorrect or nonsensical information. A major limitation of current language models.
Example: Claiming a fake historical event happened or citing non-existent research papers.
The process of using a trained model to generate predictions or outputs. In LLMs, this means generating text based on a prompt.
Fine-tuning a language model on instructions and desired responses to make it better at following user commands. Models like "Llama 3.3 Instruct" are instruction-tuned versions.
A neural network trained on massive amounts of text data (billions of words) to understand and generate human-like text. "Large" refers to the number of parameters (billions to trillions).
An efficient fine-tuning method that only trains a small number of additional parameters instead of the entire model. Allows customization with minimal computational resources.
An architecture where the model consists of multiple "expert" sub-networks, and a gating mechanism decides which experts to activate for each input. Enables larger models with efficient inference.
Example: Mixtral 8x7B has 8 expert networks but only activates 2 per token.
A model that can process and generate multiple types of data (text, images, audio, video). Examples: GPT-4 Vision, Gemini Pro Vision.
A learnable value in a neural network that gets adjusted during training. The number of parameters is often used as a proxy for model size and capability.
Scale: Llama 3.3 has 70 billion parameters • GPT-4 has ~1.7 trillion parameters (estimated)
The practice of carefully crafting input prompts to get the best possible output from an LLM. Includes techniques like few-shot learning, chain-of-thought, and role-playing.
A technique to reduce model size and memory usage by representing weights with lower precision (e.g., 4-bit or 8-bit instead of 16-bit). Enables running larger models on consumer hardware.
A method that combines LLMs with external knowledge retrieval. The system first searches a database for relevant information, then uses that context to generate more accurate responses.
Why it matters: Reduces hallucinations • Provides source citations • Updates model knowledge without retraining
A training technique where human evaluators rank model outputs, and the model learns to produce responses that humans prefer. Used to align models with human values.
The mechanism in transformers that allows each token to "attend" to all other tokens in the sequence, learning contextual relationships.
A response mode where the LLM sends generated tokens as soon as they're produced (word-by-word) instead of waiting for the complete response. Provides a better user experience for long outputs.
An initial instruction given to the model that sets its behavior, personality, or role for the entire conversation. Typically used in chatbot applications.
System: "You are a helpful Python programming assistant. Provide clear, concise code examples with explanations."
A parameter (0.0 to 2.0) that controls randomness in generation. Lower = more focused/deterministic, Higher = more creative/random.
Factual tasks, code, math
Chatbots, Q&A
Creative writing, brainstorming
The basic unit of text that an LLM processes. Not exactly words—can be parts of words, whole words, or punctuation. "Tokenization" splits text into these units.
"Hello world!" → ["Hello", " world", "!"] = 3 tokens
The foundational neural network architecture (introduced in 2017) that powers modern LLMs. Uses self-attention mechanisms instead of recurrence. Introduced in the paper "Attention Is All You Need."
The ability of a model to perform a task without any examples in the prompt. Large models excel at zero-shot learning due to their extensive pre-training.
Example: Asking "Translate to French: Hello" without providing any translation examples first.
Explore our comprehensive guides and start building with free LLM APIs today.