LLM Concepts

Module 1: Understanding Large Language Models

Foundation models are large-scale AI models trained on vast, diverse datasets — text, images, code, audio — that develop a broad, flexible understanding of the world. Large Language Models (LLMs) are the most prominent class, trained primarily on text. Models like Claude (Anthropic), ChatGPT (OpenAI), Gemini (Google), and Amazon Nova are all LLMs. Multimodal models go further — processing text, images, audio, and video in a single model. As of 2025, most frontier models are multimodal by default.

Hands-On Lab:
Try the LLM Foundations Lab in Jupyter!
Launch the companion lab notebook to experiment with context windows, tokenization, embeddings, and more using real examples.

What You'll Learn in This Module

How LLMs Are Built: The key techniques behind modern models — Transformers, RLHF, MoE, Quantization, Fine-Tuning
Context Window: The model's working memory — what it can see at once
Tokenization: How text is broken into chunks the model processes
Embeddings: How meaning is represented as vectors
Logits & Temperature: How the model selects the next token
LLM Evolution: From Transformers to today's reasoning models

1. How LLMs Are Built

Building a modern LLM happens in three phases: designing the architecture before training begins (pre-training), refining the trained model's behavior and capabilities (post-training), and optimizing it for real-world serving (deployment).

🏗️ Pre-Training: Architectural Decisions

These choices are made before the model sees any data. They define its fundamental structure and scale.

🔧 Transformers & Self-Attention

Every modern LLM is a Transformer. Each Transformer block has two distinct layers that do very different jobs:

Self-Attention layer — the relationship finder. For every word, it looks at every other word in the sequence and decides how much attention to pay to each one. This is how the model understands that "it" refers to "the bank" three sentences back, or that "bank" means a financial institution and not a river bank given the surrounding context. Before Transformers, models read text word by word and forgot early context. Self-attention processes the entire sequence at once.
Feed-Forward Network (FFN) layer — the knowledge store. After attention resolves relationships, each token is processed individually through the FFN. Research has shown that factual knowledge — "Paris is the capital of France", "Python uses indentation" — is stored in the FFN weights. Think of attention as the discussion, and the FFN as the reference library each token consults on its own.

📄 Vaswani et al. (2017). "Attention Is All You Need" — Google Brain

🧩 Mixture of Experts (MoE)

MoE is a direct extension of the Transformer — specifically, it replaces the single FFN layer with multiple specialist FFN networks called "experts", plus a routing mechanism that picks the 2–4 most relevant experts for each token.

Why the FFN and not the attention layer? Attention must remain global — every token still needs to talk to every other token. But the FFN knowledge lookup is independent per token, making it a natural place to specialize. The result: a model can have 400B total parameters but only activate ~20B per token — frontier capability at a fraction of the compute cost per inference.

GPT-4, Gemini 1.5, Mixtral, and DeepSeek R1 all use MoE. DeepSeek R1 (Jan 2025) proved open-source MoE could match proprietary frontier models.

📄 Shazeer et al. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer" · Jiang et al. (2024). "Mixtral of Experts"

🎓 Post-Training: Refining the Model

Applied after the base model is trained. This is where raw capability gets shaped into useful, aligned, and specialized behavior.

🧑‍🏫 RLHF — Reinforcement Learning from Human Feedback

Teaches the model how to behave. Human raters compare model outputs and rank them by quality; those preferences train a reward model; the LLM is then fine-tuned to maximize that reward. This is why ChatGPT felt fundamentally different from GPT-3 — same Transformer architecture, but RLHF-aligned to follow instructions and avoid harmful outputs. Every major assistant model (Claude, ChatGPT, Gemini, Nova) uses a variant of RLHF.

📄 Ouyang et al. (2022). "Training language models to follow instructions with human feedback" (InstructGPT) — OpenAI

🎯 Fine-Tuning & LoRA

Adapts the model to your domain. Fine-tuning continues training a pre-trained model on a smaller, domain-specific dataset. LoRA (Low-Rank Adaptation) made this practical: instead of updating all billions of parameters, it injects small trainable matrices into specific layers, allowing fine-tuning of a 70B model on a single GPU with results comparable to full fine-tuning.

📄 Hu et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models" — Microsoft

🧠 Reasoning Capabilities

Teaches the model how to think. RLHF aligned model behavior; reasoning training improves the cognitive process itself. The key insight: instead of spending more compute only at training time, let the model think longer at inference time — generating a reasoning scratchpad before committing to an answer. This is sometimes called inference-time compute scaling.

The training mechanism uses RL with verifiable rewards: for math and code, you can programmatically check if an answer is correct, giving a clean reward signal without needing human raters for every step. The resulting models (o3, Claude with extended thinking, DeepSeek R1) dramatically outperform standard models on multi-step reasoning, math, and complex coding tasks.

📄 Wei et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in LLMs" · Snell et al. (2024). "Scaling LLM Test-Time Compute Optimally" · DeepSeek-AI (2025). "DeepSeek-R1"

🚀 Deployment: Inference Optimization

Applied after training is complete, to make serving the model practical and cost-effective.

⚡ Quantization

Shrinks the model for serving. Reduces the numerical precision of model weights — from 32-bit floats down to 8-bit or 4-bit integers — dramatically cutting memory requirements with minimal quality loss. It's why a capable 7B model can run on a MacBook, and why cloud providers can serve 70B models on fewer GPUs. For developers, quantized models mean lower inference costs and faster response times.

📄 Dettmers et al. (2022). "LLM.int8()" · Frantar et al. (2022). "GPTQ"

2. Context Window: The Model's Working Memory

Concept: The context window is the model's "working memory"—the total number of tokens (chunks of text) it can consider at once. This includes both your input and the model's output.
Modern AI models, known as transformers (introduced by Google), use an attention mechanism to focus on all tokens in this window simultaneously—but nothing outside it.

Everyday Example: Imagine a whiteboard with limited space. You write your question (input tokens) and leave room for the model's answer (output tokens). If your question fills the board, there's less space for the answer. If you run out of space, the model stops writing—even mid-sentence.

Your Prompt
(e.g., 3,000 tokens)

Model's Response
(up to 5,000 tokens)

Context Window: 8,000 tokens total (input + output)

Why it matters:

Hard limit: input tokens + output tokens ≤ max context window
If your input is large, you have less room for the model's answer.
If you hit the limit, the model will stop—sometimes in the middle of a sentence.
This applies to all transformer models: OpenAI's GPT-4o/o1, Anthropic's Claude 3.7, and Amazon's Nova Premier.
Cost control: Although foundational models have a maximum context window, most APIs let you set a smaller limit (using parameters like max_tokens) if you want to control costs or keep responses shorter.

Note on Reasoning Models & Token Budgets: Newer models (like OpenAI's o1, Anthropic's Claude 3.7, and Amazon's Nova Premier) support very large context windows—sometimes up to 1 million tokens! These models can "think" for longer and do multi-step reasoning, but every step and intermediate thought also uses up tokens. Many APIs let you control this with a budget_tokens or reasoning budget parameter, so you can balance depth of reasoning with cost and performance.

💡Tip: Keep your prompts concise and leave enough space for the model's answer—especially for complex tasks that need extended reasoning.

3. Tokenization: Breaking Down Text

Concept: Tokenization splits text into small pieces called tokens that the model can process within its context window.

Everyday Example: Think of tokenization like cutting a pizza. The whole pizza is your full text, and the slices are your tokens.

Input Text: "Machine learning is fascinating"

Model A tokenization:

"Machine"

" learning"

" is"

" fascinating"

Model B tokenization:

"Machine"

" learn"

"ing"

" is"

" fascin"

"ating"

Practical Application: Quick Token & Cost Estimation for Developers:
For English, on average, 1 token ≈ 4 characters (including spaces and punctuation) or ¾ of a word.
Reference: OpenAI Tokenizer

How to use:

Count words or characters in your input and expected output.
Estimate tokens (words × 1.33 or characters ÷ 4).
Add input and output tokens for total usage.
Check your provider's pricing—most charge less for input tokens and more for output tokens.
Multiply by the respective rates to estimate total cost.

Example: 375 words input ~ 500 tokens, 75 words output ~ 100 tokens ≈ 600 tokens in context window. If input is $0.01/1K tokens and output is $0.02/1K tokens, cost ≈ $0.005 (input) + $0.002 (output) = $0.007 per request.

4. Embeddings: Understanding Meaning

Concept: Embeddings are numerical representations of tokens that capture their meaning in a mathematical space.

Everyday Example: Imagine a map where similar words are clustered together. "Happy" and "joyful" would be neighbors, while "happy" and "sad" would be far apart.

Positive Emotions

Negative Emotions

Animals

happy

joyful

pleased

delighted

sad

unhappy

gloomy

dog

cat

puppy

Words with similar meanings cluster together in embedding space, while different concept groups remain separate

Practical Application: Embeddings allow models to understand semantic relationships and make connections between concepts that weren't explicitly mentioned.

5. Logits: Making Predictions

Concept: Logits are raw numerical scores the model assigns to each possible next token before making its final selection.

First phase: The model splits the prompt into tokens, converts the tokens to embeddings, and processes the sequence of embeddings through its layers (e.g., transformer blocks), which use attention mechanisms to understand relationships and context. The model then produces logits—raw scores for every possible next token. These logits are converted to probabilities using the softmax function (see Wikipedia). This calculation phase is deterministic—identical inputs always produce the same probability distribution.

Second phase: The model selects tokens from this distribution, either deterministically (by always choosing the highest-probability token, if configured to do so) or with controlled randomness (to balance accuracy with creativity, depending on the sampling parameters).

Everyday Example: When completing "The capital of France is ____," a model assigns high scores to relevant answers like "Paris" and low scores to irrelevant options like "banana."

Input: "The capital of France is"

How Token Selection Works

Rank (k)	Token	Raw Logit	Base Probability
1	"Paris"	8.2	80%
2	"Lyon"	4.6	10%
3	"Nice"	3.9	5%
4	"Marseille"	3.2	3%
5	"banana"	-5.0	0.1%
6+	Other tokens	varies	1.9%

Temperature

Modifies the probability distribution itself. Lower temperatures make the model more deterministic, leading to predictable outputs, while higher temperatures introduce more randomness and creativity. The allowed range depends on the provider and model—check your API documentation.

Low (0.2): Makes likely tokens even more likely
High (1.0): Makes distribution more uniform

With temperature 0.2, "Paris" might be 95% likely

topP (also called Nucleus Sampling)

Uses a cumulative probability distribution. Sorts all possible next tokens by their probability (from highest to lowest). Then selects the smallest set of tokens whose cumulative probability adds up to the value of topP (e.g., 0.9 means the top tokens that together make up 90% of the probability)

topP = 0.9: Only "Paris", "Lyon", "Nice" considered (95% cumulative)
topP = 0.8: Only "Paris" considered (80% cumulative)

It's more flexible than top-K because it dynamically adjusts the number of tokens based on their probabilities

topK

Considers only K most likely tokens

topK = 3: Only "Paris", "Lyon", "Nice" considered
topK = 1: Only "Paris" considered

Fixed number regardless of probabilities

💡Tip: For most use cases, set either temperature or topP—not both. Controlling both can lead to unpredictable or unstable results, as both parameters affect randomness in different ways.

Combined Effect: These parameters work together to control selection. Temperature modifies the distribution, then topP and topK filter which tokens can be selected from the modified distribution.

Practical Application: The temperature, topP, and topK parameters control creativity vs. predictability in responses. These parameters let you balance deterministic, factual outputs with more creative, varied responses.

Interactive: See How Temperature Changes Probabilities

Adjust the temperature to see how the probability distribution changes for a generic set of logits.

Temperature: 1.0

This chart uses a generic set of logits: [2.0, 1.0, 0.5, 0.0, -1.0]. Probabilities are calculated using the softmax function after scaling by temperature.

💡 Shaping Output Format: Once the model selects tokens, you can guide it to produce free-form text (natural, conversational) or structured data (JSON, XML) by instructing it in your prompt or using your API's response_format parameter. Structured output is essential when the model's response needs to be processed programmatically — parsed, stored, or passed to another system — rather than read by a human.

6. LLM Evolution & Architectural Advances

The Transformer Foundation (2017–2022)

The modern LLM era began with the 2017 paper "Attention Is All You Need", which introduced the Transformer architecture. Three key innovations replaced recurrent neural networks:

Self-attention: Connecting related words regardless of distance in the sequence
Parallel processing: Computing all positions simultaneously rather than sequentially
Scalability: Architecture that improves predictably as parameters and data increase (the "scaling law")

Year	Model	Parameters	Key Advancement
2018	BERT	340M	Bidirectional understanding; fine-tuning popularized
2020	GPT-3	175B	Few-shot learning; emergent capabilities at scale
2022	ChatGPT (GPT-3.5)	~175B	RLHF alignment — made LLMs genuinely useful to non-researchers

The Efficiency Revolution (2023–2024)

Raw scale hit diminishing returns. The field pivoted to architectural efficiency and alignment quality:

Mixture of Experts (MoE): GPT-4, Gemini 1.5, and Mixtral route each token to specialist sub-networks — frontier capability at a fraction of compute per inference
RLHF variants: Constitutional AI (Anthropic), DPO, and RLAIF refined how models are aligned without requiring massive human annotation
Multimodal by default: GPT-4o, Gemini, Claude 3, Amazon Nova — all natively process text, images, and audio in a single model
DeepSeek R1 (Jan 2025): Chinese open-source model matching GPT-4 performance using MoE. Released weights publicly — proved frontier AI is no longer a closed club, triggered a 17% single-day NVIDIA stock drop

Year	Model	Key Advancement
2023	GPT-4 / Claude 2	MoE architecture; multimodal; 128K+ context
2024	Claude 3.5 Sonnet / Gemini 1.5 / Nova Pro	Coding breakout; 200K–1M context; tool use
Jan 2025	DeepSeek R1	Open-source MoE matches GPT-4; major cost/access shift
Feb 2025	Claude Sonnet 3.7 + Claude Code	Extended thinking; agentic coding inflection point for SDEs
2025	Claude Sonnet 4 / Opus 4 / o3 / Nova Premier	Reasoning + agentic capabilities mainstream

Reasoning Models (2024–Present)

A new generation emerged that thinks before answering — generating internal reasoning chains before producing output:

Aspect	Description
How they work	Models are trained to produce explicit step-by-step reasoning (chain-of-thought), self-critique, and verify intermediate steps before committing to an answer
Key capabilities	Structured problem-solving · Multi-step math and code · Self-consistency checking · Extended logical arguments
Notable models	OpenAI o3: Best-in-class math/science reasoning · Claude Sonnet 4: Coding + reasoning · Amazon Nova Premier: 300K context, multimodal reasoning · DeepSeek R1: Open-source reasoning
Cost note	Every reasoning step uses tokens — reasoning models can be 5–20x more expensive per task than standard models. Use `budget_tokens` or `thinking` parameters to control depth and cost

Current Limitations

Limitation	Description	Partial Workaround
Hallucinations	Generate plausible but factually incorrect information; invent citations	Retrieval-Augmented Generation (RAG) — covered in the Embeddings module
Knowledge cutoff	Training data has a fixed date; models don't know recent events	RAG, web search tools, or frequent fine-tuning
Context limits	Even 200K–1M token windows have limits; very long contexts degrade attention quality	Chunking, summarization, RAG retrieval
Reasoning gaps	Still struggle with novel math proofs, spatial reasoning, and highly specialized domains	Reasoning models + tool use (calculators, code interpreters)

Future Research Directions

Research Area	What it is	Reference
Neuro-Symbolic AI	Combining neural networks with symbolic reasoning for interpretability and reliability	Survey (2025)
JEPA	Yann LeCun's approach: predict abstract representations, not raw tokens — may enable more efficient world understanding	LeCun et al. (2024)
World Models / Physical AI	Models that build internal representations of physical environments — critical for robotics and embodied AI	NVIDIA Cosmos (2025)

Concept Check Questions

1. Context Window: If a model has a context window of 16,000 tokens, and your prompt uses 7,500 tokens, how many tokens remain available for the response?

A) 7,500 tokens
B) 8,500 tokens
C) 16,000 tokens
D) 24,500 tokens

Answer: B) 8,500 tokens. The remaining space is calculated by subtracting the prompt size (7,500) from the total context window size (16,000).

2. Tokenization: Which would likely use more tokens?

A) Common English words in a short sentence
B) Simple numbers (1, 2, 3)
C) All options use exactly the same number of tokens
D) Technical jargon and rare terminology

Answer: D) Technical jargon and rare terminology. Uncommon words are often broken into multiple subword tokens by the tokenizer, whereas common words are typically represented as single tokens.

3. Token Costs: A model charges $0.01 per 1K input tokens and $0.02 per 1K output tokens. What's the approximate cost of processing 10 documents (1,000 words each) with 200-word summaries?

A) ~$0.18
B) $1.65
C) $3.00
D) $0.01

Answer: A) ~$0.18. Each 1,000-word document ≈ 1,300 input tokens; each 200-word summary ≈ 260 output tokens. Total input: 10 × 1,300 = 13,000 tokens × $0.01/1K = $0.13. Total output: 10 × 260 = 2,600 tokens × $0.02/1K = $0.052. Grand total: $0.13 + $0.052 ≈ $0.18.

4. Embeddings: What makes embeddings powerful for understanding language?

A) They contain the dictionary definition of each word
B) They store grammar rules for proper sentence construction
C) They represent words as points in space where similar words are closer together
D) They directly translate between different languages

Answer: C) They represent words as points in space where similar words are closer together. This allows the model to understand relationships between concepts and generalize to new situations.

5. Logits: When would you use a high temperature setting?

A) When generating creative stories or poetry
B) When performing factual question answering
C) When extracting structured data from text
D) When performing mathematical calculations

Answer: A) When generating creative stories or poetry. Higher temperature settings introduce more randomness, allowing for more creative and varied outputs.

6. Structured Output: Which scenario would benefit most from requesting a structured output format?

A) A bedtime story for children
B) Data extraction for a financial dashboard
C) A personalized email response
D) A creative description of a landscape

Answer: B) Data extraction for a financial dashboard. Structured output formats like JSON allow other systems to easily process and display the information without needing to parse natural language.

7. Reasoning in Foundational Models: Which approach would likely yield the most accurate answer to a multi-step math problem?

A) Asking for just the final answer
B) Using the highest temperature setting for creativity
C) Asking the model to be as concise as possible
D) Requesting step-by-step reasoning

Answer: D) Requesting step-by-step reasoning. Breaking a problem into explicit steps lets the model work through it methodically, catching errors in intermediate calculations before committing to a final answer. This is the core insight behind Chain-of-Thought prompting.

8. Large Context Windows: What task would most benefit from a model with a 200,000 token context window?

A) Generating a single paragraph response
B) Translating a single sentence
C) Analyzing an entire codebase or legal contract at once
D) Converting a short text to JSON

Answer: C) Analyzing an entire codebase or legal contract at once. Large context windows (Claude Sonnet 4: 200K, Amazon Nova Premier: 300K) allow processing entire documents while maintaining understanding of references and relationships throughout.

9. Hallucinations: Which of the following BEST explains why LLMs sometimes generate confident-sounding but incorrect information?

A) Their training optimizes for statistically plausible text — not factual accuracy — so they can produce fluent, confident outputs that are factually wrong
B) LLMs have perfect knowledge but occasionally prioritize creativity over accuracy
C) Hallucinations only occur when temperature is set above 0.5
D) Large models (100B+ parameters) are immune to hallucinations

Answer: A) LLMs are trained to predict the next most plausible token, not to verify facts. This makes them excellent at generating fluent, coherent text — but they can confidently state things that are wrong. Mitigations include RAG (grounding responses in verified sources) and fact-checking pipelines.

10. Which training approach below is specifically designed to enhance an LLM's reasoning capabilities?

A) Next-token prediction
B) Chain-of-thought training with RL and verifiable rewards
C) Masked language modeling
D) Quantization

Answer: B) Chain-of-thought training combined with RL and verifiable rewards. This teaches models to generate reasoning steps before answering, and uses correct/incorrect outcomes (e.g., running code, checking math) as the training signal — producing models like o3, Claude with extended thinking, and DeepSeek R1.

11. RLHF vs Fine-Tuning: What is the primary difference between RLHF and fine-tuning?

A) RLHF requires more training data than fine-tuning
B) Fine-tuning is only used for vision models
C) RLHF aligns model behavior using human preference signals; fine-tuning adapts the model to a specific domain using labeled data
D) They are the same technique with different names

Answer: C) RLHF teaches the model how to behave (be helpful, honest, safe) using ranked human preferences. Fine-tuning specializes the model for a domain (medical, legal, code) using curated datasets. Both are post-training techniques but serve different goals.

12. Mixture of Experts: Why does MoE replace the Feed-Forward Network (FFN) layer rather than the self-attention layer?

A) The attention layer is too small to benefit from specialization
B) MoE actually replaces the attention layer — the FFN stays the same
C) The FFN layer is only used during training, not inference
D) Attention must remain global (all tokens interact); the FFN processes each token independently, making it natural to specialize into experts

Answer: D) The self-attention layer must stay global — every token needs to attend to every other token to understand context. The FFN layer operates independently on each token position (like a reference lookup), making it the natural place to introduce specialized experts. MoE routes each token to the most relevant 2–4 expert FFNs, enabling massive model scale with far less compute per inference.

Resources

Foundational Architecture

Vaswani et al. (2017). "Attention Is All You Need" – The original Transformer paper. Every modern LLM descends from this.
Shazeer et al. (2017). "Outrageously Large Neural Networks: The Sparsely-Gated MoE Layer" – The original Mixture of Experts paper.
Jiang et al. (2024). "Mixtral of Experts" – MoE applied to open-source frontier models.

Post-Training & Alignment

Ouyang et al. (2022). "Training language models to follow instructions with human feedback" (InstructGPT) – The RLHF paper behind ChatGPT.
Hu et al. (2021). "LoRA: Low-Rank Adaptation of Large Language Models" – The technique that made fine-tuning practical at scale.

Reasoning Capabilities

Wei et al. (2022). "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models" – Foundational paper showing step-by-step reasoning dramatically improves LLM accuracy.
Lightman et al. (2023). "Let's Verify Step by Step" (OpenAI) – Process Reward Models: rewarding correct reasoning steps, not just final answers.
Snell et al. (2024). "Scaling LLM Test-Time Compute Optimally" – The research behind inference-time compute scaling (o1-style thinking).
DeepSeek-AI (2025). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning" – Open-source reasoning model matching proprietary frontier models.

Deployment & Efficiency

Dettmers et al. (2022). "LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale" – Quantization without quality loss.
Frantar et al. (2022). "GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers" – 4-bit quantization for large models.

Tools & Guides

Anthropic API Fundamentals: Model Parameters Notebook – Hands-on guide to LLM parameters and API usage.
OpenAI Tokenizer Tool – Visualize and estimate token counts for prompts and responses.
Lin et al. (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods" – Benchmark and analysis of LLM hallucinations.
Wei et al. (2022). "Emergent Abilities of Large Language Models" – Research on scaling and emergent properties.
LeCun et al. (2024). "Learning and Leveraging World Models in Visual Representation Learning" – Overview of world models and future LLM directions.