Embeddings & RAG

Module 4: Embeddings & Retrieval-Augmented Generation (RAG)

In Module 1 we saw that LLMs have real limitations: a knowledge cutoff date, no access to your private data, and a tendency to hallucinate when asked about facts they don't know. Retrieval-Augmented Generation (RAG) is the most widely used technique to address all three — and embeddings are what make it work.

This module builds on the embeddings introduction in Module 1 and delivers on the pointer in Module 2 where we removed the RAG section and said it would live here.

What You'll Learn

Embeddings (deep dive): How semantic similarity works, why it beats keyword search, and what dimensions actually mean
Embedding Models: Amazon Titan, Cohere — how to choose the right one
Vector Stores: Where embeddings live — OpenSearch, Aurora pgvector, FAISS, Pinecone, and when to use each
RAG Pipeline: The full architecture — chunking, retrieval, reranking, and prompt augmentation
AWS Bedrock Knowledge Bases: Fully managed RAG that collapses most of the above into configuration

Embeddings: Meaning as Mathematics

In Module 1 we introduced embeddings as "words represented as points in space where similar words are closer together." Here we go deeper — because understanding how similarity search works is what makes RAG intuitive.

From Words to Vectors

An embedding model takes text (a word, sentence, or entire document) and outputs a fixed-length list of numbers — a vector. These numbers aren't random; they encode semantic meaning. Words or passages with similar meaning end up with similar vectors.

Classic intuition: In a well-trained embedding space, king − man + woman ≈ queen. The vector arithmetic works because "gender" and "royalty" are encoded as directions in the vector space. You don't program these relationships — they emerge from training on large text corpora.

Cosine Similarity

To find how similar two pieces of text are, you compare their vectors using cosine similarity — the cosine of the angle between them. A score of 1.0 means identical meaning; 0.0 means unrelated; -1.0 means opposite meaning. This is how a search system finds the most relevant documents for a query — it embeds the query, then finds the stored documents with the highest cosine similarity.

Why Embeddings Beat Keyword Search

Keyword search requires the exact same words to appear. A query for "how do I reset my password" won't find a document titled "account recovery instructions" — different words, same meaning. Embeddings search by meaning, not by word match. This is the core superpower that makes RAG possible.

Not Just Text

Embedding models can encode more than text:

Text embeddings: Sentences, paragraphs, documents — most common for RAG
Image embeddings: Visual content encoded as vectors, enabling image similarity search
Multimodal embeddings: A single embedding space for both text and images — so a text query like "red sports car" can retrieve visually matching images. Amazon Titan Multimodal Embeddings is an example.

What Do Dimensions Mean?

An embedding vector might have 256, 512, 1024, or 3072 dimensions. More dimensions = more capacity to capture nuance, but also higher storage and compute cost. For most RAG use cases, 1024 dimensions is the sweet spot. The actual values of each dimension aren't interpretable — what matters is the relative distance between vectors.

Embedding Models: Which One to Use?

Choosing the right embedding model matters — the same text will produce different vectors depending on which model you use, and you must use the same model consistently for both indexing and querying.

Key Options

Model	Provider	Dimensions	Best For	Notes
Titan Text Embeddings V2	Amazon (via Bedrock)	256 / 512 / 1024	General RAG, AWS-native workloads	Adjustable dimensions — lower = cheaper storage; supports 8K token input
Titan Multimodal Embeddings	Amazon (via Bedrock)	384 / 1024	Image + text search in one index	Single embedding space for both modalities
Cohere Embed v3	Cohere (via Bedrock)	1024	High-quality retrieval, multilingual	English and multilingual variants; strong benchmark performance

How to Choose

AWS workloads: Start with Titan Text Embeddings V2 — native Bedrock pricing, no egress, Bedrock Knowledge Bases uses it by default
Multilingual content: Cohere Embed v3 multilingual outperforms Titan for non-English
Image + text search: Titan Multimodal Embeddings is the only AWS-native option
Retrieval quality matters most: Cohere Embed v3 — strong benchmark performance, benchmark on your own data
Cost-sensitive at scale: Use lower dimensions (256/512) — Titan V2 lets you choose without changing the model

Consistency is non-negotiable. You must use the same embedding model (and same dimension setting) for both indexing your documents and embedding queries at retrieval time. Mixing models — even two versions of the same model — produces incompatible vector spaces and silently breaks retrieval.

Vector Stores: Where Embeddings Live

Once you've embedded your documents, you need somewhere to store the vectors and — critically — query them efficiently. Finding the top-K nearest vectors in a 1024-dimensional space across millions of documents isn't something a traditional SQL database does well. Vector stores are built (or extended) for this.

AWS Options

Store	What it is	When to use it
Amazon OpenSearch (k-NN)	OpenSearch was built for full-text search (like Elasticsearch). The k-NN plugin adds approximate nearest-neighbor vector search. You supply your own vectors — embed documents first, then store them. You manage the index, shards, and cluster.	Most flexible AWS option. Good if you need both text search and vector search on the same data, or need fine-grained control over indexing and retrieval.
Aurora pgvector	A PostgreSQL extension that adds a vector column type and similarity search operators. Runs on Amazon Aurora (PostgreSQL-compatible).	Good if your application already uses Aurora and you want to avoid another service. Simpler operationally, but slower than purpose-built vector DBs at large scale.

What about Amazon Kendra? Kendra is often mentioned alongside these, but it's a different category entirely. It's an enterprise search appliance — you point it at your document sources (S3, SharePoint, Confluence), and AWS handles indexing, embedding, and retrieval internally. You never manage vectors. Good for "build a company-wide search engine" use cases, but too opaque and expensive for building custom RAG pipelines. Bedrock Knowledge Bases is the better managed option for RAG.

Third-Party Options

Store	What it is	When to use it
FAISS	Facebook AI Similarity Search — a library (not a server) that runs in-process. You load vectors into RAM and query them locally.	Local development, prototyping, offline use cases. No server to manage, but no persistence across restarts and doesn't scale beyond one machine.
Pinecone	Fully managed, purpose-built vector database. Simple API, serverless option available.	Quickest path to production if you don't want to manage infrastructure. Supported as a Bedrock Knowledge Bases backend.
Chroma	Open-source vector DB (Apache 2.0) with a developer-friendly API. Runs embedded in-process or as a standalone server. Persists to disk by default.	Best starting point for RAG prototyping — clean API, works well with LangChain/LlamaIndex, handles persistence without extra setup.

Quick Decision Guide

Prototyping / local dev — start here: Chroma (friendlier API, built-in persistence) or FAISS (raw speed, fully in-memory)
Production on AWS, need control: OpenSearch k-NN
Already on Aurora PostgreSQL: pgvector
Fully managed end-to-end: Use Bedrock Knowledge Bases (see next section) — it handles the vector store for you

The RAG Pipeline: Architecture & Key Decisions

RAG works by retrieving relevant context from your knowledge base and injecting it into the LLM's prompt before generating a response. The LLM never "learns" your data — it reads it at inference time. This means the retrieval quality directly determines the answer quality.

The Full Architecture

RAG Architecture: Indexing Pipeline (Documents → Chunk → Embed → Store) and Retrieval Pipeline (Query → Retrieve via vector similarity search → Augment → Generate)

The indexing pipeline (bottom lane) runs offline in batch — you chunk documents, attach metadata, embed the chunks, and store them in the vector store. The retrieval pipeline (top lane) runs on every user request — the query is embedded, metadata filters narrow the search space, vector similarity search finds the most relevant chunks, those chunks augment the prompt, and the LLM generates the answer.

1. Chunking — Why It Matters

You can't embed an entire 50-page document as a single vector — you'd lose the ability to pinpoint which part of the document is relevant. Chunking splits documents into retrievable units. The strategy affects retrieval quality significantly:

Strategy	How it works	Best for
Fixed-size	Split every N tokens (e.g. 512), with optional overlap between chunks	Simple starting point; works reasonably for most content
Sentence-boundary	Split at sentence ends to avoid cutting mid-thought	Prose documents, articles, documentation
Semantic	Group sentences with similar meaning into chunks; split when topic changes	Long documents with distinct sections
Hierarchical	Index at multiple granularities (e.g. paragraph + document summary)	When you need both precise retrieval and broader context in the response

Chunk size involves a tradeoff: smaller chunks → more precise retrieval; larger chunks → more context per retrieved piece. A common starting point is 512 tokens with 50-token overlap. Test on your actual queries.

2. Retrieval — Finding What's Relevant

Three retrieval approaches:

Dense (vector) retrieval: Embed the query, find chunks with highest cosine similarity. Understands meaning but can miss exact keyword matches.
Sparse (BM25/keyword) retrieval: Classic term-frequency search. Misses semantic relationships but great at exact matches (product codes, names, IDs).
Hybrid retrieval: Runs both and merges results (typically with Reciprocal Rank Fusion). Best of both worlds — recommended for production. Bedrock Knowledge Bases supports hybrid search natively.

3. Metadata Filtering — Narrowing the Search Space

Without metadata filtering, a query for "what is our vacation policy?" searches every chunk in your index — and vector similarity might surface chunks from the engineering wiki or sales playbook that are topically close but completely wrong for the user asking. Metadata filtering constrains retrieval to the relevant slice of your knowledge base before (or alongside) the similarity search.

Two Levels of Metadata

Metadata can live at two granularities, and both get attached to each chunk at index time:

Document-level metadata — properties of the source document that all its chunks inherit: department, doc_type, author, created_at, status, access_role. If a document belongs to HR, every chunk from that document gets department: "hr".
Chunk-level metadata — properties specific to that chunk: page_number, section_title, chapter. Useful for long PDFs or hierarchical documents where you want to retrieve a specific section, not just the right document.

How It Works

At index time, attach metadata when you store each chunk. At retrieval time, pass a filter alongside the query — the vector store applies it before computing similarities:

Python — Metadata Filtering with Chroma

# Index time: attach metadata to each chunk
collection.add(
    ids=["chunk-001", "chunk-002"],
    embeddings=[embed("...vacation policy text..."), embed("...sales playbook text...")],
    documents=["...vacation policy text...", "...sales playbook text..."],
    metadatas=[
        {"department": "hr", "doc_type": "policy", "status": "published"},
        {"department": "sales", "doc_type": "playbook", "status": "published"}
    ]
)

# Retrieval time: filter to HR docs only before similarity search
results = collection.query(
    query_embeddings=[embed(user_query)],
    n_results=5,
    where={"department": "hr"}   # only search HR chunks
)

OpenSearch, pgvector, and Bedrock Knowledge Bases all support similar pre-filtering. In Bedrock Knowledge Bases, you pass a filter object in the retrieval configuration alongside your query.

When vector store metadata isn't enough. Vector store metadata is denormalized — it's a copy attached to each chunk. If that metadata changes (a document moves departments, an author leaves), you need to re-index. For metadata that is relational, changes frequently, or requires complex filtering logic (JOINs, multi-table lookups), consider a separate structured store (RDS/DynamoDB) that maps chunk IDs to metadata. At retrieval time, query the structured store first to get the relevant chunk IDs, then run vector search limited to those IDs. More infrastructure to manage, but full relational power without re-embedding.

4. Reranking — Refining the Top-K

Vector search retrieves the top-K approximately most relevant chunks (e.g. top 20). A reranker takes those 20 and re-scores them using a more expensive but accurate cross-encoder model, returning the top 3-5 for the prompt. Cohere Rerank is the most commonly used option and is available via Bedrock.

Reranking adds latency and cost, but significantly improves precision — worth it when your answer quality is sensitive to context relevance.

4. Prompt Augmentation

The final step: inject the retrieved chunks into the prompt before sending to the LLM. A typical pattern:

RAG Prompt Template

You are a helpful assistant. Answer the user's question using ONLY the provided context.
If the context doesn't contain enough information to answer, say so.

<context>
{retrieved_chunk_1}

{retrieved_chunk_2}

{retrieved_chunk_3}
</context>

Question: {user_query}
Answer:

Key practices: use delimiters to clearly separate context from the question, instruct the model to stay within the provided context, and include a fallback for when the context is insufficient. This ties back to the prompt engineering techniques in Module 2.

AWS Bedrock Knowledge Bases: Managed RAG

Bedrock Knowledge Bases is AWS's fully managed RAG service. Rather than building the indexing and retrieval pipeline yourself (chunks → embed → store → retrieve), you configure it and AWS runs it. For most production use cases on AWS, this should be your starting point.

What It Handles For You

Data connectors: Ingest from S3, web crawlers, SharePoint, Confluence, Salesforce — no custom ingestion code
Chunking: Fixed, semantic, hierarchical, or custom strategies — configurable without code
Embedding: Choose your model (Titan V2, Cohere Embed) — KB handles embedding and re-embedding on updates
Vector store: Managed OpenSearch Serverless by default; can also use Aurora pgvector, Pinecone, Redis Enterprise, MongoDB Atlas
Hybrid search: Dense + sparse retrieval combined — enabled with one config flag
Sync: Detects document changes and incrementally re-indexes

Querying a Knowledge Base

Python — Retrieve & Generate with Bedrock KB

import boto3

bedrock_agent_runtime = boto3.client("bedrock-agent-runtime", region_name="us-east-1")

response = bedrock_agent_runtime.retrieve_and_generate(
    input={"text": "What is our refund policy for digital products?"},
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            "knowledgeBaseId": "YOUR_KB_ID",
            "modelArn": "arn:aws:bedrock:us-east-1::foundation-model/us.amazon.nova-pro-v1:0",
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults": 5,
                    "overrideSearchType": "HYBRID"   # dense + sparse combined
                }
            }
        }
    }
)

print(response["output"]["text"])

# Citations are returned alongside the answer
for citation in response.get("citations", []):
    for ref in citation.get("retrievedReferences", []):
        print(f"  Source: {ref['location']['s3Location']['uri']}")

When to Use Bedrock KB vs. Build Your Own

	Bedrock Knowledge Bases	Custom RAG Pipeline
Setup time	Minutes (console or CDK)	Days to weeks
Chunking control	Good (4 built-in strategies)	Full control
Custom retrieval logic	Limited	Full control (reranking, filtering, multi-hop)
Data sources	Connectors for major sources	Any source you can code against
AWS integration	Native (IAM, CloudTrail, VPC)	Must wire up yourself
Cost model	Per query + per token	Vector store + embedding costs
Best for	Most production use cases; internal knowledge bases; prototyping	Complex retrieval requirements; non-standard data; full control needed

Rule of thumb: start with Bedrock Knowledge Bases. Build a custom pipeline only if KB's retrieval quality or configuration options aren't sufficient for your use case after testing.

Resources

Core Reading

RAG Paper (Original)
Lewis et al., 2020 — arXiv — The original Retrieval-Augmented Generation paper from Meta AI. Foundational reading.
Chunking Strategies for LLM Applications
Pinecone Learn — Practical guide to chunking methods with tradeoffs.
Advanced RAG Techniques
Gao et al., 2023 — arXiv — Survey of RAG improvements (hybrid search, reranking, query rewriting, and more).

AWS / Bedrock

Amazon Bedrock Knowledge Bases
aws.amazon.com — Official docs for managed RAG on AWS.
Amazon Titan Embeddings
aws.amazon.com — Titan V2 and Multimodal Embeddings documentation.
OpenSearch k-NN Guide
opensearch.org — k-NN vector search in OpenSearch, including HNSW and IVF index types.
pgvector on Aurora
aws.amazon.com — pgvector support in Amazon Aurora PostgreSQL.

Tools & Libraries

FAISS — github.com/facebookresearch/faiss — In-process vector search library, ideal for local development
Cohere Rerank — docs.cohere.com — Cross-encoder reranking model, available via Bedrock
LangChain RAG How-To — python.langchain.com — Practical RAG patterns with code examples

Concept Check Questions

1. RAG: What problem does Retrieval-Augmented Generation primarily solve?

A) It makes LLMs generate text faster
B) It gives LLMs access to current or private information they weren't trained on, without fine-tuning
C) It reduces the number of parameters in the model
D) It replaces the need for prompt engineering

Answer: B) RAG solves the knowledge gap problem — LLMs have a training cutoff and can't access your private data. RAG retrieves relevant context at inference time and injects it into the prompt, so the model can answer questions about things it was never trained on.

2. Embeddings: Why do embeddings enable better search than traditional keyword matching?

A) They compress text to use less memory
B) They store the exact text for fast lookup
C) They require fewer API calls
D) They encode semantic meaning, so queries find documents with the same intent even if they use different words

Answer: D) Keyword search requires exact word matches — a query for "how do I reset my password" won't find a document titled "account recovery instructions." Embeddings capture meaning, so semantically similar text ends up with similar vectors regardless of exact wording.

3. Embedding Models: You're building a RAG system on AWS for English-language internal documentation. Which embedding model is the most natural starting point?

A) Amazon Titan Text Embeddings V2
B) Cohere Embed v3 (multilingual)
C) Amazon Titan Multimodal Embeddings
D) Cohere Embed v3 (multilingual)

Answer: A) Titan Text Embeddings V2 is the natural AWS-native choice for English text RAG — it's available via Bedrock, supports adjustable dimensions (256/512/1024), and integrates directly with Bedrock Knowledge Bases. Multimodal is for image+text search; Cohere multilingual shines for non-English content; Cohere Embed v3 (English) is the alternative if you need higher retrieval benchmark performance.

4. Vector Stores: You're prototyping a RAG system locally before deploying to AWS. Which vector store requires no server and runs in-process?

A) Amazon OpenSearch
B) Pinecone
C) FAISS
D) Aurora pgvector

Answer: C) FAISS (Facebook AI Similarity Search) is a library that runs in-process — no server, no cluster, no cloud account needed. It loads vectors into RAM and queries them locally. Perfect for prototyping and development, but doesn't persist across restarts and doesn't scale beyond one machine.

5. Chunking: Why is chunking documents into smaller pieces important for RAG retrieval quality?

A) It reduces the cost of embedding models
B) A single vector for a 50-page document can't pinpoint which part is relevant — chunking creates retrievable units that can be matched precisely to a query
C) It prevents the LLM from hallucinating
D) It allows the same document to be stored in multiple vector stores

Answer: B) If you embed an entire document as one vector, retrieval can find the document but not the specific passage that answers the question. Chunking creates granular retrievable units — the right-sized piece for embedding and retrieval.

6. Hybrid Search: What does "hybrid search" mean in the context of RAG retrieval?

A) Searching across both text and image embeddings
B) Using two different LLMs to generate the answer
C) Splitting the query into sub-queries before retrieval
D) Combining dense (vector/semantic) retrieval and sparse (keyword/BM25) retrieval, then merging the results

Answer: D) Hybrid search runs both dense (embedding similarity) and sparse (keyword/BM25) retrieval in parallel, then merges results — typically using Reciprocal Rank Fusion. Dense retrieval captures semantic meaning; sparse retrieval excels at exact matches (product IDs, names, codes). Combining them outperforms either alone. Bedrock Knowledge Bases supports hybrid search natively.

7. Bedrock Knowledge Bases: What is the primary advantage of using Bedrock Knowledge Bases over building a custom RAG pipeline?

A) It handles the full indexing pipeline (chunking, embedding, vector store, sync) with configuration instead of code, significantly reducing build and maintenance effort
B) It uses a proprietary retrieval algorithm that always outperforms custom pipelines
C) It eliminates the need to choose an embedding model
D) It works without any AWS account or IAM setup

Answer: A) Bedrock Knowledge Bases collapses the chunking → embedding → storage → retrieval pipeline into a managed service you configure rather than code. You still choose your embedding model and can tune chunking strategy — but AWS runs the infrastructure, handles incremental syncs, and provides native integration with Bedrock models and guardrails.

8. Kendra vs OpenSearch: What is the key difference between Amazon Kendra and Amazon OpenSearch k-NN for RAG use cases?

A) Kendra only supports keyword search; OpenSearch only supports vector search
B) OpenSearch is more expensive than Kendra for large document sets
C) With OpenSearch you supply your own vectors (embed first, then store); Kendra is an enterprise search appliance that manages embedding and retrieval internally — you don't control the vector layer
D) Kendra is open-source; OpenSearch is proprietary

Answer: C) OpenSearch k-NN is a raw vector store — you embed documents yourself and manage the index. Kendra is a managed enterprise search service where AWS handles all the embedding and ML internally. Kendra is good for "build a company search engine" use cases but too opaque for custom RAG pipelines. For RAG, use OpenSearch or Bedrock Knowledge Bases.