Module 4: Embeddings & Retrieval-Augmented Generation (RAG)

In Module 1 we saw that LLMs have real limitations: a knowledge cutoff date, no access to your private data, and a tendency to hallucinate when asked about facts they don't know. Retrieval-Augmented Generation (RAG) is the most widely used technique to address all three — and embeddings are what make it work.

This module builds on the embeddings introduction in Module 1 and delivers on the pointer in Module 2 where we removed the RAG section and said it would live here.

What You'll Learn

  • Embeddings (deep dive): How semantic similarity works, why it beats keyword search, and what dimensions actually mean
  • Embedding Models: Amazon Titan, Cohere — how to choose the right one
  • Vector Stores: Where embeddings live — OpenSearch, Aurora pgvector, FAISS, Pinecone, and when to use each
  • RAG Pipeline: The full architecture — chunking, retrieval, reranking, and prompt augmentation
  • AWS Bedrock Knowledge Bases: Fully managed RAG that collapses most of the above into configuration

Embeddings: Meaning as Mathematics

In Module 1 we introduced embeddings as "words represented as points in space where similar words are closer together." Here we go deeper — because understanding how similarity search works is what makes RAG intuitive.

From Words to Vectors

An embedding model takes text (a word, sentence, or entire document) and outputs a fixed-length list of numbers — a vector. These numbers aren't random; they encode semantic meaning. Words or passages with similar meaning end up with similar vectors.

Classic intuition: In a well-trained embedding space, king − man + woman ≈ queen. The vector arithmetic works because "gender" and "royalty" are encoded as directions in the vector space. You don't program these relationships — they emerge from training on large text corpora.

Cosine Similarity

To find how similar two pieces of text are, you compare their vectors using cosine similarity — the cosine of the angle between them. A score of 1.0 means identical meaning; 0.0 means unrelated; -1.0 means opposite meaning. This is how a search system finds the most relevant documents for a query — it embeds the query, then finds the stored documents with the highest cosine similarity.

Why Embeddings Beat Keyword Search

Keyword search requires the exact same words to appear. A query for "how do I reset my password" won't find a document titled "account recovery instructions" — different words, same meaning. Embeddings search by meaning, not by word match. This is the core superpower that makes RAG possible.

Not Just Text

Embedding models can encode more than text:

What Do Dimensions Mean?

An embedding vector might have 256, 512, 1024, or 3072 dimensions. More dimensions = more capacity to capture nuance, but also higher storage and compute cost. For most RAG use cases, 1024 dimensions is the sweet spot. The actual values of each dimension aren't interpretable — what matters is the relative distance between vectors.

Embedding Models: Which One to Use?

Choosing the right embedding model matters — the same text will produce different vectors depending on which model you use, and you must use the same model consistently for both indexing and querying.

Key Options

ModelProviderDimensionsBest ForNotes
Titan Text Embeddings V2 Amazon (via Bedrock) 256 / 512 / 1024 General RAG, AWS-native workloads Adjustable dimensions — lower = cheaper storage; supports 8K token input
Titan Multimodal Embeddings Amazon (via Bedrock) 384 / 1024 Image + text search in one index Single embedding space for both modalities
Cohere Embed v3 Cohere (via Bedrock) 1024 High-quality retrieval, multilingual English and multilingual variants; strong benchmark performance

How to Choose

Consistency is non-negotiable. You must use the same embedding model (and same dimension setting) for both indexing your documents and embedding queries at retrieval time. Mixing models — even two versions of the same model — produces incompatible vector spaces and silently breaks retrieval.

Vector Stores: Where Embeddings Live

Once you've embedded your documents, you need somewhere to store the vectors and — critically — query them efficiently. Finding the top-K nearest vectors in a 1024-dimensional space across millions of documents isn't something a traditional SQL database does well. Vector stores are built (or extended) for this.

AWS Options

StoreWhat it isWhen to use it
Amazon OpenSearch (k-NN) OpenSearch was built for full-text search (like Elasticsearch). The k-NN plugin adds approximate nearest-neighbor vector search. You supply your own vectors — embed documents first, then store them. You manage the index, shards, and cluster. Most flexible AWS option. Good if you need both text search and vector search on the same data, or need fine-grained control over indexing and retrieval.
Aurora pgvector A PostgreSQL extension that adds a vector column type and similarity search operators. Runs on Amazon Aurora (PostgreSQL-compatible). Good if your application already uses Aurora and you want to avoid another service. Simpler operationally, but slower than purpose-built vector DBs at large scale.
What about Amazon Kendra? Kendra is often mentioned alongside these, but it's a different category entirely. It's an enterprise search appliance — you point it at your document sources (S3, SharePoint, Confluence), and AWS handles indexing, embedding, and retrieval internally. You never manage vectors. Good for "build a company-wide search engine" use cases, but too opaque and expensive for building custom RAG pipelines. Bedrock Knowledge Bases is the better managed option for RAG.

Third-Party Options

StoreWhat it isWhen to use it
FAISS Facebook AI Similarity Search — a library (not a server) that runs in-process. You load vectors into RAM and query them locally. Local development, prototyping, offline use cases. No server to manage, but no persistence across restarts and doesn't scale beyond one machine.
Pinecone Fully managed, purpose-built vector database. Simple API, serverless option available. Quickest path to production if you don't want to manage infrastructure. Supported as a Bedrock Knowledge Bases backend.
Chroma Open-source vector DB (Apache 2.0) with a developer-friendly API. Runs embedded in-process or as a standalone server. Persists to disk by default. Best starting point for RAG prototyping — clean API, works well with LangChain/LlamaIndex, handles persistence without extra setup.

Quick Decision Guide

The RAG Pipeline: Architecture & Key Decisions

RAG works by retrieving relevant context from your knowledge base and injecting it into the LLM's prompt before generating a response. The LLM never "learns" your data — it reads it at inference time. This means the retrieval quality directly determines the answer quality.

The Full Architecture

RAG Architecture: Indexing Pipeline (Documents → Chunk → Embed → Store) and Retrieval Pipeline (Query → Retrieve via vector similarity search → Augment → Generate)

The indexing pipeline (bottom lane) runs offline in batch — you chunk documents, attach metadata, embed the chunks, and store them in the vector store. The retrieval pipeline (top lane) runs on every user request — the query is embedded, metadata filters narrow the search space, vector similarity search finds the most relevant chunks, those chunks augment the prompt, and the LLM generates the answer.

1. Chunking — Why It Matters

You can't embed an entire 50-page document as a single vector — you'd lose the ability to pinpoint which part of the document is relevant. Chunking splits documents into retrievable units. The strategy affects retrieval quality significantly:

StrategyHow it worksBest for
Fixed-sizeSplit every N tokens (e.g. 512), with optional overlap between chunksSimple starting point; works reasonably for most content
Sentence-boundarySplit at sentence ends to avoid cutting mid-thoughtProse documents, articles, documentation
SemanticGroup sentences with similar meaning into chunks; split when topic changesLong documents with distinct sections
HierarchicalIndex at multiple granularities (e.g. paragraph + document summary)When you need both precise retrieval and broader context in the response

Chunk size involves a tradeoff: smaller chunks → more precise retrieval; larger chunks → more context per retrieved piece. A common starting point is 512 tokens with 50-token overlap. Test on your actual queries.

2. Retrieval — Finding What's Relevant

Three retrieval approaches:

3. Metadata Filtering — Narrowing the Search Space

Without metadata filtering, a query for "what is our vacation policy?" searches every chunk in your index — and vector similarity might surface chunks from the engineering wiki or sales playbook that are topically close but completely wrong for the user asking. Metadata filtering constrains retrieval to the relevant slice of your knowledge base before (or alongside) the similarity search.

Two Levels of Metadata

Metadata can live at two granularities, and both get attached to each chunk at index time:

How It Works

At index time, attach metadata when you store each chunk. At retrieval time, pass a filter alongside the query — the vector store applies it before computing similarities:

Python — Metadata Filtering with Chroma
# Index time: attach metadata to each chunk
collection.add(
    ids=["chunk-001", "chunk-002"],
    embeddings=[embed("...vacation policy text..."), embed("...sales playbook text...")],
    documents=["...vacation policy text...", "...sales playbook text..."],
    metadatas=[
        {"department": "hr", "doc_type": "policy", "status": "published"},
        {"department": "sales", "doc_type": "playbook", "status": "published"}
    ]
)

# Retrieval time: filter to HR docs only before similarity search
results = collection.query(
    query_embeddings=[embed(user_query)],
    n_results=5,
    where={"department": "hr"}   # only search HR chunks
)

OpenSearch, pgvector, and Bedrock Knowledge Bases all support similar pre-filtering. In Bedrock Knowledge Bases, you pass a filter object in the retrieval configuration alongside your query.

When vector store metadata isn't enough. Vector store metadata is denormalized — it's a copy attached to each chunk. If that metadata changes (a document moves departments, an author leaves), you need to re-index. For metadata that is relational, changes frequently, or requires complex filtering logic (JOINs, multi-table lookups), consider a separate structured store (RDS/DynamoDB) that maps chunk IDs to metadata. At retrieval time, query the structured store first to get the relevant chunk IDs, then run vector search limited to those IDs. More infrastructure to manage, but full relational power without re-embedding.

4. Reranking — Refining the Top-K

Vector search retrieves the top-K approximately most relevant chunks (e.g. top 20). A reranker takes those 20 and re-scores them using a more expensive but accurate cross-encoder model, returning the top 3-5 for the prompt. Cohere Rerank is the most commonly used option and is available via Bedrock.

Reranking adds latency and cost, but significantly improves precision — worth it when your answer quality is sensitive to context relevance.

4. Prompt Augmentation

The final step: inject the retrieved chunks into the prompt before sending to the LLM. A typical pattern:

RAG Prompt Template
You are a helpful assistant. Answer the user's question using ONLY the provided context.
If the context doesn't contain enough information to answer, say so.

<context>
{retrieved_chunk_1}

{retrieved_chunk_2}

{retrieved_chunk_3}
</context>

Question: {user_query}
Answer:

Key practices: use delimiters to clearly separate context from the question, instruct the model to stay within the provided context, and include a fallback for when the context is insufficient. This ties back to the prompt engineering techniques in Module 2.

AWS Bedrock Knowledge Bases: Managed RAG

Bedrock Knowledge Bases is AWS's fully managed RAG service. Rather than building the indexing and retrieval pipeline yourself (chunks → embed → store → retrieve), you configure it and AWS runs it. For most production use cases on AWS, this should be your starting point.

What It Handles For You

Querying a Knowledge Base

Python — Retrieve & Generate with Bedrock KB
import boto3

bedrock_agent_runtime = boto3.client("bedrock-agent-runtime", region_name="us-east-1")

response = bedrock_agent_runtime.retrieve_and_generate(
    input={"text": "What is our refund policy for digital products?"},
    retrieveAndGenerateConfiguration={
        "type": "KNOWLEDGE_BASE",
        "knowledgeBaseConfiguration": {
            "knowledgeBaseId": "YOUR_KB_ID",
            "modelArn": "arn:aws:bedrock:us-east-1::foundation-model/us.amazon.nova-pro-v1:0",
            "retrievalConfiguration": {
                "vectorSearchConfiguration": {
                    "numberOfResults": 5,
                    "overrideSearchType": "HYBRID"   # dense + sparse combined
                }
            }
        }
    }
)

print(response["output"]["text"])

# Citations are returned alongside the answer
for citation in response.get("citations", []):
    for ref in citation.get("retrievedReferences", []):
        print(f"  Source: {ref['location']['s3Location']['uri']}")

When to Use Bedrock KB vs. Build Your Own

Bedrock Knowledge BasesCustom RAG Pipeline
Setup timeMinutes (console or CDK)Days to weeks
Chunking controlGood (4 built-in strategies)Full control
Custom retrieval logicLimitedFull control (reranking, filtering, multi-hop)
Data sourcesConnectors for major sourcesAny source you can code against
AWS integrationNative (IAM, CloudTrail, VPC)Must wire up yourself
Cost modelPer query + per tokenVector store + embedding costs
Best forMost production use cases; internal knowledge bases; prototypingComplex retrieval requirements; non-standard data; full control needed

Rule of thumb: start with Bedrock Knowledge Bases. Build a custom pipeline only if KB's retrieval quality or configuration options aren't sufficient for your use case after testing.

Resources

Core Reading

AWS / Bedrock

Tools & Libraries