RAG

Retrieval-Augmented Generation

RAG is an AI architecture that combines retrieval (searching external data) with generation (language model output) to produce more accurate and informed responses.

Core Components

1. Retriever

Finds relevant documents from a knowledge base (vector store, database, etc.)
Uses Vector Embeddings to perform semantic search

2. Generator

A large language model (LLM) that takes the retrieved documents and query
Produces a final, informed answer

How It Works

User query → converted to an embedding vector
Retriever searches for relevant documents using similarity search
Top-k documents + original query → passed to the Generator
LLM uses both to generate the response

Benefits

Grounds the LLM in factual, real-time data
Works with custom knowledge bases
Avoids retraining the LLM when new info is added
Improves accuracy and reduces hallucination

Use Cases

Chatbots with proprietary knowledge
AI assistants with document search
Customer service agents
Legal, research, and data analysis tools

Vector databases (e.g. Pinecone, FAISS)
Vector Embeddings (e.g. OpenAI, Cohere, Sentence Transformers)
Prompt engineering for context injection

Links to this note

AI Engineer

AI vs AGI LLMs Inference Training Embeddings Vector Databases AI Agents Roadmap RAG Prompt Engineering Benefits of Pre-trained Models Limitations

Approximate Nearest Neighbors (ANN)

is a class of algorithms designed to efficiently retrieve vectors that are close to a query vector in

Top-K Retrieval

Top-K retrieval refers to the process of returning the K most relevant or closest results from a dataset in response to a query. In vector search

Vector Databases

Vector databases are specialized data stores designed to handle high-dimensional vector representations—typically generated from embeddings—and

AI Agents

Agents are autonomous or semi-autonomous systems powered by LLMs that can take actions, make decisions, and operate over time to accomplish a goal.

Context Window

The context window refers to the maximum number of tokens a language model can process in a single request. It includes both the input prompt and

Hallucinations

occur when a language model generates content that is confidently wrong, fabricated, or misleading, despite sounding plausible. These

Prompt Engineering

Prompt engineering is the practice of crafting effective input prompts to guide language models (LLMs) toward desired outputs. It is a core skill