RAG
Retrieval-Augmented Generation
RAG is an AI architecture that combines retrieval (searching external data) with generation (language model output) to produce more accurate and informed responses.
Core Components
1. Retriever
- Finds relevant documents from a knowledge base (vector store, database, etc.)
- Uses Vector Embeddings to perform semantic search
2. Generator
- A large language model (LLM) that takes the retrieved documents and query
- Produces a final, informed answer
How It Works
- User query → converted to an embedding vector
- Retriever searches for relevant documents using similarity search
- Top-k documents + original query → passed to the Generator
- LLM uses both to generate the response
Benefits
- Grounds the LLM in factual, real-time data
- Works with custom knowledge bases
- Avoids retraining the LLM when new info is added
- Improves accuracy and reduces hallucination
Use Cases
- Chatbots with proprietary knowledge
- AI assistants with document search
- Customer service agents
- Legal, research, and data analysis tools
Related Concepts
- Vector databases (e.g. Pinecone, FAISS)
- Vector Embeddings (e.g. OpenAI, Cohere, Sentence Transformers)
- Prompt engineering for context injection
AI Engineer
AI vs AGI LLMs Inference Training Embeddings Vector Databases AI Agents Roadmap RAG Prompt Engineering Benefits of Pre-trained Models Limitations
Approximate Nearest Neighbors (ANN)
is a class of algorithms designed to efficiently retrieve vectors that are close to a query vector in
Top-K Retrieval
Top-K retrieval refers to the process of returning the K most relevant or closest results from a dataset in response to a query. In vector search
Vector Databases
Vector databases are specialized data stores designed to handle high-dimensional vector representations—typically generated from embeddings—and
AI Agents
Agents are autonomous or semi-autonomous systems powered by LLMs that can take actions, make decisions, and operate over time to accomplish a goal.
Context Window
The context window refers to the maximum number of tokens a language model can process in a single request. It includes both the input prompt and
Hallucinations
occur when a language model generates content that is confidently wrong, fabricated, or misleading, despite sounding plausible. These
Prompt Engineering
Prompt engineering is the practice of crafting effective input prompts to guide language models (LLMs) toward desired outputs. It is a core skill
Prompt Engineering
LLMs Prompt Engineering Prompt Engineering LLMs Tokens Context Window Hallucination AI Agents Prompt Injection Model Weights and Parameters