Back to Patterns

Retrieval-Augmented Generation

Agent Patterns

Summary

Retrieval-Augmented Generation (RAG) combines information retrieval with text generation. Relevant documents are retrieved from a knowledge base based on user queries, then provided as context for LLM response generation. This grounding reduces hallucinations and ensures responses reflect current, authoritative information.

How it works

  1. Query Processing: Transform user query into retrieval format
  2. Document Retrieval: Search knowledge base for relevant passages
  3. Context Assembly: Combine retrieved documents with user query
  4. Generation: LLM produces grounded response from augmented prompt
  5. Verification: Optional citation or source attribution

RAG architectures

  • Naive RAG: Retrieve-then-read pipeline
  • Agentic RAG: Iterative retrieval with planning and tool use
  • Hybrid RAG: Combine dense and sparse retrieval
  • Graph RAG: Leverage knowledge graph relationships

Component considerations

  • Retriever: BM25, dense embeddings, hybrid approaches
  • Index: FAISS, Pinecone, Weaviate, Elasticsearch
  • Chunking: Fixed-size, semantic, recursive strategies
  • Retrieval: Top-k, MMR, similarity thresholds

Build This Pattern

Copy this prompt and paste it into Claude Code, OpenCode, Codex, or Cursor to implement this pattern.

Build me a RAG (Retrieval-Augmented Generation) system. Architecture: implement a pipeline with separate modules for ingestion, indexing, retrieval, and generation. For ingestion: accept document uploads (PDF, DOCX, TXT, MD), use recursive chunking (1000 char chunks, 200 overlap) with document-level metadata tracking. Generate embeddings using a configurable embedding model and store in a vector database with pgvector. For retrieval: implement hybrid search combining semantic similarity and keyword matching. For generation: pass retrieved chunks with source metadata to the LLM for answer generation with inline citations. Error handling: handle unsupported file formats with clear error messages. If embedding generation fails, queue documents for retry. Handle empty search results by returning a no relevant sources found response instead of hallucinating. Edge cases: handle very large documents with progress tracking across chunks. Deduplicate overlapping chunks from multiple documents at retrieval time. Handle multilingual queries by detecting language and adjusting accordingly. Best practices: use a chat interface showing which sources contributed to each answer. Include an admin panel for document management (upload, delete, reindex). Log query latency broken down by retrieval versus generation. Testing: unit test chunking with various input formats. Test retrieval precision and recall with known document sets. Integration test full RAG pipeline from upload to answer. TypeScript with Next.js.