Retrieval-Augmented Generation: AI Pattern

Summary

Retrieval-Augmented Generation (RAG) combines information retrieval with text generation. Relevant documents are retrieved from a knowledge base based on user queries, then provided as context for LLM response generation. This grounding reduces hallucinations and ensures responses reflect current, authoritative information.

How it works

Query Processing: Transform user query into retrieval format
Document Retrieval: Search knowledge base for relevant passages
Context Assembly: Combine retrieved documents with user query
Generation: LLM produces grounded response from augmented prompt
Verification: Optional citation or source attribution

RAG architectures

Naive RAG: Retrieve-then-read pipeline
Agentic RAG: Iterative retrieval with planning and tool use
Hybrid RAG: Combine dense and sparse retrieval
Graph RAG: Leverage knowledge graph relationships

Component considerations

Retriever: BM25, dense embeddings, hybrid approaches
Index: FAISS, Pinecone, Weaviate, Elasticsearch
Chunking: Fixed-size, semantic, recursive strategies
Retrieval: Top-k, MMR, similarity thresholds

Build This Pattern

Copy this prompt and paste it into Claude Code, OpenCode, Codex, or Cursor to implement this pattern.

Build me a RAG (Retrieval-Augmented Generation) system. Architecture: implement a pipeline with separate modules for ingestion, indexing, retrieval, and generation. For ingestion: accept document uploads (PDF, DOCX, TXT, MD), use recursive chunking (1000 char chunks, 200 overlap) with document-level metadata tracking. Generate embeddings using a configurable embedding model and store in a vector database with pgvector. For retrieval: implement hybrid search combining semantic similarity and keyword matching. For generation: pass retrieved chunks with source metadata to the LLM for answer generation with inline citations. Error handling: handle unsupported file formats with clear error messages. If embedding generation fails, queue documents for retry. Handle empty search results by returning a no relevant sources found response instead of hallucinating. Edge cases: handle very large documents with progress tracking across chunks. Deduplicate overlapping chunks from multiple documents at retrieval time. Handle multilingual queries by detecting language and adjusting accordingly. Best practices: use a chat interface showing which sources contributed to each answer. Include an admin panel for document management (upload, delete, reindex). Log query latency broken down by retrieval versus generation. Testing: unit test chunking with various input formats. Test retrieval precision and recall with known document sets. Integration test full RAG pipeline from upload to answer. TypeScript with Next.js.