Skip to content
general artificial-intelligence

RAG (RAG)

rag retrieval ai vector-search embeddings
Plain English

An AI model can only answer from what it was trained on. RAG fixes this by giving the model a way to look things up. When you ask a question, the system first searches your documents, knowledge base, or database for relevant information, then passes that information to the AI along with your question. The AI generates its answer based on your actual data instead of guessing. Think of it as giving the AI an open-book exam instead of asking it to answer from memory.

Technical Definition

Retrieval-Augmented Generation (RAG) is a pattern that combines information retrieval with language model generation to produce responses grounded in specific source documents. Introduced by Lewis et al. (2020), RAG addresses two key LLM limitations: hallucination and knowledge cutoff.

RAG pipeline:

  1. Ingestion: documents are split into chunks (paragraphs, sections), converted to vector embeddings (dense numerical representations capturing semantic meaning), and stored in a vector database.
  2. Retrieval: when a user query arrives, it is embedded using the same model, and the vector database returns the top-k most semantically similar chunks via approximate nearest neighbor (ANN) search.
  3. Augmentation: retrieved chunks are injected into the prompt as context alongside the user query and system instructions.
  4. Generation: the LLM generates a response grounded in the retrieved context, ideally with citations linking claims to source documents.

Key components:

  • Embedding model: converts text to dense vectors (OpenAI text-embedding-3, Cohere embed-v3, BGE, E5). Dimension: 256-3072.
  • Vector database: stores and searches embeddings (Pinecone, Weaviate, Qdrant, pgvector, ChromaDB).
  • Chunking strategy: fixed-size (512 tokens), semantic (by paragraph/section), or recursive character splitting. Chunk size affects retrieval precision.
  • Reranking: optional second-stage scoring (Cohere Rerank, cross-encoder models) to improve relevance of retrieved chunks before injection.

Advanced patterns:

  • Hybrid search: combine vector similarity with keyword search (BM25) for better recall
  • Multi-query RAG: generate multiple reformulations of the user query to retrieve more diverse context
  • Self-RAG: model decides when to retrieve and evaluates the relevance of retrieved content
  • Graph RAG: combine vector retrieval with knowledge graph traversal for multi-hop reasoning
User Query"How to setup VLANs?"Embed[0.23, -0.8...]Vector StoreSimilarity searchTop-k resultsYour docs, wikis, codeLLMSystem prompt+ Retrieved context+ User query= Grounded answerResponseWith citationsfrom your dataRAG = LLM answers grounded in your actual data, not just training dataReduces hallucination by providing verifiable source material

Basic RAG implementation

from anthropic import Anthropic
import chromadb

# 1. Set up vector store with your documents
client_db = chromadb.Client()
collection = client_db.create_collection("docs")
collection.add(
    documents=[
        "VLAN 10 is assigned to the engineering team on ports 1-8.",
        "The firewall blocks all inbound traffic except ports 80 and 443.",
        "DNS resolvers are set to 8.8.8.8 and 1.1.1.1.",
    ],
    ids=["doc1", "doc2", "doc3"],
)

# 2. Retrieve relevant context for a query
query = "What ports are open on the firewall?"
results = collection.query(query_texts=[query], n_results=2)
context = "\n".join(results["documents"][0])

# 3. Generate grounded response
anthropic = Anthropic()
response = anthropic.messages.create(
    model="claude-sonnet-4-20250514",
    max_tokens=512,
    system="Answer based ONLY on the provided context. Cite sources.",
    messages=[
        {"role": "user", "content": f"Context:\n{context}\n\nQuestion: {query}"}
    ],
)
print(response.content[0].text)
In the Wild

RAG is the most practical way to make AI useful for organization-specific knowledge. Instead of fine-tuning a model (expensive, slow, requires ML expertise), you build a retrieval pipeline over your existing documents. Common production RAG systems index internal wikis, Confluence pages, Slack history, support tickets, and codebases. The pattern powers enterprise AI assistants (Glean, Guru), customer support bots that answer from your help docs, and developer tools that search codebases. The primary engineering challenge is retrieval quality: if the retriever returns irrelevant chunks, the LLM produces poor answers regardless of its capability. Evaluation frameworks (RAGAS, LangSmith) measure retrieval precision, answer faithfulness, and hallucination rates. MCP (Model Context Protocol) servers can act as RAG endpoints, letting AI assistants retrieve context from external sources in a standardized way.