Building RAG Pipelines: Vector Databases, Chunking Strategies, and Retrieval Optimization

Why RAG Outperforms Fine-Tuning for Knowledge-Intensive Tasks

Retrieval-Augmented Generation (RAG) has become the dominant pattern for grounding large language models in specific, up-to-date knowledge. Rather than baking facts into model weights via fine-tuning — an expensive, static process — RAG retrieves relevant documents at inference time and provides them as context for the model’s response. The result is a system that can answer questions about your private knowledge base, stay current without retraining, and cite its sources.

The challenge is that a naive RAG implementation — chunk documents, embed them, retrieve the top-k chunks, stuff into a prompt — produces mediocre results. This guide covers the engineering decisions that separate production-quality RAG pipelines from prototypes: embedding model selection, chunking strategy, hybrid search, and re-ranking.

Vector Databases: Choosing the Right Foundation

A vector database stores high-dimensional embedding vectors and supports approximate nearest neighbor (ANN) search — finding the vectors most similar to a query vector by cosine similarity or dot product.

For production RAG, evaluate vector databases on four axes:

  • Query latency at scale: Can it return top-k results in under 50ms at your anticipated query volume and corpus size?
  • Filtering support: Can you filter by metadata (document type, date, access control label) alongside the vector search?
  • Hybrid search: Does it support combined keyword (BM25) + vector search natively, or do you need to implement that externally?
  • Operational complexity: What does deployment, backup, and scaling look like?

Common choices and their trade-offs:

  • pgvector (PostgreSQL extension): Easiest to operate if you already run PostgreSQL. HNSW index added in pgvector 0.5 provides competitive ANN performance. Best for corpora under ~5M vectors where SQL filtering is valuable.
  • Qdrant: Purpose-built vector database written in Rust. Excellent performance, rich filtering, built-in sparse vector support for hybrid search. Good choice for medium to large corpora.
  • Weaviate: Natively supports hybrid BM25 + vector search. GraphQL API. Higher operational overhead but strong built-in hybrid search story.
  • Milvus: Highly scalable, designed for billion-vector corpora. Kubernetes-native. Overkill for most deployments, but the right choice at extreme scale.

Embedding Models: The Foundation of Retrieval Quality

The quality of your retrieval is determined primarily by your embedding model. An embedding model maps text to a dense vector where semantically similar texts land near each other in vector space.

Selecting an Embedding Model

Key evaluation criteria:

  • MTEB benchmark performance: The Massive Text Embedding Benchmark provides standardized scores across retrieval, clustering, and classification tasks. Sort by retrieval scores for RAG use cases.
  • Context window: Models with 512-token limits will truncate long documents at indexing time, losing information. Models like nomic-embed-text and bge-m3 support 8K+ token contexts.
  • Embedding dimensions: Higher-dimensional embeddings (1536d, 3072d) generally carry more information but cost more to store and search. Many modern models offer Matryoshka Representation Learning (MRL) — embeddings that remain useful when truncated to smaller dimensions.
  • Inference cost and latency: For high-volume production use, self-hosted open models (via Ollama, vLLM, or dedicated embedding API) are often more cost-effective than API providers at scale.
# Example: batch embedding with sentence-transformers
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("BAAI/bge-large-en-v1.5")

def embed_chunks(chunks: list[str], batch_size: int = 64) -> np.ndarray:
    """Embed a list of text chunks with batching for throughput."""
    embeddings = model.encode(
        chunks,
        batch_size=batch_size,
        normalize_embeddings=True,   # normalize for cosine similarity
        show_progress_bar=True
    )
    return embeddings

Always normalize embeddings to unit length before storing if your vector database computes similarity via dot product — normalized vectors make dot product equivalent to cosine similarity, which is what most embedding models are trained to optimize.

Chunking Strategies: Where Most RAG Pipelines Fail

Chunking — splitting source documents into segments for embedding and storage — is the most underappreciated decision in RAG pipeline design. The wrong chunking strategy is the most common cause of poor retrieval quality, even with an excellent embedding model.

Fixed-Size Chunking

The simplest approach: split every document into chunks of N tokens with an overlap of M tokens.

# Fixed-size chunking with overlap
from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,          # tokens (roughly 350-400 words)
    chunk_overlap=64,        # overlap to avoid cutting mid-sentence context
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_text(document_text)

Using RecursiveCharacterTextSplitter rather than a naive character split respects natural boundaries (paragraphs, sentences) before falling back to arbitrary splits. The separators list is tried in order — split on double newlines first, then single newlines, then sentence endings.

Semantic Chunking

A more sophisticated approach embeds sentences and splits when the semantic similarity between adjacent sentences drops below a threshold, indicating a topic boundary. This produces chunks that are semantically coherent rather than arbitrarily sized.

# Semantic chunking (simplified)
def semantic_chunk(sentences: list[str], threshold: float = 0.75) -> list[str]:
    embeddings = embed_chunks(sentences)
    chunks, current_chunk = [], [sentences[0]]

    for i in range(1, len(sentences)):
        similarity = np.dot(embeddings[i-1], embeddings[i])
        if similarity < threshold:
            # Semantic shift detected — start new chunk
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentences[i]]
        else:
            current_chunk.append(sentences[i])

    if current_chunk:
        chunks.append(" ".join(current_chunk))
    return chunks

Hierarchical (Parent-Child) Chunking

Store documents at two granularities: small child chunks for high-precision retrieval, and larger parent chunks for rich context delivery to the LLM. Retrieve by child chunk similarity, then return the parent chunk to the model.

# Parent-child chunk indexing
def index_with_parent_child(document: str):
    # Small chunks for retrieval (128 tokens)
    child_splitter = RecursiveCharacterTextSplitter(chunk_size=128, chunk_overlap=16)
    # Large chunks for context (512 tokens)
    parent_splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=64)

    parent_chunks = parent_splitter.split_text(document)
    for parent_id, parent in enumerate(parent_chunks):
        child_chunks = child_splitter.split_text(parent)
        for child in child_chunks:
            child_embedding = embed_chunks([child])[0]
            vector_db.upsert(
                vector=child_embedding,
                payload={"text": child, "parent_id": parent_id,
                         "parent_text": parent}
            )

Hybrid Search: Combining Semantic and Keyword Retrieval

Pure vector search excels at semantic similarity but can miss documents that share exact keywords — product names, version numbers, error codes, proper nouns — that embedding models may not represent precisely. Hybrid search combines dense vector retrieval with sparse keyword search (BM25) to capture both dimensions.

# Hybrid search with Qdrant (sparse + dense)
from qdrant_client import QdrantClient
from qdrant_client.models import SparseVector, NamedSparseVector, NamedVector

def hybrid_search(query: str, top_k: int = 10, alpha: float = 0.5):
    """
    alpha=1.0  → pure vector search
    alpha=0.0  → pure BM25
    alpha=0.5  → balanced hybrid
    """
    dense_vector = embed_chunks([query])[0]
    sparse_indices, sparse_values = bm25_encode(query)  # BM25 sparse encoding

    results = client.query_points(
        collection_name="documents",
        prefetch=[
            # Dense retrieval
            models.Prefetch(
                query=dense_vector.tolist(),
                using="dense",
                limit=top_k * 2,
            ),
            # Sparse BM25 retrieval
            models.Prefetch(
                query=models.SparseVector(
                    indices=sparse_indices,
                    values=sparse_values
                ),
                using="sparse",
                limit=top_k * 2,
            ),
        ],
        # Reciprocal Rank Fusion to merge results
        query=models.FusionQuery(fusion=models.Fusion.RRF),
        limit=top_k,
    )
    return results

Reciprocal Rank Fusion (RRF) is the standard algorithm for merging ranked lists from multiple retrievers. It rewards documents that rank highly in multiple lists without requiring score normalization across the two retrieval methods.

Re-Ranking: The Final Quality Gate

ANN search returns approximate results — the top-k documents by embedding similarity, not guaranteed to be the most relevant. A cross-encoder re-ranker applies a more expensive but more accurate relevance model to the retrieved candidates, reordering them before passing to the LLM.

# Re-ranking with a cross-encoder
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list[dict], top_n: int = 5) -> list[dict]:
    """Re-rank retrieved chunks using a cross-encoder."""
    pairs = [(query, c["text"]) for c in candidates]
    scores = reranker.predict(pairs)

    ranked = sorted(
        zip(scores, candidates),
        key=lambda x: x[0],
        reverse=True
    )
    return [doc for _, doc in ranked[:top_n]]

The cross-encoder scores each query-document pair jointly, capturing interaction signals that bi-encoder embeddings miss. The performance cost is justified because re-ranking runs only on the small candidate set (typically 20-50 chunks), not the full corpus.

Metadata Filtering and Access Control

Production RAG pipelines must often restrict retrieval by document metadata — user permissions, document classification, date ranges, source system. Build metadata filtering into your indexing schema from the start.

# Qdrant payload schema for access-controlled RAG
{
    "text": "chunk content...",
    "doc_id": "doc-2024-03-15-001",
    "source": "confluence",
    "classification": "internal",
    "allowed_groups": ["engineering", "product"],
    "created_at": "2024-03-15T10:30:00Z",
    "section": "architecture"
}

# At query time, filter by user's groups
search_filter = Filter(
    must=[
        FieldCondition(
            key="allowed_groups",
            match=MatchAny(any=current_user.groups)
        )
    ]
)

Evaluating RAG Pipeline Quality

Measure your pipeline against three dimensions using a golden question-answer dataset drawn from your knowledge base:

  • Retrieval recall: For each test question, does the correct source chunk appear in the top-k retrieved results?
  • Answer faithfulness: Is the LLM’s answer grounded in the retrieved context, or is it hallucinating facts not present in the retrieved chunks?
  • Answer relevance: Does the answer actually address what the question asked?

Frameworks like RAGAS automate this evaluation against a test set, giving you quantitative metrics to compare chunking strategies, embedding models, and re-ranking configurations before committing to production.

Conclusion

A production-quality RAG pipeline is a system, not a script. The difference between a demo and a reliable production system lies in deliberate choices at every layer: an embedding model matched to your domain and context requirements, a chunking strategy that preserves semantic coherence, hybrid search to catch keyword-critical queries, and cross-encoder re-ranking to surface the most relevant candidates for the model. Build each layer with evaluation in mind, and treat retrieval quality as a first-class engineering concern alongside latency and cost.

Scroll to Top