textrawl
byJeff Green
Architecture

How Hybrid Search Works

Understanding textrawl's hybrid search architecture

textrawl combines semantic similarity with full-text keyword matching using Reciprocal Rank Fusion (RRF) to deliver highly relevant search results.

The Problem

Traditional search approaches have trade-offs:

ApproachStrengthWeakness
Keyword searchExact matches, names, codesMisses synonyms, paraphrases
Semantic searchMeaning, concepts, questionsMay miss exact terms

Users often need both: "Find documents about quarterly planning" (semantic) AND "Find PRJ-12345" (keyword).

textrawl runs both searches in parallel and combines results:

Query: "Q4 planning meeting notes"

         ├─────────────────────┬─────────────────────┐
         ▼                     ▼                     │
    ┌─────────────┐     ┌─────────────┐             │
    │  Full-Text  │     │  Semantic   │             │
    │   Search    │     │   Search    │             │
    │  (tsvector) │     │  (pgvector) │             │
    └──────┬──────┘     └──────┬──────┘             │
           │                   │                     │
           ▼                   ▼                     │
    Rank by BM25        Rank by Cosine              │
    Similarity          Similarity                   │
           │                   │                     │
           └─────────┬─────────┘                     │
                     ▼                               │
              ┌─────────────┐                        │
              │    RRF      │◄───────────────────────┘
              │   Fusion    │      Weights
              └──────┬──────┘

              Combined Results

Reciprocal Rank Fusion (RRF)

RRF is a simple but effective algorithm for combining ranked lists.

The Formula

For each result appearing in either list:

RRF_score = Σ (weight / (k + rank))

Where:

  • weight = fullTextWeight or semanticWeight (user-configurable, 0-2)
  • k = 60 (standard RRF constant, smooths rank differences)
  • rank = position in that result list (1-based)

Example

Query: "quarterly budget report"

Full-text results:

  1. "Q4 Budget Report 2024" (rank 1)
  2. "Quarterly Financial Summary" (rank 2)
  3. "Budget Planning Guide" (rank 3)

Semantic results:

  1. "Financial Overview Q4" (rank 1)
  2. "Q4 Budget Report 2024" (rank 2)
  3. "Expense Tracking Document" (rank 3)

RRF calculation (with weight=1.0):

DocumentFTS RankSemantic RankRRF Score
Q4 Budget Report 2024121/(60+1) + 1/(60+2) = 0.0327
Financial Overview Q4-11/(60+1) = 0.0164
Quarterly Financial Summary2-1/(60+2) = 0.0161
Budget Planning Guide3-1/(60+3) = 0.0159
Expense Tracking Document-31/(60+3) = 0.0159

Final ranking:

  1. Q4 Budget Report 2024 (appears in both → highest score)
  2. Financial Overview Q4
  3. Quarterly Financial Summary
  4. Budget Planning Guide
  5. Expense Tracking Document

Weight Tuning

Default (Balanced)

{ "fullTextWeight": 1.0, "semanticWeight": 1.0 }

Both strategies contribute equally.

Keyword-Heavy

For specific terms, codes, names:

{ "fullTextWeight": 1.5, "semanticWeight": 0.5 }

Semantic-Heavy

For natural language questions:

{ "fullTextWeight": 0.5, "semanticWeight": 1.5 }

Pure Keyword

When you know the exact term:

{ "fullTextWeight": 2.0, "semanticWeight": 0 }

Pure Semantic

For conceptual exploration:

{ "fullTextWeight": 0, "semanticWeight": 2.0 }

Database Implementation

hybrid_search() RPC

PostgreSQL function that runs both searches:

CREATE FUNCTION hybrid_search(
  query_text TEXT,
  query_embedding VECTOR(1536),
  match_limit INT,
  fts_weight FLOAT,
  semantic_weight FLOAT
) RETURNS TABLE (...) AS $$
  -- Full-text search with BM25
  WITH fts_results AS (
    SELECT id, ts_rank_cd(search_vector, query) AS rank
    FROM chunks
    WHERE search_vector @@ plainto_tsquery(query_text)
    ORDER BY rank DESC
    LIMIT match_limit * 2
  ),
  -- Semantic search with cosine distance
  semantic_results AS (
    SELECT id, 1 - (embedding <=> query_embedding) AS similarity
    FROM chunks
    ORDER BY embedding <=> query_embedding
    LIMIT match_limit * 2
  ),
  -- RRF combination
  combined AS (
    SELECT id,
      COALESCE(fts_weight / (60 + fts_rank), 0) +
      COALESCE(semantic_weight / (60 + semantic_rank), 0) AS score
    FROM fts_results
    FULL OUTER JOIN semantic_results USING (id)
  )
  SELECT * FROM combined ORDER BY score DESC LIMIT match_limit;
$$ LANGUAGE sql;

Indexes

IndexTableTypePurpose
search_vector_idxchunksGINFull-text search
embedding_idxchunksHNSWVector similarity

HNSW (Hierarchical Navigable Small World) provides fast approximate nearest neighbor search.

Performance Characteristics

Query Time

Knowledge Base SizeTypical Latency
< 10,000 chunks50-100ms
10,000-100,000 chunks100-300ms
100,000+ chunks300-500ms

Scaling Considerations

  • HNSW index: O(log n) query time
  • Full-text index: O(log n) query time
  • RRF fusion: O(n) where n = match_limit * 2

Why This Approach?

  1. Best of both worlds: Captures exact matches AND semantic meaning
  2. Simple fusion: RRF is easy to understand and tune
  3. No training required: Works out of the box, no ML training needed
  4. Configurable: Users can adjust weights for their use case
  5. Efficient: Single database query, parallel execution

References