How Hybrid Search Works
Understanding textrawl's hybrid search architecture
textrawl combines semantic similarity with full-text keyword matching using Reciprocal Rank Fusion (RRF) to deliver highly relevant search results.
The Problem
Traditional search approaches have trade-offs:
| Approach | Strength | Weakness |
|---|---|---|
| Keyword search | Exact matches, names, codes | Misses synonyms, paraphrases |
| Semantic search | Meaning, concepts, questions | May miss exact terms |
Users often need both: "Find documents about quarterly planning" (semantic) AND "Find PRJ-12345" (keyword).
The Solution: Hybrid Search
textrawl runs both searches in parallel and combines results:
Reciprocal Rank Fusion (RRF)
RRF is a simple but effective algorithm for combining ranked lists.
The Formula
For each result appearing in either list:
Where:
weight= fullTextWeight or semanticWeight (user-configurable, 0-2)k= 60 (standard RRF constant, smooths rank differences)rank= position in that result list (1-based)
Example
Query: "quarterly budget report"
Full-text results:
- "Q4 Budget Report 2024" (rank 1)
- "Quarterly Financial Summary" (rank 2)
- "Budget Planning Guide" (rank 3)
Semantic results:
- "Financial Overview Q4" (rank 1)
- "Q4 Budget Report 2024" (rank 2)
- "Expense Tracking Document" (rank 3)
RRF calculation (with weight=1.0):
| Document | FTS Rank | Semantic Rank | RRF Score |
|---|---|---|---|
| Q4 Budget Report 2024 | 1 | 2 | 1/(60+1) + 1/(60+2) = 0.0327 |
| Financial Overview Q4 | - | 1 | 1/(60+1) = 0.0164 |
| Quarterly Financial Summary | 2 | - | 1/(60+2) = 0.0161 |
| Budget Planning Guide | 3 | - | 1/(60+3) = 0.0159 |
| Expense Tracking Document | - | 3 | 1/(60+3) = 0.0159 |
Final ranking:
- Q4 Budget Report 2024 (appears in both → highest score)
- Financial Overview Q4
- Quarterly Financial Summary
- Budget Planning Guide
- Expense Tracking Document
Weight Tuning
Default (Balanced)
Both strategies contribute equally.
Keyword-Heavy
For specific terms, codes, names:
Semantic-Heavy
For natural language questions:
Pure Keyword
When you know the exact term:
Pure Semantic
For conceptual exploration:
Database Implementation
hybrid_search() RPC
PostgreSQL function that runs both searches:
Indexes
| Index | Table | Type | Purpose |
|---|---|---|---|
| search_vector_idx | chunks | GIN | Full-text search |
| embedding_idx | chunks | HNSW | Vector similarity |
HNSW (Hierarchical Navigable Small World) provides fast approximate nearest neighbor search.
Performance Characteristics
Query Time
| Knowledge Base Size | Typical Latency |
|---|---|
| < 10,000 chunks | 50-100ms |
| 10,000-100,000 chunks | 100-300ms |
| 100,000+ chunks | 300-500ms |
Scaling Considerations
- HNSW index: O(log n) query time
- Full-text index: O(log n) query time
- RRF fusion: O(n) where n = match_limit * 2
Why This Approach?
- Best of both worlds: Captures exact matches AND semantic meaning
- Simple fusion: RRF is easy to understand and tune
- No training required: Works out of the box, no ML training needed
- Configurable: Users can adjust weights for their use case
- Efficient: Single database query, parallel execution
References
- Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods - Original RRF paper
- pgvector Documentation - Vector similarity in PostgreSQL
- PostgreSQL Full Text Search - Built-in FTS