Architecture
Chunking Strategy
How textrawl splits documents for optimal search
Large documents must be split into smaller chunks for effective embedding and retrieval. textrawl uses a paragraph-aware chunking strategy.
Default Parameters
| Parameter | Value | Description |
|---|---|---|
| Chunk size | 512 tokens | ~2048 characters |
| Overlap | 50 tokens | Context preservation |
| Min chunk | 100 tokens | Avoid tiny fragments |
Chunking Algorithm
Why This Approach?
Paragraph-Aware Splitting
- Preserves semantic coherence
- Doesn't cut sentences mid-thought
- Respects document structure
512-Token Chunks
- Optimal for most embedding models
- Balances context vs. precision
- Fits within Claude's context window
50-Token Overlap
- Prevents losing context at boundaries
- Ensures queries match across chunk edges
- Small overhead (~10% redundancy)
Example
Original document (1500 tokens):
Becomes 3 chunks:
Configuration
Chunking parameters are currently not user-configurable. They're optimized for general-purpose search.
Trade-offs
| Smaller Chunks | Larger Chunks |
|---|---|
| More precise retrieval | More context per result |
| More chunks to search | Fewer embedding calls |
| Risk of missing context | Risk of diluted relevance |
The 512-token default balances these trade-offs for most use cases.