textrawl
byJeff Green
Architecture

Chunking Strategy

How textrawl splits documents for optimal search

Large documents must be split into smaller chunks for effective embedding and retrieval. textrawl uses a paragraph-aware chunking strategy.

Default Parameters

ParameterValueDescription
Chunk size512 tokens~2048 characters
Overlap50 tokensContext preservation
Min chunk100 tokensAvoid tiny fragments

Chunking Algorithm

Document


┌─────────────────────────────────────────┐
│ 1. Split on paragraph boundaries (\n\n) │
└─────────────────────────────────────────┘


┌─────────────────────────────────────────┐
│ 2. Combine paragraphs up to 512 tokens  │
└─────────────────────────────────────────┘


┌─────────────────────────────────────────┐
│ 3. Add 50-token overlap between chunks  │
└─────────────────────────────────────────┘


Chunks with metadata

Why This Approach?

Paragraph-Aware Splitting

  • Preserves semantic coherence
  • Doesn't cut sentences mid-thought
  • Respects document structure

512-Token Chunks

  • Optimal for most embedding models
  • Balances context vs. precision
  • Fits within Claude's context window

50-Token Overlap

  • Prevents losing context at boundaries
  • Ensures queries match across chunk edges
  • Small overhead (~10% redundancy)

Example

Original document (1500 tokens):

## Introduction
First paragraph about topic...

## Details
Second paragraph with specifics...
Third paragraph continuing...

## Conclusion
Final thoughts...

Becomes 3 chunks:

Chunk 1 (512 tokens):
"## Introduction\nFirst paragraph about topic..."

Chunk 2 (512 tokens, 50-token overlap):
"...end of first paragraph\n\n## Details\nSecond paragraph..."

Chunk 3 (476 tokens + overlap):
"...continuing...\n\n## Conclusion\nFinal thoughts..."

Configuration

Chunking parameters are currently not user-configurable. They're optimized for general-purpose search.

Trade-offs

Smaller ChunksLarger Chunks
More precise retrievalMore context per result
More chunks to searchFewer embedding calls
Risk of missing contextRisk of diluted relevance

The 512-token default balances these trade-offs for most use cases.

On this page