textrawl
byJeff Green
Concepts

Document Crawling

How textrawl discovers and processes your documents

textrawl's converter tools walk through your directories, extract text content, and prepare it for embedding.

Supported Formats

FormatExtensionNotes
Markdown.md, .mdxFull support including frontmatter
Plain text.txtDirect text extraction
PDF.pdfText extraction, no OCR
HTML.htmlStrips tags, keeps structure
Email.mbox, .emlFull email parsing
DOCX.docxMicrosoft Word documents

Conversion Pipeline

Source Files


┌─────────────┐
│  Converter  │  npm run convert
└─────────────┘


Markdown + YAML frontmatter


┌─────────────┐
│   Upload    │  npm run upload
└─────────────┘


Documents + Chunks + Embeddings

Deduplication

Each converted file gets a source_hash in its frontmatter. This prevents duplicate uploads when re-running the upload command.

Configuration

Conversion is currently not configurable beyond CLI options. Common options:

-o, --output <dir>   Output directory
-r, --recursive      Process subdirectories
-v, --verbose        Enable verbose logging
--dry-run            Preview without writing
-t, --tags <tags>    Add custom tags

On this page