Document Crawling

textrawl's converter tools walk through your directories, extract text content, and prepare it for embedding.

Supported Formats

Format	Extension	Notes
Markdown	`.md`, `.mdx`	Full support including frontmatter
Plain text	`.txt`	Direct text extraction
PDF	`.pdf`	Text extraction, no OCR
HTML	`.html`	Strips tags, keeps structure
Email	`.mbox`, `.eml`	Full email parsing
DOCX	`.docx`	Microsoft Word documents

Conversion Pipeline

Source Files
     │
     ▼
┌─────────────┐
│  Converter  │  pnpm convert
└─────────────┘
     │
     ▼
Markdown + YAML frontmatter
     │
     ▼
┌─────────────┐
│   Upload    │  pnpm upload
└─────────────┘
     │
     ▼
Documents + Chunks + Embeddings

Deduplication

Each converted file gets a source_hash in its frontmatter. This prevents duplicate uploads when re-running the upload command.

Configuration

Conversion is currently not configurable beyond CLI options. Common options:

-o, --output <dir>   Output directory
-r, --recursive      Process subdirectories
-v, --verbose        Enable verbose logging
--dry-run            Preview without writing
-t, --tags <tags>    Add custom tags

Document Crawling

Supported Formats

Conversion Pipeline

Deduplication

Configuration

On this page