Concepts
Document Crawling
How textrawl discovers and processes your documents
textrawl's converter tools walk through your directories, extract text content, and prepare it for embedding.
Supported Formats
| Format | Extension | Notes |
|---|---|---|
| Markdown | .md, .mdx | Full support including frontmatter |
| Plain text | .txt | Direct text extraction |
.pdf | Text extraction, no OCR | |
| HTML | .html | Strips tags, keeps structure |
.mbox, .eml | Full email parsing | |
| DOCX | .docx | Microsoft Word documents |
Conversion Pipeline
Deduplication
Each converted file gets a source_hash in its frontmatter. This prevents duplicate uploads when re-running the upload command.
Configuration
Conversion is currently not configurable beyond CLI options. Common options: