Concepts
Embeddings
How textrawl creates vector representations of your documents
Embeddings are numerical representations of text that capture semantic meaning. textrawl uses them to enable semantic search.
What Are Embeddings?
An embedding converts text into a high-dimensional vector (array of numbers). Similar texts have similar vectors, enabling "search by meaning."
Embedding Models
textrawl supports multiple embedding providers:
| Provider | Model | Dimensions | Notes |
|---|---|---|---|
| OpenAI | text-embedding-3-small | 1536 | Cloud, fast |
| Ollama | nomic-embed-text | 1024 | Local, private |
Important: Embedding providers cannot be mixed. OpenAI and Ollama use different embedding dimensions (1536 vs 1024), making them incompatible. Switching providers requires re-embedding all documents in your knowledge base.
How It Works
- Chunk: Document split into ~512 token pieces
- Embed: Each chunk sent to embedding API
- Store: Vectors stored in pgvector column
- Index: HNSW index for fast similarity search
Query Time
- Query text embedded using same model
- pgvector finds similar chunks by cosine distance
- Results combined with full-text search via RRF