Core Concepts

Understanding these core concepts will help you get the most out of textrawl's knowledge base capabilities.

Concepts Overview

Search

textrawl combines two search strategies for best results:

Keyword search finds exact matches using PostgreSQL's full-text search
Semantic search finds meaning using vector embeddings
Hybrid search combines both using Reciprocal Rank Fusion (RRF)

Learn about search →

Embeddings

Vector embeddings are numerical representations of text that capture semantic meaning:

OpenAI's text-embedding-3-small (1536 dimensions)
Google AI's gemini-embedding-2-preview (3072 dimensions)
Ollama's nomic-embed-text (1024 dimensions) or nomic-embed-text-v2-moe (768 dimensions)
Stored in PostgreSQL with pgvector

Learn about embeddings →

Document Processing

How textrawl processes your documents:

Crawling discovers and extracts content from various formats
Chunking splits documents into searchable pieces
Indexing creates embeddings and full-text search vectors

Learn about crawling →

How It All Fits Together

Documents (PDF, MBOX, HTML, MD)
         │
         ▼
    ┌─────────┐
    │ Crawl & │
    │ Extract │
    └────┬────┘
         │
         ▼
    ┌─────────┐
    │  Chunk  │
    │  (512t) │
    └────┬────┘
         │
         ├──────────────────┐
         ▼                  ▼
    ┌─────────┐       ┌──────────┐
    │ Embed   │       │ Full-Text│
    │ (vector)│       │ (tsvector)│
    └────┬────┘       └─────┬────┘
         │                  │
         └────────┬─────────┘
                  ▼
           ┌──────────┐
           │ Postgres │
           │ (pgvector)│
           └──────────┘

Next Steps

Quick Start - Get textrawl running
Hybrid Search Architecture - Deep dive into search
Tools Reference - Explore the API