Architecture

Core Components

  • Stealth (pkg/stealth): High-speed retrieval using TLS fingerprinting to bypass bot detection.
  • Readability (pkg/readability): Local DOM pruning and deterministic Markdown normalization.
  • Engine (pkg/engine): Pipeline orchestrator and vLLM interface.
  • Storage (pkg/storage): SQLite state engine for content hashing, caching, and sync.
  • Vector (pkg/vector): Abstracted interface for local (JSONL) and cloud (Pinecone) indexing.

Data Flow

  1. Fetch: Stealth retrieval of raw HTML.
  2. Prune: Heuristic boilerplate removal.
  3. Refine: SLM-powered semantic cleaning.
  4. Normalize: Deterministic structural formatting.
  5. Sync: SQLite state update and caching.