Architecture
Core Components
- Stealth (
pkg/stealth): High-speed retrieval using TLS fingerprinting to bypass bot detection. - Readability (
pkg/readability): Local DOM pruning and deterministic Markdown normalization. - Engine (
pkg/engine): Pipeline orchestrator and vLLM interface. - Storage (
pkg/storage): SQLite state engine for content hashing, caching, and sync. - Vector (
pkg/vector): Abstracted interface for local (JSONL) and cloud (Pinecone) indexing.
Data Flow
- Fetch: Stealth retrieval of raw HTML.
- Prune: Heuristic boilerplate removal.
- Refine: SLM-powered semantic cleaning.
- Normalize: Deterministic structural formatting.
- Sync: SQLite state update and caching.