PII Janitor: The Hybrid Redaction Engine

The PII Janitor is the heart of Bauxite’s privacy protection. It utilizes a Hybrid Detection Path to ensure that sensitive data is scrubbed with the highest possible speed and accuracy.


The Hybrid Detection Path

Bauxite is the first intercept to combine deterministic speed with semantic depth:

1. Fast Path (Deterministic Regex)

  • Use Case: Credit Cards, API Keys, Social Security Numbers, Email addresses.
  • Overhead: <1ms.
  • Logic: Optimized regex patterns scan the stream for high-entropy matches.
  • New Defaults: Out-of-the-box support for GitHub PATs, AWS Access Keys, Stripe Secret Keys, Google API Keys, Slack Tokens, HuggingFace Tokens, and NPM Tokens.

2. Deep Path (Semantic SLM)

  • Use Case: Names in conversation, residential addresses, medical identifiers.
  • Overhead: ~30ms.
  • Logic: Integrates with local Small Language Models (SLMs) via Ollama to understand the context of the sentence.

3. Custom Path (WASM Plugins)

  • Use Case: Proprietary enterprise data, internal project IDs.
  • Logic: Executes custom, sandboxed WebAssembly modules to detect data formats unique to your organization.

Tiered Redaction Modes

The Janitor’s behavior changes based on your Commercial Tier:

Open Core (Free)

  • Matches are replaced with the static string [REDACTED].
  • Limitation: The LLM loses context (e.g., it can’t tell if ”[REDACTED] talked to [REDACTED]” refers to one person or two).

Shield / Fortress (Pro/Ent)

  • Matches are replaced with unique placeholders like [PERSON_1], [PERSON_2], or [EMAIL_1].
  • The Token Vault: Maps these placeholders back to original values in the response path, providing full re-identification to the end-user while keeping the cloud provider blind.

Memory Safety & “The Straitjacket”

The Janitor operates within a strict 20MB memory limit.

  • No Disk Spillage: If the PII vault grows too large, the request is failed rather than being written to disk.
  • Explicit Purge: When a request completes, Bauxite zero-fills the memory buffers containing sensitive data before returning them to the pool.

Configuration

janitor:
  patterns:
    - label: "INTERNAL_ID"
      regex: "ID-[0-9]{5}"

inference:
  enabled: true
  url: "http://localhost:11434"
  model: "llama3"