Streaming Redaction

Bauxite is built for the era of real-time AI. Unlike traditional proxies that wait for the full response to arrive before processing it, Bauxite’s Streaming Redaction engine scrubs data on-the-fly with sub-millisecond overhead.

The Challenge: Partial Matches

When an LLM streams a response via Server-Sent Events (SSE), it often sends data in tiny chunks:

Chunk 1: {"text": "My email is john."}

Chunk 2: {"text": "doe@exam"}

Chunk 3: {"text": "ple.com"}

A naive regex would miss the email address because it never sees the full string at once.

The Bauxite Solution: Sliding Window Buffer

To solve this within our 20MB Straitjacket, Bauxite uses a fixed-size, circular sliding window buffer.

Windowing: Bauxite maintains a small 32KB buffer for each active stream.
Lookahead: As chunks arrive, they are appended to the window.
Redaction: The regex engine scans the window. If a partial match is detected at the end of a chunk (e.g., john.doe@), Bauxite holds that small fragment back for a few milliseconds until the next chunk confirms or breaks the pattern.
Flush: Once a pattern is confirmed and redacted—or the window moves past a non-match—the data is immediately flushed to the client.

Performance Impact

Latency: Adds ~0.5ms to 2ms of jitter (negligible for human reading speeds).
Memory: Fixed 32KB per concurrent stream. 100 concurrent users only consume ~3.2MB of the 20MB limit.

Security: The “No-Leak” Guarantee

Because Bauxite operates at the byte level, we ensure that no PII “leaks” through the stream even if the LLM attempts to obfuscate it with unusual chunking.

Feature	Implementation
SSE Aware	Parses `data:` frames natively to avoid redacting metadata.
Tail-Holding	Prevents partial PII leaks by holding the end of the buffer until the next chunk.
Backpressure	If the client is slow, Bauxite holds data in the memory-only window, never swapping to disk.

Configuration

Streaming redaction is enabled by default when the PII Janitor is active. You can tune the “Lookahead” buffer size in your config.yaml:

streaming:
  enabled: true
  lookahead_buffer_kb: 32 # Default 32KB
  max_hold_ms: 50        # Max time to hold a partial match before flushing

Testing the Stream

You can test the real-time redaction using curl:

curl -X POST http://localhost:9090/v1/chat/completions 
  -H "Content-Type: application/json" 
  -d '{
    "model": "gpt-4",
    "stream": true,
    "messages": [{"role": "user", "content": "Repeat this back: my secret is sk-12345"}]
  }'

Expected Output: data: {"choices": [{"delta": {"content": "Your secret is [SECRET_1]"}}]}