Launch GitLab Knowledge Graph

[BUG] Memory leak in toxicity model inference server

Symptoms

Production alert: Moderation API memory usage growing 200MB/hour

Impact

Frequency: OOM restart every 12 hours
User impact: 30-60s downtime during restart
Workaround: Auto-restart in Kubernetes

Root Cause Investigation

Hypothesis 1: PyTorch model not releasing GPU memory

Fix: Explicitly move to CPU and delete tensors

Hypothesis 2: Request context not cleaned up

FastAPI may be holding references to request objects
Need to profile with memory_profiler

Hypothesis 3: Tokenizer cache growing unbounded

Transformers tokenizer caches vocab lookups
May need to limit cache size

Debugging Steps

Add memory profiling to production
Run load test with memory tracking
Review PyTorch tensor lifecycle
Check FastAPI request cleanup

Priority

Medium-High: Not blocking users (auto-restart works) but wastes resources.

cc @bob_wilson @sabrina_farmer