[BUG] Memory leak in toxicity model inference server
Symptoms
Production alert: Moderation API memory usage growing 200MB/hour
Impact
- Frequency: OOM restart every 12 hours
- User impact: 30-60s downtime during restart
- Workaround: Auto-restart in Kubernetes
Root Cause Investigation
Hypothesis 1: PyTorch model not releasing GPU memory
Fix: Explicitly move to CPU and delete tensors
Hypothesis 2: Request context not cleaned up
- FastAPI may be holding references to request objects
- Need to profile with memory_profiler
Hypothesis 3: Tokenizer cache growing unbounded
- Transformers tokenizer caches vocab lookups
- May need to limit cache size
Debugging Steps
- Add memory profiling to production
- Run load test with memory tracking
- Review PyTorch tensor lifecycle
- Check FastAPI request cleanup
Priority
Medium-High: Not blocking users (auto-restart works) but wastes resources.
cc @bob_wilson @sabrina_farmer