Skip to content
Launch GitLab Knowledge Graph

[BUG] Memory leak in toxicity model inference server

Symptoms

Production alert: Moderation API memory usage growing 200MB/hour

Impact

  • Frequency: OOM restart every 12 hours
  • User impact: 30-60s downtime during restart
  • Workaround: Auto-restart in Kubernetes

Root Cause Investigation

Hypothesis 1: PyTorch model not releasing GPU memory

Fix: Explicitly move to CPU and delete tensors

Hypothesis 2: Request context not cleaned up

  • FastAPI may be holding references to request objects
  • Need to profile with memory_profiler

Hypothesis 3: Tokenizer cache growing unbounded

  • Transformers tokenizer caches vocab lookups
  • May need to limit cache size

Debugging Steps

  • Add memory profiling to production
  • Run load test with memory tracking
  • Review PyTorch tensor lifecycle
  • Check FastAPI request cleanup

Priority

Medium-High: Not blocking users (auto-restart works) but wastes resources.

cc @bob_wilson @sabrina_farmer