Optimize recommendation model inference latency - p99 exceeds SLA
Performance Issue
Current State
Recommendation API latency is exceeding our SLA:
-
p50: 45ms (target: <50ms)
✅ -
p95: 178ms (target: <150ms)
❌ -
p99: 412ms (target: <200ms)
❌ CRITICAL
Impact
- 1% of requests timeout (>500ms)
- Poor user experience for mobile users
- Revenue impact: ~$8K/month from abandoned sessions
- SLA breach: 99th percentile consistently above 200ms
Root Cause Analysis
Profiling Results (from CI performance tests):
-
Model inference: 89ms (hot path)
- SVD matrix multiplication: 67ms
- Post-processing: 22ms
-
Feature retrieval: 78ms
- Redis cache miss rate: 23%
- Cold cache hits PostgreSQL: 156ms
-
Result ranking: 45ms
- Sorting 500 candidates
- Diversity filtering
Optimization Strategies
Short-term (Week 1-2)
-
Increase Redis cache TTL
- Current: 1 hour
- Proposed: 6 hours
- Expected: -30ms p99
-
Model quantization
- FP32 → FP16
- Expected: -20ms inference time
- Accuracy impact: <0.1%
-
Batch inference
- Group requests by 10ms window
- Expected: -40ms p99
Long-term (Week 3-4)
-
GPU inference
- Move to GPU-accelerated serving
- Expected: -60ms inference time
- Cost: +$500/month
-
Model distillation
- Train smaller student model
- Expected: -50ms inference time
- Accuracy impact: -2%
-
Feature store
- Precompute all features
- Expected: -70ms feature retrieval
Testing Plan
-
Load test with k6 (5000 req/s) -
A/B test accuracy vs latency tradeoff -
Monitor cache hit rates -
Measure GPU cost/benefit
Success Metrics
- p99 latency < 200ms
- Cache hit rate > 90%
- Maintain Precision@10 > 0.80
Priority: HIGH - SLA breach
cc: @dmitry @bill_staples