Launch GitLab Knowledge Graph

Optimize recommendation model inference latency - p99 exceeds SLA

Performance Issue

Current State

Recommendation API latency is exceeding our SLA:

p50: 45ms (target: <50ms) ✅
p95: 178ms (target: <150ms) ❌
p99: 412ms (target: <200ms) ❌ CRITICAL

Impact

1% of requests timeout (>500ms)
Poor user experience for mobile users
Revenue impact: ~$8K/month from abandoned sessions
SLA breach: 99th percentile consistently above 200ms

Root Cause Analysis

Profiling Results (from CI performance tests):

Model inference: 89ms (hot path)
- SVD matrix multiplication: 67ms
- Post-processing: 22ms
Feature retrieval: 78ms
- Redis cache miss rate: 23%
- Cold cache hits PostgreSQL: 156ms
Result ranking: 45ms
- Sorting 500 candidates
- Diversity filtering

Optimization Strategies

Short-term (Week 1-2)

Increase Redis cache TTL
- Current: 1 hour
- Proposed: 6 hours
- Expected: -30ms p99
Model quantization
- FP32 → FP16
- Expected: -20ms inference time
- Accuracy impact: <0.1%
Batch inference
- Group requests by 10ms window
- Expected: -40ms p99

Long-term (Week 3-4)

GPU inference
- Move to GPU-accelerated serving
- Expected: -60ms inference time
- Cost: +$500/month
Model distillation
- Train smaller student model
- Expected: -50ms inference time
- Accuracy impact: -2%
Feature store
- Precompute all features
- Expected: -70ms feature retrieval

Testing Plan

Load test with k6 (5000 req/s)
A/B test accuracy vs latency tradeoff
Monitor cache hit rates
Measure GPU cost/benefit

Success Metrics

p99 latency < 200ms
Cache hit rate > 90%
Maintain Precision@10 > 0.80

Priority: HIGH - SLA breach

cc: @dmitry @bill_staples