Skip to content
Launch GitLab Knowledge Graph

Optimize recommendation model inference latency - p99 exceeds SLA

Performance Issue

Current State

Recommendation API latency is exceeding our SLA:

  • p50: 45ms (target: <50ms)
  • p95: 178ms (target: <150ms)
  • p99: 412ms (target: <200ms) CRITICAL

Impact

  • 1% of requests timeout (>500ms)
  • Poor user experience for mobile users
  • Revenue impact: ~$8K/month from abandoned sessions
  • SLA breach: 99th percentile consistently above 200ms

Root Cause Analysis

Profiling Results (from CI performance tests):

  1. Model inference: 89ms (hot path)

    • SVD matrix multiplication: 67ms
    • Post-processing: 22ms
  2. Feature retrieval: 78ms

    • Redis cache miss rate: 23%
    • Cold cache hits PostgreSQL: 156ms
  3. Result ranking: 45ms

    • Sorting 500 candidates
    • Diversity filtering

Optimization Strategies

Short-term (Week 1-2)

  1. Increase Redis cache TTL

    • Current: 1 hour
    • Proposed: 6 hours
    • Expected: -30ms p99
  2. Model quantization

    • FP32 → FP16
    • Expected: -20ms inference time
    • Accuracy impact: <0.1%
  3. Batch inference

    • Group requests by 10ms window
    • Expected: -40ms p99

Long-term (Week 3-4)

  1. GPU inference

    • Move to GPU-accelerated serving
    • Expected: -60ms inference time
    • Cost: +$500/month
  2. Model distillation

    • Train smaller student model
    • Expected: -50ms inference time
    • Accuracy impact: -2%
  3. Feature store

    • Precompute all features
    • Expected: -70ms feature retrieval

Testing Plan

  • Load test with k6 (5000 req/s)
  • A/B test accuracy vs latency tradeoff
  • Monitor cache hit rates
  • Measure GPU cost/benefit

Success Metrics

  • p99 latency < 200ms
  • Cache hit rate > 90%
  • Maintain Precision@10 > 0.80

Priority: HIGH - SLA breach

cc: @dmitry @bill_staples