Skip to content
Launch GitLab Knowledge Graph

PRODUCTION: Recommendation latency spike in us-east-1

Incident Report

Severity: P1 - Production Impact Started: 2025-01-15 14:32 UTC Status: Investigating

Symptoms

  • Recommendation API p95 latency: 450ms (normal: 50ms)
  • Error rate increased to 2.3%
  • Affecting ~15% of users in us-east-1 region

Timeline

  • 14:32 - Alerts triggered for high latency
  • 14:35 - On-call engineer paged (@bill_staples)
  • 14:40 - Identified Redis cache hit rate dropped to 23% (normal: 85%)

Hypothesis

  • Redis cluster in us-east-1 may be experiencing issues
  • Possible memory pressure or network partition
  • Need to check Redis metrics and logs

Action Items

  • Check Redis cluster health
  • Review recent deployments (was CF algorithm deployed today?)
  • Consider failover to us-west-2
  • Root cause analysis

cc: @dmitry @bill_staples