PRODUCTION: Recommendation latency spike in us-east-1
Incident Report
Severity: P1 - Production Impact Started: 2025-01-15 14:32 UTC Status: Investigating
Symptoms
- Recommendation API p95 latency: 450ms (normal: 50ms)
- Error rate increased to 2.3%
- Affecting ~15% of users in us-east-1 region
Timeline
- 14:32 - Alerts triggered for high latency
- 14:35 - On-call engineer paged (@bill_staples)
- 14:40 - Identified Redis cache hit rate dropped to 23% (normal: 85%)
Hypothesis
- Redis cluster in us-east-1 may be experiencing issues
- Possible memory pressure or network partition
- Need to check Redis metrics and logs
Action Items
-
Check Redis cluster health -
Review recent deployments (was CF algorithm deployed today?) -
Consider failover to us-west-2 -
Root cause analysis
cc: @dmitry @bill_staples