Implement circuit breaker pattern for downstream service calls
Architecture Improvement
Problem
Streaming service has no circuit breaker protection when calling downstream services:
Current behavior when user-service is down:
- Every request waits for 30s timeout
- Thread pool exhaustion
- Cascading failure across all services
- Mean time to recovery: 15+ minutes
Recent Incident
2025-10-08 03:45 UTC: user-service had memory leak
- user-service response time: 30s+ (timeout)
- streaming-service: All threads blocked waiting
- Complete service outage: 18 minutes
- Revenue impact: $4,500
Proposed Solution
Implement Netflix Hystrix circuit breaker pattern:
@HystrixCommand(
fallbackMethod = "getUserFallback",
commandProperties = {
@HystrixProperty(name = "circuitBreaker.requestVolumeThreshold", value = "10"),
@HystrixProperty(name = "circuitBreaker.errorThresholdPercentage", value = "50"),
@HystrixProperty(name = "circuitBreaker.sleepWindowInMilliseconds", value = "5000")
}
)
public User getUserById(String userId) {
return userServiceClient.getUser(userId);
}
public User getUserFallback(String userId, Throwable t) {
// Return cached user or default
return userCache.get(userId).orElse(User.anonymous());
}
Circuit States
- CLOSED (normal): All requests pass through
- OPEN (failure): Fail fast, return fallback
- HALF-OPEN (recovery): Test if service recovered
State transitions:
- CLOSED → OPEN: When error rate > 50% for 10 requests
- OPEN → HALF-OPEN: After 5 second sleep window
- HALF-OPEN → CLOSED: When test request succeeds
- HALF-OPEN → OPEN: When test request fails
Services to Protect
-
user-service calls:
-
GET /users/{id}- authentication -
POST /users/validate- token validation - Fallback: Return anonymous user
-
-
payment-service calls:
-
POST /payments/process- payment processing -
GET /payments/{id}/status- status check - Fallback: Queue for retry
-
-
recommendation-service calls:
-
GET /recommendations/{userId}- personalized recs - Fallback: Return popular items
-
Monitoring
Add circuit breaker metrics:
metrics.gauge("circuit.breaker.state", () -> circuitBreaker.getState());
metrics.counter("circuit.breaker.fallback.calls");
metrics.timer("circuit.breaker.call.duration");
Dashboards:
- Circuit breaker states over time
- Fallback invocation rate
- Success/failure rates
- Response time percentiles
Testing Strategy
-
Chaos engineering:
- Kill downstream service randomly
- Verify circuit opens
- Verify fallback works
-
Load testing:
- Simulate downstream latency
- Measure thread pool usage
- Verify no cascading failures
-
Integration tests:
@Test void circuitOpensAfterFailureThreshold() { // Fail 6 out of 10 requests for (int i = 0; i < 10; i++) { if (i < 6) mockService.fail(); else mockService.succeed(); service.call(); } assertThat(circuitBreaker.getState()).isEqualTo(OPEN); }
Implementation Plan
Week 1: Add Hystrix dependency
- Update pom.xml
- Configure Hystrix properties
- Add basic circuit breaker
Week 2: Implement fallback logic
- Cached responses
- Default values
- Queue for retry
Week 3: Monitoring & alerting
- Prometheus metrics
- Grafana dashboards
- PagerDuty alerts
Week 4: Chaos testing
- Simulate failures
- Tune thresholds
- Document runbooks
Success Metrics
- Mean time to recovery < 2 minutes
- Zero cascading failures
- Fallback success rate > 95%
- Thread pool utilization < 70%
Related Work
- Depends on: #7 (Log4Shell fix - need stable baseline)
- Relates to: user-service#8 (connection pool issue)
- Relates to: orders-service#4 (similar circuit breaker need)
Priority: HIGH - Prevent cascading failures Epic: Resilience Engineering
cc: @stan @bill_staples @jean_gabriel