Implement circuit breaker pattern for downstream service calls

Architecture Improvement

Problem

Streaming service has no circuit breaker protection when calling downstream services:

Current behavior when user-service is down:

Every request waits for 30s timeout
Thread pool exhaustion
Cascading failure across all services
Mean time to recovery: 15+ minutes

Recent Incident

2025-10-08 03:45 UTC: user-service had memory leak

user-service response time: 30s+ (timeout)
streaming-service: All threads blocked waiting
Complete service outage: 18 minutes
Revenue impact: $4,500

Proposed Solution

Implement Netflix Hystrix circuit breaker pattern:

@HystrixCommand(
    fallbackMethod = "getUserFallback",
    commandProperties = {
        @HystrixProperty(name = "circuitBreaker.requestVolumeThreshold", value = "10"),
        @HystrixProperty(name = "circuitBreaker.errorThresholdPercentage", value = "50"),
        @HystrixProperty(name = "circuitBreaker.sleepWindowInMilliseconds", value = "5000")
    }
)
public User getUserById(String userId) {
    return userServiceClient.getUser(userId);
}

public User getUserFallback(String userId, Throwable t) {
    // Return cached user or default
    return userCache.get(userId).orElse(User.anonymous());
}

Circuit States

CLOSED (normal): All requests pass through
OPEN (failure): Fail fast, return fallback
HALF-OPEN (recovery): Test if service recovered

State transitions:

CLOSED → OPEN: When error rate > 50% for 10 requests
OPEN → HALF-OPEN: After 5 second sleep window
HALF-OPEN → CLOSED: When test request succeeds
HALF-OPEN → OPEN: When test request fails

Services to Protect

user-service calls:
- GET /users/{id} - authentication
- POST /users/validate - token validation
- Fallback: Return anonymous user
payment-service calls:
- POST /payments/process - payment processing
- GET /payments/{id}/status - status check
- Fallback: Queue for retry
recommendation-service calls:
- GET /recommendations/{userId} - personalized recs
- Fallback: Return popular items

Monitoring

Add circuit breaker metrics:

metrics.gauge("circuit.breaker.state", () -> circuitBreaker.getState());
metrics.counter("circuit.breaker.fallback.calls");
metrics.timer("circuit.breaker.call.duration");

Dashboards:

Circuit breaker states over time
Fallback invocation rate
Success/failure rates
Response time percentiles

Testing Strategy

Chaos engineering:
- Kill downstream service randomly
- Verify circuit opens
- Verify fallback works
Load testing:
- Simulate downstream latency
- Measure thread pool usage
- Verify no cascading failures

Integration tests:

@Test
void circuitOpensAfterFailureThreshold() {
    // Fail 6 out of 10 requests
    for (int i = 0; i < 10; i++) {
        if (i < 6) mockService.fail();
        else mockService.succeed();
        service.call();
    }
    
    assertThat(circuitBreaker.getState()).isEqualTo(OPEN);
}

Implementation Plan

Week 1: Add Hystrix dependency

Update pom.xml
Configure Hystrix properties
Add basic circuit breaker

Week 2: Implement fallback logic

Cached responses
Default values
Queue for retry

Week 3: Monitoring & alerting

Prometheus metrics
Grafana dashboards
PagerDuty alerts

Week 4: Chaos testing

Simulate failures
Tune thresholds
Document runbooks

Success Metrics

Mean time to recovery < 2 minutes
Zero cascading failures
Fallback success rate > 95%
Thread pool utilization < 70%

Related Work

Depends on: #7 (Log4Shell fix - need stable baseline)
Relates to: user-service#8 (connection pool issue)
Relates to: orders-service#4 (similar circuit breaker need)

Priority: HIGH - Prevent cascading failures Epic: Resilience Engineering

cc: @stan @bill_staples @jean_gabriel