Skip to content
Launch GitLab Knowledge Graph

Implement circuit breaker pattern for downstream service calls

Architecture Improvement

Problem

Streaming service has no circuit breaker protection when calling downstream services:

Current behavior when user-service is down:

  1. Every request waits for 30s timeout
  2. Thread pool exhaustion
  3. Cascading failure across all services
  4. Mean time to recovery: 15+ minutes

Recent Incident

2025-10-08 03:45 UTC: user-service had memory leak

  • user-service response time: 30s+ (timeout)
  • streaming-service: All threads blocked waiting
  • Complete service outage: 18 minutes
  • Revenue impact: $4,500

Proposed Solution

Implement Netflix Hystrix circuit breaker pattern:

@HystrixCommand(
    fallbackMethod = "getUserFallback",
    commandProperties = {
        @HystrixProperty(name = "circuitBreaker.requestVolumeThreshold", value = "10"),
        @HystrixProperty(name = "circuitBreaker.errorThresholdPercentage", value = "50"),
        @HystrixProperty(name = "circuitBreaker.sleepWindowInMilliseconds", value = "5000")
    }
)
public User getUserById(String userId) {
    return userServiceClient.getUser(userId);
}

public User getUserFallback(String userId, Throwable t) {
    // Return cached user or default
    return userCache.get(userId).orElse(User.anonymous());
}

Circuit States

  1. CLOSED (normal): All requests pass through
  2. OPEN (failure): Fail fast, return fallback
  3. HALF-OPEN (recovery): Test if service recovered

State transitions:

  • CLOSED → OPEN: When error rate > 50% for 10 requests
  • OPEN → HALF-OPEN: After 5 second sleep window
  • HALF-OPEN → CLOSED: When test request succeeds
  • HALF-OPEN → OPEN: When test request fails

Services to Protect

  1. user-service calls:

    • GET /users/{id} - authentication
    • POST /users/validate - token validation
    • Fallback: Return anonymous user
  2. payment-service calls:

    • POST /payments/process - payment processing
    • GET /payments/{id}/status - status check
    • Fallback: Queue for retry
  3. recommendation-service calls:

    • GET /recommendations/{userId} - personalized recs
    • Fallback: Return popular items

Monitoring

Add circuit breaker metrics:

metrics.gauge("circuit.breaker.state", () -> circuitBreaker.getState());
metrics.counter("circuit.breaker.fallback.calls");
metrics.timer("circuit.breaker.call.duration");

Dashboards:

  • Circuit breaker states over time
  • Fallback invocation rate
  • Success/failure rates
  • Response time percentiles

Testing Strategy

  1. Chaos engineering:

    • Kill downstream service randomly
    • Verify circuit opens
    • Verify fallback works
  2. Load testing:

    • Simulate downstream latency
    • Measure thread pool usage
    • Verify no cascading failures
  3. Integration tests:

    @Test
    void circuitOpensAfterFailureThreshold() {
        // Fail 6 out of 10 requests
        for (int i = 0; i < 10; i++) {
            if (i < 6) mockService.fail();
            else mockService.succeed();
            service.call();
        }
        
        assertThat(circuitBreaker.getState()).isEqualTo(OPEN);
    }

Implementation Plan

Week 1: Add Hystrix dependency

  • Update pom.xml
  • Configure Hystrix properties
  • Add basic circuit breaker

Week 2: Implement fallback logic

  • Cached responses
  • Default values
  • Queue for retry

Week 3: Monitoring & alerting

  • Prometheus metrics
  • Grafana dashboards
  • PagerDuty alerts

Week 4: Chaos testing

  • Simulate failures
  • Tune thresholds
  • Document runbooks

Success Metrics

  • Mean time to recovery < 2 minutes
  • Zero cascading failures
  • Fallback success rate > 95%
  • Thread pool utilization < 70%

Related Work

  • Depends on: #7 (Log4Shell fix - need stable baseline)
  • Relates to: user-service#8 (connection pool issue)
  • Relates to: orders-service#4 (similar circuit breaker need)

Priority: HIGH - Prevent cascading failures Epic: Resilience Engineering

cc: @stan @bill_staples @jean_gabriel