Database connection pool exhaustion causing 503 errors
Production Incident
Symptoms
Production user-service experiencing intermittent 503 errors:
- Frequency: 50-100 errors/hour during peak traffic
- Error: "HikariCP - Connection is not available, request timed out after 30000ms"
- Impact: 0.3% error rate (up from baseline 0.01%)
Timeline
2025-10-09 14:23 UTC: First 503 errors detected 2025-10-09 14:45 UTC: Error rate spike to 150/hour 2025-10-09 15:12 UTC: Auto-scaling triggered (4→8 instances) 2025-10-09 15:30 UTC: Error rate decreased but still elevated 2025-10-09 16:00 UTC: Incident declared
Root Cause
Connection pool configuration is insufficient for current load:
Current Config (application.yml):
spring.datasource.hikari:
maximum-pool-size: 10
minimum-idle: 5
connection-timeout: 30000
idle-timeout: 600000
Analysis:
- Average request time: 120ms
- Peak RPS per instance: 200 req/s
- Concurrent DB connections needed: ~24
-
Pool size (10) < needed connections (24)
❌
Load Testing Results
From k6 load test (see CI pipeline):
scenario: (100.00%) 1 scenario, 100 max VUs
✓ status was 200
✗ response time < 500ms
↳ 97% — ✓ 119834 / ✗ 3622
errors...................: 3.01% ✗ 3622 errors
http_req_duration........: avg=247ms p95=489ms p99=1.2s
Proposed Fix
Immediate (Today)
- Increase pool size:
spring.datasource.hikari:
maximum-pool-size: 30 # was 10
minimum-idle: 15 # was 5
connection-timeout: 20000 # faster fail
- Add connection pool monitoring:
metrics.gauge("hikari.connections.active", poolStats::getActiveConnections);
metrics.gauge("hikari.connections.idle", poolStats::getIdleConnections);
metrics.gauge("hikari.connections.pending", poolStats::getThreadsAwaitingConnection);
Short-term (Week 1)
-
Database read replicas:
- Route read queries to replicas
- Reduce primary DB load by 70%
-
Query optimization:
- Add indexes on frequently queried columns
- Reduce N+1 query patterns
-
Connection pooling per instance:
- Scale pool size with instance CPU
Testing
-
Load test with k6 shows improvement -
Deploy to staging -
Soak test for 24h -
Deploy to production
Monitoring
New alerts:
- Pool utilization > 80%
- Connection wait time > 1s
- Failed connection attempts
Related Issues
- Relates to #2 (closed) (performance optimization)
- Blocked by: #7 (Log4Shell - need to deploy fix first)
Priority: CRITICAL - Production incident Severity: HIGH - 0.3% error rate
cc: @stan @jean_gabriel @bill_staples