Database connection pool exhaustion causing 503 errors

Production Incident

Symptoms

Production user-service experiencing intermittent 503 errors:

Frequency: 50-100 errors/hour during peak traffic
Error: "HikariCP - Connection is not available, request timed out after 30000ms"
Impact: 0.3% error rate (up from baseline 0.01%)

Timeline

2025-10-09 14:23 UTC: First 503 errors detected 2025-10-09 14:45 UTC: Error rate spike to 150/hour 2025-10-09 15:12 UTC: Auto-scaling triggered (4→8 instances) 2025-10-09 15:30 UTC: Error rate decreased but still elevated 2025-10-09 16:00 UTC: Incident declared

Root Cause

Connection pool configuration is insufficient for current load:

Current Config (application.yml):

spring.datasource.hikari:
  maximum-pool-size: 10
  minimum-idle: 5
  connection-timeout: 30000
  idle-timeout: 600000

Analysis:

Average request time: 120ms
Peak RPS per instance: 200 req/s
Concurrent DB connections needed: ~24
Pool size (10) < needed connections (24) ❌

Load Testing Results

From k6 load test (see CI pipeline):

scenario: (100.00%) 1 scenario, 100 max VUs

✓ status was 200
✗ response time < 500ms
  ↳  97% — ✓ 119834 / ✗ 3622

errors...................: 3.01% ✗ 3622 errors
http_req_duration........: avg=247ms p95=489ms p99=1.2s

Proposed Fix

Immediate (Today)

Increase pool size:

spring.datasource.hikari:
  maximum-pool-size: 30  # was 10
  minimum-idle: 15       # was 5
  connection-timeout: 20000  # faster fail

Add connection pool monitoring:

metrics.gauge("hikari.connections.active", poolStats::getActiveConnections);
metrics.gauge("hikari.connections.idle", poolStats::getIdleConnections);
metrics.gauge("hikari.connections.pending", poolStats::getThreadsAwaitingConnection);

Short-term (Week 1)

Database read replicas:
- Route read queries to replicas
- Reduce primary DB load by 70%
Query optimization:
- Add indexes on frequently queried columns
- Reduce N+1 query patterns
Connection pooling per instance:
- Scale pool size with instance CPU

Testing

Load test with k6 shows improvement
Deploy to staging
Soak test for 24h
Deploy to production

Monitoring

New alerts:

Pool utilization > 80%
Connection wait time > 1s
Failed connection attempts

Related Issues

Relates to #2 (closed) (performance optimization)
Blocked by: #7 (Log4Shell - need to deploy fix first)

Priority: CRITICAL - Production incident Severity: HIGH - 0.3% error rate

cc: @stan @jean_gabriel @bill_staples