Skip to content
Launch GitLab Knowledge Graph

Database connection pool exhaustion causing 503 errors

Production Incident

Symptoms

Production user-service experiencing intermittent 503 errors:

  • Frequency: 50-100 errors/hour during peak traffic
  • Error: "HikariCP - Connection is not available, request timed out after 30000ms"
  • Impact: 0.3% error rate (up from baseline 0.01%)

Timeline

2025-10-09 14:23 UTC: First 503 errors detected 2025-10-09 14:45 UTC: Error rate spike to 150/hour 2025-10-09 15:12 UTC: Auto-scaling triggered (4→8 instances) 2025-10-09 15:30 UTC: Error rate decreased but still elevated 2025-10-09 16:00 UTC: Incident declared

Root Cause

Connection pool configuration is insufficient for current load:

Current Config (application.yml):

spring.datasource.hikari:
  maximum-pool-size: 10
  minimum-idle: 5
  connection-timeout: 30000
  idle-timeout: 600000

Analysis:

  • Average request time: 120ms
  • Peak RPS per instance: 200 req/s
  • Concurrent DB connections needed: ~24
  • Pool size (10) < needed connections (24)

Load Testing Results

From k6 load test (see CI pipeline):

scenario: (100.00%) 1 scenario, 100 max VUs

✓ status was 200
✗ response time < 500ms
  ↳  97% — ✓ 119834 / ✗ 3622

errors...................: 3.01% ✗ 3622 errors
http_req_duration........: avg=247ms p95=489ms p99=1.2s

Proposed Fix

Immediate (Today)

  1. Increase pool size:
spring.datasource.hikari:
  maximum-pool-size: 30  # was 10
  minimum-idle: 15       # was 5
  connection-timeout: 20000  # faster fail
  1. Add connection pool monitoring:
metrics.gauge("hikari.connections.active", poolStats::getActiveConnections);
metrics.gauge("hikari.connections.idle", poolStats::getIdleConnections);
metrics.gauge("hikari.connections.pending", poolStats::getThreadsAwaitingConnection);

Short-term (Week 1)

  1. Database read replicas:

    • Route read queries to replicas
    • Reduce primary DB load by 70%
  2. Query optimization:

    • Add indexes on frequently queried columns
    • Reduce N+1 query patterns
  3. Connection pooling per instance:

    • Scale pool size with instance CPU

Testing

  • Load test with k6 shows improvement
  • Deploy to staging
  • Soak test for 24h
  • Deploy to production

Monitoring

New alerts:

  • Pool utilization > 80%
  • Connection wait time > 1s
  • Failed connection attempts

Related Issues

  • Relates to #2 (closed) (performance optimization)
  • Blocked by: #7 (Log4Shell - need to deploy fix first)

Priority: CRITICAL - Production incident Severity: HIGH - 0.3% error rate

cc: @stan @jean_gabriel @bill_staples