Launch GitLab Knowledge Graph

[TECH DEBT] Refactor collaborative filtering to support real-time updates

Problem

Current collaborative filtering model requires full retrain (8 hours) to incorporate new user interactions. This creates stale recommendations:

New users: cold start problem for 24h
Trending items: not reflected until next retrain
User preference changes: 24h lag

Current Architecture

User interactions → Batch job (daily) → Full model retrain (8h) → Deploy

Issues:

Cannot adapt to viral content
Poor experience for new users
Wastes compute (retraining entire model daily)

Proposed Architecture

User interactions → Stream processing → Incremental model update → Live serving

Benefits:

Real-time personalization
Trending content surfaces faster
Reduced compute cost (incremental updates)

Technical Approach

Option 1: Online Matrix Factorization

Incremental SGD updates
Update user/item factors in real-time
Requires streaming infrastructure (Kafka/Flink)

Option 2: Hybrid Model

Keep batch CF for long-term patterns
Add real-time popularity boost
Blend: 70% CF + 30% trending

Option 3: Neural Collaborative Filtering

Replace SVD with deep learning model
Train on mini-batches (hourly)
More flexible but higher complexity

Effort Estimate

Option 1: 6 weeks (requires streaming infra)
Option 2: 2 weeks (easier, good enough)
Option 3: 8 weeks (highest quality, most complex)

Recommendation

Start with Option 2 (hybrid model) as quick win, then evaluate Option 3 for Q2.

Related Work

Blocked by: #12 (need A/B testing to validate)
Relates to: #1 (closed) (original CF implementation)
Enables: Better cold-start handling

cc @dmitry @bill_staples @jean_gabriel