Fix: Add sentiment pre-filter to reduce false positives on idioms
Problem
Issue #9 (closed): Customer escalation about false positives on positive idioms.
Example false positive:
"You killed it with that presentation!"
Flagged as: threat (0.72), toxic (0.65)
Should be: allow
Solution
Implement sentiment analysis pre-filter using DistilBERT (SST-2):
How it works
-
Idiom whitelist (fast path)
- "killed it", "dying laughing", "sick moves", etc.
- Instant dampening if matched
-
Sentiment analysis (slower path)
- DistilBERT predicts positive/negative
- If positive > 0.8: dampen by 70%
- If positive > 0.7: dampen by 40%
-
Apply to toxicity scores
- Multiply raw scores by damping factor
- Recompute action (allow/review/block)
Results
Test set (500 examples with idioms):
- False positive rate: 18.4% → 7.2%
✅ - Reduction: 60.9%
- False negative rate: 2.1% → 2.3% (acceptable)
Latency impact:
- Idiom whitelist: +2ms
- Sentiment model: +40ms
- Total: +42ms (acceptable for quality)
Customer examples fixed:
-
✅ "You killed it!" -
✅ "I'm dying laughing" -
✅ "This is sick! (meaning cool)" -
✅ "Insanely good work"
Architecture
Input text
↓
[Sentiment Pre-Filter]
├─ Check idiom whitelist
├─ Analyze sentiment (DistilBERT)
└─ Compute damping factor
↓
[Toxicity Detector]
├─ Get raw scores
└─ Apply sentiment dampening
↓
Final moderation decision
Testing
-
Unit tests for sentiment analysis -
Integration tests with toxicity detector -
Validation on 500 idiom examples -
Latency profiling -
Staging deployment -
Production A/B test
Deployment Plan
- Merge to main
- Deploy to staging (3 days)
- A/B test 20% production traffic (1 week)
- Full rollout if FPR improvement >50%
Related
- Fixes #9 (closed) (false positive escalation)
- Works with #4 (closed) (review queue - fewer items to review)
- Complements #5 (closed) (fairness - reduces bias from context misunderstanding)
/cc @bob_wilson @bill_staples @michael_usanchenko