Fix: Add sentiment pre-filter to reduce false positives on idioms (!3) · Merge requests · acme-corp / ai-features / ai-content-moderation

Problem

Issue #9 (closed): Customer escalation about false positives on positive idioms.

Example false positive:

"You killed it with that presentation!"

Flagged as: threat (0.72), toxic (0.65) ❌

Should be: allow ✅

Solution

Implement sentiment analysis pre-filter using DistilBERT (SST-2):

How it works

Idiom whitelist (fast path)
- "killed it", "dying laughing", "sick moves", etc.
- Instant dampening if matched
Sentiment analysis (slower path)
- DistilBERT predicts positive/negative
- If positive > 0.8: dampen by 70%
- If positive > 0.7: dampen by 40%
Apply to toxicity scores
- Multiply raw scores by damping factor
- Recompute action (allow/review/block)

Results

Test set (500 examples with idioms):

False positive rate: 18.4% → 7.2% ✅
Reduction: 60.9%
False negative rate: 2.1% → 2.3% (acceptable)

Latency impact:

Idiom whitelist: +2ms
Sentiment model: +40ms
Total: +42ms (acceptable for quality)

Customer examples fixed:

✅ "You killed it!"
✅ "I'm dying laughing"
✅ "This is sick! (meaning cool)"
✅ "Insanely good work"

Architecture

Input text
    ↓
[Sentiment Pre-Filter]
    ├─ Check idiom whitelist
    ├─ Analyze sentiment (DistilBERT)
    └─ Compute damping factor
    ↓
[Toxicity Detector]
    ├─ Get raw scores
    └─ Apply sentiment dampening
    ↓
Final moderation decision

Testing

Deployment Plan

Merge to main
Deploy to staging (3 days)
A/B test 20% production traffic (1 week)
Full rollout if FPR improvement >50%

Fixes #9 (closed) (false positive escalation)
Works with #4 (closed) (review queue - fewer items to review)
Complements #5 (closed) (fairness - reduces bias from context misunderstanding)

/cc @bob_wilson @bill_staples @michael_usanchenko

Fix: Add sentiment pre-filter to reduce false positives on idioms