Implement BERT-based toxicity detection model (!1) · Merge requests · acme-corp / ai-features / ai-content-moderation

Summary

Implements production-ready toxicity detection using BERT transformer model for multi-label text classification.

Add ToxicityClassifier neural network with BERT architecture
Implement multi-label classification for 6 toxicity categories:
- toxic, severe_toxic, obscene, threat, insult, identity_hate
Add ToxicityDetector high-level API with preprocessing
Include confidence-based action recommendations:
- allow: Non-toxic content
- review: Low-confidence predictions (human review needed)
- block: High-confidence toxic content
Implement fairness evaluation framework
Add comprehensive type hints and docstrings

BERT (bert-base-uncased)
  ↓
Pooled Output ([CLS] token)
  ↓
Dropout (0.3)
  ↓  
Linear Classification Head
  ↓
Sigmoid (multi-label)

⚠️ Important: Preliminary fairness testing shows bias:

Mitigation strategy in progress (see #5 (closed)).