Implement text toxicity detection model
Overview
Implement ML model to detect toxic, abusive, and harmful text content.
Model Architecture
- BERT-based transformer model
- Fine-tuned on toxicity datasets (Civil Comments, Jigsaw)
- Multi-label classification: toxic, severe_toxic, obscene, threat, insult, identity_hate
Performance Requirements
- Accuracy: >92% on test set
- Inference latency: <100ms
- False positive rate: <5%
Tech Stack
- Python 3.11+
- PyTorch / HuggingFace Transformers
- ONNX for optimized inference
- TensorRT for GPU acceleration
Training Data
- 500K+ labeled examples
- Balanced across toxicity categories
- Regular retraining with new data
Acceptance Criteria
-
Model accuracy >92% -
API response time <100ms p95 -
Support for 10+ languages -
Bias audit completed