Interpretable Hate Speech Detection with Natural Language Instructions

Supervised by: Professor Majid Komeili

The presence of hate speech and toxic content online is a rapidly growing problem. Current approaches to automatic detection of hate speech lack interpretability and can be vulnerable to racial bias. We propose a method which integrates human-written instructions, which provides insight into the detection process and guides the learning process toward the correct signal beyond keywords.

This work is ongoing in fulfillment of my senior thesis at Carleton University, and is supported by the I-CUREUS grant.