FAQ / Text Moderation

How do text classification models work?

Machine learning models can detect problematic content in situations that would otherwise have been missed or incorrectly flagged by Rule-based models because they are able to take context into account.

When submitting a text item to the API, you instantly receive a score for each available class. Scores are between 0 and 1, and they reflect how likely it is that someone would find the text problematic. Higher scores are therefore usually associated with more problematic content. Note that the API may return multiple high scores for one text if the text is matching multiple classes.

Class availability depends on the language of the submitted text. The available classes for Text Classification are the following:

sexual detects references to sexual acts, sexual organs or any other content typically associated with sexual activity
discriminatory detects hate speech directed at individuals or groups because of specific characteristics of their identity (origin, religion, sexual orientation, gender, etc.)
insulting detects insults undermining the dignity or honor of an individual, signs of disrespect towards someone
violent detects threatening content, i.e. with an intention to harm / hurt, or expressing violence and brutality
toxic detects whether a text is unacceptable, harmful, offensive, disrespectful or unpleasant

See the Text Classification documentation to learn more.

Was this page helpful?