Which AI speech generators does the model detect?

The model is trained to detect speech produced by the widest list of voice generators currently in use, including ElevenLabs, OpenAI, PlayHT, Resemble AI, Murf, WellSaid Labs, Microsoft (Azure Neural TTS, VALL-E), Google (WaveNet, Chirp), Amazon Polly, Descript Overdub, LMNT, Cartesia, Hume AI, Coqui, Tortoise and Bark, as well as additional emerging providers. New generators are added as they appear.

Does the model rely on metadata or watermarks?

No. The model only analyzes the acoustic content of the audio waveform. Stripping or altering metadata has no effect on the score, which makes the model robust to common evasion techniques.

What threshold should I use on the ai_speech score?

Scores above 0.5 generally indicate AI-generated speech. For high-precision workflows such as identity verification, KYC or fraud detection, use a higher threshold (e.g. 0.8). For broader content moderation, a threshold around 0.5 balances precision and recall.

How is AI-generated speech detection different from AI-generated music detection?

AI-generated speech detection targets synthetic voices and text-to-speech output (e.g. ElevenLabs, OpenAI). AI-generated music detection targets fully generated music tracks (e.g. Suno, Udio). The two models are complementary and can be combined in a single API call.

Docs / AI Content Detection / AI-Generated Speech Detection

AI-Generated Speech Detection

Q: Will real recordings with noise, compression or post-processing be flagged as AI-generated?

No. The model is designed to distinguish synthetic speech from real recordings that have been compressed, denoised, equalized or transmitted over a phone line. Standard post-processing does not push the score above the AI-generated threshold.

BETA ai_speech

Detect if a voice or speech recording was generated with an AI model such as ElevenLabs, OpenAI, PlayHT, Resemble and more.

This model is currently gated. Access is only available to enterprise users and partners. Please reach out for details.

Overview

The AI-Generated Speech Detection Model can help you determine if a voice or speech recording was generated by an AI model, or if it is a genuine human recording. This model was trained on artificially-created and human-recorded speech spanning a wide variety of voices, languages, accents, speaking styles and recording conditions.

The Model works by analyzing the acoustic content of the audio waveform. No meta-data is used in the analysis. Tampering with meta-data therefore has no effect on the scoring.

The Model was trained to detect speech generated by the main voice generators currently in use: ElevenLabs, OpenAI, PlayHT, Resemble AI, Murf, WellSaid, Microsoft Neural TTS, Google WaveNet... Additional generators will be added over time as they become available.

Use cases

Deepfake voice detection
Fraud prevention in call centers and customer support
Voice-based KYC and identity verification
Detection of voice-clone scams and impersonation
Journalism, authenticity verification and fact-checking
Content moderation for podcasts, voice messages and user-generated audio
Limit the spread of audio misinformation and synthetic political speech

Related model

The following model can provide a useful complement to the AI-generated speech model:

AI Video Detection: Detect AI-generated videos.
AI Music Detection: Detect AI-generated music tracks.

Generator-specific information

Sightengine's AI Speech detection model computes per-generator confidence scores alongside a global AI probability score. For every audio file analyzed, the API response includes individual scores for each supported voice generator, giving you a complete fingerprint of the content.

The list of in-scope speech generators covers commercial APIs, open-source models, and emerging providers:

Speech generators

Generator	Creator	Example versions detected
Amazon Polly	Amazon	Neural, Generative voices...
Azure Neural TTS	Microsoft	Neural TTS, VALL-E, VALL-E 2...
Bark	Suno	Bark, Bark Small...
Chatterbox / Resemble	Resemble AI	Chatterbox, Resemble v2...
Coqui	Coqui	XTTS, XTTS v2...
CosyVoice	Alibaba	CosyVoice, CosyVoice 2...
ElevenLabs	ElevenLabs	Multilingual v2, Turbo v2, Flash...
F5-TTS	SWivid	F5-TTS, F5-TTS v1...
Fish Audio	Fish Audio	Fish Pro S1, Fish Pro S2...
Google TTS	Google	WaveNet, Chirp, Chirp 3...
GPT-SoVITS	RVC-Boss	GPT-SoVITS v1, v2, v3...
Grok Voice	xAI	Grok Voice...
HeyGen	HeyGen	HeyGen voices...
MiniMax	MiniMax	Speech-01, Speech-02...
OpenAI	OpenAI	TTS-1, TTS-1-HD, Voice Engine, GPT-4o audio...
Qwen	Alibaba	Qwen3, Qwen3-TTS...
SeamlessM4T	Meta	SeamlessM4T, SeamlessM4T v2...
Synthesia	Synthesia	Synthesia voices...
Tortoise	Neonbjb	Tortoise TTS...
Voicemod	Voicemod	Voicemod AI voices...
Other generators	Various	Camb AI, Cartesia, Descript, Hume AI, Inworld, Kits AI, Kokoro, LMNT, Murf, Parler TTS, PlayHT, Replica Studios, Speechify, Tacotron 2, WellSaid Labs...

And more, new generators are added continuously as they appear in the wild.

Use the model

If you haven't already, create an account to get your own API keys.

Detect if a speech recording was AI-generated

To analyze a speech recording, simply send a POST request with the audio file. Supported audio formats: OGG, OPUS, FLAC, WAV, MP3, M4A, WEBM.


curl -X POST 'https://api.sightengine.com/1.0/audio/check.json' \
    -F 'audio=@/path/to/audio.mp3' \
    -F 'models=ai_speech' \
    -F 'api_user={api_user}' \
    -F 'api_secret={api_secret}'


# this example uses requests
import requests
import json

params = {
  'models': 'ai_speech',
  'api_user': '{api_user}',
  'api_secret': '{api_secret}'
}
files = {'audio': open('/path/to/audio.mp3', 'rb')}
r = requests.post('https://api.sightengine.com/1.0/audio/check.json', files=files, data=params)

output = json.loads(r.text)


$params = array(
  'audio' => new CurlFile('/path/to/audio.mp3'),
  'models' => 'ai_speech',
  'api_user' => '{api_user}',
  'api_secret' => '{api_secret}',
);

// this example uses cURL
$ch = curl_init('https://api.sightengine.com/1.0/audio/check.json');
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $params);
$response = curl_exec($ch);
curl_close($ch);

$output = json_decode($response, true);


// this example uses axios and form-data
const axios = require('axios');
const FormData = require('form-data');
const fs = require('fs');

data = new FormData();
data.append('audio', fs.createReadStream('/path/to/audio.mp3'));
data.append('models', 'ai_speech');
data.append('api_user', '{api_user}');
data.append('api_secret', '{api_secret}');

axios({
  method: 'post',
  url:'https://api.sightengine.com/1.0/audio/check.json',
  data: data,
  headers: data.getHeaders()
})
.then(function (response) {
  // on success: handle response
  console.log(response.data);
})
.catch(function (error) {
  // handle error
  if (error.response) console.log(error.response.data);
  else console.log(error.message);
});

See request parameter description

Parameter	Type	Description
audio	file	audio file to analyze
models	string	comma-separated list of models to apply
api_user	string	your API user id
api_secret	string	your API secret

API response

The API will then return a JSON response with the following structure:

                  
                  
{
  "status": "success",
  "request": {
    "id": "req_0zrbHDeitGYY7wEGncAne",
    "timestamp": 1491402308.4762,
    "operations": 15
  },
  "type": {
    "ai_speech": 0.98
  },
  "media": {
    "id": "med_0zrbk8nlp4vwI5WxIqQ4u",
    "uri": "speech.mp3"
  }
}

The JSON response contains the ai_speech score. This score is a float between 0 and 1. The higher the value, the higher the confidence that the audio is AI-generated.

Additional information can be provided, such as a breakdown of the score by time segments and per-generator confidence scores. Please contact sales for more details.

Frequently asked questions

Which AI speech generators are supported?

The model targets all current voice generators, including ElevenLabs, OpenAI, PlayHT, Resemble AI, Murf, WellSaid, Microsoft Neural TTS, Google WaveNet, Amazon Polly, Descript Overdub, LMNT, Cartesia, Hume AI, Coqui, Tortoise and Bark, along with smaller and emerging generators. The model is updated on an ongoing basis as new generators become available. See Supported AI generators.

How does detection work without metadata or a watermark?

Detection is purely waveform-based. Metadata and inaudible watermarks are ignored, so stripping them has no effect on the result.

What does the ai_speech score mean?

It is the model's confidence, from 0 to 1, that the analyzed audio was produced by a generative AI voice model. Higher means more likely AI-generated. Scores above 0.5 typically indicate AI-generated speech; tune the threshold to your precision/recall preference.

Will real recordings with noise, compression or post-processing be flagged?

No. Standard post-production such as denoising, equalization, compression, phone-line transmission or social-platform re-encoding is treated as original audio. The model targets fully synthetic generated speech, not edited real recordings.

Does it work on phone calls, voice messages and socially-shared audio?

Yes. Detection is robust to re-encoding, downsampling, narrowband phone audio and standard social-platform recompression. Confidence may drop somewhat on heavily degraded clips, but the model is specifically developed to handle real-world redistribution artifacts.

Does it work across languages and accents?

Yes. The model was trained on speech spanning many languages, accents and speaking styles, and is designed to generalize beyond a single language. Performance is best on the most widely-used languages, with ongoing improvements for additional locales.

Is music detection also available?

Yes. A dedicated model targets AI-generated music tracks. See AI-Generated Music Detection.

Can I call this model together with other Sightengine models?

Yes. Pass a comma-separated list in the models parameter: models=ai_speech,ai_music and the API will return all results in a single response. This is the recommended pattern for production pipelines.

Next steps

Get started

On this page