AI-Generated Speech Detection
BETA ai_speechDetect if a voice or speech recording was generated with an AI model such as ElevenLabs, OpenAI, PlayHT, Resemble and more.
This model is currently gated. Access is only available to enterprise users and partners. Please reach out for details.
Overview
The AI-Generated Speech Detection Model can help you determine if a voice or speech recording was generated by an AI model, or if it is a genuine human recording. This model was trained on artificially-created and human-recorded speech spanning a wide variety of voices, languages, accents, speaking styles and recording conditions.
The Model works by analyzing the acoustic content of the audio waveform. No meta-data is used in the analysis. Tampering with meta-data therefore has no effect on the scoring.
The Model was trained to detect speech generated by the main voice generators currently in use: ElevenLabs, OpenAI, PlayHT, Resemble AI, Murf, WellSaid, Microsoft Neural TTS, Google WaveNet... Additional generators will be added over time as they become available.
Use cases
- Deepfake voice detection
- Fraud prevention in call centers and customer support
- Voice-based KYC and identity verification
- Detection of voice-clone scams and impersonation
- Journalism, authenticity verification and fact-checking
- Content moderation for podcasts, voice messages and user-generated audio
- Limit the spread of audio misinformation and synthetic political speech
Related model
The following model can provide a useful complement to the AI-generated speech model:
- AI Music Detection: Detect AI-generated music tracks.
Generator-specific information
Sightengine's AI Speech detection model computes per-generator confidence scores alongside a global AI probability score. For every audio file analyzed, the API response includes individual scores for each supported voice generator, giving you a complete fingerprint of the content.
The list of in-scope speech generators covers commercial APIs, open-source models, and emerging providers:
Speech generators
| Generator | Creator | Example versions detected |
| Amazon Polly | Amazon | Neural, Generative voices... |
| Azure Neural TTS | Microsoft | Neural TTS, VALL-E, VALL-E 2... |
| Bark | Suno | Bark, Bark Small... |
| Chatterbox / Resemble | Resemble AI | Chatterbox, Resemble v2... |
| Coqui | Coqui | XTTS, XTTS v2... |
| ElevenLabs | ElevenLabs | Multilingual v2, Turbo v2, Flash... |
| Google TTS | WaveNet, Chirp, Chirp 3... | |
| Grok Voice | xAI | Grok Voice... |
| HeyGen | HeyGen | HeyGen voices... |
| MiniMax | MiniMax | Speech-01, Speech-02... |
| OpenAI | OpenAI | TTS-1, TTS-1-HD, Voice Engine, GPT-4o audio... |
| Qwen | Alibaba | Qwen3, Qwen3-TTS... |
| SeamlessM4T | Meta | SeamlessM4T, SeamlessM4T v2... |
| Synthesia | Synthesia | Synthesia voices... |
| Tortoise | Neonbjb | Tortoise TTS... |
| Voicemod | Voicemod | Voicemod AI voices... |
| Other generators | Various | Camb AI, Cartesia, Descript, Hume AI, Inworld, Kits AI, Kokoro, LMNT, Murf, Parler TTS, PlayHT, Replica Studios, Speechify, Tacotron 2, WellSaid Labs... |
And more, new generators are added continuously as they appear in the wild.
Use the model
If you haven't already, create an account to get your own API keys.
Detect if a speech recording was AI-generated
To analyze a speech recording, simply send a POST request with the audio file. Supported audio formats: OGG, OPUS, FLAC, WAV, MP3, M4A, WEBM.
curl -X POST 'https://api.sightengine.com/1.0/audio/check.json' \
-F 'audio=@/path/to/audio.mp3' \
-F 'models=ai_speech' \
-F 'api_user={api_user}' \
-F 'api_secret={api_secret}'
# this example uses requests
import requests
import json
params = {
'models': 'ai_speech',
'api_user': '{api_user}',
'api_secret': '{api_secret}'
}
files = {'audio': open('/path/to/audio.mp3', 'rb')}
r = requests.post('https://api.sightengine.com/1.0/audio/check.json', files=files, data=params)
output = json.loads(r.text)
$params = array(
'audio' => new CurlFile('/path/to/audio.mp3'),
'models' => 'ai_speech',
'api_user' => '{api_user}',
'api_secret' => '{api_secret}',
);
// this example uses cURL
$ch = curl_init('https://api.sightengine.com/1.0/audio/check.json');
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $params);
$response = curl_exec($ch);
curl_close($ch);
$output = json_decode($response, true);
// this example uses axios and form-data
const axios = require('axios');
const FormData = require('form-data');
const fs = require('fs');
data = new FormData();
data.append('audio', fs.createReadStream('/path/to/audio.mp3'));
data.append('models', 'ai_speech');
data.append('api_user', '{api_user}');
data.append('api_secret', '{api_secret}');
axios({
method: 'post',
url:'https://api.sightengine.com/1.0/audio/check.json',
data: data,
headers: data.getHeaders()
})
.then(function (response) {
// on success: handle response
console.log(response.data);
})
.catch(function (error) {
// handle error
if (error.response) console.log(error.response.data);
else console.log(error.message);
});
See request parameter description
| Parameter | Type | Description |
| audio | file | audio file to analyze |
| models | string | comma-separated list of models to apply |
| api_user | string | your API user id |
| api_secret | string | your API secret |
API response
The API will then return a JSON response with the following structure:
{
"status": "success",
"request": {
"id": "req_0zrbHDeitGYY7wEGncAne",
"timestamp": 1491402308.4762,
"operations": 15
},
"type": {
"ai_speech": 0.98
},
"media": {
"id": "med_0zrbk8nlp4vwI5WxIqQ4u",
"uri": "speech.mp3"
}
}
The JSON response contains the ai_speech score. This score is a float between 0 and 1. The higher the value, the higher the confidence that the audio is AI-generated.
Additional information can be provided, such as a breakdown of the score by time segments and per-generator confidence scores. Please contact sales for more details.
Frequently asked questions
Which AI speech generators are supported?
The model targets all current voice generators, including ElevenLabs, OpenAI, PlayHT, Resemble AI, Murf, WellSaid, Microsoft Neural TTS, Google WaveNet, Amazon Polly, Descript Overdub, LMNT, Cartesia, Hume AI, Coqui, Tortoise and Bark, along with smaller and emerging generators. The model is updated on an ongoing basis as new generators become available. See Supported AI generators.
How does detection work without metadata or a watermark?
Detection is purely waveform-based. Metadata and inaudible watermarks are ignored, so stripping them has no effect on the result.
What does the ai_speech score mean?
It is the model's confidence, from 0 to 1, that the analyzed audio was produced by a generative AI voice model. Higher means more likely AI-generated. Scores above 0.5 typically indicate AI-generated speech; tune the threshold to your precision/recall preference.
Will real recordings with noise, compression or post-processing be flagged?
No. Standard post-production such as denoising, equalization, compression, phone-line transmission or social-platform re-encoding is treated as original audio. The model targets fully synthetic generated speech, not edited real recordings.
Does it work on phone calls, voice messages and socially-shared audio?
Yes. Detection is robust to re-encoding, downsampling, narrowband phone audio and standard social-platform recompression. Confidence may drop somewhat on heavily degraded clips, but the model is specifically developed to handle real-world redistribution artifacts.
Does it work across languages and accents?
Yes. The model was trained on speech spanning many languages, accents and speaking styles, and is designed to generalize beyond a single language. Performance is best on the most widely-used languages, with ongoing improvements for additional locales.
Is music detection also available?
Yes. A dedicated model targets AI-generated music tracks. See AI-Generated Music Detection.
Can I call this model together with other Sightengine models?
Yes. Pass a comma-separated list in the models parameter: models=ai_speech,ai_music and the API will return all results in a single response. This is the recommended pattern for production pipelines.