Docs / AI Content Detection / AI-Generated Speech Detection

AI-Generated Speech Detection

BETA ai_speech

Detect if a voice or speech recording was generated with an AI model such as ElevenLabs, OpenAI, PlayHT, Resemble and more.

This model is currently gated. Access is only available to enterprise users and partners. Please reach out for details.

Overview

The AI-Generated Speech Detection Model can help you determine if a voice or speech recording was generated by an AI model, or if it is a genuine human recording. This model was trained on artificially-created and human-recorded speech spanning a wide variety of voices, languages, accents, speaking styles and recording conditions.

The Model works by analyzing the acoustic content of the audio waveform. No meta-data is used in the analysis. Tampering with meta-data therefore has no effect on the scoring.

The Model was trained to detect speech generated by the main voice generators currently in use: ElevenLabs, OpenAI, PlayHT, Resemble AI, Murf, WellSaid, Microsoft Neural TTS, Google WaveNet... Additional generators will be added over time as they become available.

Use cases

  • Deepfake voice detection
  • Fraud prevention in call centers and customer support
  • Voice-based KYC and identity verification
  • Detection of voice-clone scams and impersonation
  • Journalism, authenticity verification and fact-checking
  • Content moderation for podcasts, voice messages and user-generated audio
  • Limit the spread of audio misinformation and synthetic political speech

Related model

The following model can provide a useful complement to the AI-generated speech model:

Generator-specific information

Sightengine's AI Speech detection model computes per-generator confidence scores alongside a global AI probability score. For every audio file analyzed, the API response includes individual scores for each supported voice generator, giving you a complete fingerprint of the content.

The list of in-scope speech generators covers commercial APIs, open-source models, and emerging providers:

Speech generators

GeneratorCreatorExample versions detected
Amazon PollyAmazonNeural, Generative voices...
Azure Neural TTSMicrosoftNeural TTS, VALL-E, VALL-E 2...
BarkSunoBark, Bark Small...
Chatterbox / ResembleResemble AIChatterbox, Resemble v2...
CoquiCoquiXTTS, XTTS v2...
ElevenLabsElevenLabsMultilingual v2, Turbo v2, Flash...
Google TTSGoogleWaveNet, Chirp, Chirp 3...
Grok VoicexAIGrok Voice...
HeyGenHeyGenHeyGen voices...
MiniMaxMiniMaxSpeech-01, Speech-02...
OpenAIOpenAITTS-1, TTS-1-HD, Voice Engine, GPT-4o audio...
QwenAlibabaQwen3, Qwen3-TTS...
SeamlessM4TMetaSeamlessM4T, SeamlessM4T v2...
SynthesiaSynthesiaSynthesia voices...
TortoiseNeonbjbTortoise TTS...
VoicemodVoicemodVoicemod AI voices...
Other generatorsVariousCamb AI, Cartesia, Descript, Hume AI, Inworld, Kits AI, Kokoro, LMNT, Murf, Parler TTS, PlayHT, Replica Studios, Speechify, Tacotron 2, WellSaid Labs...

And more, new generators are added continuously as they appear in the wild.

Use the model

If you haven't already, create an account to get your own API keys.

Detect if a speech recording was AI-generated

To analyze a speech recording, simply send a POST request with the audio file. Supported audio formats: OGG, OPUS, FLAC, WAV, MP3, M4A, WEBM.


curl -X POST 'https://api.sightengine.com/1.0/audio/check.json' \
    -F 'audio=@/path/to/audio.mp3' \
    -F 'models=ai_speech' \
    -F 'api_user={api_user}' \
    -F 'api_secret={api_secret}'


# this example uses requests
import requests
import json

params = {
  'models': 'ai_speech',
  'api_user': '{api_user}',
  'api_secret': '{api_secret}'
}
files = {'audio': open('/path/to/audio.mp3', 'rb')}
r = requests.post('https://api.sightengine.com/1.0/audio/check.json', files=files, data=params)

output = json.loads(r.text)


$params = array(
  'audio' => new CurlFile('/path/to/audio.mp3'),
  'models' => 'ai_speech',
  'api_user' => '{api_user}',
  'api_secret' => '{api_secret}',
);

// this example uses cURL
$ch = curl_init('https://api.sightengine.com/1.0/audio/check.json');
curl_setopt($ch, CURLOPT_POST, true);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
curl_setopt($ch, CURLOPT_POSTFIELDS, $params);
$response = curl_exec($ch);
curl_close($ch);

$output = json_decode($response, true);


// this example uses axios and form-data
const axios = require('axios');
const FormData = require('form-data');
const fs = require('fs');

data = new FormData();
data.append('audio', fs.createReadStream('/path/to/audio.mp3'));
data.append('models', 'ai_speech');
data.append('api_user', '{api_user}');
data.append('api_secret', '{api_secret}');

axios({
  method: 'post',
  url:'https://api.sightengine.com/1.0/audio/check.json',
  data: data,
  headers: data.getHeaders()
})
.then(function (response) {
  // on success: handle response
  console.log(response.data);
})
.catch(function (error) {
  // handle error
  if (error.response) console.log(error.response.data);
  else console.log(error.message);
});

See request parameter description

ParameterTypeDescription
audiofileaudio file to analyze
modelsstringcomma-separated list of models to apply
api_userstringyour API user id
api_secretstringyour API secret

API response

The API will then return a JSON response with the following structure:

                  
                  
{
  "status": "success",
  "request": {
    "id": "req_0zrbHDeitGYY7wEGncAne",
    "timestamp": 1491402308.4762,
    "operations": 15
  },
  "type": {
    "ai_speech": 0.98
  },
  "media": {
    "id": "med_0zrbk8nlp4vwI5WxIqQ4u",
    "uri": "speech.mp3"
  }
}


              

The JSON response contains the ai_speech score. This score is a float between 0 and 1. The higher the value, the higher the confidence that the audio is AI-generated.

Additional information can be provided, such as a breakdown of the score by time segments and per-generator confidence scores. Please contact sales for more details.

Frequently asked questions

Which AI speech generators are supported?

The model targets all current voice generators, including ElevenLabs, OpenAI, PlayHT, Resemble AI, Murf, WellSaid, Microsoft Neural TTS, Google WaveNet, Amazon Polly, Descript Overdub, LMNT, Cartesia, Hume AI, Coqui, Tortoise and Bark, along with smaller and emerging generators. The model is updated on an ongoing basis as new generators become available. See Supported AI generators.

How does detection work without metadata or a watermark?

Detection is purely waveform-based. Metadata and inaudible watermarks are ignored, so stripping them has no effect on the result.

What does the ai_speech score mean?

It is the model's confidence, from 0 to 1, that the analyzed audio was produced by a generative AI voice model. Higher means more likely AI-generated. Scores above 0.5 typically indicate AI-generated speech; tune the threshold to your precision/recall preference.

Will real recordings with noise, compression or post-processing be flagged?

No. Standard post-production such as denoising, equalization, compression, phone-line transmission or social-platform re-encoding is treated as original audio. The model targets fully synthetic generated speech, not edited real recordings.

Does it work on phone calls, voice messages and socially-shared audio?

Yes. Detection is robust to re-encoding, downsampling, narrowband phone audio and standard social-platform recompression. Confidence may drop somewhat on heavily degraded clips, but the model is specifically developed to handle real-world redistribution artifacts.

Does it work across languages and accents?

Yes. The model was trained on speech spanning many languages, accents and speaking styles, and is designed to generalize beyond a single language. Performance is best on the most widely-used languages, with ongoing improvements for additional locales.

Is music detection also available?

Yes. A dedicated model targets AI-generated music tracks. See AI-Generated Music Detection.

Can I call this model together with other Sightengine models?

Yes. Pass a comma-separated list in the models parameter: models=ai_speech,ai_music and the API will return all results in a single response. This is the recommended pattern for production pipelines.

Next steps