Get a clear verdict on any voice recording, not just a score. Detects 55 voice generators including ElevenLabs, OpenAI, PlayHT and Resemble — no watermark or meta-data needed.
High confidence this recording was produced by a voice model.
Per-generator confidence
Block voice-clone scams and impersonation in call centers, customer support and over the phone.
Strengthen voice-based KYC and identity verification against synthetic speech and audio deepfakes.
Flag AI-generated voice messages, podcasts and synthetic political speech to limit the spread of audio misinformation.
Drop in any recording. The model analyzes the raw waveform and returns a calibrated score plus a per-generator breakdown — no metadata, no watermark, no setup.
Upload a file or POST a URL. OGG, OPUS, FLAC, WAV, MP3, M4A and WEBM are all supported.
A model trained on millions of real and synthetic samples inspects acoustic artifacts the human ear cannot perceive.
Get a 0–1 confidence score, a clear verdict and per-generator attribution — returned in a single JSON response.
High confidence the recording was produced by a generative voice model.
Per-generator confidence
Single-vendor classifiers only recognize their own voices. Sightengine returns a global AI probability plus a confidence score for each generator — a complete fingerprint of any recording, whoever made it.
Detection is purely waveform-based, so it holds up even when metadata has been stripped, no watermark exists, or the clip has been re-encoded by a phone line or social platform.
A real person on a call with noise suppression. A podcast cleaned by a studio tool. A low-bandwidth stream rebuilt by a neural codec. Every other detector sees the AI fingerprint and cries "fake" — blocking real users. Sightengine scores generation and processing as two independent signals, so AI cleanup is never mistaken for a synthetic voice.
Zoom, Teams, Meet and Krisp-style tools re-render a real voice in real time to kill background noise.
Low-bandwidth calls and streaming re-synthesize the waveform from a compressed representation — modern compression, not a fake.
Podcast and voice-cleanup tools that turn a phone recording into "studio" audio.
Restoring or upsampling old or low-quality recordings of a real person.
A real human spoke every one of these. A single-score detector flags them all.
| AI-generated (ai_speech) | AI-processed (ai_processed) | Honest verdict |
|---|---|---|
| Low | Low | Genuine — clean recording |
| Low | High | Genuine voice, AI-processed — a real person← other detectors call this FAKE |
| High | — | Synthetic — AI-generated voice |
In our benchmarks, even the strongest detectors on the market flag the middle row as AI-generated, with no way to tell a denoised real caller from a synthetic one. Sightengine separates the two signals, so AI cleanup reads as processed, while genuine fakes still fire on ai_speech.
Denoised, compressed or enhanced real audio no longer trips a false alarm — the #1 reason detection APIs get pulled from production.
When ai_speech fires it means something, because benign processing isn't polluting the score.
Two independent scores. Treat processing as informational or gate on it — and label AI-touched vs AI-generated for C2PA and the EU AI Act.
Most detectors only catch their own model, break on compressed audio, or never scale past a demo. Here is how the approaches compare.
| Capability | Single-vendor classifier | Generic AI detector | Sightengine |
|---|---|---|---|
| Detects all major generators | ✗ Own model only | ~ Partial | ✓ 55 |
| Per-generator attribution | ✗ | ✗ | ✓ |
| Tells AI-processed from AI-generated | ✗ | ✗ Flags both as AI | ✓ Two scores |
| Works without watermark / metadata | ~ | ✓ | ✓ |
| Robust to phone / compressed audio | ✗ | ~ | ✓ |
| Multilingual coverage | ~ | ~ | ✓ |
| Production REST API at scale | ~ | ✗ | ✓ Billions/mo |
| Generator | Creator | Example versions detected |
| Amazon Polly | Amazon | Neural, Generative voices... |
| Azure Neural TTS | Microsoft | Neural TTS, VALL-E, VALL-E 2... |
| Bark | Suno | Bark, Bark Small... |
| Chatterbox / Resemble | Resemble AI | Chatterbox, Resemble v2... |
| Coqui | Coqui | XTTS, XTTS v2... |
| CosyVoice | Alibaba | CosyVoice, CosyVoice 2... |
| ElevenLabs | ElevenLabs | Multilingual v2, Turbo v2, Flash... |
| F5-TTS | SWivid | F5-TTS, F5-TTS v1... |
| Fish Audio | Fish Audio | Fish Pro S1, Fish Pro S2... |
| Google TTS | WaveNet, Chirp, Chirp 3... | |
| GPT-SoVITS | RVC-Boss | GPT-SoVITS v1, v2, v3... |
| MiniMax | MiniMax | Speech-01, Speech-02... |
| OpenAI | OpenAI | TTS-1, TTS-1-HD, Voice Engine, GPT-4o audio... |
| Qwen | Alibaba | Qwen3, Qwen3-TTS... |
| SeamlessM4T | Meta | SeamlessM4T, SeamlessM4T v2... |
| Tortoise | Neonbjb | Tortoise TTS... |
| Other generators | Various | Camb AI, Cartesia, Descript, Hume AI, Inworld, Kits AI, Kokoro, LMNT, Murf, Parler TTS, PlayHT, Replica Studios, Speechify, WellSaid Labs... |
And more — new generators are added continuously as they appear in the wild.
The ElevenLabs, OpenAI, PlayHT, Resemble AI, Murf and other trademarks and logos are the property of their respective trademark holders. They are not affiliated with Sightengine.
A single REST call returns the AI-speech score for any recording. Designed by developers, for developers — quick to integrate and built to scale.
curl -X POST 'https://api.sightengine.com/1.0/audio/check.json' \ -F 'media=@/path/to/speech.mp3' \ -F 'models=ai_speech' \ -F 'api_user={API_USER}' \ -F 'api_secret={API_SECRET}'
{
"status": "success",
"request": {
"id": "req_0zrbHDeitGYY7wEG",
"operations": 15
},
"type": {
"ai_speech": 0.98 // 98% AI
},
"media": { "uri": "speech.mp3" }
}
From a handful to billions of items per month, with no change to your integration.
No human reviewers in the loop. Your audio stays private — just as your users expect.
Combine with ai_music and other models in a single request: models=ai_speech,ai_music.
Modern text-to-speech and voice-cloning models now produce voices that are virtually indistinguishable to the human ear. In blind listening tests, people score little better than a coin flip.
Our model inspects subtle, sub-perceptual acoustic artifacts left behind by generative pipelines — and stays reliable on exactly the kind of compressed, real-world audio that trips up the human ear.
Take the "AI or not" testTwo clips that sound identical to most listeners. The model is confident on both.
Detect synthetic media across every modality and moderate user-generated content from a single API.
Which AI speech generators can it detect?
The model targets all current voice generators, including ElevenLabs, OpenAI, PlayHT, Resemble AI, Murf, WellSaid Labs, Microsoft Azure Neural TTS and VALL-E, Google WaveNet and Chirp, Amazon Polly, Descript, LMNT, Cartesia, Hume AI, MiniMax, Fish Audio, Coqui, Tortoise and Bark, along with smaller and emerging generators. The model is updated on an ongoing basis as new generators become available.
How does detection work without metadata or a watermark?
Detection is purely waveform-based. The model only analyzes the acoustic content of the audio. Metadata and inaudible watermarks are ignored, so stripping or altering them has no effect on the result, which makes the model robust to common evasion techniques.
Will real recordings with noise, compression or post-processing be flagged as AI-generated?
No. The model is designed to distinguish synthetic speech from real recordings that have been compressed, denoised, equalized or transmitted over a phone line. Standard post-processing does not push the score above the AI-generated threshold.
Does it work on phone calls, voice messages and socially-shared audio?
Yes. Detection is robust to re-encoding, downsampling, narrowband phone audio and standard social-platform recompression. Confidence may drop somewhat on heavily degraded clips, but the model is specifically developed to handle real-world redistribution artifacts.
Does it work across languages and accents?
Yes. The model was trained on speech spanning many languages, accents and speaking styles, and is designed to generalize beyond a single language. Performance is best on the most widely-used languages, with ongoing improvements for additional locales.
What does the score mean?
It is the model's confidence, from 0 to 1, that the analyzed audio was produced by a generative AI voice model. Higher means more likely AI-generated. Scores above 0.5 typically indicate AI-generated speech. For high-precision workflows such as identity verification, KYC or fraud detection, use a higher threshold such as 0.8.
How is AI-generated speech detection different from AI-generated music detection?
AI-generated speech detection targets synthetic voices and text-to-speech output such as ElevenLabs and OpenAI. AI-generated music detection targets fully generated music tracks such as Suno and Udio. The two models are complementary and can be combined in a single API call.
Is there an API for businesses?
Yes. The same detection technology is available via a REST API. Pass models=ai_speech to the audio check endpoint, with documentation, code examples in major languages, and enterprise SLAs available.