Audio authenticity API

AI Voice Detector — detect AI-generated speech across every generator

Get a clear verdict on any voice recording, not just a score. Detects 55 voice generators including ElevenLabs, OpenAI, PlayHT and Resemble — no watermark or meta-data needed.

No watermark or metadata needed 55 voice generators & 48 languages Robust to phone & compressed audio
Sample analysis
voice-clone.mp3 0:14
96%
Likely AI-generated

Synthetic voice detected

High confidence this recording was produced by a voice model.

Per-generator confidence

ElevenLabs
96%
PlayHT
6%
OpenAI
3%
Example response — ai_speech score with per-generator breakdown
55
Voice generators detected
97.51%
Global accuracy
48
Languages covered
7
Audio formats supported

Built for fraud prevention, identity verification, content moderation...

Voice clone scams

Voice cloning and impersonation

Block voice-clone scams and impersonation in call centers, customer support and over the phone.

Voice KYC

Voice-based identity verification

Strengthen voice-based KYC and identity verification against synthetic speech and audio deepfakes.

Audio misinformation

Synthetic political speech and misinformation

Flag AI-generated voice messages, podcasts and synthetic political speech to limit the spread of audio misinformation.

How it works

From clip to verdict in three steps

Drop in any recording. The model analyzes the raw waveform and returns a calibrated score plus a per-generator breakdown — no metadata, no watermark, no setup.

01 — SUBMIT

Drop in audio

Upload a file or POST a URL. OGG, OPUS, FLAC, WAV, MP3, M4A and WEBM are all supported.

7 audio formats
02 — ANALYZE

Waveform analysis

A model trained on millions of real and synthetic samples inspects acoustic artifacts the human ear cannot perceive.

Pure acoustic signal
03 — VERDICT
96%

Citable result

Get a 0–1 confidence score, a clear verdict and per-generator attribution — returned in a single JSON response.

Score + attribution
voice-clone.mp3 0:14
96%
Likely AI-generated

Synthetic voice detected

High confidence the recording was produced by a generative voice model.

Per-generator confidence

ElevenLabs
96%
PlayHT
6%
OpenAI
3%
Human
2%
Per-generator detection

One verdict. Every generator.

Single-vendor classifiers only recognize their own voices. Sightengine returns a global AI probability plus a confidence score for each generator — a complete fingerprint of any recording, whoever made it.

Voice cloning

ElevenLabsResemble AIPlayHTDescriptFish Audio

Text-to-speech

OpenAIAzure Neural TTSGoogle WaveNetAmazon PollyMurfWellSaid

Open-source

Coqui XTTSBarkTortoiseF5-TTSGPT-SoVITS

Emerging

VALL-ECosyVoiceMiniMaxQwen-TTSSeamlessM4T+ many more

Detection is purely waveform-based, so it holds up even when metadata has been stripped, no watermark exists, or the clip has been re-encoded by a phone line or social platform.

Fewer false positives

AI-processed isn't AI-generated

A real person on a call with noise suppression. A podcast cleaned by a studio tool. A low-bandwidth stream rebuilt by a neural codec. Every other detector sees the AI fingerprint and cries "fake" — blocking real users. Sightengine scores generation and processing as two independent signals, so AI cleanup is never mistaken for a synthetic voice.

support-call.wav 0:21
AI-generated ai_speech0.04
AI-processed ai_processed0.88
Genuine human voice — passed through an AI pipeline (denoised), not synthetic. Don't block this user.

Real audio is AI-touched everywhere

Videoconference noise suppression

Zoom, Teams, Meet and Krisp-style tools re-render a real voice in real time to kill background noise.

Neural codecs

Low-bandwidth calls and streaming re-synthesize the waveform from a compressed representation — modern compression, not a fake.

AI audio enhancement

Podcast and voice-cleanup tools that turn a phone recording into "studio" audio.

Super-resolution & upscaling

Restoring or upsampling old or low-quality recordings of a real person.

A real human spoke every one of these. A single-score detector flags them all.

AI-generated (ai_speech)AI-processed (ai_processed)Honest verdict
LowLowGenuine — clean recording
LowHighGenuine voice, AI-processed — a real person← other detectors call this FAKE
HighSynthetic — AI-generated voice

In our benchmarks, even the strongest detectors on the market flag the middle row as AI-generated, with no way to tell a denoised real caller from a synthetic one. Sightengine separates the two signals, so AI cleanup reads as processed, while genuine fakes still fire on ai_speech.

Stop blocking real users

Denoised, compressed or enhanced real audio no longer trips a false alarm — the #1 reason detection APIs get pulled from production.

Trust the "fake" verdict

When ai_speech fires it means something, because benign processing isn't polluting the score.

Your policy, your thresholds

Two independent scores. Treat processing as informational or gate on it — and label AI-touched vs AI-generated for C2PA and the EU AI Act.

Why Sightengine

Built for breadth, scale and the real world

Most detectors only catch their own model, break on compressed audio, or never scale past a demo. Here is how the approaches compare.

Capability Single-vendor classifier Generic AI detector Sightengine
Detects all major generators Own model only~ Partial 55
Per-generator attribution
Tells AI-processed from AI-generated Flags both as AI Two scores
Works without watermark / metadata~
Robust to phone / compressed audio~
Multilingual coverage~~
Production REST API at scale~ Billions/mo
See the full list of supported generators
GeneratorCreatorExample versions detected
Amazon PollyAmazonNeural, Generative voices...
Azure Neural TTSMicrosoftNeural TTS, VALL-E, VALL-E 2...
BarkSunoBark, Bark Small...
Chatterbox / ResembleResemble AIChatterbox, Resemble v2...
CoquiCoquiXTTS, XTTS v2...
CosyVoiceAlibabaCosyVoice, CosyVoice 2...
ElevenLabsElevenLabsMultilingual v2, Turbo v2, Flash...
F5-TTSSWividF5-TTS, F5-TTS v1...
Fish AudioFish AudioFish Pro S1, Fish Pro S2...
Google TTSGoogleWaveNet, Chirp, Chirp 3...
GPT-SoVITSRVC-BossGPT-SoVITS v1, v2, v3...
MiniMaxMiniMaxSpeech-01, Speech-02...
OpenAIOpenAITTS-1, TTS-1-HD, Voice Engine, GPT-4o audio...
QwenAlibabaQwen3, Qwen3-TTS...
SeamlessM4TMetaSeamlessM4T, SeamlessM4T v2...
TortoiseNeonbjbTortoise TTS...
Other generatorsVariousCamb AI, Cartesia, Descript, Hume AI, Inworld, Kits AI, Kokoro, LMNT, Murf, Parler TTS, PlayHT, Replica Studios, Speechify, WellSaid Labs...

And more — new generators are added continuously as they appear in the wild.

SEE DOCUMENTATION

The ElevenLabs, OpenAI, PlayHT, Resemble AI, Murf and other trademarks and logos are the property of their respective trademark holders. They are not affiliated with Sightengine.

Developer API

One request. A structured verdict.

A single REST call returns the AI-speech score for any recording. Designed by developers, for developers — quick to integrate and built to scale.

request — cURL
curl -X POST 'https://api.sightengine.com/1.0/audio/check.json' \
     -F 'media=@/path/to/speech.mp3' \
     -F 'models=ai_speech' \
     -F 'api_user={API_USER}' \
     -F 'api_secret={API_SECRET}'
response — 200 OK
{
  "status": "success",
  "request": {
    "id": "req_0zrbHDeitGYY7wEG",
    "operations": 15
  },
  "type": {
    "ai_speech": 0.98   // 98% AI
  },
  "media": { "uri": "speech.mp3" }
}

Scale to the sky

From a handful to billions of items per month, with no change to your integration.

Absolute privacy

No human reviewers in the loop. Your audio stays private — just as your users expect.

One call, many models

Combine with ai_music and other models in a single request: models=ai_speech,ai_music.

SEE THE DOCUMENTATION SIGN UP

Super-human accuracy

Your ear can't tell. The model can.

Modern text-to-speech and voice-cloning models now produce voices that are virtually indistinguishable to the human ear. In blind listening tests, people score little better than a coin flip.

Our model inspects subtle, sub-perceptual acoustic artifacts left behind by generative pipelines — and stays reliable on exactly the kind of compressed, real-world audio that trips up the human ear.

Take the "AI or not" test

Sample A

Real ✓
Human listeners: 51%Model: Authentic

Sample B

AI ✓
Human listeners: 48%Model: Synthetic

Two clips that sound identical to most listeners. The model is confident on both.

FAQ

Frequently asked questions

Which AI speech generators can it detect?

The model targets all current voice generators, including ElevenLabs, OpenAI, PlayHT, Resemble AI, Murf, WellSaid Labs, Microsoft Azure Neural TTS and VALL-E, Google WaveNet and Chirp, Amazon Polly, Descript, LMNT, Cartesia, Hume AI, MiniMax, Fish Audio, Coqui, Tortoise and Bark, along with smaller and emerging generators. The model is updated on an ongoing basis as new generators become available.

How does detection work without metadata or a watermark?

Detection is purely waveform-based. The model only analyzes the acoustic content of the audio. Metadata and inaudible watermarks are ignored, so stripping or altering them has no effect on the result, which makes the model robust to common evasion techniques.

Will real recordings with noise, compression or post-processing be flagged as AI-generated?

No. The model is designed to distinguish synthetic speech from real recordings that have been compressed, denoised, equalized or transmitted over a phone line. Standard post-processing does not push the score above the AI-generated threshold.

Does it work on phone calls, voice messages and socially-shared audio?

Yes. Detection is robust to re-encoding, downsampling, narrowband phone audio and standard social-platform recompression. Confidence may drop somewhat on heavily degraded clips, but the model is specifically developed to handle real-world redistribution artifacts.

Does it work across languages and accents?

Yes. The model was trained on speech spanning many languages, accents and speaking styles, and is designed to generalize beyond a single language. Performance is best on the most widely-used languages, with ongoing improvements for additional locales.

What does the score mean?

It is the model's confidence, from 0 to 1, that the analyzed audio was produced by a generative AI voice model. Higher means more likely AI-generated. Scores above 0.5 typically indicate AI-generated speech. For high-precision workflows such as identity verification, KYC or fraud detection, use a higher threshold such as 0.8.

How is AI-generated speech detection different from AI-generated music detection?

AI-generated speech detection targets synthetic voices and text-to-speech output such as ElevenLabs and OpenAI. AI-generated music detection targets fully generated music tracks such as Suno and Udio. The two models are complementary and can be combined in a single API call.

Is there an API for businesses?

Yes. The same detection technology is available via a REST API. Pass models=ai_speech to the audio check endpoint, with documentation, code examples in major languages, and enterprise SLAs available.