Audio authenticity API

AI Voice Detector — detect AI-generated speech across every generator

Get a clear verdict on any voice recording, not just a score. Detects 55 voice generators including ElevenLabs, OpenAI, PlayHT and Resemble — no watermark or meta-data needed.

GET STARTED VIEW DOCS

No watermark or metadata needed 55 voice generators & 48 languages Robust to phone & compressed audio

Sample analysis

voice-clone.mp3 0:14

96%

Likely AI-generated

Synthetic voice detected

High confidence this recording was produced by a voice model.

Per-generator confidence

ElevenLabs

96%

PlayHT

OpenAI

Example response — ai_speech score with per-generator breakdown

Voice generators detected

97.51%

Global accuracy

Languages covered

Audio formats supported

Built for fraud prevention, identity verification, content moderation...

Voice clone scams

Block voice-clone scams and impersonation in call centers, customer support and over the phone.

Voice KYC

Strengthen voice-based KYC and identity verification against synthetic speech and audio deepfakes.

Audio misinformation

Synthetic political speech and misinformation

Flag AI-generated voice messages, podcasts and synthetic political speech to limit the spread of audio misinformation.

How it works

From clip to verdict in three steps

Drop in any recording. The model analyzes the raw waveform and returns a calibrated score plus a per-generator breakdown — no metadata, no watermark, no setup.

01 — SUBMIT

Drop in audio

Upload a file or POST a URL. OGG, OPUS, FLAC, WAV, MP3, M4A and WEBM are all supported.

7 audio formats

02 — ANALYZE

Waveform analysis

A model trained on millions of real and synthetic samples inspects acoustic artifacts the human ear cannot perceive.

Pure acoustic signal

03 — VERDICT

Citable result

Get a 0–1 confidence score, a clear verdict and per-generator attribution — returned in a single JSON response.

Score + attribution

voice-clone.mp3 0:14

96%

Likely AI-generated

Synthetic voice detected

High confidence the recording was produced by a generative voice model.

Per-generator confidence

ElevenLabs

96%

PlayHT

OpenAI

Human

Per-generator detection

One verdict. Every generator.

Single-vendor classifiers only recognize their own voices. Sightengine returns a global AI probability plus a confidence score for each generator — a complete fingerprint of any recording, whoever made it.

Voice cloning

ElevenLabsResemble AIPlayHTDescriptFish Audio

Text-to-speech

OpenAIAzure Neural TTSGoogle WaveNetAmazon PollyMurfWellSaid

Open-source

Coqui XTTSBarkTortoiseF5-TTSGPT-SoVITS

Emerging

VALL-ECosyVoiceMiniMaxQwen-TTSSeamlessM4T+ many more

Detection is purely waveform-based, so it holds up even when metadata has been stripped, no watermark exists, or the clip has been re-encoded by a phone line or social platform.

Fewer false positives

AI-processed isn't AI-generated

A real person on a call with noise suppression. A podcast cleaned by a studio tool. A low-bandwidth stream rebuilt by a neural codec. Every other detector sees the AI fingerprint and cries "fake" — blocking real users. Sightengine scores generation and processing as two independent signals, so AI cleanup is never mistaken for a synthetic voice.

support-call.wav 0:21

AI-generated ai_speech0.04

AI-processed ai_processed0.88

Genuine human voice — passed through an AI pipeline (denoised), not synthetic. Don't block this user.

Real audio is AI-touched everywhere

Videoconference noise suppression

Zoom, Teams, Meet and Krisp-style tools re-render a real voice in real time to kill background noise.

Neural codecs

Low-bandwidth calls and streaming re-synthesize the waveform from a compressed representation — modern compression, not a fake.

AI audio enhancement

Podcast and voice-cleanup tools that turn a phone recording into "studio" audio.

Super-resolution & upscaling

Restoring or upsampling old or low-quality recordings of a real person.

A real human spoke every one of these. A single-score detector flags them all.

AI-generated (ai_speech)	AI-processed (ai_processed)	Honest verdict
Low	Low	Genuine — clean recording
Low	High	Genuine voice, AI-processed — a real person← other detectors call this FAKE
High	—	Synthetic — AI-generated voice

In our benchmarks, even the strongest detectors on the market flag the middle row as AI-generated, with no way to tell a denoised real caller from a synthetic one. Sightengine separates the two signals, so AI cleanup reads as processed, while genuine fakes still fire on ai_speech.

Stop blocking real users

Denoised, compressed or enhanced real audio no longer trips a false alarm — the #1 reason detection APIs get pulled from production.

Trust the "fake" verdict

When ai_speech fires it means something, because benign processing isn't polluting the score.

Your policy, your thresholds

Two independent scores. Treat processing as informational or gate on it — and label AI-touched vs AI-generated for C2PA and the EU AI Act.

Why Sightengine

Built for breadth, scale and the real world

Most detectors only catch their own model, break on compressed audio, or never scale past a demo. Here is how the approaches compare.

Capability	Single-vendor classifier	Generic AI detector	Sightengine
Detects all major generators	✗ Own model only	~ Partial	✓ 55
Per-generator attribution	✗	✗	✓
Tells AI-processed from AI-generated	✗	✗ Flags both as AI	✓ Two scores
Works without watermark / metadata	~	✓	✓
Robust to phone / compressed audio	✗	~	✓
Multilingual coverage	~	~	✓
Production REST API at scale	~	✗	✓ Billions/mo

See the full list of supported generators

Generator	Creator	Example versions detected
Amazon Polly	Amazon	Neural, Generative voices...
Azure Neural TTS	Microsoft	Neural TTS, VALL-E, VALL-E 2...
Bark	Suno	Bark, Bark Small...
Chatterbox / Resemble	Resemble AI	Chatterbox, Resemble v2...
Coqui	Coqui	XTTS, XTTS v2...
CosyVoice	Alibaba	CosyVoice, CosyVoice 2...
ElevenLabs	ElevenLabs	Multilingual v2, Turbo v2, Flash...
F5-TTS	SWivid	F5-TTS, F5-TTS v1...
Fish Audio	Fish Audio	Fish Pro S1, Fish Pro S2...
Google TTS	Google	WaveNet, Chirp, Chirp 3...
GPT-SoVITS	RVC-Boss	GPT-SoVITS v1, v2, v3...
MiniMax	MiniMax	Speech-01, Speech-02...
OpenAI	OpenAI	TTS-1, TTS-1-HD, Voice Engine, GPT-4o audio...
Qwen	Alibaba	Qwen3, Qwen3-TTS...
SeamlessM4T	Meta	SeamlessM4T, SeamlessM4T v2...
Tortoise	Neonbjb	Tortoise TTS...
Other generators	Various	Camb AI, Cartesia, Descript, Hume AI, Inworld, Kits AI, Kokoro, LMNT, Murf, Parler TTS, PlayHT, Replica Studios, Speechify, WellSaid Labs...

And more — new generators are added continuously as they appear in the wild.

SEE DOCUMENTATION

The ElevenLabs, OpenAI, PlayHT, Resemble AI, Murf and other trademarks and logos are the property of their respective trademark holders. They are not affiliated with Sightengine.

Developer API

One request. A structured verdict.

A single REST call returns the AI-speech score for any recording. Designed by developers, for developers — quick to integrate and built to scale.

request — cURL

curl -X POST 'https://api.sightengine.com/1.0/audio/check.json' \
     -F 'media=@/path/to/speech.mp3' \
     -F 'models=ai_speech' \
     -F 'api_user={API_USER}' \
     -F 'api_secret={API_SECRET}'

response — 200 OK

{
  "status": "success",
  "request": {
    "id": "req_0zrbHDeitGYY7wEG",
    "operations": 15
  },
  "type": {
    "ai_speech": 0.98   // 98% AI
  },
  "media": { "uri": "speech.mp3" }
}

Scale to the sky

From a handful to billions of items per month, with no change to your integration.

Absolute privacy

No human reviewers in the loop. Your audio stays private — just as your users expect.

One call, many models

Combine with ai_music and other models in a single request: models=ai_speech,ai_music.

SEE THE DOCUMENTATION SIGN UP

Super-human accuracy

Your ear can't tell. The model can.

Modern text-to-speech and voice-cloning models now produce voices that are virtually indistinguishable to the human ear. In blind listening tests, people score little better than a coin flip.

Our model inspects subtle, sub-perceptual acoustic artifacts left behind by generative pipelines — and stays reliable on exactly the kind of compressed, real-world audio that trips up the human ear.

Take the "AI or not" test

Sample A

Real ✓

Human listeners: 51%Model: Authentic

Sample B

AI ✓

Human listeners: 48%Model: Synthetic

Two clips that sound identical to most listeners. The model is confident on both.

Integrated suite

Part of a complete content-analysis platform

Detect synthetic media across every modality and moderate user-generated content from a single API.

FAQ

Frequently asked questions

Which AI speech generators can it detect?

The model targets all current voice generators, including ElevenLabs, OpenAI, PlayHT, Resemble AI, Murf, WellSaid Labs, Microsoft Azure Neural TTS and VALL-E, Google WaveNet and Chirp, Amazon Polly, Descript, LMNT, Cartesia, Hume AI, MiniMax, Fish Audio, Coqui, Tortoise and Bark, along with smaller and emerging generators. The model is updated on an ongoing basis as new generators become available.

How does detection work without metadata or a watermark?

Detection is purely waveform-based. The model only analyzes the acoustic content of the audio. Metadata and inaudible watermarks are ignored, so stripping or altering them has no effect on the result, which makes the model robust to common evasion techniques.

Will real recordings with noise, compression or post-processing be flagged as AI-generated?

No. The model is designed to distinguish synthetic speech from real recordings that have been compressed, denoised, equalized or transmitted over a phone line. Standard post-processing does not push the score above the AI-generated threshold.

Does it work on phone calls, voice messages and socially-shared audio?

Yes. Detection is robust to re-encoding, downsampling, narrowband phone audio and standard social-platform recompression. Confidence may drop somewhat on heavily degraded clips, but the model is specifically developed to handle real-world redistribution artifacts.

Does it work across languages and accents?

Yes. The model was trained on speech spanning many languages, accents and speaking styles, and is designed to generalize beyond a single language. Performance is best on the most widely-used languages, with ongoing improvements for additional locales.

What does the score mean?

It is the model's confidence, from 0 to 1, that the analyzed audio was produced by a generative AI voice model. Higher means more likely AI-generated. Scores above 0.5 typically indicate AI-generated speech. For high-precision workflows such as identity verification, KYC or fraud detection, use a higher threshold such as 0.8.

How is AI-generated speech detection different from AI-generated music detection?

AI-generated speech detection targets synthetic voices and text-to-speech output such as ElevenLabs and OpenAI. AI-generated music detection targets fully generated music tracks such as Suno and Udio. The two models are complementary and can be combined in a single API call.

Is there an API for businesses?

Yes. The same detection technology is available via a REST API. Pass models=ai_speech to the audio check endpoint, with documentation, code examples in major languages, and enterprise SLAs available.

AI Voice Detector — detect AI-generated speech across every generator

Synthetic voice detected

Built for fraud prevention, identity verification, content moderation...

Voice clone scams

Voice KYC

Audio misinformation

From clip to verdict in three steps

Drop in audio

Waveform analysis

Citable result

Synthetic voice detected

One verdict. Every generator.

Voice cloning

Text-to-speech

Open-source

Emerging

AI-processed isn't AI-generated

Real audio is AI-touched everywhere

Videoconference noise suppression

Neural codecs

AI audio enhancement

Super-resolution & upscaling

Stop blocking real users

Trust the "fake" verdict

Your policy, your thresholds

Built for breadth, scale and the real world

One request. A structured verdict.

Scale to the sky

Absolute privacy

One call, many models

Your ear can't tell. The model can.

Sample A

Sample B

Part of a complete content-analysis platform

AI Music Detection

AI Image & Deepfake Detection

AI Video Detection

Frequently asked questions