Productivity AutomationVerified

ElevenLabs offers Scribe, an AI speech-to-text model that converts audio and video recordings into accurate, structured text transcripts in 99 languages, with features including speaker labeling, word-level timestamps, and audio-event tagging. Scribe launched in December 2024, with Scribe v2 following in January 2026.

Details

Scribe takes audio or video files as input and produces structured JSON transcripts as output, including speaker diarization (identifying who said what), character-level timestamps, and tagging of non-speech audio events such as laughter or applause. A real-time version, Scribe v2 Realtime, processes live speech with approximately 150 milliseconds of latency and is designed for use in conversational AI agents and meeting assistants. The tool is available via the web dashboard and API.