Microsoft's VibeVoice ASR runs locally on Mac, transcribes an hour in under 9 minutes
Microsoft quietly released VibeVoice in January 2026, an MIT-licensed speech-to-text model in the Whisper lineage with speaker diarization baked into the model itself rather than bolted on as a post-processing step. Simon Willison ran the 4-bit MLX conversion (5.71GB, down from 17.3GB) against a one-hour podcast on a 128GB M5 Max MacBook using mlx-audio, completing transcription in 8 minutes 45 seconds with peak observed RAM of around 61.5GB during prefill.
Output is structured JSON with per-utterance timestamps and speaker IDs, making downstream processing straightforward — Willison loaded the result directly into Datasette Lite. Diarization is sensitive enough that it tagged the host’s intro/sponsor-read voice as a distinct speaker from his conversational voice. The default 8192-token output cap only covers ~25 minutes, so longer runs need an explicit max-tokens bump.
The practical ceiling is one hour of audio per invocation. Anything longer requires manual chunking with overlap and speaker-ID reconciliation across segments — a meaningful gap for podcast-scale workloads, but the local-first, permissively licensed footprint makes it a strong open alternative to hosted ASR APIs.
Read the full article
Continue reading at Simon Willison →This is an AI-generated summary. Read the original for the full story.