microsoft/VibeVoice
Microsoft released VibeVoice, an MIT-licensed speech-to-text model with built-in speaker diarization. A test on a MacBook Pro transcribed one hour of audio in about 9 minutes, using up to 61.5GB of RAM. The model outputs JSON with text, timestamps, and speaker IDs, but is limited to one hour per run.