local LLM for audio transcription on Mac Mini M2 16GB host. Include Whisper Vs more general purpose capable LLM options
Answer
For local audio transcription on a Mac Mini M2 16GB, Whisper-based solutions (whisper.cpp, WhisperKit) are the clear best-fit: purpose-built for ASR, CoreML/Neural Engine accelerated, and highly memory-efficient. General-purpose LLMs (Llama 3B via MLX, Ollama models) can supplement transcription with post-processing cleanup but are not competitive as primary transcription engines. The recommended stack is whisper.cpp with the large-v3-turbo Q5_0 model (~954MB) for transcription, optionally paired with a small LLM (1–3B) via MLX or Ollama for cleanup tasks.
Key Findings
-
whisper.cpp (github.com/ggml-org/whisper.cpp): Free, open-source, CoreML + Neural Engine accelerated. Recommended model: large-v3-turbo at ~954MB (fast + accurate, 100+ languages) or large-v3 at ~3GB (max accuracy). Q5_0 quantization reduces model size ~65% with minimal accuracy loss. Achieves near-real-time transcription on M2 hardware.
-
Ready-made Mac apps using Whisper: (1) Local Whisper (github.com/y-dai20/local-whisper) — free, open-source, captures mic + system audio for meetings; (2) Speak2 (github.com/zachswift615/speak2) — free, open-source, push-to-talk dictation using WhisperKit or Parakeet v3 (~600MB, 25 languages), supports Ollama for LLM cleanup; (3) Transcribe Master (App Store, free, by Dawei Bi) — polished GUI app, Whisper-powered, supports Mandarin/Cantonese/Japanese/English; (4) getonit.ai Dictate — free app using Parakeet 0.6B + Llama 3B via MLX, <500ms latency without LLM cleanup, ~800ms with it.
-
General-purpose LLM role is supplementary, not primary: Small LLMs (Llama 3B, 1B via MLX or Ollama) are used for post-transcription cleanup — removing filler words, formatting numbers/emails/currency — not for core ASR. On 16GB M2, a 3B model via MLX runs comfortably alongside a Whisper model. Larger models (7B–13B) would compete for unified memory and slow the pipeline. Parakeet v3 (NVIDIA/FluidAudio) is a competitive Whisper alternative for multilingual use at ~600MB.
-
Memory fit on 16GB M2: large-v3-turbo Q5_0 (~500MB active) + Llama 3B MLX (~2GB) leaves ample headroom. Full large-v3 fp16 (~6GB) is feasible but leaves less room. Avoid running a 7B+ LLM simultaneously with large Whisper models on 16GB.
Open Questions
-
How does Parakeet v3 (FluidAudio) accuracy compare to Whisper large-v3-turbo on real-world audio with accents or background noise on M2 hardware specifically?
-
For batch/offline file transcription workflows (vs. real-time dictation), are there productivity gains from using a pipeline tool like whisper.cpp CLI + a local summarization LLM via Ollama versus an integrated app like Transcribe Master?
-
Does running Ollama with a 7B model (e.g., Mistral 7B Q4) for higher-quality post-processing alongside Whisper large-v3-turbo cause memory pressure or swapping issues on 16GB M2 in practice?
Entities
whisper-cpp openai-whisper local-whisper whisperkit speak2 transcribe-master dawei-bi parakeet mlx-framework llama-3b ollama apple google-speech-to-text amazon-transcribe deepgram gladia getonit-ai
Concepts
local-offline-transcription apple-silicon-acceleration whisper-model-variants model-quantization llm-post-processing system-audio-capture cloud-vs-local-asr-trade-offs
Sources
-
https://www.reddit.com/r/LocalLLaMA/comments/1kppr0t/whats_the_best_local_model_for_m2_32gb_macbook/
-
https://apps.apple.com/us/app/transcribe-master-local-ai/id6754982174?mt=12
-
https://www.gladia.io/blog/openai-whisper-vs-google-speech-to-text-vs-amazon-transcribe
-
https://slator.com/resources/should-i-use-whisper-or-amazon-transcribe/
-
https://www.getvoibe.com/resources/best-local-whisper-model-superwhisper/
-
https://www.facebook.com/groups/1577315533418837/posts/1649651932851863/
-
https://www.reddit.com/r/ollama/comments/1fd3bg6/best_model_for_transcription_with_ollama/
-
https://www.facebook.com/groups/seocalifornia/posts/901757971790867/
-
https://www.reddit.com/r/LLMDevs/comments/1f7h0g3/sep_2024_speechtotext_api_with_highest_accuracy/
-
https://www.reddit.com/r/LocalLLaMA/comments/1gnce9t/voice_transcription_tools_preferably_with/
-
https://www.reddit.com/r/LocalLLaMA/comments/1g2vhy3/creating_very_highquality_transcripts_with/
-
https://www.reddit.com/r/LocalLLaMA/comments/1c8oj8h/what_about_real_time_voice_conversations_with/