local LLM for audio transcription on Mac Mini M2 16GB host. Include Whisper Vs more general purpose capable LLM options

Answer

For local audio transcription on a Mac Mini M2 16GB, Whisper-based solutions (whisper.cpp, WhisperKit) are the clear best-fit: purpose-built for ASR, CoreML/Neural Engine accelerated, and highly memory-efficient. General-purpose LLMs (Llama 3B via MLX, Ollama models) can supplement transcription with post-processing cleanup but are not competitive as primary transcription engines. The recommended stack is whisper.cpp with the large-v3-turbo Q5_0 model (~954MB) for transcription, optionally paired with a small LLM (1–3B) via MLX or Ollama for cleanup tasks.

Key Findings

  • whisper.cpp (github.com/ggml-org/whisper.cpp): Free, open-source, CoreML + Neural Engine accelerated. Recommended model: large-v3-turbo at ~954MB (fast + accurate, 100+ languages) or large-v3 at ~3GB (max accuracy). Q5_0 quantization reduces model size ~65% with minimal accuracy loss. Achieves near-real-time transcription on M2 hardware.

  • Ready-made Mac apps using Whisper: (1) Local Whisper (github.com/y-dai20/local-whisper) — free, open-source, captures mic + system audio for meetings; (2) Speak2 (github.com/zachswift615/speak2) — free, open-source, push-to-talk dictation using WhisperKit or Parakeet v3 (~600MB, 25 languages), supports Ollama for LLM cleanup; (3) Transcribe Master (App Store, free, by Dawei Bi) — polished GUI app, Whisper-powered, supports Mandarin/Cantonese/Japanese/English; (4) getonit.ai Dictate — free app using Parakeet 0.6B + Llama 3B via MLX, <500ms latency without LLM cleanup, ~800ms with it.

  • General-purpose LLM role is supplementary, not primary: Small LLMs (Llama 3B, 1B via MLX or Ollama) are used for post-transcription cleanup — removing filler words, formatting numbers/emails/currency — not for core ASR. On 16GB M2, a 3B model via MLX runs comfortably alongside a Whisper model. Larger models (7B–13B) would compete for unified memory and slow the pipeline. Parakeet v3 (NVIDIA/FluidAudio) is a competitive Whisper alternative for multilingual use at ~600MB.

  • Memory fit on 16GB M2: large-v3-turbo Q5_0 (~500MB active) + Llama 3B MLX (~2GB) leaves ample headroom. Full large-v3 fp16 (~6GB) is feasible but leaves less room. Avoid running a 7B+ LLM simultaneously with large Whisper models on 16GB.

Open Questions

  • How does Parakeet v3 (FluidAudio) accuracy compare to Whisper large-v3-turbo on real-world audio with accents or background noise on M2 hardware specifically?

  • For batch/offline file transcription workflows (vs. real-time dictation), are there productivity gains from using a pipeline tool like whisper.cpp CLI + a local summarization LLM via Ollama versus an integrated app like Transcribe Master?

  • Does running Ollama with a 7B model (e.g., Mistral 7B Q4) for higher-quality post-processing alongside Whisper large-v3-turbo cause memory pressure or swapping issues on 16GB M2 in practice?

Entities

whisper-cpp openai-whisper local-whisper whisperkit speak2 transcribe-master dawei-bi parakeet mlx-framework llama-3b ollama apple google-speech-to-text amazon-transcribe deepgram gladia getonit-ai

Concepts

local-offline-transcription apple-silicon-acceleration whisper-model-variants model-quantization llm-post-processing system-audio-capture cloud-vs-local-asr-trade-offs

Sources