Successes and risks running LLM locally on Mac Mini M2 16GB

Answer

The Mac Mini M2 with 16GB RAM is a viable but constrained platform for local LLMs. It successfully runs 4-bit quantized GGUF models up to 7B parameters (using ~4-7GB RAM) with Metal GPU acceleration via llama.cpp/Ollama, delivering acceptable inference speeds. The primary risk is the tight memory envelope — running the OS, model, embeddings, and vector DB simultaneously can push against the 16GB ceiling, leaving little headroom for larger or multi-model setups.

Key Findings

SUCCESSES — Recommended models that fit comfortably: Phi-3-mini-instruct 3.8B (~4GB GGUF 4-bit, MIT license, fastest), Gemma-2-7B-Instruct (~6GB, Apache 2.0), Mistral-7B-Instruct (~6GB, Apache 2.0). All run with Metal acceleration via llama.cpp or Ollama. Full RAG stack (OpenWebUI + llama.cpp + embeddings + Qdrant/Chroma vector DB) is deployable via Docker Compose on arm64.
RISKS — 16GB is the hard ceiling: OS typically consumes 3-4GB baseline, leaving ~12GB for models + embeddings + vector DB + inference buffers. Running a 7B model (6GB) + embedding model (1-2GB) + Docker overhead can hit swap, degrading performance significantly. No RAM upgrade path exists post-purchase. Models above 13B parameters are not feasible.
TOOLCHAIN — Recommended stack: Ollama (easiest, native Metal support) or llama.cpp (more control), OpenWebUI as frontend (localhost:8080 via Docker), MiniLM-v2 or Phi-3-mini-embedding for RAG embeddings, Qdrant or Chroma for vector DB (both have native arm64 images). For context, an M3 18GB achieves ~40 tokens/sec on a 13GB llamafile. Used M2 Pro 32GB units start ~$800 and offer significantly more headroom.

Open Questions

What are real-world tokens-per-second benchmarks specifically on M2 16GB (not M3/M4) for Mistral-7B and Phi-3-mini at 4-bit quantization, and at what model size does swap usage begin degrading performance noticeably?
Is it practical to run a persistent background service (e.g., always-on Ollama + embedding model) on M2 16GB alongside normal desktop workloads, or does memory pressure make this unreliable for daily use?

Entities

mac-mini-m2 mac-mini-m4-pro mac-studio apple llama-cpp ollama openwebui qdrant chroma langchain llamaindex docker phi-3-mini-instruct gemma-2-7b-instruct mistral-7b-instruct llama-2-7b-chat meta openclaw bill-wang simon-willison hacker-news deepseek-coder-6-7b

Concepts

gguf-4-bit-quantisation local-llm-inference metal-gpu-acceleration unified-memory-architecture ram-capacity-constraints retrieval-augmented-generation containerised-ai-stack-deployment

Harris Notes

Explorer

20260414-1234-successes-and-risks-running-llm--locally-on-mac-mi

Successes and risks running LLM locally on Mac Mini M2 16GB

Answer

Key Findings

Open Questions

Entities

Concepts

Sources

Graph View

Table of Contents

Backlinks