Harris Notes

Home

❯

concepts

❯

Local LLM Inference

Local LLM Inference

Apr 14, 20261 min read

  • concept
  • general
  • process

Local LLM Inference

Running large language models entirely on local hardware without sending data to external APIs, prioritising data privacy and eliminating per-token costs.

Related

gguf-4-bit-quantisation metal-gpu-acceleration ram-capacity-constraints ollama llama-cpp openwebui mac-mini-m2


Graph View

  • Local LLM Inference
  • Related

Backlinks

  • Containerised AI Stack Deployment
  • GGUF 4-bit Quantisation
  • Metal GPU Acceleration
  • RAM Capacity Constraints
  • Retrieval-Augmented Generation
  • Unified Memory Architecture
  • Bill WANG
  • DeepSeek-Coder 6.7B
  • Gemma-2-7B-Instruct
  • Hacker News
  • LangChain
  • Llama 2 7B-Chat
  • llama.cpp
  • LlamaIndex
  • Mac Mini M2
  • Mac Mini M4 Pro
  • Mac Studio
  • Meta
  • Mistral-7B-Instruct
  • Ollama
  • OpenClaw
  • OpenWebUI
  • Phi-3-mini-instruct
  • Simon Willison
  • 20260414-1234-successes-and-risks-running-llm--locally-on-mac-mi
  • index

Created with Quartz v4.5.2 © 2026

  • GitHub
  • Discord Community