Summary

A reference guide for matching open-source LLMs to Apple Silicon Mac RAM tiers, using Q4 quantization as the default optimization. The core claim: Apple’s unified memory architecture (CPU and GPU sharing one memory pool) makes Macs uniquely efficient for local LLM inference compared to traditional PC setups with separate VRAM. Ollama is the recommended runtime.

Key Points

  • Unified memory is the unlock: Apple Silicon’s shared CPU/GPU memory pool eliminates the PCIe bottleneck — model weights don’t need to be copied between system RAM and VRAM, giving Macs a structural inference advantage over comparably-priced PCs.
  • Q4 quantization is the practical standard: A Q4-quantized 7B model uses ~3.5GB RAM vs. ~14GB for FP16. This makes 7–14B models viable on 8–16GB machines.
  • RAM tier recommendations (Q4 quantized):
    • 8GB: Phi-3 3.8B, Qwen 4B, Gemma 2B (~18–37 tok/s)
    • 16–18GB: Llama 3 8B, DeepSeek Coder v2 16B, Qwen 14B (~26–65 tok/s depending on chip)
    • 24–36GB: Mixtral 8x7B (MoE), Gemma2 27B, Qwen 32B (~34–72 tok/s)
    • 48–64GB: Llama 3 70B, Mixtral 8x22B, DeepSeek Coder 67B (~16–40 tok/s)
    • 96–512GB: Command R+ 104B, Llama 3.1 405B (Q2–Q4), DeepSeek V2 236B
  • Ollama handles model download, conversion, and Apple Silicon optimization automatically — the lowest-friction entry point.
  • MoE architecture advantage: Mixtral 8x7B delivers “knowledge of a 47B model with the speed of a 13B model” — efficient for RAM-constrained users.

Newsletter Angles

  • The democratization angle: frontier-adjacent models now run on consumer hardware with no cloud dependency. This is AI sovereignty at the individual/corporate level, not just the nation-state level.
  • Privacy as a first-class argument: local inference means no data leaving your machine — a meaningful differentiator for legal, medical, financial, or journalistic use cases.
  • Apple Silicon as accidental AI infrastructure play: Apple built unified memory for media workflows; it turned out to be the optimal architecture for LLM inference. Classic emergent advantage from architectural decision made for unrelated reasons.

Entities Mentioned

  • Apple — Apple Silicon unified memory architecture as the enabling hardware
  • Meta — Llama 3 8B and 70B as reference models throughout
  • Ollama — primary recommended runtime

Concepts Mentioned

  • On-Device AI — the central practice this guide enables
  • AI Sovereignty — local inference as individual/corporate data sovereignty

Quotes

“Running large language models (LLMs) on your local machine is a game-changer. It offers unparalleled privacy, eliminates API costs, and gives you the freedom to experiment without limits.”

“A Q4 quantized 7B model only needs about 3.5GB of RAM, making it perfect for Macs with less memory.”

Notes

Published July 2025. Superseded by the 2026 edition for model recommendations but remains useful for foundational concepts (unified memory, quantization math). Model-specific benchmark numbers (tokens/sec) are hardware-dependent and should be treated as ballpark figures. The 2026 edition adds Phi-4 Mini, Qwen3, and DeepSeek R1 671B.