In March 2026, Ollama quietly shipped a change that matters more than any model release of the past year: starting with version 0.19, the most widely used local LLM tool switched from its llama.cpp Metal backend to MLX — Apple's own machine learning framework — as the default runtime on Apple Silicon Macs.
The practical result: the same Mac Studio or Mac Mini that was running local models acceptably now runs them fast enough to compete with cloud API response times. For SMBs that have already invested in Apple Silicon hardware, this is a free performance upgrade requiring no new hardware.
What Is MLX and Why Does It Change Things?
MLX was built by Apple's machine learning research team specifically to exploit the unified memory architecture of Apple Silicon. On conventional hardware with a discrete GPU, model weights must be copied from CPU RAM to GPU VRAM before inference can begin — and back again on each pass. That transfer is a bottleneck.
Apple Silicon doesn't have separate GPU memory. CPU, GPU, and Neural Engine all share the same memory pool. MLX was designed around this: it addresses all compute units directly, without data moving across buses. The result is lower latency, higher throughput, and better utilisation of GPU cores — especially noticeable at larger model sizes and longer context windows.
The llama.cpp backend that Ollama previously used is a solid, cross-platform runtime, but it wasn't built for Apple's specific architecture. That compromise is now gone.
Performance Numbers Reported by the Community
Practitioners and developers have published benchmarks showing consistent gains:
M4 Max (36–128 GB Unified Memory)
- Qwen 3.5 9B, 4-bit quantisation: ~45–60 tok/s with MLX vs ~35–50 tok/s with llama.cpp Metal
- Qwen 3.5 35B-A3B (MoE): ~70–80 tok/s (MLX) vs ~45 tok/s (llama.cpp Metal)
M5 Max
- Qwen 3.5 35B-A3B: prefill from ~1,150 to ~1,810 tok/s (+57%), decode from ~58 to ~112 tok/s (+93%)
These figures come from community benchmarks and vary depending on system load, quantisation level, and model size. Reports on M3 Ultra describe decode improvements in the 40–60% range as well.
Ollama's official X account describes the update as bringing "much faster performance to accelerate demanding work on macOS" — listing personal assistants, coding agents, and other local workflows as direct beneficiaries.
Which Hardware Benefits Most?
The MLX backend is active by default on all Apple Silicon Macs running Ollama 0.19 or later. No flags, no configuration changes required. Benefits scale with available unified memory:
| Hardware | Unified Memory | Practical Model Range |
|---|---|---|
| Mac Studio M3 Ultra | up to 192 GB | 70B models, comfortable continuous use |
| Mac Studio M4 Max | up to 128 GB | 70B models, high tok/s |
| Mac Mini M4 Pro | 24–48 GB | Up to 14B models very fast |
| MacBook Pro M4 Max | 36–128 GB | 14B–32B depending on config |
For 70B models, 64 GB unified memory is the practical floor. At 32 GB, 4-bit quantised 32B models run at production-worthy speeds.
Best Open-Source Models on Apple Silicon (May 2026)
Community benchmarks point to three recommendations for 2026:
Llama 4 Scout 17B
Meta's latest open-source release uses a Mixture-of-Experts architecture: rather than activating all 17 billion parameters simultaneously, it selects a relevant subset per task. This significantly reduces memory requirements while maintaining strong quality. Practitioners currently recommend it as the best overall choice for Apple Silicon.
ollama pull llama4-scout
Qwen 3 (7B to 32B)
Alibaba's Qwen 3 family is reported by practitioners as the strongest open-source option for code tasks. Qwen 3 14B runs at ~40–55 tok/s on a Mac Mini M4 Pro with 24 GB — production-ready for internal tooling.
ollama pull qwen3:14b
Gemma 3 12B
Google's Gemma 3 12B is the recommended choice at 16 GB RAM. It shows particularly strong results on structured extraction tasks and European languages, including German and Spanish — relevant for EU-based SMBs.
ollama pull gemma3:12b
Practical Use Cases for SMBs
Faster inference translates directly into better user experiences for business workloads.
Private Coding Assistant
Developer Anders Brownworth noted on X that Xcode's Apple Intelligence now supports local LLMs via Ollama: "in Xcode's Apple Intelligence you can add a local LLM using ollama and have private AI coding assistance without an internet connection." The same applies to Claude Code and other coding agents: point them at a local Ollama endpoint, and no source code or prompts leave the machine.
Document Q&A and RAG
A local RAG system querying contracts, SOPs, or email archives responds noticeably faster at 60 tok/s than at 40 tok/s — especially on documents that require large context windows. Latency drops from several seconds to under one second for most queries.
Internal Chat Interface
Open WebUI deployed as a company-internal ChatGPT alternative serves multiple simultaneous users much more smoothly at higher inference rates. Waiting times that previously made the tool feel sluggish now disappear.
Agentic Workflows
Ollama 0.21 added support for Hermes Agent from NousResearch, a self-improving agent accessible via ollama launch hermes. Local agent frameworks like LangGraph benefit directly from the higher throughput when orchestrating multi-step tasks.
GDPR Compliance: Fully Intact
Speed improvements change nothing about the fundamental data architecture: all processing stays on-device. Prompts, intermediate outputs, and model responses never leave your network. For EU SMBs this means:
- No data processing agreement (DPA) required with an AI vendor
- No third-country data transfer, no US jurisdiction over company data
- No monthly API costs
Many EU member states offer SMB digitalisation grants. In Germany, BAFA programmes for business digitalisation and KfW investment loans may cover part of a Mac Studio infrastructure investment — based on our reading of current programme conditions, though eligibility should be confirmed directly with the relevant authority. In the UK, the Made Smarter programme covers similar ground. Across the EU, regional ERDF co-funded grants are often available for AI infrastructure investment.
The kAIra Toolkit runs fully on this MLX-backed Ollama stack. Existing Freshlab pilots receive the performance gain with a simple ollama update.
Rapid-MLX: Even Faster for Advanced Users
A newer open-source alternative, Rapid-MLX, goes further still. Developer Raullen reports on X: "Rapid-MLX is built specifically for Apple Silicon. Tested across 18 models vs Ollama, mlx-lm, llama.cpp — fastest on 16 of them." It introduces DeltaNet state snapshots for faster multi-turn caching. For production SMB use, Ollama with MLX remains the more mature and stable choice; Rapid-MLX is suited for developers who want to push the performance ceiling further.
Setup: Three Steps
MLX is the default since Ollama 0.19. No configuration needed:
- Update Ollama:
curl -fsSL https://ollama.com/install.sh | shon macOS, or re-download from ollama.com - Pull a model:
ollama pull qwen3:14b— a strong all-round choice at 14 GB - Test:
ollama run qwen3:14b "Explain unified memory and MLX in three sentences"
Optional: deploy Open WebUI for a multi-user chat interface. Teams already running a Freshlab stack need no further changes.
What This Means in Practice
The MLX switch is not experimental — it is the new production standard on Apple Silicon. An M3 Ultra or M4 Max Mac Studio running Ollama 0.19+ is now a credible enterprise-grade AI inference node: no cloud dependency, no ongoing token costs, fully GDPR-compliant, and faster than it was six months ago without any hardware change.
Ready to assess which models and use cases make sense for your business? Start with a Freshlab pilot — free initial consultation.