Ollama + MLX on Apple Silicon: Local LLMs Up to 2× Faster

30. May 2026 English 6 min read

ollama mlx apple-silicon

In March 2026, Ollama quietly shipped a change that matters more than any model release of the past year: starting with version 0.19, the most widely used local LLM tool switched from its llama.cpp Metal backend to MLX — Apple's own machine learning framework — as the default runtime on Apple Silicon Macs.

The practical result: the same Mac Studio or Mac Mini that was running local models acceptably now runs them fast enough to compete with cloud API response times. For SMBs that have already invested in Apple Silicon hardware, this is a free performance upgrade requiring no new hardware.

What Is MLX and Why Does It Change Things?

MLX was built by Apple's machine learning research team specifically to exploit the unified memory architecture of Apple Silicon. On conventional hardware with a discrete GPU, model weights must be copied from CPU RAM to GPU VRAM before inference can begin — and back again on each pass. That transfer is a bottleneck.

Apple Silicon doesn't have separate GPU memory. CPU, GPU, and Neural Engine all share the same memory pool. MLX was designed around this: it addresses all compute units directly, without data moving across buses. The result is lower latency, higher throughput, and better utilisation of GPU cores — especially noticeable at larger model sizes and longer context windows.

The llama.cpp backend that Ollama previously used is a solid, cross-platform runtime, but it wasn't built for Apple's specific architecture. That compromise is now gone.

Performance Numbers Reported by the Community

Practitioners and developers have published benchmarks showing consistent gains:

M4 Max (36–128 GB Unified Memory)

Qwen 3.5 9B, 4-bit quantisation: ~45–60 tok/s with MLX vs ~35–50 tok/s with llama.cpp Metal
Qwen 3.5 35B-A3B (MoE): ~70–80 tok/s (MLX) vs ~45 tok/s (llama.cpp Metal)

M5 Max

Qwen 3.5 35B-A3B: prefill from ~1,150 to ~1,810 tok/s (+57%), decode from ~58 to ~112 tok/s (+93%)

These figures come from community benchmarks and vary depending on system load, quantisation level, and model size. Reports on M3 Ultra describe decode improvements in the 40–60% range as well.

Ollama's official X account describes the update as bringing "much faster performance to accelerate demanding work on macOS" — listing personal assistants, coding agents, and other local workflows as direct beneficiaries.

Which Hardware Benefits Most?

The MLX backend is active by default on all Apple Silicon Macs running Ollama 0.19 or later. No flags, no configuration changes required. Benefits scale with available unified memory:

Hardware	Unified Memory	Practical Model Range
Mac Studio M3 Ultra	up to 192 GB	70B models, comfortable continuous use
Mac Studio M4 Max	up to 128 GB	70B models, high tok/s
Mac Mini M4 Pro	24–48 GB	Up to 14B models very fast
MacBook Pro M4 Max	36–128 GB	14B–32B depending on config

For 70B models, 64 GB unified memory is the practical floor. At 32 GB, 4-bit quantised 32B models run at production-worthy speeds.

Best Open-Source Models on Apple Silicon (May 2026)

Community benchmarks point to three recommendations for 2026:

Llama 4 Scout 17B

Meta's latest open-source release uses a Mixture-of-Experts architecture: rather than activating all 17 billion parameters simultaneously, it selects a relevant subset per task. This significantly reduces memory requirements while maintaining strong quality. Practitioners currently recommend it as the best overall choice for Apple Silicon.

ollama pull llama4-scout

Qwen 3 (7B to 32B)

Alibaba's Qwen 3 family is reported by practitioners as the strongest open-source option for code tasks. Qwen 3 14B runs at ~40–55 tok/s on a Mac Mini M4 Pro with 24 GB — production-ready for internal tooling.

ollama pull qwen3:14b

Gemma 3 12B

Google's Gemma 3 12B is the recommended choice at 16 GB RAM. It shows particularly strong results on structured extraction tasks and European languages, including German and Spanish — relevant for EU-based SMBs.

ollama pull gemma3:12b

Practical Use Cases for SMBs

Faster inference translates directly into better user experiences for business workloads.

Private Coding Assistant

Developer Anders Brownworth noted on X that Xcode's Apple Intelligence now supports local LLMs via Ollama: "in Xcode's Apple Intelligence you can add a local LLM using ollama and have private AI coding assistance without an internet connection." The same applies to Claude Code and other coding agents: point them at a local Ollama endpoint, and no source code or prompts leave the machine.

Document Q&A and RAG

A local RAG system querying contracts, SOPs, or email archives responds noticeably faster at 60 tok/s than at 40 tok/s — especially on documents that require large context windows. Latency drops from several seconds to under one second for most queries.

Internal Chat Interface

Open WebUI deployed as a company-internal ChatGPT alternative serves multiple simultaneous users much more smoothly at higher inference rates. Waiting times that previously made the tool feel sluggish now disappear.

Agentic Workflows

Ollama 0.21 added support for Hermes Agent from NousResearch, a self-improving agent accessible via ollama launch hermes. Local agent frameworks like LangGraph benefit directly from the higher throughput when orchestrating multi-step tasks.

GDPR Compliance: Fully Intact

Speed improvements change nothing about the fundamental data architecture: all processing stays on-device. Prompts, intermediate outputs, and model responses never leave your network. For EU SMBs this means:

No data processing agreement (DPA) required with an AI vendor
No third-country data transfer, no US jurisdiction over company data
No monthly API costs

Many EU member states offer SMB digitalisation grants. In Germany, BAFA programmes for business digitalisation and KfW investment loans may cover part of a Mac Studio infrastructure investment — based on our reading of current programme conditions, though eligibility should be confirmed directly with the relevant authority. In the UK, the Made Smarter programme covers similar ground. Across the EU, regional ERDF co-funded grants are often available for AI infrastructure investment.

The kAIra Toolkit runs fully on this MLX-backed Ollama stack. Existing Freshlab pilots receive the performance gain with a simple ollama update.

Rapid-MLX: Even Faster for Advanced Users

A newer open-source alternative, Rapid-MLX, goes further still. Developer Raullen reports on X: "Rapid-MLX is built specifically for Apple Silicon. Tested across 18 models vs Ollama, mlx-lm, llama.cpp — fastest on 16 of them." It introduces DeltaNet state snapshots for faster multi-turn caching. For production SMB use, Ollama with MLX remains the more mature and stable choice; Rapid-MLX is suited for developers who want to push the performance ceiling further.

Setup: Three Steps

MLX is the default since Ollama 0.19. No configuration needed:

Update Ollama: curl -fsSL https://ollama.com/install.sh | sh on macOS, or re-download from ollama.com
Pull a model: ollama pull qwen3:14b — a strong all-round choice at 14 GB
Test: ollama run qwen3:14b "Explain unified memory and MLX in three sentences"

Optional: deploy Open WebUI for a multi-user chat interface. Teams already running a Freshlab stack need no further changes.

What This Means in Practice

The MLX switch is not experimental — it is the new production standard on Apple Silicon. An M3 Ultra or M4 Max Mac Studio running Ollama 0.19+ is now a credible enterprise-grade AI inference node: no cloud dependency, no ongoing token costs, fully GDPR-compliant, and faster than it was six months ago without any hardware change.

Ready to assess which models and use cases make sense for your business? Start with a Freshlab pilot — free initial consultation.