Rapid-MLX: Fastest Open-Source Local LLM on Apple Silicon

26. Apr 2026 English 5 min read

apple-silicon local-llm mlx

Running a large language model on your own hardware has always forced a trade-off: privacy or speed. Cloud APIs are fast, but no European business wants to route customer documents, contracts, or internal analysis through an external endpoint it doesn't control. Self-hosted stacks like Ollama or llama.cpp solve the data-sovereignty problem but have historically been slower than their cloud counterparts. A new open-source project aims to break that trade-off: Rapid-MLX is built from the ground up for Apple Silicon and, according to its documentation and community benchmarks, is currently the fastest local LLM inference engine available for Mac hardware.

Apple Silicon as an Inference Platform

Apple's M-series chips have a fundamentally different architecture from x86 processors or NVIDIA GPUs. Instead of separate CPU and GPU memory pools, they use Unified Memory: both compute units access the same physical RAM directly, with no PCIe copy overhead. For large language models, that means model weights sit in the same memory pool that the CPU, GPU, and Neural Engine all access simultaneously.

Apple's own MLX framework was designed specifically for this architecture. It uses native Metal compute kernels optimized for the unified memory bus. Traditional engines like llama.cpp or Ollama's default backend were conceived for CUDA GPUs or generic ARM CPUs. They run on Apple Silicon, but they don't address the hardware in its native language.

Rapid-MLX starts from MLX natively — no porting layer, no architectural compromises.

What Is Rapid-MLX?

Rapid-MLX is an open-source inference server built entirely on the MLX framework and designed as a drop-in replacement for the OpenAI API. On X, @Raullen described it as "the fastest local LLM inference engine on Mac" — built specifically for Apple Silicon and tested head-to-head against Ollama, mlx-lm, and llama.cpp across 18 models.

According to the project's documentation, Rapid-MLX delivers:

4.2× faster throughput than Ollama (llama.cpp backend) on an M3 Ultra, measured across

multiple models

0.08s cached TTFT — time to first token on a cached prompt, critical for interactive

applications

17 tool parsers for structured real-time tool calling
Prompt cache and reasoning separation for chain-of-thought architectures
Cloud routing as an optional fallback when the local model hits capacity limits
Full OpenAI API compatibility: /v1/chat/completions endpoint, bearer token auth,

streaming responses

That API compatibility matters in practice. Coding agents like Claude Code, Cursor, or Aider, internal RAG pipelines, or any LLM client already wired to OpenAI-style endpoints can be reconfigured to point at a local Rapid-MLX server without changing a single line of application code.

Benchmark Numbers and Technical Details

According to community benchmarks run on a Mac Studio M3 Ultra with 256 GB Unified Memory, 22 models were tested across 6 different inference engines. Rapid-MLX ranked first in 16 of 18 benchmarked comparisons.

Practitioners report throughput figures in the range of 60–120 tok/s for 7B-class models and 15–35 tok/s for 70B-class models under Rapid-MLX, compared to typically 20–40 tok/s and 5–12 tok/s respectively under Ollama on the same hardware. These numbers come from community measurements and vary depending on model architecture and context length.

The key technical differentiator is the DeltaNet State Snapshot technique. Hybrid RNN architectures like Qwen3.5 DeltaNet don't use a classical attention mechanism; they maintain a rolling state vector instead. Rapid-MLX can persist that state between requests — rather than recomputing the full context on every multi-turn exchange, the engine reloads a cached snapshot. This reduces both latency and power consumption significantly.

Model coverage

Rapid-MLX supports models commonly used in European SMB deployments: Llama 3.3, Qwen2.5, DeepSeek-V3, and their quantized variants. On a Mac Studio M3 Ultra with 256 GB Unified Memory, the project reports running models with up to 397 billion parameters fully locally — fully offline, no outbound connections.

A GDPR-Safe Local AI Stack for SMBs

For European businesses, the defining advantage is data sovereignty by architecture. Running Rapid-MLX on a Mac Studio in your own office or server room means no requests leave your network, no data reaches a third-party provider, and no Article 28 GDPR data-processing agreement is needed for the AI inference layer itself.

This matters concretely for:

Law firms and tax advisors who cannot feed client documents into external systems
Healthcare providers where patient data falls under heightened protection under Art. 9 GDPR
Industrial businesses that must keep production data, engineering files, or supply-chain

information on-premises

Financial services firms operating under MiFID II, DORA, or equivalent regulation

Unlike privacy guarantees that rest on vendor contracts, a local stack's data boundary is physical. There is no API call that can inadvertently route data to a third-party server if no API call is ever made. For more on this architectural approach, see our pages on local AI infrastructure and data sovereignty.

Investment and Funding Options

A Mac Studio M3 Ultra with 192 GB Unified Memory lists in the €6,000–€8,000 range; the 256 GB configuration proportionally higher. Set against recurring cloud API costs — which can exceed €500–€2,000 per month per team under intensive production use — a one-time hardware investment typically breaks even within 12 to 18 months.

For European SMBs looking at how to structure that investment, several funding paths may be relevant. Based on our reading of current programs, the EU's Horizon Europe and various national digital-transformation co-financing schemes may cover AI infrastructure costs under the right project framing. Specific eligibility varies by country and scope; we recommend consulting a funding advisor before committing to a hardware budget.

Getting Started

Rapid-MLX requires macOS on Apple Silicon (M2 Pro or newer recommended for production use), Python 3.11+, and the MLX and Rapid-MLX packages installed via pip. After setup, the OpenAI- compatible endpoint is typically live within an hour.

For SMBs without in-house ML engineering capacity, Freshlab structures the full journey: hardware selection, model configuration, integration with existing tools, and team onboarding. Our training programme prepares internal teams for day-to-day operation.

If you want a structured proof-of-concept before committing to a full deployment, our pilot project format is designed exactly for that — a defined scope, fixed timeline, and clear success criteria.

Ready to run local AI at cloud speeds — without the cloud?<br> Talk to us about your use case and we'll show you how Rapid-MLX fits into your stack: Start a conversation