Running a 671-billion-parameter language model on hardware that costs less than a mid-range server and lives in your own server room was theoretical 18 months ago. As of June 2026, it is a documented production workflow — and the enabling technology arrived quietly inside macOS Tahoe 26.2.
The Infrastructure Shift: JACCL and RDMA over Thunderbolt 5
macOS Tahoe 26.2 ships JACCL (Joint Apple Compute Cluster Library), Apple's distributed inference backend for the MLX machine-learning framework. JACCL runs MLX collective operations over RDMA (Remote Direct Memory Access) on Thunderbolt 5 connections, reaching 50–60 Gbps bandwidth at sub-50 µs inter-node latency, according to community-reported measurements.
The practical consequence: two or more Apple Silicon Macs connected via Thunderbolt 5 can pool their unified memory as a single address space. A model too large to fit in one machine's RAM is sharded automatically across nodes, with weight transfers happening at near-memory speeds rather than network speeds.
EXO: Open-Source Orchestration for Mac Clusters
The open-source project EXO (45.2k GitHub stars, version 1.0.71 as of April 2026) puts a usable interface on top of JACCL. EXO auto-discovers Apple Silicon Macs on the local network, distributes model weights across them, and exposes a single OpenAI-compatible Chat API endpoint — which means any existing LLM integration using the OpenAI SDK drops into an EXO cluster without modification.
EXO-documented benchmarks report the following throughput gains over a single-device baseline:
- 1.8× throughput on 2 nodes
- 3.2× throughput on 4 nodes
- 99% reduction in inter-device latency versus TCP-based networking (RDMA vs. standard Ethernet)
Models confirmed running on EXO clusters: DeepSeek v3.1 671B (8-bit), Qwen3-235B (8-bit), Kimi K2, Llama 3.2 (all sizes).
EXO requires macOS Tahoe 26.2 or later on every node. Supported hardware: M4 Pro Mac Mini, M4 Max Mac Studio, M4 Max MacBook Pro, M3 Ultra Mac Studio.
Hardware Configurations and Reported Costs
Entry Cluster: 4 × Mac Mini M4 Pro (36 GB) — approximately €6,000–8,000
Four M4 Pro Mac Minis linked via a Thunderbolt 5 hub aggregate 144 GB of unified memory and 128 GPU cores. At this configuration, practitioners in the community report:
- Qwen3-235B (8-bit): approximately 20–30 tokens/second in generation
- Llama 3.2 70B: approximately 60–80 tokens/second — suitable for real-time conversational applications
Each M4 Pro Mac Mini (36 GB) costs approximately €1,500–2,000 depending on configuration and market, putting a four-node cluster in the €6,000–8,000 range. For businesses currently spending on cloud API access, the break-even horizon depends on volume; at moderate usage for a mid-sized team, practitioners report payback periods of 12–24 months.
Scale Cluster: 4 × Mac Studio M3 Ultra — approximately €40,000–50,000
Four M3 Ultra Mac Studios aggregate approximately 1.5 TB of unified memory. Community practitioners report this configuration runs DeepSeek v3.1 671B at 8-bit quantisation at approximately 25 tokens/second — slower than an NVIDIA H100 cluster, but fully on-premise at roughly 5% of equivalent GPU cluster cost.
When engineer Ronald Mannak described the concept on X as "your own home cluster of Mac Mini or Mac Studio for distributed local LLM inference", the thread generated significant discussion among practitioners — a signal of how broadly the idea had taken hold even before macOS Tahoe made it a turnkey capability.
Why This Matters Beyond the Benchmarks
For European businesses, the argument for on-premise clusters is not primarily about cost. It is about data sovereignty.
When inference runs inside your cluster:
- No data leaves your network. Client contracts, HR records, financial models, source code — none of it reaches a cloud provider's infrastructure.
- GDPR compliance is structural. You are not relying on a vendor's data processing agreement; you control the physical hardware and the network boundary.
- EU AI Act documentation (Articles 13, 26, and 50) becomes considerably simpler when you can demonstrate full control over the AI system's deployment environment.
This matters especially for businesses in regulated sectors — legal, financial services, healthcare, and manufacturing — where a single accidental data egress event can carry disproportionate regulatory consequences.
For a practical review of what EU AI Act deployer obligations look like for on-premise systems, see our local AI overview.
Setup Considerations Before You Buy
Network topology. JACCL RDMA requires Thunderbolt 5 direct connections or a certified Thunderbolt 5 hub. Standard Gigabit Ethernet falls back to TCP and loses the latency advantage. Budget for a quality hub if you are connecting more than two nodes.
Quantisation trade-offs. DeepSeek 671B and Qwen3-235B run at 8-bit quantisation on these clusters, which reduces memory footprint while introducing minor precision loss. For the business applications that motivate most SMB deployments — document analysis, classification, summarisation, internal search, code generation — the quality difference versus full precision is typically imperceptible in practitioner reports.
Model storage. A 671B 8-bit model occupies approximately 350–400 GB on disk. EXO supports NFS mounts via EXOMODELSREADONLYDIRS, so a single NAS unit can serve model weights to all nodes, avoiding redundant storage.
Power and cooling. Four M4 Pro Mac Minis under sustained inference load draw approximately 400–600 W combined, according to practitioner estimates. Standard office infrastructure handles this without modification.
Who Should Evaluate This Now
The Mac cluster approach fits organisations that:
- Handle data subject to GDPR, NDA, or sector-specific confidentiality requirements
- Need frontier-model reasoning capability beyond what 7B–14B local models deliver
- Have a multi-year hardware horizon that can absorb €6K–50K in upfront cost
For pilot projects involving sensitive or regulated data, the architectural simplicity of an EXO cluster — one API endpoint, zero cloud credentials, no data egress path — often makes it the cleaner option compared to private cloud deployments or hybrid architectures.
If you want to benchmark your specific workload against reference cluster hardware before committing, reach out to Freshlab. We can run your use case and give you a realistic throughput and quality estimate for the models that fit your data.