llama.cpp recently crossed 100,000 GitHub stars β a milestone its creator Georgi Gerganov marked with a half-joking prediction on X: "90% of all AI agents will be running locally with llama.cpp." Jokes aside, the trajectory is real. Local AI has moved from hobbyist experiment to team-grade infrastructure, and the tooling is finally catching up.
The most telling sign of that maturity is a feature that has been a long-standing request: router mode in llama-server. A single server process can now manage and switch between multiple models on demand β no restarts, no wrapper tools, no cloud dependency.
What Is Router Mode?
Until recently, llama-server was a one-model-per-process tool. Running a coding assistant alongside a general chat model on the same Mac Studio required either two separate processes or an orchestration layer like Ollama in front of them.
Router mode changes that entirely. As reported by Victor Mustar on X, the new capability set includes:
"Auto-discover GGUFs from cache β’ Load on first request β’ Each model runs in its own process β’ Route by
model(OpenAI-compatible API) β’ LRU unload at--models-max"
In practice this means llama-server becomes a lightweight model-aware reverse proxy. Each model runs in a dedicated child process β so a crash in one model does not take down the others. Models are loaded on first request and evicted using least-recently-used (LRU) logic when the --models-max cap is reached.
Setting Up Multi-Model Serving
Activating router mode requires one change from a normal llama-server launch: start the server without a --model flag. The server detects the omission and enters router mode automatically, discovering available GGUF files from its cache directory.
# Router mode β no --model flag, set a model cache limit
llama-server --port 8080 --models-max 3
From there, any OpenAI-compatible client routes to a specific model by setting the model field in the request body:
{
"model": "llama-3.3-70b-q4_k_m",
"messages": [{"role": "user", "content": "Summarise this contract:"}]
}
If the requested model is not currently loaded, the server loads it from cache before responding. If loading it would exceed --models-max, the least-recently-used model is evicted first. Community practitioners report switching times of 3β10 seconds depending on disk read speed and available VRAM bandwidth β fast enough for most asynchronous workflows.
Practical Team Use Cases
Use Case 1: Developer Team on a Shared Server
A five-person engineering team runs three models on a single Mac Studio M3 Ultra (192 GB unified memory):
- DeepSeek-R1:32B (Q4) for complex reasoning and code review
- Qwen2.5-Coder:7B (Q8) for fast IDE completions via an OpenAI-compatible plugin
- Llama 3.3:70B (Q4) for documentation, commit messages, and general queries
Each tool β IDE plugin, Slack bot, CI pipeline β sends the appropriate model name in its API request. No server restarts. No manual model swaps. The router handles everything.
Use Case 2: Cross-Department Business Stack
Non-technical teams benefit equally from multi-model routing:
- HR: a small Phi-4 model (3.8B) for drafting job descriptions and screening criteria summaries
- Customer support: Llama 3.3 for templated email replies and FAQ generation
- Finance: a dedicated model for invoice categorisation and report summarisation
All departments share a single internal API endpoint. No data reaches an external server β full GDPR compliance without any cloud dependency.
Use Case 3: Scheduled Batch Processing
Overnight, a large 70B model processes batch workloads β contract analysis, document classification, report generation. During business hours, faster smaller models handle interactive queries. LRU eviction manages the transition automatically, with no cron jobs or server restarts needed.
Router Mode vs. Ollama: When to Use Each
Ollama remains an excellent starting point, particularly for teams that want a straightforward GUI-driven experience. Router mode in llama.cpp occupies a different position on the complexity-control spectrum.
| Feature | llama.cpp Router | Ollama |
|---|---|---|
| Setup effort | Medium (CLI) | Low (GUI + CLI) |
| Raw inference speed | Higher (no abstraction) | Good |
| Per-model process isolation | Yes | No |
| Multi-model management | Yes (native) | Yes (native) |
| OpenAI API compatibility | Yes | Yes |
| Configuration depth | Very high | Moderate |
The router mode wins on raw performance and isolation guarantees. Ollama wins on ease of setup and user-friendly management. For teams comfortable with the command line and wanting maximum inference throughput with process-level isolation, llama.cpp's router mode is the stronger production choice.
Hardware Considerations
Router mode does not change the fundamental hardware constraint: unified memory determines how many models stay simultaneously resident. The more models you want active at once, the more RAM you need.
| Hardware | RAM | Practical configuration |
|---|---|---|
| Mac Mini M4 Pro | 64 GB | 1β2 small models (3Bβ7B Q4) |
| Mac Studio M3 Max | 64β96 GB | 2β3 mid-size models (7Bβ14B Q4/Q8) |
| Mac Studio M3 Ultra | 192 GB | 3β5 models including 70B Q4 |
Community measurements report 30β80 tokens per second for 7B models in Q4 quantisation on Apple Silicon β fast enough for interactive use with multiple concurrent users. Larger models (70B Q4) typically deliver 8β20 tok/s on the M3 Ultra, which suits document analysis and batch workloads where throughput matters more than latency.
On the cost side, a Mac Studio M3 Max at roughly β¬2,000β3,500 replaces API spend that commonly runs β¬300β600 per month for active team usage. Break-even sits around four to eight months β after which inference is effectively free at the marginal cost of electricity.
Data Sovereignty by Architecture
The deeper advantage of llama.cpp and local inference generally is architectural: nothing leaves the network perimeter. No API key to rotate, no third-country data transfer to justify under GDPR, no model provider reading your prompt context.
For teams handling sensitive data β legal documents, patient records, financial reports, HR files β this is not a privacy preference but a compliance requirement. Local inference removes the question entirely. Learn more about how we approach local AI and data sovereignty and our broader local AI platform.
Next Step
Router mode marks a quiet but significant maturity point for the local AI stack. You no longer need a separate tool to manage multiple models β llama.cpp does it natively, at full performance, with process-level isolation. For teams already running local inference, this is a straightforward upgrade that removes an entire category of operational friction.
If you want to deploy a multi-model local AI server for your team without the DIY overhead, start a pilot project with us β we typically have a working stack running within two weeks.