Local LLM Router Mode: Run Multiple Models with llama.cpp

llama-cpp local-ai inference

llama.cpp recently crossed 100,000 GitHub stars β€” a milestone its creator Georgi Gerganov marked with a half-joking prediction on X: "90% of all AI agents will be running locally with llama.cpp." Jokes aside, the trajectory is real. Local AI has moved from hobbyist experiment to team-grade infrastructure, and the tooling is finally catching up.

The most telling sign of that maturity is a feature that has been a long-standing request: router mode in llama-server. A single server process can now manage and switch between multiple models on demand β€” no restarts, no wrapper tools, no cloud dependency.

What Is Router Mode?

Until recently, llama-server was a one-model-per-process tool. Running a coding assistant alongside a general chat model on the same Mac Studio required either two separate processes or an orchestration layer like Ollama in front of them.

Router mode changes that entirely. As reported by Victor Mustar on X, the new capability set includes:

"Auto-discover GGUFs from cache β€’ Load on first request β€’ Each model runs in its own process β€’ Route by model (OpenAI-compatible API) β€’ LRU unload at --models-max"

In practice this means llama-server becomes a lightweight model-aware reverse proxy. Each model runs in a dedicated child process β€” so a crash in one model does not take down the others. Models are loaded on first request and evicted using least-recently-used (LRU) logic when the --models-max cap is reached.

Setting Up Multi-Model Serving

Activating router mode requires one change from a normal llama-server launch: start the server without a --model flag. The server detects the omission and enters router mode automatically, discovering available GGUF files from its cache directory.

# Router mode β€” no --model flag, set a model cache limit
llama-server --port 8080 --models-max 3

From there, any OpenAI-compatible client routes to a specific model by setting the model field in the request body:

{
  "model": "llama-3.3-70b-q4_k_m",
  "messages": [{"role": "user", "content": "Summarise this contract:"}]
}

If the requested model is not currently loaded, the server loads it from cache before responding. If loading it would exceed --models-max, the least-recently-used model is evicted first. Community practitioners report switching times of 3–10 seconds depending on disk read speed and available VRAM bandwidth β€” fast enough for most asynchronous workflows.

Practical Team Use Cases

Use Case 1: Developer Team on a Shared Server

A five-person engineering team runs three models on a single Mac Studio M3 Ultra (192 GB unified memory):

  • DeepSeek-R1:32B (Q4) for complex reasoning and code review
  • Qwen2.5-Coder:7B (Q8) for fast IDE completions via an OpenAI-compatible plugin
  • Llama 3.3:70B (Q4) for documentation, commit messages, and general queries

Each tool β€” IDE plugin, Slack bot, CI pipeline β€” sends the appropriate model name in its API request. No server restarts. No manual model swaps. The router handles everything.

Use Case 2: Cross-Department Business Stack

Non-technical teams benefit equally from multi-model routing:

  • HR: a small Phi-4 model (3.8B) for drafting job descriptions and screening criteria summaries
  • Customer support: Llama 3.3 for templated email replies and FAQ generation
  • Finance: a dedicated model for invoice categorisation and report summarisation

All departments share a single internal API endpoint. No data reaches an external server β€” full GDPR compliance without any cloud dependency.

Use Case 3: Scheduled Batch Processing

Overnight, a large 70B model processes batch workloads β€” contract analysis, document classification, report generation. During business hours, faster smaller models handle interactive queries. LRU eviction manages the transition automatically, with no cron jobs or server restarts needed.

Router Mode vs. Ollama: When to Use Each

Ollama remains an excellent starting point, particularly for teams that want a straightforward GUI-driven experience. Router mode in llama.cpp occupies a different position on the complexity-control spectrum.

Feature llama.cpp Router Ollama
Setup effort Medium (CLI) Low (GUI + CLI)
Raw inference speed Higher (no abstraction) Good
Per-model process isolation Yes No
Multi-model management Yes (native) Yes (native)
OpenAI API compatibility Yes Yes
Configuration depth Very high Moderate

The router mode wins on raw performance and isolation guarantees. Ollama wins on ease of setup and user-friendly management. For teams comfortable with the command line and wanting maximum inference throughput with process-level isolation, llama.cpp's router mode is the stronger production choice.

Hardware Considerations

Router mode does not change the fundamental hardware constraint: unified memory determines how many models stay simultaneously resident. The more models you want active at once, the more RAM you need.

Hardware RAM Practical configuration
Mac Mini M4 Pro 64 GB 1–2 small models (3B–7B Q4)
Mac Studio M3 Max 64–96 GB 2–3 mid-size models (7B–14B Q4/Q8)
Mac Studio M3 Ultra 192 GB 3–5 models including 70B Q4

Community measurements report 30–80 tokens per second for 7B models in Q4 quantisation on Apple Silicon β€” fast enough for interactive use with multiple concurrent users. Larger models (70B Q4) typically deliver 8–20 tok/s on the M3 Ultra, which suits document analysis and batch workloads where throughput matters more than latency.

On the cost side, a Mac Studio M3 Max at roughly €2,000–3,500 replaces API spend that commonly runs €300–600 per month for active team usage. Break-even sits around four to eight months β€” after which inference is effectively free at the marginal cost of electricity.

Data Sovereignty by Architecture

The deeper advantage of llama.cpp and local inference generally is architectural: nothing leaves the network perimeter. No API key to rotate, no third-country data transfer to justify under GDPR, no model provider reading your prompt context.

For teams handling sensitive data β€” legal documents, patient records, financial reports, HR files β€” this is not a privacy preference but a compliance requirement. Local inference removes the question entirely. Learn more about how we approach local AI and data sovereignty and our broader local AI platform.

Next Step

Router mode marks a quiet but significant maturity point for the local AI stack. You no longer need a separate tool to manage multiple models β€” llama.cpp does it natively, at full performance, with process-level isolation. For teams already running local inference, this is a straightforward upgrade that removes an entire category of operational friction.

If you want to deploy a multi-model local AI server for your team without the DIY overhead, start a pilot project with us β€” we typically have a working stack running within two weeks.