Llama 4 Scout: Run Meta's Multimodal Local LLM with Ollama

llama4 multimodal lokale-ki

Meta released the Llama 4 model family in April 2026, marking a significant step for organisations running local AI. For the first time in the Llama lineage, the models are natively multimodal β€” meaning text and images travel together in a single request, with no separate vision endpoint, no cloud pipeline, and no additional model required.

Llama 4 Scout, the variant designed for deployment on a single high-memory machine, is already available in the Ollama library. This article covers what the architecture means in practice, what hardware you need, how to set it up, and how Scout compares to the alternatives already running in European SMB environments.

What Makes Llama 4 Different: Mixture-of-Experts

Llama 4 uses a Mixture-of-Experts (MoE) architecture. According to Meta's official documentation, Scout has 109 billion total parameters β€” but only approximately 17 billion are active on any given inference. The model activates only the relevant expert subnetworks for each token, rather than running all parameters at full compute cost every time.

For local deployment, this matters directly. A conventional dense model with 70B parameters requires the full 70B active on every request. Scout's MoE design means you get competitive quality at a fraction of the memory and compute footprint. The practical result: Scout runs on hardware that would struggle with a dense 70B model, while delivering outputs that practitioners report as competitive with significantly heavier models.

The context window is a separate headline figure: up to 10 million tokens, according to Meta. To calibrate that β€” a 300-page contract document typically runs under 150,000 tokens. Llama 4 Scout can theoretically process dozens of such documents in a single request, which changes how you think about document retrieval architectures.

Scout vs. Maverick: Choosing the Right Variant

Two Llama 4 variants are currently available for local deployment via Ollama.

Llama 4 Scout is the practical choice for most SMB setups:

  • 17B active parameters, 109B total (MoE)
  • Context window: up to 10 million tokens (per Meta's documentation)
  • Memory requirement: approximately 20 GB of VRAM or unified memory for quantised variants; 32 GB+ recommended for stable multi-user operation
  • Suitable hardware: Mac Mini M4 Pro (64 GB), MacBook Pro M4 Max (128 GB), Mac Studio M3 Ultra (192 GB)
  • Multimodal: text and images, natively, in the same request

Llama 4 Maverick has 400 billion total parameters and is designed for multi-GPU setups or dedicated AI servers. For organisations running local AI without a data centre, Scout is the starting point.

Running Llama 4 with Ollama

If Ollama is already installed, getting Llama 4 Scout running takes two commands:

ollama pull llama4:scout
ollama run llama4:scout

Both Scout and Maverick are available in the Ollama model library at ollama.com/library/llama4. Once running, the model is accessible via a local REST API that is compatible with Open WebUI, Continue.dev, and any application that expects an OpenAI-compatible endpoint β€” which covers the majority of current local AI tooling. A fresh Ollama installation typically takes around five minutes.

For Maverick on qualifying hardware:

ollama run llama4:maverick

No configuration file changes are required beyond the standard Ollama setup. The multimodal capability is built into the model and available immediately.

Multimodal in Practice: What You Can Actually Do Today

Native multimodality means you pass an image and a text prompt together. There is no separate OCR step, no additional model call, no pipeline to wire up. For European businesses, the practical use cases are immediate:

  • Invoice processing: Scan a receipt and ask "Extract the total, invoice date, and IBAN from this image" β€” Scout handles it in a single call
  • Product catalogue work: Feed product photos alongside a spec sheet and generate consistent descriptions, flag discrepancies
  • Contract review: Pass scanned document pages directly; ask for clause summaries or flag specific legal language
  • Report analysis: Hand in a chart image from a PDF and ask for a plain-language interpretation

All of this runs on your own hardware. No image, no document fragment, no prompt text leaves your network.

Comparison: Scout, Gemma 4, and Qwen 2.5VL

Llama 4 Scout is not the only multimodal open-weight model available for local deployment. Two alternatives are already well-established in European SMB setups.

Gemma 4 27B (Google, April 2026) performs particularly well on coding tasks and agentic workflows with native function calling. Practitioners consistently rate it highly for structured tool use. If your primary use case is a local coding assistant or an agent that calls APIs and databases, Gemma 4 remains competitive. Its context window is shorter than Scout's, but for most single-document tasks that is rarely a constraint.

Qwen 2.5VL (Alibaba) is the vision-language variant of the Qwen 2.5 line. Community measurements consistently report strong quality in German and Spanish formal text β€” which matters for European businesses. For multilingual document work in European languages, Qwen 2.5VL remains a strong contender.

Llama 4 Scout is the clear choice when the 10-million-token context window is relevant, when you need native multimodality without additional setup, or when you want the breadth of Meta's language coverage across a wide range of tasks.

If you are evaluating which model fits your workflow, our local AI guidance page provides a practical starting framework.

GDPR and the Deployer Advantage

The EU AI Act's deployer obligations, which take full effect in August 2026, apply to any organisation using an AI system in Europe β€” including local LLMs. Running Llama 4 Scout via Ollama on your own hardware does not reduce compliance obligations, but it does simplify a significant subset of them.

With a fully local stack, there is no third-party processor with access to your prompts and outputs. GDPR Article 32 requires appropriate technical measures to protect personal data. A local inference setup where data physically cannot leave your premises is one of the strongest technical controls available β€” stronger than relying on a cloud provider's contractual commitments, which can change with policy updates, jurisdictional shifts, or corporate restructuring.

For organisations handling legal documents, HR records, financial data, or any category of personal data, this is a material difference. More detail on the technical and regulatory dimensions of local AI is available on our data sovereignty page.

Funding: Making the Hardware Investment Viable

Apple Silicon hardware is a real upfront cost. European funding programmes can absorb a portion of it.

Germany: The BAFA "Digital Jetzt" programme covers investments in software and hardware for digitalisation. Based on our reading of current scheme guidelines, a local AI stack on Mac Studio hardware can be positioned as a digitalisation investment under several programme categories. Individual eligibility depends on company context β€” consult your tax adviser for confirmation and application support.

Spain: The Kit Digital subsidy covers AI and digital tool adoption for SMBs with 3–49 employees. A local AI infrastructure setup may qualify under the "Inteligencia Artificial y AnalΓ­tica" category, based on our reading of current scheme guidelines.

EU-wide: The European Investment Fund's SME programmes and national co-financing mechanisms can supplement local hardware investment in most member states. Your regional development agency is the right starting point.

Getting Started

Llama 4 Scout is available today via ollama pull llama4:scout on Apple Silicon Macs β€” natively multimodal, no cloud account required, no API billing to manage. If you want to assess whether multimodal local AI fits a specific workflow in your organisation, we are ready to help β€” from hardware sizing to production integration.

β†’ Start a pilot project