Long Context Local LLMs: Qwen3, Llama 4 and Gemma 4

12. Jun 2026 English 6 min read

local-llm context-window ollama

The context window determines how much text a language model can hold in working memory at once — a short query or an entire 400-page contract. For years this was a cloud-versus-local fault line: locally runnable models capped at 4k–8k tokens while cloud APIs advertised 128k. In 2026, that gap has largely closed. Qwen3.6 delivers 256k tokens natively on hardware you own. Llama 4 Scout reaches a theoretical 10 million tokens, with practitioners in the developer community reporting 256k–1M as the practical range on consumer hardware. Gemma 4 sits at a solid 128k. For SMBs that routinely process contracts, codebases, and long email threads, the implications are significant — and none of it requires a single byte leaving your own network.

Why Context Window Size Matters in Practice

One token is roughly a quarter of an English word; 256,000 tokens correspond to approximately 192,000 words, or about 384 standard A4 pages. That is enough for:

a complete annual report with appendices (typically 80–120 pages)
a medium-sized Python or Node.js project with all modules in scope
15 hours of meeting transcripts produced by a local Whisper installation
a full tender package including technical specifications

The default behaviour of most local deployment tools creates a hidden pitfall: Ollama caps context at 2,048 tokens by default, regardless of what the underlying model supports. Without an explicit override, earlier conversation turns are silently truncated — no warning, no indication that information has been lost.

Models With Long Context Support (as of June 2026)

Llama 4 Scout (Meta, Llama Community Licence)

Theoretical maximum: 10 million tokens. As reported by practitioners, the practical range on consumer hardware sits between 256k and 1M tokens, depending on available unified memory. Minimum recommendation: 80–128 GB RAM for meaningful inference at extended context lengths.

Qwen3.6 (Alibaba, Apache 2.0)

256k tokens natively. Extendable to 1M via YaRN extrapolation. Two variants: Qwen3.6-27B (dense) and Qwen3.6-35B-A3B (Mixture-of-Experts). Community measurements report the 27B model requires approximately 22 GB RAM at 128k context. Strong multilingual quality including German, French, Spanish, and other European languages.

Qwen3.5 (Alibaba, Apache 2.0)

The 9B model supports up to 262k tokens natively according to the developer documentation — well suited for 16–24 GB setups that still need to handle long documents.

Qwen3-Coder (Alibaba, Apache 2.0)

Specialised for code and technical writing; 256k tokens natively, up to 1M via extrapolation. A strong choice for automated codebase review and documentation generation workflows.

Gemma 4 (Google, Gemma Terms of Use)

128k context. A practical option for 16 GB setups with the 12B variant. Broad language coverage and reliable structured output. Not as wide as the Qwen3 family but more resource-efficient and well-suited for laptops or compact workstations.

RAM Requirements: What Runs on What Hardware

The KV cache — the memory store that holds the model's context — grows with context length. As reported by community benchmarks using a 7B model at Q4\K\M quantisation:

Context length	Approximate RAM
4k tokens	~6 GB
32k tokens	~8–9 GB
128k tokens	~12–16 GB

For larger models, as reported by developers:

Qwen3.5-9B at 128k context: ~14–18 GB — fits a Mac Mini M4 Pro (24 or 48 GB)
Qwen3.6-27B at 128k context: ~22 GB — comfortable on a Mac Studio M4 Max (128 GB) or Mac Studio M3 Ultra (192 GB)
Llama 4 Scout at 256k context: ~80–96 GB — designed for Mac Studio M3 Ultra (192 GB) or equivalent server hardware
Qwen3.6-27B at 1M context: ~65 GB — within range of Mac Studio M3 Ultra or a dedicated inference server

Our local AI infrastructure overview covers which hardware configurations make sense for different workload profiles. As a practical rule: budget for more RAM than you think you need today — context requirements tend to grow once teams start using extended context in production.

Configuring Ollama for Long Context

Ollama sets num_ctx to 2,048 tokens by default. Three ways to override this:

Option 1 — Directly in the API request:

{
  "model": "qwen3.6:27b",
  "prompt": "...",
  "options": { "num_ctx": 65536 }
}

Option 2 — At the command line when starting the model:

ollama run qwen3.6:27b --num_ctx 65536

Option 3 — Via Modelfile (recommended for persistent deployment):

FROM qwen3.6:27b
PARAMETER num_ctx 65536

Run ollama create my-qwen3 -f Modelfile to register it. This approach persists across restarts and works cleanly with Open WebUI or any other frontend.

For context lengths above 64k, the Ollama context length documentation recommends enabling Flash Attention to reduce KV cache memory pressure. KV cache quantisation is a further option: Q80 roughly halves cache memory according to community reports, while Q40 reduces it to approximately one third — with some quality trade-off at very long lengths.

Real-World Use Cases for SMBs

Contract analysis without chunking complexity

Load a complete 80-page supplier contract plus three amendment letters into a single prompt, identify conflicting clauses, and extract a structured summary. No splitting, no information loss across context boundaries — the model sees the entire document as a coherent whole.

Codebase review

Qwen3-Coder can analyse an entire backend repository in one context, understand cross-file dependencies, and suggest targeted refactoring. No RAG pipeline required, no chunking decisions to tune.

Email thread analysis

Months of email exchanges exported from Outlook (PST → EML) structured in a single prompt: surface critical decision points, identify open commitments, and generate a handover brief.

Meeting notes from transcripts

Paired with a local Whisper installation (Faster-Whisper), multi-hour meeting transcripts can be processed in one pass — converted to structured minutes and queried with specific follow-up questions. The kAIra toolkit provides pre-built automation workflows that connect transcription and summarisation in a single pipeline.

See our training resources for guided walkthroughs on deploying these workflows in production.

When RAG Is Still the Right Architecture

Long context windows do not replace RAG pipelines in every scenario:

Knowledge base larger than 1M tokens: Thousands of documents — internal wikis, full document archives — cannot fit into any practical prompt, even with generous context limits. RAG remains the right architecture here.
Frequently updated content: RAG keeps a knowledge base current without rebuilding context on every query.
Latency on simple lookups: The pre-fill phase (processing the full context before generating a response) adds latency at very long lengths. For simple questions across large datasets, RAG is faster.

For document sets up to roughly 300 pages, or medium-sized codebases, direct long context is often the more elegant solution today — less infrastructure, no chunking decisions to debate, and complete information access in a single inference pass.

Architecture Decisions Before the Pilot Starts

Choosing context length and model is an architecture decision that belongs at the start of a project, not after the first iteration. If you want to understand which model, which hardware, and which context configuration fits your documents, language requirements, and budget, get in touch. We will walk you through what is realistically achievable on your own infrastructure — no vendor lock-in, no obligatory cloud subscription.