Qwen3.6 and DeepSeek V4: April 2026 Local LLM Wave

local-ai open-weight deepseek

The week of April 21โ€“27, 2026 produced two significant open-weight model releases within days of each other: Qwen3.6-35B-A3B from Alibaba's Qwen team (released April 16) and DeepSeek V4-Flash from DeepSeek AI (released April 24). Both are MIT-licensed, both are self-hostable, and together they mark a meaningful shift in what European SMBs can run locally โ€” without routing data through external cloud APIs.

Qwen3.6-35B-A3B: Frontier Performance on a 24 GB Mac

The most practically significant release of the week is Qwen3.6-35B-A3B. The "A3B" suffix stands for "3 billion active parameters" โ€” the defining characteristic of its Mixture-of-Experts (MoE) architecture. The model has 35 billion total parameters, but only approximately 3 billion are activated during each forward pass. The inference cost is equivalent to a 3B model; the knowledge embedded across 35 billion parameters is still accessible.

The practical consequence: Qwen3.6-35B-A3B runs comfortably on a Mac with 24 GB unified memory. That covers the MacBook Pro M4, the Mac Mini M4 Pro, and upwards. You do not need a dedicated server room or specialised GPU infrastructure to run a model that, according to published benchmarks, scores 73.4 % on SWE-bench Verified.

What a SWE-bench Verified score of 73 % means in practice

SWE-bench Verified presents real GitHub issues to a model and asks it to produce a working patch. A score of 73.4 % means nearly three out of four such engineering tasks are solved autonomously. For everyday business use, this translates to production-viable code review, structured document analysis, and complex reasoning โ€” tasks that until recently required proprietary cloud APIs or expensive dedicated GPU hardware.

On a Mac Studio M3 Ultra with 192 GB unified memory, community benchmarks report 50โ€“60 tok/s for Qwen3.6-35B-A3B โ€” fast enough for interactive use and automated document workflows alike. On the M5 Max, practitioners report speeds closer to 55 tok/s on the same model at comparable quantisation settings.

Running it locally

If you already have Ollama installed, getting started is straightforward:

ollama pull qwen3.6:35b-a3b
ollama run qwen3.6:35b-a3b

For Apple Silicon users who want maximum throughput, mlx-lm (v0.24.0+) provides native MLX backend support. Models are available on Hugging Face with the "-MLX" suffix and run without additional format conversion.

The model supports a 256,000-token context window โ€” sufficient for extended document analysis, long code repositories, or multi-step reasoning chains. For most SMB document workflows, this is effectively unlimited.

Why this matters beyond the benchmark numbers

Performance benchmarks are a snapshot. What matters operationally is that Qwen3.6-35B-A3B represents the first generation where the default local model recommendation for a standard Mac setup competes on quality with models that, 18 months ago, required API contracts with major cloud providers. The hardware bar has not dropped โ€” the model quality has risen to meet the hardware that many businesses already own.


DeepSeek V4-Flash: Self-Hosting for Server Infrastructure

While Qwen3.6 targets Apple Silicon, DeepSeek V4-Flash is designed for organisations that operate GPU infrastructure โ€” either on-premises or via EU-based hosting providers.

On April 24, 2026, DeepSeek released two models simultaneously:

  • V4-Pro: 1.6 trillion total parameters, 49 billion active per forward pass
  • V4-Flash: 284 billion total parameters, 13 billion active per forward pass

Both are released as open-weight models under the MIT license โ€” permitting commercial use without per-deployment licensing. Tech blogger Simon Willison summarised the significance on simonwillison.net: "almost on the frontier, a fraction of the price".

V4-Flash: Key specifications

  • Approximately 160 GB on Hugging Face (FP4+FP8 mixed precision)
  • 1 million token context window, with up to 384,000 tokens of output
  • Recommended inference framework: vLLM with MoE expert parallelism
  • Hardware minimum: 1ร— NVIDIA H200 (141 GB HBM3e) or 2ร— A100 80 GB

According to DeepSeek's release documentation, the V4 architecture achieves a 73 % reduction in per-token inference FLOPs and a 90 % reduction in KV cache memory burden compared with DeepSeek-V3.2. For organisations processing high query volumes, these numbers translate directly into operating cost at scale.

API pricing as a benchmark reference

If you want to test V4-Flash before committing to infrastructure: the hosted DeepSeek API charges $0.14 per million input tokens for Flash. That is substantially below comparable cloud frontier models and supports a low-cost proof-of-concept before any hardware decision. For European businesses with data residency requirements, an EU-hosted self-managed deployment is the production path โ€” but the API enables rapid iteration on prompts and workflows first.


Why this matters for European SMBs

Two significant model releases in a single week shift the conversation about local AI from "is open-weight good enough?" to "which hardware tier do I actually need?"

The Mac-accessible quality tier just expanded. Qwen3.6-35B-A3B on 24 GB unified memory is not a minimum viable compromise โ€” it is the current community default recommendation for good reason. On mid-range Apple Silicon, it delivers quality that 18 months ago required a dedicated cloud API contract with usage-based pricing.

Data sovereignty by architecture. Running inference locally means queries and responses never leave your network. This is structurally different from a vendor's privacy policy: you can audit zero outbound traffic with a packet capture in real time. For European companies with GDPR obligations โ€” especially those processing customer contracts, employee records, or legally privileged documents โ€” the architectural guarantee matters as much as the performance numbers. See our detailed breakdown on local AI and data sovereignty.

2026 hardware tiers for local AI deployments:

Device RAM Model Suitable for
Mac Mini M4 Pro 48 GB Qwen3.6-35B-A3B Single user, office
Mac Studio M4 Max 128 GB Multiple models in parallel Small team (3โ€“5 users)
EU GPU server H200 DeepSeek V4-Flash High-throughput workloads

Practical use cases at this quality level

The April 2026 model generation is production-ready for:

  • Contract review and summarisation โ€” documents stay on your own hardware throughout
  • Code assistance and review โ€” a 73 % SWE-bench score is production-viable
  • Multilingual output (DE/EN/ES) without quality degradation
  • Internal document Q&A โ€” answers grounded in your own knowledge base, not internet training data
  • Compliance checks โ€” reasoning-capable models run locally with a full audit trail

Compliance framing

Choosing local inference does not automatically resolve every GDPR question โ€” you still need to address data retention schedules, access controls, and your Records of Processing Activities. But it does close the most common gap: personal data being processed on infrastructure you do not control, in a jurisdiction you cannot verify. Based on our reading of the current regulatory framework, on-premise inference is one of the most robust foundations for GDPR-compliant AI deployments in European SMBs.


Getting started

A Mac Mini M4 Pro and a single, well-defined use case is enough for a meaningful pilot. The most common starting point: automate a document that gets summarised on a recurring basis, or build a draft-generation tool over an internal knowledge base.

If you want a structured evaluation framework before committing to hardware, Freshlab offers pilot projects where we work through model selection, hardware configuration, and use case validation together with your team.

More on the local AI stack for European businesses: Local AI for SMBs