Practitioners on X are describing a production-ready AI architecture that costs nothing to run once deployed: Ollama as a local LLM runtime, open-weight models like Llama 3.3 or Gemma 3, an orchestration layer using LangGraph or CrewAI, and a local vector store such as ChromaDB or Qdrant. No API subscriptions. No data leaving the building.
What was a niche setup for AI researchers a year ago has become a credible production architecture for European businesses in 2026 β particularly for organisations where GDPR compliance and data sovereignty are not optional.
Why 2026 Is the Turning Point
Three converging trends make this stack viable now. Open-weight models have crossed a quality threshold: Llama 3.3 70B, Gemma 3 27B, and Qwen 2.5 72B perform well on standard enterprise tasks β document analysis, classification, structured extraction β tasks previously delegated to cloud APIs. Ollama has become the de facto standard for running these models locally, with a single command replacing complex model-serving infrastructure. And the orchestration layer β LangGraph and CrewAI β has matured from research tooling into production-ready frameworks with active maintenance and enterprise adoption.
For businesses operating under GDPR, this convergence isn't just convenient. It's architecturally significant.
Layer 1: Ollama as the LLM Runtime
Ollama wraps open-weight models in a Docker-like interface. One command pulls a model and starts a REST server that mirrors the OpenAI API format:
ollama pull llama3.3:70b
# starts at http://localhost:11434 β OpenAI-compatible
This compatibility matters: every library targeting the OpenAI API β LangChain, LangGraph, LlamaIndex, CrewAI β works with Ollama without modification. The migration cost from a cloud-API prototype to a fully local system is minimal.
On Apple Silicon, Ollama uses MLX for hardware-accelerated inference. On Linux, it supports CUDA and ROCm. Reported community benchmarks place Gemma 3 12B at comfortable interactive speeds on 24 GB MacBook Pros and Llama 3.3 70B at usable throughput on Mac Studio M3 Max hardware (64β96 GB unified memory).
Model selection by task
| Model | Size | Strength |
|---|---|---|
| Llama 3.3 70B | 70B | General-purpose reasoning |
| Gemma 3 27B | 27B | Fast on consumer hardware |
| Qwen 2.5 72B | 72B | Multilingual, strong in European languages |
| Mistral Small | 22B | Low-latency classification tasks |
The Ollama model library at ollama.com lists current availability and quantisation options.
Layer 2: Orchestration with LangGraph or CrewAI
A language model alone is a component, not a system. Production workflows require multi-step coordination: conditional branching, tool calls, state persistence, retry logic, and β for regulated use cases β human approval gates.
LangGraph models workflows as directed graphs. Each node is a function; edges carry state between steps. This makes complex pipelines auditable: you can inspect exactly which path a document took, which tool was called, and what the model decided at each step. For EU AI Act Article 28 deployer obligations, which require organisations to maintain records of AI system behaviour, an auditable graph-based architecture is a practical advantage over opaque chain-of-thought prompting.
CrewAI takes a role-based approach: you define agents (Researcher, Analyst, Writer) and assign them tasks and tools. The framework handles inter-agent communication. Setup is more declarative and accessible for teams without deep Python experience.
Both connect to Ollama via the OpenAI-compatible interface:
from langchain_ollama import ChatOllama
llm = ChatOllama(model="llama3.3:70b", base_url="http://localhost:11434")
No API key. No outbound request. No data shared with a third party.
Layer 3: Local Vector Store for RAG
Retrieval-Augmented Generation grounds model responses in your organisation's actual documents rather than training data. It requires two local components: an embedding model and a vector database.
Local embeddings run through Ollama as well β nomic-embed-text and mxbai-embed-large are both available via ollama pull and produce high-quality vectors for semantic search without any cloud call.
ChromaDB in embedded mode runs in-process with no separate server β ideal for prototypes and single-user deployments. Qdrant as a Docker container offers higher throughput, persistent storage, and horizontal scaling for multi-user production environments.
The complete RAG loop β embed document, store vector, retrieve on query, generate response β operates entirely on-premises. This eliminates the need for a Data Processing Agreement (DPA) with a third-party vector-DB provider and keeps sensitive document content fully under your control. See our guide on local AI architecture for implementation details.
Typical SME Use Cases
With this stack, the following applications run without any cloud dependency:
- Internal document search: Contracts, SOPs, email archives queried in plain language
- Internal FAQ bot: Employee questions answered from internal knowledge bases
- PDF data extraction: Invoices, forms, and applications parsed into structured data
- Meeting summaries: Paired with locally running Whisper for automatic transcription and summarisation
- Private coding assistant: Code completion and review without sharing proprietary source code
The kAIra Toolkit by Freshlab packages these use cases into a managed platform for SMEs.
GDPR and EU AI Act Compliance Built In
A local stack resolves several compliance challenges simultaneously.
Under GDPR, there is no third-party data processor, no Schrems II transfer risk, no reliance on a cloud provider's sub-processors. Standard contractual clauses and transfer impact assessments become irrelevant because data never crosses a network boundary outside your own infrastructure.
Under the EU AI Act, Article 28 requires deployers of AI systems to implement appropriate technical and organisational measures, maintain logs of system operation, and be able to explain AI-assisted decisions. A transparent, locally hosted model β where you control the version, the configuration, and the inference environment β is significantly easier to document and audit than a black-box API endpoint whose underlying model may change without notice.
This compliance advantage is, based on our reading of the regulatory texts, a structural benefit that cloud-API deployments cannot easily replicate. For a deeper look at data sovereignty considerations: Data Sovereignty and Local AI.
Cost and ROI
The software is entirely free and open source. Costs are:
- Hardware: Mac Studio M3 Max (96 GB) from approximately β¬4,500; handles 70B models with several concurrent users. For heavier workloads, a Mac Studio M3 Ultra (192 GB) or a dedicated Linux server with a 64 GB+ GPU.
- Integration: 2β8 days of setup and integration work depending on existing system complexity.
- Power: ~20β30 W idle β negligible against cloud API costs at regular usage volumes.
Based on typical SME usage profiles as we model them, hardware payback periods relative to cloud API subscriptions are frequently under 18 months. For teams that want to build internal capability alongside the deployment: AI Training for Teams.
The lowest-risk entry point is a focused pilot: one concrete use case, measurable outcome, fixed scope. Freshlab helps European businesses design and deploy local AI systems that stay compliant from day one.