RAG (Retrieval-Augmented Generation) is the practical standard for connecting language models to a company's own knowledge base โ without retraining the model. The technology is mature. The problem is that most production guides still rely on cloud embedding APIs โ OpenAI, Cohere, Pinecone โ which transmit every document and every query to external servers.
For organizations handling sensitive data โ law firms, accountancy practices, engineering companies, healthcare providers โ that data flow is a non-starter. This guide covers building a fully local RAG stack: embeddings, vector storage, retrieval, re-ranking, and evaluation, all running on hardware you control.
Which Local Embedding Models Work in Production
An embedding model converts text into a numerical vector that captures semantic meaning. A well-designed system finds the right contract clause or process document even when the user's question uses different words than the source text.
Two models have established themselves as reliable for production use:
- nomic-embed-text: A 768-dimension multilingual embedding model, available directly via Ollama. Community practitioners report quality competitive with earlier OpenAI embedding models, with significantly lower latency for local queries. Well-suited for multilingual European document corpora.
- mxbai-embed-large: Released by Mixedbread.ai, also available via Ollama. Practitioners report strong results on the MTEB benchmark leaderboard and solid retrieval quality for English-language documents.
For primarily English-language document corpora, both models perform well. For multilingual environments โ common for European SMBs serving multiple markets โ nomic-embed-text has the broader language coverage.
Setting Up Ollama as the Embedding Backend
Ollama's embedding API has been available since version 0.1.x. A single command fetches the model:
ollama pull nomic-embed-text
Embeddings are then generated via a local REST endpoint:
curl http://localhost:11434/api/embeddings \
-d '{"model": "nomic-embed-text", "prompt": "Your document text here"}'
No API key. No network request. No per-call cost. The embedding service runs on the same hardware as the language model โ no second machine, no separate infrastructure cost. For background on running a full local AI stack, see local-ai.html.
ChromaDB as the Local Vector Store
ChromaDB is an open-source vector database that runs locally either as a Python library or as a standalone Docker container. For SMB-scale deployments โ typically under a million document chunks โ the embedded mode is sufficient and requires no separate server process:
import chromadb
client = chromadb.Client()
collection = client.create_collection("company_documents")
Multiple active community projects on GitHub demonstrate this stack in production environments: Ollama for both embeddings and LLM inference, paired with ChromaDB for vector storage. The combination is well-tested and operationally simple.
Chunking: The Most Consequential Decision
RAG quality depends more on how documents are split than on which language model processes them. Chunks that are too large dilute relevance; chunks too small lose context.
For typical SMB documents โ contracts, SOPs, technical specifications, email archives โ these values are a reliable starting point:
- Chunk size: 400โ800 tokens, approximately 300โ600 words
- Overlap: 50โ100 tokens to avoid cutting sentences at boundaries
- Split strategy: Paragraph-first rather than fixed-size โ paragraph boundaries as primary split points
For legal documents, practitioners recommend a hierarchical approach: sections as the top-level unit, paragraphs as the retrievable chunk, with section identifiers stored as metadata. This enables accurate citations in generated answers.
Re-Ranking: Improving Precision After Retrieval
Vector similarity search retrieves candidates quickly, but not always the most relevant ones first. A cross-encoder re-ranker evaluates the query and each candidate document together, producing a more accurate relevance ranking.
A practical local re-ranking pipeline:
- Retrieval: Top 20 chunks via ChromaDB vector search
- Re-ranking: Top 5 via cross-encoder (e.g., a locally-run ms-marco-MiniLM model)
- Generation: Only the top 5 chunks passed to the LLM as context
Hybrid search โ combining BM25 keyword matching with dense embeddings โ improves candidate quality further. This is especially useful for document corpora containing product codes, article numbers, or domain-specific terminology that semantic search alone handles poorly.
Evaluating RAG Quality with RAGAS
Without evaluation, RAG quality is guesswork. RAGAS is an open-source framework that measures RAG systems across four dimensions:
- Faithfulness: Does the answer align with what the retrieved documents actually say?
- Answer Relevance: Does the answer address the question asked?
- Context Precision: Are the retrieved chunks actually relevant to the question?
- Context Recall: Were all relevant passages retrieved?
RAGAS can run entirely locally using an open-source judge model โ Llama 3.3, Qwen2.5, or Gemma 3 via Ollama โ so even the evaluation step stays within your infrastructure. No separate cloud API call required.
A test set of 50โ100 representative questions is sufficient to identify weaknesses in chunking or retrieval before the system goes live.
GDPR Compliance as a Structural Advantage
A fully local RAG stack means no document, no query, and no generated answer leaves your infrastructure. For organizations processing client data under GDPR, this eliminates three specific compliance burdens:
- No data processing agreement required with a US or third-country cloud provider
- No international data transfer under GDPR Chapter V
- No exposure to conflicting data access obligations under foreign law (US CLOUD Act, Chinese cybersecurity law)
For a deeper look at data sovereignty as an architectural principle, see data-sovereignty.html.
On hardware, production-ready local RAG is practical on current Apple Silicon machines. Practitioners report that a Mac Studio M3 Ultra (192 GB unified memory) handles simultaneous LLM inference and embedding generation without throughput bottlenecks. A Mac Mini M4 Pro (48 GB) is sufficient for document volumes under 500,000 chunks.
kAIra: A Pre-Configured Local RAG Stack
The kAIra platform by Freshlab integrates nomic-embed-text embeddings, ChromaDB, and Ollama inference as a pre-configured, production-ready system. Document indexing is automatic on upload; changes to existing documents are detected and the vector index updated accordingly. No programming required to operate the system.
The WikiHub module makes internal documents, SOPs, and contract archives searchable in natural language. The MailForge module applies the same RAG retrieval to email drafting โ pulling from the same local knowledge base to generate contextually accurate responses.
If you're evaluating whether a local RAG stack fits your documents and workflows, a pilot project is the fastest route to a concrete answer: Request a pilot.