Local Models
Local Models
QUI supports running LLMs entirely on your own hardware through Qllama — an isolated Ollama wrapper. Local models require no internet connection and incur no billing costs. Your data never leaves your machine.
What Is Qllama
Qllama runs the official Ollama Docker image with QUI-specific configuration. It provides:
- Port isolation — Qllama uses port 11435, separate from any native Ollama installation (11434). Both can run simultaneously.
- Data isolation — models are stored in a dedicated Docker volume, independent of any local Ollama installation
- Network integration — other QUI services can reach Qllama through the internal Docker network
Setting Up Local Models
1. Start Qllama
Qllama starts automatically with the QUI Core installer. You can also manage it manually from the Local Models tab in the QUI Core dashboard.
2. Pull a Model
Download a model from the Ollama model library:
From the dashboard's Local Models tab, search for and pull models. Popular choices:
| Model | Size | Best For |
|---|---|---|
phi3:mini |
~2 GB | Fast, lightweight tasks |
mistral |
~4 GB | General purpose |
llama3 |
~4-8 GB | Strong general reasoning |
codellama |
~4-8 GB | Code generation and analysis |
3. Select the Model on a Character
In the Visual Builder:
- Click the core Anima node
- Select a local model from the model dropdown
- Local models appear with a
local-prefix
How Local Models Route
When a character uses a local model, the call chain is shorter:
Your character (Anima) → QUI Core Bridge → detects local-* prefix → routes to Qllama (localhost:11435)
The request bypasses the central billing hub entirely — no internet needed, no billing charges. The QUI Core Bridge detects the local-* prefix on the model name and routes directly to Qllama.
GPU Requirements
Local LLMs perform best with GPU acceleration:
- NVIDIA GPU with CUDA support — recommended
- NVIDIA Container Toolkit must be installed for Docker GPU passthrough
- Without GPU, models fall back to CPU inference (significantly slower)
The GPU Monitor panel in the QUI Core dashboard shows VRAM usage and loaded models.
Managing Models
The Local Models tab in the QUI Core dashboard lets you:
- View installed models — see what's downloaded and how much space each uses
- Pull new models — search and download from the Ollama library
- View running models — see what's currently loaded in VRAM
- Monitor GPU usage — VRAM consumption and utilization
When to Use Local vs Cloud
| Scenario | Recommendation |
|---|---|
| Privacy-sensitive data | Local — data never leaves your machine |
| No internet connection | Local — works fully offline |
| Cost-sensitive experimentation | Local — no per-token charges |
| Maximum quality responses | Cloud — larger cloud models generally outperform local ones |
| Fast response times | Cloud — cloud inference is typically faster than local (unless you have a powerful GPU) |
| Long context windows | Cloud — cloud models support larger context windows |
You can mix local and cloud models across characters — one character uses a local model for privacy-sensitive tasks while another uses a cloud model for complex reasoning.
Tip: Start with a small local model like
phi3:minito verify your GPU setup works, then pull larger models as needed.