Local Models

QUI supports running LLMs entirely on your own hardware through Qllama — an isolated Ollama wrapper. Local models require no internet connection and incur no billing costs. Your data never leaves your machine.

What Is Qllama

Qllama runs the official Ollama Docker image with QUI-specific configuration. It provides:

Port isolation — Qllama uses port 11435, separate from any native Ollama installation (11434). Both can run simultaneously.
Data isolation — models are stored in a dedicated Docker volume, independent of any local Ollama installation
Network integration — other QUI services can reach Qllama through the internal Docker network

Setting Up Local Models

1. Start Qllama

Qllama starts automatically with the QUI Core installer. You can also manage it manually from the Local Models tab in the QUI Core dashboard.

2. Pull a Model

Download a model from the Ollama model library:

From the dashboard's Local Models tab, search for and pull models. Popular choices:

Model	Size	Best For
`phi3:mini`	~2 GB	Fast, lightweight tasks
`mistral`	~4 GB	General purpose
`llama3`	~4-8 GB	Strong general reasoning
`codellama`	~4-8 GB	Code generation and analysis

3. Select the Model on a Character

In the Visual Builder:

Click the core Anima node
Select a local model from the model dropdown
Local models appear with a local- prefix

How Local Models Route

When a character uses a local model, the call chain is shorter:

Your character (Anima) → QUI Core Bridge → detects local-* prefix → routes to Qllama (localhost:11435)

The request bypasses the central billing hub entirely — no internet needed, no billing charges. The QUI Core Bridge detects the local-* prefix on the model name and routes directly to Qllama.

GPU Requirements

Local LLMs perform best with GPU acceleration:

NVIDIA GPU with CUDA support — recommended
NVIDIA Container Toolkit must be installed for Docker GPU passthrough
Without GPU, models fall back to CPU inference (significantly slower)

The GPU Monitor panel in the QUI Core dashboard shows VRAM usage and loaded models.

Managing Models

The Local Models tab in the QUI Core dashboard lets you:

View installed models — see what's downloaded and how much space each uses
Pull new models — search and download from the Ollama library
View running models — see what's currently loaded in VRAM
Monitor GPU usage — VRAM consumption and utilization

When to Use Local vs Cloud

Scenario	Recommendation
Privacy-sensitive data	Local — data never leaves your machine
No internet connection	Local — works fully offline
Cost-sensitive experimentation	Local — no per-token charges
Maximum quality responses	Cloud — larger cloud models generally outperform local ones
Fast response times	Cloud — cloud inference is typically faster than local (unless you have a powerful GPU)
Long context windows	Cloud — cloud models support larger context windows

You can mix local and cloud models across characters — one character uses a local model for privacy-sensitive tasks while another uses a cloud model for complex reasoning.

Tip: Start with a small local model like phi3:mini to verify your GPU setup works, then pull larger models as needed.

Updated on Mar 21, 2026

Advanced Features