Skip to main content

Local Models

Local Models

QUI supports running LLMs entirely on your own hardware through Qllama — an isolated Ollama wrapper. Local models require no internet connection and incur no billing costs. Your data never leaves your machine.


What Is Qllama

Qllama runs the official Ollama Docker image with QUI-specific configuration. It provides:

  • Port isolation — Qllama uses port 11435, separate from any native Ollama installation (11434). Both can run simultaneously.
  • Data isolation — models are stored in a dedicated Docker volume, independent of any local Ollama installation
  • Network integration — other QUI services can reach Qllama through the internal Docker network

Setting Up Local Models

1. Start Qllama

Qllama starts automatically with the QUI Core installer. You can also manage it manually from the Local Models tab in the QUI Core dashboard.

2. Pull a Model

Download a model from the Ollama model library:

From the dashboard's Local Models tab, search for and pull models. Popular choices:

Model Size Best For
phi3:mini ~2 GB Fast, lightweight tasks
mistral ~4 GB General purpose
llama3 ~4-8 GB Strong general reasoning
codellama ~4-8 GB Code generation and analysis

3. Select the Model on a Character

In the Visual Builder:

  1. Click the core Anima node
  2. Select a local model from the model dropdown
  3. Local models appear with a local- prefix

How Local Models Route

When a character uses a local model, the call chain is shorter:

Your character (Anima) → QUI Core Bridge → detects local-* prefix → routes to Qllama (localhost:11435)

The request bypasses the central billing hub entirely — no internet needed, no billing charges. The QUI Core Bridge detects the local-* prefix on the model name and routes directly to Qllama.


GPU Requirements

Local LLMs perform best with GPU acceleration:

  • NVIDIA GPU with CUDA support — recommended
  • NVIDIA Container Toolkit must be installed for Docker GPU passthrough
  • Without GPU, models fall back to CPU inference (significantly slower)

The GPU Monitor panel in the QUI Core dashboard shows VRAM usage and loaded models.


Managing Models

The Local Models tab in the QUI Core dashboard lets you:

  • View installed models — see what's downloaded and how much space each uses
  • Pull new models — search and download from the Ollama library
  • View running models — see what's currently loaded in VRAM
  • Monitor GPU usage — VRAM consumption and utilization

When to Use Local vs Cloud

Scenario Recommendation
Privacy-sensitive data Local — data never leaves your machine
No internet connection Local — works fully offline
Cost-sensitive experimentation Local — no per-token charges
Maximum quality responses Cloud — larger cloud models generally outperform local ones
Fast response times Cloud — cloud inference is typically faster than local (unless you have a powerful GPU)
Long context windows Cloud — cloud models support larger context windows

You can mix local and cloud models across characters — one character uses a local model for privacy-sensitive tasks while another uses a cloud model for complex reasoning.

Tip: Start with a small local model like phi3:mini to verify your GPU setup works, then pull larger models as needed.

Updated on Mar 21, 2026