Every message hits OpenAI, Anthropic, or some third-party API. Your conversation is now their training data — or their incident log.
Heavy users hit rate limits, hit billing walls, or get throttled when usage spikes. The bot's silent when you need it most.
The same GPT call handles chat, code, image prompts, vision — even though no single model is best at all four.
Llama-3.1, Qwen-VL, SDXL all run on your card. Zero outbound API calls. Chat, image, archive, search history — all on the box that owns the silicon.
5 chat roles — default, quality, reasoning, fallback, lightweight. The router picks the right model for the task. Translation goes to Gemma-3n. Code goes to Phi-4.
Task-LRU swaps models in and out of VRAM as needed. Thrashing warnings before OOM. /gpu watch shows it live.
API calls. Ever.
| ROLE | MODEL | QUANT | VRAM |
|---|---|---|---|
| default_chat | Llama-3.1-8B-Instruct | 4bit · NF4 | 5.1 GB |
| quality_text | Mistral-Nemo-Instruct-2407 | 4bit | 7.2 GB |
| reasoning_code | microsoft/phi-4 | 4bit | 9.1 GB |
| lightweight_chat | gemma-3n-E4B-it | 4bit | 3.1 GB |
| image | Dreamshaper-XL | fp16 | 6.8 GB |
| vision | Qwen2.5-VL-3B-Instruct | fp16 | 3.4 GB |
▸ task-LRU keeps just one chat model resident · pin via LOCAL_*_KEEP_RESIDENT=on
▸ GPL-2.0 licensed · Windows-first · ~6-minute install