← All articles
INFRASTRUCTURE Self-Hosting AI with Ollama and Open WebUI: Run LLMs... 2026-02-09 · ollama · open-webui · llm

Self-Hosting AI with Ollama and Open WebUI: Run LLMs Locally

Infrastructure 2026-02-09 ollama open-webui llm ai machine-learning

Running large language models locally used to require deep knowledge of Python environments, model quantization, and GPU drivers. Today, two tools make it remarkably simple: Ollama handles downloading, running, and serving LLMs through a clean CLI, and Open WebUI provides a polished chat interface on top of it. Together, they give you a private, self-hosted alternative to ChatGPT that runs entirely on your own hardware.

No API keys, no usage fees, no data leaving your network.

What Ollama Does

Ollama is a lightweight runtime for large language models. Think of it as Docker for LLMs -- you pull a model by name, and Ollama handles downloading the weights, loading them into memory, and serving an OpenAI-compatible API.

# Install and run a model in two commands
ollama pull llama3.1:8b
ollama run llama3.1:8b

Under the hood, Ollama uses llama.cpp for inference, which means it supports GGUF-quantized models and can run on both CPU and GPU. It exposes a REST API on port 11434, making it easy for other tools to interact with it.

What Open WebUI Provides

Open WebUI (formerly Ollama WebUI) is a self-hosted web interface that connects to Ollama's API. It gives you a ChatGPT-like experience in your browser, with features that go well beyond a basic chat box:

Ollama + Open WebUI Architecture User Browser Open WebUI Chat interface Port 3000 Ollama API REST endpoint Port 11434 GPU/CPU Inference engine LLM Model weights All processing happens locally — no data leaves your network

Docker Compose Setup

The simplest way to run both together is a single Docker Compose file.

CPU-Only Setup

# docker-compose.yml
services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    restart: unless-stopped

  open-webui:
    image: ghcr.io/open-webui/open-webui:main
    container_name: open-webui
    ports:
      - "3000:8080"
    volumes:
      - open_webui_data:/app/backend/data
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
    depends_on:
      - ollama
    restart: unless-stopped

volumes:
  ollama_data:
  open_webui_data:
docker compose up -d

Open http://your-server:3000, create an account (the first account becomes admin), and you're ready to go. You still need to pull a model -- either from the Open WebUI interface or via the CLI:

docker exec -it ollama ollama pull llama3.1:8b

NVIDIA GPU Setup

For GPU acceleration with an NVIDIA card, install the NVIDIA Container Toolkit, then modify the Ollama service:

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    volumes:
      - ollama_data:/root/.ollama
    ports:
      - "11434:11434"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    restart: unless-stopped

Verify GPU access after starting:

docker exec -it ollama ollama run llama3.1:8b "Hello, are you using GPU?"
# Check nvidia-smi to confirm GPU utilization
nvidia-smi

GPU vs CPU: What to Expect

The performance difference between GPU and CPU inference is dramatic. Here are rough expectations for generating tokens with Llama 3.1 8B:

Hardware Tokens/sec Experience
Modern CPU (16 cores, AVX2) 5-15 tok/s Usable but noticeably slow
NVIDIA RTX 3060 (12 GB VRAM) 40-60 tok/s Smooth, real-time feel
NVIDIA RTX 3090 (24 GB VRAM) 60-90 tok/s Fast, comfortable
NVIDIA RTX 4090 (24 GB VRAM) 90-130 tok/s Near-instant responses
Apple M2 Pro (16 GB unified) 30-50 tok/s Good experience

CPU inference is viable for small models and occasional use. If you plan to use LLMs regularly, a GPU with at least 8 GB of VRAM makes a significant quality-of-life difference.

Recommended Models

Not all models are equal, and bigger is not always better. Here's a practical guide to which models work well for self-hosting:

Best Starting Points

Model Size VRAM Needed Good For
Llama 3.1 8B ~4.7 GB 6 GB General purpose, coding, reasoning
Mistral 7B ~4.1 GB 6 GB Fast general use, good instruction following
Gemma 2 9B ~5.4 GB 8 GB Strong reasoning, Google's quality
Phi-3 Mini 3.8B ~2.2 GB 4 GB Surprisingly capable for its size
CodeLlama 7B ~3.8 GB 6 GB Code generation and explanation

Larger Models (If You Have the Hardware)

Model Size VRAM Needed Good For
Llama 3.1 70B ~40 GB 48 GB (or CPU offload) Near-GPT-4 quality for many tasks
Mixtral 8x7B ~26 GB 32 GB Excellent quality-to-speed ratio
Qwen 2.5 32B ~18 GB 24 GB Strong multilingual and coding

Pull any of these with:

docker exec -it ollama ollama pull llama3.1:8b
docker exec -it ollama ollama pull mistral:7b
docker exec -it ollama ollama pull gemma2:9b

Hardware Requirements

Minimum Viable Setup

Comfortable Setup

Memory Rule of Thumb

For GGUF quantized models (which Ollama uses by default), expect roughly:

If a model doesn't fit entirely in VRAM, Ollama will split it between GPU and CPU memory, which works but reduces speed.

Practical Configuration Tips

Customizing Model Behavior

Create a custom model with a system prompt:

docker exec -it ollama bash -c 'cat <<EOF | ollama create my-assistant -f -
FROM llama3.1:8b
SYSTEM "You are a helpful technical assistant. Be concise and accurate."
PARAMETER temperature 0.7
PARAMETER num_ctx 4096
EOF'

Increasing Context Length

By default, most models run with a 2048-token context window. For longer conversations:

docker exec -it ollama ollama run llama3.1:8b --num-ctx 8192

More context uses more memory. A 7B model with 8192 context needs roughly 2 GB more RAM than the default.

Exposing Ollama to Your Network

By default, Ollama only listens on localhost. To allow other devices (or Open WebUI on a different machine) to connect:

environment:
  - OLLAMA_HOST=0.0.0.0

If you do this, make sure Ollama is behind a firewall or VPN. There is no built-in authentication.

Practical Use Cases

Self-hosted LLMs shine in specific scenarios:

Self-Hosted LLMs vs Cloud APIs

Feature Self-Hosted (Ollama) Cloud (ChatGPT/Claude)
Privacy Complete -- nothing leaves your machine Data sent to provider
Cost Hardware only (one-time) Per-token or subscription
Model quality Good (7B-70B class) State-of-the-art (GPT-4o, Claude)
Speed Depends on hardware Consistently fast
Internet required No (after model download) Yes
Customization Full control (system prompts, fine-tuning) Limited
Maintenance You manage updates and hardware Zero maintenance

The honest truth: cloud models like GPT-4o and Claude are still significantly more capable than anything you can run locally on consumer hardware. Self-hosted LLMs are best as a complement to cloud services, not a replacement. Use them for privacy-sensitive tasks and experimentation, and cloud APIs when you need maximum quality.

Keeping Things Updated

Ollama and Open WebUI both move quickly. Update regularly:

docker compose pull
docker compose up -d

To update a model to the latest version:

docker exec -it ollama ollama pull llama3.1:8b

Verdict

Ollama and Open WebUI are the easiest way to run LLMs on your own hardware. The setup takes five minutes with Docker Compose, and the experience is genuinely good -- especially with a decent GPU. You won't match GPT-4 quality with a 7B model, but for many everyday tasks, local models are more than capable. The privacy and zero ongoing cost make it worth running alongside whatever cloud services you already use.