HxHippy

Running LLMs Locally

Set up and run open-source LLMs on your own hardware with Ollama, llama.cpp, and more.

Last updated: 2024-12-18

Running LLMs locally gives you privacy, no API costs, and offline capability. Here's how to set up your own AI infrastructure.

Why Run Locally?

Benefit Description
Privacy Data never leaves your machine
Cost No per-token charges
Offline Works without internet
Customization Fine-tune for your domain
Speed No network latency

Hardware Requirements

Minimum for Small Models (7B)

  • 16GB RAM
  • 8GB+ VRAM (GPU) or fast CPU
  • 20GB storage
  • 32GB RAM
  • 16GB+ VRAM (RTX 3090, RTX 4080+)
  • 50GB storage

For Large Models (70B+)

  • 64GB+ RAM
  • 48GB+ VRAM (A100, multiple GPUs)
  • 150GB+ storage

Option 1: Ollama (Easiest)

Installation

# Linux/macOS
curl -fsSL https://ollama.ai/install.sh | sh

# Start the service
ollama serve

Download and Run Models

# Pull a model
ollama pull llama3.1:8b
ollama pull codellama:13b
ollama pull mistral:7b

# Run interactively
ollama run llama3.1:8b

# List installed models
ollama list

API Usage

import requests

response = requests.post('http://localhost:11434/api/generate', json={
    'model': 'llama3.1:8b',
    'prompt': 'Explain Docker containers',
    'stream': False
})
print(response.json()['response'])

OpenAI-Compatible API

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # required but unused
)

response = client.chat.completions.create(
    model='llama3.1:8b',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)

Option 2: llama.cpp (Performance)

Build from Source

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

# For CUDA support
make LLAMA_CUDA=1 -j

# For Metal (macOS)
make LLAMA_METAL=1 -j

Run Models

# Download a GGUF model
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf

# Run interactive chat
./main -m llama-2-7b.Q4_K_M.gguf \
    -n 256 \
    --repeat_penalty 1.1 \
    -i -r "User:" \
    -p "User: Hello!\nAssistant:"

Quantization Levels

Format Size Quality Speed
Q2_K Tiny Low Fastest
Q4_K_M Small Good Fast
Q5_K_M Medium Better Medium
Q8_0 Large Best Slower
F16 Full Original Slowest

Option 3: vLLM (High Throughput)

Best for serving multiple requests efficiently.

pip install vllm

# Start server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --port 8000

Option 4: Text Generation WebUI

Full-featured web interface:

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
./start_linux.sh  # or start_windows.bat

Model Selection Guide

For Code

codellama:13b - Good balance of speed/quality
deepseek-coder:6.7b - Fast, specialized for code
starcoder2:15b - Excellent completion

For General Chat

llama3.1:8b - Best quality at this size
mistral:7b - Fast, good quality
gemma:7b - Google's efficient model

For RAG/Embeddings

nomic-embed-text - Fast embeddings
mxbai-embed-large - Higher quality

Performance Optimization

GPU Memory Management

# Ollama: Set GPU layers
OLLAMA_NUM_GPU=35 ollama run llama3.1:8b

# llama.cpp: Specify GPU layers
./main -m model.gguf -ngl 35

Context Length

# Extend context (uses more memory)
ollama run llama3.1:8b --num-ctx 8192

Batching

# vLLM batching example
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")
sampling_params = SamplingParams(temperature=0.8, max_tokens=100)

prompts = ["Hello", "How are you", "What is AI"]
outputs = llm.generate(prompts, sampling_params)

Docker Setup

# docker-compose.yml
version: '3.8'
services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  ollama-data:

Comparison Summary

Tool Ease Speed Features
Ollama Easy Good Simple API
llama.cpp Medium Best Low level
vLLM Hard Excellent Production
WebUI Easy Good Full GUI

Privacy Considerations

When running locally:

  1. No data leaves your network
  2. No usage logging (unless you add it)
  3. Full control over the model
  4. Compliance friendly for sensitive data
advanced LLM Comparison Updated 2024-12-18
  • local llm
  • ollama
  • llama.cpp
  • self-hosted ai
  • private llm