Running LLMs Locally | AI Tools

Running LLMs locally gives you privacy, no API costs, and offline capability. Here's how to set up your own AI infrastructure.

Why Run Locally?

Benefit	Description
Privacy	Data never leaves your machine
Cost	No per-token charges
Offline	Works without internet
Customization	Fine-tune for your domain
Speed	No network latency

Hardware Requirements

Minimum for Small Models (7B)

16GB RAM
8GB+ VRAM (GPU) or fast CPU
20GB storage

Recommended for Medium Models (13-30B)

32GB RAM
16GB+ VRAM (RTX 3090, RTX 4080+)
50GB storage

For Large Models (70B+)

64GB+ RAM
48GB+ VRAM (A100, multiple GPUs)
150GB+ storage

Option 1: Ollama (Easiest)

Installation

# Linux/macOS
curl -fsSL https://ollama.ai/install.sh | sh

# Start the service
ollama serve

Download and Run Models

# Pull a model
ollama pull llama3.1:8b
ollama pull codellama:13b
ollama pull mistral:7b

# Run interactively
ollama run llama3.1:8b

# List installed models
ollama list

API Usage

import requests

response = requests.post('http://localhost:11434/api/generate', json={
    'model': 'llama3.1:8b',
    'prompt': 'Explain Docker containers',
    'stream': False
})
print(response.json()['response'])

OpenAI-Compatible API

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # required but unused
)

response = client.chat.completions.create(
    model='llama3.1:8b',
    messages=[{'role': 'user', 'content': 'Hello!'}]
)

Option 2: llama.cpp (Performance)

Build from Source

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j

# For CUDA support
make LLAMA_CUDA=1 -j

# For Metal (macOS)
make LLAMA_METAL=1 -j

Run Models

# Download a GGUF model
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf

# Run interactive chat
./main -m llama-2-7b.Q4_K_M.gguf \
    -n 256 \
    --repeat_penalty 1.1 \
    -i -r "User:" \
    -p "User: Hello!\nAssistant:"

Quantization Levels

Format	Size	Quality	Speed
Q2_K	Tiny	Low	Fastest
Q4_K_M	Small	Good	Fast
Q5_K_M	Medium	Better	Medium
Q8_0	Large	Best	Slower
F16	Full	Original	Slowest

Option 3: vLLM (High Throughput)

Best for serving multiple requests efficiently.

pip install vllm

# Start server
python -m vllm.entrypoints.openai.api_server \
    --model meta-llama/Llama-2-7b-chat-hf \
    --port 8000

Option 4: Text Generation WebUI

Full-featured web interface:

git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
./start_linux.sh  # or start_windows.bat

Model Selection Guide

For Code

codellama:13b - Good balance of speed/quality
deepseek-coder:6.7b - Fast, specialized for code
starcoder2:15b - Excellent completion

For General Chat

llama3.1:8b - Best quality at this size
mistral:7b - Fast, good quality
gemma:7b - Google's efficient model

For RAG/Embeddings

nomic-embed-text - Fast embeddings
mxbai-embed-large - Higher quality

Performance Optimization

GPU Memory Management

# Ollama: Set GPU layers
OLLAMA_NUM_GPU=35 ollama run llama3.1:8b

# llama.cpp: Specify GPU layers
./main -m model.gguf -ngl 35

Context Length

# Extend context (uses more memory)
ollama run llama3.1:8b --num-ctx 8192

Batching

# vLLM batching example
from vllm import LLM, SamplingParams

llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")
sampling_params = SamplingParams(temperature=0.8, max_tokens=100)

prompts = ["Hello", "How are you", "What is AI"]
outputs = llm.generate(prompts, sampling_params)

Docker Setup

# docker-compose.yml
version: '3.8'
services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  ollama-data:

Comparison Summary

Tool	Ease	Speed	Features
Ollama	Easy	Good	Simple API
llama.cpp	Medium	Best	Low level
vLLM	Hard	Excellent	Production
WebUI	Easy	Good	Full GUI

Privacy Considerations

When running locally:

No data leaves your network
No usage logging (unless you add it)
Full control over the model
Compliance friendly for sensitive data

advanced LLM Comparison Updated 2024-12-18

local llm
ollama
llama.cpp
self-hosted ai
private llm