Running LLMs locally gives you privacy, no API costs, and offline capability. Here's how to set up your own AI infrastructure.
Why Run Locally?
| Benefit | Description |
|---|---|
| Privacy | Data never leaves your machine |
| Cost | No per-token charges |
| Offline | Works without internet |
| Customization | Fine-tune for your domain |
| Speed | No network latency |
Hardware Requirements
Minimum for Small Models (7B)
- 16GB RAM
- 8GB+ VRAM (GPU) or fast CPU
- 20GB storage
Recommended for Medium Models (13-30B)
- 32GB RAM
- 16GB+ VRAM (RTX 3090, RTX 4080+)
- 50GB storage
For Large Models (70B+)
- 64GB+ RAM
- 48GB+ VRAM (A100, multiple GPUs)
- 150GB+ storage
Option 1: Ollama (Easiest)
Installation
# Linux/macOS
curl -fsSL https://ollama.ai/install.sh | sh
# Start the service
ollama serveDownload and Run Models
# Pull a model
ollama pull llama3.1:8b
ollama pull codellama:13b
ollama pull mistral:7b
# Run interactively
ollama run llama3.1:8b
# List installed models
ollama listAPI Usage
import requests
response = requests.post('http://localhost:11434/api/generate', json={
'model': 'llama3.1:8b',
'prompt': 'Explain Docker containers',
'stream': False
})
print(response.json()['response'])OpenAI-Compatible API
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # required but unused
)
response = client.chat.completions.create(
model='llama3.1:8b',
messages=[{'role': 'user', 'content': 'Hello!'}]
)Option 2: llama.cpp (Performance)
Build from Source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j
# For CUDA support
make LLAMA_CUDA=1 -j
# For Metal (macOS)
make LLAMA_METAL=1 -jRun Models
# Download a GGUF model
wget https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_K_M.gguf
# Run interactive chat
./main -m llama-2-7b.Q4_K_M.gguf \
-n 256 \
--repeat_penalty 1.1 \
-i -r "User:" \
-p "User: Hello!\nAssistant:"Quantization Levels
| Format | Size | Quality | Speed |
|---|---|---|---|
| Q2_K | Tiny | Low | Fastest |
| Q4_K_M | Small | Good | Fast |
| Q5_K_M | Medium | Better | Medium |
| Q8_0 | Large | Best | Slower |
| F16 | Full | Original | Slowest |
Option 3: vLLM (High Throughput)
Best for serving multiple requests efficiently.
pip install vllm
# Start server
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-2-7b-chat-hf \
--port 8000Option 4: Text Generation WebUI
Full-featured web interface:
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
./start_linux.sh # or start_windows.batModel Selection Guide
For Code
codellama:13b - Good balance of speed/quality
deepseek-coder:6.7b - Fast, specialized for code
starcoder2:15b - Excellent completionFor General Chat
llama3.1:8b - Best quality at this size
mistral:7b - Fast, good quality
gemma:7b - Google's efficient modelFor RAG/Embeddings
nomic-embed-text - Fast embeddings
mxbai-embed-large - Higher qualityPerformance Optimization
GPU Memory Management
# Ollama: Set GPU layers
OLLAMA_NUM_GPU=35 ollama run llama3.1:8b
# llama.cpp: Specify GPU layers
./main -m model.gguf -ngl 35Context Length
# Extend context (uses more memory)
ollama run llama3.1:8b --num-ctx 8192Batching
# vLLM batching example
from vllm import LLM, SamplingParams
llm = LLM(model="meta-llama/Llama-2-7b-chat-hf")
sampling_params = SamplingParams(temperature=0.8, max_tokens=100)
prompts = ["Hello", "How are you", "What is AI"]
outputs = llm.generate(prompts, sampling_params)Docker Setup
# docker-compose.yml
version: '3.8'
services:
ollama:
image: ollama/ollama
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
volumes:
ollama-data:Comparison Summary
| Tool | Ease | Speed | Features |
|---|---|---|---|
| Ollama | Easy | Good | Simple API |
| llama.cpp | Medium | Best | Low level |
| vLLM | Hard | Excellent | Production |
| WebUI | Easy | Good | Full GUI |
Privacy Considerations
When running locally:
- No data leaves your network
- No usage logging (unless you add it)
- Full control over the model
- Compliance friendly for sensitive data
- local llm
- ollama
- llama.cpp
- self-hosted ai
- private llm