Ollama makes running local LLMs as easy as using Docker. Here's everything you need to know.
Quick Start
Installation
# Linux
curl -fsSL https://ollama.ai/install.sh | sh
# macOS
brew install ollama
# Windows
# Download from https://ollama.ai/downloadFirst Run
# Start the service
ollama serve
# In another terminal, pull and run a model
ollama run llama3.1:8b
# Chat directly
>>> Hello! What can you help me with?Model Management
Available Models
# Search models at ollama.ai/library
# Common models:
ollama pull llama3.1:8b # Meta's latest
ollama pull codellama:13b # Code-focused
ollama pull mistral:7b # Fast, quality
ollama pull mixtral:8x7b # MoE architecture
ollama pull phi3:mini # Microsoft's small model
ollama pull gemma:7b # Google's modelModel Information
# List installed models
ollama list
# Show model details
ollama show llama3.1:8b
# Remove a model
ollama rm codellama:7bAPI Reference
Generate Completion
curl http://localhost:11434/api/generate -d '{
"model": "llama3.1:8b",
"prompt": "Why is the sky blue?",
"stream": false
}'Chat Completion
curl http://localhost:11434/api/chat -d '{
"model": "llama3.1:8b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
"stream": false
}'Streaming Response
import requests
import json
response = requests.post(
'http://localhost:11434/api/generate',
json={
'model': 'llama3.1:8b',
'prompt': 'Write a haiku about programming',
'stream': True
},
stream=True
)
for line in response.iter_lines():
if line:
data = json.loads(line)
print(data.get('response', ''), end='', flush=True)Python Integration
Using requests
import requests
def chat(prompt, model='llama3.1:8b'):
response = requests.post(
'http://localhost:11434/api/generate',
json={
'model': model,
'prompt': prompt,
'stream': False
}
)
return response.json()['response']
print(chat("Explain Docker in one sentence."))Using Official Library
import ollama
# Simple generation
response = ollama.generate(
model='llama3.1:8b',
prompt='What is machine learning?'
)
print(response['response'])
# Chat with history
messages = [
{'role': 'user', 'content': 'Why is Python popular?'}
]
response = ollama.chat(model='llama3.1:8b', messages=messages)
print(response['message']['content'])OpenAI-Compatible Client
from openai import OpenAI
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # Required but not used
)
response = client.chat.completions.create(
model='llama3.1:8b',
messages=[
{'role': 'system', 'content': 'You are a coding assistant.'},
{'role': 'user', 'content': 'Write a Python hello world.'}
]
)
print(response.choices[0].message.content)Custom Models (Modelfiles)
Create Custom Model
# Modelfile
FROM llama3.1:8b
# Set the system prompt
SYSTEM """You are a senior software engineer. You write clean,
efficient code with helpful comments. Always consider edge cases
and security implications."""
# Adjust parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096# Build the custom model
ollama create code-assistant -f Modelfile
# Use it
ollama run code-assistantParameters Reference
| Parameter | Default | Description |
|---|---|---|
| temperature | 0.8 | Creativity (0=focused, 1=creative) |
| top_p | 0.9 | Nucleus sampling threshold |
| top_k | 40 | Top-k sampling |
| num_ctx | 2048 | Context window size |
| repeat_penalty | 1.1 | Penalize repetition |
| seed | random | For reproducible output |
Embeddings
import ollama
# Generate embeddings
response = ollama.embeddings(
model='nomic-embed-text',
prompt='Machine learning is a subset of AI.'
)
embedding = response['embedding']
print(f"Embedding dimension: {len(embedding)}")Docker Deployment
# docker-compose.yml
version: '3.8'
services:
ollama:
image: ollama/ollama
ports:
- "11434:11434"
volumes:
- ollama-data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0
# For GPU support:
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
volumes:
ollama-data:Performance Tuning
Environment Variables
# Set number of GPU layers
export OLLAMA_NUM_GPU=35
# Set number of threads
export OLLAMA_NUM_THREAD=8
# Set context window
export OLLAMA_NUM_CTX=4096
# Enable debug logging
export OLLAMA_DEBUG=1Memory Optimization
# Run with specific context size
ollama run llama3.1:8b --num-ctx 2048
# Use smaller quantized model
ollama run llama3.1:8b-q4_0 # 4-bit quantizationCommon Use Cases
Code Review Bot
def review_code(code):
response = ollama.generate(
model='codellama:13b',
prompt=f"""Review this code for bugs, security issues,
and style improvements:
{code}
Provide specific, actionable feedback."""
)
return response['response']Local RAG
import ollama
import chromadb
# Store documents
client = chromadb.Client()
collection = client.create_collection("docs")
def add_document(text, doc_id):
embedding = ollama.embeddings(
model='nomic-embed-text',
prompt=text
)['embedding']
collection.add(
embeddings=[embedding],
documents=[text],
ids=[doc_id]
)
def query(question):
# Get embedding for question
q_embedding = ollama.embeddings(
model='nomic-embed-text',
prompt=question
)['embedding']
# Find similar documents
results = collection.query(
query_embeddings=[q_embedding],
n_results=3
)
# Generate answer with context
context = "\n".join(results['documents'][0])
response = ollama.generate(
model='llama3.1:8b',
prompt=f"""Based on this context:
{context}
Answer: {question}"""
)
return response['response'] - ollama
- local ai
- self-hosted llm
- ollama api
- run llama locally