Ollama: Local AI Made Simple | AI Tools

Ollama makes running local LLMs as easy as using Docker. Here's everything you need to know.

Quick Start

Installation

# Linux
curl -fsSL https://ollama.ai/install.sh | sh

# macOS
brew install ollama

# Windows
# Download from https://ollama.ai/download

First Run

# Start the service
ollama serve

# In another terminal, pull and run a model
ollama run llama3.1:8b

# Chat directly
>>> Hello! What can you help me with?

Model Management

Available Models

# Search models at ollama.ai/library
# Common models:

ollama pull llama3.1:8b        # Meta's latest
ollama pull codellama:13b      # Code-focused
ollama pull mistral:7b         # Fast, quality
ollama pull mixtral:8x7b       # MoE architecture
ollama pull phi3:mini          # Microsoft's small model
ollama pull gemma:7b           # Google's model

Model Information

# List installed models
ollama list

# Show model details
ollama show llama3.1:8b

# Remove a model
ollama rm codellama:7b

API Reference

Generate Completion

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Chat Completion

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false
}'

Streaming Response

import requests
import json

response = requests.post(
    'http://localhost:11434/api/generate',
    json={
        'model': 'llama3.1:8b',
        'prompt': 'Write a haiku about programming',
        'stream': True
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        data = json.loads(line)
        print(data.get('response', ''), end='', flush=True)

Python Integration

Using requests

import requests

def chat(prompt, model='llama3.1:8b'):
    response = requests.post(
        'http://localhost:11434/api/generate',
        json={
            'model': model,
            'prompt': prompt,
            'stream': False
        }
    )
    return response.json()['response']

print(chat("Explain Docker in one sentence."))

Using Official Library

import ollama

# Simple generation
response = ollama.generate(
    model='llama3.1:8b',
    prompt='What is machine learning?'
)
print(response['response'])

# Chat with history
messages = [
    {'role': 'user', 'content': 'Why is Python popular?'}
]
response = ollama.chat(model='llama3.1:8b', messages=messages)
print(response['message']['content'])

OpenAI-Compatible Client

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # Required but not used
)

response = client.chat.completions.create(
    model='llama3.1:8b',
    messages=[
        {'role': 'system', 'content': 'You are a coding assistant.'},
        {'role': 'user', 'content': 'Write a Python hello world.'}
    ]
)
print(response.choices[0].message.content)

Custom Models (Modelfiles)

Create Custom Model

# Modelfile
FROM llama3.1:8b

# Set the system prompt
SYSTEM """You are a senior software engineer. You write clean,
efficient code with helpful comments. Always consider edge cases
and security implications."""

# Adjust parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

# Build the custom model
ollama create code-assistant -f Modelfile

# Use it
ollama run code-assistant

Parameters Reference

Parameter	Default	Description
temperature	0.8	Creativity (0=focused, 1=creative)
top_p	0.9	Nucleus sampling threshold
top_k	40	Top-k sampling
num_ctx	2048	Context window size
repeat_penalty	1.1	Penalize repetition
seed	random	For reproducible output

Embeddings

import ollama

# Generate embeddings
response = ollama.embeddings(
    model='nomic-embed-text',
    prompt='Machine learning is a subset of AI.'
)
embedding = response['embedding']
print(f"Embedding dimension: {len(embedding)}")

Docker Deployment

# docker-compose.yml
version: '3.8'
services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
    # For GPU support:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  ollama-data:

Performance Tuning

Environment Variables

# Set number of GPU layers
export OLLAMA_NUM_GPU=35

# Set number of threads
export OLLAMA_NUM_THREAD=8

# Set context window
export OLLAMA_NUM_CTX=4096

# Enable debug logging
export OLLAMA_DEBUG=1

Memory Optimization

# Run with specific context size
ollama run llama3.1:8b --num-ctx 2048

# Use smaller quantized model
ollama run llama3.1:8b-q4_0  # 4-bit quantization

Common Use Cases

Code Review Bot

def review_code(code):
    response = ollama.generate(
        model='codellama:13b',
        prompt=f"""Review this code for bugs, security issues,
and style improvements:

{code}


Provide specific, actionable feedback."""
    )
    return response['response']

Local RAG

import ollama
import chromadb

# Store documents
client = chromadb.Client()
collection = client.create_collection("docs")

def add_document(text, doc_id):
    embedding = ollama.embeddings(
        model='nomic-embed-text',
        prompt=text
    )['embedding']
    collection.add(
        embeddings=[embedding],
        documents=[text],
        ids=[doc_id]
    )

def query(question):
    # Get embedding for question
    q_embedding = ollama.embeddings(
        model='nomic-embed-text',
        prompt=question
    )['embedding']

    # Find similar documents
    results = collection.query(
        query_embeddings=[q_embedding],
        n_results=3
    )

    # Generate answer with context
    context = "\n".join(results['documents'][0])
    response = ollama.generate(
        model='llama3.1:8b',
        prompt=f"""Based on this context:
{context}

Answer: {question}"""
    )
    return response['response']

beginner Tools & APIs Updated 2024-12-18

ollama
local ai
self-hosted llm
ollama api
run llama locally

Quick Start

Installation

First Run

Model Management

Available Models

Model Information

API Reference

Generate Completion

Chat Completion

Streaming Response

Python Integration

Using requests

Using Official Library

OpenAI-Compatible Client

Custom Models (Modelfiles)

Create Custom Model

Parameters Reference

Embeddings

Docker Deployment

Performance Tuning

Environment Variables

Memory Optimization

Common Use Cases

Code Review Bot

Local RAG

Related Guides

Running LLMs Locally

LangChain: Building LLM Applications

Llama Models Guide