HxHippy

Ollama: Local AI Made Simple

Run open-source LLMs locally with Ollama's simple setup and API.

Last updated: 2024-12-18

Ollama makes running local LLMs as easy as using Docker. Here's everything you need to know.

Quick Start

Installation

# Linux
curl -fsSL https://ollama.ai/install.sh | sh

# macOS
brew install ollama

# Windows
# Download from https://ollama.ai/download

First Run

# Start the service
ollama serve

# In another terminal, pull and run a model
ollama run llama3.1:8b

# Chat directly
>>> Hello! What can you help me with?

Model Management

Available Models

# Search models at ollama.ai/library
# Common models:

ollama pull llama3.1:8b        # Meta's latest
ollama pull codellama:13b      # Code-focused
ollama pull mistral:7b         # Fast, quality
ollama pull mixtral:8x7b       # MoE architecture
ollama pull phi3:mini          # Microsoft's small model
ollama pull gemma:7b           # Google's model

Model Information

# List installed models
ollama list

# Show model details
ollama show llama3.1:8b

# Remove a model
ollama rm codellama:7b

API Reference

Generate Completion

curl http://localhost:11434/api/generate -d '{
  "model": "llama3.1:8b",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Chat Completion

curl http://localhost:11434/api/chat -d '{
  "model": "llama3.1:8b",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"}
  ],
  "stream": false
}'

Streaming Response

import requests
import json

response = requests.post(
    'http://localhost:11434/api/generate',
    json={
        'model': 'llama3.1:8b',
        'prompt': 'Write a haiku about programming',
        'stream': True
    },
    stream=True
)

for line in response.iter_lines():
    if line:
        data = json.loads(line)
        print(data.get('response', ''), end='', flush=True)

Python Integration

Using requests

import requests

def chat(prompt, model='llama3.1:8b'):
    response = requests.post(
        'http://localhost:11434/api/generate',
        json={
            'model': model,
            'prompt': prompt,
            'stream': False
        }
    )
    return response.json()['response']

print(chat("Explain Docker in one sentence."))

Using Official Library

import ollama

# Simple generation
response = ollama.generate(
    model='llama3.1:8b',
    prompt='What is machine learning?'
)
print(response['response'])

# Chat with history
messages = [
    {'role': 'user', 'content': 'Why is Python popular?'}
]
response = ollama.chat(model='llama3.1:8b', messages=messages)
print(response['message']['content'])

OpenAI-Compatible Client

from openai import OpenAI

client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # Required but not used
)

response = client.chat.completions.create(
    model='llama3.1:8b',
    messages=[
        {'role': 'system', 'content': 'You are a coding assistant.'},
        {'role': 'user', 'content': 'Write a Python hello world.'}
    ]
)
print(response.choices[0].message.content)

Custom Models (Modelfiles)

Create Custom Model

# Modelfile
FROM llama3.1:8b

# Set the system prompt
SYSTEM """You are a senior software engineer. You write clean,
efficient code with helpful comments. Always consider edge cases
and security implications."""

# Adjust parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
# Build the custom model
ollama create code-assistant -f Modelfile

# Use it
ollama run code-assistant

Parameters Reference

Parameter Default Description
temperature 0.8 Creativity (0=focused, 1=creative)
top_p 0.9 Nucleus sampling threshold
top_k 40 Top-k sampling
num_ctx 2048 Context window size
repeat_penalty 1.1 Penalize repetition
seed random For reproducible output

Embeddings

import ollama

# Generate embeddings
response = ollama.embeddings(
    model='nomic-embed-text',
    prompt='Machine learning is a subset of AI.'
)
embedding = response['embedding']
print(f"Embedding dimension: {len(embedding)}")

Docker Deployment

# docker-compose.yml
version: '3.8'
services:
  ollama:
    image: ollama/ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama-data:/root/.ollama
    environment:
      - OLLAMA_HOST=0.0.0.0
    # For GPU support:
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]

volumes:
  ollama-data:

Performance Tuning

Environment Variables

# Set number of GPU layers
export OLLAMA_NUM_GPU=35

# Set number of threads
export OLLAMA_NUM_THREAD=8

# Set context window
export OLLAMA_NUM_CTX=4096

# Enable debug logging
export OLLAMA_DEBUG=1

Memory Optimization

# Run with specific context size
ollama run llama3.1:8b --num-ctx 2048

# Use smaller quantized model
ollama run llama3.1:8b-q4_0  # 4-bit quantization

Common Use Cases

Code Review Bot

def review_code(code):
    response = ollama.generate(
        model='codellama:13b',
        prompt=f"""Review this code for bugs, security issues,
and style improvements:

{code}


Provide specific, actionable feedback."""
    )
    return response['response']

Local RAG

import ollama
import chromadb

# Store documents
client = chromadb.Client()
collection = client.create_collection("docs")

def add_document(text, doc_id):
    embedding = ollama.embeddings(
        model='nomic-embed-text',
        prompt=text
    )['embedding']
    collection.add(
        embeddings=[embedding],
        documents=[text],
        ids=[doc_id]
    )

def query(question):
    # Get embedding for question
    q_embedding = ollama.embeddings(
        model='nomic-embed-text',
        prompt=question
    )['embedding']

    # Find similar documents
    results = collection.query(
        query_embeddings=[q_embedding],
        n_results=3
    )

    # Generate answer with context
    context = "\n".join(results['documents'][0])
    response = ollama.generate(
        model='llama3.1:8b',
        prompt=f"""Based on this context:
{context}

Answer: {question}"""
    )
    return response['response']
beginner Tools & APIs Updated 2024-12-18
  • ollama
  • local ai
  • self-hosted llm
  • ollama api
  • run llama locally