HxHippy

SRE Essentials Cheat Sheet

Site reliability engineering: monitoring, incidents, and SLOs.

Last updated: 2025-01-15

SRE Essentials Cheat Sheet

Site reliability engineering: monitoring, incidents, and service level objectives.

Quick Reference

# Incident triage
kubectl get pods -A | grep -v Running
curl -s localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'
journalctl -p err --since "15 minutes ago" | head -50

SLI/SLO Calculations

Availability

Availability = (Total Time - Downtime) / Total Time * 100

99.9% = 8.76 hours downtime/year
99.95% = 4.38 hours downtime/year
99.99% = 52.6 minutes downtime/year

Error Budget

Error Budget = 1 - SLO
Monthly Budget (99.9%) = 43.2 minutes

Burn Rate = Error Rate / Error Budget
Fast Burn Alert: Burn Rate > 14.4 (budget in 1 hour)
Slow Burn Alert: Burn Rate > 1 (budget exhausted in window)

Prometheus Queries

# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))

# Latency P99
histogram_quantile(0.99,
  sum(rate(http_request_duration_seconds_bucket[5m])) by (le))

# Availability
avg_over_time(up{job="myservice"}[1d])

# Error budget remaining
1 - (
  sum(increase(http_requests_total{status=~"5.."}[30d]))
  / sum(increase(http_requests_total[30d]))
) / (1 - 0.999)

Incident Response

Severity Levels

Level Impact Response Time Example
SEV1 Full outage < 5 min Site down
SEV2 Major degradation < 15 min 50% errors
SEV3 Minor degradation < 1 hour Slow responses
SEV4 Low impact < 4 hours Single endpoint

Incident Workflow

# 1. Acknowledge and assess
kubectl get events --sort-by='.lastTimestamp' | tail -20
kubectl describe pod <failing-pod>

# 2. Mitigate
kubectl rollout undo deployment/app
kubectl scale deployment/app --replicas=10

# 3. Communicate
# Update status page, notify stakeholders

# 4. Document
# Create incident ticket with timeline

Kubernetes Reliability

# Resource pressure
kubectl top nodes
kubectl top pods -A --sort-by=memory

# Pod issues
kubectl get pods -A | grep -E 'Error|CrashLoop|Pending'
kubectl describe pod <pod> | grep -A5 "Events:"

# Node issues
kubectl get nodes -o wide
kubectl describe node <node> | grep -A10 "Conditions:"

# Resource quotas
kubectl describe resourcequota -A

Alerting Rules

# Multi-window burn rate alert
- alert: ErrorBudgetBurn
  expr: |
    (
      sum(rate(http_requests_total{status=~"5.."}[1h]))
      / sum(rate(http_requests_total[1h]))
    ) > 14.4 * (1 - 0.999)
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "Error budget burning fast"

Capacity Planning

# Current utilization
kubectl top pods -A --containers
kubectl get hpa -A

# Historical data (Prometheus)
# CPU: avg(rate(container_cpu_usage_seconds_total[1h]))
# Memory: avg(container_memory_working_set_bytes)

# Growth projections
# Use linear regression on metrics over 30/90 days

Runbook Template

## Alert: ServiceHighErrorRate

### Impact
- User-facing errors on checkout flow

### Detection
- Prometheus alert firing
- Error rate > 1%

### Diagnosis
1. Check error logs: kubectl logs -l app=checkout
2. Check dependencies: curl health endpoints
3. Check recent deployments: kubectl rollout history

### Mitigation
1. Rollback: kubectl rollout undo deployment/checkout
2. Scale: kubectl scale deployment/checkout --replicas=10
3. Failover: Update DNS to backup region

### Escalation
- Slack: #incident-response
- PagerDuty: checkout-oncall

Post-Incident

## Blameless Post-Mortem Template

### Summary
Brief description of what happened

### Impact
- Duration: X minutes
- Users affected: N
- Revenue impact: $X

### Timeline
- HH:MM - First alert fired
- HH:MM - Incident declared
- HH:MM - Root cause identified
- HH:MM - Mitigation applied
- HH:MM - Full recovery

### Root Cause
Technical explanation

### Action Items
- [ ] Immediate fix (owner, due date)
- [ ] Detection improvement
- [ ] Prevention measure
advanced DevOps Roles Updated 2025-01-15
  • sre
  • reliability
  • monitoring
  • incident response
  • slo
  • sli