SRE Essentials Cheat Sheet
Site reliability engineering: monitoring, incidents, and service level objectives.
Quick Reference
# Incident triage
kubectl get pods -A | grep -v Running
curl -s localhost:9090/api/v1/alerts | jq '.data.alerts[] | select(.state=="firing")'
journalctl -p err --since "15 minutes ago" | head -50SLI/SLO Calculations
Availability
Availability = (Total Time - Downtime) / Total Time * 100
99.9% = 8.76 hours downtime/year
99.95% = 4.38 hours downtime/year
99.99% = 52.6 minutes downtime/yearError Budget
Error Budget = 1 - SLO
Monthly Budget (99.9%) = 43.2 minutes
Burn Rate = Error Rate / Error Budget
Fast Burn Alert: Burn Rate > 14.4 (budget in 1 hour)
Slow Burn Alert: Burn Rate > 1 (budget exhausted in window)Prometheus Queries
# Error rate
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
# Latency P99
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Availability
avg_over_time(up{job="myservice"}[1d])
# Error budget remaining
1 - (
sum(increase(http_requests_total{status=~"5.."}[30d]))
/ sum(increase(http_requests_total[30d]))
) / (1 - 0.999)Incident Response
Severity Levels
| Level | Impact | Response Time | Example |
|---|---|---|---|
| SEV1 | Full outage | < 5 min | Site down |
| SEV2 | Major degradation | < 15 min | 50% errors |
| SEV3 | Minor degradation | < 1 hour | Slow responses |
| SEV4 | Low impact | < 4 hours | Single endpoint |
Incident Workflow
# 1. Acknowledge and assess
kubectl get events --sort-by='.lastTimestamp' | tail -20
kubectl describe pod <failing-pod>
# 2. Mitigate
kubectl rollout undo deployment/app
kubectl scale deployment/app --replicas=10
# 3. Communicate
# Update status page, notify stakeholders
# 4. Document
# Create incident ticket with timelineKubernetes Reliability
# Resource pressure
kubectl top nodes
kubectl top pods -A --sort-by=memory
# Pod issues
kubectl get pods -A | grep -E 'Error|CrashLoop|Pending'
kubectl describe pod <pod> | grep -A5 "Events:"
# Node issues
kubectl get nodes -o wide
kubectl describe node <node> | grep -A10 "Conditions:"
# Resource quotas
kubectl describe resourcequota -AAlerting Rules
# Multi-window burn rate alert
- alert: ErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) > 14.4 * (1 - 0.999)
for: 2m
labels:
severity: critical
annotations:
summary: "Error budget burning fast"Capacity Planning
# Current utilization
kubectl top pods -A --containers
kubectl get hpa -A
# Historical data (Prometheus)
# CPU: avg(rate(container_cpu_usage_seconds_total[1h]))
# Memory: avg(container_memory_working_set_bytes)
# Growth projections
# Use linear regression on metrics over 30/90 daysRunbook Template
## Alert: ServiceHighErrorRate
### Impact
- User-facing errors on checkout flow
### Detection
- Prometheus alert firing
- Error rate > 1%
### Diagnosis
1. Check error logs: kubectl logs -l app=checkout
2. Check dependencies: curl health endpoints
3. Check recent deployments: kubectl rollout history
### Mitigation
1. Rollback: kubectl rollout undo deployment/checkout
2. Scale: kubectl scale deployment/checkout --replicas=10
3. Failover: Update DNS to backup region
### Escalation
- Slack: #incident-response
- PagerDuty: checkout-oncallPost-Incident
## Blameless Post-Mortem Template
### Summary
Brief description of what happened
### Impact
- Duration: X minutes
- Users affected: N
- Revenue impact: $X
### Timeline
- HH:MM - First alert fired
- HH:MM - Incident declared
- HH:MM - Root cause identified
- HH:MM - Mitigation applied
- HH:MM - Full recovery
### Root Cause
Technical explanation
### Action Items
- [ ] Immediate fix (owner, due date)
- [ ] Detection improvement
- [ ] Prevention measure - sre
- reliability
- monitoring
- incident response
- slo
- sli