SLA Monitoring for Engineering Teams: Burn Rate, Error Budgets, and Alerting Done Right
Uptime percentage is a lagging indicator. When you hit your 99.9% SLA at the end of the month, you already know whether you made it — you can't change anything. Error budget monitoring gives you a leading indicator: you know how fast you're consuming your downtime allowance before you run out.
This guide covers the practical implementation of error budgets and burn rate alerts for SaaS engineering teams.
SLA vs SLO vs SLI: The Definitions That Matter
Before the implementation details, the definitions:
SLA (Service Level Agreement): A contract with customers. "We guarantee 99.9% uptime. If we miss it, you get a credit." SLAs have financial consequences.
SLO (Service Level Objective): Your internal target. "We aim for 99.95% uptime." SLOs should be more ambitious than SLAs — you need headroom between your SLO and your SLA so that normal variance doesn't cause SLA breaches.
SLI (Service Level Indicator): The actual measurement. "Uptime over the last 28 days: 99.97%." SLIs are the raw data from your monitoring. SLOs are the targets SLIs are measured against.
The relationship: SLI is measured against SLO; SLO is more ambitious than SLA; SLA is what you owe customers.
Error Budget: What It Means in Practice
An error budget is the allowance you have for downtime, latency degradation, or error rate before you breach your SLO.
For a 99.9% monthly uptime SLO (28-day month):
Error budget = (1 - 0.999) x 28 x 24 x 60 = 40.3 minutes
You have 40.3 minutes of downtime per 28-day window before you breach your 99.9% SLO. This is your error budget.
Common SLO targets and their error budgets:
| SLO Target | Error budget (monthly) | Error budget (weekly) | Error budget (daily) |
|---|---|---|---|
| 99.5% | 3 hours 36 min | 50.4 min | 7.2 min |
| 99.9% | 43.8 min | 10.1 min | 1.44 min |
| 99.95% | 21.9 min | 5 min | 43 sec |
| 99.99% | 4.38 min | 1 min | 8.6 sec |
| 99.999% | 26.3 sec | 6 sec | <1 sec |
A 99.99% SLO means you have 4.38 minutes of downtime per month. One incident of 5 minutes breaches your SLO for the entire month.
Burn Rate: The Leading Indicator
Burn rate measures how fast you're consuming your error budget. A burn rate of 1 means you're consuming your error budget at exactly the rate that would exhaust it over the full window. A burn rate of 10 means you're consuming it 10x faster — you'll exhaust your monthly budget in 3 days instead of 28.
Burn rate = error rate over measurement window / error budget rate
For a 99.9% SLO (error budget rate = 0.1%):
- Your service has been down for 30 minutes in the last 24 hours
- Error rate over last 24h = 30/1440 = 2.08%
- Burn rate = 2.08% / 0.1% = 20.8
A burn rate of 20.8 means you're consuming your monthly error budget at 20.8x the sustainable rate. If this continues, you'll exhaust your monthly budget in 28/20.8 = 1.35 days.
The Multi-Window Burn Rate Alert Strategy
Single-window burn rate alerts have a flaw: a 5-minute outage looks catastrophic in a 1-hour window (burn rate = very high) but trivial in a 7-day window (burn rate = low). Alerts based on a single window generate false urgency for brief spikes.
The Google SRE Book recommends a multi-window approach: alert on burn rate in both a fast (short) window and a slow (long) window simultaneously. An alert fires only when both windows indicate elevated burn rate.
Standard multi-window alert thresholds (from Google SRE Book):
| Severity | Short window | Long window | Burn rate threshold | Response time |
|---|---|---|---|---|
| Page (P1) | 5 min | 1 hour | 14.4× | < 5 minutes |
| Page (P2) | 30 min | 6 hours | 6× | < 30 minutes |
| Ticket (P3) | 2 hours | 24 hours | 3× | < 4 hours |
| Monitor (P4) | 6 hours | 3 days | 1× | Business hours |
How this works in practice:
A P1 alert fires when:
- 5-minute burn rate is ≥ 14.4× AND
- 1-hour burn rate is ≥ 14.4×
A 5-minute burst with a 10× burn rate fires in the 5-minute window but not the 1-hour window → no P1 alert. A sustained outage that produces 14.4× burn rate for both windows → P1 alert fires.
This dramatically reduces false-positive P1 pages while ensuring real sustained outages get immediate escalation.
Implementing SLO Monitoring in PingSLA
Step 1: Define Your SLIs
Map your monitoring checks to your SLOs:
Availability SLI: Percentage of HTTP checks returning 2xx status codes.
- Monitor: HTTP check on checkout page, login API, health endpoint
- SLO: 99.9% over 28 days
Latency SLI: Percentage of requests with TTFB < 2 seconds.
- Monitor: Latency checks from all regions
- SLO: 99th percentile TTFB < 2 seconds
Checkout success SLI: Percentage of synthetic checkout flow checks passing.
- Monitor: Checkout Defender flow monitor
- SLO: 99.9% success over 28 days
Step 2: Configure Alert Thresholds
In PingSLA, for each critical monitor:
Alert on first failure (for checkout/login): Immediate alert — don't wait for 2 consecutive failures. Checkout downtime is too costly to add confirmation delay.
Alert on sustained degradation (for latency): Alert after 3 consecutive checks show TTFB > threshold. This reduces false positives from probe-level network blips.
Alert escalation path:
- Checkout failure: WhatsApp → PagerDuty → Escalation manager (5-minute escalation)
- Login failure: WhatsApp → on-call (10-minute escalation)
- Latency degradation: Slack → on-call (30-minute window)
Step 3: Track Burn Rate
PingSLA's SLA dashboard shows rolling uptime percentage for any time window (1h, 24h, 7d, 28d). Use this to manually calculate burn rate:
Current 1h uptime: 99.5%
Error rate: 0.5%
SLO target: 99.9%
Error budget rate: 0.1%
Burn rate: 0.5% / 0.1% = 5×
A 5× burn rate over the last hour with a 99.9% SLO means: if this continues for 28/5 = 5.6 days, you breach your monthly SLO. This is in "Page (P2)" territory — respond within 30 minutes.
Step 4: Communicate Error Budget Status to Leadership
Error budget burn rate is one of the most effective ways to communicate infrastructure health to non-engineering leadership:
- "We've consumed 60% of our monthly error budget in the first 10 days" is clearer than "we've had 26 minutes of downtime"
- "Our burn rate is 2× for the last 7 days, we need to reduce it before end of month" sets clear engineering priorities
Use PingSLA's SLA reports to generate monthly error budget utilization reports for leadership review.
When to Reduce Your SLO vs Fix Your Infrastructure
A common error: setting a 99.99% SLO before your infrastructure can reliably achieve it.
If your monitoring shows you consistently consuming 80%+ of your monthly error budget and shipping reliability improvements isn't keeping up, consider reducing your SLO temporarily. A 99.9% SLO you can reliably meet is better than a 99.99% SLO you consistently miss.
This is counterintuitive but important: an SLO you can meet builds trust with your engineering team (we know when we're meeting expectations) and your customers (you don't breach your SLA contracts). An aspirational SLO that's never met is noise.
- What is the difference between MTTR and error budget?
- MTTR (Mean Time to Recovery) is the average time to resolve an incident. Error budget measures how much downtime you can afford in a given period. Both metrics are important: a low MTTR tells you your incident response is fast; a healthy error budget tells you incident frequency is sustainable. A 5-minute MTTR with 20 incidents per month still consumes 100 minutes of error budget — which breaches a 99.9% monthly SLO.
- How do I handle planned maintenance within my error budget?
- Planned maintenance windows that are publicly announced are typically excluded from SLA calculations by contract. In your SLO tracking, either exclude planned maintenance windows from your SLI calculation, or factor them into your error budget planning (e.g., reserve 15 minutes per month for planned maintenance). PingSLA supports maintenance window configuration that pauses alerts and excludes checks from SLA calculations during scheduled maintenance.
- Should my SLO target be the same as my SLA target?
- No — your SLO should always be more ambitious than your SLA to give yourself a buffer. A common approach: if your SLA is 99.9% (customer-facing), set your internal SLO at 99.95%. The gap between SLO and SLA is your buffer for handling SLO breaches before they become SLA breaches. Engineering teams that only target their SLA target leave no margin for recovery before customer commitments are violated.
Track your SLA compliance with 15-region monitoring
Start Free Monitoring →Monitor your site from 15 real global locations →
Start Free →