DNS Failure Killed 100% of Our Traffic for 4 Hours. No Alert Fired.
The incident started at 11:23 AM on a Tuesday. A junior engineer was cleaning up DNS records in the Cloudflare dashboard, removing entries that looked like legacy configurations. One of them was not legacy. It was the A record pointing app.yoursaas.com to the production server.
By 11:24 AM, users attempting to load the application received NXDOMAIN — domain not found. Every user. Every request. Instantly.
The uptime monitor was checking https://app.yoursaas.com. Without the DNS record, the HTTP request never reached the server. There was no 5xx to trigger an alert. The probe itself couldn't resolve the domain. Most monitoring tools handle a DNS resolution failure as a "network error" rather than triggering the standard alert flow.
No alert fired until an engineer checked their dashboard at 12:07 PM. The DNS record was deleted at 11:23 AM. The repair was deployed at 12:15 AM. DNS propagation completed at 3:30 PM.
Four hours and seven minutes of complete downtime. Zero automated alerts.
Why DNS Failures Are Different from Other Incidents
A server crash, an API failure, or a deployment issue affects users gradually or partially. DNS failure is categorical: it either works or it doesn't, and it affects every user in every region simultaneously the moment the TTL cache expires on their resolver.
This creates two properties that make DNS uniquely dangerous:
The blast radius is always 100%. There is no partial DNS failure for your primary domain. Either the record exists and resolves correctly, or it doesn't, and nobody can reach you. There is no "degraded" state for a missing A record.
The damage is delayed by TTL, then instant. DNS changes (including accidental deletions) propagate over a window determined by your TTL. A TTL of 300 seconds means the deletion reaches cached resolvers over 5 minutes. During that window, some users are affected (those whose cache has expired) and others are not (those still serving cached records). After full propagation, 100% of users are affected. This makes the failure look gradual at first, then suddenly catastrophic.
Most uptime monitoring tools are not designed for this failure mode. They check HTTP. If DNS is broken, the HTTP check fails with a network error rather than a non-200 response — and many tools don't alert on network errors the same way they alert on HTTP failures.
5 DNS Failure Modes That Kill SaaS Products
1. Accidental Record Deletion
The most common. Someone with DNS access removes a record they believe is unused. It is not unused. Without change control on DNS, this is a constant risk.
Indicator: Records disappear from your monitored zone. NXDOMAIN responses from resolvers.
2. Domain Expiry
If your domain registrar auto-renewal fails — due to an expired credit card, a billing email going to a former employee, or a registrar payment issue — your domain registration lapses and DNS resolvers stop returning records for it.
Domain expiry is the most embarrassing and most preventable DNS failure. It happens to companies of all sizes. The monitoring solution is simple: alert on domain expiry date with 60, 30, and 7-day warnings.
Indicator: Registrar WHOIS shows past expiry date. All DNS queries return NXDOMAIN.
3. Nameserver Misconfiguration
A migration between DNS providers (e.g., from Route53 to Cloudflare) requires updating nameserver records at the registrar. If the nameserver update is applied incorrectly, or the old DNS provider's records are removed before the new provider's records are verified, queries are directed to a nameserver that doesn't serve your zone.
Indicator: DNS queries return SERVFAIL. Nameserver records at registrar don't match the live nameservers.
4. CNAME Chain Failure
Modern SaaS architectures use CNAME chains for flexibility. app.yoursaas.com → CNAME to app.hosting-provider.com → A record at hosting provider. If any link in this chain breaks — if the hosting provider changes their CNAME target or removes a record — the entire chain fails.
Indicator: Direct CNAME target resolution fails. End-to-end DNS resolution fails but the individual records look correct in your zone.
5. DNS Propagation During Migration
During a planned migration, you update DNS records and wait for propagation. During the propagation window, users on resolvers with cached old records reach the old server, while users with expired caches reach the new server. If the old server is decommissioned before propagation completes, users with cached old records experience failures.
Indicator: Intermittent failures correlated with user geography or ISP. Some users can reach the product, others cannot.
DNS Monitoring vs SSL Monitoring vs Uptime Monitoring
| What It Monitors | DNS Monitoring | SSL Monitoring | HTTP Uptime Monitoring |
|---|---|---|---|
| Record existence (A, CNAME) | Yes | No | No |
| Record value changes | Yes | No | No |
| Domain expiry | Yes | No | No |
| Certificate validity | No | Yes | No (basic check) |
| Certificate expiry | No | Yes | No |
| Server HTTP response | No | No | Yes |
| Application functionality | No | No | Partial |
| DNS resolution time | Yes | No | Implicit |
| Nameserver health | Yes | No | No |
These three types of monitoring are complementary, not redundant. A complete monitoring setup needs all three.
What DNS Records to Monitor
Not all records are equally critical. Here is what to monitor and why:
A records (required): These map your domain to an IP address. Changes or deletions immediately break access. Monitor the current value and alert on any change or deletion.
CNAME records (required for CNAMEs): If you use CNAMEs for any user-facing domain, monitor them for existence and target value. CNAME target changes can be intentional (hosting migration) or accidental (provider removes the target).
MX records (required for email-dependent products): If your product relies on email for sign-up confirmation, password reset, or notifications, MX record failures break these flows. Monitor MX records separately.
TXT records (required for SPF/DKIM): Email deliverability depends on SPF and DKIM TXT records. Changes to these records cause your emails to land in spam or be rejected.
NS records (required): Your nameserver records at the registrar determine which DNS provider serves your zone. A change here affects all records. Monitor for unexpected changes.
How to Set Up DNS Record Monitoring
The core DNS monitoring workflow is simple: record the current value of each critical DNS record, then compare against it on every check. Alert on any change or on resolution failure.
#!/bin/bash
# DNS record monitoring script — compare against expected values
DOMAIN="yoursaas.com"
APP_DOMAIN="app.yoursaas.com"
EXPECTED_A="203.0.113.42" # Your production IP
EXPECTED_CNAME_TARGET="yoursaas.hosting.com"
# Check A record
CURRENT_A=$(dig +short A $DOMAIN | head -1)
if [ "$CURRENT_A" != "$EXPECTED_A" ]; then
echo "⚠️ A record changed or missing: expected $EXPECTED_A, got '$CURRENT_A'"
# Trigger alert
fi
# Check CNAME target
CURRENT_CNAME=$(dig +short CNAME $APP_DOMAIN)
if [ "$CURRENT_CNAME" != "$EXPECTED_CNAME_TARGET." ]; then
echo "⚠️ CNAME target changed: expected $EXPECTED_CNAME_TARGET, got '$CURRENT_CNAME'"
fi
# Check domain expiry
EXPIRY=$(whois $DOMAIN | grep -i "expiry date" | awk '{print $3}')
EXPIRY_EPOCH=$(date -d "$EXPIRY" +%s)
NOW_EPOCH=$(date +%s)
DAYS_UNTIL_EXPIRY=$(( ($EXPIRY_EPOCH - $NOW_EPOCH) / 86400 ))
if [ $DAYS_UNTIL_EXPIRY -lt 30 ]; then
echo "⚠️ Domain expires in $DAYS_UNTIL_EXPIRY days: $EXPIRY"
fi
echo "DNS check complete. Domain expires in $DAYS_UNTIL_EXPIRY days."
For production use, this runs every 5 minutes and sends results to your alerting system. PingSLA's SSL DNS Hunter implements this check (and the SSL equivalents) and alerts via WhatsApp, Slack, or email when records change or domain expiry approaches.
DNS TTL and Incident Response Time
Your DNS TTL directly affects how quickly you can recover from a DNS incident and how quickly you can prevent accidental deletions from becoming full outages.
Low TTL (60–300 seconds): Changes propagate quickly. Beneficial for planned migrations. But also means an accidental deletion propagates to all resolvers within 5 minutes — leaving almost no time to catch and revert the change before it affects all users.
High TTL (3600+ seconds): Changes propagate slowly. An accidental deletion takes up to an hour to affect all resolvers — giving you a window to detect and correct it. But planned migrations require you to lower TTL 24–48 hours in advance.
The right strategy: run a permanently low TTL (300 seconds) combined with DNS change monitoring that alerts within 1–2 minutes of any record modification. The monitoring catch window is your safety net — not the TTL.
- What is DNS monitoring?
- DNS monitoring is the practice of continuously checking that your domain's DNS records exist, have the correct values, and resolve successfully. It alerts when records change unexpectedly, when records are deleted, when domain expiry approaches, or when DNS resolution fails. It is distinct from uptime monitoring (which checks HTTP responses) and must be implemented separately.
- How quickly does a DNS failure affect users?
- DNS failures affect users as their local DNS resolver's cached record expires. With a TTL of 300 seconds, a DNS record deletion reaches cached resolvers within 5 minutes — meaning 100% of users are affected within approximately 5 minutes of the deletion propagating. For TTLs of 3600 seconds, the blast propagates over up to an hour, but all users are eventually affected.
- What DNS records should I monitor for my SaaS product?
- At minimum: A records for all user-facing domains (main domain and app subdomain), CNAME records for any subdomain using CNAME routing, domain expiry date (alert at 60/30/7 days), and NS records to detect unexpected nameserver changes. For products with email-dependent features, also monitor MX and TXT (SPF/DKIM) records.
- How do I get alerted when a DNS record changes?
- Set up DNS monitoring that records the current expected value of each critical DNS record and compares on every check cycle. Alert immediately on any change — even intentional changes should be confirmed, not silently accepted, because the distinction between intentional and accidental is only known to the person who made the change. PingSLA's SSL DNS Hunter monitors record values and alerts on changes via WhatsApp, Slack, or email.
- What is the difference between DNS monitoring and uptime monitoring?
- Uptime monitoring checks whether your server returns a successful HTTP response. DNS monitoring checks whether your domain name resolves to the correct IP address. DNS failure prevents the HTTP request from reaching your server at all, so uptime monitoring cannot detect DNS failures — the probe gets a DNS resolution error before it even makes an HTTP connection. DNS monitoring and uptime monitoring are complementary and both are required.
How old is your DNS A record? When does your domain expire? PingSLA's SSL DNS Hunter scans your DNS records, checks record values, and calculates domain expiry — free, no account required. Know where you stand before a deletion catches you off guard.
For continuous DNS record monitoring with change detection and 30/7-day domain expiry alerts via WhatsApp and Slack, see PingSLA plans.
Related reading: SSL Certificate Monitoring · Website Downtime Cost · Post-Deployment Monitoring Checklist
Monitor your site from 15 real global locations →
Start Free →