deploymentdevopssremonitoringchecklistproduction

Post-Deployment Monitoring: The 10-Minute Check That Prevents Production Incidents

PingSLA Team··8 min read

The deploy succeeded. The pipeline is green. The build passed all tests. You merged to main and everything looks fine.

Twelve minutes later, 400 users are hitting a broken checkout. The deploy introduced a subtle environment variable mismatch that only surfaces in production with real payment credentials. The monitoring didn't notice because it's watching the status code, not the checkout flow. The engineering team finds out when support tickets spike 45 minutes after the deploy.

This is not a rare scenario. Internal data across incident post-mortems consistently shows 55–65% of production incidents occur within the first 20 minutes of a deployment. The reason is simple: deploys are the highest-risk state change your production system makes. Every environment assumption, third-party integration, and configuration dependency is at elevated risk during the 15-minute post-deploy window.

Post-deploy monitoring is not regular monitoring with a different name. It is an elevated-sensitivity verification pass specifically designed for the period immediately after a change.

Why Post-Deploy Monitoring Is Different from Regular Monitoring

Regular production monitoring is designed for stability — it fires when something transitions from normal to broken. Post-deploy monitoring is designed for verification — it confirms that the new deployment didn't introduce a new broken state.

The difference in practice:

Regular monitoring: Runs on the same 5-minute check interval 24/7. Alert threshold tuned to avoid false positives. Acceptable that a transient spike doesn't fire an alert.

Post-deploy monitoring: Runs every 1 minute for the first 15 minutes after deploy. Alert threshold lowered (more sensitive). Any deviation from expected behaviour warrants investigation — even transient.

The risk profile is fundamentally different. In regular operation, a single slow response is probably noise. In the 10 minutes after a deploy, a single slow response might be the first signal of a cascading failure.

What Typically Breaks Within 10 Minutes of a Deploy

Based on common incident patterns, ranked by frequency:

  1. Environment variable not set or incorrectly migrated (broken third-party integrations, auth failures)
  2. Database migration ran but left schema in a transitional state
  3. New code path dependent on a cache that hasn't been warmed
  4. Static asset URL changed, breaking existing sessions that reference the old path
  5. Third-party API key rotated but new key not deployed everywhere
  6. Feature flag defaulting to wrong state in production
  7. New dependency loaded in production for the first time (network access, permissions)
  8. Memory or CPU spike from a previously undetected inefficiency in the new code

Most of these are not caught by unit tests, integration tests, or staging environments. They only manifest in production, with production load, production credentials, and production infrastructure.

The 10-Point Post-Deploy Monitoring Checklist

Run these checks in order, within the first 10 minutes of every production deployment.

1. HTTP Response Code Verification

Confirm your primary pages return the expected status codes.

# Quick multi-endpoint status check
for url in \
  "https://yoursaas.com" \
  "https://app.yoursaas.com/login" \
  "https://app.yoursaas.com/dashboard" \
  "https://api.yoursaas.com/health"
do
  status=$(curl -o /dev/null -s -w "%{http_code}" "$url")
  echo "$url$status"
done

Expected: 200 for all. Any 5xx is a critical failure. Any 4xx on a previously-working URL needs immediate investigation.

2. Response Time Baseline Comparison

Compare current response times against pre-deploy baseline. A 50% increase in p95 response time after a deploy is a warning sign, even if absolute times are still within thresholds.

# Compare response time to baseline
baseline_ttfb=280  # ms, from pre-deploy measurement
current_ttfb=$(curl -o /dev/null -s -w "%{time_starttransfer}" https://app.yoursaas.com/dashboard)
current_ms=$(echo "$current_ttfb * 1000" | bc | cut -d. -f1)

if [ $current_ms -gt $(($baseline_ttfb * 2)) ]; then
  echo "⚠️  TTFB degradation: ${current_ms}ms vs baseline ${baseline_ttfb}ms"
fi

3. SSL Certificate Still Valid

Deployments occasionally introduce SSL configuration changes or CDN changes that break SSL. Verify immediately.

# SSL validity check
ssl_expiry=$(echo | openssl s_client -connect yoursaas.com:443 2>/dev/null | \
  openssl x509 -noout -enddate | cut -d= -f2)
echo "SSL expires: $ssl_expiry"

4. Critical User Flow Synthetic Check

This is the most important check. Run a Playwright synthetic check of your most critical user flow — login, checkout, or primary product action — immediately after every deploy.

This is the check that would have caught the checkout failure in the scenario above.

// Post-deploy synthetic verification script
const { chromium } = require('playwright');

async function postDeployVerification() {
  const browser = await chromium.launch();
  const page = await browser.newPage();
  const results = [];

  try {
    // Check 1: Login flow
    await page.goto('https://app.yoursaas.com/login');
    await page.fill('[data-testid="email"]', process.env.MONITOR_EMAIL);
    await page.fill('[data-testid="password"]', process.env.MONITOR_PASSWORD);
    await page.click('[data-testid="login-btn"]');
    await page.waitForSelector('[data-testid="dashboard"]', { timeout: 10000 });
    results.push({ check: 'login', status: 'pass' });

    // Check 2: Primary product page loads with data
    await page.goto('https://app.yoursaas.com/monitors');
    const monitorCount = await page.$eval('[data-testid="monitor-row"]', els => els.length);
    if (monitorCount === 0) throw new Error('No monitors loaded — possible data fetch failure');
    results.push({ check: 'data-load', status: 'pass' });

    // Check 3: Critical API endpoint
    const apiResponse = await page.evaluate(async () => {
      const res = await fetch('/api/v1/monitors', { credentials: 'include' });
      const body = await res.json();
      return { status: res.status, count: body.monitors?.length ?? 0 };
    });
    if (apiResponse.status !== 200 || apiResponse.count === 0) {
      throw new Error(`API check failed: status ${apiResponse.status}, count ${apiResponse.count}`);
    }
    results.push({ check: 'api', status: 'pass' });

  } catch (error) {
    results.push({ check: 'failed', status: 'fail', error: error.message });
  } finally {
    await browser.close();
  }

  return results;
}

5. API Endpoint Response Validation

Check your 3–5 most critical API endpoints. Verify status codes AND that response bodies contain expected data (not empty arrays or null fields).

6. Error Rate Spike Detection

Pull your error rate for the 5 minutes since deploy and compare to the 30-minute pre-deploy baseline. A 3x increase in 5xx errors is a critical signal, even if absolute error count is low.

7. Database Query Latency

If your deploy included schema migrations or query changes, verify that database query times have not spiked. A migration that runs cleanly in staging can cause table lock contention in production under real load.

8. Static Asset Delivery Verification

Verify that your CSS, JavaScript, and image assets are loading from the expected CDN URLs. A CDN configuration change or asset versioning issue can break the frontend while all API endpoints return 200 OK.

9. Third-Party Integration Health

Explicitly check each critical integration:

  • Payment gateway (can the payment widget load?)
  • Authentication provider (can the auth SDK initialise?)
  • Email service (can a test email be triggered?)
  • Analytics (are events being received?)

10. Mobile Viewport Check

Run a quick mobile rendering check. An HTML/CSS change can break mobile viewports while desktop looks fine. Use Playwright's mobile emulation or check from a real device.

Post-Deploy vs Regular Monitoring Comparison

DimensionRegular MonitoringPost-Deploy Monitoring
Check frequencyEvery 5–30 minutesEvery 1 minute × 15 minutes
Alert sensitivityTuned for low false positivesElevated (investigate anything unusual)
ScopeDefined critical pathsAll paths touched by the deploy
DurationContinuous15-minute elevated window
Rollback triggerSustained failuresAny confirmed regression
Team awarenessOn-call onlyDeploying engineer + on-call

Automating Post-Deploy Checks

The most effective implementation: trigger your monitoring verification run automatically on deploy completion.

# In your CI/CD pipeline (post-deploy step)
# Trigger PingSLA to run synthetic checks immediately
curl -X POST https://api.pingsla.com/v1/checks/run-now \
  -H "Authorization: Bearer $PINGSLA_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"monitorIds": ["login-flow", "checkout-flow", "api-health"], "reason": "post-deploy"}'

# Wait for results
sleep 120  # 2 minutes for checks to complete

# Fetch results
RESULT=$(curl -s https://api.pingsla.com/v1/checks/recent \
  -H "Authorization: Bearer $PINGSLA_API_KEY")

# Fail the pipeline if checks failed
if echo "$RESULT" | grep -q '"status":"fail"'; then
  echo "❌ Post-deploy checks failed. Consider rollback."
  exit 1
fi

echo "✅ Post-deploy checks passed."

What is post-deployment monitoring?
Post-deployment monitoring is an elevated-sensitivity verification pass that runs immediately after a production deployment to confirm the new code did not introduce regressions. Unlike regular monitoring, it runs at higher frequency (every 1 minute), at lower alert thresholds, and specifically covers paths that may have been affected by the deploy. It differs from pre-deploy checks in that it validates real production behaviour under live traffic.
How long should I monitor a production deployment for issues?
The highest-risk window is the first 15–20 minutes after deploy, during which the majority of deploy-related incidents manifest. Elevated monitoring should run for at least 15 minutes at 1-minute intervals. After that, standard monitoring resumes. For major releases (architectural changes, database migrations), extend the elevated monitoring window to 60 minutes.
What is a deployment smoke test?
A deployment smoke test is a minimal set of automated checks run immediately after a deployment to verify that the most critical functions of the application still work. It typically covers: the application starts and serves requests, the login flow works, at least one primary feature completes successfully, and no critical API endpoints are returning errors. It is designed to be fast (under 2 minutes) and catch obvious regressions before users encounter them.
How do I trigger monitoring checks after a deployment in my CI/CD pipeline?
Use your monitoring tool's API to trigger a check run as a post-deploy step in your CI/CD pipeline (GitHub Actions, CircleCI, Buildkite, etc.). After triggering, wait 1–2 minutes for checks to complete, then fetch results via the API and fail the pipeline step if any checks returned failures. This creates an automated gate that prevents silent regressions from going undetected.
What is the difference between canary deployment and post-deploy monitoring?
Canary deployment is a deployment strategy that routes a small percentage of traffic to the new version before full rollout. Post-deploy monitoring is a verification practice that applies to any deployment strategy, including canary. They are complementary: canary reduces blast radius if something goes wrong; post-deploy monitoring detects problems quickly regardless of deployment strategy.

Know within 5 minutes whether your deploy broke production. PingSLA's Infrastructure Audit runs a complete health check across your endpoints, SSL, DNS, APIs, and synthetic flows — free, no account required.

For automated post-deploy monitoring triggered by your CI/CD pipeline, with Playwright synthetic checks and WhatsApp alerts, see PingSLA plans.

Related reading: Monitoring Checklist Before Deploy · Synthetic Monitoring Explained · Login Flow Monitoring

Monitor your site from 15 real global locations →

Start Free →