Use traces, health checks, and operational runbooks to separate carrier outages from your own integration failures.
Telemetry Is the First Triage Tool
Carrier incidents move faster when you can answer three questions immediately: which operation failed, which correlation IDs are affected, and whether the failures are concentrated by carrier, account, region, or deployment version. Logs alone are rarely enough; you also want request metrics and traces.
Health Checks Need Intent
A green health endpoint only proves your app process is alive. For carrier integrations, the more useful checks validate dependencies, credential freshness, and synthetic carrier reachability without triggering expensive writes. Use lightweight probes that tell operators where to look next.
Carrier Reality
Sandbox credentials often stay healthy while production credentials drift, rotate, or lose permissions. If your observability stack does not compare environments cleanly, you can waste hours debugging the wrong system.
Runbooks Should Reference Real Evidence
A runbook should tell the responder exactly which dashboards, carrier lookup paths, dead-letter queues, and compensation levers matter for this incident class. If the runbook only says 'check logs,' you do not actually have a runbook yet.