Skip to main content

We Knew How to Failover, But We Didn’t Know When to Failover

Jonathan Scharf
by Jonathan Scharf
December 9, 2025

Ippon copyLessons From The AWS Outage About The Blind Spot In Resiliency Planning

During a retrospective on the recent AWS outage related to DynamoDB, one of my colleagues said something I have been thinking about ever since:

“We knew how to failover, but we didn’t know when to failover.”

This so perfectly captures a hidden weakness in many teams’ resiliency strategies. Scripts, runbooks, automation, and health checks all get tested fairly regularly and are a critical part of any resiliency strategy, but we rarely rehearse the decision to failover. When things break, that gap becomes painfully obvious.

The Hardest Part Of Failover Isn’t Technical

If you asked most teams, “Do you know how to failover?” They would almost always immediately say yes. They would talk about how they have:

  • A secondary region ready
  • Database replicas ready
  • A runbook
  • Maybe even be able to tell you the last time they did a failover exercise

But when the outage actually occurs, the questions sound more like:

  • Is this an AWS outage or something on our side?
  • Are things getting better or worse?
  • Are we failing over too early?
  • Will failing over make things even worse?
  • Who has the authority to pull the trigger?

None of these are technical questions. They are judgment calls, and when judgment is unclear, teams hesitate. This hesitation can turn a 10-minute disruption into an hour… or worse.

Why “When To Failover” Is So Hard

There are a few big reasons why teams tend to hesitate during decision time.

Outages are rarely black and white

Most incidents are brownouts, not full blackouts. Partial region degradation, intermittent timeouts, or elevated error rates. None of these scream “Flip the switch!”

Failing over too early might risk data divergence, cost spikes, or new bugs in that secondary region. Failing over too late could mean customer impact, missed SLAs, or a longer overall incident time. You’re trying to make a potentially high-stakes call with incomplete information.

Teams assume the decision is obvious

Runbooks often say how to execute the failover but rarely specify when it is justified. There is no threshold, no escalation rule, no owner, and no timer. If no one knows who decides, no one decides.

Business tradeoffs aren’t pre-aligned

Failover is a business decision presented as a technical one.

  • Is 5 minutes of downtime worse than a 0.1% risk of data loss?
  • Is customer experience more important than operational cost?
  • Should we bias towards safety or uptime?

You can’t resolve these questions mid-incident.

Incident Readiness Is A Mix Of Technical And Decision Readiness

Reliability isn’t just about systems; it’s about people. You can have multi-region everything, but if your team is afraid to failover or unsure if they are allowed to, you effectively don’t have failover at all. High maturity teams treat failover as both a technical process and a decision process. Many companies only prepare for the first.

A Framework To Fix This

I have been thinking a lot about how we can break this cycle and better define our when.

Define objective failover triggers

Set clear, observable conditions that justify failover. For example:

  • Error rate above 10% for 15 consecutive minutes
  • Latency above SLO by X% for Y minutes
  • AWS (or another vendor) confirms a regional or service degradation
  • Database replication lag under specific threshold for N seconds
  • Queue processing falls below baseline with no recovery trend

Objective metrics help prevent emotional decisions… or emotional indecision.

Timebox investigations

Adopt a rule such as “If a root cause isn’t identified within 20 minutes, assume external failure and failover.” Replace guesswork with a clear, predictable clock.

Predefine the decision owner

Someone must have the authority to call for a failover. Ensure that this person is reachable, and assign another person who has the authority to make the call in the event the person is not reachable. Ambiguity can be costly. Defile an owner before the outage, and follow the “chain of command” in the event the owner is not available.

Run “decision drills,” not just failover drills

Most planned chaos engineering events focus on systems breaking. But you can also simulate a gray area outage, conflicting signals, uncertain AWS status pages, internal disagreement, and a time box expiring. In other words, practice the judgement, not just the mechanics. This is where teams build confidence.

The Real Lesson From The Outage

Every outage is a reminder that systems fail in new and unpredictable ways. But during the AWS outage, many teams discovered something surprising. Their technology was ready, but their people weren’t empowered.

The next outage won't ask whether you know how to failover. It will ask whether you know when to failover. If you need help getting there, drop us a line at sales@ipponusa.com

Comments

©Copyright 2024 Ippon USA. All Rights Reserved.   |   Terms and Conditions   |   Privacy Policy   |   Website by Skol Marketing