Skip to main content

Making Sense of the October 20 AWS Outage

Ippon-Oct-20-2025-04-43-44-1916-PM

What happened, how to communicate impact, and how to brief business leaders—turning a technical disruption into something clear and constructive.

On the morning of October 20, AWS us-east-1 services were degraded—in particular, DNS services for DynamoDB. I didn’t find out from an outage alert or dashboard—I found out the same way many of us did: when the apps on my phone stopped working.

That's the reality of modern infrastructure incidents: they often surface as user-facing failures long before the official root cause analysis lands in your inbox. And when you're the one who has to walk into a meeting and explain what happened to executives who don't have AWS acumen, you need more than technical accuracy—you need clarity.

This article is for that moment. If you're now trying to explain this outage to leadership while also planning how to prevent the next one, this one's for you.

What Actually Happened

Early on October 20, Amazon Web Services experienced elevated error rates across multiple services in the us-east-1 region. AWS reported that the root cause appeared to be related to DNS resolution issues for the DynamoDB endpoint.

Here's what made this incident particularly insidious: it wasn't a full region going dark or a global outage—it was a DNS problem at a foundational layer. And when DNS breaks, everything that depends on it breaks too.

DNS (Domain Name System) is the internet’s address book—it translates service names like dynamodb.us-east-1.amazonaws.com into actual IP addresses. When DNS fails, services can’t find one another. It’s like someone removing all the street signs in a city overnight—your services are still there, but nothing can find its way.

For us (like many of you), it meant increased error rates, intermittent failures, and retry storms in some of our serverless applications. We didn’t lose data. We didn’t experience catastrophic downtime. But we did see the kind of disruption that doesn’t take your business offline but does frustrate users and flood support tickets. 

As AWS began recovery, a second wave of disruption emerged from retries, service restarts, and delayed DNS propagation—showing that recovery can be just as stressful on systems as the outage itself if they aren’t built to fail gracefully.

Having the Leadership Conversation

This is where a lot of engineers and architects struggle. You know the issue was upstream. You know it's AWS's infrastructure. But leadership doesn't care about the cloud provider—they care about impact and what we're doing about it.

After an incident like this, frame it as a dependency visibility problem, not a blame game:

“We were affected by a regional outage in AWS's us-east-1 region due to DNS issues that temporarily disrupted our ability to read and write from DynamoDB. This impacted certain user-facing services and background workflows. The outage has since been resolved, but it exposed areas where we're overly dependent on single-region infrastructure and regional service discovery.”

Then immediately shift to action:

“We're taking this opportunity to map our regional dependencies, prioritize applications by criticality, and identify where we need fallback logic, health checks, and multi-region routing in place.”

That's a far more useful message than "AWS went down." It shows ownership, demonstrates understanding, and presents a path forward.

The Hard Questions You Need to Answer

Today's incident reinforces a lesson that cloud-native teams are often slow to learn: you can't protect what you can't see.

If I asked you to tell me:

  • Which of your apps run in us-east-1?

  • Which rely on DynamoDB?

  • Which of those are Tier 1 or customer-facing?

  • Which of those have active-active failover across regions?


    …could you answer confidently?

Most teams can't. And that's not a judgment—it's just the reality of growing cloud estates, microservices, and organic infrastructure decisions that pile up over time. Services get deployed. Teams change. Documentation drifts. Before you know it, you're running critical workloads on infrastructure patterns that no one fully understands anymore.

This is why resilience can't be treated as a feature you build into individual applications. It needs to be a program—a cross-functional process that brings together cloud, security, product, and engineering leaders to regularly answer one question:

“What happens if [critical region or dependency] goes away?”

Making Resilience Visible

For smaller teams, resilience often starts with brute force: replicate data across regions, turn on retries, and hope for the best. But in large, complex organizations, that approach quickly falls apart. You can't brute force your way into confidence.

What's needed is a structured, risk-informed approach that starts by identifying the most critical user journeys and understanding the infrastructure, data layers, and service dependencies that support them. It's not enough to know what's running; you need to know what matters most, where the failure points are, and what the business impact would be if they went offline for an hour—or a day.

This is where tools like AWS Resilience Hub become valuable. By formally defining applications, assessing risk against defined RTO/RPO targets, and surfacing concrete remediation recommendations, you can turn resilience from a vague aspiration into a measurable practice.

Once you’ve defined that posture, you can validate it with AWS Fault Injection Service (FIS)—a managed chaos engineering tool that lets you safely simulate real-world failures. Whether it’s testing how your systems respond to DNS timeouts, API throttling, or instance termination, FIS helps you prove that your resilience isn’t just theoretical.

But tooling alone isn't enough. The real work is cultural: getting teams to prioritize resilience alongside feature delivery, building observability around failure modes (not just uptime), and ensuring that architecture decisions are informed by real-world risk rather than best-case assumptions.

Building resilience takes more than tools—it takes collaboration across engineering, architecture, and risk to make it part of how an organization operates. Through structured reviews, clear dependency mapping, and transparent reporting, teams can turn resilience from a concept into a measurable, ongoing practice. The goal isn’t perfection—it’s visibility, intention, and readiness.

Where to Go From Here

Today’s outage wasn’t catastrophic—but it was loud and disruptive enough to get everyone’s attention. It showed how something as foundational as DNS, when it fails, can impact services up and down the stack—no matter how modern or distributed your architecture looks on paper.

This wasn’t just noise. It revealed real architectural risks that often go unnoticed until they become outages—and reminded us how easy it is to be dependent on a single region, endpoint, or layer of the stack without realizing it.

Outages like this are reminders, not just disruptions. Taking deliberate steps toward resiliency and dependency awareness is what lets us rest easy—knowing that when something fails, our business won’t.

Use this as a moment to begin a conversation—across architecture, risk, and engineering teams—about what resilience really means for your organization. Not just in terms of uptime, but in how you plan, measure, and communicate it. The goal isn’t to make everything indestructible; it’s to make risk visible, decisions intentional, and recovery predictable.

Let’s make resilience part of how we work—not just how we respond.

Comments

©Copyright 2024 Ippon USA. All Rights Reserved.   |   Terms and Conditions   |   Privacy Policy   |   Website by Skol Marketing