List of Notable AWS Region Disasters

AWS has experienced several significant outages or disruptions over the years, often affecting specific regions or Availability Zones (AZs). Below are some key examples of AWS region-related incidents, focusing on their scope and impact. Note that AWS does not always publish detailed post-event summaries for every incident, and comprehensive disaster records may not cover every minor event.

Notable AWS Region Disasters

February 28, 2017 – US-EAST-1 (Northern Virginia) S3 Outage
- What Happened: A human error during a routine debugging task led to a mistyped command, taking down a significant portion of Amazon S3 storage in the US-EAST-1 region. This cascaded to affect other AWS services dependent on S3, like EC2 and Lambda.
- Impact: Disrupted major websites and services (e.g., Netflix, Slack, and parts of Amazon’s own retail site). Estimated losses were in the hundreds of millions of dollars for affected businesses.
- Scope: Limited to US-EAST-1 but had widespread effects due to the region’s heavy usage.
- AWS Response: AWS published a detailed post-event summary, acknowledging the error and outlining improvements to prevent recurrence.
November 25, 2020 – US-EAST-1 Kinesis Outage
- What Happened: A capacity increase in Amazon Kinesis Data Streams overwhelmed internal service dependencies, causing a cascading failure across multiple AWS services (e.g., CloudWatch, Lambda) in US-EAST-1.
- Impact: Affected real-time data processing for companies like Autodesk and disrupted services relying on Kinesis.
- Scope: Confined to US-EAST-1 but highlighted inter-service dependencies.
- AWS Response: A post-event summary was released, detailing the root cause and mitigation steps.
December 7, 2021 – US-EAST-1 Network Device Failure
- What Happened: Impaired network devices in US-EAST-1 caused connectivity issues, disrupting services like EC2, RDS, and Lambda for over 8 hours.
- Impact: Major companies (e.g., Netflix, Disney+, Slack, Robinhood) experienced downtime, with Amazon’s own delivery operations affected due to app failures.
- Scope: Regional, centered in US-EAST-1.
- AWS Response: AWS published a summary, noting the issue stemmed from network congestion and device failures.
December 22, 2021 – US-EAST-1 Power Outage
- What Happened: A power loss in a single data center within the USE1-AZ4 Availability Zone of US-EAST-1 took down EC2 and EBS instances.
- Impact: Affected services like Slack, Imgur, and Epic Games. Recovery took about 12 hours.
- Scope: Limited to one AZ within US-EAST-1, not a full region failure.
- AWS Response: No detailed public postmortem was widely noted, though status updates were provided.
June 13, 2023 – US-EAST-1 Lambda Incident
- What Happened: A capacity management subsystem issue in US-EAST-1 impacted over 100 AWS services, including Lambda, API Gateway, and the AWS Management Console.
- Impact: Noticeable disruptions across the internet, including Fortnite matchmaking and Slack outages.
- Scope: Regional, affecting US-EAST-1.
- AWS Response: A summary was published after a notable gap in public postmortems, possibly prompted by external pressure (e.g., posts on X noted by users like Gergely Orosz).
July 30, 2024 – US-EAST-1 Kinesis Data Streams Event
- What Happened: Another Kinesis-related issue in US-EAST-1 caused service disruptions, though specifics are less detailed in public records.
- Impact: Affected real-time data workflows, though less widespread than prior incidents.
- Scope: Regional, US-EAST-1.
- AWS Response: A post-event summary was archived, per AWS’s public records.

Observations and Patterns

US-EAST-1 Dominance: Many high-profile incidents occur in US-EAST-1 (Northern Virginia), one of AWS’s oldest and most utilized regions. Its size and complexity make it prone to cascading failures.
Causes: Common triggers include human error (e.g., 2017 S3 outage), capacity scaling issues (e.g., 2020 Kinesis), network failures (e.g., 2021), and power disruptions (e.g., 2021 AZ outage).
Scope: Most “disasters” are limited to a single region or AZ, not multi-region failures. AWS designs regions to be independent, so a full multi-region outage is rare.
Data Center Resilience: AWS claims it has never lost an entire data center (per a 2018 statement by an AWS VP), but individual AZs within regions have failed due to power or network issues.

Gaps in Records

AWS only publishes Post-Event Summaries (PES) for incidents with “broad and significant customer impact” (e.g., major API failures or infrastructure loss). Smaller outages or AZ-specific issues often lack detailed public reports.
Since late 2021, AWS reduced the frequency of public postmortems, drawing criticism (e.g., X posts in 2023 noted a nearly 2-year gap until October 2023).
Exact dates and details for some incidents (e.g., minor AZ failures) are not consistently documented publicly unless they escalate to region-wide impact.

Related posts:

Footer

Copyrights

Tags