AWS has experienced several significant outages or disruptions over the years, often affecting specific regions or Availability Zones (AZs). Below are some key examples of AWS region-related incidents, focusing on their scope and impact. Note that AWS does not always publish detailed post-event summaries for every incident, and comprehensive disaster records may not cover every minor event.
Notable AWS Region Disasters
- February 28, 2017 – US-EAST-1 (Northern Virginia) S3 Outage
- What Happened: A human error during a routine debugging task led to a mistyped command, taking down a significant portion of Amazon S3 storage in the US-EAST-1 region. This cascaded to affect other AWS services dependent on S3, like EC2 and Lambda.
- Impact: Disrupted major websites and services (e.g., Netflix, Slack, and parts of Amazon’s own retail site). Estimated losses were in the hundreds of millions of dollars for affected businesses.
- Scope: Limited to US-EAST-1 but had widespread effects due to the region’s heavy usage.
- AWS Response: AWS published a detailed post-event summary, acknowledging the error and outlining improvements to prevent recurrence.
- November 25, 2020 – US-EAST-1 Kinesis Outage
- What Happened: A capacity increase in Amazon Kinesis Data Streams overwhelmed internal service dependencies, causing a cascading failure across multiple AWS services (e.g., CloudWatch, Lambda) in US-EAST-1.
- Impact: Affected real-time data processing for companies like Autodesk and disrupted services relying on Kinesis.
- Scope: Confined to US-EAST-1 but highlighted inter-service dependencies.
- AWS Response: A post-event summary was released, detailing the root cause and mitigation steps.
- December 7, 2021 – US-EAST-1 Network Device Failure
- What Happened: Impaired network devices in US-EAST-1 caused connectivity issues, disrupting services like EC2, RDS, and Lambda for over 8 hours.
- Impact: Major companies (e.g., Netflix, Disney+, Slack, Robinhood) experienced downtime, with Amazon’s own delivery operations affected due to app failures.
- Scope: Regional, centered in US-EAST-1.
- AWS Response: AWS published a summary, noting the issue stemmed from network congestion and device failures.
- December 22, 2021 – US-EAST-1 Power Outage
- What Happened: A power loss in a single data center within the USE1-AZ4 Availability Zone of US-EAST-1 took down EC2 and EBS instances.
- Impact: Affected services like Slack, Imgur, and Epic Games. Recovery took about 12 hours.
- Scope: Limited to one AZ within US-EAST-1, not a full region failure.
- AWS Response: No detailed public postmortem was widely noted, though status updates were provided.
- June 13, 2023 – US-EAST-1 Lambda Incident
- What Happened: A capacity management subsystem issue in US-EAST-1 impacted over 100 AWS services, including Lambda, API Gateway, and the AWS Management Console.
- Impact: Noticeable disruptions across the internet, including Fortnite matchmaking and Slack outages.
- Scope: Regional, affecting US-EAST-1.
- AWS Response: A summary was published after a notable gap in public postmortems, possibly prompted by external pressure (e.g., posts on X noted by users like Gergely Orosz).
- July 30, 2024 – US-EAST-1 Kinesis Data Streams Event
- What Happened: Another Kinesis-related issue in US-EAST-1 caused service disruptions, though specifics are less detailed in public records.
- Impact: Affected real-time data workflows, though less widespread than prior incidents.
- Scope: Regional, US-EAST-1.
- AWS Response: A post-event summary was archived, per AWS’s public records.
Observations and Patterns
- US-EAST-1 Dominance: Many high-profile incidents occur in US-EAST-1 (Northern Virginia), one of AWS’s oldest and most utilized regions. Its size and complexity make it prone to cascading failures.
- Causes: Common triggers include human error (e.g., 2017 S3 outage), capacity scaling issues (e.g., 2020 Kinesis), network failures (e.g., 2021), and power disruptions (e.g., 2021 AZ outage).
- Scope: Most “disasters” are limited to a single region or AZ, not multi-region failures. AWS designs regions to be independent, so a full multi-region outage is rare.
- Data Center Resilience: AWS claims it has never lost an entire data center (per a 2018 statement by an AWS VP), but individual AZs within regions have failed due to power or network issues.
Gaps in Records
- AWS only publishes Post-Event Summaries (PES) for incidents with “broad and significant customer impact” (e.g., major API failures or infrastructure loss). Smaller outages or AZ-specific issues often lack detailed public reports.
- Since late 2021, AWS reduced the frequency of public postmortems, drawing criticism (e.g., X posts in 2023 noted a nearly 2-year gap until October 2023).
- Exact dates and details for some incidents (e.g., minor AZ failures) are not consistently documented publicly unless they escalate to region-wide impact.