Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There was a very large outage back in ~2017 that was caused by DynamoDB going down. Because EC2 stored its list of servers in DynamoDB, EC2 went down too. Because DynamoDB ran its compute on EC2, it was suddenly no longer able to spin up new instances to recover.

It took several days to manually spin up DynamoDB/EC2 instances so that both services could recover slowly together. Since then, there was a big push to remove dependencies between the “tier one” systems (S3, DynamoDB, EC2, etc.) so that one system couldn’t bring down another one. Of course, it’s never foolproof.



I don't remember an event like that, but I'm rather certain the scenario you described couldn't have happened in 2017.

The very large 2017 AWS outage originated in s3. Maybe you're thinking about a different event?

https://share.google/HBaV4ZMpxPEpnDvU9


Sorry the 2015 one. I misremembered the year

https://aws.amazon.com/message/5467D2/

I imagine this was impossible in 2017 because of actions taken after the 2015 incident


Definitely impossible in 2015.

If you're talking about this part:

> Initially, we were unable to add capacity to the metadata service because it was under such high load, preventing us from successfully making the requisite administrative requests.

It isn't about spinning up ec2 instances or provisioning hardware. It is about logically adding the capacity to the system. The metadata service is a storage service, so adding capacity necessitates data movement. There are a lot of things that need to happen to add capacity while maintaining data correctness and availability (mind at this point, it was still trying to fulfill all requests)


I’m referring to impact on other services




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: