AWS outage brings Netflix down for some devices on Christmas Eve

rdl · on Dec 25, 2012

I don't get why Netflix doesn't put enough logic in the client to deal with this kind of thing in some kind of graceful way.

Netflix, to me, is a big collection of links to stuff in CDNs (which, got a first approximation, never go down; Akamai is essentially unfadeable, highly resistant to all forms of outage because it's a trivial replication problem.)

I rarely if ever care about the recommendations engine, new content in the submission queue, etc.

Yes, there's an authentication problem, but this is also trivial to replicate, and it's fine if "is a valid subscriber" goes even a month out of date.

Essentially, even if AWS goes down completely, the Netflix client should be able to get a static list of movies and show them. Maybe that's my queue, maybe that's the top 10 per genre, whatever.

In times of degraded operation, show me something.

stusmall · on Dec 25, 2012

I know they have some degradation of service in their recommendations but I can't speak for the rest of the service. A coworker was telling me that each recommendation on your main netflix screen is provided by a different service. So lets say its suppose to be feeding me scifi recommendations. If the scifi service is down, then it will just remove it from the main screen and I will get a different recommendation category until its back up.

ladon86 · on Dec 25, 2012

Our ELB endpoints have been down for over 9 hours. Luckily we were able to failover to an alternative solution in under 5 minutes, but that is pretty mean downtime.

patrickgzill · on Dec 25, 2012

At some point, companies that rely solely on AWS will be seen as having a "technical debt" - to not be able to engineer around such obstacles will be seen as a management failure.

morgo · on Dec 25, 2012

Yeah I can't wait till people wise up and don't just see "the cloud" as one AZ in us-east-1.

(It wouldn't have helped in this case, but its a general annoyance I have.)

druiid · on Dec 25, 2012

I'm wondering if this outage will have Netflix looking at alternatives either for only running on AWS, or moving from them. While AWS is a great piece of technology, stable it is not, and Amazon has as surely proven this as I can imagine.

Basically after nearly 15 hours of downtime I'd consider that beyond unacceptable. With physical hardware even on Christmas Day you could have replacements well before that and spun up... just saying.

durpleDrank · on Dec 25, 2012

You'd think Amazon would have done something since the last big outage. They just don't care do they ?

seanp2k2 · on Dec 25, 2012

They do care, but they're also developing software pretty quickly and they break stuff just like everyone else. Only, when they break, people tend to notice, and those people tend to spread word around pretty quickly.

When I worked at a hosting company, we'd have at least 5 shared hosting servers (with ~75-200 websites on each) go down each day, and yet we had a reputation for good uptime, because comparatively, we DID have good uptime.

I think the problem is that most people think about uptime wrong. Uptime is a compromise; a trade-off. You can pay for more uptime with diminishing returns after about 98%. Running one dedicated server colo'd with decent fault recovery systems in a decent datacenter will probably get you 100% uptime most of the time, until it doesn't, thus ~98%.

If you're going to attempt to get beyond power failures, you'll need a second server (or "instance") somewhere. If you need the capacity anyway, this might not double your costs, but you may still have unused headroom. You could buy flexible computing power by using shared hosting ("the cloud") or whatever, but it's the same problem.

Once you get into a state where you have a global business, customer demands, supplier issues, vendor lock-in, etc, it becomes a numbers game. You can hire (more|better|famous) devs and possibly get more uptime. You can test more (and slow down feature releases) to get better uptime. You can break stuff (and pay for the recovery + lost face + downtime) to decrease downtime. Everything is a trade-off, and right now it makes sense to chase about "four nines" of uptime -- 99.99%.

Four nines is 4.32 minutes per month -- four minutes and nineteen seconds AT MOST once a month. This is very possible for many large services and while it does have cost overhead, it's manageable.

TL;DR don't go chasing waterfalls (100% uptime.) It's not practical. It may be possible, but it depends on how long. 100% uptime for 10 years would be a pretty damn lofty goal. In my opinion, it's much more important to recover quickly and gracefully with awesome communication with your customers than it is to be up 100% of the time. 100% uptime goes unnoticed, but consistently great customer service does not.

unabridged · on Dec 25, 2012

Why should they care, Netflix is the biggest competitor to Amazon Instant Video.

emersonrsantos · on Dec 25, 2012

Not too late to remind: if you use AWS for big things, you should have worldwide region redundancy.

druiid · on Dec 25, 2012

This is only reasonable to a point. You're essentially doubling your costs because things don't exactly work the 'same' between zones. It's not like you're putting a bunch of servers in the same platform exactly and spreading them out around the planet.... so you can't have four web-servers instead of four and send traffic two all four (with two sets at another AZ) without considering and also having a separate DB setup in that AZ, etc.... so costs quickly mount (not to mention it's a much more complex platform suddenly).

seanp2k2 · on Dec 25, 2012

This works well until it isn't cost-effective to do at Netflix scale. Doubling your spending for once-a-year events doesn't sit well with investors.

ck2 · on Dec 25, 2012

Still down 5am EST, so that's over 12 hours.

I guess redbox is still working.

I wonder what percentage of the population will have never owned a DVD player in the next generation.

If netflix owned their own hardware and could reach out and touch it, would this have happened?

Cthulhu_ · on Dec 25, 2012

> I wonder what percentage of the population will have never owned a DVD player in the next generation.

Probably the same amount of people that have never owned a record player or, for the younger people, a CD player.

> If netflix owned their own hardware and could reach out and touch it, would this have happened?

Yes, outages happen whether you have the hardware in your own hands or have it hosted by a cloud provider.

paisawalla · on Dec 25, 2012

Can someone explain why Netflix's service is designed this way? By "this way" I mean, why are some devices served by some set of servers/services exclusively, and apparently web browsers and other devices served in some other way?

I'm not questioning if there's a good reason that it's done this way, but that reason just not obvious to me. I would have designed such a system where there is one endpoint which all clients hit, regardless of platform. But I have never designed the world's largest video streaming infrastructure.

moe · on Dec 25, 2012

Not all devices support all streaming protocols and formats. I.e. their desktop player uses the proprietary SMPTE 421M codec (microsoft) which is not available (or efficiently implemented) on platforms where silverlight is not available (read: most of them).

Since many of these other platforms also use proprietary formats they end up having to maintain a range of different proprietary streaming servers, many of which presumably need to interface with their respective mothership platform services (XBox live, PSN, Roku).

Oh, and recently they have begun building their own CDN[1] which may add further to the diversity.

[1] https://signup.netflix.com/openconnect

seanp2k2 · on Dec 25, 2012

See also: http://gigaom.com/video/netflix-encoding/ and DRM schemes, esp. those detailed in content contracts (protip: big studios specify codecs, encryption specifics, etc.)

Different for each device based on how much power the device has. TVs have nothing for CPU power in most cases.

berthamilton · on Dec 25, 2012

I recently attended a talk by netflix on building fault tolerant systems. From what I could tell, there wasn't an architecture, everything could talk to everything else, with a thin Api layer on top. The way they achieve fault tolerance is by not having an ops team, instead they have a resiliency team and monitor everything. This team uses various monkeys to try to break the application in different ways.

They then put all developers on call, and force them to write code that can recover from faults by trying different methods to break it.

tl;dr netflix systems are chaotic, because chaotic systems tend to tolerate failures.

hga · on Dec 26, 2012

Yes, building your approach around chaos monkeys (https://github.com/Netflix/SimianArmy) has some real advantages.

rdl · on Dec 25, 2012

Is there anything at all "special" about ELB? i.e. is there any reason you couldn't implement a software load balancer yourself within AWS, using non-EBS-dependent EC2 instances?

jwilliams · on Dec 25, 2012

Not especially. What you do get is:

- SSL for "free"

- If you use Route 53, it's a single DNS lookup (no CNAME hop)

- If you're using a VPC you can use an ELB to face the public Internet.

- (Slightly simplier) auto-scaling.

- Redundancy across AZ's

Many/All of these things you could achieve yourself, but if you were using EC2 resources you'd probably find it reasonably expensive.

1SaltwaterC · on Dec 26, 2012

The redundancy across AZ's is limited to the removal of the zone from the ELB config if all the instances from that AZ are marked as failed. If the API is down, tough cookies. All you get is 503 errors for each request hitting the failed AZ. Asked the support to implement proper failover. They said they're working on it, but no ETA. Didn't see an announcement about it in the mean time. Their resolution for closing the ticket was: "script it yourself" with the core assumption that the API works.

I say these because I don't want people to get a sense of false security when deploying to multi-AZ behind ELB.

lucian1900 · on Dec 25, 2012

The other thing is performance. It's almost impossible to get the same performance as the ELB at similar cost.

http://blog.rightscale.com/2010/04/01/benchmarking-load-bala...

seldo · on Dec 25, 2012

Amazon is currently reporting simultaneous issues in ELB, Cloudsearch, and Elastic Beanstalk. I suspect they use some common underlying service which is the root of the failure in all three, but I don't know what service that might be. (The usual suspect, EBS, is not reporting issues right now)

xwowsersx · on Dec 25, 2012

Was down for me on xbox last night, which was super annoying because I loaded up some great movies from rottentomatoes netflix list and then fired up my xbox only to discover netflix was down :(

cbsmith · on Dec 25, 2012

Drat. Just got hit by this. PS3 player is totally not working.

sjs382 · on Dec 25, 2012

Roku too