I don't get why Netflix doesn't put enough logic in the client to deal with this kind of thing in some kind of graceful way.
Netflix, to me, is a big collection of links to stuff in CDNs (which, got a first approximation, never go down; Akamai is essentially unfadeable, highly resistant to all forms of outage because it's a trivial replication problem.)
I rarely if ever care about the recommendations engine, new content in the submission queue, etc.
Yes, there's an authentication problem, but this is also trivial to replicate, and it's fine if "is a valid subscriber" goes even a month out of date.
Essentially, even if AWS goes down completely, the Netflix client should be able to get a static list of movies and show them. Maybe that's my queue, maybe that's the top 10 per genre, whatever.
In times of degraded operation, show me something.
I know they have some degradation of service in their recommendations but I can't speak for the rest of the service. A coworker was telling me that each recommendation on your main netflix screen is provided by a different service. So lets say its suppose to be feeding me scifi recommendations. If the scifi service is down, then it will just remove it from the main screen and I will get a different recommendation category until its back up.
Our ELB endpoints have been down for over 9 hours. Luckily we were able to failover to an alternative solution in under 5 minutes, but that is pretty mean downtime.
At some point, companies that rely solely on AWS will be seen as having a "technical debt" - to not be able to engineer around such obstacles will be seen as a management failure.
I'm wondering if this outage will have Netflix looking at alternatives either for only running on AWS, or moving from them. While AWS is a great piece of technology, stable it is not, and Amazon has as surely proven this as I can imagine.
Basically after nearly 15 hours of downtime I'd consider that beyond unacceptable. With physical hardware even on Christmas Day you could have replacements well before that and spun up... just saying.
They do care, but they're also developing software pretty quickly and they break stuff just like everyone else. Only, when they break, people tend to notice, and those people tend to spread word around pretty quickly.
When I worked at a hosting company, we'd have at least 5 shared hosting servers (with ~75-200 websites on each) go down each day, and yet we had a reputation for good uptime, because comparatively, we DID have good uptime.
I think the problem is that most people think about uptime wrong. Uptime is a compromise; a trade-off. You can pay for more uptime with diminishing returns after about 98%. Running one dedicated server colo'd with decent fault recovery systems in a decent datacenter will probably get you 100% uptime most of the time, until it doesn't, thus ~98%.
If you're going to attempt to get beyond power failures, you'll need a second server (or "instance") somewhere. If you need the capacity anyway, this might not double your costs, but you may still have unused headroom. You could buy flexible computing power by using shared hosting ("the cloud") or whatever, but it's the same problem.
Once you get into a state where you have a global business, customer demands, supplier issues, vendor lock-in, etc, it becomes a numbers game. You can hire (more|better|famous) devs and possibly get more uptime. You can test more (and slow down feature releases) to get better uptime. You can break stuff (and pay for the recovery + lost face + downtime) to decrease downtime. Everything is a trade-off, and right now it makes sense to chase about "four nines" of uptime -- 99.99%.
Four nines is 4.32 minutes per month -- four minutes and nineteen seconds AT MOST once a month. This is very possible for many large services and while it does have cost overhead, it's manageable.
TL;DR don't go chasing waterfalls (100% uptime.) It's not practical. It may be possible, but it depends on how long. 100% uptime for 10 years would be a pretty damn lofty goal. In my opinion, it's much more important to recover quickly and gracefully with awesome communication with your customers than it is to be up 100% of the time. 100% uptime goes unnoticed, but consistently great customer service does not.
This is only reasonable to a point. You're essentially doubling your costs because things don't exactly work the 'same' between zones. It's not like you're putting a bunch of servers in the same platform exactly and spreading them out around the planet.... so you can't have four web-servers instead of four and send traffic two all four (with two sets at another AZ) without considering and also having a separate DB setup in that AZ, etc.... so costs quickly mount (not to mention it's a much more complex platform suddenly).
Can someone explain why Netflix's service is designed this way? By "this way" I mean, why are some devices served by some set of servers/services exclusively, and apparently web browsers and other devices served in some other way?
I'm not questioning if there's a good reason that it's done this way, but that reason just not obvious to me. I would have designed such a system where there is one endpoint which all clients hit, regardless of platform. But I have never designed the world's largest video streaming infrastructure.
Not all devices support all streaming protocols and formats. I.e. their desktop player uses the proprietary SMPTE 421M codec (microsoft) which is not available (or efficiently implemented) on platforms where silverlight is not available (read: most of them).
Since many of these other platforms also use proprietary formats they end up having to maintain a range of different proprietary streaming servers, many of which presumably need to interface with their respective mothership platform services (XBox live, PSN, Roku).
Oh, and recently they have begun building their own CDN[1] which may add further to the diversity.
See also: http://gigaom.com/video/netflix-encoding/ and DRM schemes, esp. those detailed in content contracts (protip: big studios specify codecs, encryption specifics, etc.)
Different for each device based on how much power the device has. TVs have nothing for CPU power in most cases.
I recently attended a talk by netflix on building fault tolerant systems. From what I could tell, there wasn't an architecture, everything could talk to everything else, with a thin Api layer on top. The way they achieve fault tolerance is by not having an ops team, instead they have a resiliency team and monitor everything. This team uses various monkeys to try to break the application in different ways.
They then put all developers on call, and force them to write code that can recover from faults by trying different methods to break it.
tl;dr netflix systems are chaotic, because chaotic systems tend to tolerate failures.
Is there anything at all "special" about ELB? i.e. is there any reason you couldn't implement a software load balancer yourself within AWS, using non-EBS-dependent EC2 instances?
The redundancy across AZ's is limited to the removal of the zone from the ELB config if all the instances from that AZ are marked as failed. If the API is down, tough cookies. All you get is 503 errors for each request hitting the failed AZ. Asked the support to implement proper failover. They said they're working on it, but no ETA. Didn't see an announcement about it in the mean time. Their resolution for closing the ticket was: "script it yourself" with the core assumption that the API works.
I say these because I don't want people to get a sense of false security when deploying to multi-AZ behind ELB.
Amazon is currently reporting simultaneous issues in ELB, Cloudsearch, and Elastic Beanstalk. I suspect they use some common underlying service which is the root of the failure in all three, but I don't know what service that might be. (The usual suspect, EBS, is not reporting issues right now)
Was down for me on xbox last night, which was super annoying because I loaded up some great movies from rottentomatoes netflix list and then fired up my xbox only to discover netflix was down :(
Netflix, to me, is a big collection of links to stuff in CDNs (which, got a first approximation, never go down; Akamai is essentially unfadeable, highly resistant to all forms of outage because it's a trivial replication problem.)
I rarely if ever care about the recommendations engine, new content in the submission queue, etc.
Yes, there's an authentication problem, but this is also trivial to replicate, and it's fine if "is a valid subscriber" goes even a month out of date.
Essentially, even if AWS goes down completely, the Netflix client should be able to get a static list of movies and show them. Maybe that's my queue, maybe that's the top 10 per genre, whatever.
In times of degraded operation, show me something.