Your single VM is a single point of failure. You probably want to run this in 3 ...

shantly · on Jan 13, 2020

An awful lot of server systems can tolerate a hardware failure on their one server every couple years given 1) good backups, 2) "shit's broken" alerts, and 3) reliable push-button re-deploy-from-scratch capability, all of which you should have anyway. Lots of smaller shops trying to run to k8s and The Cloud probably have at least that much downtime (maybe an hour or two a year, on average) due to configuration fuck-ups on their absurd Rube Goldberg deployment processes anyway.

[EDIT] oh and of course The Cloud itself dies from time to time, too. Usually due to configuration fuck-ups on their absurd Rube Goldberg deployment processes :-) I don't think one safely-managed (see above points) server is a ton worse than the kind of cloud use any mid-sized-or-smaller business can afford, outside certain special requirements. Your average CRUD app? Just rent a server from some place with a good reputation, once you have paying customers (just host on a VPS or two until then). All the stuff you need to do to run it safely you should be doing with your cloud shit anyway (testing your backups, testing your re-deploy-from-scratch capability, "shit's broken" alerts) so it's not like it takes more time or expertise. Less, really.

abraxas · on Jan 13, 2020

Not to mention there are now servers available for purchase today that have 128 x86 cores. And 2-4 TB of RAM.

That's a lot of "cloud" right there in a single server.

carlsborg · on Jan 13, 2020

Business services generally need high availability goals, so often that doesn't cut it. And your single server doesn't autoscale to load.

AWS gives you availability zones, which are usually physically distinct datacenters in a region, and multiple regions. Well designed cloud apps failover between them. Very very rarely have we seen an outage across regions in AWS, if ever.

shantly · on Jan 13, 2020

In practice I see a lot of breakage (=downtime), velocity loss, and terrible "bus factor" from complex Cloud setups where they're really not needed—one beefy server and some basic safety steps that are also needed with the Cloud, so aren't any extra work, would do. "Well designed" is not the norm and lots of the companies are heading to the cloud without an expert at the wheel, let alone more than one (see: terrible bus factor)

cpitman · on Jan 13, 2020

Businesses always ask for High Availability, but they never agree on what that actually means. IE, does HA mean "Disaster Recovery", in which case rebuilding the system after an incident could qualify? Does it require active-active runtimes? Multiple data centers? Geographic distribution?

And by the way, how much are they willing to spend on their desired level of availability?

I still need a better way to run these conversations, but I'm trying to find a way to bring it back to cost. How much does an hour of downtime really cost you?

carlsborg · on Jan 16, 2020

Agree - different business functions have different availability goals. An system that computes live risk for a trading desk might have different availability goals from an HR services portal.

0xbadcafebee · on Jan 13, 2020

I once ran a Linux server on an old IBM PC out of a run-down hotel's closet with a tiny APC battery for 10 years without a reboot. Just because I got away with it doesn't make it a great idea. (It failed because the hard drive died, but for a year and a half nobody noticed)

> An awful lot of server systems can tolerate a hardware failure on their one server every couple years given 1) good backups, 2) "shit's broken" alerts, and 3) reliable push-button re-deploy-from-scratch capability, all of which you should have anyway

Just.... just... no. First of all, nobody's got good backups. Nobody uses tape robots, and whatever alternative they have is poor in comparison, but even if they did have tape, they aren't testing their restores. Second, nobody has good alerts. Most people alert on either nothing or everything, so they end up ignoring all alerts, so they never realize things are failing until everything's dead, and then there goes your data, and also your backups don't work. Third, nobody needs push-button re-deploy-from-scratch unless they're doing that all the time. It's fine to have a runbook which documents individual pieces of automation with a few manual steps in between, and this is way easier, cheaper and faster to set up than complete automation.

shantly · on Jan 13, 2020

> Just.... just... no. First of all, nobody's got good backups. Nobody uses tape robots, and whatever alternative they have is poor in comparison, but even if they did have tape, they aren't testing their restores. Second, nobody has good alerts. Most people alert on either nothing or everything, so they end up ignoring all alerts, so they never realize things are failing until everything's dead, and then there goes your data, and also your backups don't work.

But you should test your backups and set up useful alerts with the cloud, too.

> Third, nobody needs push-button re-deploy-from-scratch unless they're doing that all the time. It's fine to have a runbook which documents individual pieces of automation with a few manual steps in between, and this is way easier, cheaper and faster to set up than complete automation.

Huh. I consider getting at least as close as possible to that, and ideally all the way there, vital to developer onboarding and productivity anyway. So to me it is something you're doing all the time.

[EDIT] more to the point, if you don't have rock-solid redeployment capability, I'm not sure how you have any kind of useful disaster recovery plan at all. Backups aren't very useful if there's nothing to restore to.

[EDIT EDIT] that goes just as much for the cloud—if you aren't confident you can re-deploy from nothing then you're just doing a much more complicated version of pets rather than cattle.

0xbadcafebee · on Jan 13, 2020

> more to the point, if you don't have rock-solid redeployment capability, I'm not sure how you have any kind of useful disaster recovery plan at all. Backups aren't very useful if there's nothing to restore to.

As Helmuth von Moltke Sr said, "No battle plan survives contact with the enemy." So, let's step through creating the first DR plan and see how it works out.

1) Login to your DR AWS account (because you already created a DR account, right?) using your DR credentials.

2) Apply all IAM roles and policies needed. Ideally this is in Terraform. But somebody has been modifying the prod account's policies by hand and not merging it into Terraform (because reasons), and even though you had governance installed and running on your old accounts flagging it, you didn't make time to commit and test the discrepancy because "not critical, it's only DR". But luckily you had a recurring job dumping all active roles and policies to a versioned write-only S3 bucket in the DR account, so you whip up a script to edit and apply all those to the DR account.

3) You begin building the infrastructure. You take your old Terraform and try to apply it, but you first need to bootstrap the state s3 and dynamodb resources. Once that's done you try to apply again, but you realize you have multiple root modules which all refer to each other's state (because "super-duper-DRY IaC" etc) so you have to apply them in the right sequence. You also have to modify certain values in between, like VPC IDs, subnets, regions and availability zones, etc.

You find odd errors that you didn't expect, and re-learn the manual processes required for new AWS accounts, such as requesting AWS support to allow you to generate certs for your domains with ACM, manually approving the use of marketplace AMIs, and requesting service limit increases that prod depended on (to say nothing of weird things like DirectConnect to your enterprise routers).

Because you made literally everything into Terraform (CloudWatch alerts, Lambda recurring jobs, CloudTrail trails logging to S3 buckets, governance integrations, PrivateLink endpoints, even app deployments into ECS!) all the infrastructure now exists. But nothing is running. It turns out there were tons of whitelisted address ranges needed to connect with various services both internal and external, so now you need to track down all those services whose public and private subnets have changed and modify them, and probably tell the enterprise network team to update some firewalls. You also find your credentials didn't make it over, so you have to track down each of the credentials you used to use and re-generate them. Hope you kept a backed up encrypted key store, and backed up your kms customer key.

All in all, your DR plan turns out to require lots of manual intervention. By re-doing DR over and over again with a fresh account, you finally learn how to automate 90% of it. It takes you several months of coordinating with various teams to do this all, which you pay for with the extra headcount of an experienced cloud admin and a sizeable budget accounting gave you to spend solely on engineering best practices and DR for an event which may never happen.

....Or you write down how it all works and keep backups, and DR will just be three days of everyone running around with their heads cut off. Which is what 99% of people do, because real disaster is pretty rare.

shantly · on Jan 13, 2020

This is kind of what I'm talking about WRT the cloud being more trouble than it's worth if you app sits somewhere in between "trivial enough you can copy-paste some cloud configs then never touch them" on the one end and "so incredibly well-resourced you can hire three or more actual honest-to-god cloud experts to run everything, full time". Unless you have requirements extreme/weird enough that you're both not-well-resourced but also need the cloud to practically get off the ground, in which case, god help you. I think the companies in that middle ground who are "doing cloud" are mostly misguided and burning cash & harming uptime while thinking they're saving and improving them, respectively.