I have always had this fantasy thinking of what happens when outages of one of t...

codingdave · on Jan 4, 2021

That scenario is what Disaster Recovery plans are for. Every large company I've worked for has had recovery plans in place, including scenarios as disturbing as "All data centers and offices explode simultaneously, and all staff who know how it all works are killed in the blasts."

You not only have backups in place, you have documentation in place, including a back-up vendor who has copies of the documentation and can staff up workers to get it up and running again without any help from existing staff.

And we tested those scenarios. I'm not sure which dry runs were less fun - when you got paged at 3 AM to go to the DR site and restore the entire infrastructure from scratch... or when you got paged at 3 AM and were instructed to stay home and not communicate with anyone for 24 hours to prove it can be done with out you. (OK, so staying home was definitely more fun, but disturbing.)

NyxWulf · on Jan 4, 2021

This scenario isn't as far fetched as people think. I was running a global deployment in 2012 when hurricane Sandy hit the east cost. The entire eastern seaboard went offline and was off for several days. Some data centers were down for weeks. Our plan had covered that contingency and we failed all of our US traffic to the two west coast regions of amazon. Our downtime on the east cost was around two minutes. Yet a sister company had only one data center in downtown New York, and they were offline for weeks, scrambling to get a backup loaded and online.

dharmab · on Jan 4, 2021

I worked for a regional company in the oil and gas industry and the HQ and both datacenters were in the same earthquake zone. A twice per century earthquake had a real risk of taking down both DCs and the HQ. The plan would have been for every gas station in the vertical to switch to a contingency plan distributing critical emergency supplies and selling non-essential supplies using off-grid procedures.

blntechie · on Jan 4, 2021

That’s some really good thoughts on DR planning. I have never thought DR to be to such an extent.

How many companies really plan for an event where their entire infrastructure goes offline and their entire team gets killed? Does even companies like Google plan for this kind of event?

twistedpair · on Jan 4, 2021

> How many companies really plan for an event where their entire infrastructure goes offline and their entire team gets killed?

Since 9/11, more than you might think. For example Empire Blue Cross Blue Shield [1] had its HQ in the WTC.

https://www.computerworld.com/article/2585046/empire-blue-[1... cross-it-group-undaunted-by-wtc-attack--anthrax-scare.html

sebmellen · on Jan 5, 2021

Fixed link: https://www.computerworld.com/article/2585046/empire-blue-cr...

And what a blast from the past:

> Some of the temporary locations, such as the W Hotel, required significant upgrades to their network infrastructure, Klepper said. "We're running a Gigabit Ethernet now here in the W Hotel,'' Klepper said, with a network connected to four T1 (1.54M bit/sec) circuits. That network supports the code development for a Web-based interface to the company's systems, which Klepper called "critical" to Empire's efforts to serve its customers. Despite the lost time and the lost code in the collapse of the World Trade Center towers, Klepper said, "we're going to get this done by the end of the year."

> Shevin Conway, Empire's chief technology officer, said that while the company lost about "10 days' worth" of source code, the entire object-oriented executable code survived, as it had been electronically transferred to the Staten Island data center.

codingdave · on Jan 4, 2021

The two I've worked for that took it that far were a Federal bank, and an energy company. I have no idea how far Google or other large software companies take their plans.

But based on my experience, the initial recovery planning is the hard part. The documentation to tell a new team how to do it isn't so painful once the base plan exists, although you do need to think ahead to make sure somebody at your back-up vendor has an account with enough access to set up all the other accounts that will need to be created, including authorization to spend money to make it happen.

noir_lord · on Jan 4, 2021

The last company I worked for where I was (de facto) in charge of IT (small company so I wore lots of hats) could have recovered if both sites burnt down and I got hit by a bus since I made sure that all code, data and instructions to re-up everything existed off site, that both most senior managers understood how to access everything and enough to hand it to a competent firm with a memory stick and a password.

In some ways losing your ERP and it's backups would be harder to recover from than both sites burning down, insurance would cover that at least.

jacques_chester · on Jan 4, 2021

Yes, Google plans extensively and runs regular drills.

It's hearsay, but I was once told that achieving "black start" capability was a program that took many years and about a billion dollars. But they (probably) have it now.

twistedpair · on Jan 4, 2021

"black start" for GCP would be something to see. Since the global root keys for Cloud KMS are kept on physical encrypted keys locked safes, accessible to only a few core personnel, that would be interesting, akin to a missile silo launch.

jacques_chester · on Jan 4, 2021

It would be amazing to see. But I hope we never have to.

blntechie · on Jan 4, 2021

So 'black start' is a program to start over from scratch? The scale required for it itself would be amazing.

jcranmer · on Jan 4, 2021

"Black start" is a term that refers to bringing up services when literally everything is down.

It's most often referred to in the electricity sector, where bringing power up after a major regional blackout (think 2003 NE blackout) is extremely nontrivial, since the normal steps to turn on a power plant usually requires power: for example, operating valves in a hydro plant or blowers in a coal/gas/oil plant, synchronizing your generation with grid frequency, having something to consume the power; even operating the relays and circuit breakers to connect to the grid may require grid power.

The idea here is presumably that Google services have so many mutual dependencies that if everything were to go down, restarting would be nontrivial because every service would be blocked on starting up due to some other service not being available.

rodgerd · on Jan 5, 2021

I work for a bank. We have to do a full DR test for our regulator every six months. That means failing all real production systems and running customer workloads in DR, for realsies, twice a year. We also have to do periodic financial stress tests - things like "$OTHER_BANK collapsed. What do you do?" - and be able to demonstrate what we'll do if our vendors choose to sever links with us or go out of business.

It's pretty much part of the basic day-to-day life in some industries.

eecks · on Jan 4, 2021

The company I work for plans for that and it's definitely not FAANG. In fact, DR planning and testing is far more important than stuff like continuous integration, build pipelines, etc.

rodgerd · on Jan 5, 2021

> Every large company I've worked for has had recovery plans in place, including scenarios as disturbing as "All data centers and offices explode simultaneously, and all staff who know how it all works are killed in the blasts."

I sat in on a DR test where the moment one of the Auckland based ops team tried asking the Wellington lead, the boss stepped in and said "Wellington has been levelled by an earthquake. Everyone is dead or trying to get back to their family. They will not be helping you during the exercise."

teagee · on Jan 4, 2021

This reminds me of what happened to the financial services Cantor Fitzgerald after 9/11, just replacing a system with hundreds of lost employees:

https://www.nytimes.com/2014/11/19/magazine/the-secret-life-...

nobody9999 · on Jan 4, 2021

I was at CF (at new offices, obviously) briefly a couple weeks after 9/11.

They had backups and were able to recover data and systems.

By the time I got there, they were somewhat functional.

The biggest problems were the lack of knowledgeable personnel, not lost data or systems.

teagee · on Jan 4, 2021

Thanks for sharing, for some reason I think about this story a lot. It must have been such an emotionally difficult time for everyone involved in piecing back together their processes.

nobody9999 · on Jan 4, 2021

>Thanks for sharing, for some reason I think about this story a lot. It must have been such an emotionally difficult time for everyone involved in piecing back together their processes.

I was there as a consultant and didn't know anyone there when I went.

I won't provide any details out of respect for those fine people, but the grief was so thick, you could have cut it with a knife. As I said, I didn't know anyone who was there (or wasn't there) but after a day, I wanted to cry.

Macha · on Jan 4, 2021

Happened to Ma.gnolia, which was the number 2 bookmarking site behind Del.icio.us in that era: https://en.wikipedia.org/wiki/Gnolia

HN comments at the time: https://news.ycombinator.com/item?id=487497

The site relaunched a month later and shut down for good a year after that

JadoJodo · on Jan 4, 2021

My tangential thought in that regard is what if this is a really bad outage that causes Slack to tank (i.e. A large number of companies switch to Microsoft, Zulip, etc). Equally interesting a thought.

signed0 · on Jan 4, 2021

In 2011 a small amount (0.02%) of Gmail users had all their emails deleted due to a bug: https://gmail.googleblog.com/2011/02/gmail-back-soon-for-eve... They ended up having to restore them from tape backup, which took several days. Affected users also had all their incoming mail bounce for 20 hours.

MattGaiser · on Jan 4, 2021

Google would be catastrophic because so much is stored there.

Slack is mostly real time communication, at least for me. There are a few bits and bobs that really should be documented that are in the messages though.

thesuitonym · on Jan 4, 2021

If this thread is to be believed, apparently a lot of engineers use slack for alerting and don't know how to check their monitoring software manually.

blntechie · on Jan 4, 2021

Yeah, Google would easily top the list of companies which can have catastrophic impact. Microsoft, Apple, Salesforce, Dropbox would be the next in the list I guess if we leave out the utility companies and internet providers etc.

jon-wood · on Jan 4, 2021

Just look at the impact a 40 minute outage of Google Auth had last month, I wouldn't be surprised if the global productivity hit during that outage was in the billions of dollars, and that was for a relatively short outage without any data loss.

sjg007 · on Jan 4, 2021

AWS outages have basically crippled a few businesses. The longest I know of was 8-10 hours the day before Thanksgiving. Some Bay Area food company got hit by it and couldn’t deliver thanksgiving dinners.

alexsey · on Jan 4, 2021

Being in DR,I live my life wondering about that too. I spend alot of extra time checking accounts and making sure that I print (yes, sneakernet) out important data as well as have manual copies of passwords. Its old school, but it removes the risk to my business in case of a total loss of a global service and lowers the risk of a heart attack and related stress.

The rest of the world may not be so energetic re: their accounts and data, so it would be painful for many, it depends on their how much risk they are willing to experience.

Being in DR, it is very difficult for businesses to allocate the time and resources to good planning - for many, DR is an insurance policy. Staff: engineering and development are focused on putting out fires however, a real DR is more than most companies can handle if they have not planned accordingly or practiced through testing failover/normalization processes as well as performing component-level testing.

lsaferite · on Jan 4, 2021

This should actually be part of your Disaster Recovery plan. You should have at least some plan for the loss of all of your service providers. Even if that plan is to sit in the corner and cry (j/k).

tclancy · on Jan 4, 2021

We might start to see actual legislation around implied SLAs in the US which would cause Google to rethink everyone's 20% project being rolled out for 2 years.

ab_io · on Jan 4, 2021

It would be a mass customer extinction event for said service, and would effectively result in a windfall for competing services

blntechie · on Jan 4, 2021

Services like Slack are replaceable to most extent. How does even replace a service like Google easily? There are like to like services available for Google but the data is where it becomes tricky. Almost 1bn people losing their email addresses could cause massive issues.

TwoPhotons · on Jan 4, 2021

That wouldn't be ideal.