I have always had this fantasy thinking of what happens when outages of one of these major service never come back online i.e. in this outage Slack loses info of all the accounts, users, messages etc.
How would people react? What would engineers do to recover? I always found that idea fascinating.
Imagine Google saying tomorrow that they lost all the accounts and emails. What kind of impact the world will have?
That scenario is what Disaster Recovery plans are for. Every large company I've worked for has had recovery plans in place, including scenarios as disturbing as "All data centers and offices explode simultaneously, and all staff who know how it all works are killed in the blasts."
You not only have backups in place, you have documentation in place, including a back-up vendor who has copies of the documentation and can staff up workers to get it up and running again without any help from existing staff.
And we tested those scenarios. I'm not sure which dry runs were less fun - when you got paged at 3 AM to go to the DR site and restore the entire infrastructure from scratch... or when you got paged at 3 AM and were instructed to stay home and not communicate with anyone for 24 hours to prove it can be done with out you. (OK, so staying home was definitely more fun, but disturbing.)
This scenario isn't as far fetched as people think. I was running a global deployment in 2012 when hurricane Sandy hit the east cost. The entire eastern seaboard went offline and was off for several days. Some data centers were down for weeks. Our plan had covered that contingency and we failed all of our US traffic to the two west coast regions of amazon. Our downtime on the east cost was around two minutes. Yet a sister company had only one data center in downtown New York, and they were offline for weeks, scrambling to get a backup loaded and online.
I worked for a regional company in the oil and gas industry and the HQ and both datacenters were in the same earthquake zone. A twice per century earthquake had a real risk of taking down both DCs and the HQ. The plan would have been for every gas station in the vertical to switch to a contingency plan distributing critical emergency supplies and selling non-essential supplies using off-grid procedures.
That’s some really good thoughts on DR planning. I have never thought DR to be to such an extent.
How many companies really plan for an event where their entire infrastructure goes offline and their entire team gets killed? Does even companies like Google plan for this kind of event?
> Some of the temporary locations, such as the W Hotel, required significant upgrades to their network infrastructure, Klepper said. "We're running a Gigabit Ethernet now here in the W Hotel,'' Klepper said, with a network connected to four T1 (1.54M bit/sec) circuits. That network supports the code development for a Web-based interface to the company's systems, which Klepper called "critical" to Empire's efforts to serve its customers. Despite the lost time and the lost code in the collapse of the World Trade Center towers, Klepper said, "we're going to get this done by the end of the year."
> Shevin Conway, Empire's chief technology officer, said that while the company lost about "10 days' worth" of source code, the entire object-oriented executable code survived, as it had been electronically transferred to the Staten Island data center.
The two I've worked for that took it that far were a Federal bank, and an energy company. I have no idea how far Google or other large software companies take their plans.
But based on my experience, the initial recovery planning is the hard part. The documentation to tell a new team how to do it isn't so painful once the base plan exists, although you do need to think ahead to make sure somebody at your back-up vendor has an account with enough access to set up all the other accounts that will need to be created, including authorization to spend money to make it happen.
The last company I worked for where I was (de facto) in charge of IT (small company so I wore lots of hats) could have recovered if both sites burnt down and I got hit by a bus since I made sure that all code, data and instructions to re-up everything existed off site, that both most senior managers understood how to access everything and enough to hand it to a competent firm with a memory stick and a password.
In some ways losing your ERP and it's backups would be harder to recover from than both sites burning down, insurance would cover that at least.
Yes, Google plans extensively and runs regular drills.
It's hearsay, but I was once told that achieving "black start" capability was a program that took many years and about a billion dollars. But they (probably) have it now.
"black start" for GCP would be something to see. Since the global root keys for Cloud KMS are kept on physical encrypted keys locked safes, accessible to only a few core personnel, that would be interesting, akin to a missile silo launch.
"Black start" is a term that refers to bringing up services when literally everything is down.
It's most often referred to in the electricity sector, where bringing power up after a major regional blackout (think 2003 NE blackout) is extremely nontrivial, since the normal steps to turn on a power plant usually requires power: for example, operating valves in a hydro plant or blowers in a coal/gas/oil plant, synchronizing your generation with grid frequency, having something to consume the power; even operating the relays and circuit breakers to connect to the grid may require grid power.
The idea here is presumably that Google services have so many mutual dependencies that if everything were to go down, restarting would be nontrivial because every service would be blocked on starting up due to some other service not being available.
I work for a bank. We have to do a full DR test for our regulator every six months. That means failing all real production systems and running customer workloads in DR, for realsies, twice a year. We also have to do periodic financial stress tests - things like "$OTHER_BANK collapsed. What do you do?" - and be able to demonstrate what we'll do if our vendors choose to sever links with us or go out of business.
It's pretty much part of the basic day-to-day life in some industries.
The company I work for plans for that and it's definitely not FAANG. In fact, DR planning and testing is far more important than stuff like continuous integration, build pipelines, etc.
> Every large company I've worked for has had recovery plans in place, including scenarios as disturbing as "All data centers and offices explode simultaneously, and all staff who know how it all works are killed in the blasts."
I sat in on a DR test where the moment one of the Auckland based ops team tried asking the Wellington lead, the boss stepped in and said "Wellington has been levelled by an earthquake. Everyone is dead or trying to get back to their family. They will not be helping you during the exercise."
Thanks for sharing, for some reason I think about this story a lot. It must have been such an emotionally difficult time for everyone involved in piecing back together their processes.
>Thanks for sharing, for some reason I think about this story a lot. It must have been such an emotionally difficult time for everyone involved in piecing back together their processes.
I was there as a consultant and didn't know anyone there when I went.
I won't provide any details out of respect for those fine people, but the grief was so thick, you could have cut it with a knife. As I said, I didn't know anyone who was there (or wasn't there) but after a day, I wanted to cry.
My tangential thought in that regard is what if this is a really bad outage that causes Slack to tank (i.e. A large number of companies switch to Microsoft, Zulip, etc). Equally interesting a thought.
In 2011 a small amount (0.02%) of Gmail users had all their emails deleted due to a bug: https://gmail.googleblog.com/2011/02/gmail-back-soon-for-eve... They ended up having to restore them from tape backup, which took several days. Affected users also had all their incoming mail bounce for 20 hours.
Google would be catastrophic because so much is stored there.
Slack is mostly real time communication, at least for me. There are a few bits and bobs that really should be documented that are in the messages though.
Yeah, Google would easily top the list of companies which can have catastrophic impact. Microsoft, Apple, Salesforce, Dropbox would be the next in the list I guess if we leave out the utility companies and internet providers etc.
Just look at the impact a 40 minute outage of Google Auth had last month, I wouldn't be surprised if the global productivity hit during that outage was in the billions of dollars, and that was for a relatively short outage without any data loss.
AWS outages have basically crippled a few businesses. The longest I know of was 8-10 hours the day before Thanksgiving. Some Bay Area food company got hit by it and couldn’t deliver thanksgiving dinners.
Being in DR,I live my life wondering about that too. I spend alot of extra time checking accounts and making sure that I print (yes, sneakernet) out important data as well as have manual copies of passwords. Its old school, but it removes the risk to my business in case of a total loss of a global service and lowers the risk of a heart attack and related stress.
The rest of the world may not be so energetic re: their accounts and data, so it would be painful for many, it depends on their how much risk they are willing to experience.
Being in DR, it is very difficult for businesses to allocate the time and resources to good planning - for many, DR is an insurance policy. Staff: engineering and development are focused on putting out fires however, a real DR is more than most companies can handle if they have not planned accordingly or practiced through testing failover/normalization processes as well as performing component-level testing.
This should actually be part of your Disaster Recovery plan. You should have at least some plan for the loss of all of your service providers. Even if that plan is to sit in the corner and cry (j/k).
We might start to see actual legislation around implied SLAs in the US which would cause Google to rethink everyone's 20% project being rolled out for 2 years.
Services like Slack are replaceable to most extent. How does even replace a service like Google easily? There are like to like services available for Google but the data is where it becomes tricky. Almost 1bn people losing their email addresses could cause massive issues.
How would people react? What would engineers do to recover? I always found that idea fascinating.
Imagine Google saying tomorrow that they lost all the accounts and emails. What kind of impact the world will have?