Right now, a handful of people will be furiously working to fix the major outage that’s brought down Facebook, Instagram and WhatsApp for users around the world. The Spinoff CTO Ben Gracewood sheds some light on what’s likely happened, and what they can do about it.
“Have you tried turning it off and on again?” actually used to work pretty well. When systems went offline, you knew which physical box to reboot, and nine times out of 10 rebooting it would make the problem go away.
But then the cloud came along and made everything infinitely better and oh so much worse at the same time. It’s kinda like opening the bonnet of a 1972 Toyota Corolla versus a 2021 BMW: one you can make sense of and point at the different bits that make it work, the other is an enigma that requires specialist knowledge and specific tools to configure.
Facebook’s internal systems are multiple orders of magnitude larger than anything I’ve worked on, but I’d hazard a guess that they’re built on similar principles. Partly because computers work the same way everywhere, but mostly because like all big tech companies, they literally write about how they do it (not at all ironically, that link does not work at the time of writing). So while I can’t fathom the specific level of angst that their engineers are dealing with, I do know the general shape of things.
In modern large-scale systems, everything is managed remotely. There will be thousands of identical, generic machines running in a bunch of remote locations with cheap electricity. The software that makes up Facebook, and probably even the admin systems that are needed to configure Facebook, will be spread all over those generic computers in a way that is meant to guarantee availability. The entire point is that if you rebooted one of those machines, absolutely no one would notice anything at all.
None of that protects you from humans doing human things, whether malicious or just accidental. Like (entirely hypothetically of course), someone at a global retail software company typing a database delete command into the wrong window and deleting a production database when they thought they were deleting a test database. This is why anyone working in cloud software knows about “incident response” and “blameless post-mortems”.
If, as the rumours indicate, Facebook has accidentally ruined its ability to connect remotely to its servers, they’re in quite the pickle. It’s not just the Facebook application that’s down. How would you manage in a work crisis if you were unable to connect with your workmates over email or chat, couldn’t access the company directory with their phone numbers, and your building access didn’t work?
They might even need to physically plug in to the network to fix the problem, and the people who know how to fix the problem are almost certainly nowhere near the servers that need plugging in to. It’s not unthinkable to imagine a harried electrician in a noisy data centre, phone tucked under their ear, while a very senior Facebook engineer phones in from holiday in St Barts asking them to type in various commands and read back the results.
A bunch of other engineers will be working out how to handle the absolute torrent of traffic that will hit the servers when they finally get reconnected. Why? Because a large fraction of all the devices in the world are currently trying to connect and authenticate with Facebook, over and over and over again. The instant traffic starts flowing again, every one of those devices will try to connect. If not handled properly, that would just knock Facebook off the internet again.
Somewhere inside Facebook there are probably five to 10 people that know how to fix these issues. They’ll be methodically working through the steps, trying not to make the problem worse, while thousands of staff and billions of users scream into the void:
So how does Facebook “turn it off and on again”? Very, very carefully.