Yesterday’s Facebook outage – which took down Facebook Messenger, Instagram, and WhatsApp as well as the main service – resulted from a mistake by the company’s own network engineers.
The mistake led to all of Facebook’s services being inaccessible, with one analogy likening it to a failure in the “air traffic control” services for network traffic …
We reported yesterday on the massive failure.
It’s not just you: Facebook, Instagram, and WhatsApp are all currently down for users around the world. We’re seeing error messages on all three services across iOS applications as well as on the web. Users are being greeted with error messages such as: “Sorry, something went wrong,” “5xx Server Error,” and more.
The outage is affecting every Facebook-owned platform, according to data on Downdetector and Twitter. This includes Instagram, Facebook, WhatsApp, and Facebook Messenger […] While some Facebook, Instagram, and WhatsApp outages only affect certain geographic regions, the services are down worldwide today.
It gradually appeared that the problem might relate to DNS – the domain name servers that tell devices which IP addresses to use to access services – but it was unclear what exactly had happened, and whether this was an external hack, malicious action by an insider, or a catastrophic mistake.
Facebook has now admitted in a blog post that it was a mistake.
Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt.
It took a long time to resolve the problem because the inaccessible systems included the servers and tools engineers would normally use to solve the problem remotely. Reports suggest that lower-level employees had to gain physical access to the data centers, and then rely on step-by-step instructions from more senior engineers in order to undo the mistake. Complicating this, the networks being unavailable meant that Facebook’s door access systems were also offline, physically preventing access.
How to understand the Facebook outage
We’ll doubtless get the full story in time, but the consensus view emerging is that the problem was some mix of domain name server (DNS) and border gateway protocol (BGP) configuration.
The best analogy I’ve seen is to think of network traffic as being like planes. Your device wants to fly to facebook.com. Your plane first needs to know the GPS coordinates of the destination airport, that is, the IP address it should connect to. It gets that information by asking a DNS, which tells it that facebook.com is located at (for example) 126.96.36.199.
But getting to the final destination – the actual server that can perform the task you want to do – relies on a kind of air traffic control system for network traffic, and that’s the BGP. The BGP tells your device which route to fly through the various servers en route to your final destination.
It appears that Facebook completely lost its BGP systems – so there was no way for Facebook to tell devices how to reach their destination. And that included Facebook’s own engineers reaching the systems they needed to undo the mistake.