When the platforms under the Meta banner go down, we don’t need journalists to tell us about it. We know.
If anything, the worldwide outage taught us just how far the social media giant’s influence reaches and how powerfully it affects our lives, whether we’re conscious of it or not. The question is: if a technical issue can keep the world’s leading social media platform offline for six hours, what does that imply for healthcare organisations and the systems they use?
If Meta can go down, anyone can go down
The company's 2020 financial report revealed that it supports 2.8 billion monthly users and earned US$85.9 billion in 2020 alone. If a company of this magnitude can be disrupted so significantly, the sobering truth is that it can happen to any one of us.
At this stage, Meta’s official stance is that the outage was caused by faulty configurations made by the company’s engineers:
“Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centres caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centres communicate, bringing our services to a halt.”
In other words, a human error brought the social media machine to its knees for six hours, costing the company millions in lost ad revenue alone and dropping its value by close to 5% in a single day.
You can have the best technology in the world, supported by talented and knowledgeable engineers, but those engineers are human, and humans make mistakes. With this in mind, it’s essential for healthcare organisations to have structures and processes in place that ensure these inevitable errors won’t be catastrophic.
Alphalake Ai's Director of Infrastructure DevOps and Security Services, Steven Elliot, explains that this forces you to be something of a pessimist, at least in theory.
“If it can go wrong, it will go wrong – that should be your motto when planning. Risks must be identified in the implementation of any solution so that you are clear on where your weaknesses are, where things can go wrong, and where things can be improved. You can never ask too many what-ifs, and you can never run too many tests.”
Stress testing and contingency plans are essential
From what we can gather, a contributing factor to the length of the outage was the fact that it created a sort of catch-22. The update that caused the outage effectively cut off the digital inroads that lead to Meta's properties. This meant that remote workers were unable to access the system and revert the changes.
Those who could head down to Meta's physical premises were stuck too because the internal tools needed to fix the problem were all connected to the company’s inaccessible domains. On the day, Sheera Frenkel of the New York Times reported that even getting into the building was a problem.
Was just on phone with someone who works for FB who described employees unable to enter buildings this morning to begin to evaluate extent of outage because their badges weren't working to access doors.— Sheera Frenkel (@sheeraf) October 4, 2021
At every turn, employees were faced with mutually conflicting conditions that cut off their attempts to resolve the situation – a classic catch-22. However, physical access was eventually arranged, and the engineers were able to restore the company's backbone network connectivity.
Here, the company says its “storm drills” prevented a second outage, which would have been a likely result of the traffic surge expected if all services had been brought back online at once.
“Individual data centres were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk.”
While the company had never run a simulation involving their global backbone going offline for hours and then being booted back up again, the contingency plans and experience developed from other drills gave them the tools and knowledge needed to manage the resumption of service with no further disruptions.
Meta's management of the outage highlights the importance of threat hunting, risk assessments, stress tests, drills, and the development of IT contingency plans for healthcare organisations. As Steven Elliot explains, this is an ideal opportunity to take advantage of hybrid automation.
“When you do regression testing, for example, you have to repeat precisely the same testing processes every time you make a change.
A human worker may miss-key something, click the wrong screen, or not follow a documented process when testing, quite simply, because it’s boring.
The beauty of automation is that robots don’t get bored and, therefore, aren’t prone to making mistakes. By using Hybrid Automation to replace User Acceptance Testing, you cut out the manual process and the errors that often go with it.”
Automation also saves time and frees employees to give full focus to the more demanding aspects of their roles. Of course, even the most expertly-designed contingency plan can still fail. Going back to the Facebook example, the company’s out-of-band management (OOBM) system didn’t provide the uninterrupted access it was designed to deliver, illustrating the fact that it never hurts to have a backup plan for your backup plan.
The importance of uninterrupted communication
While it’s unlikely that Meta’s public messages tell the whole story of the outage experienced on October 4, 2021, the company was prompt and direct in its communication. Meta was also proactive in delivering the two most powerful words you can say when you’ve caused a problem for your clients: We’re sorry.
To the huge community of people and businesses around the world who depend on us: we're sorry. We've been working hard to restore access to our apps and services and are happy to report they are coming back online now. Thank you for bearing with us.— Meta (@Meta) October 4, 2021
Perhaps the biggest lesson to be learned here in terms of communication, though, is the fact that Meta’s system outage also took down its usual means of communication. So, the company turned to the best available platform to get in contact with its customers: its competitor, Twitter.
To say that communication is vital for healthcare organisations is still somehow understating its importance. Whether the problem is caused by a technical issue, human error, or a ransomware attack, you need to have a way of communicating with clients and other relevant stakeholders if your system goes down.
Security is paramount
In an interesting twist, the very protections Meta has put in place to stave off both physical and digital attacks contributed to the problems they faced on October 4, 2021. These security measures made it more difficult for the company's own engineers to gain access to the system. However, their stance on the situation is that the hours (and millions of dollars) lost during the outage were a worthy trade-off for the resilience of its system against malicious activity.
While we’re not about to dive into the data security and privacy concerns that have plagued Meta in recent years, the company’s message regarding security is valuable. Healthcare organisations deal with private and sensitive data, so security must be paramount when decisions are made about everything from systems and apps to access and automation.