How could that happen: All eyes on CrowdStrike post mortem
How can a single update of antivirus software interrupt air travel, TV broadcasts, office work, retail stores, and emergency services, all on the same day and across the globe?!
That was the question on everyone’s mind on July 19th, after it became clear that CrowdStrike’s update of the sensor configuration of its antivirus software Falcon was the reason behind all of those blue screens of death.
In a promptly published public statement on technical details of the incident, CrowStrike says:
We understand how this issue occurred and we are doing a thorough root cause analysis to determine how this logic flaw occurred. This effort will be ongoing.
The whole world is curious to find out why this particular software update didn’t go through standard engineering practices. CrowdStikes promised to update its root-cause analysis as the investigation progresses.
“Fortunately, in the last decades of IT engineering, it has been established that companies of high esteem, such as CrowdStrike, publicly disclose technical and organizational details that led to incidents so that others may learn from their mistakes. I’m sure it will be an interesting and educational story.”, says Mihovil Madjer, Product Director at Infobip.
How come it doesn’t happen more often?
Madjer explains that big tech companies like CrowdStrike usually have engineering practices to prevent incidents like that. Before changes to software are pushed to all customers, the new version usually goes through several safety and risk prevention/reduction steps:
“First, the change is tested in a lab environment to ensure it won’t break the software or system and that it behaves as intended. The Friday incident appears to have affected all Windows machines, but that’s not always the case. Sometimes, software can affect only certain versions of Windows. Since lab testing can only include a finite combination of software and hardware versions, additional steps are usually taken to ensure as little of a customer base is affected as possible.”
Canary deployment
Madjer says a technique called canary deployment can be used to send the update to just a fraction of customers and compare its behavior against the old version of software running on the customers’ side.
Lastly, even if the canary doesn’t show any issues, the new version is usually rolled out gradually to customers, a few percent at a time, and monitored for any defects.
Any of these techniques would have prevented such a vast impact on the global scale. It appears the new version of the software was pushed instantly to all customers without testing.
YOLO deployment
“Push all software to all customers without testing” is what Gergely Orosz of The Pragmatic Engineer called YOLO Deployment:
“YOLO deploys are fine when you don’t care much if a deploy goes wrong, and it’s easy enough to revert. A deployment that could take down the majority of your customers is not one with which to experiment with this approach.”
As the IT people eagerly await for CrowStrike’s update on the root cause analysis, let’s use what has already been called “the largest IT outage in history” as a wake-up call and a reminder to examine our own processes and workflows.