Engineering Learnings from the CrowdStrike Falcon Outage
- 10 minsWhat Happened?
On July 18, 2024, the world witnessed a global outage that disrupted major industries, including banking, airlines, healthcare, oil and gas, and governments. The outage was caused by a faulty software update by CrowdStrike that rendered Windows computers in a continuous BSOD (Blue Screen of Death).
(Banners of Time Square are showing Windows BSOD as a result of the CrowdStrike incident)
CrowdStrike agents are installed virtually everywhere. When this update happened, it negatively affected companies in a myriad of ways: some were forced to use paper, hundreds of thousands of medical devices were affected, putting people’s lives at risk, flights were canceled and delayed, some TV stations couldn’t air, and some banks couldn’t function.
This blog post focuses on engineering learnings from the CrowdStrike Falcon outage and reflects on how organizations can avoid this through engineering industry practices.
What is this post about?
I will break down engineering practices and cultures that I have seen in my years of experience that, if implemented, would have prevented this incident.
Incidents happen very frequently in engineering organizations. We should promote engineering cultures where we take learnings of public incidents and use them as a lesson within organizations.
The incident at CrowdStrike could have happened to virtually every organization globally (including other security vendors). I run a security startup myself on Attack Surface Management (FullHunt.io), and I know well that this kind of incident happens. The best take here is to reflect on your engineering culture and see how this could be prevented in the first place.
YES - CrowdStrike could have prevented this incident and it should not have happened. Let’s just boldly state this fact.
Background: How did this incident happen?
The incident occurred due to pushing a global update on the CrowdStrike Falcon.
CrowdStrike released a sensor configuration update to Windows systems. This is a definition update that happens frequently in the background by endpoint security agents to enrich protected endpoints with new updates about rules, signatures, configurations, and instructions to detect anomalies and attacks.
CrowdStrike runs at the kernel-level, which makes potential updates more dangerous than they have to be - any error on the agent or the definitions could result in a system crash.
According to CrowdStrike, which was also confirmed by researchers who reverse-engineered the update, CrowdStrike Falcon configuration updates introduced a change to a Channel File that controls how Falcon evaluates named pipe execution on Windows systems. The change was designed to block malicious named pipes used by common C2 (Command and Control) frameworks, which is actually a great goal. Name pipe execution is a novel evasion method, and CrowdStrike designed a rule to block malicious usage of named pipes.
However, the configuration update also introduced a logic error that resulted in an operating system crash. A call is made to access a memory address that does not exist from a kernel driver process. This introduces a BSOD (Blue Screen of Death) every time the CrowdStrike process is started.
Engineering Cultures are Important
Engineering cultures are important, especially for technology companies, especially when your product affects the lives of millions of people every day. At the end of the day, every engineer would like to “write code, ship code fast”, but does that mean it’s going to be reliable? Not always.
Engineers from all levels of seniority make mistakes. Mistakes happen, often, and by everyone - we have to accept this fact. Humans make errors, AI introduces errors, and everything will eventually introduce errors at a certain point. That’s why there are standard engineering practices that need to be put in place, and that’s how you would prevent this error from reaching production.
I will dive deeper into engineering cultures after sharing some examples I have seen and implemented in engineering organizations that could detect and prevent this CrowdStrike incident earlier in the development before it reached the production environment of millions of clients.
What are the Engineering Practises CrowdStrike could have implemented?
I’m writing these practices based on my experience, knowledge about the incident, and discussions with security engineering leaders about the CrowdStrike incident. While I do not have internal insights into how CrowdStrike maintains its Engineering program and Development Experience, we can safely assume that this is not implemented in a way that blocks faulty production releases, especially with the news that CrowdStrike had a very similar incident in April, where Debian and Rocky Linux builds where broken in production.
End-to-End Tests
While the CrowdStrike incident occurred due to the definition update, that does not mean that it can not be tested. If the code can be read or executed, it can be tested.
End-to-end tests are standard development practices that could have prevented a change from happening.
Test coverage is another practice based on end-to-end tests and test-driven development. With test coverage (also known as code coverage), you can make sure that the introduced change does not introduce logic errors. You can even write negative tests to make sure that logic functions under specific conditions.
Integration Tests
Integration tests are standard tests that should be ideally baked into the development cycle of definition updates. This works by deploying a disposable virtual machine with a CrowdStrike agent and then iterating through different malware, C2 implants, ransomware binaries, and regular software to see how the CrowdStrike agent with the latest definition update could react to it.
This concept is being used everywhere, FullHunt.io implements it to monitor new vulnerability checks being introduced to our customers, and we’re a small and bootstrapped company. Financial institutions have dedicated teams that focus on integration tests, the aviation industry has been doing this for decades.
Canary Builds and Deployments
Canary deployments are a new concept that works as an additional layer to integration tests. Canary deployments run an additional layer of checks to see if the new release builds properly, and how it behaves once executed on a system, then wait for a period of time while running heavy observability to make sure that this change is not introducing critical errors. Canary deployments act as a gate to control releases from being pushed to production when an error or an issue is introduced.
Canary deployments are not new; they have been an engineering practice for years. It takes time and effort to implement, but it’s not impossible to build.
Engineering Culture: Make it simple to write quality code
As discussed earlier, humans make mistakes. Mistakes are inevitable, that’s why we build guardrails and controls to ensure mistakes do not reach production.
1. Make all quality checks clear
Make all quality checks clear: all services, changes, and updates need to be written with automated end-to-end tests in mind. A Pull Request should never be merged without passing all checks. All engineering practices I have mentioned are automated and can be implemented today for any product or software.
2. Gated Guardrails: Make it hard to introduce faulty code
This is not difficult; it’s an engineering culture practice.
This is an engineering cultural matter, not a technical matter. Setting gated guardrails using current tools on GitHub and GitLab is pretty straightforward. You can set controls to block a definition or an agent update that does not function properly on a Windows machine. Code reviews should give an additional layer of confidence that introduced changes are safe to be merged.
If the introduced change:
- Includes and passes end-to-end tests with both positive and negative checks.
- Includes integration tests that validate that the rules are running correctly on different support environments.
- Includes canary builds that block releases from reaching production when a given update utilizes additional computing resources or fails wide open with a BSOD.
Then this incident could have been much harder to introduce.
3. Make it part of the engineering culture
Let’s face it, introducing tests on different levels can slow down builds and releases. Therefore, it may take significant engineering culture work to be introduced as normal within an organization.
If a small startup needs to have this baked in place, it may not always be the best use of time and resources, unless they’re a small startup whose product is considered critical infrastructure. Also, if your company is producing a product that is used by hundreds of millions around the world, this is the minimum expectation.
4. Make it elegant
I have seen companies that invest time and effort into building this to be part of all engineering and development by making testing easy and streamlined.
For example, abstracting all tests into GitHub Actions, Gitlab Runners, CircleCI, TravisCI, or similar CI pipelines, and then ensuring that all services have testing enabled is one way to start. A second step can be to introduce service scores on services and teams that implement these practices correctly. Gating updates are also simple and elegant today through branch protection on Github and Gitlab. Once these controls are gradually in place, it will slowly start to become part of the culture.
Another example that reinforces this habit with engineers is making this part of the KPIs, bonus plans, and sending gifts and/or acknowledgements to team members that work on empowering these values.
5. Do not make it hard
It’s the opposite of “Make it Clear” and “Make it Elegant”.
Testing is important, but making it hard and time-consuming for people will result in an extremely bad developer experience and velocity.
An example of this to avoid is having a dedicated team whose sole focus is testing new features developed by engineers, and having to approve every single pull request.
With “Do not make it hard,” service owners should make sure that automated tests are in place, baked into the CI pipeline, and nearby and close to the developer’s eyes - on Github for example. If an engineer pushes a change, they should have a continuous feedback loop on whether this change can introduce breaking changes. This should be simple, and elegant, that’s all.
If it’s hard for engineers to introduce changes, engineering and innovation will take a hit at first, followed by customers and users.
Kernel-Mode vs. User-Mode Architecture for Endpoint Security
My awesome friend, Matt Suiche, has written an excellent post here, Bob and Alice in Kernel Land (back story: I was introduced to kernel security architecture for the first time thanks to Matt!).
The discussion on whether to run security software on Kernel-Mode vs. User-Mode depends on the software goal and decisions. Kernel-mode architecture provides additional access and faster detection. However, this will come at the cost of an extra security risk (a compromised security software on kernel mode can control everything on the machine), may introduce similar incidents, and will increase development and testing time.
Automatically disabling the kernel component on recurring BSOD events
Naveed Hamid shared an excellent idea that the CrowdStrike Falcon should have a capability to disable kernel-based access on a subsequent reboot when Kernel-based bugs occur and cause a BSOD. This concept allows CrowdStrike and its partners to have a stable environment to resolve the incidents. Kernel-based changes can cause BSOD (Blue Screen of Death). It is possible to identify recurring BSOD events. In cases where the machine lands in a Windows safe mode, the architecture can also be designed to disable kernel-level components and run safely on the user level. FortiEDR has implemented this concept – here is a longer read about the approach link.
Thanks
I would like to thank E Coleen C, Naveed Hamid, Matt Suiche, and Mohammed Morsy for their great thoughts on this incident and about engineering learnings and lessons that we can take as a reflection.
Best Regards,
Mazin Ahmed