Advertising

Preventing Catastrophic Software Deployment Issues: Lessons from the CrowdStrike Incident

The recent incident involving a faulty update from cybersecurity company CrowdStrike highlighted the potential risks of software deployment. The update caused widespread disruptions, affecting airports, healthcare facilities, and emergency call centers. Despite having a sophisticated DevOps pipeline in place, the buggy code managed to slip through, causing a significant blow to CrowdStrike’s reputation and stock price.

CrowdStrike acknowledged the gravity of the situation and quickly deployed a fix to address the issue. However, the root cause of the problem is still under investigation. LaunchDarkly CEO Dan Rogers, whose company uses feature flags to control software deployment, explained that most software experience issues are not due to infrastructure problems but rather faulty software releases. With feature flags, companies can control the speed of deployment and disable problematic features to prevent widespread issues.

In the case of CrowdStrike, the problem occurred at the operating system kernel level, making it more challenging to fix. However, a slower deployment process could have potentially alerted the company to the issue earlier. Jyoti Bansal, CEO of Harness Labs, emphasized that even companies with good release practices can experience similar problems. He noted that in larger organizations with multiple engineering teams, inconsistencies in software release practices can lead to code slipping through the cracks.

To minimize the risk of bugs slipping through during software deployment, both CEOs recommend following standard release hygiene practices. This includes comprehensive testing before deployment and a controlled rollout process. Rogers highlighted the importance of progressive rollouts, where changes are initially released to a small subset of users before expanding further. If issues arise, companies can quickly roll back to the previous version.

Bansal also suggested utilizing “canary deployments,” which involve small controlled test deployments before a progressive rollout. This approach helps catch issues that may not have been detected during lab testing. Combining thorough DevOps testing with controlled deployment can help identify and address potential problems effectively.

When analyzing software testing processes, Rogers emphasized the importance of considering platform, people, and processes. All three elements need to work together to ensure successful software deployment. To prevent engineers or teams from bypassing the deployment pipeline, it’s crucial to establish a standardized approach that doesn’t hinder productivity. Bansal added that while automation can be beneficial, finding the right balance between security and release velocity is challenging.

While the exact cause of the CrowdStrike incident is still unknown, practicing best software deployment practices can help minimize the risk of catastrophic failures. While bugs may occasionally slip through, following standardized processes and implementing controlled rollouts can mitigate the impact of such issues.