The Fire Drills Your System Needs During the Pandemic

During the coronavirus pandemic, when much of the global marketplace has moved online, many companies are finding themselves unprepared for the additional online traffic as a result of online shopping and working remotely. Knowing your system’s limitations is crucial to planning for them, and Castra is here to help you fire drill your system so you know how to fix it when it fails.

Look to the Cloud

Cloud computing offers the ability to scale capacity quickly and is a company’s best bet for sudden traffic increases. However, the reliability of your system still depends on your ability to gauge correct configurations and capacity for scaling services. You need to understand all the variances of your system in order to keep it up and running under any duress.

Chaos Engineering

This is where chaos engineering comes into play. Using cloud platforms to “fire drill” your system brings to light any issues you may have avoiding online disasters and maintaining business continuity. Testing your system exposes any holes and vulnerabilities you may have and shows your potential to maintain uptime, reliability, and bandwidth during disasters and emergency surges.

One way to do this is to schedule a Game Day – that is, a day dedicated to running chaos engineering experiments against your infrastructure and services to see if you can handle various failures. Game Days usually run between two and four hours, simulating a number of carefully developed test cases. These cases are based not only on the past but also on hypothetical future impacts to your system.

There are a number of chaos engineering tools out there, but the process can be daunting for smaller businesses that don’t have the advantage of an in-house IT team. Hence, while larger corporations have been working with these types of fire drills, smaller businesses are more at risk for online issues.

Controlled Disruptions of the System

Chaos engineering is all about controlled disruptions of your cloud-based system. By studying how your system reacts, you can identify the weak areas and work to improve resiliency. Proactively identifying and addressing these weaknesses helps you to break away from the reactive incident response model. What kind of disruptions might be used? Here are a few examples:

  • Killing a process on a Linux server
  • Inducing errors for a segment of live traffic serving customers in production
  • Stopping, rebooting, and terminating virtual machines
  • Removing network services, routers, and load balancers
  • Simulating the failure of an entire region
  • Introducing latency between services, missing messaging topics, random errors and crashing docker containers
  • Mimicking the unavailability of third-party APIs or creating additional latency

Advantages and Disadvantages

There are a number of advantages to using chaos engineering to test the faults in your system. Such experiments include:

  • Allow for analyzing real system behavior in real-time.
  • Control the damage since you have the ability to stop the experiment at any time.
  • Allow you to build an efficient disaster recovery plan.

There are also challenges to keep in mind, such as:

  • Keeping customer data safe while testing for loss of data.
  • Managing outage duration while testing for system failure.
  • Increased costs for targeting network bandwidth or raw disk storage.
  • Difficulty in interpreting the results and enacting changes to the system to mitigate any issues.

Those challenges make it even more important to partner with a strong IT team that can work with the delicate balance of these live drills.

The Power of 24x7 Detection

Having a set of expert eyes trained on your systems around the clock can complement your fire drill strategy – and may even mitigate the need for some of those drills. With Castra’s in-depth Elite solution, our Security Operations Center (SOC) watches your network, investigates security alarms, tunes the system for better visibility, and works with you when we find anomalies. You can focus on your business while we take care of your system!

As online systems are increasingly at-risk due to high traffic and remote use, it becomes even more important to test and plan for any foreseeable failure that could hamper the customer experience. Want to learn more about how to keep your system up and running regardless of the rigors of an emergency like the coronavirus pandemic? Castra is here to help.