What Is Chaos Engineering? The Art of Creating a Resilient System

Quynh Pham

Publish: 01/09/2023

What Is Chaos Engineering? The Art of Creating a Resilient System

Content Map

More chapters

As technology advances at an exponential rate, it might be easy to forget how easy it can be to fall behind. A seemingly minor mistake can cause major damage and downtime, something which many businesses cannot afford. That’s why companies have looked for ways to prepare their systems for all the worst scenarios.

One way to do so is chaos engineering. As strange and somewhat contradictory as the name sounds, it is a legitimate testing method that has been used by some industry giants like Netflix, Amazon, Google, and Microsoft.

What Is Chaos Engineering?

Chaos engineering is the practice of testing computer-distributed systems by creating unforeseen interruptions to determine how resilient the system is and pinpoint potential weak areas.

These unforeseen interruptions might be a sudden natural disaster that destroys the hardware, a power outage, or a cyber attack. In short, chaos engineering is a practice to see whether applications are strong enough to weather the “chaos” in production.

Chaos engineering focuses on experimenting rather than testing because the variables are already known when you perform a test. Chaos engineering experiments concentrate on random and unpredictable scenarios. Another distinction between chaos engineering and testing is that tests are often binary, meaning that after the test is performed, the results are going to determine if something is true or not. Experiments, on the other hand, often produce fresh insights and reveal new data.

How Does Chaos Engineering Work?

How Does Chaos Engineering Work?

The goal of chaos engineering is to gain new insights into the system. It does so by intentionally breaking the system, finding and identifying the weak points, and then working on improving the system.

A particular area of chaos engineering is distributed computing. Distributed computing refers to a collection of computers that are connected over a network to shared resources. This system can break down when unexpected events take place, costing businesses hundreds and thousands of dollars. The Information Technology Intelligence Consulting Research has found that a single hour of downtime can cost businesses an average of $100,000.

This number is even bigger when it comes to large and complex systems that have unpredictable, dependent components. Debugging is tricky since the larger the system, the more chaotic its behavior is.

Therefore, in order to gain new knowledge - which are either hidden bugs, performance bottlenecks, or other unseen spots, chaos engineers look at problems that seemingly have an endless list of root causes. The less likely causes are addressed rather than the more obvious ones. A problem or a number of problems are tested against distributed systems to obtain new knowledge.

Principles of Chaos Engineering

Chaos engineering is more than chaos experiments. This practice uses a systematic approach with planned experiments to better understand how the system behaves should there be any unexpected failure. It follows a number of sets and principles.

Start with a Baseline

What is the system’s “steady” state? This means that the chaos engineers must identify the measures of a normal system output. These measures are the system throughput, error rates, the latency percentile, etc.

Hypothesize

Assume that the steady state is carried on in both the control group and experimental group. For example, the hypothesis assumes that the steady state will continue when a service is unavailable.

Testing

The next step is to set up a simulation of uncertainty in combination with load testing. Testers then need to keep an eye out for any changes occurring within one or more of the four following pillars of an application: Compute, networking, storage, and application infrastructure. The testing might reveal that there is something wrong with critical processes or a surprising cause-and-effect connection.

Attempt to Refute the Hypothesis

As you based the hypothesis on the system’s steady state, any differences between the control and experimental group invalidates the hypothesis you created. From then, the engineers isolate and study system failures and use the knowledge to make corrections or modifications. After the experiment, the system is more stable and resilient.

Chaos Engineering Best Practices

Chaos Engineering Best Practices

Even after understanding chaos engineering principles, chaos engineering is still complicated. Therefore, when you are running chaos engineering experiments, try to follow the following chaos engineering practices to ensure its success.

  • Perform Experiments in the Production Environment: The experiments should be conducted in the production environments. The thought can be scary as that is where the users are and where traffic spikes are very real. However, running the test in production is by far the most reliable way to get the most accurate answer about the resilience of your system.
  • Maintain a Small Blast Radius: Blast radius refers to any harm or influence brought on by the test. However, an experiment shouldn’t bring down an entire production. Since each chaotic experiment requires coordination from numerous teams, you should keep the experiments small and intentional. If the experiments do negatively affect the system, make sure you have backup plans to keep the system up and running.
  • Understand the System Well: In order to make accurate assertions when accidents happen, you first need to understand the system well.
  • Cover the Projected Frequency/Impact of Failure: It is impossible to completely attain 100% test coverage in software. Therefore, instead of spending hundreds of hours going through each and every possible experiment, coverage works by making what is most likely to fail testable. Coverage in this context means to examine events that result in significant impact, like storage failure, or events that happen on a regular basis, like network failure in a distributed system.

Advantages of Chaos Engineering

Advantages of Chaos Engineering

There are several benefits when you push the limits of your application.

Boosts System Resiliency

The experiments conducted allows teams and organization to better understand how the system performs under certain stress. As a result, companies take measures to strengthen it.

Increased Revenue

Minimizing downtime means businesses aren’t losing money in costly outages or unexpected problems. This also means that companies are given the space to scale up their business without compromising the system’s stability.

Improved Customer Satisfaction

Customers are used to the seamless online experience. Therefore, when your application performs well, has a fast response time, and constantly meets your customer demands, customers are left with a positive experience.

Facilitates Better Collaboration

The insights gathered from experience are shared among teams in the companies, not just among the engineers. Chaos engineering motivates teams to collaborate effectively during the experiment in order to achieve the desired outcome, as everyone benefits from it.

Enhanced Failure Recovery

In the event of a similar outage, organizations can expedite recovery as chaos testing provides a comprehensive understanding of the system’s capability and behavior under different outage scenarios.

Chaos Engineering Challenges

Those who wish to start implementing chaos engineering should also be aware of its challenges.

The first challenge is limited resources. As mentioned earlier, chaos engineering requires multiple teams, even departments, to work together to make it happen. However, this can be a problem for some businesses.

Next is the lack of a strong monitoring system. During chaos engineering, the system’s health and metrics need to be carefully monitored and kept under control. The blast radius can easily go out of hand and cause the entire system to come down. The lack of visibility also makes it difficult to pinpoint the problems’ root causes.

Last but not least is the lack of clarity regarding the initial state of the system prior to the execution of the test. Without a clear understanding of the system’s stable state, teams may find it difficult to fully grasp the real-world consequences of the test. Hence, the efficacy of chaos testing is significantly reduced and even puts other systems at risk.

Chaos Engineering Tools

You can always adopt tools to make the process of chaos engineering more efficient. There are both open-source tools and paid solutions available. Make sure you have already listed out the business requirements and goals before choosing one.

Chaos Monkey

The first chaos engineering tool created by Netflix in 2010 is called Chaos Monkey. It is an open-source application made to test the AWS system. Many businesses currently use Chaos Monkey in addition to Netflix. With detailed documentation, this is a good starting point.

Simian Army

Simian Army is a collection of cloud-based failure generation, abnormal condition detection, and resilience testing services (called “Monkeys”). It consists of many chaos engineering tools, including Latency Monkey, Janitor Monkey, Doctor Monkey, and Security Monkey.

Gremlin Platform

You can experiment with chaotic engineering with the aid of the Gremlin service. You are given a number of attacks to employ. They are fed into the system, where they are transformed into various schemes, plots, and scenarios. The effects or harm of these attacks can then be recorded.

Getting Started with Chaos Engineering

Chaos engineering has become a valuable practice in the increasingly complex World Wide Web. We have now become more and more dependent on numerous complex systems. Cybersecurity has also become a serious concern in recent years. Proper and healthy management of chaos engineering allows engineers to better understand how systems react under stress and, from then, build stronger and more resilient systems. Robust systems have become essential in the golden digital era.

Monitor and improve your system’s health as soon as possible with Orient Software’s experienced and dedicated QA and Testing team. It is time to seriously take your system’s stability into consideration. Contact us and get help from the best experts in the field.

Quynh Pham

Writer


Writer


Quynh is a content writer at Orient Software who is an avid learner of all things technology. She enjoys writing and communicating her findings.

Zoomed image