Chaos Engineering- Chaos Toolkit- S1E1- Introduction

This is going to be a series of articles on the principles of Chaos Engineering. What is Chaos Engineering, why do we need it, ways to implement and sample code to do the same. So let’s start with the introduction here!

Chaos can occur anywhere in the system.

What is Chaos Engineering?

Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system’s capability to withstand turbulent and unexpected conditions. Using chaos engineering tools, we can induce chaos/failures/toxicity into our system to understand how our system reacts and how resilient it is to failures. Based on the results, we can improve the resilience of our system.

For example, we may not know how will our system react in case:

  • Downstream system is down
  • Downstream system is responding with a delay
  • Database calls have delay
  • Network has some latency
  • One of our regions goes down

We are unsure when will a failed pod come up. We could rely on an orchestrator like Kubernetes and rest assured that if a pod goes down, it will come up automatically using HPA, but have you seen it happen? How much time does it take? Would the other pods be able to serve all the traffic during this downtime?

What if a pod never comes up and goes into an infinite restart loop?

Only way we can look into the above scenarios is when it actually happens, a P1 ticket is opened and a developer’s sleep goes flaring into flames.

This is exactly where Chaos Engineering comes in; it helps in injecting faults to your system and understanding how your system reacts to failures. It greatly helps in laying down a fallback plan for your system and keep improving on it by repeated Chaos Engineering. By this, you are a step ahead in avoiding failures and handling them in Production should they occur.

Principles of Chaos Engineering: This is very simple:

  1. Define a steady state as some measurable output of a system that indicates normal behaviour.
  2. Hypothesize that steady state is met before and after you induce chaos.
  3. Validate the result of your experiment/chaos and improve on it.
  4. Go through the cycle again until you reach the required confidence.
Principles of chaos engineering

What is Chaos Toolkit?

Chaos Toolkit is a free open source framework/project which aims to provide an API to all Chaos Engineering tools that the community needs. It aims to be the simplest and easiest way to explore building your own Chaos Engineering Experiments.

Chaos Toolkit Experiment:

A Chaos experiment is a JSON object written to induce chaos into the system and study the results. It describes both the chaos and the order in which they should be applied.

Sample experiment written in json is available here.

Experiment can be broken down into below steps:

1. Experiment version, title and description.

2. Steady State Hypothesis: The Steady State Hypothesis element describes what normal state looks like in our system before the Method element is applied. If the steady state is not met, the Method element is not applied and the experiment must bail out.

3. A probes property: Each Probe must define a tolerance property that acts as a gate mechanism for the experiment to carry on or bail. Any Probe that does not fall into the tolerance zone must fail the experiment.

4. Method: Method describes the sequence of Probe and Action elements to apply.

5. Probe: A Probe collects information from the system during the experiment.

6. Action: An Action performs an operation against the system and collects information from the system during the experiment.

7. Rollbacks: Rollbacks declare the sequence of actions that attempt to put the system back to its initial state.

8. Secrets: Secrets declare values that need to be passed on to Actions or Probes in a secure manner.

9. Configuration: Configuration is meant to provide runtime values to actions and probes.

How to run a Chaos Toolkit Experiment?

Well, first we will need Phyton to run Chaos Experiments, so it needs to be installed

Prerequisites for running an experiment

This would be the end of introduction, stay tuned to catch up on sample experiments — pod kill — S1E2.

References:
https://principlesofchaos.org/
https://chaostoolkit.org/
https://en.wikipedia.org/wiki/Chaos_engineering
https://www.gremlin.com/community/tutorials/chaos-engineering-tools-comparison/

Chaos Engineering- Chaos Toolkit- S1E1- Introduction was originally published in Walmart Global Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Article Link: Chaos Engineering- Chaos Toolkit- S1E1- Introduction | by Kishore Kumar Naidu | Walmart Global Tech Blog | Jan, 2024 | Medium