Chaos Engineering- Chaos Toolkit- S1E2- Pod Kill

In this episode, we will see a sample experiment on how to kill a pod in Kubernetes cluster. If you have not seen the introduction on Chaos Engineering yet, please visit S1E1.

Kubernetes Pod Failure

Pod failures are bound to happen and by using an orchestration tool like Kubernetes, we know that HPA will automatically recover the failed pod. But do we know how much time would it take to bring back a new pod? Would the rest of the pods be able to serve the traffic in peak hours? What if the pod does not start and goes into an infinite loop? Can we afford to bring our system down in Production in such circumstances that are highly likely to happen?

If we wait for this to happen in Production then answering all of these questions will be a nightmare. That’s where Chaos Experiments help us in killing a pod in Production to see how our system reacts and can we improve it to our satisfaction and make our system more resilient, robust and fault tolerant.

Extension chaosk8s:

This project contains activities, such as probes and actions, you can call from your experiment through the Chaos Toolkit to perform Chaos Engineering against the Kubernetes API: killing a random pod, removing a statefulset or node.

Kill a random pod:

The experiment takes label as an argument and kills any pod randomly which matches that label. As our pods are part of Kubernetes, they automatically get restarted by Horizontal pod autoscaler, thereby proving our system is resilient to pod failures. So, this experiment will give you a picture of how your system behaved all the while your pod was killed and recovered.

In the JSON below, you can replace the placeholders with your specifications.

{
"type": "action",
"name": "terminate_pods",
"provider": {
"type": "python",
"module": "chaosk8s.pod.actions",
"func": "terminate_pods",
"arguments": {
"label_selector": "app=<your_label>",
"name_pattern": "<a regular expression for your pod name>",
"rand": true,
"ns": "<your_namespace>"
}
}

How to run a Chaos Toolkit Experiment?

Well, first you need Python to run Chaos experiments. Create and activate a virtual environment where you can run chaos experiments and install k8s extension for Chaos. Below table lists the steps and commands to run an experiment in Mac.

Prerequisites for running an experiment
Github — Please visit my GitHub for sample pod kill experiment.

Experiment Output:

Pod kill experiment

Changes in cluster:

HPA in action

Conclusion: We can clearly see Pod terminating and Pod Initializing process and while the pod was down our steady state was met, which means other pods were able to serve the requests coming in. Now we can analyse our system behaviour for the duration of pod termination & initialisation; and also add more replica set to support business traffic.

Coming up next in S1E3 is ToxiProxy, a catchy name for an agent that injects Network level latency!

For a detailed list of all probes and actions that you can perform in a Chaos Experiment, please take a look at the following resources.

References:
https://principlesofchaos.org/
https://chaostoolkit.org/drivers/kubernetes/
https://en.wikipedia.org/wiki/Chaos_engineering
https://chaostoolkit.org/drivers/kubernetes/

Chaos Engineering- Chaos Toolkit- S1E2- Pod Kill was originally published in Walmart Global Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Article Link: Chaos Engineering- Chaos Toolkit- S1E2- Pod Kill | by Kishore Kumar Naidu | Walmart Global Tech Blog | Jan, 2024 | Medium