How to Break Things the Right Way: 4 Basic Chaos Engineering Experiments
Some failures are evasive for unit tests or straightforward debugging. That’s when chaos engineering becomes essential — the deliberate injection of failure shows how your system behaves under real-world stress. In this article, we’ll walk you through the steps of chaos engineering. You’ll learn how to run experiments and get ready-to-use manifests for common issues — from inconsistent configurations to connection leaks.
Whether you’re new to chaos engineering or just looking for ideas to level up existing experiments, this guide is for you.
Why Chaos Engineering Matters
Your system may seem perfect, yet incidents still happen: services crash, systems degrade, outages occur, including security-related ones. Some of them are hard to reproduce and take a long time to investigate.
These types of problems are hard to cover with standard unit or integration tests, as they often surface only under high load or in complex service chains. But those failures can be effectively simulated with chaos engineering. This practice helps you understand how the system as a whole — not just individual components — responds and prepares your team to act during real incidents (outages, degradations, or partial failures).
Fault Tolerance Testing or Chaos Engineering?
These two are often confused, yet serve different purposes. Fault tolerance testing is a subset of the broader chaos engineering practice.
Fault Tolerance Testing | Chaos Engineering | |
Goal | Verify how the system behaves under predictable and known failures (e.g., service crash, network outage) | Explore how the system responds to unpredictable, rare, or compound failures and uncover hidden weaknesses |
Nature | Typically manual or automated tests, planned in advance | Structured experiments that introduce uncertainty and simulate chaos |
Scope | Usually targets a single component or service | Involves the whole system and interactions between components |
Tools | Failover setups, redundancy, failure simulation scripts, cloud-based fault injection tools | Chaos Mesh, Gremlin, Litmus |
The main idea of chaos engineering lies in conducting deliberate experiments to uncover systemic weaknesses. These experiments typically involve multiple teams and stakeholders and are designed to simulate real-world failure conditions in a controlled way.
Core Principles of Chaos Engineering
Chaos Mesh: A Powerful Tool for Experimenting
Chaos engineering is still a relatively young discipline — roughly 15 years old. Early tools were fairly basic and could randomly take down services or even entire clusters. Modern, second-generation frameworks support far more nuanced experiments — not only “killing” services, but combining with other conditions to simulate complex real-world failure modes.
Over time, chaos experiments have become more intentional and precise. This led to the need for flexible, safe, and extensible solutions that could be embedded into real infrastructure without breaking it. That’s exactly the context in which Chaos Mesh emerged in 2019 — an open-source platform created by PingCAP, originally designed to test their distributed database, TiDB.
Chaos Mesh enables controlled fault injection not just into services, but deeper into critical layers like the file system, network, process scheduler, HTTP layer, Kubernetes controllers, and more. All of this is done using CRDs and YAML — familiar tools for Kubernetes engineers. This ease of integration is one of the reasons Chaos Mesh quickly became one of the most widely adopted tools in the chaos engineering ecosystem.
Chaos Mesh Capabilities
1. Perfect for Kubernetes — and Beyond
Chaos Mesh was designed as a native Kubernetes solution. It integrates seamlessly into clusters using standard Kubernetes mechanisms — Custom Resource Definitions (CRDs), controllers, admission webhooks — making it especially convenient for DevOps and SRE teams already operating in cloud environments.
Yet Chaos Mesh isn’t limited to Kubernetes. It also supports bare-metal nodes and virtual machines, enabling fault injection in heterogeneous infrastructures. This is crucial in real-world scenarios, where many companies run hybrid architectures — part of the workload might be stored in Kubernetes, while other parts run on dedicated hardware or legacy systems.
2. Wide Range of Built-in Experiments
Chaos Mesh comes with a rich set of pre-defined failure scenarios, each packaged as a separate CRD. That means every failure type — whether it’s an HTTP delay, a file system read error, or a pod restart — is defined as a native Kubernetes resource.
This eliminates the need to write custom scripts or manually assemble experiments. You simply describe the desired behavior in YAML and apply it like any other Kubernetes resource.
3. Composite Scenarios and Workflow Support
Chaos Mesh goes beyond isolated failure injection by allowing the creation of complex, multi-step experiments. These can be structured as chains or trees of actions, with fine-grained control over timing, parallelism, and dependencies.
This enables you to simulate a full-blown incident timeline, closely mimicking what might happen in real production outages. Such flexibility makes Chaos Mesh a powerful tool for systems where it’s important not only to survive the first failure, but also to withstand cascading effects that follow.
How Chaos Mesh Works and Where to Begin
Chaos Mesh is composed of two main components:
1. Chaos Dashboard
Web-based interface to create and run experiments through an intuitive UI. Each experiment is represented by a Kubernetes-native manifest, which allows you to interact with the system either via client libraries in various programming languages or directly through the Kubernetes API.
2. Chaos Operator
Includes several key components:
Through the Chaos Dashboard, users can select the target runtime (Kubernetes or host-level) and specify the type of fault to inject, offering fine-grained control over the blast radius and experiment environment.
Running Experiments: Four Common Scenarios
Cascading Requests
Imagine: a user is interacting with a service that reliably communicates with a database. Traffic is routed through a load balancer. At a certain point, something unusual happens on the backend, so one of the service replicas crashes.
The load balancer continues distributing the same volume of traffic, but now across only two remaining replicas. As pressure builds, more and more errors start appearing.
Now, let’s hit F5 and take a closer look at what’s happening. Open DevTools, simulate additional load — possibly background requests — that the system is now struggling to handle because earlier requests have already exceeded their timeouts. Suddenly, everything fails at once.
It’s a nasty failure mode — especially if it happens in production. Fortunately, we can simulate and analyze this scenario using a built-in experiment from Chaos Mesh.
PodChaos Experiment Example
kind: PodChaos
apiVersion: chaos-mesh.org/v1alpha1
metadata:
namespace: service-a
name: service-a-single-pod-failure
spec:
selector:
namespaces:
- service-a
labelSelectors:
app: service-a
pods:
service-a:
- service-a-ar12b
mode: all
action: pod-failure
duration: 20m
What You Can Learn From This Experiment
Upstream Service Unavailability
Let’s say your service has evolved. Its functionality was expanded by adding new upstream services that provide additional data.
In this simplified example, there are two services with two replicas each — but in real-world systems, there could be dozens. Imagine a data aggregator system that pulls from various external sources, which can be added dynamically. This is a common architectural pattern.
Now, imagine Service B, an upstream dependency, becomes unavailable to Service A. On paper, this shouldn’t be catastrophic — after all, Service B is just one of many data sources, and it could be temporarily replaced with a stub or fallback without blocking the entire system.
But in real life, the unavailability of a single upstream service can sometimes lead to a full cascading system failure.
Let’s be clear: the system should be able to function for several minutes (or even hours) without that missing data. Instead, it locks up completely. To simulate and analyze this behavior, we can use a NetworkChaos experiment, which lets us manipulate the networking stack of the application.
NetworkChaos Experiment Example
kind: NetworkChaos
apiVersion: chaos-mesh.org/v1alpha1
metadata:
namespace: service-a
name: service-a-to-service-b-net-part
spec:
selector:
namespaces:
- service-a
labelSelectors:
app: service-a
mode: all
action: partition
duration: 5m
direction: to
target:
selector:
namespaces:
- service-b
labelSelectors:
app: service-b
mode: all
This experiment partitions the network between Service A and Service B. The key elements here are:
Network partitioning is just one of several actions available through NetworkChaos. Others include packet delays, packet duplication, bandwidth throttling, packet loss or corruption.
What You Can Learn From This Experiment
File System Failures
This scenario is especially relevant if your service handles a lot of data.
Let’s slightly revise our previous example: instead of having services A and B communicate, we now introduce a storage layer. Imagine that your system is now part of a larger user-facing data pipeline.
For instance, your service might be responsible for aggregating data to generate a unified system dashboard. A user kicks off the pipeline in the evening, expecting results to be ready by morning. But something goes wrong, and the storage becomes unavailable. The next morning, support receives complaints, and your team spends hours debugging the root cause.
IOChaos Experiment Example
kind: IOChaos
apiVersion: chaos-mesh.org/v1alpha1
metadata:
namespace: service-b
name: io-fault-service-b
spec:
selector:
namespaces:
- service-b
labelSelectors:
app: service-b
mode: all
action: fault
errno: 5
path: /var/tmp/data/**/*
methods:
- READ
- WRITE
percent: 50
volumePath: /var/tmp/data/**/*
duration: 10m
When this experiment runs, you may see the same failure pattern as in a real user incident — delayed or failed pipelines, incomplete dashboards, and unhappy users.
For example, even a basic retry mechanism might be enough to prevent full pipeline failure. Alternatively, routing writes to a secondary storage location during an outage could provide resilience. There are many potential fixes — but the key is that chaos engineering reveals whether the service is prepared to handle these failure modes.
What You Can Learn From This Experiment
Contract Testing Through Fault Injection
Let’s say you have a core service (Service A) that communicates with an upstream service (Service B). Service A sends a request to the /hello endpoint on Service B. If it receives a 200 OK response, everything works as expected.
Now, imagine a failure occurs and Service B starts returning a 500 Internal Server Error instead. This is a fairly normal situation, but the team responsible for Service A didn’t anticipate or handle the error properly, leading to a cascading failure.
This is where HTTPChaos can help — a tool for injecting faults at the HTTP layer.
HTTPChaos Experiment Example
kind: HTTPChaos
apiVersion: chaos-mesh.org/v1alpha1
metadata:
namespace: service-b
name: replace-hello-response-code
spec:
selector:
namespaces:
- service-b
labelSelectors:
app: service-b
mode: all
target: Response
port: 80
path: hello
method: GET
code: 500
duration: 10m
This experiment uses HTTPChaos, a transparent proxy deployed inside the Kubernetes cluster that can intercept and modify HTTP traffic between services. It targets the /hello
endpoint on Service B and forces it to return a 500 status code in response to all GET
requests. The scope is limited using selector
, targeting only Service B — but this can be narrowed further to specific pods or headers, even down to traffic from a single replica of Service A.
The selector
field in HTTPChaos allows you to minimize the blast radius and test very specific scenarios — a critical capability when working in shared environments (e.g., staging or multi-tenant clusters).
You’re not limited to just modifying HTTP status codes. Chaos Mesh also supports:
What You Can Learn From This Experiment
Beyond the Basics: More Ways to Use Chaos Mesh
In this article, we’ve explored four common types of experiments using Chaos Mesh. But that’s just the beginning. Chaos engineering can take you much deeper into the behavior of your systems and the processes around them.
Chaos Mesh helps uncover not only technical faults, but also configuration-related issues. This is especially important if your environments (dev, staging, production) differ significantly — for example, fewer replicas on dev, no real traffic, or missing observability. These differences can lead to false positives, creating an illusion of stability.
Chaos engineering doesn’t just expose system-level issues — it also reveals organizational bottlenecks. Running meaningful experiments often requires collaboration across multiple teams: development, SRE, infrastructure, security. This process exposes real communication paths and helps evaluate how fast and effectively your organization responds when an incident occurs.
And at the company-wide level, chaos engineering can be used to run Disaster Recovery Plan (DRP) exercises — full-scale simulations of catastrophic outages.
Chaos Mesh is also well-suited for uncovering problems that don’t show up on standard metrics — like memory leaks, accumulating active or shadow sessions or resource buildup caused by internal service behavior, monitoring tools, or background jobs.
These are common in systems with high database connection churn, where each service maintains its own connection pool. Over time, connections accumulate and may lead to degradation or even full system failure.
Chaos Engineering Best Practices
To get the most value from chaos experiments, consider making them a routine part of your engineering culture:
Related insights in blog articles
30 Working Prompts for Performance & Load Testing (Works with Any Report)

Performance reports are packed with truth and noise in equal measure. Percentiles bend under outliers, error spikes hide between throughput plateaus, and a single mislabeled chart can derail a release meeting. AI can help, but the quality of its answers tracks the quality of your questions. What you’ll find here is a prompt list you […]
SAAS Testing : A Complete Guide

Cloud software has transformed how businesses operate, but it also raises new challenges. How do you ensure a subscription-based product works reliably for every user, at every moment? SaaS testing provides that assurance by validating performance, security, and overall stability before issues reach production. This guide explains what is SaaS testing, why it matters, and […]
Soak Testing: A Complete Guide

Software rarely fails because of a single heavy hit — it fails slowly, under constant pressure. That’s where soak testing in software testing comes in. A soak test measures how your system behaves under expected load for an extended period, helping uncover memory leaks, resource exhaustion, and gradual slowdowns that quick checks can miss. In […]
gRPC Alternatives You Need To Know

In modern distributed systems, choosing the right communication protocol is crucial for scalability, performance, and flexibility. gRPC has become a popular choice thanks to its efficiency and language support, but it’s not always the best fit for every project. This guide explores the top gRPC alternatives, comparing their features, use cases, and best applications. Whether […]
Be the first one to know
We’ll send you a monthly e-mail with all the useful insights that we will have found and analyzed
People love to read
Explore the most popular articles we’ve written so far
- Top 10 Online Load Testing Tools for 2025 May 19, 2025
- Cloud-based Testing: Key Benefits, Features & Types Dec 5, 2024
- Benefits of Performance Testing for Businesses Sep 4, 2024
- Android vs iOS App Performance Testing: What’s the Difference? Dec 9, 2022
- How to Save Money on Performance Testing? Dec 5, 2022