Getting Started with Chaos Engineering: Tools and Principles for Resilience Testing
What Chaos Engineering Is
Chaos engineering is the practice of running deliberate experiments on a system to uncover weaknesses. The core idea: any sufficiently complex system has failure modes you don’t know about, and the cheapest way to find them is to cause small, controlled failures and observe. The Principles of Chaos Engineering document defines the foundational methodology.
It’s not ‘break production randomly.’ It’s a disciplined experimental process: hypothesis, controlled blast radius, measurement, abort conditions.
The Experiment Loop
Form a hypothesis. ‘If we kill 20% of the order service pods, customer order completion rate stays above 99.5%.’
Define a blast radius. Production, but limited — a specific availability zone, a percentage of traffic, a single service.
Define abort conditions. Specific metrics or thresholds that stop the experiment immediately.
Run the experiment. Observe.
Document findings. Fix what broke. Re-run to verify.
Where to Start
Don’t start with chaos in production. Start in staging, with the most boring possible experiments. Kill one pod. Verify the replacement pod comes up. Measure what happened during the gap.
From there, add network latency injection, CPU saturation, dependency failures. Each experiment teaches you something about the system; build the catalog of known behaviors slowly.
Tools
Chaos Mesh and Litmus are the dominant open-source chaos platforms for Kubernetes. Both provide CRDs for declaring experiments, run controllers to execute them, and integrate with monitoring for measurement.
Gremlin is the commercial equivalent with a broader scope (VMs, containers, applications, network). Worth evaluating once you’ve outgrown the open-source tools or need the enterprise features.
For specific failure modes: tc (Linux traffic control) for network manipulation, stress-ng for resource pressure, AWS Fault Injection Simulator for AWS-specific scenarios.
Cultural Prerequisites
Chaos engineering requires a culture where ‘we found a bug’ is rewarded, not punished. Teams that punish discovery don’t do chaos engineering — they hide problems until users find them.
Leadership support is necessary. The first time a chaos experiment causes a real incident (and one will), the response determines whether the practice continues.
Related Reading
- See our deeper guide at /devops/incident-management-runbooks/.
Game Days and Cultural Practice
Beyond ongoing chaos experiments, scheduled game days bring teams together for larger-scope failure simulation. Take down an entire region in a test environment. Simulate a database failover under load. Practice runbooks against real failures.
The cultural value is significant. Engineers who’ve practiced failure recovery in non-emergencies handle real incidents more calmly. Runbooks tested in game days are better than runbooks that have only been used in real incidents.
Chaos and Compliance
For regulated industries, chaos engineering can intersect with compliance requirements positively. Demonstrating resilience to specific failure modes (region loss, dependency outages, security control failures) is often easier with chaos engineering evidence than with audit interviews.
Document chaos experiments alongside DR tests. Both serve similar audit purposes. The combination is stronger than either alone.
Hypothesis-Driven Experiments
A chaos experiment without a hypothesis is just disruption. Each experiment should state: what we believe about the system’s behavior under specific conditions, what evidence would confirm or refute that belief, what the abort conditions are.
‘Killing pods doesn’t affect users’ is a hypothesis. ‘Let’s see what happens when we kill pods’ is not. The discipline of articulating hypotheses sharpens the experiment design and makes results meaningful.
Document hypotheses before running experiments. Post-experiment, document whether they were confirmed, refuted, or surfaced unexpected findings. The accumulated knowledge base teaches the team about the system over time.
Tooling Choices
Chaos Mesh has the strongest open-source CRD model and Kubernetes-native UX. Litmus has a broader experiment library and tighter Argo integration. Both are CNCF-graduated projects with active communities. AWS Fault Injection Simulator documentation is at the FIS documentation hub.
Gremlin (commercial) covers more ground — VMs, containers, applications, network failures — with a polished UI. Worth the cost for organizations with chaos engineering as a real practice, less so for occasional experiments.
AWS Fault Injection Simulator targets AWS-specific scenarios: AZ outages, EBS volume failures, RDS reboots. Useful for AWS-heavy infrastructure with limited Kubernetes focus.
Team Culture and Practices
The tooling matters; the culture matters more. Teams with strong DevOps practices and middling tools usually outperform teams with state-of-the-art tools and weak culture.
Core practices: shared ownership of production reliability, blameless incident response, regular retrospectives, deliberate investment in developer experience. None require specific tools; all require sustained leadership attention.
Maturity grows gradually. A team that adopts blameless postmortems but still has weekly all-hands-on-deck deployments hasn’t internalized the practice. Watch the behaviors during stress, not the documented procedures.
Continuous Improvement Cadence
The DevOps movement’s emphasis on continuous improvement isn’t a slogan; it’s a practical requirement. Systems decay, requirements change, and tools age. Maintaining a healthy engineering organization requires deliberate, ongoing investment.
Quarterly retrospectives at the team and organization level surface what’s working and what isn’t. The output is concrete commitments to change — not abstract aspirations.
Track changes from retrospectives. Teams that don’t follow through on retrospective actions eventually stop running retrospectives. Demonstrated follow-through builds the trust that makes future retrospectives valuable.
Hiring and Team Building
DevOps and platform engineering hiring is competitive. The job market pays well; experienced engineers are in demand. Building teams requires investment in both compensation and culture.
What attracts strong candidates: meaningful work on systems they can directly impact, clear ownership boundaries, modern tooling, sustainable on-call practices, growth opportunities. What drives them away: legacy systems with no migration plan, unclear ownership, oppressive on-call, no investment in their growth.
Internal mobility matters too. Engineers in adjacent disciplines (backend development, networking, security) often become strong platform engineers with appropriate support. Building hiring pipelines that include internal transfers expands the talent pool.
Vendor Selection and Tool Procurement
DevOps organizations buy many tools. Each tool selection is a multi-year commitment with implicit ongoing costs (licenses, training, integration). The selection process deserves more attention than it usually gets.
Standard evaluation criteria: feature fit, total cost of ownership over three years, exit cost (how hard is it to migrate away later), vendor stability, and integration with existing tooling.
Avoid the trap of evaluating only on features in initial demos. The features matter; so do the rough edges that surface in week three of actual use. Trial periods and reference customers in similar environments surface what marketing doesn’t.
Practical Next Steps
For teams beginning their DevOps or platform engineering journey, the temptation is to adopt every recommended practice at once. The teams that succeed tend to focus on one or two specific improvements at a time, build the habit, and then expand scope.
Concrete next steps worth considering: instrument deployment frequency and lead time, even informally. Run a blameless retrospective after the next incident. Document the platform decisions that have been made implicitly. Survey the team about pain points and tackle the top two.
Each of these is a small investment with compounding returns. The team that runs retrospectives quarterly accumulates institutional learning that the team without them doesn’t. The team that tracks DORA metrics over a year sees trends they would otherwise miss.
The work doesn’t end. Engineering organizations are living systems that decay without active maintenance. The practices described throughout this article are tools for sustained improvement, not destinations to reach and stop.
Key Takeaways
The most important point throughout this guide: practical engineering decisions depend on specific context. Best-practice recommendations are starting points, not destinations. The right answer for your team depends on your scale, your existing tooling investment, your team’s experience, and the specific constraints you face.
Three principles worth carrying forward regardless of specific tool choices. First, measure what you change. Engineering improvements without measurement become folklore — claims without evidence. Track the metrics that show whether interventions are working.
Second, default to simpler architectures and tools. Complexity has cost. Each additional moving part is something to monitor, debug, upgrade, and eventually replace. Choose the simplest thing that meets your actual requirements, not the most sophisticated thing you could build.
Third, invest continuously in the boring foundations. Reliable CI, good observability, sensible access controls, and clear documentation pay back across every project. Skipping these for short-term feature velocity accumulates debt that eventually consumes the velocity it was supposed to enable.
The teams that operate well over the long term are usually not the teams with the most exotic tooling. They’re the teams with disciplined fundamentals, deliberate decision-making, and continuous incremental improvement.
Frequently Asked Questions
Is chaos engineering only for big companies?
No. The principles scale down. A team of five running monthly chaos experiments in staging gets meaningful value.
Should I run chaos in production?
Eventually, with controlled blast radius. Production is where the real failure modes live. Don’t skip the staging step.
What about the disaster recovery test?
Related but different. DR tests validate planned recovery procedures. Chaos engineering finds unplanned failure modes. Both have a place.
How do I get leadership buy-in?
Frame as risk reduction. The cost of a chaos-engineering-discovered problem is dramatically lower than the cost of a customer-discovered problem.