Skip to main content

Simulating an Outage with AWS Fault Injection Service

shutterstock_152290277

AWS gives you everything you need to build highly available systems: multiple regions and availability zones spread all over the world, CloudWatch for monitoring, metrics, and alarms, Auto Scaling, Load Balancing, replication, the list goes on and on. If you follow the guidelines in the AWS Well-Architected Framework, your system SHOULD be able to continue working and recover if part of your infrastructure fails. But how do you know that’s the case? Enter AWS Fault Injection Service (FIS), or Chaos Engineering as a Service.

WARNING!!!! AWS FAULT INJECTION SERVICE CARRIES OUT REAL ACTION ON AWS RESOURCES IN YOUR ACCOUNT. Therefore, before you use AWS FIS to run experiments in production, AWS strongly recommends that you complete a planning phase and run the experiments in a pre-production environment.

What is Chaos Engineering?

Chaos Engineering is the idea of deliberately causing disruptions in your systems to gauge their resilience and reliability. The intentional system failures enable teams to see how their systems will react when the unexpected happens but in a controlled environment. This allows you to identify any gaps in your systems before they cause an outage. By purposely pushing your systems to their limits and simulating outages, you can have confidence that a minor glitch won’t snowball into a major outage. 

What is AWS Fault Injection Service?

AWS Fault Injection Service (FIS), sometimes called Fault Injection Simulator, is a fully managed AWS service that is designed to help teams carry out chaos engineering experiments on their AWS workloads. The service allows you to inject faults into your application and infrastructure in a controller manner to simulate real-world situations. By purposefully causing these failures, FIS can help you test out the resilience of your systems and better prepare you for a real outage. It can help you expose weak links in your system, which in turn allows you to fix cracks you didn’t know existed. 

Setting up AWS Fault Injection Service

To utilize AWS FIS, you first need to create experiments. These experiments define the targets, and what kind of fault to introduce. For example, you can set an EC2 instance to shut down unexpectedly. These experiments can also be stopped based on CloudWatch alarms. The best part is these experiments can be defined using the UI, or with JSON via the AWS CLI. You define the targets, action, and stop condition and your experiment is ready. Now for the fun part, running the experiment.

Running Experiments with AWS Fault Injection Service

You’ve defined your experiment, and now is the moment of truth. Did your preparation work? The easiest way to kick off an experiment is using the CLI and the aws fis start-experiment –experiment-template-id “template-id” command where template-id is the experiment that you designed earlier. This will trigger the experiment you defined. Depending on how complex your experiment and environment are, the execution time may vary. AWS FIS will run in the background, and while it will be doing what you defined, it is best to monitor your systems in CloudWatch throughout the experiment to get the best sense of how your system is responding in real-time. Remember that by running these experiments you are voluntarily and purposefully introducing chaos into your environment. The goal is not to disrupt your services but to better understand their behaviors when the unexpected happens. By running these experiments, you are gaining valuable insight into your systems. By running these experiments regularly, you will continue to have confidence that your systems have the resiliency that you expect.

Analyzing the Results of AWS Fault Injection Service Experiments 

You gained valuable insight watching your resources respond in real-time, now it’s important to analyze the results. The integration of FIS and CloudWatch allows you to review what happened throughout the duration of your experiment. You can pair this information with the experiment report from FIS that includes information about run time and which resources were impacted. Now that you have your data, did your system perform how you expected? Were there gaps in your system that you did not expect that caused a larger outage than you expected or found acceptable? These are all learning opportunities where you can fortify your application. Remember, AWS FIS is all about learning how your system will respond to an outage. You want to ensure that you really study the results and if the experiment you ran were to happen in the real world, would you be happy with the results?

Best Practices for AWS Fault Injection Service

To get the most out of AWS FIS, there are some recommended practices to follow. First, start small. When you create your first experiment, start with simple experiments on non-critical workloads. As you gain experience and better understand your systems, you can scale up your experiments to be more complex and affect more important workloads.

Second, you want to automate your experiments. AWS FIS can integrate with EventBridge to run your experiments at regular intervals. This is an easy way to ensure that you are constantly testing your system.

Third, ensure that you have effective monitoring. CloudWatch can help you monitor your experiment, and setting up effective and informative alarms can help you determine if anything has gone awry. 

Fourth, in addition to scheduling your experiments, incorporate chaos engineering into your CI/CD process. This is an easy way to ensure that any changes you make to your systems continue to meet your resiliency standards.

Finally, do a post-experiment analysis. Review the cloud watch metrics after your experiment runs. Review the post-experiment report. Just running the experiments without reviewing the results and taking any necessary action will do nothing to improve your application.

Conclusion

AWS Fault Injection Service can be a fantastic addition to your SRE tool belt. When leveraged properly, it can increase your workload availability and disaster recovery posture. If you are eager to get started engineering Chaos with AWS FIS, then drop us a line at sales@ipponusa.com. Let us be your little chaos monkeys.

Jonathan Scharf
Post by Jonathan Scharf
February 21, 2024

Comments