Stress Test is something that everyone should be doing if you care about performance. I run software at AWS so is easy to spin up more boxes but is also more expensive and you might be in a trap because you might have ineffective software with Memory Leaks, Hidden Bugs, and untuned servers which will all show up with scale or stress tests — whatever happens first. Unfortunately, this still not quite popular among most of the microservices developers. Create and run stress tests is something hard and tedious because it involves lots of moving parts. A single microservice could call several other microservices some of the downstream dependency graphs could get complex easily, this is just one of the many challenges involved with stress tests. Today I want to share some experiences in building and running a stress test and chaos platform. Do you know how many concurrent users or requests per second(RPS) your microservices can handle? Do you know what it will be the COST to scale your services? Do you know if you can SCALE your services in a COST-EFFECTIVE way? All these questions can be answered with proper Stress Tests.
Why Build a Stress Test / Chaos Platform
I run Stress Tests with Gatling. Gatling is a great tool, not only because you write your tests in Scala(Not XML or some complex UI like JMeter). Gatling Scala DSL is very simple and effective to test microservices via HTTP / REST APIs. So why build a platform? Why Running local is not enough? First of all running local is fine in order to get the Stress Test Scenarios right but as a baseline is completely wrong since you don’t have same hardware/infrastructure as production. Secondly, there are several other questions we need to answer in order to know whats going on — these questions are impossible to be answered locally:
- Do you know what are the previous results? You are faster, slower?
- Did your service has more or less latency? What was before?
- Did you increase or decrease the RPS(Request Per Second)?
- What about resource usage? Are you CPU, Memory, IO or Network Intensive?
- What about Chaos? If a downstream dependency dies what happens to your service?
- Do you test all that manual or it’s automated?
I always remember Google SRE book when they say “Hope is not a Strategy”. Most of the developer doesn’t care and/or don’t have the tools to answer this questions in a productive way. To be able to answer this questions we need to have the following capabilities:
- Set up and run Stress Tests
- Collected and Store data for further comparison(baseline)
- Analyze results
- Isolation: Make sure different people don’t mess each other tests
The Last Capability is complicated since, in theory, it would be possible to have a dedicated production environment per developer however in practice this is not COST-EFFECTIVE. So a sort of SHARED environment is needed. Sharting an environment requires some kind of scheduling/placement so we only run stress tests that don’t use same downstream dependencies in order to make the tests fully isolated.
What about Chaos Engineering. My current project runs on AWS using NetflixOSS Stack. I run Chaos Tests with Chaos Monkey from Simian Army. How do we check when a downstream dependency goes down that a service call fallback to other AZ or to Static Fallback and recover from that failure in an automated way? The trick here was to use Stress Tests for Chaos verification. So basically a stress tests in running while chaos testing is running. This way we re-use same Stress Tests for Chaos and we don’t need to write down 2 verifications.
What we Build / How it works?
There are 3 Phases in order to do the Whole Stress Test / Chaos Process. Phase 1 Plan and Code, Phase 2 Execution, Phase 3 Analyses. Let’s get started with Phase 1.
Phase 1 — Plan and Code
Phase 1(Plan and code) you need to think about your Exceptions in sense of failure. This is needed if you are running a Chaos Test, If you are just running a Stress Test you don’t need to worry about this. So for the Chaos Test, we need to think about what should happen for each failure scenario and how the code/service should recover from that scenario. Once you have that in mind you can write down your Gatling script. Is possible to write down assertions in Gatling/Scala in order to check if your assumption was correct. You will write down Scala code and you might test it locally just to make it sure the code is correct. Them you can move one and push this code to github. When often create a project to make it easy and often folks have multi scenarios of chaos/stress so a project is handy. So this project is pushed to github. Phase 2 now is Execution. Let’s explain what will happen with the Stress Test / Chaos code.
Phase 2 — Execution
The Stress Test / Chaos Platform runs in Jenkins(So we did not have to build a new UI). However we just used Jenkins as UI, the Gatling machine runs as separated AMI, so if Jenkins fails we don’t lose the tests doing this way we make sure the Stress Test and Chaos Platform is Reliable. Because Jenkins is not reliable. The Jenkins job is pretty simple and receives some parameters like the github project url, the scenario which the developer wants to run, the number of users(Threads) the test run should, The duration of the test and if we should apply chaos or not, If yes for what IP. This information sends from Jenkins to the Scala code this is done this way so we can re-run the same tests multiple times increasing the load or the duration which is quite handy. This is the main Jenkins job but there is a second job which we call it Profile, so this job will run you stress test multiple times which different users(threads) which we call rounds, so first with 1 user, them with 10 users, them 100, them 1k, 5k, 10k and so and on, these rounds come by parameter so you can tell what the sequence you want. Why do we do this? We do this because them we have an automated way to know when the service breaks so we can know, how many users the service can handle and what is the latency as we increase users.
Phase 3 — analysis
Since all data is on SignalFX is pretty easy to co-related Cassandra Data with Gatling Data and pretty much all other information we have like hystrix metrics, OS level metrics and so on and on. Right now we are doing comparisons anyway. But in the future, this will be done by a service with will consider this in to the Automated Canary Score so if you degraed the performance, increase the latency your deploy will fail. Often the commons problems/Bottlenecks are file descriptors not being tunned in Linux, Connection Pools, Threads configurations, Lack of Cache, Too much logging.
Build a Stress Test / Chaos Platform is not that hard, have developers to use it with discipline is the hard part. There are interesting challengings here don’t get me wrong but some evangelism and support is the core port I would say. Stress Tests / Chaos need o be on the Definition of Done or Production Ready checklist of the microservices teams otherwise people might not use as much as they should.
Another big learning for me was the fact that not every developer like to do Stress Test / Chaos engineering, I worked with engineers who love it and others who hate it so cultural FIT is something I care a lot nowadays and make sure whoever works on my team want to do the kind of work I’m doing and care about DevOps.
Originally published at diego-pacheco.blogspot.com.