Image for post
Image for post

Tests are at same time a simple and a complex subject. Testing is something basic. Are you don’t do testing how do you know it works? How do you know your software won’t stop working after some refactorings?.

Unit Tests are the basic level of testing you could do. In the past unit test was normal. When we deliver software we need to deliver unit tests whiten so level of coverage. When we talk about coverage things can get tricky since you might end up coding tests for parts of your application where it is little or no value what so ever.

Coverage can be something questionable as you language might force you to do less or more tests. For instance is you have a strongly typed language where you enforce as much thing as you can via compiler you might need to do less testing since the compiler works in your favor. As you work with more dynamic and weak-typed languages you might need to do more tests since the compiler might catch fewer things.

As much the language ends up affecting how much tests you do I can say the same or even more about software architecture. How we run our software today affects a lot on how and what kind of strategies we should be considering to test out software.

From Monolith to Microservices

  • Independence: Different things can happen at same time.
  • Different teams can use different technologies and versions
  • Easier to scale
  • Easier to maintain and evolve since there are different and independent code bases
  • Isolation: Each microservice has they own:
  • ** Database
  • ** Operation System / VM
  • ** Configuration
  • ** Release Process

However, microservices are not a free lunch. There are several drawbacks or issues you need to address with microservices which were not present before, like:

  • How do we do joins? There is no central DB. The need for Streaming.
  • Infrastructure Complexity: Microservices require DevOps Engineering
  • The need for Observability: Health Checkers, Centralized Logs, and Distributed Tracking.

There is one big effect on microservice that is the fact that we moved from a CENTRALIZED solution to a DISTRIBUTED one. As we have distribution we will have way more FAILURE. Failure will happen and we need to design for failure. That’s something it’s very hard to do it later. As we need to design for failure, we also need to test for failure, right? Yes. We need to test our microservices in a different way. That’s why we need to have chaos engineering.

Chaos: Different Strategy

  • A machine can have no more CPU available
  • A machine can have no more MEMORY available
  • A machine can have no more DISK available
  • A machine can have no more NETWORK available
  • An Instance can be terminated any time
  • An AZ can go down
  • A Region can go down

Is out software ready to deal with all these kinds of failures? This failure eventually will happen and you might learn on the worst time and in the worst way possible. So it’s better to test before it happens. This makes the chaos as the new normal. Is chaos is the new normal this could be the new Definition of Done? So now when we finish a story we can say we need to complement our testing strategy with chaos testing. If you are not doing this on the story level you should be doing at least at the delivery level. So you need to have some kind of Production Ready Checklist where applying chaos is one critical item.

Don’t forget Network testing

  • The network might completely fail
  • You may get 20% extra lag/latency
  • What happens is your call never return? The code will be hanging?
  • What happens is the return is completely mess up(corruption)
  • What IF the return is too big(10MB string for instance)?

Would your code be ready to deal with this scenarios? Well, there is the only way to know it. Doing some kind of network failure testing. Some of this network failures you can use tools like Toxy Proxy(

What about Databases?

You might go even further and test the database itself. Do you know is your database deliver what the database promise to you? It’s your database strong consistent? Are the DB losing data or not? Well, there is chaos testing for databases. Aphyr is doing a while with Jepsen( and it’s open source so is your DB is not there you can add it

Assertions on Chaos

  • Circuit Breaker(provided by NetflixOss Hystrix
  • Timeouts — Hystrix or any good request lib have it.
  • Retries — at least 3 times? Or use Retry Budget is latency sensible.
  • Fallback to other AZ and regions(Hystrix + Ribbon(
  • Error Observability: Are you fail — log it, send some metric somewhere.

For Simian Army cases, you actually should not care is a machine die or a microservice stop working. You need to worry is has you disrupted the service for the final users or is the failure is perceived by the final user. A great way to test this is using a stress test tool like Gatling( because then you can simulate a number of users let’s say 10k concurrent users per minute and then you can run the Simian army and see if a machine, az or region dies will after the users or not.

The software has changed a lot in the last 5 years. Software architecture changed, runtime changed. So your tests need to change as well otherwise you will not be ready for what you are building. Chaos testing requires discipline but keeping the user experience great and not increasing much latency pays off in the end of the day.

Originally published at

Written by

Brazilian, Software Architect, SWE(Java, Scala, Rust, Go) SOA & DevOps expert, Author. Working with EKS/K8S. (Opinions on my own)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store