Currently, we live in a Cloud Native, Microservice Oriented world. This is a bigger trend them you might think. Gartner predicts by 2020 companies will be doing more algorithms on they own. Consider we are transitioning from a centralized architecture with a monolith UI, monolith service and monolith database to a distributed-* we need scale and work with different approaches for telemetry.
Observability: The concepts
According to wikipedia Telemetry is defined as:
- Alerting and Visualizations
- Distributed Systems Tracking
- Log Aggregation
- Automated Canary Analysis
- Dynamic Thresholds with ML
You might also see lots of people talking about telemetry in an IOT context. Why should I care? Well, Telemetry should be as big as your system is. As your infrastructure and architecture scale out, you will need to scale your telemetry platform.
Monitoring: The Basics
The basic need is to know if your system is up and running. Considering a microservices world this is not so simple. There are multiple middleware servers, caches, engines, database clusters using different protocols and languages. People tend to leverage the use of plugins in solutions like Nagios and Sensu. Nowadays you will need to do custom development because is almost sure you will need to adapt the telemetry to your architecture.
Alerting and Visualization
Alerting often is done using a very particular data model called Time Series. Not all solutions are using TSDBs some uses RDD-styles for instance but if you want scalability you will need to consider this. The best solutions are OpenTSDB(Hbase based) and Apache Cassandra. There is one database getting lots of traction called InfluxDB which is very nice and easy to use however there cases with issues at scale.
Visualization is another key aspect, this is not just a simple chart. One of the best solutions for me is Grafana. People often do all kinds of aggregation ad window analysis to check out trends and do analytics to spot issues, performance, degradation and potential incidents.
Another key idea very popular nowadays is the use of Advanced Math to predict and visualize your telemetry data. Netflix has a very interesting solution for this called Atlas. Atlas is built with scala,Akka and Spray.
Distributed Systems Tracking
One way of thinking is to consider everything that happens in your system as immutable events and this is particularly interesting because you deal with a Stream problem. That’s exactly what Reimann does. Reimann is written in Clojure and you can all sorts of complex math because you are coding in Clojure this is very sexy :-)
Several solutions are working with a concept call retention and some solutions do not keep track of several data points such as Prometheus. It’s particularly interesting store all events because you possible will apply regression and compare with past values.
No matter if you are using Container like LXC, Docker or virtualization you will have several servers and being cloud-native or container native will require in a Stateless and ephemeral solution. Having said that you can’t store your logs in your FS like you did before. You will need solutions like ELK(Elastic Search, Kibana and Logstash) or Graylog.
Some people use ELK as a telemetry solution(for storage and visualization) I think this is wrong and you only can do it in very simple and low scale of data.
Brazilian Software Architect, Agile Coach, DevOps Practitioner, SOA Expert, Blogger, PS3/PSP/Wii Gamer, Guitarist on spare time, pagodeiro learning cavaco/pandeiro/viola. I Love Functional Programing specially Clojure. Currently Working as Software Architect / Agile Coach with Scala, Akka and NetflixOSS.
Originally published at diego-pacheco.blogspot.com.