Pipelines From CI/CD to Self Operating Systems

Image for post
Image for post

Several years ago everybody was talking about and trying todo deploy automation. This is a fixed problem, right? well, several companies now have good pipelines but most of the tech industry is not quite there. There are several reasons why release pipelines are more important than they look. First of all, Deploy != Release, I talked about that on my previous post about Testing in Production. Several companies have doubted if they need the newest forms of CD(Continuous delivery) like GitOPS but I can say almost 100% sure, this is a must-have. So let’s step back and talk about Lean and Kanban and business agility. Even if your business is not asking to increase your deploys and release frequency you should be in pursuit of that. Lean has the concept of LEAD TIME, DevOps is about Lean if you have doubts go read the amazing Phoenix project from Gene Kim. Lead time is about from Idea to Production and how long it takes to get there. Production is the best place to be, so you must increase both frequencies of deploy/release and this means going to production as much as possible.

Why Production Matters?

Production is the best place to be! There are some features that can only be understood in production via A/B/n Testing and with real Domain Observability. You might say, that’s too much for me or I dont have basic testing in place why should I increase my deploys frequency? Experiments are super important for the Discovery track, some features can only be understood with real user traffic and several rounds of experiments. Experiments might go way beyond background color, CSS style and play with words to see if we sell more, dont get me wrong thats value on that. However, some experiments require backend changes and therefore you need to have different versions of you backend services somehow, either via backward/forward compatibility or via multiple versions of your services running in prod.

Besides business experiments, there is something called growth. Growths teams become super popular on SV startups and big sites like Facebook. Growth teams are the modern sales and marketing departments. Guess what are the important tools they use? Experiments, A/B/n Testing and other user experience and discovery tools. Right, that’s not enough argument for you? ok.

There are companies who have several legal and compliance changes that they need to release pretty quickly and if that does not happen it means losing money and even having to be sued by customers. Right, that’s still not enough reason? They think this way, imagine we are playing a Brazillian football game imagine 2 hypothetical scenarios where A) You just score 1 goal per game. B) You won’t score more goals you just make sure the ball past mid-field. Not increasing deploy frequency is playing defense and playing defense is super dangerous. Not increasing deploy frequency is like piling up balls on the midfield but zero goals, what’s the sense of that?

From CI to CD

Continuous integration started as a good idea but thanks to Git Flow it got messed up. Git Flow Sucks. There are several reasons, first of all, it’s because is too complex and error phone, second because it is a smell that shows you architecture is fragile. Git flow becomes popular because of mobile development with often had pretty bad architecture and lack of modularity and isolation so there was several people had to work on the same files.

CI was about to have the code being “integrated” all the time. This means all good in the trunk/master branch, so no long live branches, long-lived branches are the opposite of continuous integration. Wow but I have a Jenkins server doing builds all the time, it does not matter because the code is all over different branches.

Image for post
Image for post

Here we have a very basic and overall CI pipeline. First of all the engineer code on his local machine and do several commits on git and at some point, he pushes the changes to git. Them a Jenkins pipeline is a trigger, it could be scheduled to run every 5 minutes of after every push, IMHO after every push is much better. Once the code is checkout on Jenkins machine the build phase will be kicked in, so here depending on your language several things will happen like, linters will be run in case of frontend development with angular / react, in case of backend development like Java, Go, Rust, Scala, the code will be compiled and unit tests, snapshot tests, mocked contract tests will run. Some advance pipelines also run Integration Tests, E2E tests and sometimes stress tests. After tests pass reports will be generated in the sense of coverage and test results and the binaries will be deployed often to an application server, HTTP server or s3.

CD is the next level, which stands for continuous Deployment. It’s absolutely important to understand Continuous deployment can be done with almost the same tools like CI, like Jenkins, Database automation tools like liquibase or flyway, unit tests. However, Cd requires more things like Feature Toggles / Canary and Split traffic.

Feature Toggles / Canary and Split Traffic

Here is where things get interesting since you will be deploying a lot to the production you need to be able to enable and disable features and releases and also to automatic release or rollback changes in order to scale this process without getting crazy. Split traffic needed in order to reduce risk and make sure we dont affect the User experience.

Canary means automation and one of the superiors and final levels of pipeline maturity. It’s important to have proper observability in place(Centralized Logs, Dashboards, Alerts and better instrumentation on the applications and microservices to send data to the canary solution). A good canary implementation will compare several metrics dimensions like Cpu Usage, memory, network, HTTP errors, disk usage, latency percentiles like p75, p90, p99, 99.5 and much more. Netflix does that with more than +500 metrics. So once you have all this data you can allow some threshold of decay or say you need to be improving by 1% or whatever it makes sense of that service, and therefore this should be configured per services bases.

Image for post
Image for post

CD has a more complete pipeline, It’s required to have automation for your infrastructure, like databases, cases, full-text search engines, messaging, load balancers and any other component your services relly on. Often infrastructure deployment happens because of service deployment and happens in separate pipelines. Often this work is done using tools like Terraform, Ansible, Cloud Formation, etc…

After the infrastructure deploys them you will push your code and run build and test like you would do in your CI pipeline, the biggest difference for me is not the fact that Stress Test / Load Tests are mandatory but what will happen is, first of all, DEPLOY IN PRODUCTION. I’m sure some folks shat their pants just be reading deploy in production. Being a grown-up means doing hard things. Deploy in production should be a normal thing, not a myth, not an “special” access that only a DevOps team can be doing (If there is a devops team there are no devops and most likely you talent tensity is low).

Once you are in production you want a test in production, reply real user traffic and as all the candy comparisons go well you want rollout to your users, blue-green is the basics, whats you want to do is release to 1% them 5%, 10%, 25%, 50%, 75, 90%, 100% of your users so if you get something wrong you won’t disrupt the majority of your users.

Having Better Pipelines

It’s easy to mess up with pipelines, here are some tips and lessons learned that will help you out to build better pipelines:

* Do not use Git Flow, try simple models like CI/CD or even GitHUB Flow.
* Avoid multiple downstream jobs, because they increase pipeline complexity.
* Keep an eye on Test performance, make sure you dont have flaky and slow tests, make it all parallel.
* Avoid do obscure and steps that your team does not understand. Stick to the basics and slowly improve.
* Do not hard code credentials, use Hashicorp Vault.
* Make sure you have notifications in place, email and slack are important.
* Avoid long scheduling cycles like 1 time per day.
* Consider dedicated pipelines for the infrastructure tasks like create, search, upgrade, destroy clusters.
* Try to use Terraform instead of Cloud Formation because it is much better and simple.
* Use Tests for Terraform specs.
* Make sure you tune up Jenkins agents and use proper instances size, avoid t2 medium.
* Have more than 1 agent or you will have bottlenecks
* Consider using spot instances and shutdown Jenkins overnight and weekend to save costs.
* Automate your Jenkins service too :-)
* Make engineers responsible for their pipelines and not a “DevOps” team responsibility
* Create templates and guides to help your engineers improve productivity.

GitOPS

GitOPS makes lots of sense if you are running workloads in Kubernetes, it does not matter if is on Ec2 with Kops or with a managed service like EKS or GKE. The easiest way to go EKS on AWS is with EKSCTL by the way. So GitOPS means that all kubernetes cluster spec changes go to GIT, via Pull request which is super great, it does not matter if you use plain yaml specs or if you are using helm charts(luckily helm v3 because <= v2 its a bit ugly).

Image for post
Image for post

GitOPS is not that different from CI/CD, there are just a few differences, first all it’s all trigger by Git hooks via Pull Requests. In your build phase besides the application build, you also need to build the docker image or any other container runtime or packing format you might use, in the future maybe we will all be using WASM. Them you run your tests normally and when you will deploy, first of you need to deploy the cluster changes and laters on the application changes. Using kubernetes canary, Feature Toggles, Split Traffic is much easier to do it so.

Self Operating, Self Healing, Self Remediation Systems

The next level which goes beyond GitOPS is self-operating, self-healing, self-remediation systems. Thinking like a Tesla car who auto-drives, auto-parks for you. You might want to check out some of my experiences with these systems in my previous blog posts:

* Experiences Building a Cassandra Orchestrator(CM)
* Lessons Learned using AWS Lambda as Remediation System

The whole idea is simple. If you think trought its not crazy at all. AWS and GCP offer more and more services like that, called so “Managed Services” like RDS, ElastiCache, S3, MSK, EKS, and the list go on and on.

If self-operating, self-healing,self-remediation systems are the future why not everybody does it? First of all, it all depends on your use cases and scale. If you can you should use managed services if you use another open-source solution and is super important to your business and you have lots of use cases on it them it would make sense to build a self-operation, self-healing, self-remediation system or internally managed services for short. For microservices, GitOPS should be enough for you, for the DATA layer you might need to go one level down.

Improve Business Agility

Teams are agile, agile has +20 years, now we need to go to the next step, Organizations need to be agile they need to have agile goals, Think more in Impact and Outcome instead of output/feature for product discovery and there is no sense to do that without improving the reliability and time to market. Therefore having better pipelines is a must-have investment you need to do and will pay off in the long term.

Cheers,
Diego Pacheco

Originally published at http://diego-pacheco.blogspot.com on April 1, 2020.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store