I have background on software development. Currently, I’m leading a team of highly effective engineering’s where we are responsible of some of the components we delivery in production and I need to say that is not the standard reality on it market. Development is used to break things and move faster, however, lots of developers never cared about production or about “polishing” they systems. This effect happens some several reasons like:
- Having different people to develop and operate software
- Having different departments for development and Operations
- Having development teams focus too much in new features rather than real UX
So there is this culture clash when we have developers doing DevOps Engineering(I say that for my own experience). We developers are not used to keep something running no matter what. To be always available and reliable. It takes some time to get this “Stability Mindset” but once you get it is hard to get out of your veins.
IMHO Stability is not about freezing things. So you need to engineer new features but you need to make sure you don’t disrupt your service. How do you make sure you don’t add disruptions? There are several practices for instance:
- Testing, Testing, Testing
- Telemetry and Alerts Automation
- System internals Observability(error observability, app metrics, health checkers, distributed tracking, etc…)
- Roll Out strategy: Change one client per time, step-by-step.
- Have a fallback or rollback strategy is something goes wrong.
- Coding for Stability Mindset
Some time ago lots of folks down here in brazil was talking about “Defensive Programing” and i really think is something similar but for Stability. So for instance I work a lot with cloud and for instance lots of things I do run in background so If I do have proper Observability — easily could have lots of things going wrong and we will know on the worst way possible. What I mean by proper Observability, I mean basic things, like for everything you engineer you add:
- OS automation(like ansible + Jenkins)
- Notifications(I like to create Slack channels that ping me if something goes wrong)
- Log everything(don’t forget add logrotate)
- Look Dashboards everyday(It help to build mental patterns and know you workloads in sense of “feeling” so you might smell is something is not ok.)
- When you call any API have: Retries, Fallbacks, Timeouts.
- Expose System/App metrics so you can understand what’s happening internally
Is you release some shared component or service without this thing you can have proper debuggability/observability so you might not be proactive enough to leverage your user’s experiences. Another key thing in order to improve users experiences(developers) is great documentation and also abstractions. Don’t expose all parameters, hide things and have good default this make adoption of thing easier. This concerns can be added later however they often take lots of time. I don’t think a component/service can be released without this items I don’t feel comfortable at least.
It’s usual to have components that are integrated into microservices code in this cases you need make sure to be as much lean as possible. Add only what is strictly necessary it might be better to “copy” some code and adding frameworks? Why? Because you can create some JAR HELL easily with your consumers.
That’s something very unusual for developers. We often are familiar with re-use however in this scenarios re-use can be very tricky and might not pay off in a long run. How never had issues with Guava? AWS SDK? This could be considered another item of the “Coding for Stability” since you might avoid problems for your end users by having this concerns before building anything.
In the end of the day, we want make developers more productive. We don’t want to make developers more “restrict” or be heavily “Opinionated” of some solutions. There is high need for developer teams independence, however, is we don’t have Stability mindset running in our veins we might provide poor user experience and end up with something that’s a problem and slow down developers rather than speed up productivity. Stability Mindset is something very hard to get from the beginning but as you practice you can get it and deliver better solutions.
Originally published at diego-pacheco.blogspot.com.