Last year I started to blog about something I’m calling “Stability Mindset”. I’m still working on the ideas but I think I had enough time to time and experiences to evolve the concept to some basic guiding principles.
You might be wondering if I’m just trying to do some kind of ugly re-branding of Reliability like Software Reliability Engineering(SRE) but I don’t think is the case — I see is different things or at least a specific subset of SRE. If you are interested in a broader spectrum you can check out google insights on the matter. Google has this amazing book you can check it out about SRE.
I thought a lot if this was just SRE for mere mortals but I don’t think it is. I really think is a different kind of problem or at least different subset for sure. Today I’m writing this post because I want to capture some of my recent findings and also because I did a retrospective exercise with my team today I promise I you share some of my thoughts.
Why It starts with a Mindset? Last year I think I did a fair job grasping some of the good elements of what I consider this new mindset. Some of the major elements of this concept included:
- Code / Release practices like Retry/Fallbacks/Timeouts
- Proper Observability: Visibility, Dashboards, Alerts, metrics
- Software and Infrastructure Automation
Besides the fact I still think these aspects are valid and still ok I do think there is more. There are many reasons why I think there is more and I will explain in this very post, however, don’t get me wrong all things I said on the previous post are still important and this is more like an evolution of the idea rather them a correction or replacement. Explain things can be hard. It’s easy to understand something and use your gut feeling to tell something is wrong. Sometimes It’s hard to express this to others members of your team and be clear and direct. It could get even harder to tell exactly what things they should do and what they need stop doing — in other words specific behavior and practices. Engineering means to deal with different kinds of problems, solutions, and platforms. Having said that stability is different when you are doing observability or when you are working on Cassandra backups or when you are doing in-memory database. However, I do believe the core principles are kind of the same and that’s what I try to capture and share today.
It’s not a process but how easy can you do it?
Stability does not means freezing. Old ops mentality often thinks if they “control” the env and don’t let much stuff in they are safe. This was never true and this does not work on the current level of innovation that business needs to operate today. IMHO Stability does not need more process, actually, stability needs more discipline. A process is something you can do it with compliance and you can check if you are doing or not. Compliance might work well for security but as long as you have development or people working with development background this might not be the case. So you can give feedback to people that their solutions need to be more “stable” but what that means? …And more importantly how they do it? What they start doing? what they need to stop doing? This need to explain to others added with the fact that some people get it more easily them others made me stop and think about something more direct and clear. That’s why I try to come up with some of this principles. So the idea behind the principles is to make easier to understand what stability means and thus made the behavior and practices changes to archive it.
Senior means different things to different people
Senior engineer for one company might mean that the person can deliver high-quality work without any communication or support. For other companies like my company, it means you can deliver quality work alone but also you can learn anything and you can teach others to do it. I’m other companies have other definitions what a senior engineer means and this is fine. So you might be wondering why I’m bringing this up. The reason is that people have different backgrounds and tell me how you measure someone and you will have some behavior in return for that kind of measurement. This means you might don’t have it but maybe is not because of your lack of skills but maybe because your company does not care or don’t need to care about this things. In other words, I do believe in people and I do believe people can learn and improve so this is not black and white you have it you are great you don’t have it you suck. No, Not at All! Lean is one of the roots of agile and also one of the roots of DevOps. Lean does believe a lot on the right “vision” and in people empowerment. This is about empowerment and makes people able to deliver more “quality work” which quality means stability.
Asgard is not a place
Sorry if you did not watch last Thor movie but they said “Asgard is not a place” at the end of the movie was a ship or a piece of land by actually was whatever people were. That’s how I think about stability is not something fixed with exact rules(a.k.a process). This is so true that for my team, for instance, production is whatever we do. We run software on AWS with multiple accounts, somethings production means what is running on PROD account, other times means what is running on SHARED-Infrastructure account and other times even means what is running on a TEST account. Stability should be a mindset and in whatever you do so does not matter if you are working with Cassandra(The Source of Truth) for my project or something small like reporting tools for dependencies. When you are in a CORE team and there are people depending on your work — everything you do is Asgard — everything you do is production so all needs to be stable.
This is simple to understand and very hard to put in practice since sometimes things can stop working even if you did not change a single line of code. I’m sure everyone is familiar with Spectre and Meltdown vulnerabilities they are a good and recent sample of this. There is another great sample which is Gradle how never had a 3rd party dependency downloading some extra jar and blowing up your server bootup? I’ve been there.
All right so right now this is the guiding principles of stability for me:
- Always Working
- Configuration Management
- Brazillian “Skol” principle
Isolation: Isolation is key. You need to have proper separation between DevOps Engineering or Software Engineering and production. So if your solution has a Jenkins job. You need to have 2 Jenkins job — one for development and another for production. Isolate only the “code” or servers machines is not enough. There is much more in isolation, for instance, we do use Slaves in our Jenkins and if there are many jobs on the same slave you will queue up and development bug might affect production since “Jenkins” is part of your “production” solution them the solution or have separated slaves.
Tests: If you can automate great if you cant or is too expensive at least you should know your CORE features and re-test all these features after your PR — this makes you go slower, however, it make you more stable. If you can automate is always a good idea, however, somethings in the DevOps Engineering filed are not testable at all like CollectD for instance or the effort to test is just instance since there are lots of remote and distributed async work.
Observability: Well you need to know if you are ok or not. So you need to have key metrics, basic dashboards, good alerts. This is basic and I don’t think I need to explain much about this. In my current project, I don’t do any customer-facing work and most of the things my team do are for support other teams so easy to fall into situations where it could take some time to detect that there is a problem. So the golden rules for my team have observability of everything that can fail and at least have some alert or some message on a slack chat.
Always Working: All Brazilians know an expression called “the obvious is hidden from sight” so this is the core assumption of the stability mindset. You need to internalize the idea that the software — whatever you do needs to be always working. All elements and step in your solution are equality important. So if your solution has java coding, ansible roles, Jenkins jobs, and a web UI. All these elements need to be working all the time and you cant just care about the java part. Stability is all or nothing.
Configuration Management: Rock solid configuration management practices like code versioning, code version control, tags on GitHub not only made the work safer but gives is the ability to easily fix a specific problem and develop new features in parallel.
Proactivity: This is an attitude. If you internalize that you have full ownership of everything you do and you are the last barrier and if you fail people will see the impact you will act more carefully and will work beyond what people ask you todo. We work with quarters and we have goals but you cant just follow the goals you need have some extra care.
Brazillian “Skol” principle: Down here in Brazil we have Pilsen beer called “skol” the beer is nothing special however there was a TV and talking about that the beer flows very well and you drink smoothly that the beer has something like a circle, not a square. Square means a bad thing and the idea here is you get stuff that is wrong and you make it right always. Like the boy scout. You always left the environment in a better shape then you find it. So no matter if was you who did the code wrongly or left it not working — you are working on it so now it’s your responsibility to fix it. Sorry I just realize my “skol” metaphor will suck if you are not Brazilian or don’t know Brazilians references — so think as the boy scout rule. Always let it better than you found it no matter if was your fault. Some cultural references are very hard to explain because they just don’t have translation well if you are Brazillian I hope you get it :D
What’s Next? I’m talking about this subject because I care and have passion but also because I don’t see much material about what I’m talking. Don’t get me wrong there is great SRE stuff by many companies like Google, Twitter, Facebook, Netflix, however, I’m not talking about the scale or scaling problems but a different nature of problems. I still doing coaching and coding everything. Still learning about stability mindset and how to develop better and more stable software. I will keep posting my finding here. Hope you guys enjoyed. Take care. cheers,
Originally published at diego-pacheco.blogspot.com.