A quick note: I want to be clear that the purpose of this article is not to bash on people who work in operations, traditionally known as Sys-admins or DevOps. I have the utmost respect for people in those roles, and strongly believe people should do what they love most. Instead, I am focusing purely on my beliefs and what happened to me.
When I first started as an SRE at Azuqua it seemed like a perfect job. It meshed system operations with the one thing I loved most, writing software. I originally started as a "connector developer", which was essentially a role where one wrote in a JSON DSL that empowered another service to talk to 3rd part apis. I was eager to do anything other than that (turns out writing JSON isn't that fun ?). My background, at the time, had been one of simple docker-compose environments and running docker containers on VMs. I had no idea what "Kubernetes" was and only a basic idea of Docker Swarm.
I distinctly remember my first meeting with our, at the time, CEO in a conference room discussing my movement to our Platform team expressing my desire to work with containerization technology and write code to power it. At the time I had been told that we'd be creating a "SRE" team. I had no idea what that was. I made a comment about wanting to do DevOps work, as at the time I had figured that was the type of team that worked with Docker, wrote code to scale things, and more. I was quickly shot down and informed that DevOps was an anti-pattern and that SRE was the future.
Being told by your CEO that your entire life's goal is "an anti-pattern" is not easy to digest. So, what is the difference between DevOps and SRE? The origin of DevOps was trying to solve the traditional disconnect between system administrators and developers. Ideally these DevOps engineers would bridge that gap and help bring reliability to the forefront of developer's minds, and increasing the velocity at the same time. In reality, that didn't happen. The gap between DevOps and develops is arguably only a little better, if not the same, than it used to be. Developers still generally ping a DevOps team to deploy their application, fix production issues, and generally avoid the "I wrote bad code, how do I make it better" conversation.
SRE, born at Google, was meant to take this further. Generally instead of just separate teams, SRE was meant to provide a framework, not just process, to teams to roll out code. I won't go in-depth on this, if you're interested this is a great article on it. While DevOps touched on this, it didn't go nearly as far as it should've. Docker even released a tool that I feel is a great example of this:
docker app bundle which created a yaml manifest of an application meant to be handed to "the DevOps" team. From a SRE perspective this is something that'd have been automated.
Going back to my first SRE role at Azuqua, we lived and breathed the SRE book. We followed the model to the book; app teams defined their SLO/SLIs, error budgets, and spearheaded conversations about reliability across teams. It was an amazing learning experience, and it worked well. However, all good things must come to an end and I eventually ended up leaving Azuqua post-acquisition by Okta. As I mentioned earlier, I have a ton of respect for people who enjoy their roles as DevOps engineers, but when I joined Okta it was very clearly not a role for me. While their team was called "SRE" there was no model with that name to be found, instead it seemed reduced to a fad team name, mean to hire people, that followed all of the principles of DevOps team with weeks between deployments (side note: waiting an arbitrary time to release things does not make it safer), among other problems. This was the first sign of problems with SRE I ran into.
During my brief time looking into new roles, I noticed something odd. Every time I looked at a job description for an SRE role, none of the responsibilities seemed to match what I had come to expect as the normal at Azuqua. Lack of a monitoring framework, responsible for deployments, etc. No mention of the SRE model @ Google, platforms being provided for developers, or the like, anywhere.
When I started at my new company as an SRE I rolled in ready to participate. We had just recently created this team and I immediately set to work on advocating for us to follow the SRE book. We started rolling out an SLO/SLI framework and defining a consulting process. It went well for a few months until the company started to grow and we started to get alot of pages for application problems. After investigation of these pages, which clearly violated their SLO/SLIs we had our first challenge to our new paradigm – how do you enforce your monitoring targets?
When you define SLO/SLIs, the idea is that key stakeholders will approve of the numbers, and if they aren't met then it's all hands on deck to try to solve the problem. If you miss any part of that process, they don't actually work. In our case we didn't have buy in from higher levels to solve this problem. Instead of application teams fixing their performance problems, we ended up becoming the face to blame. Latency problems? SRE's fault. Our defined SLO/SLIs became useless text without real meaning.
Soon after that the next branch of our newly defined system fell apart, consulting teams on direction. An issue frequently faced at startups is that you generally need to ship to stay afloat. In our case we needed to deliver a specific feature in another cloud. Instead of consulting with our team on the direction it was done out of bands with no consulting, only much later to be eventually brought back under the purview of the team at great expense. It felt like being on a team with no power, but all the blame.
At this point, my now second try at SRE at a different company, I decided that SRE wasn't the position for me anymore. While it's a great principal, I strongly believe that it's impossible to role out without buy-in from the exec level and enough resources being provided. Combined with the lack of qualified individuals who understand the SRE mindset, I don't believe it's possible to roll out the actual SRE model at a company unless you are a new startup.