The Platform Conundrum (and an Answer?)

A friend of mine shared this article with me titled Things I want as SRE/DevOps from Devs. An excellently written post that talks about the pains of being SRE, Platform Engineer, DevOps* etc and a lack of easily discoverable information which helps with solving problems like outages, ill health of systems etc which are stressful.

Having been a platform engineer for the last few years and currently incubating a platform to be used by a big population of engineers at my workplace, a couple of questions that is often brought up is how much abstraction is too much abstraction and how can we reduce the cognitive load for the developers when they are solving a business problem and shouldn’t be worrying too much about the platform itself. My opinion on the question posed at the end of the blog post are the following, keeping in mind these two questions.

Standards and what are called the paved roads these days really helps with making the developers adopt best practices and patterns without having to think about what they are missing in terms of observability. A good platform should mandate what’s needed and agreed upon by the engineering team(s) and become a self documenting artifact of that reality. The metadata for each service is a good starting point towards that. This also enables with discoverability, capturing the dependency graphs in the microservices world, the level of cascading for a service failure etc. A service definition is a good enough abstraction in most* cases which describes the service, the infrastructure needs, who owns it etc. The underlying components need not be exposed to a level of detail where the user has to think about the implementation details of the platform which they use. That said, it also needs to be extensible and provide that inner source model so that it’s extensible to meet a very specific need of a team. That means good documentation with explanations of the architecture, contribution guides, style guides etc.

The abstraction should be for the platform usage itself and not for observability. This is where observability should be a first class citizen and both the platform team and the service development team should have a shared view of the system and collaborate to create monitoring systems and instrumentation that help measure the SLIs accurately and purposefully.

As for the cognitive load problem, the developer having to think about the operations aspect of the service every time they sit down to write business logic isn’t going to be a productive use of their time, thus having a platform which respects that and provides appropriate signals at the right time (as a part of CI or even before a code check-in?) would help instead of making them think in two different directions.

To conclude, I think a good platform is the answer. :-)

  • *DevOps: I think this word gets thrown around wrongly in the industry a lot. If a team is true to the DevOps philosophy, they’d operate the platform they run and also figure out the pain points of observability early and put adequate guardrails themselves.
  • *most: There are always going to be edge cases and services that may not necessarily fit the paved roads.