Deprecation and Adoption Across Teams

June 11, 2024

Leading Technical Change across the Herd.

Software has evolved from a team activity to teams activity: multiple teams, each with their own Microservices, where a single user request’s calls a service which then calls another service to look something up, which calls another service, etc. A single request may spawn many requests inside the microservice herd: this is a characteristic of microservice architecture, same in REST as GraphQL (federated supergraphs reduce some of the explicit complexity here - although only somewhat).

The promise of small teams and microservice was easily manageable teams (as they fall under Dunbar’s Number) that would create isolated changes with limited risk impact (an issue with one service may not bring down the rest of the services in the company). Small teams reduces apparent complexity within the team, as there’s just less simultaneously moving pieces to think about.

In a mid to large size development shop you’ll have multiple teams with multiple microservices under their stewardship. From an observability/reliability perspective one now wants to use distributed tracing tools tools (or maybe just engineer’s knowledge, if the system is still tractable) to understand the path of requests. You’ll probably find flocks of people and cliques of services: a user request about user preferences will interact with user preference related microservices, but not the ordering microservices, for example.

Here where that promise of microservices starts to break down a little bit: some changes are not isolated. What if you have a change that impacts an interface relied on by another team? A new version of some interface, a machine or service that’s now getting shut off, a data field being deprecated?

It’s not just peer - or dependent changes, but sometimes transitive dependencies: a service three calls away from you removes a field you use waaay up here.

But modern microservice organizations aren’t just teams calling other teams for information: no! Modern software development requires teams to be knowledgeable about frontend, maybe mobile, backend, database, CI/CD, containers, The Cloud, and SRE practices. This is a vast array of, in some cases, very specialized information practiced by a experts. You may have one or two people in the org really good with Terraform, and everyone else just copies them. How do you scale this specialized expertise, and create technical earned interest?

One way may be establishing a Platform Engineering team, which can create common infrastructure, practices, standards and infrastructure. This may involve creating common libraries, pipeline patterns, tools to make it easier to automatically enforce enterprise level standards, and/or base Docker images secured per company standards. Maybe your Platform Engineering team is home of the companies Kubernetes or Kafka operational experts.

So, peer teams, teams that have transitive dependencies on another, teams guiding and scaling specialized expertise. Sounds like you haven’t gotten rid of complexity of software, you’ve shifted it to team communication patterns. The complexity isn’t in the nodes, but the connections.

For example, how are you informed of changes, made by other teams, that you have to react to? Changes by some Service you use? or changes other stakeholders like Platform Engineering, Security, Enterprise Architecture, or technical leadership want to make across the microservice herd - changes to those common artifacts, standards, and services? These changes are things that you may need to make code changes to deprecate or adopt.

The way an organization learns to navigate this change is, I believe, how an organization moves up to the Process / Stabilization Growth Plain: a mature way to inform other teams of a change, without unexpected surprises or breakages, and allowing teams to schedule the required work as part of some future - but not too far away - sprint.

My experience in pushing Technical Change Through a large org

When presented with this problem, at a previous gig, I created the idea of a Deprecation/Adoption Notice (a D.A.N): Through this system I was able to push 60+ changes across 10 teams and 100 microservices, successfully and calmly, from large changes such as adopting a new User authentication system to library upgrades to smaller changes like adopting new base Docker images across the herd. This system is still being used after 4 years of operation.

Once written a notice can be distributed to the affected teams and they should take action on it per the document’s contents.

On Deprecation / Adoption Notices

There’s some subtlety that does into this document, some best practices if you will.

The goal of this Deprecation / Adoption Notice framework is to use it only for things that can not be automated away. For example, I may have a microservice that wants to deprecate some field it returns to clients. If this service is part of a GraphQL Supergraph then deprecation tags in GraphQL and the team collaboration features of Apollo GraphOS - in particular GraphOS Studio - make it relatively easy to mark a field as deprecated, and see usage of that field decrees over time. This is harder in a REST based model, as a field will be returned from a response wither it’s used or not: GraphQL by its nature is more explicit.

I have heard of orgs where the culture and test coverage are good enough where library updates are automatically pushed and merged to main via bot, then send towards production. If you live in such an organization, awesome: non-tech companies often need so org wide notice to get off unsupported languages or platforms. (Veteran developers know of places that were on Java 6 for waaayyy too long).

Another important part is to understand the scope of the change, both in teams and work impact (and ideally both are as small as possible). This increases the chance of success. When you ask other teams to do something for you the instructions should be clear, easy to understand (thus easy to estimate), and ideally as small as possible. This means the D.A.N authoring team needs to make things clear and to do their research: to make sure the instructions work, to pass the plan through any Architecture Review Boards or Enterprise Architecture committees involved.

Understanding the scope may involve talking to tech leads before hand: understanding how the desired change affects them, getting them on board with next actions. In cultures with good platform engineering teams (ie Developer Starter Packs, or a Reference Architecture, maybe an org wide standardized platform of preferred tools and libraries) this commonality may mean effort is known and consistent across teams. “Upgrade to the latest version of this logging library, you probably need to make changes a, b, and c” is more predictably scoped than “Please switch your logging libraries to output to standard out only, if they aren’t already”.

Some rough guidelines or “golden path” gives those in technology leadership an idea about how current events need to shape technology change. When Log4Shell came out, at a previous gig, technology leadership and architecture did analysis on the vulnerabilities impact to our systems. While we were fairly clean (as the org had standardized on Logback) I was prepared to write a D.A.N describing how to mitigate the issue or upgrade Log4J.

With scope you can understand - very roughly - how much time to may take teams to adopt the change. Knowing size of effort gives you a deadline. For most places a deadline needs to be placed on actually doing the work or organizational priorities will overshadow it.

In a scrum environment my advice is that a D.A.N should have a deadline as follows:

2 sprints to get through backlog refinement and prioritization PLUS
1-2 sprints to do the work (calendar time: doesn’t matter if it’s a 1 hour change or a 1 week change, it’ll still probably take a sprint) PLUS
local time required to walk the change through the release cycle

In some places - those with ability to ship changes to production quickly - you may see an org wide change be adopted in as little as 3-4 sprints! In larger, non tech, organizations - especially with larger changes - the cycle time may look more like 5-6 sprints.

SAFe environments, or similar PI Planning there may be even more of a lead time for very large changes. If the enterprise works in a way where only very small changes don’t have to be planned in the PI, or changes with significant effort in them during the PI are culturally rejected this lead time may be even longer (“first we have to wait until the next PI planning, then the refinement work can start”)

If leaders have enough power to change how PI planning is done, personally I’d lobby for a defined amount of every team’s capacity set aside for technology improvements and changes (in particular from D.A.Ns), so changes like this can be more agile, especially in cases where it’s needed. Security vulnerabilities or sudden vendor related changes may not care about the outcomes and priorities set by your company’s Big Room Planning.

Likewise, there’s a Project Management aspect to a D.A.N. Project Management is hard when there’s many, ordered dependencies: If microservice A needs to make a change, then microservices B, C and D, then everyone else: that’s hard to manage. A constraint on the technical design certainly should be that teams can adopt this change in arbitrary timelines. Especially in a large microservice architecture: there’s a good potential for recursive loops as a request traverses the microservice herd (mciroservice A calls one endpoint say microservice b, which calls an endpoint in microservice c, which calls a new endpoint in microservice A). Software Engineering Architecture is about constraint, and this certainly affects the implementation and rollout of a change.

It’s important your change allow gradual adoption. When you deploy your change to production you shouldn’t break people relying on the existing thing, and (ideally!) your change shouldn’t mean your change and every dependency needs to go to production at the exact same time. Coordinating large, simultaneous releases of software across teams is something best avoided.

This gradual adoption means shipping version 2 of the thing while version 1 is still supported, if possible. However, after the deadline, clean up and delete version 1 - don’t let it linger. Get fully off the old thing: don’t support the two different systems indefinitely.

Gradual adoption during the duration of the notice period. After that: turn the old thing off. Have a deadline, and clearly list the consequences of non-adoption. A deadline gives focus to agile teams on what priority this item should be amongst the other items they need to deliver to customers. It means the change can be scheduled, not thrown on the backlog for forever and forgotten about.

Sometimes enforcement of deadlines can be automated. In an organization where teams use CI/CD pipelines provided by or including common components from a Platform Engineering team, enforcement of changes can sometimes be done via pipeline. For example, a change may be to update to the latest version of the common Event Bus library. CI could check to make sure this library as at the correct version even just by greping the dependencies declaration file. (Then failing the build if a feature branch isn’t updated past the deadline)

A good change enables leverage (that technical earned interest again!). Let’s say an organization wants teams to build dashboards that trace execution time for requests as they bounce through the herd. One way to do that would be to require certain instrumentation code be placed at every request callsite across the herd. A better change may be to create a common requests library and have teams adopt that. This feels larger but you’re asking developers to modify every callsite anyway, a library gives you more leverage: switching out the underlying request making mechanism, or switching out how exactly metrics are reported to the system means simply having teams upgrade to the latest version. (This example is also good candidate for a wide D.A.N as - we can imagine - metrics of this nature are more useful if everyone’s doing it.)

However, automation goes hand in hand with organizational culture. I’ve explicitly made things that could be automated be manual processes instead, so Product Owners had control and sign-off ability: even if they were almost entirely automated. “Do a prod deploy you’ll automatically get this new thing, or if you don’t plan on shipping a feature in this microservice, you can do this one thing to rebuild your service and comply with this notice” was something I repeated many, many times.

The last part of a Deprecation Notice - especially one that may look purely technical - is to ensure non-technical stakeholders understand the value of the change, and are (ideally) motivated to do their part. Maybe the value is better, more common, tools a team can take advantage of (vs building their own). Maybe the value is less things individual teams need to have ownership of; or better foundations for the future. Product Owners, Scrum Masters and other non-technical leaders can see how it benefits them, and you have them on board, implementation will go much better.

Having said that, sometimes adoption is just “adopt the latest and most secure version of this thing that we’re rolling out”. Or “Update to the latest version of these libraries probably everyone in the org uses”. As a user the velocity of which I get new features kind of matters, as well as my data staying secure.

Most importantly you need a proper forum to announce your D.A.N. For larger organizations, announcing it during the Change Advisory Board when you’re sending your change on it’s way to prod is just going to surprise people (best case) or break production (worst case).

The correct forum differs depending on organization: maybe it’s the architect group meeting, maybe it’s going to the overall project lead, maybe it’s during a Scaled Scrum ceremony, maybe you can post it as team NEWS. The forum must have technology leaders (architects, tech leads, etc etc), Product Owners (or their leaders), Scrum Masters (or their leaders), and potentially even technology people managers: while you may have distributed drafts of the D.A.N to some of these leaders before, making sure all stakeholders have access to the announcement is key.

Reminders: Occasional and frequent reminders are important here: both for your stakeholders and for your reputation as a change author. Remind teams regularly: it’s far better than “you only told me once, and now we’re unprepared!“. A change author wants to be seen as a reliable and predictable partner: telling people what you’re going to do, reminding them of that, then continuing to remind them of that every so often and especially at key points, for example just before most teams do their sprint planning, a week or so before the date that they probably should be deployed to QA/UAT environment, a week before the deadline, maybe even day that last week.

D.A.N template

An organization standard needs to look standard. I suggest using something like the following template to build your notices


Overview

Goals

Impact

Deadline

Implementation Details

The Overview and Goals section are meant to be understandable by non-technical stakeholders, and may include information about why we’re making this change, what we hope to achieve from this, etc. This is where you’re ensuring the non-technical stakeholders understand the value. Impact is for the more project management oriented product owners or scrum masters: how many teams are impacted, what’s the risk level of the change, and in larger organizations all D.A.N.s may not apply to all teams.

The Implementation Details are important. Developers will be asked to estimate and size the work involved. Especially at scale you want to pre-answer questions teams will have: Showing the developers an example of what to change, or how to perform the change gives them something to base their sizing efforts on and visualize how the change can implemented in their own codebases.

Conclusion

Small teams may reduce complexity by reducing the Dunbar number of a team and reducing the size of work a team can realistically take on. The price we pay are now in the relationships between teams and any common platform or infrastructure created to empower teams. But having a consistent way to communicate required changes in a manner that does not surprise teams is highly important for mid to large sized organizations.