The Stable Growth Plane and Stability

September 03, 2021

Intro

In my earlier software projects and planes of growth I posited that software projects have five steps of growth: foundations, testing, specialization, process, stabilization, and self actualization.

I talked very little about what stabilization means in that article, and I’m going to define it a bit here.

Stabilization looks different for different software organizations. A product team of 7 with a single microservice and a front-end has simpler needs than many small teams with many small chunks of the project (aka the Spotify Model).

I’m very grateful for my diverse background here: from desktop app projects to Rails / Django monoliths; iOS apps; Node microservices with separate frontends to leadership over a large microservice herd, there’s a lot of variables and a lot of different ways teams or organized. But, in all these cases, even if you have a team of 4 developers you want commonly used things to be well paved roads, not “well, let me ignore the previous work here and go do my own thing” without good reason. (Prevent rework / waste)

But I believe a stable project is built on three foundations that I want to talk about here: the business functionality foundation, the technical foundation, and the organizational foundation.

The business functionality foundation

The question here is: do we understand the business domain knowledge, user behaviors, how the technology shapes how this work is implemented?

To borrow a line from Wandavision, “Software is just institutional knowledge executing”. When grooming a story it’s worth considering three things:

Do we have business knowledge of the thing we want to do (for example, when - exactly - should a customer be reminded of an overdue invoice?). Do we also understand if there’s additional stakeholders at play (security folk not wanting account balances visible in that overdue notification for example)?
Do we understand the user behaviors we want to see in the product (how should the user be notified? Does the user have options or do we use the email on file?)
How do the technical foundations we have shape the implementation? More on this later in the post, but how much does the technical foundations you have - or don’t!, (or, alternatively, that you do have but they don’t match where you’re going) - should come into play as an engineer refining a ticket for estimation.

In a stable system you likely have a well trodden road for many of these items. Maybe there’s some experimenting (“we want to offer reminders, but not sure what method gets the best response rates”), but for the most part you already have email addresses in a database, some email sending mechanism, and operational dashboards to monitor that the emails you think you are sending really are.

This is that institutional knowledge: how you do something: business rules on how you do something or technology rules/patterns for how you do something.

How much is this functionality different from how the business currently works?

The Law of Software Process (Armour) says

All software development is simply a representation of knowledge that is the real activity of software development.

All software artifacts (requirements documents, design documents, code, test plans, management plans, execution scripts, compiler parameters) are knowledge repositories of various types

(The requirements documents part may sound familiar: Planning The Process Layer Cake).

When developers talk about needing some refactoring to implement a ticket, that could mean that the institutional knowledge currently represented in the software does not match the businesses current way of doing business.

Back to our invoice example, maybe you want to send SMS messages in addition to email. Maybe you don’t currently collect phone numbers at all: to notify people of overdue invoices via SMS now requires some UI changes to collect phone numbers, store them, maybe allow users to set a preference which way they prefer, etc. And likely how you deliver SMS messages is completely unexplored. But the business currently sends lots of email, sending an overdue invoice email is a well trodden path. (There’s even a dashboard with open rates!!)

In short: The better the institutional knowledge model in the software is, the less amount of surprises there should be when developing some functionality. That email is easy to send, but sending an SMS sounds like a lot of work with lots of potential for surprise as there’s little institutional knowlege around it.

If institutional knowledge is drastically changing - “but last sprint you said there was one real person per account, now there’s two?! And how does this affect the overdue invoice reminder feature we’re building this sprint?” The project is still in growth. Likewise if technical/security/operational knowledge is drastically changing it’s likely still in growth (“Waaaiitttt…. We need to pick an email provider to send all these reminder emails…”)

The stable growth plane allows you to deliver features to the end user quicker, with better quality (aka: user experience!), with better predictability (less mid sprint surprise “well, turns out this two day feature is 4 weeks of work…“)

How much of this is discovery work?

A project in my past was building a workflow system, but the source of these behaviors was (sometimes) business analysts parsing court documents. (Yes, really). In this case the executable software that came out of that was important, but equally valuable was often the “what does this really mean?” we could give our project sponsors. This institutional knowlege was independent of the Ruby code we were writing: some of it could have been turned into posters on office walls.

Same project, a challenge one teammate took on himself was trying to teach the client how to learn, introspect and think critically about the problem (including failure scenarios!) themselves. That when modelling software it’s often best - not easy, but best - to focus on the error paths first and the “happy path” last (that being the path through the system without any of the error situations happening).

Its the institutional knowledge that was important: software is but institutional knowledge executing. Kind of doesn’t matter if that knowledge is executing in Ruby, C++, or someone’s scrap of paper cheatsheet: the institution was discovering its new way of working then building tools so the folks out there in the field could correctly, easily, and readily follow company best practices.

The flow of information in discovery

How do we understand, document our understanding, then scale that down to the implementable bits? Even the top most item, where we don’t know what questions we should be asking, creates or interacts with institutional knowledge.

The Law Of Software Process calls information discovery levels “orders of ignorance”:

Level 0: “I have the answer, developing a system is a matter of transcribing what I know”
Level 1: I have the question and I know how to [discover] the answer
Level 2: I do not have the question. I do not know enough to frame a question that is contextual enough to elicit a definative answer.

(There are two orders of ignorance not included here)

While refining and implementing tickets, in a stable growth plane project most of your work should start at that level 1 (“Questions we know how to ask”).

Horizon 1, stablization, and the innovation dilemma

Stable projects are likely part of Geoffrey Morre’s “Horizon 1” activities. From The Unicorn Project

Horizon 1 is your successful, cash cow where the business and operating models are predictable and well-known.

A project in the Stablization growth plane likely needed to account for some growth in horizons 2 or 3 (growth horizons): not everyone is going to want to do it the same way, so platform providers must ensure escape hatches are along the way with adequate support when used. (Be that technology escape hatches or political cover to experiment with how to get the best open rate on that invoice reminder communication!)

The technical foundation

What foundations do we have either to better enable productive developers, or quality code shipped at speed to customers?

CI/CD

In the last 10 years, developer practices like good code review platforms, running tests and (for some / all environments) automatically deploying on a commit or push went from fancy tech to table stakes. From traditional “spend lots of money on a license for this and run it on a server yourself” solutions like Jenkins and TeamCity, to dynamic solutions like Bitbucket pipelines, Github Actions, or (my current favorite for my often unique one-off projects) AWS CodeBuild.

To me the sign of an unstable project is doing some work, running the tests, just to find out the last person who worked on the codebase broke them. So now I’m doing my own work + fixing broken nonsense that wasn’t done right the first time. A stable project is sitting down to the main work branch and knowing that if there are broken tests that I broke them. Knowing the command to release to a pre-prod environment is even better.

Releasing and testing software is certainly one of those “if it’s hurting when you do it, do it in smaller chunks more frequently” activities. Having a way for your development team to get code into pre-prod environments and through the development life cycle in known, easy, and “don’t have to think about it” ways is greatly important to developer experience.

For projects with many microservices in the herd, ideally the CI/CD pipelines share as much code as possible across all the microservices. What you don’t want is having to modify 3 dozen repos if you’re adding some new feature to the build pipelines (maybe someone got the idea to do static code analysis or something!). Using one consistent pipeline allows you to reduce maintenance, gives you a highly leveraged place for technical earned interest and allows you to scale the knowledge and support of your build-engineers (or build-engineer minded engineers).

I would also say that unit tests, at the very least, should be quick - on modern hardware, if a unit test suite is taking > 10 minutes you need to stabilize how you do tests. I have been in situations where compiles or unit tests runs took 40+ minutes and those were not productive situations! There’s an entire section of industry working on the problem of developer productivity engineering: making builds go faster.

An integration test suite - running the application in a “kind of close to real” environment is also important: if your tests are right but you broke a service that depends on you, ideally you find this out waaaaayyy before production! Shout out to all those Software Engineers In Test keeping software engineers honest.

What this foundation enables: it is a bad experience for everyone when a new feature accidentally breaks an old feature - or worse, breaks only with real customer data! Formally understanding the behaviors of your system and testing them is - among other things - is a great way to rapidly develop quality software, without a developer having to navigate deep in your app 20 times an hour to check if this control or logic behaves how it should.

Quality and Observability

The bar has also been raised on observability and monitoring of your production services. Can you be paged when an outage is happening? Do you know about it faster than your customers notice? (Or, for organizations going through a digital transformation, can one team notice and alert on enterprise wide production outages before other teams?) Honeycomb.io’s what is observability page is an excellent 2 minute intro to observation and monitoring.

Dealing with problems at scale (“this subsystem is down for everyone”) is one thing, but gray errors are harder to notice and debug (“Huh, our customer support Twitter account gets really active the 29th of every month. How odd”). What also requires forethought and planning a bit: serving the individual customer. Does an engineer, when working with customer service to resolve a support ticket, have enough information to dig into the logs or the observability solution to find out what went wrong? Or does customer support end with a “reboot your iPhone and that might help??! IDK good luck”.

A bit of testing in production goes a long way. Can you do synthetic transactions in critical parts of your system to make sure the common routes are up? Can you run a crawler on your own site to make sure your static site doesn’t link to a missing page? This is the “real life” version of the checks you’re doing with CI/CD!

What this foundation enables means the quality of customer experience is either the best you can deliver, or customer facing errors are found quickly. Because you don’t want Twitter to find them first.

Sofware Development Standards

Ideally a software architect/leader/senior+ engineer helps guide the herd to not reinventing the wheel too much, guiding teams down the road of excellent code, setting up standards and principles, with as few common development pitfalls as possible, while also making it possible for unique situations to exist, but trying to guide the development of the system towards commonalities (because there lies goodness).

In a large Microservices herd situation, in my experience, some (ideally very small percentage!) of microservice will need to do something special. This means you need a few strategic escape ramps for people, for that occasional time when people need to override the defaults. And if everyone’s overriding the default - maybe even in a similar way- time to take that feedback and adjust how that part of things works using the same learning loop your teams use when faced with external customers.

In a less distributed environment, say an iOS app or a frontend app, the patterns might be around, “do we have a standard non-modal dialog component, I want to use it here?”, “how exactly do we structure this markup?”, or “How do we do state machines or workflow management?”

Can a development style (from “where the braces go” to “patterns of where to put the code” to “it is actually faster to do this The Right Way then to do it The Wrong Way”) be created that protects and works to ensure a high quality of codebase? Can entire classes of errors be avoided because the development standards almost won’t let developers be lazy in certain ways? Yes! ask me how I know.

Can we also scale specialized developer expertise through standards, communication channels or (better yet!) tools? Yes, but that may be the topic of another blog post.

Scaling / introducing development standards in a Microservices herd: but how?

I’ve mostly done this in green-field-ish environments, so if you have to make changes across a dozen microservices already in the wild, some form of adoption planning is likely your first starting point.

Reference Implementations / Developer Starter Packs

Personally I’ve gotten really far with scaling development standards reference implementations and developer starter packs. If you know applications will approximately follow the developer starter pack you can identify, extract and abstract away cross cutting attributes and capabilities.

In addition to things like “this is where we put various kinds of source files”, a developer starter pack could include “This is how you interact with our CI/CD process”, or “this is how we, traditionally, solve this common problem”.

An example of a common problem is validating that incoming requests are acceptable Vs just plain garbage. If you are writing Node Microservices you might use express-validator for that. An example function or file showing developers how to do that doesn’t hurt, and easily deleted.

The starter kit’s Pom.xml/package.json / Gemfile / requirements.txt being loaded with these common libraries for common idioms means one less thing developers have to worry about “do we have a pattern / library I can use for this somewhere, do I have to find my own, or am I writing supporting code myself?”.

Common Herd wide libraries

Everyone using a set of common supporting libraries means the economies of scope and chances for technical earned interest become very high.

Examples of this are everyone logging with the same logging library or logging configuration, to encourage structured logging, parent POMs, web request making libraries etc.

Knowledge Transfer Opportunities

Outside of architecture, I highly suggest a good developer wiki (perhaps even with planning documentation from the process layer cake to fill in the “why?”, which from there you can grow good onboarding documentation. Onboarding buddies is another excellent way - get developers (especially in these highly remote times!) plugged into developer peer communities in the project. Wiki documentation helps scale knowledge to new team members quickly and easily.

The Unicorn Project also talks about how frustrating it is to have a complicated, manual local setup process. A smooth dev machine setup experience is so important: if it’s “clone these two Git repositories, run this Docker Compose File, then do the three things this shell script tells you to do”. Vs having a developer struggling for weeks to even start to do their work.

Beyond onboarding and tooling, get / keep developers talking to each other in situations that don’t require scrum master or management supervision. I’m very much a fan of peer lead Special Interest Group meetings where developers share what they’re working on. In such a setting a casual comment could turn into an opportunity for one team to help another get unstuck (ask me how I know). You also may have or discover situations where team A is is doing the same thing as team B, and you discover it in this meeting. Chance for either collaboration, consolidation, or idea strengthening idea!.

Maybe even call it something funny, like Developer Only Wing Night. Everyone leaves on a DOWN note (when they hear that wings are not actually provided, maybe ;) )

Allow Developers To Focus More On Stuff That Matters To The Customer

There are sometimes entire areas of knowledge you can abstract away from developers. I’ve always wanted to work at a shop that uses Cloud Foundry: “You mean, just cf push and I don’t have to worry about digging out my Dockerfile knowledge??!! Sign md up!”. Or, another place I’d love to have my life automated away a bit: personally I find myself wanting to create static websites of AWS for myself, backed with CloudFront, S3, and CodeBuild/CodePipeline, but I don’t do it often enough with CloudFormation to hook that up right, so it remains undone.

Sure we can set up large scale training programs that tech AWS or Docker, online webinars or in person training. It’s how I learned Android development: a one week in-person Big Nerd Ranch training program! Or peer lead lunch & learns or troubleshooting Confluence pages.

However, can a characteristic of your technical foundation be: “Hey, we can automate the weird incidental stuff for you, application developer! You have a hard enough job writing good quality code for the customer on a deadline, can we lighten your burden?”

What this foundation enables

Deming’s Out Of The Crisis says:

The aim of leadership should be to improve the performance of people and machine, to improve quality, to increase output, and simultaneously to bring pride of work to people

Stable technical foundations can:

Remove barriers that rob people of pride of workmanship (Out of Crisis P 77; Pheonix Project p 44)

Leadership has a responsibility to put folks in a place where they can be successful. (Muhammed Meboob)

Maybe we call this whole thing: CI/CD, Quality and Observability, and dev standards “developer experience”, or “foundational developer experience”.

The organizational foundation (influence)

Every org is different. Sometimes every department within a large big corporation is different. A team, or set of teams, may have different “ways of working”. What works for this team / culture may not work in another place, Because Reasons(TM). This may be an explicit bit of organizational design, or it may be implicit culture and organizational design.

What your org or culture looks like isn’t a characteristic of the stable growth plane, but an influencer of what your implementation looks like.

Seeing Organizational Patterns, by Keidel, focuses on tradeoffs and balances between three things: three different parts of a hypothetical triangle. An example of a hypothetical triangle is the project management triangle: “Do you want it cheap, good or fast, pick two?”. (I feel the more Agile way of saying it is pick two: date, scope or cheap?, but that’s neither here nor there.)

The hypothetical triangle Keidel focuses on is Autonomy vs Control vs Cooperation.

Autonomy vs control is the classic field vs headquarters dilemma: nitty gritty vs big picture. Those in the field are “where the rubber meets the road”, as the famous tire commercial used to put it. They are in touch with customer needs and geographical nuances in a way that corporate managers can rarely be. What field personnel tend to lack, however, is a view of the whole….

Autonomy vs cooperation is equivalent to accountability vs synergy - the individual vs the group. The more an organization stresses individual or unit accountability, the less likely it is to benefit from spontaneous cooperation among individuals or units. Conversely, the greater the commitment to synergy, the more difficult it is to sort of each player’s contribution.

It’s sometimes easy, when looking at the perspective of an entire Microservices herd - or worse, multiple Microservices herds! - to find the wrong balance of Autonomy, Cooperation and Accountability.

When designing systems like this myself, I tend to lean more towards small bits of localized autonomy in defined parts - escape hatches in case a team has to do something Weird - but leveraging a fair bit of centralized control. However, that approach may not work when supporting a herd or herds with large technical sprawl (a likely outcome of high autonomy).

Small, isolated bits of functionality (microservices), sometimes created by small isolated teams can feel fast and great, but see the summary of a Susan Fowler-Rigetti talk: Six Challenge Every Organization Will Face When Implementing Microservices. Managing these challenges, when keeping in mind the ideas of Autonomy vs Control vs Cooperation means that some of these stabilizing principles, foundations, and tradeoffs look (what feels like) every time you turn around. However, I posit that they will still be there, in some aspect, as a characteristic of being in the stable growth plane.

Conclusion

Sofware is institutional knowledge executing, whose implementation depends on the technical foundations it’s running on, and the organizational culture it’s running within. Constraints and uniqueness at every turn.

References

I cover a lot of material from a lot of different books here, a list of my sources is likely important: