I was recently on the Tiny DevOps podcast and mentioned the Disaster-Release Ratio metric. And although I’ve written about it before, but the podcast made me decide I should dedicate a separate article to it.
But before we continue, I came up with this metric while at work with a client, but someone else might already have thought of it before me and just named it differently. If so, feel free to point this out to me.
What is a Disaster?
The Disaster-Release Ratio is a simple metric that focusses on “disasters.” These are bugs that make it to production that the team or company sees as showstopper bugs. Bugs that need an immediate fix. So either a hotfix or a rollback to the previous version of the software.
I’m deliberately not talking about less critical bugs where the user or the system isn’t blocked from performing the necessary tasks. If the bug can be fixed in the next release, I won’t consider it a “disaster.”
Sometimes, hotfixes can be combined, but you should still count the bugs separately. So even if 3 bugs are fixed with one new hotfix release, you still count 3 instead of 1.
What is a Release?
Just to be clear on the definition: a release happens everytime you deploy a system to production. I wouldn’t consider staging, QA and testing environments for this release. Usually, bugs in these environments won’t be considered disasters, but of course, your specific context may vary.
Don’t factor in any hotfix releases, because that will scew your ratio in a positive way, and we’re trying to detect a problematic situation.
What is the Disaster-Release Ratio?
Now that we have our definitions, we first need to track these numbers for a while. Or maybe you can check your historical records and build up the numbers after the fact.
Group these numbers for a given time period and then divide the number of disasters by the number of releases:
DRR = #disasters / #releases
What Do I Do With It?
Now that we have our numbers, what can we do with it? The idea of the Disaster-Release Ratio is to get it as low as possible. Ideally, it’s just zero.
In the above example, we have a really bad situation where each release has showstopper bugs and most have more than one.
As the ratio is calculated with only two numbers, there are two ways of getting the number down.
Reducing the number of bugs is something we all want right? Of course you should put effort into this. But how?
Start by identifying what is causing these bugs. Too much time pressure on the team? Inexperienced developers? Too much technical debt making their work impossible? Talk with the team and see what steps are necessary to fix it.
Introducing automated testing has been one of the easiest and cheapest ways of reducing bugs. While there is a learning curve and might slow down developers in the beginning, they will quickly start reaping the benefits.
The number of bugs will drop, reducing stress before and after every release and giving developers more time after a release to focus on the features for the next release.
In our example, the next months could look like the table below. I didn’t change the amount of releases, but beacuse the amount of disasters drops, our ratio drops too.
Another way to get the Disaster-Release Ratio down is to increase your releases. Let’s show this in our table again:
We’ve moved from one release per month to two. I’ve left the number of disasters as they were and because of our extra releases, our ratio drops.
Here’s where it gets interesting. You might say this doesn’t matter because we still have the same amount of bugs and the same amount of developer time working on fixing the bugs. And while that is true, increasing the number of releases in a given time period might have pushed you to change the way you work and release for the better.
Very likely, you’ll have to invest more in automating your releases and making the process easy, fail-proof and fast. Ideally, you just push a button and things get deployed without any other manual action. This greatly reduces risk of human errors.
More releases also reduces the possible causes of bugs. If you do a big release with many changes to the code, finding the exact cause of a bug can end up being quite difficult. With so many pieces of the code changed, developers will need more time finding which change caused it. The change could also have been made several weeks ago, forcing developers to dig deep in their memory to remember what and how they were thinking at the time.
However, if your release only contains code changes from the past week (or less!), it should be fairly easy to find the code change that introduced the bug. And any thought processes and discussions can still be fresh in the team’s memory.
Reduce Your Disaster-Release Ratio
Measuring your Disaster-Release Ratio isn’t very hard. It might take you some time to gather the necessary numbers if you’re not tracking how many bugs made it to production though.
But the metric points us to two actions that are almost required to call yourself a DevOps team: automate tests and prefer small and regular releases over large releases with long periods of time between them.