Your Organization's Secret Weapon

I want to do a quick activity. Part of doing an incident review is coming from a position of inquiry. Why did something make sense? What information is missing? Who can I ask to find out more? We're going to take a few minutes to review an RCA. The purpose of this is not to call out bad examples. I want to see if this satisfies your curiosity. Here's the RCA in question. We have here the event date. We have when it started. When the incident was detected. The time it was resolved. We have our root cause. Our DNS host scheduled a normal reboot for security patches, resulting in a simultaneous outage of DNS servers on the environment. As a result, all the connections between all hosts in the environment failed due to DNS lookup issues. They provide some more info. They share some of the customer impact. Then they list some of their action items.

Incident Analysis: Your Organization

Organization's Secret Weapon

​That past RCA we looked at, there were a lot of numbers in there. I know we want to quantify a lot of our incident reviews. I want to quote something that John Allspaw said back in 2018. "Where are the people in this tracking, and where are you?" The metrics that we're tracking today like MTTR, and MTTD, and incident start, and incident end, detection, gives us some interesting data points. What are we actually learning from it? What is the purpose of it? What is the purpose of recording those metrics? Allspaw posed an open question and challenge to us. He said, where are the people in this tracking? We haven't changed much as an industry in this regard. Gathering useful data about incidents does not come for free. It's not easy. You have to give time and space to determine it.

I'm going to talk about why giving this time and space to your organizations can actually work within your favor through multiple stories. New paths on how you can do it and ways that aren't disruptive to your business. Next steps for you to embark on. There's a spoiler alert here, which is, sometimes a thorough analysis reveals things that we're not ready to see, hear, or change. It can be a mirror for our organization. The actual work involved is making sure it's safe to share these things so that we can actually improve.

Displaying Content More Like An App

When I was at Netflix, I was on a team with three other amazing software engineers. We spent years building a platform to safely inject failure in production to help engineers understand and ask more questions about areas in their system that unexpectedly behaved when presented with turbulent conditions we see in everyday engineering, like injecting failure or injecting latency. It was amazing. We were happy to be working on such an interesting problem that could ultimately help the business understand its weak spots. There was a problem to this. The problem was that most of the time, the four of us were the ones using the tooling. We were using the tooling to create the chaos experiments. We were using the tooling to run the chaos experiments, and we were analyzing the results.

Incident Analysis: Your Organization

​How did we start to do that? To know if something was or wasn't important to fix, we started looking at previous incidents. We started digging through some of them and wanted to find things like systems that were under the water a lot, or people that we relied upon a lot, or systems that were going down that resulted in an incident which was a huge surprise, or incidents that were related to action items from previous incidents. We wanted to use this to bubble up and help folks prioritize things and give them context, and feed back into the chaos tooling. However, through the process of doing this, we found that looking through incidents and looking at these patterns and studying them and learning from them, had a much greater power than just helping the organization create and prioritize chaos experiments better. Spending time on it opened my eyes up to so much more. Things that could help the business beyond just the technical, beyond just helping this chaos engineering tooling.

Here's the secret that we found. Incident analysis is not actually about the incident. The incident itself is a catalyst to understanding how your organization is structured in theory, versus how it's structured in practice. It exposes that delta for you. It's a catalyst to understanding where you need to improve your socio-technical system and how people work together. Because when something's on fire, all rules go out the window. You're just trying to stop the fire. It's a catalyst to showing you what your organization is good at, and what needs improvement. We all have thoughts around this, but the incident actually exposes them. It's a missed opportunity if we don't look into it.