At DevOpsDays Boston 2017, Matthew Boeckman presented on how emergent behavior in complex systems requires us to re-think our root cause analysis paradigms. His slides are here. I also had a great time talking meta with Matthew afterhours, but that's for a later post.
Traditional RCA in Complex Systems
Unfortunately, traditional RCA focuses on what and who. Despite its roots stemming from NASA, in the software world, RCA is misaligned to find only one channel of causality. A fishbone diagram shows this:
This might be okay for simple systems (i.e. 3-tier web/app/data servers). There's much more to this: networking, hosting, and operating environments. Beyond that, users access in both benign and benevolent ways.
Waterfall encouraged us to minimize complexity by locking down state (i.e. promote a "don't change" mentality). Waterfall (think 12mo cycles) encourages us to think that change is the developer's fault. And there were a lot of constrains in the 80s and 90s, most of them are no longer true.
Root cause is fine for static models, but there are bad when it comes to "lots of boxes", cloud-based dynamic and distributed systems. It's very hard to trace the source of problems in this new world. Change vectors (a/b testing, reconfigurations, migrations, feature flags) abound, in fact they're encouraged.
Our systems are far more complex than they were 20 years ago. They involve the whole stack, the whole team, and the whole organization.
Paul's Take: Occam's Anti-Razor
A heuristic idea we often employ is Occam's razor, in general that, the simplest answer is often the right one. Coupled with a confirmation bias, we (humans) often look for a single causal root to the problems we see. Then we build processes that inherit our bias. But what if operational failures occur because of multiple causes, chain reactions that exceed the typical '5 whys' RCA model?
As quickly as the concept of the razor was introduced, Chatton, a contemporary, countered the idea with: "If three things are not enough to verify an affirmative proposition about things, a fourth must be added, and so on." Similarly, many ascribe a balance of simplicity and complexity in solving problems to the quote "Make things as simple as possible, but no simpler." by Einstein.
The idea is right fit...right fit of simplicity/complexity to the problem at hand. With complex systems, we can't always assume that the simple answer is the most useful one in future scenarios.
Our Systems Aren't Trees, They're Forests
Emergence is about collective behaviors, systems we connect and integrate over time, and not simply the aggregate of behaviors emitted by individual subcomponents and nodes.
We need to develop, test, deploy, monitor and issue resolve them like the complex semi-organic systems they are, part of an ecosystem of services and fallible subsystems that they are. We can no longer afford to ignore better paradigms for dealing with them.
Enter Systems Thinking. Understanding why things emerge takes more than an ops dashboard and intuition. Sometimes analysis on complex problems requires a multi-variate perspective.
Paul's advice: System's thinking is a much broader topic that, if you haven't actually studied, it would serve you well to listen to The Fifth Discipline by Peter Senge. As context for my presentation in April on IoT testing, this made me realize that systems thinking was a necessary mental tool moving forward.
Systems thinking helps us to identify activities, interactions, and ultimately change vectors contributing to emergent behaviors. Understanding which dials and levers are involved in the problem enables later actions to resolve the issue. This feeling of being at home in the problem space is also similar to "cynefin", a welch/gaelic term that in Scottish (my heritage!) means:
"a place to live and belong. where the nature of what's around you feels right and welcoming"
Not at all coincidentally, the Cynefin framework as applied to emergent behavior helps us make quick decisions during and about incident management situations.
Staying Ahead of Emergent Behavior
The fact is that most workforces, small or large, are a revolving door. So is your current system state after multiple releases and infrastructure migrations. There be the monsters. Software is dynamic, and so should be your product discovery process, your learning loops, your incident management model, and so on.
The Cynefin framework gives us this quadrant visual to show that various issues need to be addressed differently:
The fact is, each of these quadrants assume two things:
- The issue occurred already, so you need to fix it and learn from it
- Information needs to be radiated (sensed) to make "sense" of it
In my after-hours chat with Matthew, we dived into the issue of metrics. Measuring issue tracking goes beyond mean-time-to-resolution (MTTR). Issues that are flagged with *how* they were resolved using Cynefin categories now have an opportunity for improvement.
Paul's Take: could this be a JIRA custom field? Just thinking out loud.
Tracking the delta on a specific issue (what approach someone thought should be used at first vs. what would have been better after the fact) is a way to measure successfulness and improvement on a spot basis.
Then over time, aggregates can be used to show team and organizational reflexiveness to dynamic, emergent behavior. Though neither of us have customer anecdotes or proof-of-concept clients, I challenge you who are reading this to try it out for a few sprints or whatever intervals you use.
We need to embrace emergent behavior and learn how to approach incidents better using systems thinking and frameworks like Cynefin. Unlike traditional RCA, we'll need to step out of our comfort zones, see what works, and learn from our mistakes.