Sunday, June 22, 2014

A Drift Into Failure

I'm still working towards catching up on my Christmas ready. I already wrote my missives on Thinking in Systems and A Pattern Language; next up is the DevOps favorite Drift into Failure.

The basic premise of Drift is that failures, even massive ones, don't (usually) happen because of a vast conspiracy or from the deeds of evil people. Massive failures occur from behavior that is considered completely normal, even accepted, as part of a daily routine. These routines give our perspectives tunnel vision and often don't allow us to see the underlying issue. Production goals, scarce resources and pressure on performance causes drift in these routines that slowly erode safe practices.

Fatal aircraft crashes and space shuttle disasters are often quoted in the book, however every operations or software engineer in IT has seen this play out a gazillion times before. The site goes down on a regular basis... and no one knows quite why. After digging and pushing new code and re-pushing bug fixes for many sleepless nights, one often finds out that the outage was due to a routine maintenance task gone awry. Maybe a query optimization cache was manually flushed within the production RDBMS, causing the entire cluster to freak out and create a bad query plan. It seemed perfectly sane at the time and even if every single person knew this was going to happen the day before, it likely wouldn't have been caught.

Drift points out how remediation and "root cause" reporting is often fruitless. The concept of high-reliability organizations was pushed in the 1980's as an entire school of thought, focused on errors and incidents as the basic units of failure. Dekker demonstrates that "the past is no good basis for sustaining a belief in future safety," and such a focus on root-cause analysis often does not prevent future incidents. The traditional "Swiss Cheese Model" for determining cause has attempted to see where all of the holes within established safety procedures line up, so as to create a long gap through which problems can drive themselves through. This type of reductionist thinking where atomic failures create linear consequences has turned out not to be predictive after all - instead we need to look at things through the lens of probability.

One of the best practices that anyone, including those supporting enterprise software, can encourage to avoid failure is to be skeptical during the quiet times and always invite in a wide range of viewpoints and opinions. Overconfidence can be your downfall, and dissent is always a healthy way to get new perspective. Dekker quotes Managing the Unexpected to point out that "success... breeds overconfidence in the adequacy of current practices and reduces the acceptance of opposing points of view." Those that were not technically qualified to make decisions often were the ones that made them, or outside pressures (event subtle ones) caused trade-offs in accepted practices. Redundancies that were supposed to make things highly available often make systems more complicated and, in turn, actually make them more likely to fail.

The best way to avoid a drift into failure is to invite outside opinion, even bring in disparate practice groups. Take minority opinions seriously. Don't distill everything to a slideshow. Be wary of adding redundancies and failsafes - often the most simple solution will be the most reliable. The recent re-invigoration in microservices is a great example of this - by simplifying the pieces of a complex system, we can allow each component to work in isolation and ignore the remainder of the system. This allows the system to grow, adapt and evolve without support systems usually provided for monolithic software stacks.

Drifting into failure occurs when an organization can't adapt to an increasingly complex environment. Never settle, always embrace diversity and keep exploring new ways to evolve. A great quote from Dekker is "if you stop exploring, however, and just keep exploiting [by only taking advantage of what you already know], you might miss out on something much better, and your system may become brittle, or less adaptive."