Wednesday, May 14, 2014

Post Hoc Ergo Propter Hoc

This guy, too.
You really should have taken Latin in school.

You'd have undoubtably discovered the phrase "post hoc ergo propter hoc." And you'd use those words every day in your IT career. It means, "after this, therefore because of this." It's a logical fallacy (which means it's a great way to pepper your conversation at hipster dinner parties) that concerns perceived causality between sequential events.

Why they hell are we talking about this?

Because in IT, when something breaks, the first thing we wonder aloud is, "what recently changed?" It's step 1 in the Troubleshooting 101 handbook. And for good reason: changes frequently have unintended consequences. So knowing the recent changes can help you find the source of the problem.

But here's the dark side of that logic. If you put a change into production that another engineer disapproves of, you're in for trouble. You can bet that every incident and outage afterwards will be blamed on your change (and, transitively, you). Now that thinking from above comes back to bite you: "well, the outage occurred after that change you put in, so your change caused the outage." Post hoc, ergo propter hoc.

An Example

Years ago, when it was not uncommon to have dozens of Windows Server 2003 VMs in your production environment, I worked as a systems engineer in an applications hosting shop. I noticed lots of events in the application logs that indicated problems closing registry handles when users were logging off of their RDP sessions. I'd seen this problem before, so I prepared a change request to install the User Profile Hive Cleanup Service (UPHClean) on these VMs to fix the problem. Easy. Basic.

The proposed change was met with bemused hand-wringing. "Why are we doing this?" "Can't we just ignore those errors?" But the change had been tested and approved in our non-production environment, so the change manager OK'd the request. And sure enough, all of those registry errors went away.

A week later, we had an outage on a SQL server that took out an application. Immediately, the UPHClean process was blamed. "Well, the outage happened after your change, so your change caused the outage." Post hoc, ergo propter hoc.

The Point

Don't fall for this trap. It's perfectly acceptable to ask "what changed?" when troubleshooting a problem, but be careful about making the leap from "what changed" to "the change must be the problem." It can lead to thrashing and flailing, and can obfuscate the true root cause of the outage.