Wednesday, May 28, 2014

Some Advice on Testing

I've run into a few oddities lately regarding testing, and thought I'd fire off a post on the subject before I clear /tmp in /dev/brain for the day.

Testing a Negative

When you're creating a connectivity test, you should test with the expectation of receiving a response. For example, write your script to not just ping a host, but to expect a response for each ping. Do NOT write your test script to expect NO response. If you do, you're introducing a reachability problem into your testing. I've seen a firewall testing script written this way; it failed all the time (pretty sure it was trying to ping a site with bewbs), but the operations team expected it to fail, and the failure was considered a success. This approach backfired when the firewall was offline for a few days before anyone realized it. Oops.

A Valid Test

If you're going to test, make sure the test is valid. Recently, I've seen a web application test plan that neglected to test the new www server via https. The test passed, so the change was put into production. And wouldn't you know it, mod_ssl was missing from the new web server. Hilarity ensued. I mean panic and thrashing.

That's it. A quickie today.

Wednesday, May 14, 2014

Post Hoc Ergo Propter Hoc

This guy, too.
You really should have taken Latin in school.

You'd have undoubtably discovered the phrase "post hoc ergo propter hoc." And you'd use those words every day in your IT career. It means, "after this, therefore because of this." It's a logical fallacy (which means it's a great way to pepper your conversation at hipster dinner parties) that concerns perceived causality between sequential events.

Why they hell are we talking about this?

Because in IT, when something breaks, the first thing we wonder aloud is, "what recently changed?" It's step 1 in the Troubleshooting 101 handbook. And for good reason: changes frequently have unintended consequences. So knowing the recent changes can help you find the source of the problem.

But here's the dark side of that logic. If you put a change into production that another engineer disapproves of, you're in for trouble. You can bet that every incident and outage afterwards will be blamed on your change (and, transitively, you). Now that thinking from above comes back to bite you: "well, the outage occurred after that change you put in, so your change caused the outage." Post hoc, ergo propter hoc.

An Example

Years ago, when it was not uncommon to have dozens of Windows Server 2003 VMs in your production environment, I worked as a systems engineer in an applications hosting shop. I noticed lots of events in the application logs that indicated problems closing registry handles when users were logging off of their RDP sessions. I'd seen this problem before, so I prepared a change request to install the User Profile Hive Cleanup Service (UPHClean) on these VMs to fix the problem. Easy. Basic.

The proposed change was met with bemused hand-wringing. "Why are we doing this?" "Can't we just ignore those errors?" But the change had been tested and approved in our non-production environment, so the change manager OK'd the request. And sure enough, all of those registry errors went away.

A week later, we had an outage on a SQL server that took out an application. Immediately, the UPHClean process was blamed. "Well, the outage happened after your change, so your change caused the outage." Post hoc, ergo propter hoc.

The Point

Don't fall for this trap. It's perfectly acceptable to ask "what changed?" when troubleshooting a problem, but be careful about making the leap from "what changed" to "the change must be the problem." It can lead to thrashing and flailing, and can obfuscate the true root cause of the outage.

Wednesday, May 7, 2014

Building a New Home Lab

IT engineers and administrators loathe discussions about budgets. To us, it's just something that management should worry about. We just want to keep the systems up and running, and to play with implement some cool technology so our skills don't stagnate. We think, "You worry about the dollars, we'll worry about the infrastructure."

Except it doesn't work that way.

If you've ever wanted some perspective on IT budgets, challenge yourself to build and fund a new home lab. All of a sudden, every dollar matters.

I'm in the process of building a new lab, so I'm doing some research to see what I'll need to purchase in order to have a decent VSAN environment at my disposal. This is a well covered topic: Duncan Epping discusses hardware selection for VSAN here, and Chris Wahl covered some options, too. But as you read through these articles, you'll soon realize that the technology is the easy part.

Start pricing out some of the solutions and you'll learn that a VSAN lab is significantly more costly than a vanilla vSphere lab. For a vSphere home lab, you could pick up a pair of servers from craigslist for a few hundred bucks. Hell, you could even pick up a NetApp FAS2020 on eBay with a diskshelf for about $600. Add in a few networking devices, and you could have a serious home lab for under $1,000. But VSAN depends on new technologies that prevent you from just selecting a server based on the number of drive bays.

You'll need to refer to the VSAN HCL early and often so that your disk controllers, HDDs, and SSDs are VSAN ready. Sure, you can get VSAN to run on devices that aren't listed there, but if you're going to invest in a home lab, you should invest in the certified hardware. Otherwise, just go to VMware's HOL and whet your VSAN appetite there.

You'll quickly learn that budget is everything. Because now it's YOUR money. It's not just a line item on a spreadsheet that lives on a network share.

So over the next few weeks, I'll be sharing my research and designing my lab environment. And this time, the biggest design constraint will be cost.

Sunday, May 4, 2014

My Experience as a Thwack Ambassador

I just wrapped up my time as a Thwack Ambassador, and wanted to share some observations from the experience.

Thwack is Fun!

I've been a Thwack user for about a year now, and I've learned that Thwack is a funny, borderline goofy community of tech enthusiasts. I've been familiar with the Ambassador program for nearly as long, and have always enjoyed the discussions that follow from thought-provoking posts. I'm happy that my four posts at Thwack this month (Whose Fault Is It Anyway?, Rise of the Hybrid Engineer, Too Many Tools, and Root Cause Paralysis) initiated great discussions. And only a few comments were obvious point-chasing. So don't let the goofiness fool you: many users at Thwack are wicked smart, and not just on the topics of SolarWinds and monitoring.

Active Participation

I've seen many Ambassadors who post an article and then disappear until next week. And I've seen what happens to the comments section as a result. I took a different approach, in part due to the great interaction I observed while Gideon Tam was an Ambassador in January of this year. Plus, the good people at Foskett Services LLC (who manage the Ambassador program for SolarWinds) encourage bloggers to interact with commenters. It makes for a more lively, dynamic discussion. The articles may be interesting, but it's the discussion that really brings them to life. Contrast that to the postings on a site like.... this one. :)

People Just Want to be Heard

Who doesn't want someone to listen to their experience and advice? I learned that if you give people a platform, they'll use it to share stories of success and failure, primarily for the purpose of helping others avoid the same mistakes. Though to be fair, the point system rewards us all for our contributions. I think I've got one of everything in the Thwack store (except the backpack... just no use for it). 

A Note of Thanks

Many thanks to Stephen Foskett for getting me involved in this program! (I have to be honest, that was the best DM I've ever received.) Thanks also to Danielle Higgins and Claire Chaplais for all of the work they put into Thwack and the Ambassador program.

I'll now return to my regular Thwack user activity.

Disclosure stuff: The Thwack Ambassador role is a paid position. Read my posts there for yourself and determine if that introduced bias into my articles.