Thursday, November 20, 2014

No Service Stands Alone

I recently had the pleasure of attending a "lessons learned" meeting with two dozen colleagues. We've just recovered from a service interruption at work, and our CIO arranged for us to spend some quality time discussing what worked well and what didn't. The discussion was very productive, thanks in part to some ground rules that prevented us from defending our actions. As far as two hour meetings go, this one wasn't too bad.

Credit: Danby @ BDN
I'll save you the sordid details of the service interruption. Suffice to say that members from many technical disciplines were present; this service in question required infrastructure from across the IT spectrum. The group observed that, because the service was distributed, restoring service required collaboration between teams. In Washington, D.C., we call that "reaching across the aisle."

At this point in the discussion, the CIO interrupted us to make a simple, seemingly obvious, statement.

No service stands alone.

It was one of those moments that shocked us into silence. Among her many skills, our CIO can command a silence like no one else. The silence lasted 5 seconds, but you'd think it was a year. Yeah, she's that good.

But her point is worth considering. In any enterprise, no service worth providing lives entirely in the orthogonal confines of an organizational chart. Services span teams and technologies, and live and die by the success of each individual component.

That's why technology professionals must change their perspective and see the services that they support. Your storage engineer can't be blind to the Exchange workloads hosted on the SAN. Your vSphere administrator can't be blissfully unaware of the impact of slow performance on hosted applications. And everyone who shares partial responsibility for a mission-critical service must be accountable for its availability.

Service Monitoring

A deceptively easy way to change your perspective is to leverage monitoring tools that are "service aware." That is, in addition to monitoring individual devices like switches and servers, these tools can group devices into a "service." In EM7 vernacular, this is called an IT Service. SolarWinds is going to tackle this notion with their AppStack solution. In fact, I'd bet that most modern monitoring solutions have a similar capability. It's a logical evolution, after all.

So stop pretending that the infrastructure you support exists for any other reason than to enable services. And start putting the health and performance of the service first.

Sunday, November 9, 2014

Unperceived Existence of Collected Data

Can something exist without being perceived?

No, this blog hasn't taken a turn into the maddening world of metaphysics. Not yet.

I'm talking about event and performance logging, naturally. In the infrastructure racket profession (well, I think it's a vocation, but I'll get to that in a later post), we're conditioned to set up logging for all of our systems. It's usually for the following reasons:
  1. You were told to do it.
  3. Security told you to do it.
So you dutifully, begrudgingly configure your remote log hosts, or you deploy your logging agents, or you do some other manner of configuration to enable logging to satisfy a requirement. And then you go about your job. Easy.

But what about that data? Where does it go? And what happens to it once it's there?

Do you use any tools to exploit that data? Or does it just consume blocks on a spinning piece of rust in your data center? Have I asked enough rhetorical questions yet?

*   *   *

The pragmatic engineer seeks to acquire knowledge of all of her or his systems, and in times of service degradation or outage, such knowledge can reduce downtime. But knowledge of a system typically requires an understanding of "normal" performance. And that understanding can only come from the analysis of collected events and performance data.

If you send your performance data, for example, to a logging system that is incapable of presenting and analyzing that data, then what's the point of logging in the first place? If you can't put that data to work, and exploit the data to make informed decisions about your infrastructure, what's the point? Why collect data if you have no intent (or capacity) to use it?

Dashboarding for Fun and Profit (but mostly for Fun)

One great way to make your data meaningful is to present it in the only way that those management-types know: dashboards. It's okay if you just rolled your eyes. The word "dashboard" was murdered by marketing in the last 10 years. And what a shame. Because we all stare at a dashboard while we're driving to and from work, and we likely don't realize how powerful it is to have all of the information we need to make decisions about how we drive right in front of us. The same should be true for your dashboards at work.

So here are a few tips for you, dear readers, to guide you in the creation of meaningful dashboards:

  1. Present the data you need, not the data you want. It's easy to start throwing every metric you have available at your dashboard. And most tools will allow you to do so. You certainly won't get an error that says, "dude, lay off the metrics." But just because you can display certain metrics, doesn't mean you should. For example, CPU and memory % utilization are dashboard stalwarts. Use them whenever you need a quick sense of health for a device. But do you really need to display your disk queue length for every system on the main dashboard? No.
  2. Less is more. Be selective not only in the types of data you present, but also in the quantity of data you present. Avoid filling every pixel with a gauge or bar chart; these aren't Victorian works, and horror vacui does not apply here. When you develop a dashboard, you're crossing into the realm of information architecture and design. Build your spaces carefully.
  3. Know your audience. You'll recall that I called out the "management-types" when talking about the intended audience for your dashboards. That was intentional. Hard-nosed engineers are often content with function over form; personally, I'll take a shell with grep, sed, and awk and I can make /var/log beg for mercy. But The Suits want form over function. So make the data work for them.
  4. Think services, not servers. When you spend your 8 hours a days managing hosts and devices, you tend to think about the infrastructure as a collection of servers, switches, storage, and software. But dashboards should focus on the services that these devices, when cooperating, provide. Again, The Suits don't care if srvw2k8r2xcmlbx01 is running at 100% CPU; they care that email for the Director's office just went down.
Don't ignore the dashboard functionality of your monitoring solution just because you're tired of hearing your account rep say "dashboard" so many times that the word loses all meaning. When used properly, and with a little bit of work on your part, a dashboard can put all of that event and performance data to work.