Friday, December 6, 2013

HA vs FT

First things first: I'm not talking exclusively about vSphere here. I'm writing about concepts, expectations, and operations, not any one product in particular.

I'm wrapping up my first week at a new project. It was a first week like most others: spending time on administrative tasks, completing various applications for access to stuff, and meeting people that I'll be spending some quality time with in the near future. I even sat in on a few meetings that helped me start to wrap my head around the infrastructure, which is rapidly changing.

Dude, this picture doesn't even make sense.
One of the topics that came up in these meetings had to do with a pair of firewalls, and their configuration as a active / passive pair. From what I gather, the conversation (during the design phase) went like this:

Management - "Are the firewalls designed to be redundant?"
Engineer - "Yes."

A seemingly innocuous, normal exchange. But the difference was in the way the engineer interpreted management's question. Management, by way of redundant, meant that a failure at the hardware level will not affect flow of traffic. Not even for a second. The engineer, upon hearing redundant, meant that yes, there were two firewalls, and if one failed the other one would handle the load after a brief outage.

This is where HA vs FT becomes important. In a vSphere cluster, we know that HA will let us recover virtual machines, automatically, AFTER the hosts determine that a failure has occurred. In order to reduce unnecessary failover, there's some logic in HA that prevents VM recovery until multiple checks and heartbeats have failed. The effect of this logic is that VM restart can take a minute or two (or three, I don't have my vSphere bible (aka the Clustering Deepdive book) with me at the moment) to start. As vSphere people, we're accepting of this time. It's still WAY faster than any manual failover or recovery that we could do. But this does mean that there's an outage.

This truly is HA: High Availability. It's not the avoidance of outages; it's the rapid recovery from outages.

FT is a different beast altogether. Now we're talking active/active. And that brings up lots of other considerations (tracking sessions across devices, addressing, how to monitor, load balancing, et cetera). It's FT that management is expecting when they say "redundant," not HA. They're looking for a solution that has no impact to their business customers during a hardware failure.

Engineers will say, "But fault tolerant systems cost 10x more! Diminishing returns! Unnecessary complexity!" Those all may be true. But management needs to hear that and make the decision on whether pursuing fault tolerance is worth it. Don't assume that it's too expensive. Find out what the functional requirements are at the start of your project, document, and get approval. There's always money for a good solution, and there's rarely money for a bad one.

The management-to-engineering interface is always a challenge in the IT world. Learning to speak both languages helps to avoid the problem of miscommunicated and misunderstood requirements.
Mastodon