Wednesday, July 10, 2013

Power Saving Modes in vSphere and Cisco UCS

If you've ever had a slow Friday and spent time poking around in vCenter Server or UCS Manager, you've probably come across some promising eco-friendly features like Distributed Power Management (DPM) and N+1 PSU redundancy. If you haven't, here's a summary of these technologies.

VMware's DPM - DPM is a feature available to vSphere clusters that determines if the cluster's workload can be satisfied using a subset of cluster members. If so, the VMs are vMotioned to free up one or more hosts which are then powered down into stand-by mode. Your cluster's HA settings are taken into account, so using DPM won't violate your availability constraints. Should the cluster's workload suddenly increase, vCenter will wake-up the stand-by hosts, then redistribute the workload across the additional hosts. Cool Stuff indeed. You save on power and cooling costs for each server that DPM puts into stand-by mode.

Cisco's UCS N+1 PSU Redundancy - N+1 is sometimes a tricky thing to wrap your head around, since its meaning changes depending on context. In the case of UCS, N+1 means the number of PSUs required to provide non-redundant power to your chassis, plus one additional PSU. So with a 5108 chassis, with all four PSU slots populated, N+1 would mean 3 PSUs active and one in "power save" mode. If one of the active PSUs fails, you still have redundancy, and the fourth PSU will be brought online to restore N+1 redundancy.

So that's the good news. Here's the bad news: DPM basically confirms that you overbought on hardware. And N+1 PSU redundancy may not give you the redundancy you're looking for. Here's why.

If you find that DPM is shutting down servers in your cluster more often than not, you purchased more hardware than you needed. This indicates that you didn't properly assess your workloads prior to creating your logical and physical designs. And that indicates that maybe you didn't account for other design factors. And that is not cool. An erstwhile pessimist, I suspect this is why many vSphere clusters do not have DPM enabled.

On the topic of Cisco UCS, N+1 PSU redundancy, and a false sense of security: chances are that what you really want to use here is Grid Redundancy, not N+1 redundancy. Grid means that you have power from two PDUs running to your 5108, and you want to spread your PSUs across those two PDUs. So you connect PSU1 and 3 to PDUA, and PSU 2 and 4 to PDUB. All four PSUs are online, and should a PDU fail, you still have two PSUs running. With N+1 and PSUs spread across two PDUs, you could encounter a situation where only one PSU is active while the "power save" PSU is turned up. One PSU may not be able to provide sufficient power to your chassis and blades, which can be... you guessed it: not cool.

Looking back on this post, I'm not sure why I lumped these two together, other than that they both deal with power. DPM and PSU configuration options solve different problems. There's no shame in including these features in your designs. Just make certain that you understand the benefits and pitfalls of each.

ps - It's late, and I'm listening to the Beastie Boys, and I'm low on Yuengling. Were I so inclined, I could add a footnote for nearly each claim above. But the point here is that you need to understand what these options do for you, and that means understanding other design requirements like total power consumption of your b-series blades.
Mastodon