Tuesday, November 5, 2013

Wrestling OpenNMS

All jesting aside, I love OpenNMS.
I've been using OpenNMS since before it was cool. Though to be honest, so has everyone since OpenNMS is still not cool. But it is a great open-source NMS solution, and it has matured significantly in the ten years since my first exposure to the software.

The current release of OpenNMS is much easier to implement than previous versions, even more so when deploying via yum. You no longer have to treat each component as individual installations; the installer takes cares of everything from the postgres database up through the web application. (This is not an insignificant matter. Years ago, simply getting OpenNMS up and running was a mark of honor.)

Last week, I was asked to troubleshoot an issue with OpenNMS. The issue was that it was broken.

Turns out that the application couldn't run because the postgres instance wasn't running. Like any three-tiered web app, if the database layer is down, then the upper layers can't function. So I dug into the logs to find out what the matter was.

If you're not familiar with troubleshooting on linux, let me tell you about a great command to use when diagnosing a problem with a system service: systemctl <processname> status. In this case, I issued systemctl postgresql status and got the following:

FATAL: could not create shared memory segment: Invalid argument 
DETAIL: Failed system call was shmget(key=5432001, size=4194304, 03600).
HINT: This error usually means that PostgreSQL's request for a shared memory segment exceeded your kernel's SHMMAX parameter. You can either reduce the request size or reconfigure the kernel with larger SHMMAX.

You can see pretty easily what the problem is. Postgres is looking for more shared memory than is currently available on the VM (shared memory is different from "physical memory"). So how do we fix that? It's surprisingly simple. Just drop this line into /etc/sysctl:

kernel.shmmax=4194304

Now a reboot, and you're back in business. Technically, you could issue a different command (sysctl -w kernel.shmmax=4194304) and then start postgres, but I like to confirm that the change will survive a reboot so there are no surprises down the road.

You might be wondering why postgres stopped working. Ask the guy who made some changes to the config file without understanding the effect of his changes. (He owes me a few beers for this one. :))