Thursday, September 19, 2019

Notes from the Field: Debug is for Debugging

I often hear system administrators say that logging is both the best and worst part of IT. They rely on the information provided by logged data, but dislike the overhead that typically comes as part of the package: you have to store the logs somewhere, and with enough systems and enough time, you end up with more logs than you know what to do with.

In many cases, sysadmins will approach this problem by tuning the logs. This is most often accomplished by changing the logging levels from their default (this is typically INFO) to something a little more discerning, like WARN. (If you're feeling a little lost at this point, read through this documentation on Log4j Custom Log Levels). The impact of this change is that the system generating the logs will restrict the events that it sends across the wire (in the case of a remote syslog solution) or to a local file (in the case of application logs). In both cases, you end up conserving resources, either network or disk. You'll still get diagnostic information that can help understand why a system is malfunctioning, just not at the detailed level provided by INFO.

But if you take a step in the other direction, you can inflict serious damage to your application or server's performance. How, you ask? By enabling DEBUG level logging.

Often, developers will set logging to debug when they're, you know, debugging. This is a pretty logical set up, and is almost exclusive to the world of non-production. Set to debug, test, break stuff, fix it, test, and turn off debug. It's cool, that's why debug facilities are built into application frameworks.

However, a single "log-level=debug" in a configuration file and throw your whole systems into chaos.

Take Apache Tomcat, for example. When you enable debug logging for a production application, well, just read this statement from Apache's documentation:
When enabling debug logging it is recommended that it is enabled for the narrowest possible scope as debug logging can generate large amounts of information.
No, you don't need all this information bruh.
What they don't tell you is exactly what that means. It means a single site with fewer than 100 users
can generate 4GB of log data in a matter of hours. The I/O alone required to write to a log file that garrulous can bring a server to its knees, and can slow down Apache's ability to serve even static pages. And since many applications, especially those from the open-source community built to run on Windows, log to text files, you can be sure that if you're accustomed to using notepad to review log files, you're going to have a bad time with a 4GB .txt file with constant I/O.

(Incidentally, if you need a way to pull data out of a large file like that, PowerShell is your answer. You can open up a PS shell and do a get-content -file c:\temp\bigasslogfile.txt -tail 1000 | out-file -path c:\temp\smallasslogfile.txt and you're good to go. You'll get a smaller file that notepad can handle with the 1000 most recent lines, which is typically enough to get a sense for what's going on.)

Anyway, this is a reminder that debugging is for debugging. Don't do it in production, and if you must, be very careful not to overwhelm your network or disk subsystem with excessive logging.

Monday, July 29, 2019

No VAMI after vCSA Update

It happens. Upgrades go south. Even tried and true updates like those VMware releases for the vCenter Server Appliance suffer from the occasional bomb.

Last week, as I was applying the latest security patches to a quartet of virtual appliances that were previously running 6.5.0.30000, I ran into a strange issue. The update to 6.5.0.30100 ran without a hitch on my PSC, but it failed on my vCSA appliance. The progress window disappeared from view, and after 10 minutes of patiently waiting, I took the plunge and rebooted the VM.

Yes, it's possible that I interrupted something important with that reboot. But in all of the updates I've pushed out over the years, it's not common for the progress window to just go away and not provide any feedback on the status of the update operation.

After the reboot, vCenter was up and running ok, albeit on the .30000 version. After doing some checks to make sure I didn't need to revert to a snapshot (always take a snapshot), I decided to log back into the VAMI and try it again.

Except the VAMI was down.

Some head-scratching ensued. But after a few minutes of panic, I realized that it's an easy fix. Here's what to do:
  1. Log into your virtual appliance's console (easy to do if vCenter is still functional. If not, just log into the host directly. Good reminder that you should record the hostname for your vCSA before you start this type of task.)
  2. The process that is responsible for that nice VAMI interface is named vami-lighttp. It's probably not running, which you can confirm with a quick ps -ef | grep vami-lighttp.
  3. Start the process by issuing this command: /etc/init.d/vami-lighttp start.
  4. Verify that your VAMI is back online.
Now you can log back into the VAMI and re-run that update.