Thursday, September 19, 2019

Notes from the Field: Debug is for Debugging

I often hear system administrators say that logging is both the best and worst part of IT. They rely on the information provided by logged data, but dislike the overhead that typically comes as part of the package: you have to store the logs somewhere, and with enough systems and enough time, you end up with more logs than you know what to do with.

In many cases, sysadmins will approach this problem by tuning the logs. This is most often accomplished by changing the logging levels from their default (this is typically INFO) to something a little more discerning, like WARN. (If you're feeling a little lost at this point, read through this documentation on Log4j Custom Log Levels). The impact of this change is that the system generating the logs will restrict the events that it sends across the wire (in the case of a remote syslog solution) or to a local file (in the case of application logs). In both cases, you end up conserving resources, either network or disk. You'll still get diagnostic information that can help understand why a system is malfunctioning, just not at the detailed level provided by INFO.

But if you take a step in the other direction, you can inflict serious damage to your application or server's performance. How, you ask? By enabling DEBUG level logging.

Often, developers will set logging to debug when they're, you know, debugging. This is a pretty logical set up, and is almost exclusive to the world of non-production. Set to debug, test, break stuff, fix it, test, and turn off debug. It's cool, that's why debug facilities are built into application frameworks.

However, a single "log-level=debug" in a configuration file and throw your whole systems into chaos.

Take Apache Tomcat, for example. When you enable debug logging for a production application, well, just read this statement from Apache's documentation:
When enabling debug logging it is recommended that it is enabled for the narrowest possible scope as debug logging can generate large amounts of information.
No, you don't need all this information bruh.
What they don't tell you is exactly what that means. It means a single site with fewer than 100 users
can generate 4GB of log data in a matter of hours. The I/O alone required to write to a log file that garrulous can bring a server to its knees, and can slow down Apache's ability to serve even static pages. And since many applications, especially those from the open-source community built to run on Windows, log to text files, you can be sure that if you're accustomed to using notepad to review log files, you're going to have a bad time with a 4GB .txt file with constant I/O.

(Incidentally, if you need a way to pull data out of a large file like that, PowerShell is your answer. You can open up a PS shell and do a get-content -file c:\temp\bigasslogfile.txt -tail 1000 | out-file -path c:\temp\smallasslogfile.txt and you're good to go. You'll get a smaller file that notepad can handle with the 1000 most recent lines, which is typically enough to get a sense for what's going on.)

Anyway, this is a reminder that debugging is for debugging. Don't do it in production, and if you must, be very careful not to overwhelm your network or disk subsystem with excessive logging.

Monday, July 29, 2019

No VAMI after vCSA Update

It happens. Upgrades go south. Even tried and true updates like those VMware releases for the vCenter Server Appliance suffer from the occasional bomb.

Last week, as I was applying the latest security patches to a quartet of virtual appliances that were previously running 6.5.0.30000, I ran into a strange issue. The update to 6.5.0.30100 ran without a hitch on my PSC, but it failed on my vCSA appliance. The progress window disappeared from view, and after 10 minutes of patiently waiting, I took the plunge and rebooted the VM.

Yes, it's possible that I interrupted something important with that reboot. But in all of the updates I've pushed out over the years, it's not common for the progress window to just go away and not provide any feedback on the status of the update operation.

After the reboot, vCenter was up and running ok, albeit on the .30000 version. After doing some checks to make sure I didn't need to revert to a snapshot (always take a snapshot), I decided to log back into the VAMI and try it again.

Except the VAMI was down.

Some head-scratching ensued. But after a few minutes of panic, I realized that it's an easy fix. Here's what to do:
  1. Log into your virtual appliance's console (easy to do if vCenter is still functional. If not, just log into the host directly. Good reminder that you should record the hostname for your vCSA before you start this type of task.)
  2. The process that is responsible for that nice VAMI interface is named vami-lighttp. It's probably not running, which you can confirm with a quick ps -ef | grep vami-lighttp.
  3. Start the process by issuing this command: /etc/init.d/vami-lighttp start.
  4. Verify that your VAMI is back online.
Now you can log back into the VAMI and re-run that update.

Friday, July 5, 2019

It Only Gets Worse When You Try To Make It Better

I have realistic expectations, I say.
It's something I blurt out to ease the tension.
These are delicate matters, he says.

He opens a small tool pouch and selects a metallic instrument.
It's not a scalpel. But it looks like it is.
He approaches.

Nevermind the century-old exterior, he says. There's only so much that can be done.
He says something else, but I'm already gone, trying to add detail to a memory
of being a child and listening to the ballgame on a radio.

It's summer, and there are no seedless watermelons.
Everyone is drinking ginger ale.

And it's hot. The mimosa trees were cut down, so there's no shade, only a jagged shadow that the limbless trunk of a dead oak casts near the well cover.

I'm brought back when he says to use warm water and dish soap on a soft cloth.
Be gentle with the ivory.

When the time comes, throw it into the landfill and don't think about it again.

The dispassion of it all is routine.

Wednesday, March 20, 2019

Bias to Action

Prototyping in progress!
I’ve always wanted to teach. One of my favorite memories from my youth is that of teaching art classes to kids in 5th grade as part of my senior studio class in high school. I like the planning that goes into a good lesson, and the unplanned opportunities to connect with students as they learn something new. I know it’s a mild case of over-simplifying the demands on a full-time teacher given that I’ve only spent a day or two here and there teaching anything to anyone. But still, those rare occasions gave me more joy than two decades in IT.

Recently I took a turn teaching engineering to a group of students at my local homeschool co-op. It was the second of two classes I taught this semester: the first lesson was based on circuit design using the wonderful and inspiring materials from Chibitronics. The creative folks behind Chibitronics have merged technology with art, and have been around since the maker wave began earlier this decade. It’s cool stuff, and if you’ve got kids, you should look into their kits and create something amazing.

The second class was something different. I spent some time reading a lesson plan from Stanford d.school on the concept of a “bias to action.” You can read the whole lesson plan here. A summary of the lesson: the class forms groups of a few students each and use dry spaghetti and marshmallows to construct a tower. (In the lesson plan, they also include a length of tape and string, but I skipped that part to simplify the work.) Before anyone gets started, you talk briefly about a few important concepts: prototyping, failing fast, iteration, and the bias to action.

Prototyping

Prototyping is something that we don’t consider often in the infrastructure ops field, but the movement towards Infrastructure as Code promises to change that. We should build prototypes of scripts to deploy systems, of templates used to build applications, and of patterns to support web scale products. And we should hope that the prototypes fail, which leads us to the next concept: failing fast.

Failing Fast

Failing fast means you learn as soon as possible whether your proposed solution is going to work out or not. If your tower prototype collapses before your second floor is built, then you know that your foundation needs more work. You’d much rather the tower fall with two floors than ten. You prototype, fail fast to learn what didn’t work, and iterate.

Iteration

To iterate means to try again, but with the knowledge of your previous attempts (failures). Each iteration should be an improvement on your previous design. And the process repeats until you’ve developed a good solution (or in this case, a towering construction of carbohydrates). The improvement can be subtle; you're not going for a 1.0 to 2.0 release. You're looking for a 1.01a.

Bias to Action

So what’s the bias to action? That was the only bit that was new to me, too. And now that I’ve learned what it is, I find myself applying it to work ALL. THE. TIME. (Sorry co-workers, and get used to it.)

A bias to action is the tendency of an individual or group to try doing something instead of over-thinking or over-planning. It’s not a license to be reckless; it’s an approach designed to acquire empirical data quickly for the purposes of iterating on your design. An extreme counter-example would be to spend a year planning on building a 100-story tower, and then watching it fall when you build that second floor. You’ve just invested a year’s worth of time on a design that, had you prototyped it early on, would have failed fast and provided feedback for your iteration.

In other words, it's a "let's try it and see what happens" approach.


So now I’ve got a phrase to describe how we should all approach our work. Through prototyping, failing fast, iterating, and a bias to action, we can modernize any infrastructure operation.