Aug. 13, 2019
I recently started an internship as a software developer, and I spent the last day and a half working on one of those problems that just makes me want to write about it. The project I’m working on is an IoT app where a device somewhere in the world with a 4G connection has to accumulate and send accurate logs to a cloud server. This sounds simple enough, but some of the particulars of the device makes getting logs a bit of a challenge.
Yesterday, my team lead/mentor was looking at some logs we received to check for accuracy. As an intern, I had the all-important job of watching our test machine running, manually timing it, and comparing the real times against our logs to check for accuracy. Unfortunately for me, the app wasn’t logging! Fortunately for me (seriously, this turned out to be really cool), my team lead has other projects to work on, so I got to dig into the problem.
Some important things to note: the app running on these devices is written in Java, which is fine, except I only know a little bit of Java; these devices are part of an embedded system, which is fine, except the end users only shut them down by cutting the power. These two things wound up being pretty important, but I’m getting ahead of myself.
One of the most important things to do to get to the bottom of any technology problem is to make sure you can re-create the issue. We’d been letting our test machine run overnight to accumulate some logs, and a peek at the file showed it hadn’t recorded any activity since we started it up the previous afternoon. It did have good records up to its last shutdown/restart cycle, just nothing after the last time it booted up.
I did what any IT guy would do, and turned the machine off and back on again, gently. It booted up, and wrote logs just fine. Then I did what no IT guy (or anyone else) should ever do, and I unplugged it while it was running and plugged it back in over and over again checking to see if it was logging each time. Unluckily, sometimes it was logging, and sometimes it wasn’t.
Upon further examination, the logs directory contained some files I wasn’t expecting. There was Program.log (expected), but also Program.log.lck and Program.log.1 and Program.log.1.lck. Some quick google research brought me to this StackOverflow question, which led me to Wikipedia, and finally to the Java docs themselves. All of that research led me to the conclusion that our hard shutdown wasn’t giving the JVM time to clean up after itself. To test this, I ran a series of tests both with the hard shutdown, with sending SIGKILL signals to the Java process, and with sending more gentle SIGTERM signals.
The Java running the main function on these devices creates a Logger object, which writes to a file using a FileHandler, that locks the log files while writing. This makes sense, and is definitely necessary to stop multiple processes from trying to write to the same file at the same time. For this application, though, there are a couple of things that make this a problem.
Because we’re not in control of when/how the machines are shut down, it’s possible for them to get shut down during a write operation. This doesn’t happen every time, but when it did happen the log would contain a corrupt/null line, and the application would be sometimes be logging to Program.log.1 instead of Program.log.
Another thing about these machines is that they are self updating. On bootup, they make a request to a cloud app to see if they need to pull down a code update. If they do pull an update, the updater rebuilds the app in the background and restarts the main application (without restarting the machine). This could also occur while the log file was open, and we’d get the same result: Logs being written to log.1, and lock files all over the place.
Because these machines are embedded we don’t really need to read these files directly, and it would be a giant pain to do even if we wanted to. Instead, the devices run a service that sends any recorded logs to the cloud. It was simple enough to add a check for the additional files and clean them up if they were present.
In a perfect world, we’d set these devices up with small-capacity batteries and send a gentle shutdown signal when the main power was cut. This isn’t a perfect world, however, so the solution that gets our logs to the cloud with the least amount of additional work won.
Concurrency is hard. Distributed systems are hard. Embedded systems are hard. The best solution is most often the one that works.
If anyone has any advice on how to gracefully handle hard shutdowns, let me know.