My friend and I got to talking the other week about how to manage automatically deploying OS updates to his device when the device is in the hands of customers, plugged into their network. Since an update failure would essentially render the device useless to a customer, causing an expensive RMA, and manual recovery operations by a technician, we wanted to investigate what standard-ish Linux utilities could be used to ramp up the reliability of the upgrade process.
What follows is essentially a brain dump of what I think a well put together system would look like. Be nice about typos, this is a very stream-of-thought document.
- UEFI firmware
- Supports Secure Boot
- Supports either modifying EFI variables by the boot loader or modifying data on the system boot partition by the bootloader.
- x86_64 processor
- Hardware watchdog
- Watchdog verified to work properly with standard Linux utilities. E.g., available via a package manager for a distribution, not some crazy custom program and/or proprietary driver.
- Internal, bootable, storage with sufficient space to store two full copies of the firmware with some scratch space left over.
Basic partitioning layoutGPT partition table. UUIDs determined statically for all instances of the product.
- UEFI system boot partition -- Assigned UUID 1
- Firmware-A -- Assigned UUID 2
- Firmware-B -- Assigned UUID 3
- Data -- Assigned UUID 4
Please note that there's no swap partition. Swap is highly useful for desktop, or high power server type situations, where overcommitting resources is a desirable capability, but an embedded setup really shouldn't run into a situation where you need more ram than is available.
Basic boot strategy
Using SecureBoot, a UEFI feature, we can use our private signing keys to ensure that the bootloader installed on the system is one that we deployed. Further, we can configure the bootloader to verify that the kernel that loads is also one that we deployed. Basic boot integrity is important, after all.
Verifying that the firmware image to boot is cryptographically signed, to ensure the image as a whole is known-good, is a good thing to consider as well, even though it may involve some CPU time spent doing the verification on such a large amount of data.
Somewhere in either the EFI system variables, or the system boot partition, we need to store a variable that tells us which partition to load. Essentially this can be the PARTUUID of the GPT partition. We also need a variable that holds a counter, representing the number of times we've attempted to boot this partition without the OS changing the counter back to 0.
We need two protection mechanisms to ensure we eventually manage to boot properly. The first mechanism is that many bootloaders can be configured to trigger some action when the kernel panics. This is a graceful failure, allowing us to fall back to a known good state.
The problem is what to do if the kernel simply hangs. Maybe there's some race condition, and it never quite gets to the panic, which can allow our bootloader to recover. Here's where the hardware watchdog comes into play. If we assume that the hardware watchdog will reset the machine if (and only if) the machine fails to send it a heartbeat signal within some amount of time that should be sufficient to boot properly, we guarantee that even in a worst-case scenerio, we'll always return to the bootloader to try booting again.
The final bit of magic is in the counter saying how many times we've failed to fully boot. Just before executing the kernel, the bootloader increments the counter for how many times it's tried that partition by one. If the counter reaches some upper limit, then the bootloader tries to boot from the other Firmware partition. (If it's using A, it switches to B, and vice versa). The upper limit can be stored anywhere. Hardcoded, another EFI variable, somewhere on the system boot partition. It doesn't really matter.
Basic runtime strategy
Post boot actions
If the system boots up fully, then among other things, one of the processes that needs to be launched is some program that resets the "number of boot attempts" variable to 0. If this process doesn't launch, then after some number of reboots, the system will still switch to the other firmware. Frankly, this process failing to launch and execute it's task should be considered a hard failure for BOOT, not for runtime. This is an important distinction.
Using the systemd init system, it's possible to set service dependencies. Say your device, ultimately, needs to run 3 different services that stay alive at all times. Your dependency graph might look something like this
Boot Counter Reset
// || \\
Service A Service B Service C
Where Service A, Service B, and Service C each have a systemd watchdog (different than the hardware watchdog, see the WatchdogSec= option of systemd.) configured, and use the sd_notify api to continually inform systemd that they are alive and functioning.
If any of service A, B, or C fail to launch, initialize, and run, the hardware watchdog and boot counter reset program won't launch. This will cause the hardware watchdog to trigger a system restart. If this happens enough times, the alternate firmware will be loaded.
The boot counter reset service is of type "oneshot", as it doesn't stick around after setting the counter variable.
The services in systemd should probably be configured to restart on failure. This way, if there's a minor problem that doesn't happen consistently, the service will automatically heal itself. Obviously, putting a limit to the number of restarts is a wise choice. If the service fails too many times, the hardware watchdog heartbeat service will be shut down, which will automatically trigger a system reboot.
An additional advanced feature would be to automatically detect too many service failures, and switch to the other firmware automatically. If a service is failing twice a minute, that's a good indication that there's something wrong. Essentially to accomplish this, when the hardware watchdog triggers (or is about to trigger, depending on how this gets handled) increment the number of times the hardware watchdog has triggered, and then reboot.
The bootloader should then check, in addition to the boot attempts number, the variable for the number of watchdog triggers, and if EITHER of them is above the threshold, to switch to the alternate firmware after clearing both counters.
To ensure system integrity, mitigate tampering, and generally just make things have less moving parts, the firmware image, when loaded by the kernel and mounted as the root partition, should be strictly read-only. The storage on the data partition is the only part of the system that should have a mutable state during normal operations. This is basic deployment hygiene, and has obvious benefits and essentially no downside.
Among other reasons, a very compelling one is "why not?". There's almost no downside to this, aside from an incredibly small amount of boottime CPU usage. A properly stripped down firmware image will fit easily into the systems file cache, and so after boot is finished, accessing the firmware file system will, essentially, involve no disk IO at all.
If the runtime cost to access parts of the compressed image are concerning, ,there's also always the option of copying the firmware (uncompressed) wholesale into a ram-based filesystem, and using that as the root filesystem instead. That will guarantee that no disk access happens after boot for root partition related usage.
There are, of course, other considerations for compressing your images. including faster firmware download times, faster firmware installation times, faster firmware cryptographic signature at boot (if you're using that).
I recommend using SquashFS with the XZ compression algorithm. Linux can boot a partition that is simply a SquashFS image written directly to the raw partition as the root filesystem, WITHOUT the need for an initramfs. Aside from the filesystem being read-only, it works identically to if you used any other normal filesystem, with great filecache properties, possibly saving you some (or all) disk accesses.
There are a couple ways to manage doing a firmware update. If you're using SquashFS with the XZ compression, you'll already have your firmware images as small as they are practically ever going to get without removing some of the files contained within them. This can drastically reduce your download size, when compared with downloading the uncompressed image.
An interesting thing about the way squashfs handles the compressed data, is that instead of being a single stream of bytes, squashfs is actually an organized collection of compressed chunks of data. When compressing a given filesystem tree, the squashfs creation tools will always compress the data in the same order, and it'll do it in specific byte sized chunks. This means that a small change to the original data will stay local to the parts of the squashfs filesystem that represent the files that were modified.
When you consider common large-file transfer tools, such as rsync, this means that rsync can very efficiently update a squashfs image from an older version to a newer version, since only a small amount of the squashfs file will have changed. Of course, if you make a very large change, all bets are off.
Even better than rsync, though, is that you can pre-compute the differences between the current firmware and the newly deployed firmware. Each system that needs the firmware update can simply download the binary diff that you generate once, and then apply that binary diff directly to the raw partition that the firmware to be replaced is stored on. Of course, make sure you cryptographically sign the binary diff, and have the firmware upgrade process verify the signature before applying it.
If, for some reason, the firmware stored on the currently-not-in-use partition is a version that has no binary diff available, the firmware upgrade system can either request that the server create a diff on-demand, or can simply download the full update and apply it directly. Either method is fully functional.
Depending on what kind of Linux distribution you're using, it might be important to consider the upgrade path from very old firmwares to the latest and greatest. Considering you're deploying read-only images, there really should be very little reason to worry about a failure, but do keep this in mind when deciding on how to handle the actual download process, and what method of applying the firmware you end up picking.
If you're using a rolling-release distribution, such as Gentoo, a lot of packages can change without much warning in simply a few months. Obviously properly testing things is important, but it might be worth making a decision between doing an occasional "package freeze" versus a continuous firmware generation / deployment / self test system on internal hardware that gets used for dogfooding. It's a tough call, and there are pros and cons for both sides.
The right answers for your setup are entirely up to you.
Upgrading the bootloader and kernel
The kernel is stored separately from the firmware image, in the system boot partition. Depending on what bootloader you're using, you might need to name the kernel a specific filename, or you might need to overwrite a configuration file to point to the new kernel. In either case, there are a few failure scenerios here.
Unlike the firmware image, where we're already booted into a known-working image that we can fall back on if the upgrade fails, the kernel is a narrower point of failiure.
When writing the kernel to disk, always write to a temporary name, and then mv the kernel to it's real filename after it's on disk. This helps protect against power failure when the kernel is only half-written to disk.
Consider setting your firmware up so that if the main kernel fails cryptographic verification, you boot from the previous kernel.
Consider also that it's possible for your kernel to pass cryptographic verification, but still fail to boot properly. As discussed in the booting section, a failed boot should result in the firmware image being switched. This doesn't help if the problem is with the recent kernel upgrade.
Address this problem by setting (yet another) efi variable, indicating that we just updated the kernel before rebooting into the bootloader. Perhaps with the value 1. Change that variable to indicate we're about to attempt to boot the new kernel, perhaps to value 2. If the bootloader encounters the value 2 when starting up, we know the new kernel failed to boot, so we'll revert to the old kernel. Just enhance the number-boot-retries clearing program to also clear this value upon successful boot.
Of course, this strategy only saves you if your kernel fails to boot. If the kernel has other problems that prevent things from working properly *after* you've cleared the efi variable, you'll encounter a reboot loop. Ultimately, it's *important* that you conduct serious regression testing before deploying updates.
Careless human action can still break things no matter how many double checks are added to the system. Perhaps only make the boot variable clearing program clear the variables after some numerous amount of minutes. Or flag a variable that boot completed fully, but service initialization failed, to your high availability logic. It's up to you.
Generally, the process for upgrading the bootloader itself is very similar to that of the kernel. Write out the new file(s), copy the old files, mv the new files to the names of the old files, to minimize the amount of time the system is in an inconsistent state in case of power failure.
Alternatively, consider writing the new bootloader to a new name, and then changing the UEFI bootloader priority / default choice using the EFI variables
Depending on how your main board is designed, it might also be possible to have multiple EFI system partitions on the same disk, and configure fallback settings, where if the main bootloader fails for some reason, then it falls back to the other. If your board can be configured to work this way, you can actually skip all the manual file writing and instead upgrade the bootloader and kernel in a single image just like with the firmware images, just replacing the whole partition with the downloaded filesystem image, and then changing the default bootloader in the EFI.
It's probably not a good idea to do this :-)
If you insist, I heavily recommend getting a mainboard that has dual bios chips, with automatic failover.
Saving even more bandwidth
Bittorrent for downloading the new images. Everything's cryptographically signed right? Might make your customers a bit grumpy that you're using their bandwidth though.
If rsync or binary diffs aren't your style, consider the zsync tool, which pre-computes the hashes that the rsync algorithm needs to do it's job.
Strip your firmware to the point of being naked
The less things you have in your fully functional firmware images, the less things can go wrong. Smaller security surface, smaller bug surface, smaller bit-rot surface, smaller transfer corruption surface, just in general you'll be much happier the smaller your images are.
- Strip out things like include files, compiler toolkits, userland utilities (probably don't need "ping", for instance).
- Kill debug symbol files.
- Remove all optional language files.
- Statically compile as many of your programs as you can (Of course, make sure this actually saves you space).
- Do a dependency analysis and ensure that 100% of the shared object files on the system are used by a binary that your system requires to function.
- Consider compiling everything with -Os instead of -O3.
- Remove man pages.
- Recompile programs to remove unused features (gentoo is really nice for this :-) )
- Compile as many parts of your system with link-time-optimization as you can. Gentoo is great for this, but be warned there are several programs that may not compile this way.
- Remove example configuration files, and remove normal config files where you want the default values of settings, and the config file is optional.
- Remove the package manager of the linux distribution you use, and all it's associated utilities.
- Consider alternative implementations of libraries and/or programs. The Musl libc replacement is quite a bit smaller than, say, glibc. Be warned though, there may be porting issues. Consider toybox instead of busybox. That kind of thing.
- Consider removing bash from the system entirely :-P
- Remove, remove, remove.
With the Linux kernels OverlayFS, if you setup another partition on the device to hold the troubleshooting utilities as a separate partition, you can boot the system such that the contents of the second partition are overlayed on top of the first, appearing to be a single filesystem. In this way, any file collisions have the partition mounted second win the collision, with the file from that partition being shown.
In this way, your normal run state is still LEAN MEAN EFFICIENT MACHINE! While still providing access to rarely needed tools.
Small kernels are required
I seriously, fanatically, recommend a moduleless kernel. If your firmware needs some specific functionality, compile that functionality into the kernel directly. Anything that doesn't talk to the hardware you have, or provide a feature one of your services needs, should be removed. Remove support for all filesystems. Remove support for 32bit binaries. Remove support for multi-media devices. Kill the entire graphics drivers subsystem if you don't need graphics.
By removing as much stuff as you can, you make your entire environment more efficient (cache friendlyness), reduce your security surface, and reduce your bug surface. All big wins.
Most importantly though, modules make booting *hard*.