Monday, February 8, 2016

Gstreamer video wall / grid

Related to my previous post: My brother also asked me to investigate if there was a way to make it easier to search through the videos. My idea was to display multiple videos side by side simultaneously, to make it easier to scan though what the camera captured.

So far, I've come up with this script
gst-launch-1.0 \
    videomixer \
        sink_1::zorder=1 sink_1::alpha=1.0 sink_1::ypos=0   sink_1::xpos=0    \
        sink_2::zorder=1 sink_2::alpha=1.0 sink_2::ypos=0   sink_2::xpos=630  \
        sink_3::zorder=1 sink_3::alpha=1.0 sink_3::ypos=0   sink_3::xpos=1260 \
        sink_4::zorder=1 sink_4::alpha=1.0 sink_4::ypos=496 sink_4::xpos=0    \
        sink_5::zorder=1 sink_5::alpha=1.0 sink_5::ypos=496 sink_5::xpos=630  \
        sink_0::zorder=1 sink_0::alpha=1.0 sink_0::ypos=496 sink_0::xpos=1260 \
        name=mix \
    uridecodebin uri=file:///path/to/video1.mkv ! videorate ! video/x-raw,framerate=[1/1,10/1] ! queue ! mix. \
    uridecodebin uri=file:///path/to/video2.mkv ! videorate ! video/x-raw,framerate=[1/1,10/1] ! queue ! mix. \
    uridecodebin uri=file:///path/to/video3.mkv ! videorate ! video/x-raw,framerate=[1/1,10/1] ! queue ! mix. \
    uridecodebin uri=file:///path/to/video4.mkv ! videorate ! video/x-raw,framerate=[1/1,10/1] ! queue ! mix. \
    uridecodebin uri=file:///path/to/video5.mkv ! videorate ! video/x-raw,framerate=[1/1,10/1] ! queue ! mix. \
    uridecodebin uri=file:///path/to/video6.mkv ! videorate ! video/x-raw,framerate=[1/1,10/1] ! queue ! mix. \
    mix. ! queue ! theoraenc ! progressreport name="Encoding Progress" ! matroskamux ! filesink location=/path/to/output/file.mkv
Note: The zorder and alpha attributes are only included for sake of example. Further, note that sink_0 is actually the last video attached, in this case video6. Personally I think this is a serious mistake on the part of the gstreamer developers, to use such confusing nomenclature, but it's their framework and they can implement it how they want to.

This pipeline does actually do the job, putting 6 videos in a 3x2 grid, and encoding the result to a single video file.

However, there are two nagging problems that I'm running into.

The first problem is that videomixer seems to be slower than molasses. It can take upward of ten minutes for this pipeline to finish merging six 10 second videos. I totally understand that the mixing process is cpu intensive, but it seems like it takes an excessive amount of time.

The second problem that I run into is that the resulting video, when I play it with VLC anyway, displays only the first frame. If I drag the scrollbar, I do see the other frames, but releasing the scrollbar makes the video revert right back to only displaying a single frame.

If anyone has any insights into how to improve this pipeline, I'd be very interested in any advice.

Merging videos using gstreamer

My brother came to me the other day asking if I knew of a way to merge multiple small (approximately 30 seconds each) video files into a single, longer, video file. He works for a retail business and wanted to review some security footage more easily.

The GStreamer framework, as of version 1.6, now has a plugin that makes doing this operation significantly easier than ever before. The concat plugin simply takes 1 to N number of inputs, and activates those inputs one at a time, playing each entirely, filtering the EOS event from all but the last element, and then moving to the next. In this way, we can make other elements in the pipeline treat these multiple sources of data as a single, continuous, stream.

The ultra-simplified explanation is that we're simply playing each small video one after another with no pause in between, and then re-encoding the resulting video as a single file.

The problem that I was running into is two fold. First, the some aspect of the contact element, when running from the commandline using the gst-launch-1.0 debugging / prototyping program, becomes very hostile to handling more than, for example, 10 files at a time. I was able to consistently use a pipeline that looks like this

gst-launch-1.0 concat name=c \
    uridecodebin uri=file:///path/to/video1.asf ! queue ! c. \
    uridecodebin uri=file:///path/to/video2.asf ! queue ! c. \
    uridecodebin uri=file:///path/to/video3.asf ! queue ! c. \
    uridecodebin uri=file:///path/to/video4.asf ! queue ! c. \
    uridecodebin uri=file:///path/to/video5.asf ! queue ! c. \
    c. ! videoconvert ! theoraenc ! progressreport name="Encoding\ Progress" ! matroskamux ! filesink location=/path/to/output/file.mkv ;
To merge a handful of files at a time, but trying to do more than roughly 10 made gst-launch throw up errors very consistently

My second problem, which gets drastically compounded by the first problem, is that my brother handed me a thumb drive with over 5 thousand video files on it. Personally, I have no interest in sitting there typing out each file name one at a time, and to the best of my, admittedly limited) knowledge, gstreamer simply doesn't have an element that opens each file that matches some pattern and attaches them to a sink in a way that would be useful for this situation.

So, I need to write a meta program that generates a program that calls gst-launch with only a few video files at a time. I figured that if I could merge them 5 at a time, I could merge the resulting videos 5 at a time, and repeat that pattern until there were only around a dozen big video files, at which point he can search through them manually on his own.

So here's what I've come up with so far. It's a bit of a big-huge-hack, but at least it'll let me get the job done and move onto other projects.
mkdir -p encode_results
echo "runfunction() {" >> $OUTPUTSCRIPT
echo "    until \$@" >> $OUTPUTSCRIPT
echo "    do" >> $OUTPUTSCRIPT
echo "        :" >> $OUTPUTSCRIPT
echo "    done" >> $OUTPUTSCRIPT
echo "}" >> $OUTPUTSCRIPT
echo "" >> $OUTPUTSCRIPT
for f in `ls *.ASF | sort -g` ;
    if [ $COUNTER -eq 0 ] ;
        echo "runfunction gst-launch-1.0  --no-fault --gst-debug-disable concat name=c \\" >> $OUTPUTSCRIPT
    echo "    uridecodebin uri=file://$VIDEOFOLDER/$f ! videorate ! video/x-raw,framerate=[1/1,10/1] ! queue ! c. \\" >> $OUTPUTSCRIPT
    if [ $COUNTER -eq $NUMBER_PER_SWEEP ] ;
        echo "    c. ! videoconvert ! theoraenc ! progressreport name=\"Encoding\ Progress\" ! matroskamux ! filesink location=$VIDEOFOLDER/encode_results/encode_result_${FIRST}_${f}.mkv &" >> $OUTPUTSCRIPT
        echo "" >> $OUTPUTSCRIPT
if [ $COUNTER -ne 0 ] ;
    echo "    c. ! videoconvert ! theoraenc ! progressreport name=\"Encoding\ Progress\" ! matroskamux ! filesink location=$VIDEOFOLDER/encode_results/encode_result_${FIRST}_${f}.mkv &" >> $OUTPUTSCRIPT
    echo "" >> $OUTPUTSCRIPT
You should be able to simply copy it into a bash prompt after changing the filenames at the top.

It'll generate a new script. That new script will all the encoding jobs in parallel. If any encoding job fails to complete successfully, then the "runfunction" will automatically start it again.

Thursday, February 4, 2016

How to manage an embedded Linux firmware update system

A friend of mine is working on an embedded Linux device that allows you to manage some security cameras that he has. I've also worked on setting up numerous systems like this for hobby projects in my personal time. Finally, my day job has started investigating this type of setup. Clearly this is a common scenario, so just for the purpose of getting all my thoughts down so I could digest them, I decided to write this post.

My friend and I got to talking the other week about how to manage automatically deploying OS updates to his device when the device is in the hands of customers, plugged into their network. Since an update failure would essentially render the device useless to a customer, causing an expensive RMA, and manual recovery operations by a technician, we wanted to investigate what standard-ish Linux utilities could be used to ramp up the reliability of the upgrade process.

What follows is essentially a brain dump of what I think a well put together system would look like. Be nice about typos, this is a very stream-of-thought document.


  • UEFI firmware
    • Supports Secure Boot
    • Supports either modifying EFI variables by the boot loader or modifying data on the system boot partition by the bootloader.
  • x86_64 processor
  • Hardware watchdog
    • Watchdog verified to work properly with standard Linux utilities. E.g., available via a package manager for a distribution, not some crazy custom program and/or proprietary driver.
  • Internal, bootable, storage with sufficient space to store two full copies of the firmware with some scratch space left over.

Basic partitioning layout

GPT partition table. UUIDs determined statically for all instances of the product.
  1. UEFI system boot partition  -- Assigned UUID 1
  2. Firmware-A  -- Assigned UUID 2
  3. Firmware-B  -- Assigned UUID 3
  4. Data  -- Assigned UUID 4
Please note that there's no swap partition. Swap is highly useful for desktop, or high power server type situations, where overcommitting resources is a desirable capability, but an embedded setup really shouldn't run into a situation where you need more ram than is available.

Basic boot strategy

Image integrity

Using SecureBoot, a UEFI feature, we can use our private signing keys to ensure that the bootloader installed on the system is one that we deployed. Further, we can configure the bootloader to verify that the kernel that loads is also one that we deployed. Basic boot integrity is important, after all.

Verifying that the firmware image to boot is cryptographically signed, to ensure the image as a whole is known-good, is a good thing to consider as well, even though it may involve some CPU time spent doing the verification on such a large amount of data.

Boot availability

Somewhere in either the EFI system variables, or the system boot partition, we need to store a variable that tells us which partition to load. Essentially this can be the PARTUUID of the GPT partition. We also need a variable that holds a counter, representing the number of times we've attempted to boot this partition without the OS changing the counter back to 0.

We need two protection mechanisms to ensure we eventually manage to boot properly. The first mechanism is that many bootloaders can be configured to trigger some action when the kernel panics. This is a graceful failure, allowing us to fall back to a known good state.

The problem is what to do if the kernel simply hangs. Maybe there's some race condition, and it never quite gets to the panic, which can allow our bootloader to recover. Here's where the hardware watchdog comes into play. If we assume that the hardware watchdog will reset the machine if (and only if) the machine fails to send it a heartbeat signal within some amount of time that should be sufficient to boot properly, we guarantee that even in a worst-case scenerio, we'll always return to the bootloader to try booting again.

The final bit of magic is in the counter saying how many times we've failed to fully boot. Just before executing the kernel, the bootloader increments the counter for how many times it's tried that partition by one. If the counter reaches some upper limit, then the bootloader tries to boot from the other Firmware partition. (If it's using A, it switches to B, and vice versa). The upper limit can be stored anywhere. Hardcoded, another EFI variable, somewhere on the system boot partition. It doesn't really matter.

Basic runtime strategy

Post boot actions

If the system boots up fully, then among other things, one of the processes that needs to be launched is some program that resets the "number of boot attempts" variable to 0. If this process doesn't launch, then after some number of reboots, the system will still switch to the other firmware. Frankly, this process failing to launch and execute it's task should be considered a hard failure for BOOT, not for runtime. This is an important distinction.

Using the systemd init system, it's possible to set service dependencies. Say your device, ultimately, needs to run 3 different services that stay alive at all times. Your dependency graph might look something like this

                Boot Counter Reset
                     //        ||      \\
        Service A  Service B  Service C

Where Service A, Service B, and Service C each have a systemd watchdog (different than the hardware watchdog, see the WatchdogSec= option of systemd.) configured, and use the sd_notify api to continually inform systemd that they are alive and functioning. 

If any of service A, B, or C fail to launch, initialize, and run, the hardware watchdog and boot counter reset program won't launch. This will cause the hardware watchdog to trigger a system restart. If this happens enough times, the alternate firmware will be loaded.

The boot counter reset service is of type "oneshot", as it doesn't stick around after setting the counter variable.

Additional watchdogging

The services in systemd should probably be configured to restart on failure. This way, if there's a minor problem that doesn't happen consistently, the service will automatically heal itself. Obviously, putting a limit to the number of restarts is a wise choice. If the service fails too many times, the hardware watchdog heartbeat service will be shut down, which will automatically trigger a system reboot.

An additional advanced feature would be to automatically detect too many service failures, and switch to the other firmware automatically. If a service is failing twice a minute, that's a good indication that there's something wrong. Essentially to accomplish this, when the hardware watchdog triggers (or is about to trigger, depending on how this gets handled) increment the number of times the hardware watchdog has triggered, and then reboot.

The bootloader should then check, in addition to the boot attempts number, the variable for the number of watchdog triggers, and if EITHER of them is above the threshold, to switch to the alternate firmware after clearing both counters.

Firmware image


To ensure system integrity, mitigate tampering, and generally just make things have less moving parts, the firmware image, when loaded by the kernel and mounted as the root partition, should be strictly read-only. The storage on the data partition is the only part of the system that should have a mutable state during normal operations. This is basic deployment hygiene, and has obvious benefits and essentially no downside.


Among other reasons, a very compelling one is "why not?". There's almost no downside to this, aside from an incredibly small amount of boottime CPU usage. A properly stripped down firmware image will fit easily into the systems file cache, and so after boot is finished, accessing the firmware file system will, essentially, involve no disk IO at all.

If the runtime cost to access parts of the compressed image are concerning, ,there's also always the option of copying the firmware (uncompressed) wholesale into a ram-based filesystem, and using that as the root filesystem instead. That will guarantee that no disk access happens after boot for root partition related usage.

There are, of course, other considerations for compressing your images. including faster firmware download times, faster firmware installation times, faster firmware cryptographic signature at boot (if you're using that).

Recommended type

I recommend using SquashFS with the XZ compression algorithm. Linux can boot a partition that is simply a SquashFS image written directly to the raw partition as the root filesystem, WITHOUT the need for an initramfs. Aside from the filesystem being read-only, it works identically to if you used any other normal filesystem, with great filecache properties, possibly saving you some (or all) disk accesses.

Deployment considerations

There are a couple ways to manage doing a firmware update. If you're using SquashFS with the XZ compression, you'll already have your firmware images as small as they are practically ever going to get without removing some of the files contained within them. This can drastically reduce your download size, when compared with downloading the uncompressed image.

An interesting thing about the way squashfs handles the compressed data, is that instead of being a single stream of bytes, squashfs is actually an organized collection of compressed chunks of data. When compressing a given filesystem tree, the squashfs creation tools will always compress the data in the same order, and it'll do it in specific byte sized chunks. This means that a small change to the original data will stay local to the parts of the squashfs filesystem that represent the files that were modified.


When you consider common large-file transfer tools, such as rsync, this means that rsync can very efficiently update a squashfs image from an older version to a newer version, since only a small amount of the squashfs file will have changed. Of course, if you make a very large change, all bets are off.

Binary diff

Even better than rsync, though, is that you can pre-compute the differences between the current firmware and the newly deployed firmware. Each system that needs the firmware update can simply download the binary diff that you generate once, and then apply that binary diff directly to the raw partition that the firmware to be replaced is stored on. Of course, make sure you cryptographically sign the binary diff, and have the firmware upgrade process verify the signature before applying it.

If, for some reason, the firmware stored on the currently-not-in-use partition is a version that has no binary diff available, the firmware upgrade system can either request that the server create a diff on-demand, or can simply download the full update and apply it directly. Either method is fully functional.


Depending on what kind of Linux distribution you're using, it might be important to consider the upgrade path from very old firmwares to the latest and greatest. Considering you're deploying read-only images, there really should be very little reason to worry about a failure, but do keep this in mind when deciding on how to handle the actual download process, and what method of applying the firmware you end up picking.

If you're using a rolling-release distribution, such as Gentoo, a lot of packages can change without much warning in simply a few months. Obviously properly testing things is important, but it might be worth making a decision between doing an occasional "package freeze" versus a continuous firmware generation / deployment / self test system on internal hardware that gets used for dogfooding. It's a tough call, and there are pros and cons for both sides.

The right answers for your setup are entirely up to you.

Upgrading the bootloader and kernel


The kernel is stored separately from the firmware image, in the system boot partition. Depending on what bootloader you're using, you might need to name the kernel a specific filename, or you might need to overwrite a configuration file to point to the new kernel. In either case, there are a few failure scenerios here.

Unlike the firmware image, where we're already booted into a known-working image that we can fall back on if the upgrade fails, the kernel is a narrower point of failiure. 

When writing the kernel to disk, always write to a temporary name, and then mv the kernel to it's real filename after it's on disk. This helps protect against power failure when the kernel is only half-written to disk. 

Consider setting your firmware up so that if the main kernel fails cryptographic verification, you boot from the previous kernel.

Consider also that it's possible for your kernel to pass cryptographic verification, but still fail to boot properly. As discussed in the booting section, a failed boot should result in the firmware image being switched. This doesn't help if the problem is with the recent kernel upgrade.

Address this problem by setting (yet another) efi variable, indicating that we just updated the kernel before rebooting into the bootloader. Perhaps with the value 1. Change that variable to indicate we're about to attempt to boot the new kernel, perhaps to value 2. If the bootloader encounters the value 2 when starting up, we know the new kernel failed to boot, so we'll revert to the old kernel. Just enhance the number-boot-retries clearing program to also clear this value upon successful boot.

Of course, this strategy only saves you if your kernel fails to boot. If the kernel has other problems that prevent things from working properly *after* you've cleared the efi variable, you'll encounter a reboot loop. Ultimately, it's *important* that you conduct serious regression testing before deploying updates. 

Careless human action can still break things no matter how many double checks are added to the system. Perhaps only make the boot variable clearing program clear the variables after some numerous amount of minutes. Or flag a variable that boot completed fully, but service initialization failed, to your high availability logic. It's up to you.


Generally, the process for upgrading the bootloader itself is very similar to that of the kernel. Write out the new file(s), copy the old files, mv the new files to the names of the old files, to minimize the amount of time the system is in an inconsistent state in case of power failure.

Alternatively, consider writing the new bootloader to a new name, and then changing the UEFI bootloader priority / default choice using the EFI variables

Depending on how your main board is designed, it might also be possible to have multiple EFI system partitions on the same disk, and configure fallback settings, where if the main bootloader fails for some reason, then it falls back to the other. If your board can be configured to work this way, you can actually skip all the manual file writing and instead upgrade the bootloader and kernel in a single image just like with the firmware images, just replacing the whole partition with the downloaded filesystem image, and then changing the default bootloader in the EFI.

Bios upgrade

It's probably not a good idea to do this :-)

If you insist, I heavily recommend getting a mainboard that has dual bios chips, with automatic failover.

Other thoughts

Saving even more bandwidth

Bittorrent for downloading the new images. Everything's cryptographically signed right? Might make your customers a bit grumpy that you're using their bandwidth though.

If rsync or binary diffs aren't your style, consider the zsync tool, which pre-computes the hashes that the rsync algorithm needs to do it's job.

Strip your firmware to the point of being naked

The less things you have in your fully functional firmware images, the less things can go wrong. Smaller security surface, smaller bug surface, smaller bit-rot surface, smaller transfer corruption surface, just in general you'll be much happier the smaller your images are.
  • Strip out things like include files, compiler toolkits, userland utilities (probably don't need "ping", for instance). 
  • Kill debug symbol files.
  • Remove all optional language files.
  • Statically compile as many of your programs as you can (Of course, make sure this actually saves you space).
  • Do a dependency analysis and ensure that 100% of the shared object files on the system are used by a binary that your system requires to function.
  • Consider compiling everything with -Os instead of -O3.
  • Remove man pages.
  • Recompile programs to remove unused features (gentoo is really nice for this :-) )
  • Compile as many parts of your system with link-time-optimization as you can. Gentoo is great for this, but be warned there are several programs that may not compile this way. 
  • Remove example configuration files, and remove normal config files where you want the default values of settings, and the config file is optional.
  • Remove the package manager of the linux distribution you use, and all it's associated utilities.
  • Consider alternative implementations of libraries and/or programs. The Musl libc replacement is quite a bit smaller than, say, glibc. Be warned though, there may be porting issues. Consider toybox instead of busybox. That kind of thing.
  • Consider removing bash from the system entirely :-P
  • Remove, remove, remove.
If you need some ability to log in an interact with a live system for diagnostic purposes, seriously consider making two versions of every firmware. One with nothing but the bare minimum, and then another with the various debug and troubleshooting tools included.

With the Linux kernels OverlayFS, if you setup another partition on the device to hold the troubleshooting utilities as a separate partition, you can boot the system such that the contents of the second partition are overlayed on top of the first, appearing to be a single filesystem. In this way, any file collisions have the partition mounted second win the collision, with the file from that partition being shown. 

In this way, your normal run state is still LEAN MEAN EFFICIENT MACHINE! While still providing access to rarely needed tools.

Small kernels are required

I seriously, fanatically, recommend a moduleless kernel. If your firmware needs some specific functionality, compile that functionality into the kernel directly. Anything that doesn't talk to the hardware you have, or provide a feature one of your services needs, should be removed. Remove support for all filesystems. Remove support for 32bit binaries. Remove support for multi-media devices. Kill the entire graphics drivers subsystem if you don't need graphics. 

By removing as much stuff as you can, you make your entire environment more efficient (cache friendlyness), reduce your security surface, and reduce your bug surface. All big wins.

Most importantly though, modules make booting *hard*.

Other deployment environments


Cloud / Virtual Machine

Thursday, September 10, 2015

Why C++ Lambdas and std::algorithm is better than a hand-written for-loop.

A conversation at work recently sparked an investigation on whether C++11 lambdas, combined with the use of functions from the standard library like std::for_each, or std::accumulate, etc, are better than writing hand-tuned for-loops. Note, however, that this post completely ignores the use of range-based for loops, which can be significantly more readable than either the old style loops, or the lambda version that I'm endorsing in this post. Your mileage may vary.

I'm personally on the side of using C++ lambda's whenever and wherever practical. Obviously don't go and rewrite every single line of code in your codebase, but if you're already going to be making changes to a function, and that function has a few hand-written for-loops, it'll probably be worth it to convert a couple of them to use std::for_each or similar. 

I'll try to touch on some major categories of reasons why I think this way.


The biggest concern a lot of people will have is that of performance.

An example usage of std::for_each.
void foo(const std::vector<int> & vect)
    std::for_each(vect.begin(), vect.end(), [](const int val) -> void { printf("%d", val); });
I know that an example this simple could be easily accomplished with std::bind as well, but I'm just trying to explain the general concept at this point :-).

Consider the STLPort (A very popular cross-platform STL implementation) implementation of std::for_each
// for_each.  Apply a function to every element of a range.
template <class _InputIter, class _Function>
for_each(_InputIter __first, _InputIter __last, _Function __f) {
  for ( ; __first != __last; ++__first)
  return __f;
Leaving aside investigations of what the _STLP_INLINE_LOOP macro will do to the resulting code, we can already see that this is a template function with template parameters for the iterator type, and the function object type.

Anywhere that this template function is used, the compiler will have 100% perfect knowledge of exactly what code is involved. Nothing is hidden in another translation unit, and there's no uncertainty about how the function is implemented at all. Your compiler should be able to inline this loop, in it's entirety, every single time it's used, even without link-time-optimization or similar fancyness. In fact, assuming you're not using an experimental or poorly implemented compiler, if it doesn't inline it's likely because of specific compilation options you've chosen, or it's decided that not-inlining will somehow be faster.

With our example, the inlined version should look something like this.
void foo(const std::vector<int> & vect)
    std::vector<int>::iterator begin = vect.begin();
    const std::vector<int>::iterator end = vect.end();
    auto function =  [](const int val) -> void { printf("%d", val); };
    for ( ; begin != end; ++begin)
We're getting pretty close to what most programmers would write for a for-loop already.

What about the "auto function = ..." line?

Essentially, a lambda function is simply a syntax shortcut in the language to define a new "functor" object, which is a struct that implements the operator() function. Unwinding the C++ syntax for lamdbas, we get code looking roughly like this.
struct randomly_generated_name_here
    void operator()(const int val)
        printf("%d", val);
void foo(const std::vector<int> & vect)
    std::vector<int>::iterator begin = vect.begin();
    const std::vector<int>::iterator end = vect.end();
    randomly_generated_name_here function;
    for ( ; begin != end; ++begin)
Clearly the compiler knows all of the implementation details of the "struct_randomly_generated_name_here" object. It might decide to inline the code for the operator() function of that object into the places where it's called. Additionally, a lambda's name is hidden from the programmer, and the compiler guarantees that there will never be name collisions. Unless you've captured the type of the lambda using a variable, such as I did in the examples above, or using either the decltype keyword, or a template function's variable type deduction capability, then once the scope that contains the lambda ends, that's it. As far as your program is concerned after that point the lambda never existed.

This is pretty compelling reason to believe that, for a lambda function that only gets used once, the compiler will simply inline it and be done with it. If you compile your program in debug mode, the compiler will almost certainly include various symbols related to lambda's that have been inlined in this kind of way, but a release mode compile, with optimizations turned on, and you might not even have any symbols for this lambda in your program at all. It may ultimately be inlined out of existence. In fact, I just compiled a small test program, the lambda's in the symbol table a few places for the debug version, but completely gone on the release version.

Here's a slightly more inlined version of our function
void foo(const std::vector<int> & vect)
    std::vector<int>::iterator begin = vect.begin();
    const std::vector<int>::iterator end = vect.end();
    for ( ; begin != end; ++begin)
        printf("%d", *begin);
One might argue that they would write the function like this
void foo(const std::vector<int> & vect)
    for(std::vector<int>::iterator begin = vect.begin();
         begin != vect.end();
        printf("%d", *begin);

But I think that for almost all situations, this version of the function is essentially identical to the other.

In fact, one might be able to make an argument that this second version is slightly worse, depending on the behavior of std::vector<int>::end(). If this function is const, then yes they should result in the same objectcode. Otherwise, the compiler might not be able to determine that end() has no side effects, and as such, it'll call end() every iteration through the loop, possibly at large computational expense if we're using a custom data structure instead of std::vector<int>. Using such a structure, that had an expensive end() function, would actually make the fake-manual-compiler-inlined version of the function measurably faster than the hand-written version that a human would write.

Now, you might argue that the resulting assembly code might be, or will be different for the two versions. There's every possibility. I can't guarantee what the compiler does with the code you hand it, obviously. However, even if the assembly output is different for the two versions, I would argue that in almost all situations, a properly optimizing compiler, doing optimizations just like what we did here by hand, should be able to produce code that is every bit as performant as a hand-tuned loop for almost any lambda.

I've written a few test programs and looked at the resulting assembly for them, and so far I haven't seen anything that makes me question the comparable performance of the two versions.


C++ standard library algorithms are designed to be composable. That is to say that the functions in <algorithm> and <numeric> are intended to be combined with each other to produce more interesting behavior. 

For example, consider the "remove erase" pattern for a datastructure. 
void remote_if_greater_than(std::vector<int> & vec, const int comp)
    vec.erase(std::remove_if(vec.begin(), vec.end(), std::bind(std::greater, _1, comp)), vec.end());
Or with a lambda
void remote_if_greater_than(std::vector<int> & vec, const int comp)
    vec.erase(std::remove_if(vec.begin(), vec.end(),
        [comp](const int val)
            return val > comp;
        }), vec.end());
The std::remove_if function rearranges the ordering of items in the collection so that the items to be removed are all at the end. It then returns an iterator to the first item of the "to be removed" set. std::vector<int>::erase has two versions, one for erasing only a single item, and another for erasing a set of items.

Basically, we rearrange the contents of the vector so that things to be removed are grouped at the end, and then we simply chop the end of the vector off. Fairly simple. 

Or perhaps we want to remove anything above the average?
void remove_above_average(std::vector<int> & vec)
    vec.erase(std::remove_if(vec.begin(), vec.end(), std::bind(std::greater, _1, std::accumulate(vec.begin(), vec.end(), 0)/vec.size()), vec.end());
Or with a lambda
void remove_above_average(std::vector<int> & vec)
    vec.erase(std::remove_if(vec.begin(), vec.end(),
        [comp = std::accumulate(vec.begin(), vec.end(), 0)/vec.size())] (const int val) -> bool
            return val > comp;
        }), vec.end());

Now, clearly lambda's aren't everything, given the comparative readability of the lambda code versus, but this post isn't about extolling the virtues of lambda's over various built-in functions, and this section is only intended to demonstrate the composibility of STL algorithms.

Consider a problem I was facing not too long ago, where I needed to compare two sequences. If sequence A had a member that sequence B did not, I needed to call function 1 with the member from sequence A as parameter. If sequence B had a member sequence A didn't, then I needed to call function 2 with the member from sequence B as parameter.

Essentially, I'm replacing sequence A with sequence B, and I need to "unhook" the items that are going away, and "hook up" the items that are incoming.

That could look something like this in psuedocode.
collection_t A;
collection_t B;
for each item in A, see if there is an equivalent item in B. If there is not, call function 1.
for each item in B, see if there is an equivalent item in A. If there is not, call function 2.

If I had to do this kind of comparison, writing the same forloop logic several times for minorly different data types could start to get annoying and error prone.

So I wrote a simple template function to do the work for me.
     * \brief Finds the first element in the collection for which pred returns false.
     * @param collOne [in] One of the collections to reconcile.
     * @param collTwo [in] One of the collections to reconcile.
     * @param pred    [in] A binary function object that accepts as the first argument, items from collOne, and the second argument, items from collTwo.
     * @param tranOne [in] A unary function object that accepts as argument items from collOne. This function will be called on any item from collOne for which pred matches NONE of the items from collTwo
     * @param tranTwo [in] A unary function object that accepts as argument items from collTwo. This function will be called on any item from collTwo for which pred matches NONE of the items from collOne
    template<typename ITERATOR_ONE_T, typename ITERATOR_TWO_T, typename COMPARITOR_T,
             typename TRANSFORM_ONE_T, typename TRANSFORM_TWO_T>
    void reconcile_sequences(const ITERATOR_ONE_T & beginOne, const ITERATOR_ONE_T & endOne, const ITERATOR_TWO_T & beginTwo, const ITERATOR_TWO_T & endTwo,
                             COMPARITOR_T pred, TRANSFORM_ONE_T tranOne, TRANSFORM_TWO_T tranTwo)
        std::for_each(beginOne, endOne,
                 [pred, tranOne, beginTwo, endTwo](const decltype(*beginOne) & one) -> void
                    if(std::none_of(beginTwo, endTwo, std::bind(pred, one, _1))
        for_each(beginTwo, endTwo,
                 [pred, tranTwo, beginOne, beginOne](const decltype(*beginTwo) & two) -> void
                    if(std::none_of(beginOne, endOne, std::bind(pred, _1, two))

Now, for any situation where I need to reconcile two different sequences, of any type, I can just do this.

collection_t A;
collection_t B;
reconcile_sequences(A.begin(), A.end(),
                    B.begin(), B.end(),
                    [](const A::value_type & aval, const B::value_type & bval) -> bool { /* comparison code */ },
                    [](A::value_type & aval) { /* A not found in B. Do the work on A here. */}
                    [](B::value_type & bval) { /* B not found in A. Do the work on B here. */});
Clean, concise, to the point. The details of how we iterate over the sequences are left to the imagination, but we only need to rely on a single loop implementation somewhere that takes function objects, and we're golden. 

One can imagine all sorts of simple algorithms that can either build off of existing std::algorithm functions to create more complex behavior, or that work well in conjunction with the existing ones. 

While none of this is, strictly speaking, a direct part of the lambda function concept, the ability to use these pre-built loops can make the life of a programmer significantly easier. Further, as my simple examples have shown, these functions are drastically enhanced by the flexibility of lambda functions to fit themselves into the nooks and crannies of any existing incompatibility.

It's like the Reese's peanut butter cup comercials from many years ago. A lot of people enjoy peanut butter, and a lot of people enjoy chocolate. I'd say that a lot of people think that the two together is just downright awesome.

A lot of people enjoy std::algorithm functions, and a lot of people enjoy lambda functions. I'd say that a lot of people think that the two together is just downright awesome!


Consider that reconcile_sequences function that I wrote in the previous section. If you had a slew of hand-written forloops that, effectively, accomplished the same task as my template function, and your manager came to you and said that the reconciliation process needed to be re-written to automatically use all available processing cores, how would you proceed?

Would you write some kind of shared function to centralize the multi-threading parts of the task into one place? Would you implement the same copy-pasted code in each place instead?

I'd assume that most programmers would end up writing a function that implemented all of the shared multi-threading logic in one place. The function may even have a signature that's very similar to my template function above. But to get to that point, they have to go back through all of the their hand written loops, and replace them with calls to a function like that. If you didn't have access to the std::algorithm collection, depending on what you're trying to do, this could be somewhat hard to pull off.

Now, assuming that you're using a relatively new version of GCC (one that supports the -fopenmp commandline option), pretend that you simply wrote a template function like I did, and used lambdas to do all of your calculations. All you'd need to do is ensure that your predicates had no side effects, and that your transform functions were thread-safe. If so, the work to switch from single threaded to multi-threaded (though, admittedly not perfectly multi-threaded) could be as simple as copying the function (lets name it "parallel_reconcile_sequences"), and in the copy replace all calls to "std::for_each" with "__gnu_parallel::for_each", and "std::none_of" to "__gnu_parallel::none_of", and suddenly your reconciliation algorithm is parallel!

Obviously this is slightly contrived, but the point that I'm trying to make is that the more "building block"-ish your code is, the easier it becomes to swap out parts of it later. This lets you evolve the implementation of various algorithms that your code uses without having to actually change the code that uses those algorithms. 

Or consider that you might change the type of the collection that your lambda's working against so that you can iterate over it more efficiently, you can create a specialized version of for_each, or similar, to go with your new iterator, without having to modify the contents of your lambda at all.

But it gets better than just the example above.

So you've got a lambda function, say one of the first examples, that prints every int given to it. Again, a very simple lambda, but it provides a simple representative example to work with.
void foo(const std::vector<int> & vect)
    std::for_each(vect.begin(), vect.end(), [](const int val) -> void { printf("%d", val); });
Now how would you approach needing to do the same operation on multiple for-loops? If you were writing hand-tuned loops, you'd simply copy-paste and change the variable names. But with lambda functions it can be considerably easier.
void foo(const std::vector<int> & vectOne, const std::vector<int> & vectTwo)
    auto lamb = [](const int val) -> void { printf("%d", val); };
    std::for_each(vectOne.begin(), vectOne.end(), lamb);
    std::for_each(vectTwo.begin(), vectTwo.end(), lamb);
Simply by storing the lambda in a variable, you can write your loop body once, but use it many times. The lambdas can also be stored in member variables of objects, or returned from functions, or otherwise reused over the lifetime of a program.

Remember that, behind the scenes, the compiler is effectively just making a struct that has an operator() function. Anything you could normally do with a normal struct, you can do with a lambda, including copies, pointers / references to, calling functions with it passed as a parameter, and so on.

I'm sure that you'll see that there are significant usages of being able to quickly and easily define new function objects and pass them around for


I think this is the big one, and I can make the case without too many pages of text.

Consider two loops

void(const std::vector<int> & foo)
    for(std::vector<int>::iterator it = foo.begin(), it != foo.end(); it++)
        printf("%d", *it);

void(const std::vector<int> & foo)
    std::for_each(foo.begin(), foo.end(), [](const int val){ printf("%d", val); });

Which one is easier to understand in it's entirety?

There's two situations to consider here. People who are experts on this code, and people who are not experts.

At a glance, an expert will be able to nearly instantly understand both versions of this particular loop without any problem, but a person who isn't an expert on this code will be able to understand the second version much faster.


A non expert, someone who isn't familiar with the section of code in question, or who isn't on amazing ground with C++ yet, is going to be able to read the second version, and only 2 keywords into the loop "std::for_each", will know instantly that whatever comes next is applied equally to the entire range. There's no possibility of short circuiting, stopping early, rewinding, etc. Now, obviously the function that's applied to each item can be selectively applied using an if statement or similar, but it's not possible for that function to effect the iteration, only the action taken on each item.

Imagine that we're working with a much more complex loop body. An algorithm that spans > 200 lines of code, and has many returns or breaks, or continues, in it. Even worse, a goto. How would you be able to tackle how this loop works? To be able to answer as simple a question as "Does this loop apply the loop body to every item?" you'd have to very carefully analyze each line of code.

Now take that same loop body, and translate it into a function suitable for use with std::for_each. Once you see std::for_each, it doesn't matter WHAT's in the loop body, you know for certain that, baring code that never returns, or that exits the program or thread, every single item in the collection will get the function applied to it. The same (or equivalent) loop bodies, but one version could take hours to answer the question of if every item is visited, and the other version takes seconds.

Given that the algorithm (and numeric!) library offers lots of very useful functions, such as copy, backwards copy, transform, for_each, none_of, find_if, and more and more, judicious use of them can drastically increase the ability of newbies to understand what your program's supposed to be doing.

I'd love to see any other rationale from readers, or any arguments toward one or the other kind of loop. Let me know what you think!

Friday, August 28, 2015

Why it's important to plan a project properly

My last post was about the commonly expounded phenomenon of shaving Yaks :

That post used, as an illustrative example, a project that I'm working on professionally.  Please understand that this post isn't intended to be critical of any particular person, project, company, or event, but is instead a gestalt of my experiences all throughout my career so far. Every company can have absolutely amazing people, customers, projects, and culture, and every company can have awful people, customers, projects, and culture. Frankly, a lot of companies have a mix of both good and bad, just like everything else in life. So please don't think I'm trying to criticize any particular organization, this post, and most of the posts that I have on my blog, are intended to be philosophical waxing, not criticism.

Even farther back in the past, I wrote a piece on how the industry that my father is in, residential construction, is remarkably, even eerily, similar to the field of Software Engineering. :

I had a conversation with my father about the yak shaving activities I was participating in (and not enjoying) at work, while he had no idea what i meant by yak shaving, as soon as I explained the whole deal around a task that has to be completed so that a task can be completed, so that a task can be completed, so that...... so that my actual assignment can move forward, he instantly recognized what I was talking about and was able to match my own frustrated rant with one of barely contained intensity born of 30+ years of shaving his own yaks!

An example of said yak shaving, (which involves significant amounts of me getting the details mixed up, and adding embellishments :) ).

He's at a customer site, building, say, a porch or something similar. Turns out that the customer's door frame is rotting out, and the customer agrees it needs to be replaced before the construction of the porch can continue. So dad calls up the local hardware store and asks about a specific model door frame, the guy on the phone assures him they have it. So he goes the half hour drive out to the store, only to find out they didn't have it! Next store closer is another half hour drive, luckily they have it in stock, but half way there he runs into road construction and has to take a detour. Yada yada yada, ultimately almost an entire day is blown on this project of building a porch before a single nail can be hammered into a single board!

Clearly yak shaving is common in lots of different fields.

We got to talking about mitigation strategies, and we were able to delve into the realm of timeline estimation, foreseeing problems, standard operations that helped prevent problems ahead of time, and so on.

On the subject of time estimates, clearly the answer is "it'll take as long as it'll take!", which is the howl of a hell hound for people trying to timebox a project, or a component of a project. My own time estimation is so notoriously bad that even doubling what my gut tells me is consistently an underestimate.

Over the years, my dad's developed a set of procedures for how to undertake operations, either big or small. Going into the details of all the procedures would be kind of hard to do, there are so many of them, and they're all gut knowledge that would probably take decades for him to convey to me, but the general gist is that he seems to build in risk avoidance at every turn. This consistently adds more time to the project compared to the best-case timeline, but that's not what concerns him. The biggest risk for a construction project, a risk that could see the death of his business, and the endless nightmare of pissed off customers and angry lawyers, is always the worst-case timeline of a project. The kind of worst-case involving a house collapsing while building it or similar disaster.

The first step, obviously, is proper planning. Assessing what the end result needs to be, visualizing it, and modeling it. Getting feedback from relevant stakeholders, such as the customer, an architect, possibly a designer or realator, and asking the relevant subcontractors what various options in the plan will do to the bottom line. On a project that will take up to a year of boots-on-the-ground time, this first phase of planning and review can easily take 3 months, or longer, especially if local government has additional concerns to throw into the mix.

For a software project, where the "agile" methodologies movement has been trying to push release dates from years to days, saying that you want to spend over a quarter of the total project timeline doing planning sounds like a sure way to get canned! Without proper project planning, you can certainly reduce your best-case timeline, but at the cost of dramatically increasing your worst-case. That's the reason for doing things this way, and it's to stave off the worst-case scenario of the project not working at all. In the software world, the cost to fix a defect in the software project drastically increases the later on in the project it's discovered. (

What's worse, if I remember my schooling properly, the defect injection rate of a project is at it's absolute highest at the beginning of a project, decreasing over time until the last phase of the project, upgrade / replace / discontinue. (What, you thought the last phase was delivery to customer? Ha!). So, the conclusion that I draw from this is that, since the most defects are introduced into the project when it's the least expensive to fix, and that the longer these defects stay in the project, the more costly they become, you should put as much effort into fixing defects in the early phases of the project as you possible can. In this scenario, a dollar of prevention can be worth a grand of cure!

The best, and cheapest, phase to correct defects is the project inception and planning phase!

Back to the point of shaving yaks, I've recently run into quite a few situations where poor project planning has cost a drastic amount of time late in the project. I went over some of the specifics about what's going on in the project that's causing so much yak shaving, but I really want to put emphasis on my feeling that the delays and yaks are, fundamentally, caused by a lack of defect correction early in the development process.

Of course, everything in the software world is interconnected, just like it is with the construction world. My dad recently did a basement remodel, and while working on it, over and over and over again, ran into huge delays that had nothing to do with the original scope of work. For example, a gas line needs to be moved out of the way, but to do that some electrical work also needs to be done, however doing that properly needs a new circuit put into the breaker and holes town in part of the ceiling that wasn't part of the original plan to route around an unexpected waterline. Get the point? Yaks.

Now, with a construction project at least you can tell just by looking that things are happening, but it can be really difficult to tell when a software project is stalled. "Boy, I haven't heard from my engineer in a couple days. What's he (or she) doing that I'm not seeing?" Well, it could be that your engineer just went off on a vacation without mentioning it, or it could be that he's spent 16 or more hours a day for an entire work week refactoring a delicate piece of complex C++ template code that's running up against compiler bugs that aren't documented very well and he just can't figure it out. Or it could be that he's had to rewrite the unit test framework for the project to account for an unanticipated incompatibility with the relevant standard, or it could be any number of things.

The real question is, knowing that these kinds of unexpected delays and "side tracks" are going to happen, what can you, as a project leader, do to either mitigate or prevent these issues. My opinion is that planning and documentation is the only proven method to handle it. If you're implementing a specific internet standard, for example, frankly the first step is going to need to be translating the incomprehensibly poorly written RFC documents into an actionable requirements document. Next step, involves the design of a project architecture, and spending at least several weeks working on that, and reviewing it, chewing on it, implementing small prototypes of the stuff in that architecture, and just in general ruminating. Why? Because, again, the early stages of a project are the source of the vast majority of project defects!

You wouldn't allow a home to be built without a blueprint. You wouldn't allow a blueprint that hasn't been reviewed by an architect. You wouldn't dig a foundation, ground well, or septic system without properly surveying the ground to ensure that you know whether to add special considerations or not. You certainly wouldn't start framing a house before seeing the blueprint, and you damn sure aren't going to start wiring the house for electricity without consulting with the customer and designer on how the rooms are designed. So why do we accept software projects that are started with no design, no specific details about what it needs to do, no review, no sign off, and components that depend on other components are started before the component they depend on are even built?

I don't know, I really don't.

But I do know this. That defect that managed to hide in your code, back when you forgot to do the inception and planning phase? That defect had it's teen and college years back in the initial development phase, it partied HARD with the other defects that you managed to kill while developing with whiskey and pizza, giving birth to it's own baby defects and puking all over your code at the same time, but this isn't a little defect anymore. It's a giant behemoth of a defect. A bureaucrat of the defect world, grown fat and lazy, but all the more powerful (like Jabba the Hutt), by the lack of attention that you've paid it over it's life.

This defect is your Yak. As soon as you hit the integration phase, this Yak is going to walk right up to you and lick your face. And it's going to keep doing that, invisible to your boss, but staring you in the face from all of 2 inches away for months, until you shave it. Try as you might, you're going to eventually shave this Yak, because it'll either drive you so crazy that you do it yourself over a weekend or 20, or it'll completely prevent progress on your project until it's been satisfied with how thoroughly you shave it.

Shave that Yak!

Wednesday, August 26, 2015

I shaved a yak, but I didn't like it

It seems to be a long standing tradition of every computer science professional to eventually write about the concept of Yak Shaving.

Before I get too far into this, I should point out that while I happen to be using a recent project that I'm working on professionally as an example, it's intended to be only for illustrative purposes. Please understand that this post isn't intended to be critical of any particular person, project, company, or event, but is instead a gestalt of my experiences all throughout my career so far. Every company can have absolutely amazing people, customers, projects, and culture, and every company can have awful people, customers, projects, and culture. Frankly, a lot of companies have a mix of both good and bad, just like everything else in life. So please don't think I'm trying to criticize any particular organization, this post, and most of the posts that I have on my blog, are intended to be philosophical waxing, not criticism.

That being said, instead of writing my own rendition of the definition of this prodigious term, I'll simply point you at another writer who already did a pretty good job describing the phenomenon (though I don't know if I agree with his conclusions):

Another take from Urban Dictionary:

And an excellent clip from the show Malcolm in the Middle:

Suffice to say that so many times while conducting a project, an engineer, ultimately, needs to go shave a yak. It's tedious, frustrating, and *very* hard to properly account for up front.

I want to expound a bit on a particular yak that I've had the distinct displeasure of shaving recently.

I'm building an Interactive Connectivity Establishment (RFC 5245) protocol implementation, in library form using C++. This particular implementation has some very specific performance constraints which make the process quite challenging.

I don't want to make this post about the performance constraints, but I will point out the biggest one, which explicitly forbids dynamic memory allocation or deallocation inside the library. This is something that a lot of C++ programmers take for granted. Every STL container can allocate memory dynamically internally, smartpointers, inherently, are about managing dynamic memory, and countless frameworks out there expect the ability to dynamically allocate memory whenever they want. It doesn't seem like it, but, this can become very hard to do, especially when combined with other performance considerations.

So anyway, what's the Yak?

This ICE library isn't my creation from start to finish. I was handed a 75% baked implementation, and asked to finish it. Originally, we thought this would only take a couple months, but it's been on-going for a year now. The most recent challenge involves the dynamic memory allocation issue. The library, as per the ICE standard, notifies it's API consumer whenever it discovers a new connection pathway. The specific mechanism that it was using was to allocate a chunk of memory that was around 5 kilobytes in size. In a normal environment, this would be no big deal, but I'm only able to get pre-allocated (and thus safe to use) chunks no larger than 128 bytes!

Skipping over a lot of details related to the HOW, the What ended up requiring *massive* changes to the library, and the code that uses the library. First, I had to redesign the struct containing the information on this communication path to only include information on one path at a time, instead of four at a time. Then redesign the communication flow between the client code and the ICE library, such as requiring the client to generate authentication information and pass that in at constructor time instead of having the ICE library generate that information and send it to the client, and various other changes. Then, had to create some new datastructures to play nicely with the code that uses this library, and rewrite the code that generates and/or uses these information structures on both sides of the API.

At each step of the way, not only did the change have to be made to the library itself, but the unit testing framework needed to be adjusted, the code that uses the library needed to be adjusted, and hundreds of lines of documentation needed adjustment for each item.

To sum up, since each change required 4 different chunks of code/docs be updated, at a minimum (and, like I said, a few times required changes to client code framework datastructures !), this particular Yak of making my library not allocate memory dynamically turned into a HUGE undertaking. Just adding up the very broad steps undertaken above, there were, at least, 4*4 = 16 individual large refactorings that had to happen to do what, conceptually, could have been a very simple undertaking. A few of them were difficult enough to work on that I got stuck on them for a week or more at a time. Nothing can demoralize a software engineer than a Yak that refuses to be shaved!

Consider also that, in general, each of these changes took well over a day of work, and very quickly, a simple task of "Go do this change" can explode into a month or more of engineering time.

This isn't a cautionary tale of not telling your engineers to refactor their code. Our performance requirements are VERY strict, and there's no way that we could use this code without having made these changes. This is more an illustration of how quickly the list of "high level" or "manager view" items that need to be addressed for business and technical reasons can magnify into dozens or hundreds of individual items that an engineer needs to address.

I shaved that yak, but I didn't like it!

Monday, June 8, 2015

Newly Remodeled Home Office!

I had the joy of starting the setup on my newly rebuilt home office over the last weekend. :-)

Hopefully I'll have it properly setup before too long. For now I just threw things in as quick as I could. Here's a few pictures of how it's laid out.