Thursday, September 10, 2015

Why C++ Lambdas and std::algorithm is better than a hand-written for-loop.


A conversation at work recently sparked an investigation on whether C++11 lambdas, combined with the use of functions from the standard library like std::for_each, or std::accumulate, etc, are better than writing hand-tuned for-loops. Note, however, that this post completely ignores the use of range-based for loops, which can be significantly more readable than either the old style loops, or the lambda version that I'm endorsing in this post. Your mileage may vary.

I'm personally on the side of using C++ lambda's whenever and wherever practical. Obviously don't go and rewrite every single line of code in your codebase, but if you're already going to be making changes to a function, and that function has a few hand-written for-loops, it'll probably be worth it to convert a couple of them to use std::for_each or similar. 

I'll try to touch on some major categories of reasons why I think this way.

Performance

The biggest concern a lot of people will have is that of performance.

An example usage of std::for_each.
void foo(const std::vector<int> & vect)
{
    std::for_each(vect.begin(), vect.end(), [](const int val) -> void { printf("%d", val); });
}
I know that an example this simple could be easily accomplished with std::bind as well, but I'm just trying to explain the general concept at this point :-).

Consider the STLPort (A very popular cross-platform STL implementation) implementation of std::for_each
// for_each.  Apply a function to every element of a range.
template <class _InputIter, class _Function>
_STLP_INLINE_LOOP _Function
for_each(_InputIter __first, _InputIter __last, _Function __f) {
  for ( ; __first != __last; ++__first)
    __f(*__first);
  return __f;
}
Leaving aside investigations of what the _STLP_INLINE_LOOP macro will do to the resulting code, we can already see that this is a template function with template parameters for the iterator type, and the function object type.

Anywhere that this template function is used, the compiler will have 100% perfect knowledge of exactly what code is involved. Nothing is hidden in another translation unit, and there's no uncertainty about how the function is implemented at all. Your compiler should be able to inline this loop, in it's entirety, every single time it's used, even without link-time-optimization or similar fancyness. In fact, assuming you're not using an experimental or poorly implemented compiler, if it doesn't inline it's likely because of specific compilation options you've chosen, or it's decided that not-inlining will somehow be faster.

With our example, the inlined version should look something like this.
void foo(const std::vector<int> & vect)
{
    std::vector<int>::iterator begin = vect.begin();
    const std::vector<int>::iterator end = vect.end();
    auto function =  [](const int val) -> void { printf("%d", val); };
    for ( ; begin != end; ++begin)
    {
        function(*begin);
    }
}
We're getting pretty close to what most programmers would write for a for-loop already.

What about the "auto function = ..." line?

Essentially, a lambda function is simply a syntax shortcut in the language to define a new "functor" object, which is a struct that implements the operator() function. Unwinding the C++ syntax for lamdbas, we get code looking roughly like this.
struct randomly_generated_name_here
{
    void operator()(const int val)
    {
        printf("%d", val);
    }
};
void foo(const std::vector<int> & vect)
{
    std::vector<int>::iterator begin = vect.begin();
    const std::vector<int>::iterator end = vect.end();
    randomly_generated_name_here function;
    for ( ; begin != end; ++begin)
    {
        function(*begin);
    }
}
Clearly the compiler knows all of the implementation details of the "struct_randomly_generated_name_here" object. It might decide to inline the code for the operator() function of that object into the places where it's called. Additionally, a lambda's name is hidden from the programmer, and the compiler guarantees that there will never be name collisions. Unless you've captured the type of the lambda using a variable, such as I did in the examples above, or using either the decltype keyword, or a template function's variable type deduction capability, then once the scope that contains the lambda ends, that's it. As far as your program is concerned after that point the lambda never existed.

This is pretty compelling reason to believe that, for a lambda function that only gets used once, the compiler will simply inline it and be done with it. If you compile your program in debug mode, the compiler will almost certainly include various symbols related to lambda's that have been inlined in this kind of way, but a release mode compile, with optimizations turned on, and you might not even have any symbols for this lambda in your program at all. It may ultimately be inlined out of existence. In fact, I just compiled a small test program, the lambda's in the symbol table a few places for the debug version, but completely gone on the release version.

Here's a slightly more inlined version of our function
void foo(const std::vector<int> & vect)
{
    std::vector<int>::iterator begin = vect.begin();
    const std::vector<int>::iterator end = vect.end();
    for ( ; begin != end; ++begin)
    {
        printf("%d", *begin);
    }
}
One might argue that they would write the function like this
void foo(const std::vector<int> & vect)
{
    for(std::vector<int>::iterator begin = vect.begin();
         begin != vect.end();
         ++begin)
    {
        printf("%d", *begin);
    }
}

But I think that for almost all situations, this version of the function is essentially identical to the other.

In fact, one might be able to make an argument that this second version is slightly worse, depending on the behavior of std::vector<int>::end(). If this function is const, then yes they should result in the same objectcode. Otherwise, the compiler might not be able to determine that end() has no side effects, and as such, it'll call end() every iteration through the loop, possibly at large computational expense if we're using a custom data structure instead of std::vector<int>. Using such a structure, that had an expensive end() function, would actually make the fake-manual-compiler-inlined version of the function measurably faster than the hand-written version that a human would write.

Now, you might argue that the resulting assembly code might be, or will be different for the two versions. There's every possibility. I can't guarantee what the compiler does with the code you hand it, obviously. However, even if the assembly output is different for the two versions, I would argue that in almost all situations, a properly optimizing compiler, doing optimizations just like what we did here by hand, should be able to produce code that is every bit as performant as a hand-tuned loop for almost any lambda.

I've written a few test programs and looked at the resulting assembly for them, and so far I haven't seen anything that makes me question the comparable performance of the two versions.

Composability

C++ standard library algorithms are designed to be composable. That is to say that the functions in <algorithm> and <numeric> are intended to be combined with each other to produce more interesting behavior. 

For example, consider the "remove erase" pattern for a datastructure. 
void remote_if_greater_than(std::vector<int> & vec, const int comp)
{
    vec.erase(std::remove_if(vec.begin(), vec.end(), std::bind(std::greater, _1, comp)), vec.end());
}
Or with a lambda
void remote_if_greater_than(std::vector<int> & vec, const int comp)
{
    vec.erase(std::remove_if(vec.begin(), vec.end(),
        [comp](const int val)
        {
            return val > comp;
        }), vec.end());
The std::remove_if function rearranges the ordering of items in the collection so that the items to be removed are all at the end. It then returns an iterator to the first item of the "to be removed" set. std::vector<int>::erase has two versions, one for erasing only a single item, and another for erasing a set of items.

Basically, we rearrange the contents of the vector so that things to be removed are grouped at the end, and then we simply chop the end of the vector off. Fairly simple. 

Or perhaps we want to remove anything above the average?
void remove_above_average(std::vector<int> & vec)
{
    vec.erase(std::remove_if(vec.begin(), vec.end(), std::bind(std::greater, _1, std::accumulate(vec.begin(), vec.end(), 0)/vec.size()), vec.end());
}
Or with a lambda
void remove_above_average(std::vector<int> & vec)
{
    vec.erase(std::remove_if(vec.begin(), vec.end(),
        [comp = std::accumulate(vec.begin(), vec.end(), 0)/vec.size())] (const int val) -> bool
        {
            return val > comp;
        }), vec.end());
}

Now, clearly lambda's aren't everything, given the comparative readability of the lambda code versus, but this post isn't about extolling the virtues of lambda's over various built-in functions, and this section is only intended to demonstrate the composibility of STL algorithms.

Consider a problem I was facing not too long ago, where I needed to compare two sequences. If sequence A had a member that sequence B did not, I needed to call function 1 with the member from sequence A as parameter. If sequence B had a member sequence A didn't, then I needed to call function 2 with the member from sequence B as parameter.

Essentially, I'm replacing sequence A with sequence B, and I need to "unhook" the items that are going away, and "hook up" the items that are incoming.

That could look something like this in psuedocode.
collection_t A;
collection_t B;
for each item in A, see if there is an equivalent item in B. If there is not, call function 1.
for each item in B, see if there is an equivalent item in A. If there is not, call function 2.

If I had to do this kind of comparison, writing the same forloop logic several times for minorly different data types could start to get annoying and error prone.

So I wrote a simple template function to do the work for me.
    /*
     * \brief Finds the first element in the collection for which pred returns false.
     * @param collOne [in] One of the collections to reconcile.
     * @param collTwo [in] One of the collections to reconcile.
     * @param pred    [in] A binary function object that accepts as the first argument, items from collOne, and the second argument, items from collTwo.
     * @param tranOne [in] A unary function object that accepts as argument items from collOne. This function will be called on any item from collOne for which pred matches NONE of the items from collTwo
     * @param tranTwo [in] A unary function object that accepts as argument items from collTwo. This function will be called on any item from collTwo for which pred matches NONE of the items from collOne
     */
    template<typename ITERATOR_ONE_T, typename ITERATOR_TWO_T, typename COMPARITOR_T,
             typename TRANSFORM_ONE_T, typename TRANSFORM_TWO_T>
    void reconcile_sequences(const ITERATOR_ONE_T & beginOne, const ITERATOR_ONE_T & endOne, const ITERATOR_TWO_T & beginTwo, const ITERATOR_TWO_T & endTwo,
                             COMPARITOR_T pred, TRANSFORM_ONE_T tranOne, TRANSFORM_TWO_T tranTwo)
    {
        std::for_each(beginOne, endOne,
                 [pred, tranOne, beginTwo, endTwo](const decltype(*beginOne) & one) -> void
                 {
                    if(std::none_of(beginTwo, endTwo, std::bind(pred, one, _1))
                    {
                        tranOne(val);
                    }                            
                 });
        for_each(beginTwo, endTwo,
                 [pred, tranTwo, beginOne, beginOne](const decltype(*beginTwo) & two) -> void
                 {
                    if(std::none_of(beginOne, endOne, std::bind(pred, _1, two))
                    {
                        tranTwo(val);
                    }                            
                 });
    }

Now, for any situation where I need to reconcile two different sequences, of any type, I can just do this.

collection_t A;
collection_t B;
reconcile_sequences(A.begin(), A.end(),
                    B.begin(), B.end(),
                    [](const A::value_type & aval, const B::value_type & bval) -> bool { /* comparison code */ },
                    [](A::value_type & aval) { /* A not found in B. Do the work on A here. */}
                    [](B::value_type & bval) { /* B not found in A. Do the work on B here. */});
Clean, concise, to the point. The details of how we iterate over the sequences are left to the imagination, but we only need to rely on a single loop implementation somewhere that takes function objects, and we're golden. 

One can imagine all sorts of simple algorithms that can either build off of existing std::algorithm functions to create more complex behavior, or that work well in conjunction with the existing ones. 

While none of this is, strictly speaking, a direct part of the lambda function concept, the ability to use these pre-built loops can make the life of a programmer significantly easier. Further, as my simple examples have shown, these functions are drastically enhanced by the flexibility of lambda functions to fit themselves into the nooks and crannies of any existing incompatibility.

It's like the Reese's peanut butter cup comercials from many years ago. A lot of people enjoy peanut butter, and a lot of people enjoy chocolate. I'd say that a lot of people think that the two together is just downright awesome.

A lot of people enjoy std::algorithm functions, and a lot of people enjoy lambda functions. I'd say that a lot of people think that the two together is just downright awesome!

Evolability

Consider that reconcile_sequences function that I wrote in the previous section. If you had a slew of hand-written forloops that, effectively, accomplished the same task as my template function, and your manager came to you and said that the reconciliation process needed to be re-written to automatically use all available processing cores, how would you proceed?

Would you write some kind of shared function to centralize the multi-threading parts of the task into one place? Would you implement the same copy-pasted code in each place instead?

I'd assume that most programmers would end up writing a function that implemented all of the shared multi-threading logic in one place. The function may even have a signature that's very similar to my template function above. But to get to that point, they have to go back through all of the their hand written loops, and replace them with calls to a function like that. If you didn't have access to the std::algorithm collection, depending on what you're trying to do, this could be somewhat hard to pull off.

Now, assuming that you're using a relatively new version of GCC (one that supports the -fopenmp commandline option), pretend that you simply wrote a template function like I did, and used lambdas to do all of your calculations. All you'd need to do is ensure that your predicates had no side effects, and that your transform functions were thread-safe. If so, the work to switch from single threaded to multi-threaded (though, admittedly not perfectly multi-threaded) could be as simple as copying the function (lets name it "parallel_reconcile_sequences"), and in the copy replace all calls to "std::for_each" with "__gnu_parallel::for_each", and "std::none_of" to "__gnu_parallel::none_of", and suddenly your reconciliation algorithm is parallel!

Obviously this is slightly contrived, but the point that I'm trying to make is that the more "building block"-ish your code is, the easier it becomes to swap out parts of it later. This lets you evolve the implementation of various algorithms that your code uses without having to actually change the code that uses those algorithms. 

Or consider that you might change the type of the collection that your lambda's working against so that you can iterate over it more efficiently, you can create a specialized version of for_each, or similar, to go with your new iterator, without having to modify the contents of your lambda at all.

But it gets better than just the example above.

So you've got a lambda function, say one of the first examples, that prints every int given to it. Again, a very simple lambda, but it provides a simple representative example to work with.
void foo(const std::vector<int> & vect)
{
    std::for_each(vect.begin(), vect.end(), [](const int val) -> void { printf("%d", val); });
}
Now how would you approach needing to do the same operation on multiple for-loops? If you were writing hand-tuned loops, you'd simply copy-paste and change the variable names. But with lambda functions it can be considerably easier.
void foo(const std::vector<int> & vectOne, const std::vector<int> & vectTwo)
{
    auto lamb = [](const int val) -> void { printf("%d", val); };
    std::for_each(vectOne.begin(), vectOne.end(), lamb);
    std::for_each(vectTwo.begin(), vectTwo.end(), lamb);
}
Simply by storing the lambda in a variable, you can write your loop body once, but use it many times. The lambdas can also be stored in member variables of objects, or returned from functions, or otherwise reused over the lifetime of a program.

Remember that, behind the scenes, the compiler is effectively just making a struct that has an operator() function. Anything you could normally do with a normal struct, you can do with a lambda, including copies, pointers / references to, calling functions with it passed as a parameter, and so on.

I'm sure that you'll see that there are significant usages of being able to quickly and easily define new function objects and pass them around for

Understandability

I think this is the big one, and I can make the case without too many pages of text.

Consider two loops

void(const std::vector<int> & foo)
{
    for(std::vector<int>::iterator it = foo.begin(), it != foo.end(); it++)
    {
        printf("%d", *it);
    }
}

void(const std::vector<int> & foo)
{
    std::for_each(foo.begin(), foo.end(), [](const int val){ printf("%d", val); });
}

Which one is easier to understand in it's entirety?

There's two situations to consider here. People who are experts on this code, and people who are not experts.

At a glance, an expert will be able to nearly instantly understand both versions of this particular loop without any problem, but a person who isn't an expert on this code will be able to understand the second version much faster.

Why?

A non expert, someone who isn't familiar with the section of code in question, or who isn't on amazing ground with C++ yet, is going to be able to read the second version, and only 2 keywords into the loop "std::for_each", will know instantly that whatever comes next is applied equally to the entire range. There's no possibility of short circuiting, stopping early, rewinding, etc. Now, obviously the function that's applied to each item can be selectively applied using an if statement or similar, but it's not possible for that function to effect the iteration, only the action taken on each item.

Imagine that we're working with a much more complex loop body. An algorithm that spans > 200 lines of code, and has many returns or breaks, or continues, in it. Even worse, a goto. How would you be able to tackle how this loop works? To be able to answer as simple a question as "Does this loop apply the loop body to every item?" you'd have to very carefully analyze each line of code.

Now take that same loop body, and translate it into a function suitable for use with std::for_each. Once you see std::for_each, it doesn't matter WHAT's in the loop body, you know for certain that, baring code that never returns, or that exits the program or thread, every single item in the collection will get the function applied to it. The same (or equivalent) loop bodies, but one version could take hours to answer the question of if every item is visited, and the other version takes seconds.

Given that the algorithm (and numeric!) library offers lots of very useful functions, such as copy, backwards copy, transform, for_each, none_of, find_if, and more and more, judicious use of them can drastically increase the ability of newbies to understand what your program's supposed to be doing.

I'd love to see any other rationale from readers, or any arguments toward one or the other kind of loop. Let me know what you think!