August 28, 2021

We're continuing our expendition to the world of manual vectorization. In this part, we will explain the most common technique for vectorizing conditional code (usually referred as if-conversion). Here's the first part of this series. If you know nothing about vectorization, you probably want to take a look at it.

What is Conditional Code ?

Conditional (a.k.a branching) code is one that contains conditions, i.e., code that depends on conditions. For example, consider the code below:

if (c) { a = 2; }

The code snippet `a = 2`

depends on the condition `if (c)`

. Code snippets that do not have conditions are usually called "branchless".

Why Vectorizing Conditional Code is Hard ?

Let's start with the basics: It's not difficult to vectorize *all* conditional code.
Difficulty arises when each lane of the vector output is computed based on some
condition that is *not the same across all lanes*. To understand
that, consider the loop below:

float a[VECTOR_WIDTH]; float b[VECTOR_WIDTH]; float out[VECTOR_WIDTH]; // Fill the arrays if (cond) { for (int i = 0; i < VECTOR_WIDTH; ++i) out[i] = a[i] + b[i]; }

If `cond`

is non-zero, we do an element-by-element addition and save into `out`

. Otherwise, `out`

stays as it was. The important thing here is that
either we do the addition for *all* the elements of `a`

and
`b`

or for *none*. This conceptual
idea is aligned with vectorized operations. A vectorized operation happens either for all
the lanes or none. So, it is pretty easy to vectorize the code above, as shown below:

__m128 a_vector = _mm_loadu_ps(&a[0]); __m128 b_vector = _mm_loadu_ps(&b[0]); if (cond) { __m128 res = _mm_add_ps(a_vector, b_vector) _mm_storeu_ps(&out[0], res); }

First, we used some slightly different intrinsics here because we're dealing with floating-point numbers. But the logic is the same and in this article floats will make our life easier.

You can see that the condition remains as it was (i.e., scalar) and the rest of the code is vectorized as if the condition didn't exist. The reason is simple; the condition is either true for all the lanes or none, so we either do the vectorized operation or not. We don't need to test the condition for each individual lane.

By the way, let's see out of curiosity what the compiler did e.g. Clang. It did exactly what we did.

But what if our original code looked like this:

float a[VECTOR_WIDTH]; float b[VECTOR_WIDTH]; float out[VECTOR_WIDTH]; ... for (int i = 0; i < VECTOR_WIDTH; ++i) { if (cond[i]) { out[i] = a[i] + b[i]; } }

Now "`cond`

" got "moved" *inside* the loop and most importantly, its value is different in every iteration. That's a problem... How do you vectorize that ?
It's like "I wanna do a sum but only for some lanes". But vector addition, at least we've learned it, is done on all lanes. And generally, you can't do only a part of a vector
operation.

A Generic Pattern for Conditional Code Vectorization

Avoiding special cases and formalities, we're going to describe a pattern for vectorizing loops that look like this:

for (int i = 0; i < N; ++i) { if (cond[i]) { // Divergent condition and hence, divergent branch value[i] = some computation; } }

A *divergent* or *varying* value is one that does not remain constant across iterations.

The definition of divergence that I presented above is over-simplified. We can go a little deeper by taking a look
at the work of Simon Moll and others.
To have a clearer description of what divergence is, we should describe it as the complement of uniformity.
Now, a uniform value is related to a loop and is one that is not varying *because of this loop* (it may
otherwise vary though).

When we're talking about innermost loops (i.e., loops that don't contain other loops), a uniform value is one that remains constant during the whole execution of the loop (i.e., it's a loop-invariant value). For example:

for (int i = 0; i < N; ++i) { float v = 1.0 + 2.0; a[i] = v; }

Here, `v`

is a constant `3`

across all iterations and we
can actually *hoist it out* of the loop.

float v = 1.0 + 2.0; for (int i = 0; i < N; ++i) { a[i] = v; }

However, the situation gets more interesting when we consider outer loops. Take a look at the loop nest below:

for (int i = 0; i < 4; ++i) { int a = 0; for (int k = 0; k < n; ++k) { bool p = B[k] > 0; if (p) { float d = C[k][i]; a += d; } } A[i] = a; }

It is clear that `p`

may not be constant during the execution of the `i`

-loop; it can't be considered loop-invariant (and e.g. get moved outside of it).
Yet it is considered *uniform* in that loop (and of course it is divergent in the `j`

-loop) because it doesn't change in any way because of it. This
is very important information because we can now vectorize that loop with the `if (p)`

branch
remaining *scalar*:

// for (int i = 0; i < 4; ++i) { <-- vectorized __m128 a = _mm_set_ps1(0.0); for (int k = 0; k < n; ++k) { bool p = B[k] > 0; if (p) { __m128 d = _mm_loadu_ps(&C[k][i]); a += add4(a, d); } } _mm_storeu_ps(&A[i]); // }

Vectorization is trivial and that's because of the uniformity of `p`

, otherwise
we would have to do weird trickery like what we'll describe below.

The idea that I'll present can be further generalized with an `else`

:

for (int i = 0; i < N; ++i) { if (cond[i]) { // Divergent condition (and hence divergent branch) value[i] = some computation; } else { value[i] = some other computation; } }

Basically, if we don't have the `else`

, the "other computation" is the value that was there
before in the output.

So, the big idea is that we'll compute *two* vector outputs. The first is going to be
the one assuming that the condition is always true. The second will be assuming the condition
is always false. So, if we had this code:

for (int i = 0; i < 4; ++i) { if (cond[i]) { value[i] = 1; } else { value[i] = 2; } }

Conceptually, we're going to compute, let's say, `value1 = {1, 1, 1, 1}`

and
`value2 = {2, 2, 2, 2}`

. Then, to construct the final
output, in each element `i`

, we're going to put the value from either
`value1[i]`

or `value2[i]`

, depending on
`cond[i]`

. So, for instance, if we assume here that `cond = {true, true, false, true}`

, then final output vector will be `value = {1, 1, 2, 1};`

. A high-level algorithm follows.

FOR j := 0 to 4 IF cond[j] value[j] := value1[j] ELSE value[j] := value2[j] FI ENDFOR

Note something important. When this algorithm starts, the vector values `value1, value2`

are *already computed*!. We don't
compute them in this algorithm, we just copy the appropriate bits to `value`

.

You might be thinking "ok, that's cute but how is that beneficial at all if I have to do all this post-processing".
That's a fair concern but what if I told you that this algorithm is already implemented *in the hardware*. It's the family of "blend" instructions. For 32-bit float values of 4-wide vectors,
we can use the intrinsic __m128 _mm_blendv_ps (__m128 a, __m128 b, __m128 mask).

This intrinsic treats the `mask`

argument as having four
32-bit lanes. It reads only the most-significant-bit from each lane. If this bit is
1, then the corresponding lane of the output gets the corresponding lane of the
*second* argument. For example, if the most-significant-bit of the first
lane of `mask`

is 1, the first lane of `a`

is 23 and the first lane of
`b`

is 46, then the first lane of the output will get the value 46 (so note here that if the MSB is true, we'll copy from the *second* not first argument).
If this most-significant-bit was 0 instead, then the first lane of the output
would be 23. In essence, `mask`

chooses whether the
output will copy from the first or second argument.

The only issue now is how do we construct this `mask`

parameter,
or more specifically, its most-significant-bits.
To set those most-significant-bits according to the values of a vector ```
cond
```

, we'll use the intrinsic _mm_cmp_ps.

This intrinsic compares two 128-bit vectors, lane-wise, meaning it compares pairs of lanes; the first lane of the first argument with the first lane of the second argument etc. For each pair of lanes, if the comparison yields true, then the corresponding lane in the output is filled with 1s (MSB included) otherwise it is filled with 0s. The third argument in the intrinsic is the comparison we want to do. For instance, we may want to compare whether one lane is equal to the other, or less, or greater. Now, if you open the intrinsic's description, you'll see many more operators than you might expect. I'm going to skip the specifics but if you want to learn more, you can see this nice post that summarizes them.

Let's return back to our example:

for (int i = 0; i < VECTOR_WIDTH; ++i) { if (cond[i]) { out[i] = a[i] + b[i]; } }

How can we use what we learned to vectorize that? Well, we'll have two vectors and we'll
blend them. The first, call it `sum`

, will have the element-wise sums in its lanes. The second
will have the values of `out`

*before* the loop. To create the `mask`

for the blend, we'll load `cond`

into a vector. But what will we compare it with?

Remember that the blend intrinsic will copy from the first argument (which we assume
is `sum`

) if the corresponding bit in `mask`

is *0*, not 1. So, we need to set the most-significant-bits in `mask`

in such a way so that if `cond[i]`

is true, then the
corresponding MSB is false and vice versa. To do that, we'll compare the `cond`

vector for *equality* with a vector that has zero in all the lanes. Thus,
if a lane is zero, the corresponding MSB will be 1 and vice versa.

The final code can be seen in this Godbolt snippet. This deals only with arrays of 4 elements, but it's easy to use the knowledge of the first part of this series and make it work for arrays of any size.

Correctness Concerns

We should be careful when using this technique because we might wreck the correctness. Consider, for example, the code below:

for (int i = 0; i < 4; ++i) { if (cond[i]) { value[i] = 1 + b[i]; } else { value[i] = 2 / b[i]; } }

What if for some `i`

, `cond[i]`

is
non-zero (i.e., we won't go to the `else`

) and `b[i]`

is zero. In the original program, we would *not* perform the division by 0 but in the
vectorized program we will, resulting in a crash. Thus, we have changed the semantics of the program. So, when using this technique, we should be sure that we can execute both branches
unconditionally.

Performance Concerns

The second thing we should consider is whether the complexity of one path of the branch is significantly bigger than the other. If we assume that they're about the same, then by using this technique, we're doing double the amount of work compared to the original program. But because we're also vectorizing it by say a factor of 4, usually the net result is a win. However, if one path of the branch is significantly slower, then executing it for all the iterations might result in a slowdown. Especially if the probability of reaching this path is low (which depends on the condition), in which case in the original program we would rarely execute it.

Next Steps

By finishing these 3 articles, we have assembled a powerful arsenal for vectorizing code. So, I'd like to stop introducing new concepts for a bit and instead try to solve some cool problems using what we have learned. Not sure when will the next article appear but let's see...