A Beginner's Guide to Vectorization By Hand: Part 3

August 28, 2021

We're continuing our expendition to the world of manual vectorization. In this part, we will explain the most common technique for vectorizing conditional code (usually referred as if-conversion). Here's the first part of this series. If you know nothing about vectorization, you probably want to take a look at it.

What is Conditional Code ?

Conditional (a.k.a branching) code is one that contains conditions, i.e., code that depends on conditions. For example, consider the code below:

if (c) {
  a = 2;
}

The code snippet a = 2 depends on the condition if (c). Code snippets that do not have conditions are usually called "branchless".

Why Vectorizing Conditional Code is Hard ?

Let's start with the basics: It's not difficult to vectorize all conditional code. Difficulty arises when each lane of the vector output is computed based on some condition that is not the same across all lanes. To understand that, consider the loop below:

float a[VECTOR_WIDTH];
float b[VECTOR_WIDTH];
float out[VECTOR_WIDTH];
// Fill the arrays
if (cond) {
  for (int i = 0; i < VECTOR_WIDTH; ++i)
    out[i] = a[i] + b[i];
}

If cond is non-zero, we do an element-by-element addition and save into out. Otherwise, out stays as it was. The important thing here is that either we do the addition for all the elements of a and b or for none. This conceptual idea is aligned with vectorized operations. A vectorized operation happens either for all the lanes or none. So, it is pretty easy to vectorize the code above, as shown below:

__m128 a_vector = _mm_loadu_ps(&a[0]);
__m128 b_vector = _mm_loadu_ps(&b[0]);
if (cond) {
  __m128 res = _mm_add_ps(a_vector, b_vector)
  _mm_storeu_ps(&out[0], res);
}

First, we used some slightly different intrinsics here because we're dealing with floating-point numbers. But the logic is the same and in this article floats will make our life easier.

You can see that the condition remains as it was (i.e., scalar) and the rest of the code is vectorized as if the condition didn't exist. The reason is simple; the condition is either true for all the lanes or none, so we either do the vectorized operation or not. We don't need to test the condition for each individual lane.

By the way, let's see out of curiosity what the compiler did e.g. Clang. It did exactly what we did.

But what if our original code looked like this:

float a[VECTOR_WIDTH];
float b[VECTOR_WIDTH];
float out[VECTOR_WIDTH];
...
for (int i = 0; i < VECTOR_WIDTH; ++i) {
  if (cond[i]) {
    out[i] = a[i] + b[i];
  }
}

Now "cond" got "moved" inside the loop and most importantly, its value is different in every iteration. That's a problem... How do you vectorize that ? It's like "I wanna do a sum but only for some lanes". But vector addition, at least we've learned it, is done on all lanes. And generally, you can't do only a part of a vector operation.

A Generic Pattern for Conditional Code Vectorization

Avoiding special cases and formalities, we're going to describe a pattern for vectorizing loops that look like this:

for (int i = 0; i < N; ++i) {
  if (cond[i]) {  // Divergent condition and hence, divergent branch 
    value[i] = some computation;
  }
}

A divergent or varying value is one that does not remain constant across iterations.

The definition of divergence that I presented above is over-simplified. We can go a little deeper by taking a look at the work of Simon Moll and others. To have a clearer description of what divergence is, we should describe it as the complement of uniformity. Now, a uniform value is related to a loop and is one that is not varying because of this loop (it may otherwise vary though).

When we're talking about innermost loops (i.e., loops that don't contain other loops), a uniform value is one that remains constant during the whole execution of the loop (i.e., it's a loop-invariant value). For example:

for (int i = 0; i < N; ++i) {
  float v = 1.0 + 2.0;
  a[i] = v;
}

Here, v is a constant 3 across all iterations and we can actually hoist it out of the loop.

float v = 1.0 + 2.0;
for (int i = 0; i < N; ++i) {
  a[i] = v;
}

However, the situation gets more interesting when we consider outer loops. Take a look at the loop nest below:

for (int i = 0; i < 4; ++i) {
  int a = 0;
  for (int k = 0; k < n; ++k) {
    bool p = B[k] > 0;
    if (p) {
      float d = C[k][i];
      a += d;
    }
  }
  A[i] = a;
}

It is clear that p may not be constant during the execution of the i-loop; it can't be considered loop-invariant (and e.g. get moved outside of it). Yet it is considered uniform in that loop (and of course it is divergent in the j-loop) because it doesn't change in any way because of it. This is very important information because we can now vectorize that loop with the if (p) branch remaining scalar:

// for (int i = 0; i < 4; ++i) {  <-- vectorized
  __m128 a = _mm_set_ps1(0.0);
  for (int k = 0; k < n; ++k) {
    bool p = B[k] > 0;
    if (p) {
      __m128 d = _mm_loadu_ps(&C[k][i]);
      a += add4(a, d);
    }
  }
   _mm_storeu_ps(&A[i]);
// }

Vectorization is trivial and that's because of the uniformity of p, otherwise we would have to do weird trickery like what we'll describe below.

The idea that I'll present can be further generalized with an else:

for (int i = 0; i < N; ++i) {
  if (cond[i]) {  // Divergent condition (and hence divergent branch) 
    value[i] = some computation;
  } else {
    value[i] = some other computation;
  }
}

Basically, if we don't have the else, the "other computation" is the value that was there before in the output.

So, the big idea is that we'll compute two vector outputs. The first is going to be the one assuming that the condition is always true. The second will be assuming the condition is always false. So, if we had this code:

for (int i = 0; i < 4; ++i) {
  if (cond[i]) {
    value[i] = 1;
  } else {
    value[i] = 2;
  }
}

Conceptually, we're going to compute, let's say, value1 = {1, 1, 1, 1} and value2 = {2, 2, 2, 2}. Then, to construct the final output, in each element i, we're going to put the value from either value1[i] or value2[i], depending on cond[i]. So, for instance, if we assume here that cond = {true, true, false, true}, then final output vector will be value = {1, 1, 2, 1};. A high-level algorithm follows.

FOR j := 0 to 4
  IF cond[j]
    value[j] := value1[j]
  ELSE
    value[j] := value2[j]
  FI
ENDFOR

Note something important. When this algorithm starts, the vector values value1, value2 are already computed!. We don't compute them in this algorithm, we just copy the appropriate bits to value.

You might be thinking "ok, that's cute but how is that beneficial at all if I have to do all this post-processing". That's a fair concern but what if I told you that this algorithm is already implemented in the hardware. It's the family of "blend" instructions. For 32-bit float values of 4-wide vectors, we can use the intrinsic __m128 _mm_blendv_ps (__m128 a, __m128 b, __m128 mask).

This intrinsic treats the mask argument as having four 32-bit lanes. It reads only the most-significant-bit from each lane. If this bit is 1, then the corresponding lane of the output gets the corresponding lane of the second argument. For example, if the most-significant-bit of the first lane of mask is 1, the first lane of a is 23 and the first lane of b is 46, then the first lane of the output will get the value 46 (so note here that if the MSB is true, we'll copy from the second not first argument). If this most-significant-bit was 0 instead, then the first lane of the output would be 23. In essence, mask chooses whether the output will copy from the first or second argument.

The only issue now is how do we construct this mask parameter, or more specifically, its most-significant-bits. To set those most-significant-bits according to the values of a vector cond, we'll use the intrinsic _mm_cmp_ps.

This intrinsic compares two 128-bit vectors, lane-wise, meaning it compares pairs of lanes; the first lane of the first argument with the first lane of the second argument etc. For each pair of lanes, if the comparison yields true, then the corresponding lane in the output is filled with 1s (MSB included) otherwise it is filled with 0s. The third argument in the intrinsic is the comparison we want to do. For instance, we may want to compare whether one lane is equal to the other, or less, or greater. Now, if you open the intrinsic's description, you'll see many more operators than you might expect. I'm going to skip the specifics but if you want to learn more, you can see this nice post that summarizes them.

Let's return back to our example:

for (int i = 0; i < VECTOR_WIDTH; ++i) {
  if (cond[i]) {
    out[i] = a[i] + b[i];
  }
}

How can we use what we learned to vectorize that? Well, we'll have two vectors and we'll blend them. The first, call it sum, will have the element-wise sums in its lanes. The second will have the values of out before the loop. To create the mask for the blend, we'll load cond into a vector. But what will we compare it with?

Remember that the blend intrinsic will copy from the first argument (which we assume is sum) if the corresponding bit in mask is 0, not 1. So, we need to set the most-significant-bits in mask in such a way so that if cond[i] is true, then the corresponding MSB is false and vice versa. To do that, we'll compare the cond vector for equality with a vector that has zero in all the lanes. Thus, if a lane is zero, the corresponding MSB will be 1 and vice versa.

The final code can be seen in this Godbolt snippet. This deals only with arrays of 4 elements, but it's easy to use the knowledge of the first part of this series and make it work for arrays of any size.

Correctness Concerns

We should be careful when using this technique because we might wreck the correctness. Consider, for example, the code below:

for (int i = 0; i < 4; ++i) {
  if (cond[i]) {
    value[i] = 1 + b[i];
  } else {
    value[i] = 2 / b[i];
  }
}

What if for some i, cond[i] is non-zero (i.e., we won't go to the else) and b[i] is zero. In the original program, we would not perform the division by 0 but in the vectorized program we will, resulting in a crash. Thus, we have changed the semantics of the program. So, when using this technique, we should be sure that we can execute both branches unconditionally.

Performance Concerns

The second thing we should consider is whether the complexity of one path of the branch is significantly bigger than the other. If we assume that they're about the same, then by using this technique, we're doing double the amount of work compared to the original program. But because we're also vectorizing it by say a factor of 4, usually the net result is a win. However, if one path of the branch is significantly slower, then executing it for all the iterations might result in a slowdown. Especially if the probability of reaching this path is low (which depends on the condition), in which case in the original program we would rarely execute it.

Next Steps

By finishing these 3 articles, we have assembled a powerful arsenal for vectorizing code. So, I'd like to stop introducing new concepts for a bit and instead try to solve some cool problems using what we have learned. Not sure when will the next article appear but let's see...