We're continuing our expendition to the world of manual vectorization. In this part, we will explain the most common technique for vectorizing conditional code (usually referred as if-conversion). Here's the first part of this series. If you know nothing about vectorization, you probably want to take a look at it.
Conditional (a.k.a branching) code is one that contains conditions, i.e., code that depends on conditions. For example, consider the code below:
if (c) { a = 2; }
The code snippet a = 2
depends on the condition if (c)
. Code snippets that do not have conditions are usually called "branchless".
Let's start with the basics: It's not difficult to vectorize all conditional code. Difficulty arises when each lane of the vector output is computed based on some condition that is not the same across all lanes. To understand that, consider the loop below:
float a[VECTOR_WIDTH]; float b[VECTOR_WIDTH]; float out[VECTOR_WIDTH]; // Fill the arrays if (cond) { for (int i = 0; i < VECTOR_WIDTH; ++i) out[i] = a[i] + b[i]; }
If cond
is non-zero, we do an element-by-element addition and save into out
. Otherwise, out
stays as it was. The important thing here is that
either we do the addition for all the elements of a
and
b
or for none. This conceptual
idea is aligned with vectorized operations. A vectorized operation happens either for all
the lanes or none. So, it is pretty easy to vectorize the code above, as shown below:
__m128 a_vector = _mm_loadu_ps(&a[0]); __m128 b_vector = _mm_loadu_ps(&b[0]); if (cond) { __m128 res = _mm_add_ps(a_vector, b_vector) _mm_storeu_ps(&out[0], res); }
First, we used some slightly different intrinsics here because we're dealing with floating-point numbers. But the logic is the same and in this article floats will make our life easier.
You can see that the condition remains as it was (i.e., scalar) and the rest of the code is vectorized as if the condition didn't exist. The reason is simple; the condition is either true for all the lanes or none, so we either do the vectorized operation or not. We don't need to test the condition for each individual lane.
By the way, let's see out of curiosity what the compiler did e.g. Clang. It did exactly what we did.
But what if our original code looked like this:
float a[VECTOR_WIDTH]; float b[VECTOR_WIDTH]; float out[VECTOR_WIDTH]; ... for (int i = 0; i < VECTOR_WIDTH; ++i) { if (cond[i]) { out[i] = a[i] + b[i]; } }
Now "cond
" got "moved" inside the loop and most importantly, its value is different in every iteration. That's a problem... How do you vectorize that ?
It's like "I wanna do a sum but only for some lanes". But vector addition, at least we've learned it, is done on all lanes. And generally, you can't do only a part of a vector
operation.
Avoiding special cases and formalities, we're going to describe a pattern for vectorizing loops that look like this:
for (int i = 0; i < N; ++i) { if (cond[i]) { // Divergent condition and hence, divergent branch value[i] = some computation; } }
A divergent or varying value is one that does not remain constant across iterations.
The definition of divergence that I presented above is over-simplified. We can go a little deeper by taking a look at the work of Simon Moll and others. To have a clearer description of what divergence is, we should describe it as the complement of uniformity. Now, a uniform value is related to a loop and is one that is not varying because of this loop (it may otherwise vary though).
When we're talking about innermost loops (i.e., loops that don't contain other loops), a uniform value is one that remains constant during the whole execution of the loop (i.e., it's a loop-invariant value). For example:
for (int i = 0; i < N; ++i) { float v = 1.0 + 2.0; a[i] = v; }
Here, v
is a constant 3
across all iterations and we
can actually hoist it out of the loop.
float v = 1.0 + 2.0; for (int i = 0; i < N; ++i) { a[i] = v; }
However, the situation gets more interesting when we consider outer loops. Take a look at the loop nest below:
for (int i = 0; i < 4; ++i) { int a = 0; for (int k = 0; k < n; ++k) { bool p = B[k] > 0; if (p) { float d = C[k][i]; a += d; } } A[i] = a; }
It is clear that p
may not be constant during the execution of the i
-loop; it can't be considered loop-invariant (and e.g. get moved outside of it).
Yet it is considered uniform in that loop (and of course it is divergent in the j
-loop) because it doesn't change in any way because of it. This
is very important information because we can now vectorize that loop with the if (p)
branch
remaining scalar:
// for (int i = 0; i < 4; ++i) { <-- vectorized __m128 a = _mm_set_ps1(0.0); for (int k = 0; k < n; ++k) { bool p = B[k] > 0; if (p) { __m128 d = _mm_loadu_ps(&C[k][i]); a += add4(a, d); } } _mm_storeu_ps(&A[i]); // }
Vectorization is trivial and that's because of the uniformity of p
, otherwise
we would have to do weird trickery like what we'll describe below.
The idea that I'll present can be further generalized with an else
:
for (int i = 0; i < N; ++i) { if (cond[i]) { // Divergent condition (and hence divergent branch) value[i] = some computation; } else { value[i] = some other computation; } }
Basically, if we don't have the else
, the "other computation" is the value that was there
before in the output.
So, the big idea is that we'll compute two vector outputs. The first is going to be the one assuming that the condition is always true. The second will be assuming the condition is always false. So, if we had this code:
for (int i = 0; i < 4; ++i) { if (cond[i]) { value[i] = 1; } else { value[i] = 2; } }
Conceptually, we're going to compute, let's say, value1 = {1, 1, 1, 1}
and
value2 = {2, 2, 2, 2}
. Then, to construct the final
output, in each element i
, we're going to put the value from either
value1[i]
or value2[i]
, depending on
cond[i]
. So, for instance, if we assume here that cond = {true, true, false, true}
, then final output vector will be value = {1, 1, 2, 1};
. A high-level algorithm follows.
FOR j := 0 to 4 IF cond[j] value[j] := value1[j] ELSE value[j] := value2[j] FI ENDFOR
Note something important. When this algorithm starts, the vector values value1, value2
are already computed!. We don't
compute them in this algorithm, we just copy the appropriate bits to value
.
You might be thinking "ok, that's cute but how is that beneficial at all if I have to do all this post-processing". That's a fair concern but what if I told you that this algorithm is already implemented in the hardware. It's the family of "blend" instructions. For 32-bit float values of 4-wide vectors, we can use the intrinsic __m128 _mm_blendv_ps (__m128 a, __m128 b, __m128 mask).
This intrinsic treats the mask
argument as having four
32-bit lanes. It reads only the most-significant-bit from each lane. If this bit is
1, then the corresponding lane of the output gets the corresponding lane of the
second argument. For example, if the most-significant-bit of the first
lane of mask
is 1, the first lane of a
is 23 and the first lane of
b
is 46, then the first lane of the output will get the value 46 (so note here that if the MSB is true, we'll copy from the second not first argument).
If this most-significant-bit was 0 instead, then the first lane of the output
would be 23. In essence, mask
chooses whether the
output will copy from the first or second argument.
The only issue now is how do we construct this mask
parameter,
or more specifically, its most-significant-bits.
To set those most-significant-bits according to the values of a vector
cond
, we'll use the intrinsic _mm_cmp_ps.
This intrinsic compares two 128-bit vectors, lane-wise, meaning it compares pairs of lanes; the first lane of the first argument with the first lane of the second argument etc. For each pair of lanes, if the comparison yields true, then the corresponding lane in the output is filled with 1s (MSB included) otherwise it is filled with 0s. The third argument in the intrinsic is the comparison we want to do. For instance, we may want to compare whether one lane is equal to the other, or less, or greater. Now, if you open the intrinsic's description, you'll see many more operators than you might expect. I'm going to skip the specifics but if you want to learn more, you can see this nice post that summarizes them.
Let's return back to our example:
for (int i = 0; i < VECTOR_WIDTH; ++i) { if (cond[i]) { out[i] = a[i] + b[i]; } }
How can we use what we learned to vectorize that? Well, we'll have two vectors and we'll
blend them. The first, call it sum
, will have the element-wise sums in its lanes. The second
will have the values of out
before the loop. To create the mask
for the blend, we'll load cond
into a vector. But what will we compare it with?
Remember that the blend intrinsic will copy from the first argument (which we assume
is sum
) if the corresponding bit in mask
is 0, not 1. So, we need to set the most-significant-bits in mask
in such a way so that if cond[i]
is true, then the
corresponding MSB is false and vice versa. To do that, we'll compare the cond
vector for equality with a vector that has zero in all the lanes. Thus,
if a lane is zero, the corresponding MSB will be 1 and vice versa.
The final code can be seen in this Godbolt snippet. This deals only with arrays of 4 elements, but it's easy to use the knowledge of the first part of this series and make it work for arrays of any size.
We should be careful when using this technique because we might wreck the correctness. Consider, for example, the code below:
for (int i = 0; i < 4; ++i) { if (cond[i]) { value[i] = 1 + b[i]; } else { value[i] = 2 / b[i]; } }
What if for some i
, cond[i]
is
non-zero (i.e., we won't go to the else
) and b[i]
is zero. In the original program, we would not perform the division by 0 but in the
vectorized program we will, resulting in a crash. Thus, we have changed the semantics of the program. So, when using this technique, we should be sure that we can execute both branches
unconditionally.
The second thing we should consider is whether the complexity of one path of the branch is significantly bigger than the other. If we assume that they're about the same, then by using this technique, we're doing double the amount of work compared to the original program. But because we're also vectorizing it by say a factor of 4, usually the net result is a win. However, if one path of the branch is significantly slower, then executing it for all the iterations might result in a slowdown. Especially if the probability of reaching this path is low (which depends on the condition), in which case in the original program we would rarely execute it.
By finishing these 3 articles, we have assembled a powerful arsenal for vectorizing code. So, I'd like to stop introducing new concepts for a bit and instead try to solve some cool problems using what we have learned. Not sure when will the next article appear but let's see...