September 13, 2020

In this part, we're going to discuss the concept of reduction. Here's the first part of this series. If you know nothing about vectorization, you probably want to take a look at it.

Reduction (Variable)

A reduction variable (or simply "reduction")
is an operation that happens in a specfic variable, in every iteration of a loop. It is in the form:
`new_value = old_value OP data`

. That is, in every iteration, the new value
of the variable is an operation involving the old value and some data. Now, the important thing here
is that the `OP`

is the *same* in *every* iteration. Furthermore,
for this to make sense, the variable has to have some initial value.

The simplest kind of reduction operation / variable that you have seen is the sum of an array of integers.

int sum_of_array(int *a, int len) { int sum = 0; for (int i = 0; i < len; ++i) sum = sum + a[i]; }

Notice that a reduction is a special kind of a loop-carried dependency. This is when an iteration depends on an earlier one, usually the previous one, which is what happens in reductions since the new value depends on / is computed based on the old value, which was computed in the previous iteration.

In general, when loop-carried dependencies are present, we have a huge problem vectorizing / parallelizing the code because they enforce a sequential order. If the second iteration depends on the first, then the first has to be executed before the second and so we can't execute those two in parallel. But that's what happens with the second and third and inductively, the whole loop has to be executed serially. But reductions are special as you'll see below...

Don't confuse the reduction variables with induction variables. An induction variable is a
variable that gets increased or decreased by a fixed amount on every iteration of a loop
or is a linear function of another induction variable. For example, in the following loop,
`i`

and `j`

are induction variables:

for (int i = 0; i < len; ++i) j = 17 * i; }

`i`

gets increased by a fix amount on every iteration and
`j`

is a linear function of `i`

. Now, to be honest,
I don't know where did the names "induction" and "reduction" come from but it's important to remember
their differences because they're used quite a lot in compiler parlance (especially the induction variables).

Reduction variables are especially interesting when the operation is associative
(like addition or multiplication) because partial values of the final result can be computed
individually (it's even more interesting if it is also commutative, as we'll see below).
For example, let's say we have to sum the integers `1 + 2 + 3 + 4`

.
Instead of pedantically following this sequential order (i.e., `(((1 + 2) + 3) + 4)`

), we can also do `(1 + 2) + (3 + 4)`

,
that is, compute `1 + 2`

and `3 + 4`

*independently* and then
add these partial results to get the final result.

When something can be done independently, it is implied that it can be done in *parallel*. One way to do that is using multi-threading /
multi-processing. One *other* way to do that is using vectorization!

Note that floating-point (FP) addition and multiplication are commutative but not (necessarily)
associative. That is, you can be sure that `a + b = b + a`

but not
that `(a + b) + c = a + (b + c)`

. So, the compiler won't generally
do FP optimizations that are based on associativity, *unless you tell it that you don't care *(e.g. with `-ffast-math`

). Most of the time, you don't care (unless
you're doing e.g. high-precision scientific computing), so you may as well use this flag (which is enabled by default
in some compilers on high levels of optimization).

Vectorizing Accumulation

If you want to parallelize accumulation (or whatever similar operation), I think there are two obvious and intuitive
ways to approach. The fisrt is the multi-threading approach. Here the idea is very simple. If we have P threads,
and N data, then give N / P data to each thread. That is, if we have P threads, each thread can take an equal (what
is equal in general is not as simple as N / P, but this is out of topic) part of the data to handle *in parallel*.
And then they will finish in roughly the 1 / P of the time that would be required for a single
thread.

Note: Instead of the term "thread" above, we could also use the term "processor", "node" etc. The point
is that we have a system in which more than one instructions can be run in *parallel*.

Vectorization, however, doesn't work like that. You have only *one* instruction to run at any point. What you have that is "many" are the *lanes*. So, how do we take
advantage of that ? Well, let's say we want to add the numbers 1, 2, 3, 4, 5, 6, 7, 8, and they're stored in this
order in memory. Just put pairs of two in every pair of lanes
in (two) vector registers.

Then, just do normal, element-wise, vectorized addition. This will give us a vector with 4 partial results, which we can add to get the final result.

That's ok but how do you get the data in the lanes ? That is, if I load (the first) 4 values from the array, those will be 1,2,3,4. Then, if I load another 4, I would load the values 5,6,7,8. That doesn't agree with what I described above..

They key insight (which is probably quite intuitive) is that addition is not only associative but also commutative! This means you can change the order of the operands and you'll get the same result i.e. `(1+2) + (3+4) + (5+6) + (7+8) = (1+5) + (2+6) + (3+7) + (4+8)`

. The right-hand-side of this is precisely what is convenient for us since is the *natural* order in
which the data fill the lanes.

Now, for a big array, we start with a register with 0 in all lanes (the initial value). Then we add to it the first 4-packed values. Then we add the output with the next set of 4 values etc. In the final output register, each lane will have a partial result and to get the final result, we just add the four lanes together. Let's see how do we do that in code!

We already know how to load values and do addition of 4-packed values from the previous post, which means that what is basically missing is how to add the 4 lanes of a register. This is an operation that is known as a "horizontal add" (from the obvious visual interpretation) and we'll use the intrinsic _mm_hadd_epi32 for 4-packed 32-bit integer values.

Actually, the horizontal add operation adds the 1st lane to the 2nd and the 3rd to the 4th and saves the two results in the two least significant lanes of the output. We need to repeat another horizontal add to get the final result in the least significant lane of the output.

int add_all_lanes_epi32(__m128i v) { v = _mm_hadd_epi32(v, v); v = _mm_hadd_epi32(v, v); return // Extract least significant lane from `v`; }

Next, we need to learn how do we extract the least significant lane from a register. I know that it would
make a lot of sense to just be able to do `v[<lane index>]`

but remember
that this whole "vectorization" interface with the intrinsics, types and stuff is built *over C*.
So, if you write that, the C compiler will complain that `v`

is not of array / pointer
type since you can only index arrays and pointers (and you can't overload operators in C, in which case we would
overload `[]`

for `__m128i`

). Fortunately, there's an intrinsic
for our job that again, does not have the easiest-to-remember name: _mm_cvtsi128_si32. I usually wrap it into a function

int get_low_32_bits_si128(__m128i v) { return _mm_cvtsi128_si32(v); }

So, our final function to add all lanes is:

int add_all_lanes_epi32(__m128i v) { v = _mm_hadd_epi32(v, v); v = _mm_hadd_epi32(v, v); return get_low_32_bits_si128(v); }

Now, we're ready to write the whole thing, in the spirit of the previous article:

int sum_of_array(int *A, int len) { int end = (len % 4 == 0) ? len : len - 4; int i = ; __m128i acc = // Register with all lanes equal to 0; for (int i = 0; i < end; i += 4) { __m128i v = loadu_si128(&A[i]); acc = add4(v); } int final_result = add_all_lanes_epi32(acc); for (; i < len; ++i) { final_result += a[i]; } return final_result; }

To finalize that, we need an intrinsic that returns us a register that has all lanes equal to zero: _mm_set1_epi32. Quiz: Why is this named set1 (instead of e.g. set) ? Is there a set0 and if yes, why ?

Just to mention that if we wanted to make a version that handles errors, we would have to handle overflow etc. but those are beside the point.

Vectorizing Other Reductions and Conclusion

What we learned in this article is not just how to vectorize accumulation. This is a generic framework to vectorize any reduction operation as long as it fits the precondtions we discussed. In the next article we will examine the peculiarities of vectorizing conditional code!