Look, I know what you're thinking: “Four years later is just way too long to publish a follow-up brooo!!” Ok, I know, but at least consider that until fairly recently the only people who read this blog were my mom and a couple of familial relationships who are bound by dept. But to my pleasant surprise, I recently learned that some (other) people follow this blog—this series in particular—and they wanted me to continue it. So, here we go: We will finally apply everything you learned three and a half years ago (I hope your memory is doing well!) to an actual problem: convolution.
Now, a warning: Convolution is a general concept that has many applications in Computer Science.1 Here, we'll focus on image convolutions. Image convolution is a great candidate because it's incredibly simple to code a naive implementation, and just a bit hard (but not too hard) to vectorize. Plus, you get to produce cool effects, like make images blurred.
For clarity, in this article I will only include parts of the code. You can find all the source code in this repository.
This is not a computer vision article, so we won't delve too deep into the specifics of image convolution and how to produce great visual results, nor are we going to consider all the different types of convolutions. But, we will consider a simple convolution that is cool enough.
For the rest of this article we'll deal with 3x3 convolutions. In this, a pixel in the output image is produced by a 3x3 "box" (or square) in the input image. More specifically, every pixel from the input image is somehow combined with its 8 surrounding pixels (thereby collectively making a 3x3 box) to create a pixel in the output image. Even more specifically, the output pixel is a linear combination of the 9 input pixels: every one of the 9 pixels is multiplied by a constant number, and finally all these 9 products are summed up. This sum becomes the output pixel.
So, we need 9 constant numbers, or a 3x3 set of constant numbers. Generally, these 9 numbers are the same for every pixel (to produce a uniform effect). So, we need 9 numbers in total for the whole image. This 3x3 box is called a kernel.2 Now, assuming the kernel has values:
float blur_kernel[3][3] = {{k0, k1, k2},
{k3, k4, k5},
{k6, k7, k8}};
and we have an a pixel in position in[y][x]
, the output pixel out[y][x]
is
computed as:
res =
// top left // top middle // top right
res = k0*in[y-1][x-1] + k1*in[y-1][1] + k2*in[y-1][x+1]
// middle left // center // middle right
res += k3*in[y][x-1] + k4*in[y][x] + k5*in[y][x+1]
// bottom left // bottom middle // bottom right
res += k6*in[y+1][x-1] + k7*in[y+1][x] + k8*in[y+1][x+1]
out[y][x] = res;
To visualize this, imagine an image is a piece of paper with the pixels as squares, and the kernel is another piece of paper that is 3x3 (with squares of the same size). To compute the output value of a pixel, you place the kernel on top of the input pixel such that the center of the kernel aligns with the pixel perfectly. Then, the top left of the kernel aligns with the top left pixel of the input pixel, and generally all the surrounding pixels align with the kernel. Then, you just multiply each square of the kernel with the pixel it aligns. You finally sum up all these products. This animation should help:
We don't care what the values of the kernel are, but since we started the discussion, it would be good to note the following:
1
s for all 9 elements (which happens to be a useful kernel, as we'll see shortly). Let's say the input pixel and all its 8 surrounding pixels have the value 100
. If we do the convolution, the result will be: 1x100 + 1x100 + ... = 9x100 = 900
which will be outside the usual range of 0-255
. But, if we divide each element by 9
, then we will get 100
, which is what we expect: i.e., the output pixel should be the same as the input pixel because the input pixel and all its surroundings have the same value.In this article, we'll be using the normalized box-blur kernel (although what we'll say applies for any 3x3 kernel and it generalizes to other sizes):
float blur_kernel[3][3] = {{1.0f / 9, 1.0f / 9, 1.0f / 9},
{1.0f / 9, 1.0f / 9, 1.0f / 9},
{1.0f / 9, 1.0f / 9, 1.0f / 9}};
It turns out when you want to work with images, you need to deal with loading (and storing) them. This is not great for us because this article is not about image formats; ideally, we'd like to magically get the raw pixels in an array in memory. This magic takes some effort to do with standard image formats like JPEG and PNG because they're compressed, so we need to decompress them (which is not trivial) to load an image and compress them to store it. We don't want to deal with that, so let's use an uncompressed format instead.
Netpbm provides a series of very simple formats. At first, we'll use the PGM format, which is grayscale. This
means there's a single channel, i.e., every pixel has a single
value—usually 0-255
and that's what we'll use too—from black to
white: 0
is total black and 255
is total white. The values in between are
(more than 50) shades of grey.
Here's an example of a 3158x4210 image in the PGM format:
P5
3158 4210
255
<raw bytes for pixel values>
You can read more about it in the PGM specification, but basically we only care about the dimensions (second line) and the raw bytes. In the repository I include code for reading and writing PGM images. I won't explain it here, it's super simple.
At the end of the article, we'll also work with colored images (i.e., RGB). Netpbm provides a similarly simple format for RGB images, PPM. It's basically like PGM, except it has bytes for all three channels.
In the repo, I have provided two images in JPEG
, PGM
, and PPM
. But
you can easily work with any image. For example, to convert a .jpg
image to
.pgm
, you can install convert
on Linux and do:
convert window.jpg window.pgm
These explanations may have scared you that we're dealing with something
complicated, but hopefully this section we'll convince you otherwise. We first define a struct
for images:
struct PgmImage {
int width, height;
uint8_t *pixels;
};
and the kernel:
static float blur_kernel[3][3] = {{1.0f / 9, 1.0f / 9, 1.0f / 9},
{1.0f / 9, 1.0f / 9, 1.0f / 9},
{1.0f / 9, 1.0f / 9, 1.0f / 9}};
Now, our main
function looks like this:
int main() {
const char *filepath = "./images/window.pgm";
PgmImage img = read_pgm(filepath);
uint8_t *pixels_out = (uint8_t *)malloc(img.width * img.height);
PgmImage blurred = {
.width = img.width, .height = img.height, .pixels = pixels_out};
box_blur(img, blurred);
write_pgm(blurred, "./images/window_blurred.pgm");
return 0;
}
We pre-allocate the blurred image for reasons that we'll become apparent
shortly. Now, we define box_blur
as follows:
void box_blur(PgmImage img_in, PgmImage img_out) {
assert(img_in.height == img_out.height);
assert(img_in.width == img_out.width);
int width = img_in.width;
int height = img_in.height;
for (int y = 1; y < height - 1; y++) {
for (int x = 1; x < width - 1; x++) {
float sum = 0.0f;
for (int ky = -1; ky <= 1; ky++) {
for (int kx = -1; kx <= 1; kx++) {
uint8_t pixel = img_in.pixels[(y + ky) * width + (x + kx)];
float pixel_f = (float)pixel;
float kernel_val = blur_kernel[ky + 1][kx + 1];
sum += pixel_f * kernel_val;
}
}
img_out.pixels[y * width + x] = (uint8_t)sum;
}
}
}
It's basically a dead simple implementation of the method we described earlier.
y
and x
iterate over the whole image. ky
and kx
iterate over a 3x3
window.
Note that we don't handle edges: we're not going through the outermost rectangle
of pixels (i.e., we start one pixel in on both width and height).3 Again, this
article is not about computer vision, but you can handle these separately
relatively easily. That said, you can already run this over your image and you
will see it blurred! To see a clearer effect you can perform multiple
iterations, in which case main
looks like this:4
// ... same
for (int it = 0; it < 15; ++it) {
box_blur(img, blurred);
// Swap
uint8_t *tmp = img.pixels;
img.pixels = blurred.pixels;
blurred.pixels = tmp;
}
// ... same
Let's do a quick analysis of the performance of this code. First, locality. At
first sight, the locality is not bad because we're iterating over the image in a
row-major order, which matches how the image's layout. However, at a second
glance, we can observe that we're jumping rows in the two innermost loops, which
is bad for cache locality. Nevertheless, this is not as bad as you may imagine
because the cold cache lines are reused. For example, let's say we are
processing the pixel at position y, x
. When we read in[y-1][x]
, we probably
get a cache miss, and we load a cache line. This cache line has, say, 64 bytes,
i.e., 64 pixels. 3 of these pixels are used in the computation of y, x
. But
also, 2 of them are reused in the computation of y, x+1
(the next position
we'll process). What is more, the pixel in[y-1][x+2]
, which is not used in y,
x
, is used in y, x+1
and it's hot, because we just loaded it.
Moreover, when we consider performance, we should take a look at what the
compiler does. One clear
opportunity for the compiler is to unroll the kx
and ky
loops because they
have a constant and compile-time-known trip count. Clang 19 does that at -O2
.
In this godbolt snippet, you can see that
there is no loop code (e.g., a back-edge jump that jumps to the beginning of the
loop) for the two loops. Besides that, I passed the -Rpass=loop-unroll
to get
a compiler report from the loop unrolling pass. Indeed, it tells us it unrolled
these two loops. GCC doesn't do it at -O2
(e.g., in this godbolt snippet, you can clearly see the jump at line
51 which is the back-edge of the innermost loop, as Godbolt's highlighting
suggests), but it does do it at -O3
(godbolt).5
It's time to start (manually) vectorizing the code! But, 2D convolution is a bit hard to tackle right away. So, we'll first vectorize 1D convolution which is slightly easier but still interesting, and it's a building block for 2D convolution.
The core idea of 1D convolution is the same as that of 2D convolution: the value of an output pixel is the sum of products of the corresponding input pixel and its surrounding pixels. But here, instead of being affected by surrounding pixels in 2 dimensions, we just consider pixels in one dimension. In simple English, this means we consider surrounding pixels only in the same row. Even more simply, we will just consider a usual case: only the left and the right pixel. Here's a video with an animation. Similarly, we use a 1x3 kernel, or simply a 1D kernel with 3 elements. Thus, we can write code that looks as follows:
uint8_t *in_pixels = img_in.pixels;
uint8_t *out_pixels = img_out.pixels;
for (int i = 1; i < (width * height)-1; ++i) {
uint8_t left = in_pixels[i-1];
uint8_t center = in_pixels[i];
uint8_t right = in_pixels[i+1];
float left_f = (float)left;
float center_f = (float)center;
float right_f = (float)right;
float sum = kernel[0]*left_f + kernel[1]*center_f + kernel[2]*right_f;
uint8_t sum_uint8 = (uint8_t)sum;
out_pixels[i] = sum_uint8;
}
Note that since we're going on a single dimension, we don't need two outer loops. Similarly, because the kernel is small, we completely eliminate the two inner loops. For a kernel, we can use a simple one like:
static float kernel[3] = {0.33f, 0.33f, 0.33f};
Note that as far as I know, 1D convolution is not supposed to be applied on images because images are fundamentally 2-dimensional objects. However, if we run this, we get a sort of blurred image that doesn't look that bad.
For simplicity, in the rest of the article I'll assume we have vectors of size 4 (i.e., 4 lanes).
Ok, how do we vectorize this? If we eyeball the code, this seems like we can
easily just store the 3 kernel values—call them k0
, k1
, k2
—in
a vector (the fourth lane is unused):
the 3 pixel values—call them s0
, s1
, s2
—in another:
and just multiply the 2 vertically, which nicely fits the SIMD model:
But... At the end, we need to sum up the elements of the resulting vector. This addition is horizontal which does not fit the SIMD model. SSE gives us (link):
__m128 _mm_hadd_ps (__m128 a, __m128 b)
dst[31:0] := a[63:32] + a[31:0]
dst[63:32] := a[127:96] + a[95:64]
dst[95:64] := b[63:32] + b[31:0]
dst[127:96] := b[127:96] + b[95:64]
But horizontal additions are slow. In fact, observe that given the specification of this instruction, we need to do two horizontal additions (I'd suggest that you ponder why; the explanation is in the footnote).6 Generally, here are 2 characteristics of this method, which you should generally worry about when you do SIMD because it usually means you have not modeled the problem correctly:
To solve this, let's start from what we want to achieve. The problem is basically the additions: we need to make those vertical. This means we need somehow put the three products (that are to be added) in 3 different vectors:
But how are we going to do this and what are we going to put in the other lanes? First, note that to have products in vectors, these multiplications also need to happen vertically. In other words, to get the first vector above, we should have two vectors that look like:
Similarly for the other 2 products. Of course we can get this by having k0
,
k1
, and k2
in one vector and s0
, s1
, and s2
in the other, but we
already established this won't work. The big idea is to broadcast the same
value in one vector:
Observe that we need to multiply k0
with both s0
and s1
(and s2
). We
multiply it with s0
for the current pixel and s1
for the next pixel. This
now starts addressing the second bullet above: we start dealing with multiple
pixels at one! So, we end up with:
The next big idea is that we do the same thing but starting from the next pixel
and using k1
(s3
here is the pixel in order after s2
):
The idea is the same as before: k1
needs to be multiplied with s1
for this
pixel and with s2
for the next pixel. So, we do this yet once more:
And now, we have addressed both problems (bullets) above at once: The products are structured so that they can be added vertically, and one addition will give us three output pixels.
One final thing to address is the ?
in the fourth lane. There's no reason to
limit this idea to only three pixels at once. We can do the same thing for four
pixels at a time:
Now, we produce 4 output pixels at once.
It's time to write this in code. But there is a problem. If you look at the
scalar version, there are many conversions. These conversions would happen
implicitly in C and I didn't need to e.g., introduce left_f
. But I did it to
show you explicitly how many conversions are needed. This is generally not good
because conversions are in some sense useless work. Unfortunately, they are also
necessary: we are multiplying 8-bit unsigned integers with floats, so some
conversion needs to be made for that to happen. Converting the floats (i.e., the
kernel values) to integers will give us inaccurate results, so we need to
convert the integers to floats. There is a workaround to that, but for now let's
live with it.
In short, all these conversions need to be made explicitly in vectorized code.
Not only that, but there are no instructions to convert uint8_t
directly to
float
(or the other way around). So, for example, we need to first convert
them to 32-bit (signed)7 integers and then
convert these to floats. With all that said, here's the code:
void conv_vec(PgmImage img_in, PgmImage img_out) {
// ...
__m128 kernel_vec[3];
// Repeat each kernel value in a 4-wide register
for(int i = 0; i < 3; ++i) {
kernel_vec[i] = _mm_set1_ps(kernel[i]);
}
int len = (width*height) - 4 - 1;
for(int i = 1; i < len; i+=4) {
// Zero accumulator
__m128 acc = _mm_setzero_ps();
for(int k = -1; k <= 1; ++k) {
// Load 4 uint8_t values into a 32-bit vector
__m128i uint8_vec = _mm_loadu_si32(in_pixels + i + k);
// Zero-extend the uint8_t values to 32-bit integers
__m128i uint32_vec = _mm_cvtepu8_epi32(uint8_vec);
// Convert the integers to floats
__m128 float_vec = _mm_cvtepi32_ps(uint32_vec);
// Compute the products
__m128 prod = _mm_mul_ps(kernel_vec[k + 1], float_vec);
// Accumulate the 4 parallel values
acc = _mm_add_ps(acc, prod);
}
// Convert the floats to 32-bit integers
__m128i int32_vec = _mm_cvtps_epi32(acc);
__m128i zero = _mm_setzero_si128();
// Pack 32-bit signed integers into 16-bit signed integers.
__m128i int16_vec = _mm_packs_epi32(int32_vec, zero);
// Pack 16-bit signed integers into 8-bit unsigned integers.
__m128i uint8_vec = _mm_packus_epi16(int16_vec, zero);
// Store. Note we go +4 every time.
_mm_storeu_si32((out_pixels + i), uint8_vec);
}
}
We're handling all the conversions directly, but unfortunately this code doesn't
produce the same results as the scalar version. The reason is subtle: it's the
rounding mode. Both the implicit or the explicit casts in the scalar version
round down. The conversion instructions we use do not necessarily round down.
The good news is that we can specify the rounding mode by setting the value of a
register called MXCSR
. We can use the following code:
// Get current MXCSR value
unsigned int mxcsr_value = _mm_getcsr();
// Clear rounding control bits (13-14)
mxcsr_value &= ~(3 << 13);
// Set rounding control to "round down" (01)
mxcsr_value |= (1 << 13);
// Set the modified value back to MXCSR
_mm_setcsr(mxcsr_value);
Now this code produces almost the same result as the scalar version. If we run both and write some checking code like:
int len = (img.height * img.width);
for (int i = 0; i < len; ++i) {
if (pixels_ser[i] != pixels_vec[i]) {
printf("diff: %d\n", (img.height * img.width)-i);
assert(false);
}
}
We'll see that diff
is 3
. In the vectorized version we're going in steps of
4
and we don't handle the last couple of pixels. You can just have some scalar
residual iterations to handle that, as I explained in Part 1.
Now that we've done all this, 2D convolution is actually pretty easy. We can do
2D convolutions by doing three 1D convolutions. To see this, consider doing 1D
convolution on the row y-1
with the first 3 elements k0
, k1
, and k2
(i.e., the first row) of the 2D kernel. The first element of the output will be:
Now, consider doing a 1D convolution on the next row (i.e., y
) using the next
3 kernel elements k3
, k4
, and k5
. The first element of the output will be:
Finally, if we do a 1D convolution on the next of the next row, i.e., y+1
. The
first element of the output will be:
Notice now that if we add these 3 elements, we'll get the convolution for pixel
in[y, 1]
(you can refer back to the introduction). This generalizes to all
pixels. So, here's how to translate that to code. For every row y
, we'll do a
1D convolution with the first 3 kernel values in row y-1
, then another 1D
convolution in row y
with middle 3 kernel values, and finally another 1D
convolution in row y+1
with the final 3 kernel values. Then, we'll just add
them vertically.
But, before presenting the code, let's make our life easier by avoiding the
casts we dealt with in the previous section. To do that, we'll first convert the
input pixels to float
. The output pixels will be float
too, but we need to
finally convert them to uint8_t
to store them a .pgm
file. So, our main
will look something like:
int main() {
PgmImage img = read_pgm(...);
uint8_t *pixels_in_u8 = img.pixels;
float *pixels_in = (float *)malloc(img.width * img.height * sizeof(float));
for (int i = 0; i < img.width * img.height; ++i) {
pixels_in[i] = (float)pixels_in_u8[i];
}
float *pixels_out = (float *)malloc(img.width * img.height * sizeof(float));
conv_2d(pixels_in, pixels_out, img.width, img.height);
uint8_t *pixels_out_u8 = (uint8_t *)malloc(img.width * img.height * sizeof(uint8_t));
for (int i = 0; i < img.width * img.height; ++i) {
pixels_out_u8[i] = (uint8_t)pixels_out[i];
}
PgmImage blurred = {
.width = img.width, .height = img.height, .pixels = pixels_out_u8};
write_pgm(blurred, ...);
}
The useful code now will be the following:
void conv_2d(float *in_pixels, float *out_pixels, int width, int height) {
float *tmp_rows[3];
for(int i = 0; i < 3; ++i) {
tmp_rows[i] = (float *)malloc(width * sizeof(width));
}
int len_1d = width - 4 - 1;
for (int y = 1; y < height - 1; y++) {
// Compute 3 1D convolutions over consecutive rows.
conv_1d(in_pixels + (y-1) * width, tmp_rows[0], len_1d, blur_kernel[0]);
conv_1d(in_pixels + y * width, tmp_rows[1], len_1d, blur_kernel[1]);
conv_1d(in_pixels + (y+1) * width, tmp_rows[2], len_1d, blur_kernel[2]);
// Sum them up and store.
for(int x = 0; x <= width - 4 - 1; x+=4) {
__m128 sum1 = _mm_add_ps(_mm_load_ps(&tmp_rows[0][x]), _mm_load_ps(&tmp_rows[1][x]));
__m128 sum2 = _mm_add_ps(sum1, _mm_load_ps(&tmp_rows[2][x]));
_mm_storeu_ps(&out_pixels[y * width + x], sum2);
}
}
}
We'll also modify conv_1d
to work with float
:
void conv_1d(float *in_pixels, float *out_pixels, int len, float kernel[3]) {
int i;
__m128 kernel_vec[3];
// Repeat each kernel value in a 4-wide register
for (i = 0; i < 3; ++i) {
kernel_vec[i] = _mm_set1_ps(kernel[i]);
}
for (i = 1; i < len; i += 4) {
// Zero accumulator
__m128 acc = _mm_setzero_ps();
for (int k = -1; k <= 1; ++k) {
// Load 4-float data block (unaligned access)
__m128 data_block = _mm_loadu_ps(in_pixels + i + k);
// Make the products
__m128 prod = _mm_mul_ps(kernel_vec[k + 1], data_block);
// Accumulate the 4 parallel values
acc = _mm_add_ps(acc, prod);
}
// Store. Note we go +4 every time.
_mm_storeu_ps((out_pixels + i), acc);
}
}
Dealing with multi-channel images is not conceptually different. We just do
convolution on each channel individually. The problem is the image layout. More
specifically, the layout of PPM images is as follows. Each pixel is a triplet of
three 0-255
values, i.e., an RGB triplet. The triplets are laid out linearly
one after the other:
RGB RGB RGB ...
As you can see, the problem is that the different channels are mixed. Ideally, we want to have:
RRR ...
GGG ...
BBB ...
In other words, the image is in an Array-Of-Structures (AOS) layout, while we want a Structure-of-Arrays (SOA) layout. So, we have to split the colors before the convolution, run three 2D convolutions, and recombine them afterwards to store the image. The code for doing the splitting and combining is simple. You can find it in the repository.
To avoid the article getting too long, here are some exercises you can try on your own for a deeper understanding.
y
is the leftmost (inside pixel) on row y+1
. In other words, if
we read the middle right pixel of the pixel y, width-1
, that will read y, width
, which would be out of bounds.
But because it's a single array, it will read in[y][width]
which is in bounds. That doesn't mean it's a good idea to consider as a surrounding pixel of the rightmost pixel in a row the leftmost pixel in the next row because they may have no visual connection.↩ss
) are scalar floating-point
instructions.↩b
(the second argument). There's nothing that we can do better because
there's a dependency of the final result of the second element on the first, and
from the third on the second. So, we can't parallelize anything. Finally, we'd
issue horizontal add to get the final sum.↩