A Beginner's Guide to Vectorization By Hand: Part 4 - Convolution

Feb 15, 2025

Introduction

Look, I know what you're thinking: “Four years later is just way too long to publish a follow-up brooo!!” Ok, I know, but at least consider that until fairly recently the only people who read this blog were my mom and a couple of familial relationships who are bound by dept. But to my pleasant surprise, I recently learned that some (other) people follow this blog—this series in particular—and they wanted me to continue it. So, here we go: We will finally apply everything you learned three and a half years ago (I hope your memory is doing well!) to an actual problem: convolution.

Now, a warning: Convolution is a general concept that has many applications in Computer Science.1 Here, we'll focus on image convolutions. Image convolution is a great candidate because it's incredibly simple to code a naive implementation, and just a bit hard (but not too hard) to vectorize. Plus, you get to produce cool effects, like make images blurred.

Source Code

For clarity, in this article I will only include parts of the code. You can find all the source code in this repository.

Image Convolution 101

This is not a computer vision article, so we won't delve too deep into the specifics of image convolution and how to produce great visual results, nor are we going to consider all the different types of convolutions. But, we will consider a simple convolution that is cool enough.

For the rest of this article we'll deal with 3x3 convolutions. In this, a pixel in the output image is produced by a 3x3 "box" (or square) in the input image. More specifically, every pixel from the input image is somehow combined with its 8 surrounding pixels (thereby collectively making a 3x3 box) to create a pixel in the output image. Even more specifically, the output pixel is a linear combination of the 9 input pixels: every one of the 9 pixels is multiplied by a constant number, and finally all these 9 products are summed up. This sum becomes the output pixel.

So, we need 9 constant numbers, or a 3x3 set of constant numbers. Generally, these 9 numbers are the same for every pixel (to produce a uniform effect). So, we need 9 numbers in total for the whole image. This 3x3 box is called a kernel.2 Now, assuming the kernel has values:

float blur_kernel[3][3] = {{k0, k1, k2},
                           {k3, k4, k5},
                           {k6, k7, k8}};

and we have an a pixel in position in[y][x], the output pixel out[y][x] is computed as:

res = 
        // top left      // top middle      // top right
res  = k0*in[y-1][x-1] + k1*in[y-1][1]   + k2*in[y-1][x+1]
        // middle left    // center        // middle right
res += k3*in[y][x-1]   + k4*in[y][x]     + k5*in[y][x+1]
       // bottom left  // bottom middle  // bottom right
res += k6*in[y+1][x-1] + k7*in[y+1][x]   + k8*in[y+1][x+1]

out[y][x] = res;

To visualize this, imagine an image is a piece of paper with the pixels as squares, and the kernel is another piece of paper that is 3x3 (with squares of the same size). To compute the output value of a pixel, you place the kernel on top of the input pixel such that the center of the kernel aligns with the pixel perfectly. Then, the top left of the kernel aligns with the top left pixel of the input pixel, and generally all the surrounding pixels align with the kernel. Then, you just multiply each square of the kernel with the pixel it aligns. You finally sum up all these products. This animation should help:

2D Convolution Animation

We don't care what the values of the kernel are, but since we started the discussion, it would be good to note the following:

The center number in a kernel is almost always the largest because the source pixel should contribute the most (more than its surroundings).
Kernels tend to have a symmetry (e.g., many times the middle left and the middle right are the same; similarly for top middle and bottom middle).
Depending on the effect you want, there are different kernels available. You can see some here.
Kernels tend to be normalized. In practice, this means that we divide each element by a factor such that the sum of all 9 elements equals to 1. The reason is because this keeps the final sum (i.e., the value of the output pixel) within the same range as the input pixel (usually 0-255). An easy example to see this is the kernel with 1s for all 9 elements (which happens to be a useful kernel, as we'll see shortly). Let's say the input pixel and all its 8 surrounding pixels have the value 100. If we do the convolution, the result will be: 1x100 + 1x100 + ... = 9x100 = 900 which will be outside the usual range of 0-255. But, if we divide each element by 9, then we will get 100, which is what we expect: i.e., the output pixel should be the same as the input pixel because the input pixel and all its surroundings have the same value.

In this article, we'll be using the normalized box-blur kernel (although what we'll say applies for any 3x3 kernel and it generalizes to other sizes):

float blur_kernel[3][3] = {{1.0f / 9, 1.0f / 9, 1.0f / 9},
                           {1.0f / 9, 1.0f / 9, 1.0f / 9},
                           {1.0f / 9, 1.0f / 9, 1.0f / 9}};

Interlude: Image Formats

It turns out when you want to work with images, you need to deal with loading (and storing) them. This is not great for us because this article is not about image formats; ideally, we'd like to magically get the raw pixels in an array in memory. This magic takes some effort to do with standard image formats like JPEG and PNG because they're compressed, so we need to decompress them (which is not trivial) to load an image and compress them to store it. We don't want to deal with that, so let's use an uncompressed format instead.

Netpbm provides a series of very simple formats. At first, we'll use the PGM format, which is grayscale. This means there's a single channel, i.e., every pixel has a single value—usually 0-255 and that's what we'll use too—from black to white: 0 is total black and 255 is total white. The values in between are (more than 50) shades of grey.

Here's an example of a 3158x4210 image in the PGM format:

P5
3158 4210
255
<raw bytes for pixel values>

You can read more about it in the PGM specification, but basically we only care about the dimensions (second line) and the raw bytes. In the repository I include code for reading and writing PGM images. I won't explain it here, it's super simple.

At the end of the article, we'll also work with colored images (i.e., RGB). Netpbm provides a similarly simple format for RGB images, PPM. It's basically like PGM, except it has bytes for all three channels.

In the repo, I have provided two images in JPEG, PGM, and PPM. But you can easily work with any image. For example, to convert a .jpg image to .pgm, you can install convert on Linux and do:

convert window.jpg window.pgm

A Scalar Implementation

These explanations may have scared you that we're dealing with something complicated, but hopefully this section we'll convince you otherwise. We first define a struct for images:

struct PgmImage {
  int width, height;
  uint8_t *pixels;
};

and the kernel:

static float blur_kernel[3][3] = {{1.0f / 9, 1.0f / 9, 1.0f / 9},
                                  {1.0f / 9, 1.0f / 9, 1.0f / 9},
                                  {1.0f / 9, 1.0f / 9, 1.0f / 9}};

Now, our main function looks like this:

int main() {
  const char *filepath = "./images/window.pgm";
  PgmImage img = read_pgm(filepath);

  uint8_t *pixels_out = (uint8_t *)malloc(img.width * img.height);
  PgmImage blurred = {
      .width = img.width, .height = img.height, .pixels = pixels_out};

  box_blur(img, blurred);
  write_pgm(blurred, "./images/window_blurred.pgm");

  return 0;
}

We pre-allocate the blurred image for reasons that we'll become apparent shortly. Now, we define box_blur as follows:

void box_blur(PgmImage img_in, PgmImage img_out) {
  assert(img_in.height == img_out.height);
  assert(img_in.width == img_out.width);

  int width = img_in.width;
  int height = img_in.height;

  for (int y = 1; y < height - 1; y++) {
    for (int x = 1; x < width - 1; x++) {
      float sum = 0.0f;

      for (int ky = -1; ky <= 1; ky++) {
        for (int kx = -1; kx <= 1; kx++) {
          uint8_t pixel = img_in.pixels[(y + ky) * width + (x + kx)];
          float pixel_f = (float)pixel;
          float kernel_val = blur_kernel[ky + 1][kx + 1];
          sum += pixel_f * kernel_val;
        }
      }

      img_out.pixels[y * width + x] = (uint8_t)sum;
    }
  }
}

It's basically a dead simple implementation of the method we described earlier. y and x iterate over the whole image. ky and kx iterate over a 3x3 window.

Note that we don't handle edges: we're not going through the outermost rectangle of pixels (i.e., we start one pixel in on both width and height).3 Again, this article is not about computer vision, but you can handle these separately relatively easily. That said, you can already run this over your image and you will see it blurred! To see a clearer effect you can perform multiple iterations, in which case main looks like this:4

// ... same
for (int it = 0; it < 15; ++it) {
  box_blur(img, blurred);
  // Swap
  uint8_t *tmp = img.pixels;
  img.pixels = blurred.pixels;
  blurred.pixels = tmp;
}
// ... same

Performance Notes

Let's do a quick analysis of the performance of this code. First, locality. At first sight, the locality is not bad because we're iterating over the image in a row-major order, which matches how the image's layout. However, at a second glance, we can observe that we're jumping rows in the two innermost loops, which is bad for cache locality. Nevertheless, this is not as bad as you may imagine because the cold cache lines are reused. For example, let's say we are processing the pixel at position y, x. When we read in[y-1][x], we probably get a cache miss, and we load a cache line. This cache line has, say, 64 bytes, i.e., 64 pixels. 3 of these pixels are used in the computation of y, x. But also, 2 of them are reused in the computation of y, x+1 (the next position we'll process). What is more, the pixel in[y-1][x+2], which is not used in y, x, is used in y, x+1 and it's hot, because we just loaded it.

Moreover, when we consider performance, we should take a look at what the compiler does. One clear opportunity for the compiler is to unroll the kx and ky loops because they have a constant and compile-time-known trip count. Clang 19 does that at -O2. In this godbolt snippet, you can see that there is no loop code (e.g., a back-edge jump that jumps to the beginning of the loop) for the two loops. Besides that, I passed the -Rpass=loop-unroll to get a compiler report from the loop unrolling pass. Indeed, it tells us it unrolled these two loops. GCC doesn't do it at -O2 (e.g., in this godbolt snippet, you can clearly see the jump at line 51 which is the back-edge of the innermost loop, as Godbolt's highlighting suggests), but it does do it at -O3 (godbolt).5

Vectorized 1-Dimensional Convolution

It's time to start (manually) vectorizing the code! But, 2D convolution is a bit hard to tackle right away. So, we'll first vectorize 1D convolution which is slightly easier but still interesting, and it's a building block for 2D convolution.

The core idea of 1D convolution is the same as that of 2D convolution: the value of an output pixel is the sum of products of the corresponding input pixel and its surrounding pixels. But here, instead of being affected by surrounding pixels in 2 dimensions, we just consider pixels in one dimension. In simple English, this means we consider surrounding pixels only in the same row. Even more simply, we will just consider a usual case: only the left and the right pixel. Here's a video with an animation. Similarly, we use a 1x3 kernel, or simply a 1D kernel with 3 elements. Thus, we can write code that looks as follows:

uint8_t *in_pixels = img_in.pixels;
uint8_t *out_pixels = img_out.pixels;

for (int i = 1; i < (width * height)-1; ++i) {
  uint8_t left   = in_pixels[i-1];
  uint8_t center = in_pixels[i];
  uint8_t right  = in_pixels[i+1];

  float left_f   = (float)left;
  float center_f = (float)center;
  float right_f  = (float)right;

  float sum = kernel[0]*left_f + kernel[1]*center_f + kernel[2]*right_f;
  uint8_t sum_uint8 = (uint8_t)sum;
  out_pixels[i] = sum_uint8;
}

Note that since we're going on a single dimension, we don't need two outer loops. Similarly, because the kernel is small, we completely eliminate the two inner loops. For a kernel, we can use a simple one like:

static float kernel[3] = {0.33f, 0.33f, 0.33f};

Note that as far as I know, 1D convolution is not supposed to be applied on images because images are fundamentally 2-dimensional objects. However, if we run this, we get a sort of blurred image that doesn't look that bad.

Vectorization

For simplicity, in the rest of the article I'll assume we have vectors of size 4 (i.e., 4 lanes).

Ok, how do we vectorize this? If we eyeball the code, this seems like we can easily just store the 3 kernel values—call them k0, k1, k2—in a vector (the fourth lane is unused):

\[ \begin{array}{|c|c|c|c|}\hline\texttt{k0} & \texttt{k1} & \texttt{k2} & \texttt{\\} \\\hline\end{array} \]

the 3 pixel values—call them s0, s1, s2—in another:

\[ \begin{array}{|c|c|c|c|}\hline\texttt{s0} & \texttt{s1} & \texttt{s2} & \texttt{\\} \\\hline\end{array} \]

and just multiply the 2 vertically, which nicely fits the SIMD model:

\[ \begin{array}{|c|c|c|c|}\hline\texttt{k0*s0} & \texttt{k1*s1} & \texttt{k2*s2} & \texttt{\\} \\\hline\end{array} \]

But... At the end, we need to sum up the elements of the resulting vector. This addition is horizontal which does not fit the SIMD model. SSE gives us (link):

__m128 _mm_hadd_ps (__m128 a, __m128 b)

dst[31:0] := a[63:32] + a[31:0]
dst[63:32] := a[127:96] + a[95:64]
dst[95:64] := b[63:32] + b[31:0]
dst[127:96] := b[127:96] + b[95:64]

But horizontal additions are slow. In fact, observe that given the specification of this instruction, we need to do two horizontal additions (I'd suggest that you ponder why; the explanation is in the footnote).6 Generally, here are 2 characteristics of this method, which you should generally worry about when you do SIMD because it usually means you have not modeled the problem correctly:

The computations happen serially and not in parallel. The horizontal additions are basically just two standard additions, which means we're not really doing SIMD.
We compute one element of the output at a time. Ideally, you should load multiple elements of the input and compute multiple elements of the output.

To solve this, let's start from what we want to achieve. The problem is basically the additions: we need to make those vertical. This means we need somehow put the three products (that are to be added) in 3 different vectors:

\[ \begin{array}{|c|c|c|c|}\hline\texttt{k0*s0} & \texttt{?} & \texttt{?} & \texttt{?} \\\hline\end{array} \]

\[ \begin{array}{|c|c|c|c|}\hline\texttt{k1*s1} & \texttt{?} & \texttt{?} & \texttt{?} \\\hline\end{array} \]

\[ \begin{array}{|c|c|c|c|}\hline\texttt{k2*s2} & \texttt{?} & \texttt{?} & \texttt{?} \\\hline\end{array} \]

But how are we going to do this and what are we going to put in the other lanes? First, note that to have products in vectors, these multiplications also need to happen vertically. In other words, to get the first vector above, we should have two vectors that look like:

\[ \begin{array}{|c|c|c|c|}\hline\texttt{k0} & \texttt{?} & \texttt{?} & \texttt{?} \\\hline\end{array} \]

\[ \begin{array}{|c|c|c|c|}\hline\texttt{s0} & \texttt{?} & \texttt{?} & \texttt{?} \\\hline\end{array} \]

Similarly for the other 2 products. Of course we can get this by having k0, k1, and k2 in one vector and s0, s1, and s2 in the other, but we already established this won't work. The big idea is to broadcast the same value in one vector:

\[ \begin{array}{|c|c|c|c|}\hline\texttt{k0} & \texttt{k0} & \texttt{k0} & \texttt{?} \\\hline\end{array} \]

\[ \begin{array}{|c|c|c|c|}\hline\texttt{s0} & \texttt{s1} & \texttt{s2} & \texttt{?} \\\hline\end{array} \]

Observe that we need to multiply k0 with both s0 and s1 (and s2). We multiply it with s0 for the current pixel and s1 for the next pixel. This now starts addressing the second bullet above: we start dealing with multiple pixels at one! So, we end up with:

\[ \begin{array}{|c|c|c|c|}\hline\texttt{k0*s0} & \texttt{k0*s1} & \texttt{k0*s2} & \texttt{?} \\\hline\end{array} \]

The next big idea is that we do the same thing but starting from the next pixel and using k1 (s3 here is the pixel in order after s2):

\[ \begin{array}{|c|c|c|c|}\hline\texttt{k1*s1} & \texttt{k1*s2} & \texttt{k1*s3} & \texttt{?} \\\hline\end{array} \]

The idea is the same as before: k1 needs to be multiplied with s1 for this pixel and with s2 for the next pixel. So, we do this yet once more:

\[ \begin{array}{|c|c|c|c|}\hline\texttt{k2*s2} & \texttt{k2*s3} & \texttt{k2*s4} & \texttt{?} \\\hline\end{array} \]

And now, we have addressed both problems (bullets) above at once: The products are structured so that they can be added vertically, and one addition will give us three output pixels.

\[ \begin{array}{|c|c|c|c|}\hline\texttt{k0*s0 + k1*s1 + k2*s2} & \texttt{k0*s1 + k1*s2 + k2*s3} & \texttt{k0*s2 + k1*s3 + s2*s4} & \texttt{?} \\\hline\end{array} \]

One final thing to address is the ? in the fourth lane. There's no reason to limit this idea to only three pixels at once. We can do the same thing for four pixels at a time:

\[ \begin{array}{|c|c|c|c|}\hline\texttt{k0*s0} & \texttt{k0*s1} & \texttt{k0*s2} & \texttt{k0*s3} \\\hline\end{array} \]

\[ \begin{array}{|c|c|c|c|}\hline\texttt{k1*s1} & \texttt{k1*s2} & \texttt{k1*s2} & \texttt{k1*s3} \\\hline\end{array} \]

\[ \begin{array}{|c|c|c|c|}\hline\texttt{k2*s2} & \texttt{k2*s3} & \texttt{k2*s4} & \texttt{k2*s5} \\\hline\end{array} \]

Now, we produce 4 output pixels at once.

It's time to write this in code. But there is a problem. If you look at the scalar version, there are many conversions. These conversions would happen implicitly in C and I didn't need to e.g., introduce left_f. But I did it to show you explicitly how many conversions are needed. This is generally not good because conversions are in some sense useless work. Unfortunately, they are also necessary: we are multiplying 8-bit unsigned integers with floats, so some conversion needs to be made for that to happen. Converting the floats (i.e., the kernel values) to integers will give us inaccurate results, so we need to convert the integers to floats. There is a workaround to that, but for now let's live with it.

In short, all these conversions need to be made explicitly in vectorized code. Not only that, but there are no instructions to convert uint8_t directly to float (or the other way around). So, for example, we need to first convert them to 32-bit (signed)7 integers and then convert these to floats. With all that said, here's the code:

void conv_vec(PgmImage img_in, PgmImage img_out) {
  // ...

  __m128 kernel_vec[3];

  // Repeat each kernel value in a 4-wide register
  for(int i = 0; i < 3; ++i) {
    kernel_vec[i] = _mm_set1_ps(kernel[i]);
  }

  int len = (width*height) - 4 - 1;

  for(int i = 1; i < len; i+=4) {
    // Zero accumulator
    __m128 acc = _mm_setzero_ps();

    for(int k = -1; k <= 1; ++k) {
      // Load 4 uint8_t values into a 32-bit vector
      __m128i uint8_vec = _mm_loadu_si32(in_pixels + i + k);

      // Zero-extend the uint8_t values to 32-bit integers
      __m128i uint32_vec = _mm_cvtepu8_epi32(uint8_vec);

      // Convert the integers to floats
      __m128 float_vec = _mm_cvtepi32_ps(uint32_vec);

      // Compute the products
      __m128 prod = _mm_mul_ps(kernel_vec[k + 1], float_vec);

      // Accumulate the 4 parallel values
      acc = _mm_add_ps(acc, prod);
    }

    // Convert the floats to 32-bit integers
    __m128i int32_vec = _mm_cvtps_epi32(acc);

    __m128i zero = _mm_setzero_si128();

    // Pack 32-bit signed integers into 16-bit signed integers.
    __m128i int16_vec = _mm_packs_epi32(int32_vec, zero);

    // Pack 16-bit signed integers into 8-bit unsigned integers.
    __m128i uint8_vec = _mm_packus_epi16(int16_vec, zero);

    // Store. Note we go +4 every time.
    _mm_storeu_si32((out_pixels + i), uint8_vec);
  }
}

We're handling all the conversions directly, but unfortunately this code doesn't produce the same results as the scalar version. The reason is subtle: it's the rounding mode. Both the implicit or the explicit casts in the scalar version round down. The conversion instructions we use do not necessarily round down. The good news is that we can specify the rounding mode by setting the value of a register called MXCSR. We can use the following code:

// Get current MXCSR value
unsigned int mxcsr_value = _mm_getcsr();
// Clear rounding control bits (13-14)
mxcsr_value &= ~(3 << 13);
// Set rounding control to "round down" (01)
mxcsr_value |= (1 << 13);
// Set the modified value back to MXCSR               
_mm_setcsr(mxcsr_value);

Now this code produces almost the same result as the scalar version. If we run both and write some checking code like:

int len = (img.height * img.width);
for (int i = 0; i < len; ++i) {
  if (pixels_ser[i] != pixels_vec[i]) {
    printf("diff: %d\n", (img.height * img.width)-i);
    assert(false);
  }
}

We'll see that diff is 3. In the vectorized version we're going in steps of 4 and we don't handle the last couple of pixels. You can just have some scalar residual iterations to handle that, as I explained in Part 1.

Vectorized 2-Dimensional Convolution

Now that we've done all this, 2D convolution is actually pretty easy. We can do 2D convolutions by doing three 1D convolutions. To see this, consider doing 1D convolution on the row y-1 with the first 3 elements k0, k1, and k2 (i.e., the first row) of the 2D kernel. The first element of the output will be:

\[ \begin{array}{|c|c|c|c|}\hline\texttt{k0*in[y-1, 0] + k1*in[y-1, 1] + k2*in[y-1, 2]} \\\hline\end{array} \]

Now, consider doing a 1D convolution on the next row (i.e., y) using the next 3 kernel elements k3, k4, and k5. The first element of the output will be:

\[ \begin{array}{|c|c|c|c|}\hline\texttt{k3*in[y, 0] + k4*in[y, 1] + k5*in[y, 2]} \\\hline\end{array} \]

Finally, if we do a 1D convolution on the next of the next row, i.e., y+1. The first element of the output will be:

\[ \begin{array}{|c|c|c|c|}\hline\texttt{k6*in[y+1, 0] + k7*in[y+1, 1] + k8*in[y+1, 2]} \\\hline\end{array} \]

Notice now that if we add these 3 elements, we'll get the convolution for pixel in[y, 1] (you can refer back to the introduction). This generalizes to all pixels. So, here's how to translate that to code. For every row y, we'll do a 1D convolution with the first 3 kernel values in row y-1, then another 1D convolution in row y with middle 3 kernel values, and finally another 1D convolution in row y+1 with the final 3 kernel values. Then, we'll just add them vertically.

But, before presenting the code, let's make our life easier by avoiding the casts we dealt with in the previous section. To do that, we'll first convert the input pixels to float. The output pixels will be float too, but we need to finally convert them to uint8_t to store them a .pgm file. So, our main will look something like:

int main() {
  PgmImage img = read_pgm(...);

  uint8_t *pixels_in_u8 = img.pixels;

  float *pixels_in = (float *)malloc(img.width * img.height * sizeof(float));
  for (int i = 0; i < img.width * img.height; ++i) {
    pixels_in[i] = (float)pixels_in_u8[i];
  }
  float *pixels_out = (float *)malloc(img.width * img.height * sizeof(float));

  conv_2d(pixels_in, pixels_out, img.width, img.height);

  uint8_t *pixels_out_u8 = (uint8_t *)malloc(img.width * img.height * sizeof(uint8_t));
  for (int i = 0; i < img.width * img.height; ++i) {
    pixels_out_u8[i] = (uint8_t)pixels_out[i];
  }

  PgmImage blurred = {
    .width = img.width, .height = img.height, .pixels = pixels_out_u8};

  write_pgm(blurred, ...);
}

The useful code now will be the following:

void conv_2d(float *in_pixels, float *out_pixels, int width, int height) {
  float *tmp_rows[3];

  for(int i = 0; i < 3; ++i) {
    tmp_rows[i] = (float *)malloc(width * sizeof(width));
  }

  int len_1d = width - 4 - 1;
  for (int y = 1; y < height - 1; y++) {
    // Compute 3 1D convolutions over consecutive rows.
    conv_1d(in_pixels + (y-1) * width, tmp_rows[0], len_1d, blur_kernel[0]);
    conv_1d(in_pixels + y * width, tmp_rows[1], len_1d, blur_kernel[1]);
    conv_1d(in_pixels + (y+1) * width, tmp_rows[2], len_1d, blur_kernel[2]);

    // Sum them up and store.
    for(int x = 0; x <= width - 4 - 1; x+=4) {
      __m128 sum1 = _mm_add_ps(_mm_load_ps(&tmp_rows[0][x]), _mm_load_ps(&tmp_rows[1][x]));
      __m128 sum2 = _mm_add_ps(sum1, _mm_load_ps(&tmp_rows[2][x]));
      _mm_storeu_ps(&out_pixels[y * width + x], sum2);
    }
  }
}

We'll also modify conv_1d to work with float:

void conv_1d(float *in_pixels, float *out_pixels, int len, float kernel[3]) {

  int i;
  __m128 kernel_vec[3];

  // Repeat each kernel value in a 4-wide register
  for (i = 0; i < 3; ++i) {
    kernel_vec[i] = _mm_set1_ps(kernel[i]);
  }

  for (i = 1; i < len; i += 4) {
    // Zero accumulator
    __m128 acc = _mm_setzero_ps();

    for (int k = -1; k <= 1; ++k) {
      // Load 4-float data block (unaligned access)
      __m128 data_block = _mm_loadu_ps(in_pixels + i + k);
      // Make the products
      __m128 prod = _mm_mul_ps(kernel_vec[k + 1], data_block);

      // Accumulate the 4 parallel values
      acc = _mm_add_ps(acc, prod);
    }

    // Store. Note we go +4 every time.
    _mm_storeu_ps((out_pixels + i), acc);
  }
}

RGB Convolution

Dealing with multi-channel images is not conceptually different. We just do convolution on each channel individually. The problem is the image layout. More specifically, the layout of PPM images is as follows. Each pixel is a triplet of three 0-255 values, i.e., an RGB triplet. The triplets are laid out linearly one after the other:

RGB RGB RGB ...

As you can see, the problem is that the different channels are mixed. Ideally, we want to have:

RRR ...
GGG ...
BBB ...

In other words, the image is in an Array-Of-Structures (AOS) layout, while we want a Structure-of-Arrays (SOA) layout. So, we have to split the colors before the convolution, run three 2D convolutions, and recombine them afterwards to store the image. The code for doing the splitting and combining is simple. You can find it in the repository.

Exercises

To avoid the article getting too long, here are some exercises you can try on your own for a deeper understanding.

Measure! Conversion to float vs casts, compiler unrolling vs manual unrolling, etc.
Enable auto-vectorization in the compiler and compare the performance.
Convolution can be broken down into 2 convolutions, one on rows and one on
Try wider vectors (e.g., AVX which provides 8-wide lanes).

Want to join the discussion? Head over to Reddit.

Don't want to miss any updates? You can follow this RSS feed or sign up for my newsletter:

Footnotes

E.g., you may have heard of this mildly successful thing called a “convolutional neural network”.↩
This is one of the many usages of the term kernel in Computer Science.↩
Note that we can't simply compute the first and last rows because some of the surrounding pixels will result in out-of-bounds memory accesses. However, this is not true for the first and last columns. Because the pixels are stored in a single linear array (instead of e.g., each row being its own array), e.g., the outside right pixel on row y is the leftmost (inside pixel) on row y+1. In other words, if we read the middle right pixel of the pixel y, width-1, that will read y, width, which would be out of bounds. But because it's a single array, it will read in[y][width] which is in bounds. That doesn't mean it's a good idea to consider as a surrounding pixel of the rightmost pixel in a row the leftmost pixel in the next row because they may have no visual connection.↩
Note that the more iterations we do, the darker the image becomes. As far as I can understand, this is because we don't handle the edges. In particular, after the first iteration the resulting image has black pixels (i.e., zeroes) on the outermost rectangle, and this darkness slowly and transitively makes its way to the inner part of the image.↩
I disabled vectorization just to avoid confusion because we're not dealing with vectorization just yet. The instructions you see (ending mostly in ss) are scalar floating-point instructions.↩
The way to do that would be to place a zero as a fourth element. This instruction adds the first and second elements and places them in the first element, and adds the third and fourth elements and places the result as second. Because the fourth element is 0, the second element becomes what was previously the fifth. The rest is useless (i.e., what is put as the third and fourth elements) because we just pass a dummy b (the second argument). There's nothing that we can do better because there's a dependency of the final result of the second element on the first, and from the third on the second. So, we can't parallelize anything. Finally, we'd issue horizontal add to get the final sum.↩
It's ok, 8-bit unsigned are obviously within the range of 32-bit signed integers, there won't be any overflow.↩