Case study: High performance convolution using OpenCL __local memory

Download source code for this case study

1. Introduction

In a previous case study, we analyzed how to create a simple 7×7 filter in OpenCL which was compatible with any GPU. Back then, performance was less an issue than compatibility as we didn’t even use images because there are some older GPUs which don’t support any extension (not even images).

This case study will focus on improving convolution performance when convolving with small image kernels. Generally speaking, FFT-based convolution is faster for 30×30 kernels and beyond [1], but in practice it’s less common to use big window sizes.

We’ll start by implementing convolution using images and then improve the filter performance by using the __constant OpenCL memory and caching image information in the __local space.

2. Mathematical concept

Image convolution consists of applying a given filter or kernel, usually of small size, into a given image in order to obtain some desired result. Results include image smoothing, edge detection, image sharpening, and many more. There are many resources available online that explain image convolution such as references [1] and [2] below.

For an in-depth explanation of filters and digital image processing, please refer to NPTel Digital Image Processing Video Course [3]. If the reader is unfamiliar with image processing this is a great resource to learn the basics.

3. Simple image convolution

The convolution can be performed simply by accessing windows of the original image, multiplying pixel values and filter (kernel image) values and writing the result into the convolved image. In this case study we are not going to deal with pixels in the border (i.e, pixels that don’t have enough neighbors). As such, if the filter width is W, there will be a border of size W/2 around the image which won’t be filtered. We encourage the reader to implement this correction as an exercise if it’s really necessary.

__kernel void BasicConvolve(__read_only image2d_t imgSrc,
                            __global float * kernelValues,
                            __global int * kernelSize,
                            __write_only image2d_t imgConvolved)
{
    const sampler_t smp = CLK_NORMALIZED_COORDS_FALSE | //Natural coordinates
                          CLK_ADDRESS_CLAMP | //Clamp to zeros
                          CLK_FILTER_NEAREST; //Don't interpolate
    //Kernel size (ideally, odd number)
    //global_size should be [width-w/2, height-w/2]
    //Writes answer to [x+w/2, y+w/2]
    int w = kernelSize[0];
    int x = get_global_id(0);
    int y = get_global_id(1);
    float4 convPix = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
    float4 temp;
    uint4 pix;
    int2 coords;
    for (int i = 0; i < w; i++)
    {
       for (int j = 0; j < w; j++)
       {        coords.x = x+i; coords.y = y+j;
            pix = read_imageui(imgSrc, smp, coords);
            temp = (float4)((float)pix.x, (float)pix.y, (float)pix.z, (float)pix.w);
            convPix += temp * kernelValues[i + w*j];
       }
    }
    coords.x = x + (w>>1); coords.y = y + (w>>1);
    pix = (uint4)((uint)convPix.x, (uint)convPix.y, (uint)convPix.z, (uint)convPix.w);
    write_imageui(imgConvolved, coords, pix);
}

This OpenCL code should be clear by now. If it’s not please refer to the Image2D tutorial.

4. Using the __constant space

The __constant space is a memory region where work-items have faster access to the data. Section 5 below will give more details about the memory model. In this particular case, using the __constant memory yields a 4% speed gain. What’s interesting, though, is that all we need to do to implement this optimization is replace __global with __constant. The limitation, of course, is that OpenCL kernels can only take a limited number of __constant arguments and buffers stored in the __constant space have limited size (queriable, but at least 64 kb). In other words, if a set of the kernel arguments won’t exceed 64 kb and they won’t be modified, there’s no reason not to allocate them in __constant space.

The only change to the above kernel to implement this optimization is as follows:

__kernel void ConvolveConst(__read_only image2d_t imgSrc,
                            __constant float * kernelValues,
                            __constant int * kernelSize,
                            __write_only image2d_t imgConvolved)
{
    //(...)
}

5. Caching information in the __local space

5.1 The __local space

The OpenCL __local space is a very fast memory that can be shared between work-items belonging to the same work-group, as summarized in the picture below, taken directly from the OpenCL Spec [4] (our red marks):

The information implicitly displayed in the above picture is that __constant cache and __local memories are much faster than the __global memory. This means that if your kernel needs to access the same values in __global memory many timesit is a good idea to cache the information using __constant (if the data fits into __constant space, i.e., its size is less than 64 kb) or __local (if the access pattern is known beforehand).

Let’s take image convolution as an example. A 5×5 filter will need to access 25 elements of the original image per pixel plus all values in the filter (image kernel), roughly yielding a total of 25*W*H (image width times image height) __global memory fetches in the image and another 25*W*H __global memory fetches to retrieve the filter. When we did image convolution without using the __constant space, we had to read a full 50WH values from __global memory. We then stored the filter in __constant space, a much faster memory, which is possible because its total size is less than 64 kb, for around 4% performance gain. This effect is greatly amplified when the number of accesses to the buffer increases.

On the other hand, it’s not possible to store the entire image in __constant space. That means we have to resort to __local memory. This is possible because the access pattern is known, that is, memory access would be the same for any image (which would not be the case if we were computing a histogram, for example). As we’ll see in the performance comparison, if performance of a given kernel is an issue and it’s possible to use __local memory, by all means do so. The performance increase goes to 25% faster and beyond.

For vendor specific memory optimizations please access references [5] and [6].

5.2 Caching image values in __local space

First of all, it is important to know that image sampling is a fast, cached operation in modern GPUs, which means that we’re going from fast to faster access with caching. Had the image been stored in the __global space we’d be going from slow to faster and the impact would be even greater.

That said, let’s create a strategy to cache the image in __local space. What we want to do is:

  • Copy all data that the work-items are going to use into __local space;
  • Access the cached data to perform the calculations;
  • Output the answer into the correct position.

As discussed before, each workitem of a workgroup shares common __local memory. Let’s assume that one workgroup has size 64 (8×8) (in practice we’ll use 16×16, this is just for simplicity). As seen in previous tutorials we retrieve group ids and local ids as follows:

//Identification of this workgroup
int i = get_group_id(0);
int j = get_group_id(1);
//Identification of work-item
int idX = get_local_id(0);
int idY = get_local_id(1);
int ii = i*BLOCK_SIZE + idX; // == get_global_id(0);
int jj = j*BLOCK_SIZE + idY; // == get_global_id(1);

We’d like to store the desired information in __local memory. So, we create a __local variable and populate it with the values we need, ensuring to block execution with barrier to guarantee that all needed data is available before retrieving values of P:

__kernel void ConvolveLocal(__read_only image2d_t imgSrc,
                            __constant float * kernelValues,
                            __constant int * kernelSize,
                            __write_only image2d_t imgConvolved)
{
    const sampler_t smp = CLK_NORMALIZED_COORDS_FALSE | //Natural coordinates
                          CLK_ADDRESS_CLAMP | //Clamp to zeros
                          CLK_FILTER_NEAREST; //Don't interpolate
    int w = kernelSize[0];
    int wBy2 = w>>1; //w divided by 2
    //This would be P if we didn't have to cache extra data
    __local uint4 P[BLOCK_SIZE][BLOCK_SIZE];
    //Identification of this workgroup
    int i = get_group_id(0);
    int j = get_group_id(1); //Identification of work-item
    int idX = get_local_id(0);
    int idY = get_local_id(1);
    int ii = i*BLOCK_SIZE + idX; // == get_global_id(0);
    int jj = j*BLOCK_SIZE + idY; // == get_global_id(1);
    int2 coords = (int2)(ii, jj);
    //Reads pixels
    P[idX][idY] = read_imageui(imgSrc, smp, coords);
    barrier(CLK_LOCAL_MEM_FENCE);
}

The problem is that the above approach is incomplete because the work-items whose local_id is BLOCK_SIZE need access to more elements, as shown in the picture below (consider a 5×5 filter):

We can circumvent this problem by having work-items whose local_id is lesser than BLOCK_SIZE load the additional pixels: whenever a work-item’s local_id is less than the window size, it loads additional pixels from the original image:

In OpenCL code this is how the load will look like:

__kernel void ConvolveLocal(__read_only image2d_t imgSrc,
                            __constant float * kernelValues,
                            __constant int * kernelSize,
                            __write_only image2d_t imgConvolved)
{
    const sampler_t smp = CLK_NORMALIZED_COORDS_FALSE | //Natural coordinates
                          CLK_ADDRESS_CLAMP | //Clamp to zeros
                          CLK_FILTER_NEAREST; //Don't interpolate
    int w = kernelSize[0];
    int wBy2 = w>>1; //w divided by 2
    //Goes up to 15x15 filters
    __local uint4 P[BLOCK_SIZE+14][BLOCK_SIZE+14]; ******************* //Identification of this workgroup
    int i = get_group_id(0);
    int j = get_group_id(1); //Identification of work-item
    int idX = get_local_id(0);
    int idY = get_local_id(1);
    int ii = i*BLOCK_SIZE + idX; // == get_global_id(0);
    int jj = j*BLOCK_SIZE + idY; // == get_global_id(1);
    int2 coords = (int2)(ii, jj);
    //Reads pixels
    P[idX][idY] = read_imageui(imgSrc, smp, coords);
    //Needs to read extra elements for the filter in the borders
    if (idX < w) *******************
    {
        coords.x = ii + BLOCK_SIZE;
        coords.y = jj;  *******************
        P[idX + BLOCK_SIZE][idY] = read_imageui(imgSrc, smp, coords); *******************
    }
    if (idY < w)*******************
    {
        coords.x = ii;
        coords.y = jj + BLOCK_SIZE;*******************
        P[idX][idY + BLOCK_SIZE] =
        read_imageui(imgSrc, smp, coords);*******************
    }
    barrier(CLK_LOCAL_MEM_FENCE);
    ////////////////////////////
    //Computes convolution (...)
    ////////////////////////////
    barrier(CLK_LOCAL_MEM_FENCE);
    coords = (int2)(ii+wBy2, jj+wBy2);
    write_imageui(imgConvolved, coords, P[idX+wBy2][idY+wBy2]);
}

Notice that we need to allocate extra fixed memory for the __local variable P because OpenCL doesn’t allow dynamic resizing.

5.3 Implementing the filter

Now that all information is available, implementing the filter is straightforward:

//Computes convolution
float4 convPix = (float4)(0.0f, 0.0f, 0.0f, 0.0f);
float4 temp;
for (int ix = 0; ix < w; ix++)
{
    for (int jy = 0; jy < w; jy++)
    {
        temp = (float4)((float)P[ix][jy].x,
                        (float)P[ix][jy].y,
                        (float)P[ix][jy].z,
                        (float)P[ix][jy].w);
        convPix += temp * kernelValues[ix + w*jy];
    }
}

Saving the result into P[idX+wBy2][idY+wBy2] is convenient because it’s the pixel that would locally get changed and saves some memory.

6. Performance comparison

Using my hardware (Radeon 5770), the “normal” implementations gave some strange results while the __local implementation was rather consistent. The following results were obtained when filtering a 2.5 MegaPixel image (1818×1368):

The result we obtained was that up to 7×7 kernels, __constant memory was able to perform 4% faster while __local caching was at least 10% faster than any other implementation for any filter size. Explaining this result is beyond the scope of this tutorial as we’d have to dive into architecture and driver aspects and we don’t have the expertise or desire to do so in this study.

What we do want to point though is that __local optimization is an excellent tool to optimize code the same memory buffer will be accessed multiple times.

7. Conclusion

We’ve shown how to use __constant and __local memory spaces to reduce __global memory access. The __local memory is much faster than __global and it’s faster than the cache and it is a suitable tool to accelerate OpenCL kernels that involve accessing the same memory buffer (images or variables) multiple times.

Proper use of barrier is necessary to ensure that all required elements are loaded into __local buffers prior to computing their values.

The key aspect to optimize a code using __constant is to check how much memory can be allocated in __constant space (OpenCL spec requires 64 kb). In the case of __local memory, it is helpful for memory fetches to occur in a predictable way.

Finally, we’ve seen that if memory access is what limits the performance of the OpenCL kernels, one should seriously consider utilizing __constant and __local optimizations.

Download source code for this case study

8. References

[1] Linear Image Processing, at http://www.dspguide.com/ch24/7.htm retrieved in apr-2011

[2] Image convolution examples, at http://www.aishack.in/2010/08/image-convolution-examples/ retrieved in apr-2011

[3] BISWAS, P. K., Digital Image Processing Video Lectures, Lecture 20 – Image Enhancement, at http://nptel.iitm.ac.in/video.php?courseId=1079 retrieved in apr-2011

[4] The OpenCL Specification, at http://www.khronos.org/registry/cl/ retrieved in apr-2011.

[5] LAMB, C., OpenCL for NVidia GPUs, http://www.hotchips.org/archives/hc21/1_sun/HC21.23.2.OpenCLTutorial-Epub/HC21.23.250.Lamb-NVIDIA-OpenCL–for-NVIDIA-GPUs.pdf retrieved in apr-2011

[6] AMD OpenCL: An Introduction to OpenCL, http://www.amd.com/us/products/technologies/stream-technology/opencl/pages/opencl-intro.aspx retrieved in apr-2011

5 thoughts on “Case study: High performance convolution using OpenCL __local memory”

  1. I think that you have a bug in your code for loading the input image into local memory. You are only loading the extra elements to the right and below the block, but you also need to load extra elements in the quadrant to the lower right too: P[idX + BLOCK_SIZE][idY + BLOCK_SIZE].

  2. If some of the threads are loading more pixels into local memory than others, then the others threads (those loading only a single pixel) will be idle waiting at the sync barrier. OTOH, if you have have every thread read one pixel into local memory, then some of those threads will be idle when convolving with local memory. If the filter radius is small, and block size is large, this could be more efficient (maybe)?

  3. That is true: some workitems will wait in the barrier. However, that is necessary to load all data needed for every workitem.

    And yes, big local workgroups are more efficient because there are less “offbounds” elements loaded. Ideally, if all memory were superfast, that would not be needed.

    Smaller filters and big workgroup items is the best.

Leave a Reply

Your email address will not be published. Required fields are marked *