Case study: image filtering

Get the source code for this section.

Get source code of runtime execution image filter.

In this tutorial we’re going to implement image filtering with a 7×7 filter. The idea is to present an OpenCL algorithm which will work on all cards compatible with the basic implementation of OpenCL without extensions. Since we received some emails asking about how to implement image filters that don’t use images and do basic stuff liike inverting colors, we also posted a simple image filter in which the user can compile the filter in execution time and choose which device will be used to compile the code. All these features can be seen in the following video:

It is very important to understand that we are NOT going to implement the fastest possible filter, but rather a general-purpose one that should work on all GPUs compatible with OpenCL. There are not many softwares out there which use GPU processing to speed up image processing and I assume that manufacturers don’t want platform-specific algorithms or versions that will run only on a limited number of cards.

Well, good performance can be obtained using solely OpenCL basic implementation. This means that the following features, which could be used to boost performance, will not be present:

  • Byte addressable stores;
  • OpenCL images;
  • OpenCL/OpenGL interoperation to manipulate and display images.

Either way, if you have a GPU which is compatible with the above techniques, you may want to optimize the code presented to a code that better suits your hardware. What we WILL use is transfer data to the GPU using bytes (OpenCL uchar), because data transfer is a bottleneck and this optimization really makes a difference.

In the end of this work I expect to convince the reader that using OpenCL can provide reasonable performance increase to applications even in computers that don’t have the most powerful GPUs.

I have used AForge video components in this tutorial. It is open source and the license can be foundhere.

1. Screenshots and benchmarks

In order to make this more interesting, real-time webcam image processing has been implemented along with regular image filtering. Using my hardware, the OpenCL version of the filter was 60x faster than the regular implementation. I have included a (slower) version of the algorithm that runs using work dimension = 1. It is not discussed in this tutorial but it may serve as reference. In my tests I could get a FPS of around 13.

Usage of the software:

  • Use the Filter icon to modify the filter that is going to be applied. Each color has its own filter. You may replicate a filter to all colors;
  • Load a picture or start the webcam;
  • Use the buttons to apply the desired filter. When using a webcam, you can modify the filter in real-time. I suggest testing the following filters:

2. Image filtering basics

For you unfamiliar with image filtering, I will give a very brief explanation. A filter is a series of math calculations that can be done in an image to create effects and consists on interpreting the image as a series of red/green/blue values and replacing the central pixel of the image with values that depend on the pixel’s surroundings. Take a look at the picture below for a quick reminder:

This tutorial is not intended to explain details of filters or the effect they create. Take a look at these references for further information:

3. Setting up the filter

Most applications don’t require a big filter and filtering time is highly dependant. 3×3 filters usually do fine, 5×5 filters will solve almost all practical problems and it is very unusual to see anything above 7×7 filters in a real application. In this tutorial, we will stick to a 7×7 filter that will still be processed real-time. Ok, the frame rate is not great but the result is still decent. The input screen has been created using C#. It is possible to create color-specific filters and copy/paste the filters in the format in the textbox below the filter, like shown in the screen below.

You may look at the code implementation if you want to. It’s just an interface so discussing it is off-topic (not OpenCL related). As you can see, it’s a 7×7 filter setup.

The most obvious way to make the filtering faster is to reduce filter size, hard-code the filter values into the code and take advantage of filter symmetries. This is not the case here since the filter is dynamic.

4. OpenCL Kernel

Let’s create a two-dimensional kernel to solve the problem. We want to filter and retrieve an image with colors. The data structure is:

Filter[3*(i*FILTERSIZE + j)] is the red component of pixel i,j;

Filter[3*(i*FILTERSIZE + j)+1] is the green component of pixel i,j;

Filter[3*(i*FILTERSIZE + j)+2] is the blue component of pixel i,j;

Same logic for the Filter and Filtered image:

kernel void ImgFilter(global uchar * image,
                      global float * Filter,
                      global float * FilteredImage,
                      global int * Width)
    int x = get_global_id(0);
    int y = get_global_id(1);
    int w = Width[0];
    int ind = 0;
    int ind2 = 0;
    float4 filteredVal = (float4)(0,0,0,0);
    for (int i = 0; i < FILTERSIZE; i++)
        for (int j = 0; j < FILTERSIZE; j++)
            ind = 3*(x+j + w*(y+i));
            ind2 = 3*(i*FILTERSIZE + j);
            filteredVal.x =  mad(Filter[ind2] , (float)image[ind],  filteredVal.x);
            filteredVal.y =  mad(Filter[ind2+1] , (float)image[ind+1],filteredVal.y);
            filteredVal.z =  mad(Filter[ind2+2] , (float)image[ind+2],filteredVal.z);
    ind = 3*(x+CENTER + w*(y+CENTER));
    FilteredImage[ind] = clamp(filteredVal.x,0,255);
    FilteredImage[ind+1] = clamp(filteredVal.y,0,255);
    FilteredImage[ind+2] = clamp(filteredVal.z,0,255);

It is possible to notice some relevant optimizations:

  • The image argument is being sent by using uchars (c# byte);
  • MAD optimization to compute a*b+c;
  • Indexes calculated only once.

5. Host Code

The host code contains two parts: copying the image to a byte array and processing the image using OpenCL.

5.1 Copying C# image into a byte array

We want to transfer the RGB values of the picture as bytes, not floats. Doing this allows us to transfer 1/4 of the data because sizeof(float) = 4 and sizeof(byte)=1. This part contains C# bitmap lockbits functions which you may want to study if you are not familiar with it. Remember the data structure being created: the byte array has to carry all 3 (RGB) components.

Full implementation is provided in ImageData class:

/// <summary>Copies bitmap data to local Data</summary>
/// <param name="bmp">Bitmap to copy</param>
private void ReadToLocalData(Bitmap bmp)
    //Lock bits
    BitmapData bmd = bmp.LockBits(new Rectangle(0, 0, bmp.Width, bmp.Height),
    System.Drawing.Imaging.ImageLockMode.ReadOnly, bmp.PixelFormat);
    //Read data
        for (int y = 0; y < bmd.Height; y++)
            byte* row = (byte*)bmd.Scan0 + (y * bmd.Stride);
            for (int x = 0; x < bmd.Width; x++)
                Data[3 * (x + width * y)] = row[x * PIXELSIZE];
                Data[3 * (x + width * y) + 1] = row[x * PIXELSIZE + 1];
                Data[3 * (x + width * y) + 2] = row[x * PIXELSIZE + 2];
    //Unlock bits

5.2 Kernel execution

The kernel execution structure simply copies data to OpenCL memory and reads it into the C# image byte array structure.

I have included code to run a kernel that implements the algorithm using work_dim = 1. You may take a look if you want. It is slower, though.

Full implementation is provided in the source code (CLFilter class). The host code to call the kernel is posted below:

/// <summary>Applies given filter to the image</summary>
/// <param name="imgDt">Image to be filtered</param>
/// <param name="Filter">Filter. [3*size*size]</param>
public static void ApplyFilter(ImageData imgDt,
                               float[] Filter,
                               bool useOpenCL,
                               bool useWorkDim2)
    int FilterSize = (int)Math.Sqrt(Filter.Length/3);
    if (Filter.Length != 3 * FilterSize * FilterSize)
        throw new Exception("Invalid filter");
    if (!Initialized && useOpenCL)
    //Writes filter to device
        if (FilteredVals == null || FilteredVals.Length != imgDt.Height * imgDt.Width * 3)
            //Filtered values
            FilteredVals = new float[imgDt.Height * imgDt.Width * 3];
            varFiltered = new CLCalc.Program.Variable(FilteredVals);
    if (useOpenCL)
        varWidth.WriteToDevice(new int[] { imgDt.Width });
    //Executes filtering
    int mean = (FilterSize - 1) / 2;
    if (useOpenCL)
        CLCalc.Program.Variable[] args = new CLCalc.Program.Variable[] { imgDt.varData, varFilter, varFiltered, varWidth };
        if (useWorkDim2)
        kernelApplyFilterWorkDim2.Execute(args, new int[] { imgDt.Width - FilterSize, imgDt.Height - FilterSize });
        kernelApplyFilter.Execute(args, new int[] { imgDt.Height - FilterSize });
        //Reads data back
        ApplyFilter(imgDt.Data, Filter, FilteredVals, new int[] { imgDt.Width }, imgDt.Height - FilterSize);
    //Writes to image data
    for (int y = mean; y < imgDt.Height - mean - 1; y++)
        int wy = imgDt.Width * y;
        for (int x = mean; x < imgDt.Width - mean - 1; x++)
            int ind = 3 * (x + wy);
            imgDt.Data[ind] = (byte)FilteredVals[ind];
            imgDt.Data[ind + 1] = (byte)FilteredVals[ind + 1];
            imgDt.Data[ind + 2] = (byte)FilteredVals[ind + 2];
    //Writes filtered values
    //In the future this rewriting can be avoided
    //because byte_addressable will be widely available
    if (useOpenCL)

6. Conclusion

We have presented a simple yet fast way to compute image filters using only OpenCL basic implementation, which makes our code compatible with all cards that are compatible with OpenCL. Even not using images or returning data using only bytes we still manage to get a 60x faster algorithm using OpenCL which, in turn, makes it feasible to process real-time data from a webcam (13 FPS in my system).

Further optimization without losing compatibility would involve using filters smaller than 7×7, hard-coding the filter values and taking better advantage of symmetries of the filter.

Get the source code for this section.

Get source code of runtime execution image filter.

Leave a Reply

Your email address will not be published. Required fields are marked *