Case study: Extraction of color Haar features

Download OpenCL Haar color feature extraction example source code

1. Introduction

Computer vision has become pervasive in our modern society, with applications ranging from robotic vision, measurement of position, face identification and recognition, automatic detection of failures in industry and many, many more.

One of the challenges in computer vision is to extract invariant features of an object, that is, to recognize the same attributes of a given object independently of scale, rotation, occlusion, illumination and other characteristics. In this sense, the idea of using integral image to extract features from an image, introduced by Viola and Jones [1], has proven to be a very useful tool.

It is quite difficult, however, to find a complete description about how to parallelize the computation of the image integral and the extraction of a complete set of features using parallel techniques.

The objective of this article is to demonstrate how to use GPGPU through OpenCL to accelerate the extraction of Haar-like wavelet features from color images. This process involves OpenCL acceleration of the computation of the image integral, generation of regions of sliding window in the target picture and preparing data structures to receive all features.

We also introduce a complete, usable software capable of extracting Haar-like features and perform linear classification of the data using Logistic Regression classifier, and give an example of how to use the software to identify stop signs, as shown in the picture below:

2. Using the software

If all you want to do is use the software to classify some data, here are the instructions of how to do it. If you’re a developer interested in the actual algorithms, feel free to jump to section 3.

For now, I suggest leaving the vertical and horizontal divisions as 12 and 16 respectively. Since what is extracted from the images are color values, the dimension of each feature vector will be 12*16*3 = 576.

2.1 Obtaining training examples

In order to obtain training examples, it’s necessary to create folders:

  • One for negative examples (i.e., in which no pictures contain samples of what you want to train);
  • One for each training example.

One sample is extracted for each file from the folders which contain positive examples, and multiple samples are extracted from files containing negative samples (since they don’t contain any positive samples in the image). The features are saved in .DAT files.

In the software we included StopSign and Generic Pictures folders containing images of traffic stop signals and samples of what are not stop signals, respectively.

You can also extract positive samples manually from images by loading them using the button Extract Haar features from picture. By clicking the images it is then possible to save marked pictures into a folder.

2.2 Loading and viewing examples

After extracting features from positive and negative samples the .DAT files can be loaded directly. From the File menu:

  • Load positive samples from its .DAT;
  • Load negative samples from its .DAT;
  • In the View features tab, click Display random samples.

It is also possible to reduce the number of training examples in a given category by using the button Reduce. The picture below shows ths exhibition of random samples and the menu items used to load sample files:

2.3 Training the classifier

Now that positive and negative samples are loaded, the Train Classifier tab becomes available. Please read further if you intend to modify the % of cross validation samples. Just click Train, wait for the computation to finish and observe the quality indicators of the classifier (internal and cross-validation hit rates).

It is also possible to Save and Load the classification coefficients into files, so that it’s not necessary to train the classifier over the same data more than one time.

The picture below shows a typical training result using the Logistic Regression classifier. If the window size of the features is too large, the algorithm may overfit (see ref. [3]).

2.4 Using the trained classifier

In the Identify tab, once a classifier has been trained or loaded from file, just click Load image… and the software will load the file, extact features, classify and highlight the features on the screen.

Of course, some postprocessing needs to be done to ensure that there is no overlap between two or more windows but that is application-specific and each case should be treated individually.

3. Fundamentals

Now we delve into the algorithms themselves. Basically, given an image to be classified, our strategy is to turn our attention to small parts of the image that have standardized sizes and classify that window as object or not object, as summarized in the picture below:

The focus of this article is how to use OpenCL to extract the features from a given image. The tasks of how to train the classifier is beyond the scope of this article and further information can be found in [3] and [4].

3.1 Motivation

The approach that will be used to extract the features is called sliding windows. Basically, we focus our attention to various subsets of a given image, downscaling each portion to a standard subsize which will then be classified as a positive or negative example of what we try to classify.

What is controlled in the sliding window is the (x,y) position and also the width and height of the window. The picture below shows various subimages extracted from a given reference.

However, it would be very time consuming to compute averages for all pixels inside the sliding window, which is why it’s a good idea to use the integral image trick.

The strategy to be able to quickly extract these features is the following:

  1. Compute the image integral;
  2. Generate a table of sliding windows;
  3. Use the image integral and the sliding windows information to extract the features.

Let’s cover each of these steps.

3.2 Downscaling with the image integral

The image integral (or summed area table) is a concept that allows one to quickly compute the average of the image window subtended by pixels (x0, y0) and (xf, yf) in constant time (see ref. [5]).

Let’s give an example using a 1D array: suppose we have the following sequence:

0  1  2  3  4  5  6  7  8  9 10 11 12
x    = [1  1  1  2  3  5  1  2  1  1  2  3  4]

its integral is given by:

intX = [1  2  3  5  8 13 14 16 17 18 20 23 27]

If we wanted to know the average of elements from x[3] = 2 up to x[9] = 1, we could just sum 2+3+5+1+2+1+1 / 7 = 2.14. But supposing intX, the integral of x, is available, we can reduce the number of operations to a single subtraction followed by the division: intX[9] – intX[2] / 7 =
(18 – 3) / 7 = 2.14. (The reason we read intX from index 2 is that we wanted to include x[3] in the sum – for big images this becomes less important).

Notice that we can compute the average between any two indexes of the vector with ONLY ONE subtraction and one division.

In the case of images, the strategy is to make the value at any point (x, y) in the summed area to be the sum of all the pixels above and to the left of (x, y), inclusive (ref. [5]). Let I(x,y) be the summed area table; we then can compute averages by just computing:

Avg = (I(xf, yf) – I(x0, yf) – I(xf, y0) + I(x0,y0)) / (area of the rectangle).

3.3 Computing the sliding windows

We now need to define what will be the positions and sizes of the sliding windows. To reduce memory fetches, it is a good idea to specify the sliding windows using their (x,y) position and width, and then computing the window height as a function of the width. As simplistic as this approach may seem, considering that the algorithm is memory intensive, it contributes to reduce the number of memory fetches.

Another important aspect is the overlap between each frame. Ideally, we’d like to slide the window one pixel at a time and increase the window size by one pixel at a time. This would be extremely slow and not generally needed if the classifier is doing a decent job.

Instead, it is better to fix an overlap factor and decrease the window sizes by certain factors.

It works as follows:

  • In the case of the window size, start with a big size (possibly covering the whole image) and decrease up to, say, 10% of the total image size;
  • In the case of the sliding window step, let’s say that, given its width W, a total of N windows would cover the entire width of the picture without overlap. We then use OVERLAP * N windows in order to have some overlap between windows.

The generation of this information is fast enough not to require OpenCL acceleration (in comparison with the other steps of the algorithm).

This function is accomplished using the function “public static void ComputeSubFrames” in file “CLImgHaar.cs”, included in the source files.

4. Algorithms

4.1 Image integral

In OpenCL, we can compute the image integral by summing over the row values of the image and then summing over column values of the result, as follows:

__kernel void SumImgRows(__constant   int       * imgDim,
                         __read_only  image2d_t   bmp,
                         __write_only image2d_t   temp)
    const sampler_t smp = CLK_NORMALIZED_COORDS_FALSE | //Natural coordinates
                          CLK_ADDRESS_CLAMP | //Clamp to zeros
                          CLK_FILTER_NEAREST; //Don't interpolate
    uint4 pix;
    int4 pixInt = (int4)(0,0,0,0);
    int2 coords;
    coords.y = get_global_id(0);
    for (int i = 0; i < imgDim[0]; i++)
        coords.x = i;
        pix = read_imageui(bmp, smp, coords);
        pixInt += (int4)((int)pix.x, (int)pix.y, (int)pix.z, 0);
        write_imagei(temp, coords, pixInt);

__kernel void SumImgCols(__constant   int       * imgDim,
                         __read_only  image2d_t   temp,
                         __write_only image2d_t   imgInt)
    const sampler_t smp = CLK_NORMALIZED_COORDS_FALSE | //Natural coordinates
                          CLK_ADDRESS_CLAMP | //Clamp to zeros
                          CLK_FILTER_NEAREST; //Don't interpolate
    int4 pix;
    int4 pixInt = (int4)(0,0,0,0);
    int2 coords;
    coords.x = get_global_id(0);
    for (int i = 0; i < imgDim[1]; i++)
        coords.y = i;
        pix = read_imagei(temp, smp, coords);
        pixInt += pix;
        write_imagei(imgInt, coords, pixInt);

We then call these kernels sequentially:

if (CLbmp == null || bmp.Width != CLbmp.Width || bmp.Height != CLbmp.Height)
    CLbmp = new CLCalc.Program.Image2D(bmp);
    CLdim = new CLCalc.Program.Variable(new int[] { bmp.Width, bmp.Height });
    CLimgDivs = new CLCalc.Program.Variable(new int[] { Config.ImgDivisions[0], Config.ImgDivisions[1] });
    imgInt = new int[4 * CLbmp.Width * CLbmp.Height];
    CLImgtemp = new CLCalc.Program.Image2D(imgInt, CLbmp.Width, CLbmp.Height);
    CLImgInt = new CLCalc.Program.Image2D(imgInt, CLbmp.Width, CLbmp.Height);
else CLbmp.WriteBitmap(bmp);
//Forces to read features again
if (haarFeats != null) haarFeats.Values[0] = 0;
//Image integral
kernelSumImgRows.Execute(new CLCalc.Program.MemoryObject[] { CLdim, CLbmp, CLImgtemp }, CLbmp.Height);
kernelSumImgCols.Execute(new CLCalc.Program.MemoryObject[] { CLdim, CLImgtemp, CLImgInt }, CLbmp.Width);

Notice that, in this algorithm, we need a temporary image to hold partial row sums and we finally get the result out in CLImgInt variable.

4.2 Extracting Haar features from sliding windows

Given an int[] array in CLsubFrames containing [x0, y0, width0, x1, y1, width1, … x[n-1], y[n-1], width[n-1]] information of positions of sliding windows and the image integral, we can extract an array of features.

The dimension of this array is:

haarFeats = new floatLinalg.floatMatrix(new float[lstSubFrames.Count / 3, 1 + 3 * (Config.ImgDivisions[0] – 1) * (Config.ImgDivisions[1] – 1)]);

That is, it has lstSubFrames.Count / 3 rows (the number of sliding windows) and 1 + 3 * (Config.ImgDivisions[0] – 1) * (Config.ImgDivisions[1] – 1) columns, which is the number of features per window. As detailes in function public floatLinalg.floatMatrix GetHaarFeatures(out List<int> lstSubFrames), and explained previously about how to compute the color average using image integrals, we call one OpenCL workitem for each window using:

kernelExtractFeats.Execute(new CLCalc.Program.MemoryObject[] { CLImgInt, CLdim, CLsubFrames, CLimgDivs, haarFeats.CLValues }, lstSubFrames.Count / 3);

Notice that the number of sliding windows is lstSubFrames.Count / 3. The CLimgDivs variable is used to select the dimension of the extracted features (12×16 pixels as default in this example). Each workitem extracts all desired features taking into consideration tha position and window size it receives, as detailed below:

__kernel void ExtractFeats(__read_only image2d_t    imgInt,
                           __constant     int     * imgDim,
                           __global const int     * subFrames,
                           __constant     int     * imgDivs,
                           __global       float   * feats)
   const sampler_t smp = CLK_NORMALIZED_COORDS_FALSE | //Natural coordinates
                         CLK_ADDRESS_CLAMP | //Clamp to zeros
                         CLK_FILTER_NEAREST; //Don't interpolate
    int4 pix00, pix01, pix10, pix11;
    int2 coords;
    float temp; int idx;
    int ii = get_global_id(0);
    //number of features
    int nFeats = 1 + 3 * (imgDivs[0] - 1) * (imgDivs[1] - 1);
    int i3 = 3*ii;
    int x0 = subFrames[i3];
    int y0 = subFrames[i3+1];
    int w = subFrames[i3+2];
    int h = (w/3)<<2;
    int xAtual = x0, xAnt;
    int yAtual, yAnt;
    //Intercept = 1
    feats[nFeats * ii] = 1;
    for (int i = 1; i < imgDivs[0]; i++)
        xAnt = xAtual;
        xAtual = x0 + (int)((float)i / (float)(imgDivs[0] - 1) * w);
        yAtual = y0;
        for (int j = 1; j < imgDivs[1]; j++)
            yAnt = yAtual;
            yAtual = y0 + (int)((float)j / (float)(imgDivs[1] - 1) * h);
            //read from image integral
            coords.x = xAnt; coords.y = yAnt;
            pix00 = read_imagei(imgInt, smp, coords);
            coords.x = xAtual; coords.y = yAnt;
            pix01 = read_imagei(imgInt, smp, coords);
            coords.x = xAnt; coords.y = yAtual;
            pix10 = read_imagei(imgInt, smp, coords);
            coords.x = xAtual; coords.y = yAtual;
            pix11 = read_imagei(imgInt, smp, coords);
            pix00 += pix11 - pix10 - pix01;
            temp = (float)(255000.0f * (xAtual - xAnt) * (yAtual - yAnt));
            temp = temp == 0 ? 1E5f : temp;
            temp = native_recip(temp);
            idx = 3 * ((i - 1) + (imgDivs[0] - 1) * (j - 1)) + nFeats*ii;
            feats[1 + idx] = temp * (float)pix00.x;
            feats[2 + idx] = temp * (float)pix00.y;
            feats[3 + idx] = temp * (float)pix00.z;

Finally, we have the desired features extracted from the image stored in the matrix haarFeats.CLValues which doesn’t have to be read back to host memory and can be fed directly to the OpenCL logistic regression classifier.

One final note: in order to work with small values and not underflow the Logistic regression classifier, instead of just dividing by (xAtual – xAnt) * (yAtual – yAnt), which would be the window area, we also divide by a factor of 255000.0f.

4.3 Classifier

To classify the data we use the so-called “Regularized logistic regression” algorithm, which is a fast technique to linearly separate data. If the application requires discrimination between very complicated shapes, or, more technically, the boundaries of the classification are nonlinear, it is necessary to augment the feature space with the nonlinear features to use this algorithm or employ a more sophisticated classifier such as Support Vector Machines (SVM) or neural network.

The discussion of classifiers is beyond the scope of this article but you can get more information from references [2], [3] and [4]. Reference [3] is a set of particularly interesting classes by Prof. Ng from Stanford. The reader is welcome to browse our repository and see the details of the algorithm (search for #region Logistic Regression).

5. Conclusion

In our tests, the feature extraction time for images of size 1000 x 1000 and bigger ranged from 0.03 to 0.05 s, while the classification times were less than 0.01 s. Thus, using OpenCL, it would be possible to process a 30 fps video stream for object detection.

In terms of performance, one should notice that the feature extraction process takes a constant time once the image integral has been computed. Thus, the only change in the computation times is due to a longer time computing the image integral. The bigger the images the more parallelism achieved using GPUs.

We note here that the features computed are color Haar features, meaning that all three components (RGB) of the colors are averaged, unlike most methods found in the literature which only deal with the grayscale values. From the point where these color features have been extracted, it’s up to the programmer to convert the RGB into other color models or extract modified features from the Haar features. Moreover, if an algorithm has already been implemented in C, it should be extremely easy to import to OpenCL given that operating in a per-sliding window basis exposes plenty of parallelism.

Finally, in this article we have presented a method to use OpenCL to accelerate extraction of color Haar features from a given image and a complete software demonstrating how to use these features and the sliding windows method to classify pictures.

Download OpenCL Haar color feature extraction example source code

6. References

[1] Viola, Paul; Jones, Michael (2002). “Robust Real-time Object Detection”. International Journal of Computer Vision.


[3] Ng, Andrew. Machine Learning.

[4] Vandenberghe, Lieven and Boyd, Stephen. COnvex Optimization. Cambridge University Press.

[5] Summed area table,

Leave a Reply

Your email address will not be published. Required fields are marked *