OpenCL Tutorial

OpenCL Tutorial with OpenCLTemplate and Cloo

Welcome to the area dedicated to parallel processing and acceleration by using OpenCL and graphics card.

This area intends to supply, in a summarized and clear way, a practical way to use the graphics card for math calculations. If you’re interested about the architecture and implementation, check the OpenCL spec from Khronos Group.
I suggest the reader NOT to skip any step because understanding later steps will often depend on having understood the previous ones. Besides, this page is not intended to give professional training about parallel processing. Instead, we’re trying to offer a practical way of learning for the non-professional OpenCL developer.
For your convenience, the topics have been grouped by difficulty level in a color scale:
Level: easy;
Level: intermediate;
Level: difficult.
Remember that the developer should be familiar with C# and .NET to read this tutorial.

The sample code for each section is available in the section.

I suggest you download OpenCLTemplate and use the OpenCL Editor to check if your code is correct:

Important note: Most of this tutorial is general-purpose information about OpenCL. OpenCLTemplate just makes it faster to try the code and see what happens. It doesn’t matter if you are going to use the pure OpenCL API or some binding like OpenTK, Cloo (which I think is great) or OpenCL.NET. What is important is that there will be commands to load variables and execute kernels. You will always be able to use the OpenCL C99 code presented here.


You may click on the desired topic or use the menu to the left to access the topics.

1 – Installation and configurations;
2 – Overview about OpenCL and parallel processing;
3 – First OpenCL program;
4 – ATI Stream OpenCL Technical Overview;
5 – Capabilities and limitations;
6 – Why parallel processing?;
7 – Reading and writing variables;

8 – Command queues;
9 – Kernel execution structure;
10 – Basic aspects of OpenCL C99 language;
11 - Intermediate aspects of the C99 OpenCL language;
12 – Advanced aspects of the C99 OpenCL language;
13 – OpenCL C99 Atomics;
14 – OpenCL Image2D Variables;
15 – Synchronization;
16 – OpenCL/OpenGL Interop Framework;
17 – OpenCL/OpenGL Interoperation;
18 – OpenCL/OpenGL Interoperation with Textures;
19 – Optimization Strategies;
20 – Case study: matrix multiplication;
21 – Case study: image filtering;
22 – Case study: Low poly collision detection;
23 – Case study: geometric fitting of pipes;
24 – Case study: color tracking;
25 - Case study: High performance convolution using OpenCL __local memory;
26 – Case study: Extraction of color Haar features;
27 – Case study: heat transfer simulation using CLGL interop;
28 – Case Study: Efficient manipulation of Kinect data using OpenCL/GL Interop.


AMD Diagonal Sparse Matrix Vector Multiplication Case Study – Nice Case Study from AMD, definitely worth seeing. Some optimizations are clearly hardware-specific though.

AMD Reductions Case Study - Another interesting case studies which show ways to compute vector sums/max/min operations efficiently. Concepts are applicable in general.

Installation and configurations

Installation Let’s get started with the OpenCL tutorial using OpenCLTemplate. OpenCL is a code developed to enable users to use their video cards (GPUs) as processors. The advantage is that GPUs have an immense power to do parallel processing. I myself have been able to run collision detection algorithms 120x faster on a GPU. Since …

View page »

Overview about OpenCL and parallel processing

Overview about OpenCL  Let’s get a basic overview about how OpenCL works. Take a look at the simplified scheme below:   You can see that there is a HOST and there are DEVICES. A closer look allows you to notice that the host sends information to the devices, sends execution commands and retrieves data. How …

View page »

First OpenCL program

First OpenCL program Get  the source code for this example. We are ready to create our first OpenCL program now. This will be a very simple program because we haven’t covered much of the OpenCL C99 or the OpenCL API. I recommend you go to the Khronos Group website and download the latest OpenCL specification. Even if …

View page »

ATI Stream OpenCL Technical Overview

AMD has made this excellent series of videos that comprise many fundamentals of OpenCL. After reading the setup operations and trying your first OpenCL program yourself, it’s much easier to understand the Overview. The reviews talk about the OpenCL C99 and the OpenCL API. The important thing to know if you want to use bindings …

View page »

Capabilities and limitations

Get the source code for this section. Like all technologies, parallel processing and OpenCL have strengths and weakness. The main strength of parallel processing (and thus OpenCL) is to be able to use all the processing power of CPUs and GPUs simultaneously to accelerate computing. The main disadvantage is that not all algorithms are 100% …

View page »

Why parallel processing?

1. Introduction  GPU computing is all about speeding up applications and making it viable to run complex interactive software in real time. Picture 1 shows some speedups obtained in scientific applications (HOBEROCK and TARJAN, [5]): Picture 1 – Speedups in scientific applications The image below shows accelerations of up to 30x in fluid dynamics simulation …

View page »

Reading and writing variables

As we saw before, the memory of the Device that is executing OpenCL is not directly accessable by the Host. Thus, it is necessary to provide a way to transfer data back and forth. This involves three steps: Create the variable space in Device memory; Write the variable to Device memory; Read the contents back …

View page »

Command queues

Command queues are a very important aspect of OpenCL. They contain instructions to inform which of the Devices of your Context (that is, the group of Devices you have chosen to use) is going to execute a particular command and also how it is going to do it. If you need more in-depth information about Contexts please refer to …

View page »

Kernel execution structure

Be warned that this tutorial is a bit longer than the others because there are many important aspects to cover. In OpenCL, you need to invoke kernel execution with the proper arguments to start executing the OpenCL C99 code. We’ve discussed that you can access the OpenCL API directly or using wrappers. Whichever way you choose, it is …

View page »

Basic aspects of OpenCL C99 language9

1. Introduction No matter what binding you are going to use to create host OpenCL code, you will need to create OpenCL kernels using the OpenCL C99 code. You may choose to use the easy-to-use but to some extent limited OpenCLTemplate, the great C# Cloo, higher degree of control with C++ and native API calls, …

View page »

Intermediate aspects of the C99 OpenCL language

Download source code example for this section. The OpenCL C99 has some important differences when compared to regular C99. Once again, I would like to emphasize that I assume the reader to know the C# and C programming languages. I am not going to discuss pointers or structures in this tutorial. The aspects I consider …

View page »

Advanced aspects of the C99 OpenCL language

In this topic, we are basically going to discuss what makes OpenCL C99 different from regular C99: running parallel. To understand this topic you must have in mind all the time that the workers are being executed at the same time, sharing resources, and that hardware clocks will probably affect all operations being done. First …

View page »

OpenCL C99 Atomics

Get source code for this section (implemented with Cloo). 1. Introduction First of all, this topic is about Atomics operations in the OpenCL C99 code. That means it doesn’t matter if you’re using Cloo, OpenCLTemplate or the API directly, C++, Java or whatever you like. I used Cloo in this example to provide more Cloo …

View page »

OpenCL Image2D Variables

Get the source code for this section. 1. Introduction The worked example of this section shows a custom 7×7 border filter that runs 285x faster on the GPU than it does on the CPU. GPUs are optimized to cache and sample textures quickly. This is due to the nature of the GPU itself as a component designed …

View page »


Get the source code for this section. Parallel algorithms aren’t usually fully parallel; they normally involve big parts of code which can be parallelized. This tutorial presents OpenCL Host code synchronization techniques to ensure that kernels that depend on previous operations effectively wait until the data they need is actually available. If you want to know about …

View page »

OpenCL/OpenGL Interop Framework

Download CLGLDemo and CLMandelbrot sources 1. Introduction Interoperation between OpenCL and OpenGL allows programmers to efficiently perform complex manipulation of data directly in the GPU memory. CMSoft brings to developers the new GLRender tool in OpenCLTemplate that automates the creation of an OpenGL scene coupled with a derived OpenCL context. It is possible to create and display …

View page »

OpenCL/OpenGL Interoperation

Get the source code for this section. This tutorial is about manipulating OpenGL data using OpenCL without the need to transfer data back and forth. This is probably one of the most important OpenCL features when coupled with 3D games or 3D software because data transfers are currently the bottleneck of OpenCL (and CUDA and …

View page »

OpenCL/OpenGL Interoperation with Textures

Download source code for this section 1. Introduction In previous Interop tutorials we discussed how to interoperate OpenGL vertex buffer objects (VBOs) with OpenCL. In this section we’ll demonstrate how to use OpenCL to manipulate OpenGL texture objects. As in the case of VBOs, the main advantage of using OpenCL to manipulate OpenGL VBOs is that it …

View page »

Optimization Strategies

Download source code for this section Topics: Memory coalescing, __local reduction, kernel launches and vectorization Not included in this example: bank conflicts, flow divergence 1. Introduction The purpose of this article is to show the performance gain that can be obtained by proper utilization of parallel techniques known as memory coalescing and __local reduction. We’ll …

View page »

Case study: matrix multiplication

Get the source code for this section. This section is dedicated to processing matrix multiplication using the GPU. We are going to implement a class that multiplies two matrixes without using __local variables and create another implementation using __local variables, to compare local sync performance versus simple worker processing performance. This section should be easy to …

View page »

Case study: image filtering

Get the source code for this section. Get source code of runtime execution image filter. In this tutorial we’re going to implement image filtering with a 7×7 filter. The idea is to present an OpenCL algorithm which will work on all cards compatible with the basic implementation of OpenCL without extensions. Since we received some …

View page »

Case study: Low poly collision detection

Download source code for this section. 1. Introduction In this tutorial we will present a O(nm) collision detection algorithm suitable for detecting exact collision between 3D models containing n and m triangles. Since the complexity of the algorithm is O(nm), it is better used for models with around 100 polygons. Nonetheless, if you need to check …

View page »

Case study: geometric fitting of pipes

Get the source code for this case study. You will find 2 sources: one with a more easy to understand OpenCL C99 code and one with the C99 code optimized for the GPU. I am developing this code for you OpenCL users who already have a good knowledge of the technology and want to see OpenCL being used for a real-life …

View page »

Case study: color tracking

Download this Case Study’s source code.   1. Introduction Tracking a set of colors in a video is a first approximation and initial guess for many applications. In fact, determining what parts of an image belong to skin, for example, is very important to track faces or hands. This Case Study presents a technique that is robust …

View page »

Case study: High performance convolution using OpenCL __local memory

Download source code for this case study 1. Introduction In a previous case study, we analyzed how to create a simple 7×7 filter in OpenCL which was compatible with any GPU. Back then, performance was less an issue than compatibility as we didn’t even use images because there are some older GPUs which don’t support any extension (not even …

View page »

Case study: Extraction of color Haar features

Download OpenCL Haar color feature extraction example source code 1. Introduction Computer vision has become pervasive in our modern society, with applications ranging from robotic vision, measurement of position, face identification and recognition, automatic detection of failures in industry and many, many more. One of the challenges in computer vision is to extract invariant features …

View page »

Case study: heat transfer simulation using CLGL interop

Download source code for this section Download the presentation Watch on YouTube: Heat transfer simulation demonstration (05:38) Prerequisites and heat equation overview (07:48) Discretization of the heat equation (13:59) Sharing OpenGL textures with OpenCL (12:11) Mapping intensity to color using OpenCL (08:42) Simulating the heat equation system (28:49) Conclusion and wrap-up (06:14) 1. Introduction Heat transfer and, more generally, parabolic partial differential …

View page »

Case Study: Efficient manipulation of Kinect data using OpenCL/GL Interop

Download source code for this section. 1. Introduction Interactive technologies have become extremely important in a world where busy users demand intuitive devices which demand little to no learning time. In this modern scenario, tablets have emerged with their easy-to-use touchscreens, gaming consoles have been successfully exploring movement controls (Wii, PS3 eye, Kinect) and augmented …

View page »

1 comment

  1. grow light

    Thanks for sharing your info. I really appreciate your efforts and I am waiting for your further write ups
    thank you once again.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>