OpenCL Tutorial

OpenCL Tutorial with OpenCLTemplate and Cloo

Welcome to the area dedicated to parallel processing and acceleration by using OpenCL and graphics card.

This area intends to supply, in a summarized and clear way, a practical way to use the graphics card for math calculations. If you’re interested about the architecture and implementation, check the OpenCL spec from Khronos Group.
I suggest the reader NOT to skip any step because understanding later steps will often depend on having understood the previous ones. Besides, this page is not intended to give professional training about parallel processing. Instead, we’re trying to offer a practical way of learning for the non-professional OpenCL developer.
For your convenience, the topics have been grouped by difficulty level in a color scale:
Level: easy;
Level: intermediate;
Level: difficult.
Remember that the developer should be familiar with C# and .NET to read this tutorial.

The sample code for each section is available in the section.

I suggest you download OpenCLTemplate and use the OpenCL Editor to check if your code is correct:

Important note: Most of this tutorial is general-purpose information about OpenCL. OpenCLTemplate just makes it faster to try the code and see what happens. It doesn’t matter if you are going to use the pure OpenCL API or some binding like OpenTK, Cloo (which I think is great) or OpenCL.NET. What is important is that there will be commands to load variables and execute kernels. You will always be able to use the OpenCL C99 code presented here.


You may click on the desired topic or use the menu to the left to access the topics.

1 – Installation and configurations;
2 – Overview about OpenCL and parallel processing;
3 – First OpenCL program;
4 – ATI Stream OpenCL Technical Overview;
5 – Capabilities and limitations;
6 – Why parallel processing?;
7 – Reading and writing variables;

8 – Command queues;
9 – Kernel execution structure;
10 – Basic aspects of OpenCL C99 language;
11 – Intermediate aspects of the C99 OpenCL language;
12 – Advanced aspects of the C99 OpenCL language;
13 – OpenCL C99 Atomics;
14 – OpenCL Image2D Variables;
15 – Synchronization;
16 – OpenCL/OpenGL Interop Framework;
17 – OpenCL/OpenGL Interoperation;
18 – OpenCL/OpenGL Interoperation with Textures;
19 – Optimization Strategies;
20 – Case study: matrix multiplication;
21 – Case study: image filtering;
22 – Case study: Low poly collision detection;
23 – Case study: geometric fitting of pipes;
24 – Case study: color tracking;
25 – Case study: High performance convolution using OpenCL __local memory;
26 – Case study: Extraction of color Haar features;
27 – Case study: heat transfer simulation using CLGL interop;
28 – Case Study: Efficient manipulation of Kinect data using OpenCL/GL Interop.


AMD Diagonal Sparse Matrix Vector Multiplication Case Study – Nice Case Study from AMD, definitely worth seeing. Some optimizations are clearly hardware-specific though.

AMD Reductions Case Study – Another interesting case studies which show ways to compute vector sums/max/min operations efficiently. Concepts are applicable in general.

11 thoughts on “OpenCL Tutorial”

  1. Hi! I just started using OpenCLTemplate for realtime simulation, but after a few seconds running it becomes laggy. It shows alot of lines like this just with different numbers in the output window:
    “WARNING! ComputeBuffer{T}(279597568) leaked.
    Disposing ComputeBuffer{T}(279597568) in Thread(2).”
    I’ve disposed all the kernels and variables every time after finished executing, but it doesn’t fix the problem, it seems tthat the dispose function doesn’t work! Can you please help me?

    1. Wait! I solved the problem by reusing the variable through WriteToDevice() 😀
      But theres another warning after creating all the CLCalc variables:
      “WARNING! ComputeCommandQueue(0) leaked.”

      And I’m still wondering about the Dispose() ?

      What should I do? Please help.

    2. I think this might be an issue with Cloo Dispose. What I do to circumvent this problem is to always reuse memory objects when their sizes don’t change.

      I do not know the details of your application but it has a 99% chance to be lagging because of memory reallocation. Try reusing the objects when their sizes dont change or using a size that fits everything.

      E.g. declare the Variable once and only use WriteToDevice after that.

      Hope this helps

  2. How i compare if a string is contained in an array of string passed by the host?

    I have a string array, i want to pass this array to the gpu and verify if one string is contained in this array.

    1. Hey! I’m not sure, but the code bellow should work.

      public int findSubString(char[] original, char[] searchString)
      int returnCode = 0; //0-not found, -1 -error in imput, 1-found
      int counter = 0;
      int ctr = 0;
      if (original.Length < 1 || (original.Length) 0)
      if ((original[ctr]) == searchString[0])
      counter = 0;
      for (int count = ctr; count < (ctr + searchString.Length); count++) { if (original[count] == searchString[counter]) { counter++; } else { counter = 0; break; } } if (counter == (searchString.Length)) { returnCode = 1; } } ctr++; } return returnCode; }

Leave a Reply

Your email address will not be published. Required fields are marked *