First OpenCL program

First OpenCL program

Get  the source code for this example.

We are ready to create our first OpenCL program now. This will be a very simple program because we haven’t covered much of the OpenCL C99 or the OpenCL API.

I recommend you go to the Khronos Group website and download the latest OpenCL specification. Even if you are not going to read it entirely, it is a great source of information. Just read the functions as you need them or as I point to them in this tutorial.

1. Setting up

1 – The first thing we have to do is create a new Visual C# Windows application:


2 – Next, we have to include a reference to Cloo and OpenCLTemplate. Use the menu Project -> Add reference to open the references:

3 – Find the OpenCL and OpenCLTemplate files (download them if you haven’t already) and add the references to your project. OpenCLTemplate has some dependencies on Cloo so we should include both;

4 – Throw in a button to your form, name it btnOpenCL and set its text property to OpenCL (this should be easy if you have any experience with visual languages). It should look like this:


Now we have the tools we need to start coding with OpenCL.

2. Creating the OpenCL source

The OpenCL source code is a string that follows the OpenCL C99 rules. Once again, refer to the Khronos Group website to download the OpenCL specification. You will probably want to download the C99 specification as well.

We will create a simple program that sums two vectors and stores the result in the first one. Let’s edit the click event of the button to execute our OpenCL code (which, in Visual C#, is done by double-clicking the button):

        private void btnOpenCL_Click(object sender, EventArgs e)
string vecSum = @”
                     __kernel void
floatVectorSum(__global       float * v1,
__global       float * v2)
// Vector element index
int i = get_global_id(0);
v1[i] = v1[i] + v2[i];


Let’s cover some basic concepts of the OpenCL C99 here. There’s no need to worry, we’ll come back to the C99 topic later.

  1. Notice that the OpenCL source is a string which will be compiled by the OpenCL compiler;
  2. The “__kernel” identifier tells OpenCL that we will call this function through the API;
  3. The “__global” identifier tells OpenCL that we will need access to v1 and v2 in the device memory;
  4. The “get_global_id(0)” is an OpenCL built-in function that returns which worker this is. In our code, the task of the i-th worker is to sum the i-th components of v1 and v2. The (0) is the dimension id we want, which we will discuss later.

As you can see, this code quite simple. What makes it simple is the fact that each sum is completely independent of the others.

3. Creating the host code

Now that we have the OpenCL source, we need to procceed into creating the host code that will initialize OpenCL, compile the code, transfer the necessary data to the device, execute the kernel and read the data back.

3.1 Initializing OpenCL and compiling the code

OpenCL initialization should identify all available devices and create the Command Queues, which are used to tell OpenCL which device should execute which command. I have encapsulated all this initialization into the InitCL function. Compiling the code shouldn’t be a problem. After compilation, it is necessary to use the OpenCL API to get host access to the OpenCL kernel so we can order its execution later. We’re going to keep adding code to the click event:

            //Initializes OpenCL Platforms and Devices and sets everything up

            //Compiles the source codes. The source is a string array because the user may want
//to split the source into many strings.

OpenCLTemplate.CLCalc.Program.Compile(new string[] { vecSum });

            //Gets host access to the OpenCL floatVectorSum kernel
OpenCLTemplate.CLCalc.Program.Kernel VectorSum = new OpenCLTemplate.CLCalc.Program.Kernel(“floatVectorSum”);

OpenCLTemplate sets the first GPU as the Default device to execute OpenCL commands. If you want to select a different device, set the OpenCLTemplate.CLCalc.Program.DefaultCQ to the number of the device that you want to use. These devices are listed in the OpenCLTemplate.exe program and the devices information is stored in the List OpenCLTemplate.CLCalc.CLDevices.

TIP: If you don’t want to write OpenCLTemplate every time, include the “using OpenCLTemplate” with the other “usings”.

3.2 Creating the vectors

We need to create the vectors we want to sum. This is done in host code. Let’s keep on adding code to the click event:

            //We want to sum 2000 numbers
int n = 2000;

            //Create vectors with 2000 numbers
float[] v1 = new float[n], v2=new float[n];

            //Creates population for v1 and v2
for (int i = 0; i < n; i++)
v1[i] = (float)i / 10;
v2[i] = -(float)i / 9;

3.3 Writing to the device memory

We need to send the data to the device memory. So, we create variables that will be v1 and v2 variables in the DEVICE. This function also copies the contents of the variable to the device.

            //Creates vectors v1 and v2 in the device memory
OpenCLTemplate.CLCalc.Program.Variable varV1 = new OpenCLTemplate.CLCalc.Program.Variable(v1);
OpenCLTemplate.CLCalc.Program.Variable varV2 = new OpenCLTemplate.CLCalc.Program.Variable(v2);

Remember: variables v1 and v2 are the vectors in HOST memory. Variables varV1 and varV2 are vectors in the DEVICE memory. These are usually different memories (unless you are targetting your CPU to execute the OpenCL code, which is better in some cases).

(This is encapsulation for the OpenCL API clEnqueueWriteBuffer function)

3.4 Executing the kernel

You would call a regular C# function by invoking Function(param1, param2, etc). To invoke an OpenCL function you need to specify the parameters AND how many workers you want. Remember that you will always have to know: what does each worker do? How many workers do I need?

In our case, each worker sums the i-th component of varV1 and varV2 (remember, OpenCL executes on the DEVICE, which means it reads device memory). This means we need as many workers as the dimension of varV1 and varV2. If each worker would sum two components of varV1 and varV2, only half the amount of workers would be needed. This will be discussed in Capabilities and Limitations.

The number of workers can be specified in multiple dimensions, i.e., the worker could be the i-th worker or the i-th, j-th worker for two dimensions. In our case, we only need one dimension.

We still need to tell the VectorSum kernel that its arguments are varV1 and varV2, and that it will need one worker for each element of varV1:

            //Arguments of VectorSum kernel
OpenCLTemplate.CLCalc.Program.Variable[] args = new OpenCLTemplate.CLCalc.Program.Variable[] { varV1, varV2 };

            //How many workers will there be? We need “n”, one for each element
int[] workers = new int[1] { n };

            //Execute the kernel
VectorSum.Execute(args, workers);

Notice that what the global float * v1 and global float * v2 in the kernel can actually see are the varV1 and varV2 variables.

(This is encapsulation for the OpenCL API clEnqueueNDRangeKernel function)

3.5 Retrieving data

The last step is to retrieve the information we need from the device memory.

            //Read device memory varV1 to host memory v1

(This is encapsulation for the OpenCL API clEnqueueReadBuffer function).

4. Conclusion

Try playing with the code you just made. Create some breakpoints to check if the sum was done correctly. Try changing the kernel so that it calculates the product of the elements of the vectors. Have some fun with the code.

OpenCL coding doesn’t get much more complicated than this unless you want to create games or very heavy real-time applications. If you are like me and need to do big calculations at once and display results in the end, this is just fine. If you want to make an  OpenGL game or real-time image processing application, I suggest you consider switching to C++ and doing your own fine-tuning for OpenCL.

There are still many points which have not been covered but this code shows the basics about running OpenCL. If you want to ask something use the Forum so the answer can help others as well.

Get  the source code for this example.


18 thoughts on “First OpenCL program”

  1. Hi guys, I’m an Italian programmer and I’m approaching with OpenCL.

    I’m doing some tests with OpenCL template, but I’ve some problems..
    Please can you help me?

    For example:
    __kernel void Test(__global int* count)

    C# CODE:
    OpenCLTemplate.CLCalc.Program.Kernel test;
    test= new OpenCLTemplate.CLCalc.Program.Kernel(“Test”);

    int[] count = new int[1];
    OpenCLTemplate.CLCalc.Program.Variable _result;
    OpenCLTemplate.CLCalc.Program.Variable[] _args;
    _result = new OpenCLTemplate.CLCalc.Program.Variable(count);
    _args = new OpenCLTemplate.CLCalc.Program.Variable[] { _result };
    test.Execute(_args, new int[] { 100, 100 });

    When I get the result, instead of have (100*100), I always have random numbers.

    Thank you very much.


      1. Thank you so much.

        I have a problem.
        I installed all step.
        1. Visual studio 2015.
        2.Driver GPU Geforce (newest)
        3. CUDA 8.0
        4. Dowload OpenCLTemplate File

        I dont know, file OpenCLTemplate.EXE not execute. Why?

        Please help me again. Thanks!

        1. Is your GPU included in this list? (

          NVidia GPUs

          NVidia supports OpenCL well on devices with Compute Capability 1.3 and up, which are the following:

          GeForce GTX 260 and up.
          GeForce GTX 400 series.
          GeForce GTX 500 series.
          Tesla C/S 1060 and up.
          Quadro FX 4800 and 5800
          Older GPUs (with compute capability 1.0 to 1.2) won’t get really good speed-ups. It might be of interest when you have SLI. The GPUs not listed above from the following series have minimal support for OpenCL, but the list could have some mistakes due to generalisations:
          GeForce 100 series.
          GeForce 200 series.
          GeForce 8000 series.
          GeForce 9000 series.
          Tesla C/D/S 870.
          Quadro FX. Check this table if it mentions CUDA-cores.
          Recent drivers have the OpenCL runtime. The SDK can be downloaded from It might seem a bit strange you need the CUDA-SDK to develop OpenCL, but they chose to bundle the two.

          1. My graphics card is Nvidia GeForce 920M. It has support Open CL.

            My computer has Processer: Intel(R) Core(TM)i5-5200U, CPY 2.2GHz. Ram: 8Gb, Window 10, 64 bit.

            I installed:
            1. Visual Studio 2015
            2. Driver Nvidia Geforce 920M
            3. CUDA Toolkit 8.0

            I excute OpenCLTemplate.exe but it do not run. Why?
            Please help me! Actually, i very want to study Open CL.

            Thank so much!

      2. So, OpenCLTemplate.exe is probably too old for the new openCL versions. Douglas said he will update the program this weekend. Then we shall give you a proper feedback.

          1. Hi Thuan! I’m sorry, Douglas didn’t have time to do it, only he knows how to do. I’m still asking him not to forget. He first thought that the problem was the and updated it. Maybe it will work for you now. But I think you still don’t have the right drivers or compatible hardware.
            You could try some official software from

    1. You can just debug the program in Visual Studio or print to the console with Console.WriteLine(output).

  2. I have a problem when add reference of opencltemplate.dll.
    after adding this file, there is error sign which say ‘this project cannot be viewed in object browser because it is unavailable or not yet built….’
    What should I do ?

    1. I’m not sure, but it probably has to do with the .net framework version you are using, which might not be compatible. Try downgrading the version.

Leave a Reply

Your email address will not be published. Required fields are marked *