1. Introduction
No matter what binding you are going to use to create host OpenCL code, you will need to create OpenCL kernels using the OpenCL C99 code. You may choose to use the easytouse but to some extent limited OpenCLTemplate, the great C# Cloo, higher degree of control with C++ and native API calls, etc, but you will always be able to port the C99 source.
In my opinion, this is what makes this part of the tutorial the most important next to the kernel execution structure.
If you haven’t done so yet, I suggest you Download OpenCLTemplate. Even if you don’t want to use OpenCLTemplate as a binder, you can still use CodeChecker to compile your programs. It is practical because it highlights the most common OpenCL functions and you can verify if your code compiles before having to recompile your entire program. As you can see in the screen below, it serves as a quick reference for common functions.
1.1 Main Differences
Below is a table that summarizes the most relevant differences from regular C99 to OpenCL C99:
Difference  Positive or negative?  Remarks 
Function qualifier __kernel

N/A  This is necessary in order to communicate with the Host code. The “__kernel” or just “kernel” identifier will tell the OpenCL API that this function will be called from the Host code. Functions qualified with kernel have to return void but they can be called by other functions and, in this case, they behave as regular functions. In other words, kernel functions are regular functions in every way with the extra feature of being possible to be called from the Host. 
Workitem functions

N/A  OpenCL C99 can’t live without these functions, like get_global_id() and get_global_size(). These are the functions that allow the OpenCL programmer to assign different tasks to each worker. 
Builtin vectors

Very positive  Vector data types make life much easier when developing code to deal with 3D. Using vectors allows the programmer to parallelize code that would otherwise have to be done sequentially. For example, you could multiply 4 floats by 2 (4 multiplications) or multiply one float4 by two (much faster). This is what GPUs are meant to do! 
Builtin math functions and common functions

Very positive  This is very important. It is necessary to know which functions are builtin because math and common functions have hardware implementations. For instance, you don’t want to interpolate z = min + (maxmin)*t; when you could use z = mix(min, max, t);. Hardware functions, such as the native_ functions and mad are much better if you don’t have strict precision requirements. 
Builtin geometric functions

Positive  If you are not manipulating 3D content, the geometric functions may not be so useful. Either way, it is very good to have hardwareimplemented cross and dot product for vector manipulation. 
Compilation at execution time

Positive  Usually, the user won’t create his own OpenCL C99 code for runtime compilation. But the fact it is possible is just amazing; it brings incredible possibilities to software. I mean, two years ago it would have been very difficult to almost instantly create a calculator with all builtin trigonometric functions. 
Address space qualifiers

N/A  Controls where you can use and how fast you can get access to a given variable. It is important to use the faster __local and __private memory to speed up the algorithm but it may not be always possible to avoid accessing objects from the global memory. 
No recursion allowed

Negative  Not many algorithms need to run recursively nowadays, but not being able to use recursion is surely a step backwards. Recursion would probably sacrifice too much performance to be allowed in parallel programming models. 
Templating not allowed  Negative  It is necessary to compile one OpenCL code for each data type you want to use. This is a point where NVidia CUDA is better, but it is still no big deal. It is always possible to replace strings in the source to get the desired effect. 
Dynamic allocation not allowed

Very negative  This may look a bit oldfashioned and it surely resembles old FORTRAN times. It is possible that this limitation will make it difficult to suit an algorithm for broad applications but I think this won’t be a problem that is impossible to come over. 
No builtin random number generator

Very negative  The reason I consider the absence of random number generators a very negative aspect is because they impact evolutionary and learning algorithms that use random numbers to evolve. Since these algorithm are parallel because of their own nature, this brings a limitation to a class of algorithms that benefit the most from parallel processing. 
2. The OpenCL C99 structure
I assume that the programmer reading this tutorial is familiar with C and C#. If this is not the case I suggest referring to the ISO/IEC 9899:1999 specification and learning C#, for example, from Microsoft’s MSDN.
2.1 Flow Control
Flow control in OpenCL C99 uses regular commands programmers already know:
while (condition) {}
if(condition) {} else {}
for (int i=0;i<N;i++) { }
I assume that the programmer reading this tutorial is already familiar with this.
2.2 Kernel qualifiers
The kernel or __kernel qualifier indicates that a function may be called directly by the Host code via the OpenCL API. There is not much to explain here: a kernel function has the following properties:
 The host can create a handle to call the kernel function via the API;
 __kernel or kernel functions (the __ is optional) must return void;
 kernel functions can be called from other functions and they behave like normal functions in this case.
As a last reminder, have in mind that C functions have to be created before they can be used.
2.3 Workitem functions
Workitem functions are the only way workers (or threads or workitems, as you prefer) can identify themselves and know what part of the data they will handle. Each worker needs to get to know its identification, i.e., that it is the ith worker (or ith jth if the work dimension is 2 and ith jth kth if dimension is 3).
So we have the int data type workitem functions:
 get_global_id(0, 1 or 2): This is the work_item function that lets the worker know its ID among all workers;
 get_local_id(0, 1 or 2): This is the function used to identify the worker inside a local workgroup;
 get_global_size(0, 1 or 2): This function lets you know the total number of global workers in the desired dimension. Unlike the get_id functions, this value could easily be passed as a global variable.
 get_local_size(0, 1 or 2): This is the function to identify the size of the local workgroup. We will discuss local IDs and local workgroups in the advanced aspects of C99 because using this will usually also involve synchronization inside kernels.
You may want to go back and check again the kernel execution structure tutorial to study better the OpenCL C99 code that shows local and global IDs now that we have covered the work_item functions.
3. Worked example
Worked examples and source code is usually very useful for understanding the concepts.
Let’s create a simple OpenCL program that sums the components of a vector v[n]. Remember, since we can’t lock structures in the OpenCL C99, we can’t just:
kernel void sum(global float * vec, global float * sum) { int i = get_global_id(0); sum[0] += vec[i]; }
And then invoke the kernel with work_dim = 1 and n workers, i.e., global_size = int[1] { n }.
THAT WILL NOT WORK!!! If you are still unfamiliar with why it won’t work I advise you to study OpenCL Capabilities and Limitations.
3.1 Description of the algorithm
So what we need to do is divide the task and sum parts of the vector in a parallel fashion and compute the final result after this parallel step. Take a look at the picture below to understand the strategy:
 We will sum each row using float4 vectors;
 Then we sum the vector results of the sum of rows;
 The final result is the sum of the 4 components of the last vector.
To implement this, it is necessary to interpret the vector as a matrix. Notice that a matrix can be rewritten as a vector: consider matrix M[n,m]. We can rewrite matrix M as a vector v by doing
v[i*m + j] = M[i,j];
This brings the limitation that the vector we want to sum must have length mn. The strategy is to sum the values in the columns and, after that, sum the sum of the columns. We don’t want the number of rows n to be low because we will have one worker to sum each row and GPUs like to process many threads. On the other hand, we don’t want the number of columns to be low because we want to keep the workitems busy. We will impose an extra limitation: the number of columns has to be a multiple of 4, to take advantage of GPUs powerful vector sum capabilities.
I am going to focus on the OpenCL C99 code, workitem functions and limitations we are discussing in the topic. If you want to study the Host code and full implementation, I suggest you download the source code for this section (OpenCLTemplate and Cloo versions included).
The algorithm to be implemented is intended to sum the components of a vector which has length v[numRows * numCols], with numCols a multiple of 4.
The vector v we will create to test is so that v[i] = i. This makes it easy to check the result: we know that the sum 0+1+2+…+n1 = n(n1)/2.
We will create the vector we want to manipulate in Host memory and copy it to the Device. Then, we will read the sum back into Host memory.
You may want to extend the results further and use this algorithm to evaluate integrals. The advantage of this approach is mainly that you can evaluate function values in parallel and no big amounts of data will need to be transferred between Host and Device; there will be a single copy of the final result of the integral. There are two versions of the Host code: OpenCLTemplate and Cloo. Check the screen shot of the program below:
3.2 Summing the columns
Despite the fact we are summing a matrix, we are going to use work_dim = 1 because each worker will sum one entire row. Since we need to know the number of rows AND columns of the associated matrix, it is necessary to use a parameter to send the number of columns. To make the algorithm more versatile and allow for a simple function change to make the algorithm able to calculate maximum and minimum elements of a vector, I have decided to create separated functions that sum the elements of the matrix. Notice the use of get_global_id(), the fact that the function F (in its scalar and vector form) come first and that I needed to write two versions of the same function.
float4 VectorF(float4 a, float4 b) { return a+b; } float ScalarF(float a, float b) { return a+b; } __kernel void SumCols(__global float4 * v, __global int * numColsby4) { //Worker identification int i = get_global_id(0); //Work sizes. Remember, numCols divided by 4 int nRows = get_global_size(0); int nCols = numColsby4[0]; float4 localSum = v[i * nCols]; for (int j = 1; j < nCols; j++) { localSum = VectorF(localSum, v[j + i * nCols]); } v[i * nCols] = localSum; }
Function F is the function to sum the vectors and, later on, the scalars to calculate the answer. You can easily calculate the maximum or minimum by replacing a+b with fmax(a,b) or fmin(a,b). Notice that we use the workitem function get_global_id to inform the worker which row it should sum.
3.3 Final sum and result
We still need to sum the sum of rows to compute a vector which is the sum of all the vectors. After this, we return the sum of the four components of this vector and that’s it.
__kernel void SumColSums(__global float4 * v, __global int * numColsby4, __global int * numRows, __global float * sum) { //This is a task //numCols divided by 4, numRows informations int nCols = numColsby4[0]; int nRows = numRows[0]; float4 localSum = v[0]; for (int i = 1; i < nRows; i++) { localSum = VectorF(localSum, v[i * nCols]); } sum[0] = ScalarF(localSum.x,ScalarF(localSum.y,ScalarF(localSum.z,localSum.w))); }
The above kernel is a single task executed by one single worker. Nonetheless, it still takes advantage of the GPU hardware by summing float4 vectors and this is still better than copying a large set of data to Host memory.
All the Host needs to do now is copy the vector sum back to Host memory.
3.4 Remarks
It is interesting to notice that this algorithm leads to robust sum, speaking of roundoff errors. If you just sum the numbers sequentially, the last numbers will be summed with a very large number thus increasing roundoff errors.
You may want to play with the functions to calculate maximum or minimum elements of the vector instead of calculating its sum.
It would also be possible to use this code to quickly evaluate the integral of a function with the advantage of being able to generate the function values using OpenCL C99 code. This is good for two main reasons: it is much faster itself and it also avoids data transfer between Host memory and Device memory.
In my tests I was able to get a performance boost of 68x disconsidering the time necessary for Host – Device data transfer (5000*5000 elements, 0.008 s to run on GPU, 0.012 on CPU with OpenCL, against 0.52 s on CPU using regular code).
4. Conclusion
We have implemented a simple yet very useful algorithm to efficiently use the GPU to sum vectors taking advantage of the GPU builtin vector sum. Using my hardware it was possible to speed up the algorithm by 68x disregarding data transfer to the Device.
Notice that the absence of templating requires one F function to sum vectors and one F function to sum the components of the vector in order to assemble the final result. Also, as this is a C language, notice that it’s necessary to create functions before using them.
You may want to experiment with calculating maximum or minimum element of a vector by modifying the source code.
Download the source code for this section. There are two examples: one uses OpenCLTemplate and the other one uses Cloo.