# Capabilities and limitations

Like all technologies, parallel processing and OpenCL have strengths and weakness. The main strength of parallel processing (and thus OpenCL) is to be able to use all the processing power of CPUs and GPUs simultaneously to accelerate computing. The main disadvantage is that not all algorithms are 100% parallel and it can be hard to re-code algorithms to run parallel.

But what can OpenCL do and what CAN’T it do? Let’s discuss some important topics.

There are code examples in this section. I will briefly explain what they do and we will cover more details afterwards.

## 1. What can be done with OpenCL?

OpenCL can accelerate parallel mathematical calculations by a lot. If your algorithm can be easy parallelized, this is just for you. The possibilites are countless and we will cover some of them in the Case Studies of this tutorial. Some examples:

– Function root-finding and optimization;
– Image filtering and processing;
– Vertex and normal vector calculation for 3D models;
– Evolutionary algorithms like Genetic Algorithms or Ant Colony Systems;
– Particles interaction (not just collision);
– Mathematical scripting: you can compile code in execution time (this is awesome, wait until programmers discover this);
– MUCH more.

Data transfer and number of working threads are the key issues, along with doing most of the calculations with vectors.

 You want to do this: What you don’t want: Transfer all data the device needs (and no more) Transfer data needlessly Launch ONLY the amount of workers needed to do the job Launch workers that do the same thing Have each worker do a decent amount of math Launch workers that don’t really do much Read data back after the job is done Execute kernels many times, transferring data inbetween executions (this can be hard to avoid!) Convert the information into a vector and then operate with the vectors. Operate with scalars that you could transform into a vector and make less operations.

## 2. Is it possible to do the job with OpenCL?

This is an important question. Theoretically, the answer will always be “yes: just copy the code you would use in the CPU” . I’m not talking about that. Some codes can’t run parallel without some adaptation.

It is MANDATORY to have in mind that the threads execute parallel and that they will be accessing and storing data at the same time. An example follows:

### 2.1 Calculating the maximum of a list

1 2 3 1 7 10 11 30 28 65 33 1 2 3 10

The most straightforward way to do this is:

float max = -1;
for (int i = 0; i < x.Length; i++)
{
if (max < x[i]) max = x[i];
}

How would you think about converting this to OpenCL? Well, my guess would be: we need to compare all the elements of x with the value of max. I can have the i-th worker compare x[i] with max and voilà, max will have the answer.

Create a new Visual Studio project for this. The Load event of the form should initialize OpenCL and compile the code:

OpenCLTemplate.CLCalc.Program.Kernel kernelCalcMax;
private void Form1_Load(object sender, EventArgs e)
{

string calcMaxSrc =
@”
__kernel void
calcMax( __global float * x,
__global float * max)
{
// Vector element index
int i = get_global_id(0);
if (max < x[i]) max = x[i];
}”
;CLCalc.InitCL();
if (CLCalc.GLAcceleration == CLCalc.GLAccelerationType
.UsingGL){

CLCalc.Program.Compile(new string { calcMaxSrc });
kernelCalcMax =
new CLCalc.Program.Kernel(“calcMax”
);
}
}

As the program is loaded, we declare the OpenCL source, initialize OpenCL and compile the program.

This is the screen I created for this test: As you see, the user types the list of numbers in the txtList textbox. The calculate maximum Click event reads the list from the textbox, calculates the true maximum and runs OpenCL code:

private void btnMaxWithOpenCL_Click(object sender, EventArgs e)
{
string[] s = txtList.Text.Split();

float[] x = new float[s.Length];

//Try to parse and write back what is being done
//just in case stupid things come in

for (int i = 0; i < x.Length; i++)
{
float.TryParse(s[i], out x[i]);
x[i] = Math.Abs(x[i]);
}

//Got the list x to find maximum
float[] max = new float[] { -1 };
for (int i = 0; i < x.Length; i++)
{
if (max < x[i]) max = x[i];
}

lblMax.Text = max.ToString();

//OpenCL calculation
max=-1;
CLCalc.Program.Variable varx = new CLCalc.Program.Variable(x);
CLCalc.Program.Variable varMax = new CLCalc.Program.Variable(max);

int[] workers = new int[] { x.Length };
CLCalc.Program.Variable[] args=new CLCalc.Program.Variable[] {varx,varMax};

kernelCalcMax.Execute(args, workers);

lblCLMax.Text = max.ToString();
}

Try executing the code now. OpenCL never gets the correct result! I will always get 2 or 3 as results.

WHY?

Let’s recall the OpenCL code:

__kernel void
calcMax( __global float * x,
__global float * max)
{
// Vector element index
int i = get_global_id(0);
if (max < x[i]) max = x[i];
}

Remember, there is no set order in which the algorithm will execute. So this may occur (note: x = 1, x = 2):

1. Worker 0 reads max and x. Since max is -1, the if test passes;
2. Worker 1 reads max and x. max is still -1 so the test  passes;
3. Worker 1 writes max = 2;
4. Worker 0 (which got to the writing point later) writes max = 1.

And this is how we get the wrong result. Of course, it is possible to solve this by creating some divide-and-conquer algorithm but this is beyond scope here. I suggest you use the processor to find the maximum.

## 3. How to debug the code?

Right now, the tools to debug the OpenCL code are not very good, in my opinion. What I have been doing is writing C# code in an OpenCL fashion, i.e., functions that don’t return data (void) and receive vectors of floats and ints.

I use a for loop to go through what would be the OpenCL code. In the actual OpenCL code I remove the for loop and include the get_global_id’s and get_global_size’s as necessary. This will be discussed in the OpenCL C99 section.

## 4. What is really important?

You can find the complete paper here.

I will only discuss NVidia’s high priority recommendations. OpenCLTemplate is designed to be an easy general-purpose tool so you can quickly test your OpenCL code. If you need extreme fine-tuning I recommend you directly call the OpenCL API functions with the proper fine-tunings. Besides, some tunings are platform specific, which makes things even more difficult.

 NVidia recommendation Relevance (my opinion) Comments To get the maximum benefit from OpenCL, focus first on finding ways to parallelize sequential code. Crucial I can’t stress this enough. You NEED to create an efficient parallel code. And this involves calculating with VECTORS instead of scalars, e.g., transform a series of 4 sums of floats into a single sum of float4. Use the effective bandwidth of your computation as a metric when measuring performance and optimization benefits. Low What really matters is how faster the computation is Minimize data transfer between the host and the device, even if it means running some kernels on the device that do not show performance gains when compared with running them on the host CPU. Very important Like we discussed before, we want to transfer the least possible data. If some data can be generated on the device using OpenCL, do it. Ensure global memory accesses are coalesced whenever possible. Medium Coalesced access means, roughly speaking, access OpenCL memory in worker order. This is only possible in very specific cases. Minimize the use of global memory. Prefer shared memory access where possible. Important Not only shared memory, local memory too. If feasible, the first thing each kernel should do is copy needed resources to local memory. Shared memory is beyond scope here. Avoid different execution paths within the same warp. Important This has to do with the quality of the parallel code. Each worker should do the same work with different data. Use the -cl-mad-enable build option. Medium This allows the compiler to optimize instructions of the type a*b+c. If your code doesn’t have this type of calculation, then there’s nothing to worry about. You might lose precision though! Either way, you can always use the specific OpenCL C99 function.

### 4.1 Data transfer example

As another simple example, lets make a code that copies 1000 elements of a vector to Device memory and then reads them back, executing 9000 read/write operations. We will compare the execution time with an empty loop, just to know how important it is to minimize data transfer.

This is the screen I created for the example: As you can see, the difference is huge. My Device spent 2 seconds to transfer sizeof(float)*9E6 = 34,3 Mb of data back and forth. Besides, reading data from the device forces synchronization, i.e., the host code can’t procceed until the Device finished processing everything that has been queued, which may not be wanted. You can look into the source if you want to know how exactly it was done.

## 5. Conclusion

OpenCL is a very powerful tool if all calculations can be done in a parallel way. It’s ok to call multiple kernels provided you do not need to transfer big amounts of data between kernel executions.

There are some other optimizations that include having the number of workers a multiple of 32 and some other hardware-specific tunings. In my opinion, we can just let the API decide how to create the local work groups because eventually the manufacturers will optimize the workgroup creation.