Advanced aspects of the C99 OpenCL language

In this topic, we are basically going to discuss what makes OpenCL C99 different from regular C99: running parallel.

To understand this topic you must have in mind all the time that the workers are being executed at the same time, sharing resources, and that hardware clocks will probably affect all operations being done.

First we will approach worker execution path which, roughly speaking, ensure that the workers have the same load and get the most out of the GPU.

Then, we will discuss local variables and workgroup synchronization, which allows us to take advantage of the fast __local memory that can be shared among workers inside the same workgroup.

1. Avoiding different execution paths in the same warp

In section Capabilities and Limitations, we have discussed that it is important to avoid different execution paths in the same warp.

A kernel warp happens when you launch all the workers needed to execute a certain task, i.e., all workers launched in a EnqueueNDRangeKernel are part of the same warp.

But what is exactly the meaning of that?

It may be easier to use an example. The worst thing that can happen in a kernel, concerning execution paths, is:

kernel void myKernel()
{
   if (condition)
   {
     do work
   }
}

As you can see, some kernels will be launched and do nothing at all. This is not good. As a rule of thumb, have in mind that throwing a worker is an expensive task and you want each worker to effectively work.

Remember that vector sum kernel almost every OpenCL tutorial posts as an example? It is not very effective because each worker only executes one sum.

Another thing to avoid is:

kernel void myKernel()
{
   if (condition)
   {
     do work
   }
   else
   {
     do something completely different
   }
}

You would prefer something like:

kernel void myKernel()
{
   if (condition)
   {
     do work
   }
   else
   {
     do something with the exact same operations and order with different data
   }
}

I don’t work for AMD or NVidia in order to know implementation details and explain exactly why this is bad. What I do know is that it messes with the parallel operations that the hardware can handle.

Of course, it is not always possible to completely get rid of different execution paths. In these cases, try to use split the execution only in the end, in a way so that workers keep their execution paths for as long as possible.

In the Collision case study, you can notice that I used what I’m saying here. There is an algorithm to detect line-triangle crossing that can be summarized like this:

void CheckCollision(line AB, triangle ABC)
{
   Calculate important vectors
   Calculate collision point
   if (Collision detected)
   {
      Save information to global memory
   } 
}

As you can see, there is a slight deviation from a straight execution path in the end, where I write to global memory, but the execution remains aligned most of the time.

At times it may be impossible to keep the execution path. In those cases, you may want to try to make sure that the workload of each worker is the same on average.

2. Local workgroup synchronization

2.1. Workgroups and __local variables

Variables identified with __local or local are shared among work-items that belong to the same workgroup. Nowadays a typical local workgroup size is 256.

Workgroups are series of work-items that can share resources and should be used wisely because accesses to __local memory are much faster than accesses to __global memory. Why not make it all fast and shared? Well, ask ATI and NVidia, this has to do with the hardware itself. It must be rather difficult to implement memory that is both global, shared and very fast. The picture below should help things become clear:

 

Notice in the picture above how a global variable can be accessed from anywhere and local variables, opposed to that, cannot be accessed outside the workgroup.

Declaring a local variable is rather simple. For example, the command:

__local float x[LOCAL_WORKSIZE];

will declare a local variable x which can be accessed from any work-item inside the same workgroup. Local variables are tightly related to the next section, barrier, because you will want to synchronize data write/read. So, for example, if you simply write:

kernel void localVarExample()
{
   int i = get_global_id(0);
   __local int x[10];
   x[i] = i;
   if (i>0) int y = x[i-1];
}

The result is undefined because the work-item that reads x[i-1] may get to this point before work-item i-1 writes to x[i]. What you’d need is

kernel void localVarExample()
{
   int i = get_global_id(0);
   __local int x[10];
   x[i] = i;
   barrier(CLK_LOCAL_MEM_FENCE);
   if (i>0) int y = x[i-1];
} 

2.2. Workgroup synchronization

Quoting Khronos OpenCL specification:

Barrier: There are two types of barriers – a command-queue barrier and a work-group barrier.
The OpenCL API provides a function to enqueue a command-queue barrier command. This barrier command ensures that all previously enqueued commands to a commandqueue have finished execution before any following commandsenqueued in the command-queue can begin execution.
The OpenCL C programming language provides a built-in work-group barrier function. This barrier built-in function can be used by a kernel executing on a device to perform synchronization between work-items in a work-groupexecuting the kernel. All the workitems of a work-group must execute the barrier construct before any are allowed to continue execution beyond the barrier.”

In this section, we will focus on the OpenCL C99 barrier. More specifically, we will talk about the barrier(CLK_LOCAL_MEM_FENCE); command. OpenCL synchronization using the API will be discussed in the Synchronization tutorial. This barrier ensures that all work-items inside a workgroup stop at the barrier before procceeding, synchronizing local memory IO and work-item execution. Once again quoting the specification:

“All work-items in a work-group executing the kernel on a processor must execute this function before any are allowed to continue execution beyond the barrier. This function must be encountered by all work-items in a work-group executing the kernel. If barrier is inside a conditional statement, then all work-items must enter the conditional if any work-item enters the conditional statement and executes the barrier. If barrer is inside a loop, all work-items must execute the barrier for each iteration of the loop before any are allowed to continue execution beyond the barrier.
The barrier function also queues a memory fence(reads and writes) to ensure correct ordering of memory operations to local or global memory.”

The information above is crucial because one can easily create deadlocks if barrier is not used correctly. Check this example:

kernel void example(global float * x)
{
   int i = get_global_id(0);
   if (i==0)  barrier(CLK_LOCAL_MEM_FENCE);
   else x[i] = i;
}

Don’t do that!!! That will cause a lock because work-item 0 will not procceed until all other workers reach the barrier but… they won’t! You need to make sure that ALL work-items will reach the same number of barriers.

3. Final words

The example for this section will be the matrix multiplication case study. I have been thinking all day of a very simple example that didn’t do anything useful but all I could come up with were examples that didn’t do anything AND were complicated. Either way, there are lots of codes in the section to demonstrate the concepts discussed.

You may want to check the matrix multiplication case study, which is the simplest OpenCL C99 synchronization example I could think of.

Leave a Reply

Your email address will not be published. Required fields are marked *