OpenCL C99 Atomics

Get source code for this section (implemented with Cloo).

1. Introduction

First of all, this topic is about Atomics operations in the OpenCL C99 code. That means it doesn’t matter if you’re using Cloo, OpenCLTemplate or the API directly, C++, Java or whatever you like. I used Cloo in this example to provide more Cloo source code and resources.

In parallel processing, synchronization is a very important issue that deserves special attention. So far, we have discussed host code synchronization and memory barriers. But one question that arises is: how to create semaphors using OpenCL? How to create synchronization that depends on the input data? In this section, we will create a semaphor system using OpenCL C99 atomic operations as well as discuss atomic basics and a few functions.

As we discussed, all workers in a workgroup have to reach all barriers in the code or the OpenCL code may behave unexpectedly.

On a side note here, the OpenCL 1.1 Spec changed all atomic functions that used to start with atom_ to atomic_. The code will be presented using the atom_ functions because of current drive availability but future code should use atomic_ instead of atom_. Also, bear in mind that not all hardware

2. Enabling atomics

OpenCL atomics extensions for 32 bit integers are enabled by:

#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable

32 bit atomics is a core feature in OpenCL 1.1 and thus will not have to be enabled in the future. 64 bit atomics, on the other hand, are still an optional feature and have to be enabled using (OpenCL 1.1):

#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_int64_base_atomics : enable

OpenCL atomics is an extension that is not yet widely supported. Before using the extension, it is a good idea to check that the hardware supports it. Using Cloo, there is a very easy way to do so:

ICollection<ComputePlatform> cp = ComputePlatform.Platforms;
foreach (ComputePlatform plat in cp)
{
    for (int ii = 0; ii < plat.Devices.Count; ii++)
    {
        List<string> extensions = new List<string>();
        int jj = 0;
        foreach (string s in plat.Devices[ii].Extensions)
        {
            extensions.Add(plat.Devices[ii].Extensions[jj]);
            jj++;
        }
    }
}

You can use the code above as a template to check with Cloo if the extension you are trying to use is supported. As you noticed, for each device in the list, the List<string> extensions will contain all the supported extensions. Of course you can use the float[]  .Extensions directly from the device but this is a fast way to query the extensions before anything else.

3. What are atomic operations?

What is an atomic operation? To put it simply, an atomic operation is one that cannot be interrupted.

Atomics is not only a OpenCL C99 feature. In fact, atomics appeared in regular C code exactly to make it possible to create synchronization points.

Think about this: two workers (or threads) want to increment the value of variable num:

__kernel void test(global int * num)
{
  num[0]++;
}

In this case, something that may happen is:

Worker 0 reads num[0] (zero).
Worker 1 reads num[0] (zero) <- this would not happen if first operation was atomic
Worker 0 writes num[0] = 0+1 = 1
Worker 1 writes num[0] = 0+1 = 1

We have discussed this before in topic Capabilities and Limitations.

Atomics come to solve this issue. The following code would fix the problem:

#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
__kernel void test(global int * num)
{
  atom_inc(&num[0]);
}

If you want a better understanding of how atomic operations work, I suggest taking a look at theAtomic operation wikipedia entry or just browse around for regular atomic operations. The important thing is the guarantee that, when using an atomic operations, you can be sure that the operation on a given memory location will be thread-safe, which means no other worker will have access to that specific pointer of memory while the atomic operation is executing.

Atomic functions are really useful and not very hard to understand: they simply perform some task uninterruptedly.

4. Simple atomic operations tests

First of all, refer to the Khronos Group OpenCL Specification for a complete list of atomic operations. Atomics only apply to integers and longs (except for a single function for floats).

Now let us perform some simple tests. Let’s suppose we want all kernels to increment the value of a single variable. We have just discussed this topic in section 3 and the two comparison codes are posted below.

Also, we may want to compare the maximum value of a list. The list will be a permutation of an array so that num[i] = i+1, which means that the maximum value should be num.Length. The source code below should be straightforward and self-explanatory.

#pragma OPENCL EXTENSION cl_khr_global_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_base_atomics : enable
#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable
#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable
__kernel void
kernelAtomInc(__global int * num)
{
   atom_inc(&num[0]);
}
__kernel void
kernelNoAtomInc(__global int * num,
                __global int * semaphor)
{
   num[0]++;
}
__kernel void maxAtomics(__global int * list,
                         __global int * maxNum)
{
   int i = get_global_id(0);
   int prevMax = atom_max(&maxNum[0],list[i]);
}
__kernel void maxNoAtomics(__global int * list,
                           __global int * maxNum)
{
   int i = get_global_id(0);
   int localMax = max(maxNum[0], list[i]);
   maxNum[0] = localMax;
}

Not surprisingly, the codes without atomics simply don’t work, as you can see in the screenshot below:

Note: in the source code available for download, the semaphors discussed in section 5. Creating OpenCL semaphors below have been implemented so the results should be correct. To see a screenshot like the one above just comment out the GetSemaphor and ReleaseSemaphor functions in the OpenCL C99 source code.

5. Creating OpenCL semaphors

Well, what if we need to lock resources for longer operations? Atomics functions only allow us to perform one single operation at a time. But we may need to execute a series of operations uninterruptedly. Let’s create a simple function that will allow us to create a simple semaphor structure.

As you might have already guessed, atomic operations will play a very important role. Our strategy will be the following:

1. Create a semaphor variable and initialize it with 0;
2. Kernels that need access to the resource protected by the semaphor (a position in an array, another variable of different type, etc.) write the value “1” into the semaphor to tell the other workers that resource is busy;
3. If the value that was written in the semaphor is 1, the resource was already busy so the worker needs to wait;
4. Else, the worker can proceed;
5. Finally, the worker releases the resource by writing 0 to it so another worker can grab that 0 and start working.

To perform atomic writing and reading to and from the device we will employ the function atomic_xchg(global int * p, int val) which “Swaps the old value stored at location p with new value given by val. Returns old value” as per the Khronos OpenCL spec. This means that this function will perform the following operations WITHOUT INTERRUPTION:

int old = *p;
*p = val;
return old;

In the code above p is a memory pointer and the contents of p are retrieved by using *p (if you are not familiar with pointers further C language study is highly recommended).

So let’s create simple functions to implement the semaphor:

void GetSemaphor(__global int * semaphor) {
   int occupied = atom_xchg(semaphor, 1);
   while(occupied > 0)
   {
     occupied = atom_xchg(semaphor, 1);
   }
}

void ReleaseSemaphor(__global int * semaphor)
{
   int prevVal = atom_xchg(semaphor, 0);
}

As you can see, the GetSemaphor function writes the value 1 in the location pointed by the semaphor and reads its current contents. If the value is zero, the function may procceed because the resource is available. If not, it keeps writing 1 to the semaphor and only procceeds if the value read is zero.

When the resource is no longer being used, the function ReleaseSemaphor simply sets its value to zero. At this point, another worker will acquire the semaphor. How can we guarantee that no two workers will acquire thhe semaphor simultaneously? Exactly because atom_xchg is an atomic operation.

If a kernel needs to lock resources with the functions above the code will look like this:

__kernel void maxNoAtomics(__global int * list,
                           __global int * maxNum,
                           __global int * semaphor)
{
  int i = get_global_id(0);  GetSemaphor(&semaphor[0]);
  {
     int localMax = max(maxNum[0], list[i]);
     maxNum[0] = localMax;
  }
  ReleaseSemaphor(&semaphor[0]);
}

Obviously, you can create the semaphor as an array and use each individual index to lock a particular resource (in OpenCL this will most likely be a memory region). Just don’t forget to pass the semaphor as a pointer (ie, &semaphor[index]).

6. Conclusion

Atomic operations and semaphors are very important to synchronize kernels and memory access in parallel programming. This section covered important aspects of OpenCL atomics and showed a easy way to implement semaphors in OpenCL C99 code using atomic operations.

Get source code for this section (implemented with Cloo).

4 thoughts on “OpenCL C99 Atomics”

  1. Hi,

    I used this example, however when I get up to 3 threads, the lock gets stuck.
    I personally can’t see why it happens. Do you have any ideas?
    (I even tried putting the lock and unlock directly after eachother, didnt’t work..)

  2. GPU threads can get stuck easily depending on the implementation and the code. For example, some implementations will simply loop forever waiting for strict SIMD sync.

    What I would suggest as the easiest workaround is to redesign your code to directly use atomic operations when you need multiple threads to write to the same variable. Note that, if too many workitems get to the atomic, you may be effectively “serializing” your code back.

    Hope this helps

  3. Hi,

    Isn’t it a problem that the waiting thread is in polled wait? I don’t know the internals of GPU and OpenCL, but maybe if we implement a Mutex with a real hardware implemented wait, other threads get the chance to be scheduled (the waiting ones won’t use execution cores for looping in a polled wait).
    Cannot it be combined with a tricky use of barriers?

    1. The OpenCL implementations vary a little bit regarding barriers. They all comply to the spec but the behavior changes when we use them as mutex.

      The real issue is to make sure that ALL workitems reach all barriers!

      I think dynamic parallelism will help a lot.

Leave a Reply

Your email address will not be published. Required fields are marked *