Kernel execution structure

Be warned that this tutorial is a bit longer than the others because there are many important aspects to cover.

In OpenCL, you need to invoke kernel execution with the proper arguments to start executing the OpenCL C99 code.

We’ve discussed that you can access the OpenCL API directly or using wrappers. Whichever way you choose, it is important to know what raw API function you are using in order to have a better understanding of the background.

In my opinion, Cloo is arising as a great choice for an OpenCL wrapper, which is why I will refer to it and create source using Cloo as possible and needed.

Before we procceed, let’s take a look into the schematics of a kernel execution. Once again, whichever way you choose to access the OpenCL API (directly, via Cloo, OpenCLTemplate, OpenCL.NET), you will need all the elements below:

  1. Identify the Platform (a computer);
  2. Identify OpenCL compliant Devices;
  3. Create a Context;
  4. Create Memory Objects inside the Context;
  5. Use the OpenCL C99 source code to create a Program bound to the Context;
  6. Get handles to the kernels inside the program;
  7. Use command queues to enqueue Kernel execution according to the desired Execution Structure;
  8. Upon being executed by Command Queues, Kernels produce results that are stored in Memory Objects.

1. Calling the kernel

1.1 OpenCLTemplate

We are ready now to discuss the kernel execution structure. The first step is to select which Command Queue will execute the kernel.

First of all, it’s necessary need to compile a program and create the kernel informing the kernel name string. Then, you need to call the kernel.Execute() method to invoke it.

In OpenCLTemplate, if you don’t explicitly specify a Command Queue fromCLCalc.Program.CommQueues (in-order) or CLCalc.Program.AsyncCommQueues  (out-of-order), OpenCLTemplate will default to CLCalc.Program.CommQueues [CLCalc.Program.DefaultCQ]. This is what we did in our first program:

//Initializes OpenCL Platforms and Devices and sets everything up
OpenCLTemplate.CLCalc.InitCL();

//Compiles the source codes. The source is a string array because the user may want
//to split the source into many strings.
OpenCLTemplate.CLCalc.Program.Compile(new string[] { vecSum });

//Gets host access to the OpenCL floatVectorSum kernel
OpenCLTemplate.CLCalc.Program.Kernel VectorSum = new OpenCLTemplate.CLCalc.Program.Kernel("floatVectorSum");

//(...)

//Execute the kernel
VectorSum.Execute(args, workers);
//It would be the same to call
VectorSum.Execute(CLCalc.Program.CommQueues [CLCalc.Program.DefaultCQ], args, workers, new GASS.OpenCL.CLEvent[0], new GASS.OpenCL.CLEvent());

As you can see, in OpenCLTemplate you can simply call kernel.Execute(arguments, workers). This is for the sake of simplicity. We will talk about workers later on.

1.2 Cloo

If you happen to be using Cloo to wrap OpenCL, here is a piece of code that will help you get started. Make sure to download Cloo first and include the reference. The commands are pretty straightforward after you’ve seen them once. Notice that Cloo auto-initializes itself and it is necessary to declare the Command Queue. One more thing to notice, the Execute command now lies inside the Command Queue, which in my opinion is a bit less intuitive.

//Number of Platforms
int numPlats = ComputePlatform.Platforms.Count;
ComputeContextPropertyList Properties = new ComputeContextPropertyList(ComputePlatform.Platforms[0]);ComputeContext Context = new ComputeContext(ComputeDeviceTypes.All, Properties, null, IntPtr.Zero);
 
//Vector sum source code
string vecSum = @"
__kernel void
floatVectorSum(__global float * v1,
__global float * v2)
{
    // Vector element index
    int i = get_global_id(0);
    v1[i] = v1[i] + v2[i];
}
"; //Get a list of devices
List<ComputeDevice> Devs = new List<ComputeDevice>(); 
Devs.Add(ComputePlatform.Platforms[0].Devices[0]);
Devs.Add(ComputePlatform.Platforms[0].Devices[2]);
Devs.Add(ComputePlatform.Platforms[0].Devices[1]);
//Create a new OpenCL program
ComputeProgram prog = null; 
try

{

prog = new ComputeProgram(Context, vecSum);prog.Build(Devs, "", null, IntPtr.Zero);
}

catch

{ }

//Create the kernel
ComputeKernel kernelVecSum = prog.CreateKernel("floatVectorSum");
//In case you want to create all kernels at the same time
ICollection<ComputeKernel> Kernels = prog.CreateAllKernels();

foreach (ComputeKernel k in Kernels) 
{
}
//Creates host variables to pass to device memory
float[] v1=new float[10], v2=new float[10];

for (int i=0;i<v1.Length;i++) 
{
v1[i] = i;
v2[i] = 2 * i;
} //Creates OpenCL buffers (copy data to Device)
//Something very positive is that you declare the type of your buffer
ComputeBuffer<float> bufV1 = new ComputeBuffer<float>(Context, ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer, v1);

ComputeBuffer<float> bufV2 = new ComputeBuffer<float>(Context, ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.UseHostPointer, v2);
 

//You have to set kernels arguments by manually assigning them
//This has a API-like fashion
kernelVecSum.SetMemoryArgument(0, bufV1);
kernelVecSum.SetMemoryArgument(1, bufV2);
 

//Create the command queue
ComputeCommandQueue Queue = new ComputeCommandQueue(Context, ComputePlatform.Platforms[0].Devices[1],ComputeCommandQueueFlags.None);
 

//Enqueue the Execute command. 
Queue.Execute(kernelVecSum, null, new long[] { v1.Length }, null, null);
//Enqueue read command.
v1 = Queue.Read<float>(bufV1, null);

You can work this out yourself by now. If you have any doubts don’t hesitate to post in the Forum.

1.3 OpenCL API

The OpenCL API for enqueuing kernel execution is a bit tricky. You first need to assign arguments to the kernel you want to execute using the command clSetKernelArg. Then you will use the API commands clEnqueueTask or clEnqueueNDRangeKernel to execute the kernel with the proper events and workers.

clEnqueueTask simply enqueues a command to be executed with one single worker, taking no advantage of parallelization. This may be useful if your algorithm is mainly parallel but has just one single part that needs to be executed serially.

clEnqueueNDRangeKernel is what deserves more detailed information: in the OpenCL specification you will find this syntax:

cl_int clEnqueueNDRangeKernel (cl_command_queue command_queue,
                               cl_kernel        kernel,
                               cl_uint          work_dim,
                               const size_t *   global_work_offset,
                               const size_t *   global_work_size,
                               const size_t *   local_work_size,
                               cl_uint          num_events_in_wait_list,
                               const cl_event   *event_wait_list,
                               cl_event         *event)

We have already discussed the command queue argument: it lets us choose the device that will execute the kernel and some of its characteristics. The argument kernel is the kernel we want to execute, that has to be created with a clCreateKernel from a program created with clCreateProgramWithSource or clCreateProgramWithBinary.

We will discuss the other arguments in the next topics.

If you really need to run OpenCL using the API and not using OpenCLTemplate or Cloo, you should carefully read the API functions I mentioned.

2. Workers, work dimension and work sizes

We have previously discussed a few aspects about workers and work dimensions. We shall now have a closer look into the subject.

2.1 Assigning tasks to the workers

If you want to get good performance, it is important that OpenCL workers don’t take different execution paths in the same warp, like we discussed in Capabilities and Limitations. You want to think beforehand what each worker is going to do. Think about this: you want to sum a vector. You can have each worker sum one of its components. OR you can have each worker sum four components at a time by arranging them into a gentype4 vector (gentype means some data type, like float, int, etc).

What you DON’T want  is this: suppose you are going to sum two symmetric matrixes M and N. (Symmetric matrix is so that M[i,j] = M[j,i]). This is what you need:

for (int i = 0; i < n; i++)
    for (int j = 0; j <= n; j++)
        if (j <= i) result[i,j] = M[i,j] + N[i,j];

Notice that the program will only sum if j <= i. In OpenCL this is a really bad choice because you will launch many threads that won’t do anything, i.e., in the same warp you will have threads that follow different execution paths. More generally, you usually DON’T WANT THIS:

kernel yourKernel(<args>)
{
    if (<condition that will happen sometimes>)
        <work a lot>
    else
        <work very little>
}

2.2 Work dimension

Let’s start this topic with a simple example: suppose you’re a school teacher and your class has 16 students. You want them to sum two arrays containing 4 vectors in a total of 16 numbers per array and, after that, you want them to normalize each vector of the resulting array. The vectors are:

x = [t0.xyzw t1.xyzw t2.xyzw t3.xyzw] = 
[t0.x t0.y t0.z t0.w t1.x t1.y t1.z t1.w t2.x t2.y t2.z t2.w t3.x t3.y t3.z t3.w]

y = [v0.xyzw v1.xyzw v2.xyzw v3.xyzw] = 
[v0.x v0.y v0.z v0.w v1.x v1.y v1.z v1.w v2.x v2.y v2.z v2.w v3.x v3.y v3.z v3.w]

The picture below illustrates the scene:

The numbers are written in a paper stuck in the board. As you can see, each child needs to read only the two numbers he/she has to sum.

Let’s say you want the children to sum working in one dimension (work_dim = 1). This is what happens:

  1. You assign a number to each child, corresponding to an index of the array (0 to 15);
  2. Each child goes to the board and reads the two numbers he/she has to sum;
  3. They go back and sum the numbers;
  4. Each child gets up and writes his/her answer.
  5. Children read each other’s answer to be able to calculate the normalization factor (sync point)
  6. They all normalize their vector components.

Now, you may want them to work in two dimensions (work_dim = 2). Then this is what you do:

  1. Assign each child two numbers: one to tell them which vector they should sum (from 0 to 3) and the other to tell them which component of the vector (0 to 3);
  2. Execute steps 2 to 6 above.

So, as you can see, not much changes. Yes, this is correct! The work dimension is just a way to better organize your kernel launch structure. If you want to use an extra index, you will be using work_dim = 3. In OpenCL terms, what we did before was:

work_dim = 1 case:

[important]Enqueue “children sum” with work_dim = 1

int[] global_work_size = int [1] {15};[/important]

work_dim = 2 case:

[important]Enqueue “children sum” with work_dim = 2

int[] global_work_size = int [2] {4, 4};[/important]

Roughly speaking, in terms of for loops, it would look like this:

work_dim = 1:

for (int i = 0; i <= get_global_size(0); i++)
{
    i = get_global_id(0);
    //Kernel code here
}

work_dim = 2:

for (int i = 0; i <= get_global_size(0); i++)
{
     for (int j = 0; j <= get_global_size(1); j++)
     {
          i = get_global_id(0);
          j = get_global_id(1);
          // Kernel code here
     }
}

work_dim = 3:

for (int i = 0; i <= get_global_size(0); i++)
{
     for (int j = 0; j <= get_global_size(1); j++)
     {
          for (int k = 0; k <= get_global_size(2); k++)
          {
            i = get_global_id(0);
            j = get_global_id(1);
            k = get_global_id(2);
            //Kernel code here
          } 
     }
}

As we will see when we get to the OpenCL C99 section, get_global_id() is a function that retrieves the worker identifier and get_global_size() is a function that retrieves the total number of work_items (the work_size) in a given dimension.

Important: This is just an example to show the basics of work_dimension and work_sizes. This code as presented here would be executed sequentially while kernel code is executed in PARALLEL.

2.3 Local and global work sizes

Consider the same example as before: your class has 16 students and you want them to sum two vectors of length 16. Let’s consider the case where we have work_dim = 1.

This time, the children will form groups of 4 and, instead of having to go back to the board to read the results they need, they will form groups (marked in gray) and read the other numbers they need from the classmates in the same group. In that case, you would have int[] global_work_size = int[1] {16} and local_work_size = int[1] {4}.

Workgroups are important because you can use memory barriers to synchronize them from inside the OpenCL C99 code itself. This is a topic to be discussed when we get to Synchronization. In my opinion, explicit local_work_size management will be handled by the manufacturer’s drivers better and better and this is not a point I want to focus on.

2.4 Worked example

Well, nothing like some code to help understand what is happening, right?

This time we will use Cloo to create code that will help us understand better the workers and worksizes structures. What we will do is call a kernel in which the workers will simply write to the memory their global_ids, local_ids and group_ids (which group they  are in). If you are familiar with CUDA you will remember immediately that global_id = group_id*local_size + local_id.

Take a look at the OpenCL C99 code we will call. It is quite simple but you should make sure you understand it:

__kernel void
kernelExample (__global       int * getglobalID0,
               __global       int * getglobalID1,
               __global       int * getglobalID2,
               __global       int * getlocalID0,
               __global       int * getlocalID1,
               __global       int * getlocalID2,
               __global       int * getgroupID0,
               __global       int * getgroupID1,
               __global       int * getgroupID2)
{
    int i = get_global_id(0);
    int j = get_global_id(1);
    int k = get_global_id(2);
    getglobalID0[i] = i; 
    getglobalID1[j] = j; 
    getglobalID2[k] = k;

    getlocalID0[i] = get_local_id(0); 
    getlocalID1[j] = get_local_id(1); 
    getlocalID2[k] = get_local_id(2);

    getgroupID0[i] = get_group_id(0); 
    getgroupID1[j] = get_group_id(1); 
    getgroupID2[k] = get_group_id(2); 
}

Now let’s analyze the results presented in the screen below:

  • global_id(0) goes from 0 to 19 as expected since the 0-th dimension is 20;
  • global_id(1) goes from 0 to 15 as expected since the 1-st dimension is 16;
  • Since our work_dim = 2, global_id(2) is always zero;
  • Notice that global_id = group_id*local_size + local_id;

Take a look at the source code for a deeper insight. You can also experiment with the code and check what your device does if you set the local_sizes to null. At the momento my graphics card sets the local_sizes to 1 which is terrible but I think this will improve soon. I mean, will we really have to implement factorization algorithms?

It is also a good example of a Cloo setup.

Download the source code for this section

Leave a Reply

Your email address will not be published. Required fields are marked *