Command queues

Command queues are a very important aspect of OpenCL. They contain instructions to inform which of the Devices of your Context (that is, the group of Devices you have chosen to use) is going to execute a particular command and also how it is going to do it. If you need more in-depth information about Contexts please refer to the OpenCL API in Khronos Group webpage.

Side note: Theoretically, the Command Queues topic should be covered before talking about reading and writing variables (buffer elements), but I think that it is easier to explain the Command Queues now that the user already knows that there are tasks that need to be executed. In fact, I’m talking about my own experience, because I tried hard to understand what a Command Queue was. Then I skipped it and when I got to the tasks part all this information dawned on me (ok, I studied a lot too).

The table below summarizes the information that is related to a specific Command queue:

Command queue characteristic When to inform Available options Explanation
Device Upon Command Queue creation One of the OpenCL enabled devices in your computer Tells which of your OpenCL compliant devices will execute commands assigned to that command queue.
Execution mode Upon Command Queue creation In-order (default) or out-of-order OpenCL commands are enqueued in order and will be executed in the queue order by default or out-of-order if you choose it to. Usually no synchronization is required.
Out-of-order execution means that OpenCL may start execution of the next enqueued command before the previous one ends. This execution mode will proobably require host-code explicit synchronization.
Type of command to execute When a command needs to be executed Memory commands, synchronization commands, kernel execution commands Anytime you need to perform an operation on a device, you issue the command to one of the available Command Queues. These commands may be: write to or read from the device, execute kernels or enqueue synchronization points for out-of-order Command Queues.

OpenCLTemplate and Cloo handle the task of loading Platforms and creating Contexts for you. OpenCLTemplate also creates all in-order and out-of-order queues withinCLCalc.Program.CommQueues (in-order) and CLCalc.Program.AsyncCommQueues  (out-of-order), and if nothing is chosen OpenCLTemplate will use one of the GPUs (if available) or the CPU (if no GPUs are found) to execute the commands, i.e., it will use one of the in-order command queues binded to one of the GPUs (or thhe CPU if no GPU is found). You can control which device is going to be used by changing CLCalc.Program.DefaultCQ if you want to use in-order execution or calling explicitly the AsyncCommQueues to execute out-of-order commands.

You may be asking yourself why I did this with OpenCLTemplate. The reason is that programmers that are not used to parallel programming may find it difficult to understand the concept of a command queue. I just wanted to make things easy and I, too, think that Cloo has better resources for advanced OpenCL programming. On the other hand, if you don’t need many advanced features, OpenCLTemplate is easier and does the job.

Using Cloo you need to create your own command queue using a structure that looks like

ComputeCommandQueue Queue = newComputeCommandQueue(Context, ComputePlatform.Platforms[0].Devices[1], ComputeCommandQueueFlags.None);

1.1 Synchronous vs asynchronous

Take a look at the picture below for better understanding of the in-order and out-of-order execution models.

What this picture tries to say is that commands are enqueued in order by the host. In the above example, you could have two commands queues attached to the same GPU device with different behaviors.

The blue commands are issued to the command queue that executes them in order. As you can see, command 2 is executed only after command 1 has been completed.

The red commands are issued to the command queue that executes them out of order. This time, command 2 can start anytime after command 1. Is this behavior good? Only your application may tell. For instance, it may be difficult to create a structure that will launch all the independent workers at the same time, and it may be simpler to just launch them asynchronously. The downside is that, if you ever need synchronization, you’ll need to do it manually.

The table below compares in-order and out-of-order execution models.

Aspect In-order mode Out-of-order mode  Comments
Ease of use  Easy Harder In-order execution mode is obviously easier to use since you only need to guarantee that the workers are independent. Out-of-order mode will require synchronization.
Explicit synchronization  Not required Required Like I said before, any kernel synchronization will need explicit synchronization in out-of-order mode.
Parallelization of kernels enqueued at different times Not possible Possible This is the biggest limitation of in-order execution mode. For example, you can’t tell the Device that you will launch 5 independent kernels and then a wrapper one that needs all 5 to have already been executed. You’d need to enqueue a barrier for that. In synchronous mode the only option would be to execute one kernel at a time.
Coordination of parallel tasks with casual synchronization points Not possible Possible This is another limitation of the in-order mode. You’d need to use Events to be able to do this. This is likely to be necessary when implementing evolutionary algorithms, heuristics or meta-heuristics.
Coordination of tasks among different devices Possible, but needs Events Possible Another very important point of command queues. Even if each device executes commands in order, you can’t usually sync different devices sharing the same work load without explicit synchronization.

As you can see, complicated algorithms may require a greater level of control in order to ensure performance. I may write a Case Study involving out-of-order execution and synchronization but this is not the correct moment to discuss it and by now I prefer  to keep it simple.

1.2 Synchronization

Synchronization is a very important aspect of OpenCL and this is not the point to discuss it. You will need synchronization when using queues in out-of-order mode.

Basically, it consists in telling OpenCL which commands need to be executed first. There are so many ways of synchronization (inside OpenCL C99, explicit, command-enqueued, implicit by blocking-read of variables) that we will discuss this in a specific topic.

Command queues take Event List arguments which are the events it has to wait before it starts and they generate an Event that will be set to a “Command Executed” value after the command has been executed.

The Atomics extensions also plays an important role in synchronization because they are non-interruptible operations that guarantee that a given piece of memory is not corrupted while it’s being updated by the atomic functions. We’ll discuss that in a later topic.

Leave a Reply

Your email address will not be published. Required fields are marked *