Download source code for this section

Topics: Memory coalescing, __local reduction, kernel launches and vectorization

Not included in this example: bank conflicts, flow divergence

[notice]Feel free to use this material as you want, but please include a reference to this page if you do so. Thank you.[/notice]

## 1. Introduction

The purpose of this article is to show the performance gain that can be obtained by proper utilization of parallel techniques known as memory coalescing and __local reduction.

We’ll start by explaining the concepts of memory coalescing and __local reductions, how they map into the hardware and why it may be a good idea to implement these optimizations.

After that, we’ll present a hands-on example showing various implementations of the computation of a dot product.

## 2. Optimization concepts

### 2.1 Memory Coalescing

The concept of memory coalescing is related to how modern GPUs retrieve data from internal memory. In order to better use the available hardware, whenever a memory fetch in __global space is requested from a kernel, the GPU reads a minimum number of elements. For the purposes of this article we will assume that the minimum number of elements read is 4 (in a float vector, this means 4 floats * 4 bytes per float = 16 bits). The typical value is 32 bits (word size).

Memory coalescing access is the data access in a sequential pattern. This means that the i-th work-item should access the i-th component of a __global vector.

Example of coalesced access:

__kernel void coalesced(__global float * v) { int i = get_global_id(0); float val = v[i]; }

Example of non-coalesced access. In the example below, half the accesses are not used. If the kernel speed is limited by memory access, the kernel below would be effectively 2x slower:

__kernel void noncoalesced(__global float * v) { int i = get_global_id(0); float val = v[2*i]; }

The picture below summarizes the GPU access patterns and what happens when the access is not coalesced:

In practice, whenever the structure of the data is known, it is a good idea to access the values sequentially in the kernels. In terms of scalability, as the GPU hardware progresses, it is possible that the number of elements retrieved will increase and well-structured coalesced kernels will benefit more from cutting-edge hardware.

Of course, if the structure of the OpenCL variables (buffers) is data dependant or otherwise not structured, there is no choice but to use non-coalesced access. What is important to know is that, in these cases, it is more likely that memory access will be the bottleneck of the OpenCL code.

### 2.2 __local reduction

Just like OpenGL, the graphics library, each vendor writes his own driver to implement OpenCL functions.

Thus, when an application launches a call to an OpenCL function, the OpenCL.DLL redirects the request to the proper driver. For exemple, if you have a NVidia GeForce GPU and a AMD processor, OpenCL calls that use the GPU as processing device will be redirected to NVidia’s drivers whereas calls to the CPU will be redirected to the AMD driver. Likewise, if the GPU is AMD’s and the processor is Intel’s, the calls will be directed towards AMD drivers when the GPU is used as compute device and to Intel’s if the CPU is the computing device.

This means that every OpenCL function call has an overhead: it takes some time to go all the way to the appropriate driver and effectively schedule the activity for processing. The whole process only takes a few milliseconds. Sometimes, though, the overhead may be comparable to the computation time itself, especially considering the accelerations that the GPU is able to achieve.

As a result of that, it is usually a good practice to keep kernel launches to a minimum, even if that may mean not utilizing the parallel potential of the code to full capability. This effect is even stronger when using wrapper APIs such as JOCL, Cloo or OpenCLTemplate, which don’t include the OpenCL.h headers directly.

The picture below summarizes the enqueuing route:

The __local reduction enables the programmer to use __local buffers to exploit workitem shared memory capabilities and avoid sequential kernel execution calls.

For example: one strategy to compute the sum of components of a vector of dimension 1024 would be:

Call 512 workitems to compute v[i] = v[i]+v[i+512]

Call 256 workitems to compute v[i] = v[i]+v[i+256]

(…)

Call 1 workitem to compute v[i] = v[i]+v[i+1]

This approach not only involves many calls to kernel execution, but also increases the number of __global memory reads.

Since it is possible to share information using the __local space, the same sum could be performed locally: each workGROUP (of 256 work-items, for example) generates one reduced result, as follows:

//Reduction part int p = get_local_id(0); int q = get_group_id(0); int maxp = get_local_size(0); __local float values[256]; values[p] = val; barrier(CLK_LOCAL_MEM_FENCE); maxp = maxp >> 1; while(maxp > 0) { if (p < maxp) { values[p] += values[p+maxp]; } maxp = maxp >> 1; barrier(CLK_LOCAL_MEM_FENCE); } if (p == 0) resps[q] = values[0];

Notice that this approach reduces the parallel pattern in final stages and slows down due to branch execution divergence (topic for later discussion), but it prevents the need to launch subsequent kernels and repeat accesses to __global memory.

Please refer to the OpenCL C99 advanced tutorial if you need more information about the __local workspace and barrier() calls.

### 2.3 Kernel launches

Ideally, considering the structure of the SIMD chip of modern GPUs, the number of workitems should be an integer multiple of the amount of the workgroup size, usually 256 or 512. Kernels are launched simultaneously in batches of this size if possible. In practice, this means that some processing power is lost if the number of items is not a multiple of the workgroup size.

Why not just simply launch kernels with size 256 then? There are many benefits such as __local shared memory. The problem is what’s called memory latency: memory fetch instructions are not instantaneous and the processors of the GPU chip will just stay idle waiting for information if there are few workgroups.

Roughly speaking, the GPU schedules for execution the workgroups whose information is available and halts workgroups as they request data. In conclusion, if the kernel is very intensive in memory fetches, the ideal situation is to have large amounts of workgroups to allow this process of hiding the latency.

Incidentally, this thread scheduling is one of the reasons the execution order of workitems is completely unpredictable.

Having considered these aspects, a worksize of at least 1024 = 4*256 can have a very positive influence on performance because 1) the workgroups are fully populated with workitems and 2) the number of workgroups is enough to hide latency. This demonstrates why GPUs need a large amount of workitems to reach peak performance.

### 2.4 Vectorization

Vectorization is an AMD specific optimization, but it can become a trend since it’s always possible to serialize a vector sum and the other drivers can always serialize the computation. This optimization take advantage of the capability of certain hardwares to perform more than one floating point operation per cycle:

Example: non-vectorized sum

float x[0] = v[0]+w[0];

float x[1] = v[1]+w[1];

float x[2] = v[2]+w[2];

float x[3] = v[3]+w[3];

Vectorized sum:

float4 x = (float4)(v[0],v[1],v[2],v[3]) + (float4)(w[0],w[1],w[2],w[3]);

Vectorization can be applied to almost all the single-variable operations (v. Khronos Group OpenCL Specification).

## 3. Specific case study: dot product

### 3.1 Definition

A dot product is a mathematical operation used to compute, among other calculations, projections of vectors in a base space and remainders. Its definition is as follows: given two vectors, v1[] and v2[], of same length, the dot product of v1 and v2 is the sum

public static double ExactDotProduct(double[] v1, double[] v2) { if (v1.Length != v2.Length) throw new Exception("Vectors must have the same length"); double resp = 0; for (int i = 0; i < v1.Length; i++) { resp += v1[i] * v2[i]; } return resp; }

### 3.2 Naive implementation

The naive implementation of the dot product is the way we implemented the calculation in previous tutorials just because it’s simpler. The idea is to launch a fixed amount workitems and compute initial and final indexes they should process. For example, suppose we have vectors of length 10 and 4 work-items. Work-item i computes ans[i]:

ans[0] = v1[0]*v2[0]+v1[1]*v2[1] ans[1] = v1[2]*v2[2]+v1[3]*v2[3]+v1[4]*v2[4] ans[2] = v1[5]*v2[5]+v1[6]*v2[6] ans[3] = v1[7]*v2[7]+v1[8]*v2[8]+v1[9]*v2[9]

Then the host code has a final load: sum the 4 components of ans.

This code is implemented as follows:

__kernel void NaiveDot(__global const float * v1, __global const float * v2, __global float * resps, __constant int * NN) { int n = NN[0]; int i = get_global_id(0); int nWorkItems = get_global_size(0); int jmin = (n * i) / nWorkItems ; int jmax = (n * (i + 1)) / nWorkItems ; float val = 0.0f; for (int j = jmin; j < jmax; j++) { val += v1[j] * v2[j]; } resps[i] = val; }

There are a number of problems with this approach. First of all, the memory access pattern is not even close to being coalesced. The first element accessed by each workitem is (n*i)/nWorkItems, which means a great amount of data is lost.

Another problem is that the number of iterations in the for loop may be different, in which case some processing is lost due to flow divergence, a topic to be discussed in other articles.

### 3.3 Optimization strategy

These will be the steps to optimize the code:

- Preprocess and prepare the input vectors so that the remaining elements to be processed are multiple of a chosen number of workitems;
- Use coalesced memory access;
- Use __local reductions to explot workgroup shared memory.

### 3.4 Preprocessing

As discussed before, there is an ideal number of workitems per kernel launch that fully populates the SIMD chip and also hides latency. The problem is that most often than not the data does not come in the most convenient format.

Suppose NWORKITEMS is the ideal number of work-items for a particular GPU, and that we want to compute the dot product of 2 n-dimensional vectors. The preprocessing consists in performing a partial computation to reduce the remaining operations to an integer multiple of NWORKITEMS.

For example: if n = 10 and NWORKITEMS = 8, and the reduced answers are stored in ans[8]:

- Initialize ans[i] = 0, i=0, …, 7
- Preprocess: ans[0] = v1[8]*v2[8]; ans[1] = v1[9]*v2[9];
- Finish computation by adding the appropriate values to the answer vector.

The following computation preprocesses the float[] vectors to reduce the remaining dimension to an integer multiple of NWORKITEMS:

public static double CoalescedLocalRedCLDotProduct(CLCalc.Program.Variable CLv1, CLCalc.Program.Variable CLv2, bool useMad) { if (CLv1.OriginalVarLength != CLv2.OriginalVarLength) throw new Exception("Vectors must have the same length"); double resp = 0; int n = CLv1.OriginalVarLength; double[] resps = new double[Config.NWORKITEMS]; CLn.WriteToDevice(new int[] { n }); CLCalc.Program.Variable[] args = new CLCalc.Program.Variable[] { CLv1, CLv2, CLresps, CLn }; //Write n = k*NWORKITEMS + p. Preprocess to eliminate p`s and leave summation only to a multiple of NWORKITEMS int k = n / Config.NWORKITEMS; int p = n - k * Config.NWORKITEMS; //Clears partial responses kernelClear.Execute(args, Config.NWORKITEMS); //Sums the p last elements into the p first elements kernelPreSum.Execute(args, p);

The associated OpenCL kernels are:

__kernel void ClearResps(__global const float * v1, __global const float * v2, __global float * resps, __constant int * NN) { int i = get_global_id(0); resps[i] = 0.0f; } __kernel void PreSum (__global const float * v1, __global const float * v2, __global float * resps, __constant int * NN) { int i = get_global_id(0); int p = get_global_size(0); int n = NN[0]; int ind = n-p+i; resps[i] = v1[ind]*v2[ind]; }

### 3.5 Coalesced implementation

Now that the remaining items to be processed are an integer multiple of NWORKITEMS n = k*NWORKITEMS, each workitem iterates k times using steps of NWORKITEMS. This means that in the j-th iteration work-item j will access v1[i + j*NWORKITEMS]:

j=0: workitem i accesses v1[i] and v2[i] (coalesced)

j=1: workitem i accesses v1[i+NWORKITEMS] and v2[i+NWORKITEMS] (coalesced)

etc.

This is performed using the following kernel:

__kernel void CoalescedDot(__global const float * v1, __global const float * v2, __global float * resps, __constant int * KK) { int i = get_global_id(0); int nWorkItems = get_global_size(0); int k = KK[0]; float val = 0.0f; int ind = i; for (int j = 0; j < k; j++) { val += v1[ind]*v2[ind]; ind += nWorkItems; } resps[i] += val; }

Notice that the host code will have to sum the 1024 reduced values computed in this manner.

### 3.6 Coalesced implementation with __local reduction

After all work-items have completed their tasks, it’s possible to use the fast __local memory to perform reductions without having to launch additional kernels and still exploiting the parallel structure of the SIMD card. This also means that the postprocessing host code will only have to process NWORKITEMS/WORKGROUPSIZE answers stored in the resps[] vector.

__kernel void CoalLocalDot(__global const float * v1, __global const float * v2, __global float * resps, __constant int * KK) { int i = get_global_id(0); int nWorkItems = get_global_size(0); int k = KK[0]; float val = 0.0f; int ind = i; for (int j = 0; j < k; j++) { val += v1[ind]*v2[ind]; ind += nWorkItems; } val += resps[i]; //Reduction part int p = get_local_id(0); int q = get_group_id(0); int maxp = get_local_size(0); __local float values[256]; values[p] = val; barrier(CLK_LOCAL_MEM_FENCE); maxp = maxp >> 1; while(maxp > 0) { if (p < maxp) { values[p] += values[p+maxp]; } maxp = maxp >> 1; barrier(CLK_LOCAL_MEM_FENCE); } if (p == 0) resps[q] = values[0]; }

[important]Notice that as maxp gets divided by 2, the utilization of work-items drops by half. This reduces the level of parallelization of the algorithm but still reduces the need to use __global fetches.[/important]

## 4. Results

The pictures below show the results obtained in real tests. We encourage the reader to download the sample code and verify the results.

## 5. Conclusion

Our tests demonstrate that it is possible to achieve speedups of order 25x when compared to the CPU double-precision computation of the dot product of vectors of dimension close to a million elements.

We also show that coalesced memory access and __local reduction significantly reduce the GPU computation time because of better use of each memory fetch and reduced need to access __global memory and launch kernels.

Finally, if data structure allows, we conclude it’s very effective to preprocess data in order to fully load each workgroup with the maximum possible workitems by using multiples of 1024 and that __local reduction can speed up the computation and reduce kernel launch overheads.

Next articles will discuss other optimization topics such as flow divergence and we intend to update the source code in this section to include new optimization tests.

i have optimization problem. in my programm a call kernel about 400 times, and it time duration is 15 ms, if a comment all kernel code, duration time is 10 ms. how optimaze it, if it iterative process?

Dear pavel;

For proper advice we would need more information about your code.

However, I assume that the next kernel call needs information that has been processed by the previous call, am I right?

One way to optimize this is to use the new OpenCL 2.0 Dynamic Parallelism. Be sure that your GPU supports these kinds of instructions.

An example is shown here:

https://www.khronos.org/assets/uploads/developers/library/2013-siggraph-opencl-bof/OpenCL-BOF_SIGGRAPH-2013.pdf

We haven’t added tutorials lately but we may work on the new OpenCL 2.0 features if programmers find it useful.

Regards

Douglas

Main part of time is spent to global memopry access. I think you comment access to global vars. Optimisation – using local/private variables (where it is possible)

Hey, there! Great article. I was wondering where you derive __constant int * KK from in the kernel? Is it a buffer?

Cheers. 🙂

Hello Dan!

KK comes from the following code:

int n = CLv1.OriginalVarLength;

double[] resps = new double[Config.NWORKITEMS];

CLn.WriteToDevice(new int[] { n });

CLCalc.Program.Variable[] args = new CLCalc.Program.Variable[] { CLv1, CLv2, CLresps, CLn };

Its the length of the vector inside an int array. It may look odd to make an array just to pass one value, but that’s how it is done in OpenCL.

Thanks so much for the reply! I’ve made the changes, but it crashes the computer now (lol). I suspect some sort of infinite loop is going on somewhere here. Mind taking a look?

__kernel void DotProduct(

__global const float * v1,

__global const float * v2,

__global float * resps,

__constant int * n

){

int i = get_global_id(0);

int nWorkItems = get_global_size(0);

int ind = i;

int k = n[0];

float val = 0.0f;

for (int j = 0; j > 1;

while(maxp > 0) {

if (p > 1;

barrier(CLK_LOCAL_MEM_FENCE);

}

if (p == 0)

resps[q] = values[0];

}

Disregard that last comment. It seems to have cut much of the text off for some reason. Thanks, anyways. 🙂

Hi Dan! Try posting your code here: http://dontpad.com/openclcodedan