Get the source code for this section.

This section is dedicated to processing matrix multiplication using the GPU.

We are going to implement a class that multiplies two matrixes without using __local variables and create another implementation using __local variables, to compare local sync performance versus simple worker processing performance.

This section should be easy to understand provided you know how to multiply matrixes and that you read and understood the advanced aspects of OpenCL.

Look at the screenshot below:

As you can see, it was possible to achieve a 298x increase in speed with some accuracy loss (due to drivers, should get fixed soon) and a 105x increase with very low accuracy loss. Of course, the comparison is a bit unfair because we’re comparing compute times for double-precision calculation and single-precision calculation. If you want, you may try to compare using floats in the processor. In my case the speedups were 247x and 87x respectively.

## 1. Definition of the problem

What we want to do is: given two matrixes M1 and M2 having dimensions M1[P,Q] and M2[Q,R], respectively, find the matrix product M3[P,R] = M1*M2. The picture below should refresh your mind on how to multiply two matrixes:

As you can see, the [i.j] element of the answer is calculated by:

M3[i,j] = sum (M1[i,k]*M2[k,j], k = 0, 1, …, Q)

### 1.1 Serial solution

The serial solution is quite straightforward, I’ll post it here only to refresh the mind of those who might have forgotten linear algebra and matrix multiplication. You may also want to check that this is consistent with the picture above.

public double[,] MultiplyNoOpenCL(float[,] M1, float[,] M2) { //M pxq, N qxr int p = M1.GetLength(0); int q = M1.GetLength(1); int r = M2.GetLength(1); if (q != M2.GetLength(0)) throw new Exception("Matrix dimensions do not match for multiplication"); double[,] resp = new double[p, r]; for (int i = 0; i < p; i++) for (int j = 0; j < r; j++) for (int k = 0; k < q; k++) resp[i, j] += M1[i, k] * M2[k, j]; return resp; }

## 2. OpenCL calculation without __local

Matrix multiplication gives us an interesting opportunity to use work dimension two: worker i,j will calculate the i-th, j-th component of the answer matrix, thus giving us global_work_size = [P, R].

Recall that we need to express the matrixes using only vectors. We shall reinterpret the matrixes like this:

M1[i,j] = vecM1[i + P*j]

M2[i,j] = vecM2[i + Q*j]

M3[i,j] = vecM3[i + P*j]

This allows us to easily transfer data between host and device. The host code to perform these transformations can be found in functions VectorToMatrix and MatrixToVector. Take a look at the source code if you want to know the details.

Like mentioned before, we will throw P*R workers using work_dim = 2, global_work_size = [P,R], in a way so that each worker will compute one component of the answer. To do this we need to get (now transformed into) vectors M1, M2 and M3 and also the value of Q:

__kernel void floatMatrixMult(__global float * MResp, __global float * M1, __global float * M2, __global int * q) { // Vector element index int i = get_global_id(0); int j = get_global_id(1); int p = get_global_size(0); int r = get_global_size(1); MResp[i + p * j] = 0; int QQ = q[0]; for (int k = 0; k < QQ; k++) { MResp[i + p * j] += M1[i + p * k] * M2[k + QQ * j]; } }

This is very similar to the serial code and shouldn’t be a problem to understand.

## 3. OpenCL calculation with __local

This is the main point of this case study because it shows how to take advantage of workgroups by using barrier and __local variables. What we want at this point is to split the matrix multiplication problem into submatrixes that fit in the local workgroup size. Take a look at the picture below:

Notice that we have split each matrix into 2×2 submatrixes and we can compute each 2×2 submatrix of the answer as the sum of products of the first submatrixes, as if the submatrixes were elements of a matrix. So what we do now is:

- Define a BLOCK_SIZE, which would usually be 16×16 (local_work_size);
- All dimensions P, Q and R should be multiples of BLOCK_SIZE;
- Take submatrixes subM1 [P/BLOCK_SIZE, Q/BLOCK_SIZE] and subM2[Q/BLOCK_SIZE, R/BLOCK_SIZE] and calculate subM3[P/BLOCK_SIZE, R/BLOCK_SIZE];
- Notice that the element [x,y] of matrix subM1[i,j] corresponds to element M1[i*BLOCK_SIZE+x, j*BLOCK_SIZE+y] and thus has to be accessed (in order to copy from global to local memory) by using M1[i*BLOCK_SIZE+x + P*(j*BLOCK_SIZE+y)];
- The same reason from 4 tells us that subM2[i,j][x,y] = M2[i*BLOCK_SIZE+x + Q*(j*BLOCK_SIZE+y)] and subM3[i,j][x,y] = M3[i*BLOCK_SIZE+x + P*(j*BLOCK_SIZE+y)];
- We can now copy the submatrixes into __local memory and perform the submatrixes multiplication using __local memory.

Now on to the code. Notice that it is necessary to ensure that local copies have been performed before continuing by using a barrier() call. Since what needs sync is local memory we use a local memory synchronization (CLK_**LOCAL**_MEM_FENCE):

#define BLOCK_SIZE 8 __kernel __attribute__((reqd_work_group_size(BLOCK_SIZE, BLOCK_SIZE, 1))) void floatMatrixMultLocals(__global float * MResp, __global float * M1, __global float * M2, __global int * q) { //Identification of this workgroup int i = get_group_id(0); int j = get_group_id(1); //Identification of work-item int idX = get_local_id(0); int idY = get_local_id(1); //matrixes dimensions int p = get_global_size(0); int r = get_global_size(1); int qq = q[0]; //Number of submatrixes to be processed by each worker (Q dimension) int numSubMat = qq/BLOCK_SIZE; float4 resp = (float4)(0,0,0,0); __local float A[BLOCK_SIZE][BLOCK_SIZE]; __local float B[BLOCK_SIZE][BLOCK_SIZE]; for (int k=0; k<numSubMat; k++) { //Copy submatrixes to local memory. Each worker copies one element //Notice that A[i,k] accesses elements starting from M[BLOCK_SIZE*i, BLOCK_SIZE*j] A[idX][idY] = M1[BLOCK_SIZE*i + idX + p*(BLOCK_SIZE*k+idY)]; B[idX][idY] = M2[BLOCK_SIZE*k + idX + qq*(BLOCK_SIZE*j+idY)]; barrier(CLK_LOCAL_MEM_FENCE); for (int k2 = 0; k2 < BLOCK_SIZE; k2+=4) { float4 temp1=(float4)(A[idX][k2],A[idX][k2+1],A[idX][k2+2],A[idX][k2+3]); float4 temp2=(float4)(B[k2][idY],B[k2+1][idY],B[k2+2][idY],B[k2+3][idY]); resp += temp1 * temp2; } barrier(CLK_LOCAL_MEM_FENCE); } MResp[BLOCK_SIZE*i + idX + p*(BLOCK_SIZE*j+idY)] = resp.x+resp.y+resp.z+resp.w; }

Notice that we also take advantage of the fact that the BLOCK_SIZE is a multiple of 4 by using vector operation with variable **resp**. Also notice that even the copy to and from global memory is split among work-items.

The __attribute__ qualifier is used to give some hints or requirements to the OpenCL driver. In this case, we are informing the API that the kernel won’t give correct answers unless local_work_size is set to [BLOCK_SIZE, BLOCK_SIZE]. Check the OpenCL specification for more details.

When the dimension of the problem gets too big, some of the results are not computed properly. I assume that this is because the space for __local storage wears out or this might be just some ATI driver bug. Another thing you might have noticed is that I used local_work_size = [8,8], this time because of a known issue in ATI OpenCL drivers. I hope this gets fixed soon. Another limitation is that the dimensions of the matrixes have to be multiples of the BLOCK_SIZE you are using.

## 4. Conclusion

A method for efficiently calculating matrix multiplication has been presented, both using __local (and barrier) and without __local qualifiers. As of now, some drivers present issues to manage __local memory and algorithms should be carefully planned if using __local memory is intended because the calculations give wrong results but no exception is thrown.

I would recommend using the __local algorithm to compute the products of matrixes up to 400×400 and NOT to use __locals otherwise. This is because of driver issues which I think will be solved in the near future.