Overview about OpenCL and parallel processing

Overview about OpenCL

 Let’s get a basic overview about how OpenCL works. Take a look at the simplified scheme below:

 

You can see that there is a HOST and there are DEVICES. A closer look allows you to notice that the host sends information to the devices, sends execution commands and retrieves data.

How does that work?

The HOST executes the code you write as usual, using C#, C++ or the language of your choice. This is the program itself, which is running on the processor, being managed by the operating system and all that stuff.

The DEVICES execute OpenCL code, written in OpeCL C99 (which we will discuss later). There is a specific OpenCL compiler for the CPU, for the GPU and for the Accelerator cards (NVidia Tesla and AMD Firestream for now).

The question is: ok, I can run my regular Windows program by double-clicking it. How do I run the OpenCL program? That’s where the OpenCL API comes into play. The OpenCL API has functions to identify devices, compile programs, send and receive information and run OpenCL programs on the chosen device.

So, this is basically how it works (there are more steps but we will stick to these for now):

OpenCL Code:
1 – Create the OpenCL code you want to run using OpenCL C99 language;

Host Code:
2 – Create your program (using C#, for example);
3 – Create the data you want to process;
4 – Use the OpenCL API to transfer data to the devices;
5 – Use the OpenCL API to call executions;
6 – Retrieve any needed data.

Sample code

This is a sample code to illustrate the steps above. Don’t worry about understanding the syntax now, focus on understanding how OpenCL works. Our next step will be a first program.

        //Starts using OpenCLTemplate to set up OpenCL
using OpenCLTemplate;

private void Test_Click(object sender, EventArgs e)
{
            //Initializes devices and sets up everything
CLCalc.InitCL();

            //Creates variables that will be passed to OpenCL
float[] x = new float[] { 1, 2, 3, 0.123f };
float[] y = new float[] { 1, 2, 1, 1 };

           //This is the OpenCL source code. A string! It will not be compiled
//by your compiler. It will be compiled by the OpenCL compiler.

string s = @”
                       kernel void
sum (global float4 * x, global float4 * y)
{
x[0] = x[0] + y[0];
}”;

            //Use the API to compile the program
CLCalc.Program.Compile(new string[] { s });

            //Gets a handle to the OpenCL function we will call
CLCalc.Program.Kernel sum = new CLCalc.Program.Kernel(“sum”);

            //Copies variables x and y to the OpenCL device memory
CLCalc.Program.Variable varx=new CLCalc.Program.Variable(x);
CLCalc.Program.Variable vary=new CLCalc.Program.Variable(y);

//Tells OpenCL that the arguments of the “sum” program are x and y

CLCalc.Program.Variable[] args = { varx, vary };

            //This is the number of work_items. We will discuss later.
int[] max = new int[] { 1 };

            //Use the OpenCL API to execute the sum code with the
//arguments and work_items we specified

sum.Execute(args, max);

            //Reads the x variable from video memory and stores it in x.
varx.ReadFromDeviceTo(x);
}

Parallel processing

So what’s the catch? Why is OpenCL so useful for computational mathematics? Because it can do parallel processing. Consider the following example: how can you sum two n-dimensional vectors? Let’s C# it.

            int n = 1000;
float[] v1 = new float[n];
float[] v2 = new float[n];
float[] v3 = new float[n];

for (int i = 0; i < n; i++)
{
v3[i] = v1[i] + v2[i];
}

What happens inside the loop? The program calculates v3[0], then v3[1] and so on, until v3[n-1]. Now let’s take a look at the OpenCL version. This is just pseudocode.

OpenCL code:

            __kernel void
floatVectorSum(__global       float * v1,
__global       float * v2)
{
// Vector element index
int i = get_global_id(0);
v1[i] = v1[i] + v2[i];
}

Host pseudocode:

Initialize OpenCL
Create v1, v2 and v3
Copy v1, v2 and v3 to OpenCL Device memory
Set v1, v2 and v3 as arguments to floatVectorSum
Tell OpenCL that there will be 1000 workers to execute floatVectorSum
Read information from v3

See the difference? With OpenCL we can have MANY workers each executing a small piece of the work instead of a single worker doing all the job. The 1000 sums are executed at the same time, in parallel. Have in mind that this is possible ONLY because all the v3[i] values depend only on v1 and v2 values.

Let’s compare regular code and OpenCL:

 Topic Regular Code OpenCL
Workers executing code (threads) Usually one Typically thousands 
Hardware accelerated vector function evaluation? No Yes, on GPUs and accelerators
Possible to use all available processing devices for computing? No
Yes
Multithreading needs explicit locking of objects? Yes  No
Necessary to manually set arguments and number of workers? No
Yes
Easy to use existing algorithms? Yes
No
Special development is needed to make the code run parallel? No
Yes
Broad availability and support? Yes
No

Diego and I have thought that parallel processing would be the future since 2003 which means we have been greatly looking forward for OpenCL to come. Engineers and scientists can get awesome results from using OpenCL and if you use the computer for math and heavy calculations, OpenCL is the way to go. On the other hand, if you develop web components and interfaces, you should ask yourself if you really can benefit from OpenCL.

2 thoughts on “Overview about OpenCL and parallel processing”

Leave a Reply

Your email address will not be published. Required fields are marked *