GPU computing is all about speeding up applications and making it viable to run complex interactive software in real time. Picture 1 shows some speedups obtained in scientific applications (HOBEROCK and TARJAN, ):
Picture 1 – Speedups in scientific applications
The image below shows accelerations of up to 30x in fluid dynamics simulation (BERNARZ, ):
Picture 2 – CFD Example
It is important to always have in mind that speedups of around 200x are possible and 30x is rather common, and that a 30x speedup means that a code which used to take 1 minute will now be executed in 2 seconds.
The computing environment has had some very important changes in a recent past with the flood of handheld ultra-small electronic devices such as cell phones and tablet computers. A contrasting side, on the other hand, is the increasing importance and usage of cloud computing where superservers take care of complex collaborative environments.
Both servers and small devices can greatly benefit from parallel processing and GPU acceleration: tablet and notebook computers could potentially match the speed of today’s desktop computers and servers can perform much faster. As soon as OpenCL and vendors’ drivers become mature GPU acceleration is very likely to become the standard.
In this work we intend to demonstrate that GPU processing is a viable option for almost any software to be developed in the future. The main reasons are:
|High availability of GPUs||Almost any personal computer now has a GPU which stays idle most of the time if the user only needs text editors, spreadsheets and office programs in general|
|Dramatic performance increase||It is possible to accelerate a really wide variety of algorithms by 2x just by parallelizing inner loops of computations. Compare this to the enormous efforts programmers need (write pieces of code in Assembly, build small clusters) to get 20%, 30% acceleration and suddenly GPUs become very attractive|
|Relative ease of use||Even a non-optimized, copy-paste C code from an innermost for-loop will probably yield acceleration to a given application. Besides, there are free, open codes that one can use right away without having to write a single line of GPGPU code|
|Heavy industry support||There are various forums around where you can post your idea and ask study groups to post code about how to perform some task in parallel. In fact, we at CMSoft implement some of these ideas for demonstration purposes|
|Raising interest in parallel processing||More and more manufaturers see with good eyes this technology as it will enable smaller devices to have a performance comparable to what we only see today in desktop computers|
|Highly developed hardware||GPUs have evolved incredibly because of games and now that their power can be used for general computation researchers still struggle to efficiently utilize the smorgasbord of resources available like texture caches, workgroup shared memory, native floating-point operations and vectorization, to name a few|
|Numerous open possibilities||There are very few programs which use GPU accelerations for reasons that will be presented later on. Scientific softwares like MATLAB and AnSys have just begun using GPGPU and math algorithms like FFT and BLAS (linear algebra) have just come to life. Even Photoshop, Vegas and Camtasia, which are softwares designed to manipulate images, don’t use all GPUs have to offer. There are countless open possibilities|
One of the key aspects of the parallel model is that it has a mixed structure: the CPU runs a Host code that controls memory transfers and Device execution. This means that, in this new programming paradigm, the programmer has to explicitly expose the parallelization of the code. In other words, developers must tell the computer what parts of the code should be executed serially and what parts can run simultaneously.
It’s possible to further develop vendor’s commitment to the new GPGPU (general-purpose computation using GPUs) technology, but this quote summarizes this aspect very well:
“Specifically, Intel has released an alpha OpenCL SDK for Intel Core processors, while ARM introduced an embedded graphics chip with OpenCL support. In fact, ARM went so far as to say OpenCL will be on every smartphone by 2014. And, quite frankly, we couldn’t be happier than to see this happen.” 
Put this together with the ARM/NVidia partnership to build CPU cores (ref. ) and there’s even stronger evidence of the parallel processing trend:
“With Project Denver, we are designing a high-performing ARM CPU core in combination with our massively parallel GPU cores to create a new class of processor,” said Jen-Hsun Huang, president and CEO of NVIDIA”
Let us summarize industry support with a picture:
Picture 3 – OpenCL industry support
The above reasons are very convincing and those points will be further developed in the next sections.
2. CPU vs GPU approach
An immediate question is: why can the GPU accelerate so much some applications and why is it less useful for other ones? The GPU is not better than the CPU in any way; one has to remember that their purposes are different and that the parallel model involves a CPU (Host) controlling memory transfers and scheduling Device operations (GPUs and other compatible hardware). Remember that CPUs perform general tasks, serial or not, scheduled by OS, and GPUs excel at simple tasks that can be performed using a heavy parallel model, such as graphics, which was the reason GPUs were created in the first place.
However, it wouldn’t be wise to ignore the huge peak computing difference between CPUs and GPUs right now:
Picture 4 – Comparison of peak performance
3. Why OpenCL?
At the moment, people interested in using GPGPU have few choices: the slightly more mature NVidia-specific CUDA, AMD’s Stream, Microsoft DirectCompute or OpenCL, the latter being the only open standard.
NVidia pioneered GPU computing with CUDA and this is also the primary reason that explains why CUDA is the most used GPGPU tool in scientific researches. Any new technology needs some company to believe in it before others come in and, when talking about GPGPU, NVidia really embraced it.
CUDA also has a huge number of tools embedded in its CUBLAS and CUSPARSE (linear algebra), CUFFT (fourier transform) and PhysX (physics routines for games) and, as such, may be a suitable option for research teams and laboratories.
On the other hand, as mentioned, products like Photoshop can take a huge advantage of parallel processing because, let’s face it, GPUs have always been designed to deal with images, polygons and the such. However, it’s hard to ship a product that only gets acceleration if the consumer has a specific brand of GPU or a specific operating system, as with DirectCompute (Microsoft has its own history of challenging open standards with DirectX). Thus, OpenCL as an open standard perfectly fills the gap that prevented manufacturers from considering GPGPU as a viable algorithm-accelerating tool.
Quoting AMD’s blog again :
“While AMD can acknowledge that proprietary solutions helped pioneer GPGPU (including CUDA and Brook+, a standard which AMD supported for years), history has proven that open solutions tend to benefit the industry more in the long run. And in order for an open solution to really be successful, it requires broad industry support.”
A last word on this is that it’s important to have more than one standard available. We’ve seen this historical importance with OpenGL and DirectX. For example: DirectX went aftes OpenGL quadbuffer using shutter technology and OpenGL is behind DirectX when it comes to tesselation. The point here is that standard developers tend to innovate more when there are competitors and, in the long run, programmers win because they have more tools and consumers win because they get access to a broader set of options.
Advantages/disadvantages comparing to CUDA/ATI Stream:
|Code compilation at runtime||No runtime compilation|
|Any vendor can become compliant||Brand-specific|
|Source code reusable across platforms||No source code reusage across platforms|
|Doesn’t allow all vendor specific optimizations||Allows vendor-specific optimizations|
|Code has to be manually fine-tuned for each vendor||Compilers fine-tune the code for the hardware|
4. What’s coming up: AMD Fusion, Intel Larrabee
Another good reason to consider parallel processing are the upcoming hybrid CPU-GPU processors from AMD and Intel, namely Fusion and Larrabee: (text below adapted from refs  and . The original description was talking about Larrabee but Fusion will probably follow).
Differences with current GPUs
Larrabee will differ from older discrete GPUs such as the GeForce 200 Series and the Radeon 4000 series in three major ways:
- Larrabee/Fusion will use the x86 instruction set with specific extensions.
- They will probably feature cache coherency across all its cores.
- They will probably include very little specialized graphics hardware, instead performing tasks like z-buffering, clipping, and blending in software, using a tile-based rendering approach.
This will probably make Larrabee/Fusion more flexible than current GPUs, allowing more differentiation in appearance between games or other 3D applications. Intel’s SIGGRAPH 2008 paper mentions several rendering features that are difficult to achieve on current GPUs: render target read, order-independent transparency, irregular shadow mapping, and real-time raytracing.
More recent GPUs such as ATI’s Radeon HD 5xxx and Nvidia’s GeForce 400 Series feature increasingly broad general-purpose computing capabilities via DirectX11 DirectCompute and OpenCL, as well as Nvidia’s proprietary CUDA technology, giving them many of the capabilities of the Larrabee.
Differences with CPUs
The x86 processor cores in Larrabee/Fusion will probably differ in several ways from the cores in current Intel CPUs such as the Core 2 Duo or Intel Core, mainly by including a greater amount of simpler cores.
Theoretically Fusion/Larrabee’s x86 processor cores are able to run existing PC software, or even operating systems. A different version of Larrabee might sit in motherboard CPU sockets using QuickPath, but Intel has not yet announced plans for this. Though Larrabee Native’s C/C++ compiler includes auto-vectorization and many applications can execute correctly after recompiling, maximum efficiency may require code optimization using C++ vector intrinsics or inline Larrabee assembly code. However, as in all GPGPU, not all software benefits from utilization of a vector processing unit. One tech journalism site claims that Larrabee graphics capabilities are planned to be integrated in CPUs based on the Haswell microarchitecture.
In this work we tried to demonstrate how general purpose computing using GPUs (GPGPU) has become a viable option to accelerate any demanding software, especially now that an open standard is available. One key aspect to have in mind is that, in the parallel model, the developer needs to explicitly determine what parts of the code can run in parallel and what needs to remain serial.
The CPU architecture is very different from a GPU in the sense that it needs to perform all types of computation and handle interrupts and events, while the GPU is dedicated to very specific tasks.
Considering the advantages of an open standard when compared to a proprietary package and observing future architectures, it’s safe to say that the industry is pointing towards massively parallel computing and OpenCL is going to play a very important role in this process.
 AMD Blogs, http://blogs.amd.com/fusion/2010/11/19/opencl-momentum-grows/, as per DEC/2010;
 Larrabee Microarchitecture, http://en.wikipedia.org/wiki/Larrabee_(microarchitecture), as per DEC/2010;
 AMD Fusion family of APUs, http://sites.amd.com/us/Documents/48423B_fusion_whitepaper_WEB.pdf, as per DEC/2010.
 BERNARZ, Tomasz. Numerical Simulation in Fluid Dynamics Using GPU: a
Practical Introduction. OzViz 2010 OpenCL Workshop,
dec – 2010.
 HOBEROCK, J., TARJAN, David. Introduction to massively parallel
computing. Stanford Engineering Everywhere,
http://see.stanford.edu/see/courses.aspx, materials at http://code.google.com/p/stanford-cs193g-sp2010/ dec – 2010 .
 NVidia to Build Custom CPU Cores based on ARM, http://www.dailytech.com/article.aspx?newsid=20590, jan – 2011.