Login Form



Image2D Variables

Using OpenCL Image2D Variables

 

Get the source code for this section. 

 

1. Introduction

 

The worked example of this section shows a custom 7x7 border filter that runs 285x faster on the GPU than it does on the CPU. 

GPUs are optimized to cache and sample textures quickly. This is due to the nature of the GPU itself as a component designed to draw interpolated texture vertexes according to given coordinates, as it may be familiar to those already familiar with OpenGL.

With GPGPU computing and OpenCL, the question that arises is: with so many hardware-built resources for using textures, why not use textures in OpenCL to store data? There are many advantages and a few drawbacks.

For those who know the OpenGL Shading Language, you may already be familiar with fragment shader texture samplers and multitexturing. This is a very powerful technique exactly because of GPUs power to sample and cache images.

From the AMD OpenCL Optimization Case Study: "OpenCL™ images (...) can provide additional memory bandwidth in certain cases by using the texture caches on GPUs, coupled with texture sampling hardware for reading from these caches."

So, why use images? There main reasons are:

- Image accesses are cached and optimized by means of samplers;
- Images allow the user to use all the GPU memory. Textures can be 8192x8192 floats, which means 2^26 elements * 4 components (RGBA) per element * 4 bytes per float = 1024 Mb = 1 Gb;

These reasons are quite powerful if you think of it. Normal variables, or buffer objects, have size limitations and are rather slow to access. So why NOT use only images? There are some limitations:

- Images may be complicated to use because usually it's necessary to read and write batches of 4 elements for better optimizations;
- It's not possible to use atomic operations in image elements;
- Images can only be read_only or write_only;
- When using OpenGL interop, it's necessary to manipulate buffer arrays;
- The hardware may simply not be compatible;
- Buffer elements are simpler to manipulate.

So, what's the ideal situation to use images? The answer is quite simple: when the variable to be stored is relatively large and will be accessed many times. Examples:

1. An image that will be filtered using a 5x5 mask. Supposing the number of pixels is n², the number of memory reads will be 25n²;
2. A vector in a matrix-vector multiplication. If the matrix is not sparse, each vector element will be read n times.

Finally, bear in mind that, in this tutorial, we will cover how to use Image2D elements for the more general purpose of storing regular data, not just pictures.

 

 

2. Creating, writing and reading Image2Ds 

 

2.1 Using OpenCLTemplate 

 

Creating Image2Ds using OpenCLTemplate is quite an easy task, much like creating regular variables. What will change the most is how to use the images inside the kernel and the fact that it is necessary to specify the width and height of the image. In the section Manipulating Image2Ds using C99 we'll discuss a clever way to store regular vectors using OpenCL Image2D elements. OpenCLTemplate simplifies the problem by creating RGBA textures of float, int or byte (byte is read in OpenCL C99 as uchar), or directly from a given C# System.Windows.Bitmap.

OpenCLTemplate creates Image2Ds using the RGBA color format with floats, ints and bytes. While this may restrict some options, it is very convenient because it fully automates the image creating procedure.

A 5x6 image with RGBA components would be represented as follows:

 

 

 

And thus the image vector v needs to have dimension byte[5*6 *4 (number of RGBA components)].

Conversely, if you want to store an ordinary vector in an Image2D object, it's necessary to have in mind that the vector will be reinterpreted as RGBA and what will be stored has to be a multiple of 4. Afterwards, accessing its elements will also need to take this into consideration. For example, in the above structure could be used to store a vector with length 6x5x4 = 120 which has nothing to do with an actual picture.

Creating, reading and writing a Image2D variable with OpenCLTemplate is as follows:

 

float[] matrix = new float[5 * 6 * 4];
for (int i = 0; i < matrix.Length; i++) matrix[i] = (float)i;
CLCalc.Program.Image2D img = new CLCalc.Program.Image2D(matrix, 5, 6);
img.WriteToDevice(matrix);
img.ReadFromDeviceTo(matrix);

 

This is very similar to creating, reading and writing OpenCLTemplate variables, with the exceptions that now it's necessary to inform the Width of the matrix and that each element is a 4-component number.

Another implementation makes reading/writing Bitmaps easier:

 

Bitmap bmp = new Bitmap(5, 6);
CLCalc.Program.Image2D img = new CLCalc.Program.Image2D(bmp);
Bitmap img2 = img.ReadBitmap();
pic.Image = img2;

 

The thing to remember is that each element ("pixel") of an image has 4 components. Loading images with OpenCLTemplate is easy enough and we're ready to manipulate the images using OpenCL. You may want to skip to section 3 if this is enough for your application or continue reading if you need more control over the image format and how to read/write submatrixes. Also remember that the storage order in C# is BGRA, which means that the X component of the vectors that are read correspond to Blue, Y is Green and Z is Red.

 

2.2 Using Cloo 

 

At this moment, using Cloo to manipulate Image2Ds is pretty much the same as using the API itself. It's more powerful, though, as you get to choose the desired image format and type. The line:

 

ComputeImageFormat format = new ComputeImageFormat(ComputeImageChannelOrder.Rgba, ComputeImageChannelType.Float);

 

is used to choose pixel format and data type. This means that Cloo Image2Ds created with this format will have 4 components per pixel (RGBA) and that each one of them will hold a floating point number (normal float). This task is automated in OpenCLTemplate: the image formats intrinsecally created are RGBA with types float (4 bytes), int (4 bytes) or byte (uchar). Cloo will probably implement this in a friendlier way in the near future. For now, just as a reference, the way to load images using Cloo can be derived from the following example (10x10 RGBA images of floats - notice that the C# dimension of the float arrays is 4*10*10 because there are 4 components per pixel):

float[] img = new float[400];
float[] img2 = new float[400];

ComputeImage2D CLimg;
ComputeImage2D CLimg2;

for (int i = 0; i < img.Length; i++) img[i] = i * 0.537f;

unsafe

{

fixed (float* imgPtr = img)

{

   ComputeImageFormat format = new ComputeImageFormat(ComputeImageChannelOrder.Rgba, ComputeImageChannelType.Float);   CLimg = new ComputeImage2D(Context, ComputeMemoryFlags.ReadWrite | ComputeMemoryFlags.CopyHostPointer, format, 10, 10, 10 * 4 * sizeof(float), (IntPtr)imgPtr);
   CLimg2 =
new ComputeImage2D(Context, ComputeMemoryFlags.ReadWrite, format, 10, 10, 0, IntPtr
.Zero);

}

ComputeKernel kernelImgTeste = prog.CreateKernel("imgTeste");
kernelImgTeste.SetMemoryArgument(0, CLimg);
kernelImgTeste.SetMemoryArgument(1, CLimg2);
Queue.Execute(kernelImgTeste,

null, new long[] { 10, 10 }, null, null);
fixed (float
* imgPtr = img2)

{

   Queue.Read(CLimg2,
true, new long[] { 0, 0, 0 }, new long[] { 10, 10, 1 }, 10 * 4 * sizeof(float), 0, (IntPtr)imgPtr, null);

}

}

 

Notice that the rowPitch argument has a close relation to the chosen ImageFormat: row dimension (10) * number of components per pixel (4 - RGBA) * sizeof(float). Using Cloo it's possible to read sub-images, something that has not been incorporated to OpenCLTemplate for the sake of simplicity and considering that general computation will simply take Image2D variables as regular vectors.

 

2.3 Using the API

 

Image2D are OpenCL memory objects created using the function (from the OpenCL spec):

 cl_mem clCreateImage2D (cl_context context,
cl_mem_flags flags,
const cl_image_format *image_format,
size_t image_width,
size_t image_height,
size_t image_row_pitch,
void *host_ptr,
cl_int *errcode_ret)

These parameters are not complicated but describing them in detail is beyond the scope of this tutorial. For more information check the OpenCL Spec at Khrono's Group. It is important to check for supported image formats, number of components and number of bytes per component.

 

3. Manipulating Image2Ds using OpenCL C99 code

 

3.1 Reading and writing Image2Ds 

 

Image2Ds are memory objects (variables) received by OpenCL kernels in a special way:


- Their type is image2d_t;
- They can either be read_only or write_only (not both);
- They need an image sampler to be read;
- They use the special read_imageXX and write_imageXX functions to read/write respectively.

In this tutorial we will focus on reading the images using their true width and height (not % of width and % of height) and without any type of filtering. This allows us to read the actual RGBA components as they were written by the Host code.

Having made these assumptions, this is how to pass image arguments to kernels:

 

__kernel void imgTeste(__read_only  image2d_t img1,
                       __write_only image2d_t img2)

 

The sampler that reads true height/width without any filtering is:

const sampler_t smp = CLK_NORMALIZED_COORDS_FALSE | //Natural coordinates
                      CLK_ADDRESS_CLAMP | //Clamp to zeros
                      CLK_FILTER_NEAREST; //Don't interpolate

 

Using OpenCLTemplate, we can write float, int or byte types. These need to be read and written to in the OpenCL C99 using the functions read/write_imagef, read/write_imagei and read/write_imageui, respectively. An int2 indicates the coordinate that will receive the var4 value, where var4 is a float4, int4 or uint4. If the image type is byte, be careful to write only values from 0 to 255 in the uint4.

The following code will read a float4 stored in coordinates (x,y) in the image. Remember, you should read float4's from float images, int4's from int and uint4's from byte images:

 

int2 coord = (int2)(get_global_id(0), get_global_id(1));

float4 val = read_imagef(img1, smp, coord);

 

And finally, this will write a modified float4 into the second image:

 

val.w = 0;

write_imagef(img2, coord, val);

 

When dealing with vectors, this technique is relatively easy. For more types of samplers I suggest reading the OpenCL specification. Source code for this simple test is inside the button Image Test, available in the source code for this section.

 

3.2 Interpreting Image2Ds as regular vectors

 

One common problem when using Image2D to store values is: how is it possible to store large vectors? The standard 8192 float4's = 32768 floats is quite a large value. Nonetheless, a simple trick can be applied to create very large vectors. The mathematical problem here consists in finding a one-to-one relation of the Image2D (x,y) coordinates and the i-th vector component.

We have seen previously that a way to store a matrix as a vector is to reassign M[x,y] = v[x+WIDTH*y] = v[i], where WIDTH is the number of columns of matrix M. The inverse problem can be computed as (remember, x, y and i are integers and the divisions are floored, e.g., 2/3 = 0):

y = i / WIDTH;
x = i mod WIDTH;

Where mod is the remainder operator. The problem with this approach is that divisions and modulus are somewhat expensive operations whereas bitwise operations are very cheap. There's a pretty good explanation at Wikipedia. If WIDTH is a power of 2, i.e., there's some n such that WIDTH = 2^n (two to the power of n), these operations can be computed as very cheap bitwise operations:

y = i >> n;
x = i & (WIDTH-1);

The operation on y is called bit shifting and the operation on x is called bitmasking. In the specific case where we want to use the full available image width, 8192 = 2^13, these operations are simply:

y = i >> 13;
x = i & 0x1fff;

While this allows extracting the i-th float4, extracting the i-th actual component of the vector would involve reading the x, y, z or w component as needed. If you happen to create image vectors with WIDTH = 8192, each row of the image can store 32768 = 2^15 floats. You can compute the necessary number of rows in the image by:

 

//What row dimensions should the image have?
//Make maximum possible, 4*WIDTH = 8192*4 = 2^15
//remember, each pixel has 4 floats
int rowDim = 1 + ((x.Length - 1) >> 15);

 

and the following OpenCL C99 function extracts the ind-th component of the vector from the image:

 

float ReadVecFromImg(int ind, __read_only image2d_t img)

{

   const sampler_t smp = CLK_NORMALIZED_COORDS_FALSE | //Natural coordinates
         CLK_ADDRESS_CLAMP | //Clamp to zeros
         CLK_FILTER_NEAREST; //Don't interpolate

   if (ind < 0) return 0;

   //Divide desired position by 4 because there are 4 components per pixel
   int imgPos = ind >> 2;

   int2 coords;
   coords.x = imgPos >> 13;
   coords.y = imgPos & 0x1fff;
   //Reads the float4
   float4 temp = read_imagef(img, smp, coords);

 

   //Computes the remainder of imgPos / 4 to check if function should return x,y,z or w component.
   imgPos = ind & 0x0003;

   if (imgPos < 2)
   {
      if (imgPos == 0) return temp.x;
      else return temp.y;
   }
   else
   {
      if (imgPos == 2) return temp.z;
      else return temp.w;
   }

}

 

The source code for this section, inside button Vector from Image Test, shows an example where a vector is stored as an image and then its contents are read into a regular buffer variable.

 

4. Worked example: custom filter

 

In this example, we will compare the filter using images with the Image Filtering case study, which did not use images to preserve compatibility with GPUs that don't support OpenCL images. The reference image to which a generic 7x7 border detection algorithm was applied has dimensions 4653x5000. The source code for this section contains all code. Since a Bitmap is copied to the device memory, it's necessary to read and write data using read_imageui and write_imageui, using uint4 types: 

 

int x0 = get_global_id(0);
int y0 = get_global_id
(1);
int2
coord = (int2)(x0+3, y0+3);
uint4
centralValue = (uint)49*read_imageui(img1, smp, coord);

 

For further reference please download the source code for this section and check the Form1_Load function. The screenshot below shows the reduced image and a portion of the full scale image. The right image was not reduced and it shows only a small piece of the first girl's head.

 

 

The filter, of course, is not customizable as in the case study but the transfer time of the 7x7 mask is very low compared to the time to transfer such a big picture. The computation times were 4 minutes 45 seconds without GPU acceleration, 3.3 s using OpenCL without images and less than 1 s (0.98 s) using images, which gives an impressive acceleration of 285x.

 

 

5. Conclusion

 

In comparison with the Case Study involving OpenCL filtering without images, the image filtering algorithm which uses images performed 3x faster but is not compatible with all GPUs. In the future, when all GPUs support OpenCL images, this will probably be the algorithm of choice. The filter without images took 4 minutes and 45 seconds to complete using the same image, which yields an acceleration of 285x.

Images should be the storage of choice when handling very large amounts of data and/or data that needs many accesses. Not all GPUs are compatible with OpenCL images, though, and OpenCL/GL interoperation still requires manipulation of buffer objects. 

 

Get the source code for this section. 



Image2D Variables PDF Print E-mail
Written by Douglas Andrade   
Monday, 26 July 2010 18:57
Please log in to leave a comment.
 


 
 
Copyright © 2014 CMSoft. All Rights Reserved.
Joomla! is Free Software released under the GNU/GPL License.
Design by handy online shop & windows 7 forum