Performance-Portable Programming

The OpenCL language and ecosystem could be a lot more portable than it is today. The idea of a static OpenCL runtime coupled with an OpenCL-to-C translator would make OpenCL code theoretically runnable anywhere C and C++ can be compiled for.

High-performance multicore processors, on multi-processor blades, in multi-blade racks, are being used in medical imaging, video surveillance and homeland security applications to deliver the very high performance that is required. But the CPU-centric systems used in these imaging applications leave much to be desired in the way of size, power and cost.

For example, large racks populated with CPU blades and loud fans have been deployed in MRI and CT scanning equipment, often with ASICs and FPGAs as additional performance-boosting offload engines. But conventional sequential microprocessors and coding languages have run their courses and are ill-prepared for the faster real-time performance and smaller system size demands of next-generation, high-performance multicore applications. Current-generation imaging applications have exhausted every method for squeezing additional performance out of their multicore platforms. First, microprocessor vendors deepened on-die caches, created single instruction multiple data (SIMD) instruction set extensions to process media streams, and then implemented multi-threading and clock-frequency scaling. However, imaging applications typically operate on data sets that are an order of magnitude larger than caches, which better serve traditional code and static (re-used) data. Although the rest of the performance-enhancing techniques have helped image processing, here we talk about an efficient and scalable way to attack the problem by looking outside of the box, or traditional architecture in this case.

A fresh new approach is needed for these applications, and that can now be realized thanks to the advances in graphics processing units (GPUs) and innovative, open and royalty-free programming standards for general-purpose computations on heterogeneous systems that are developing around them, such as The Open Computing Language (OpenCL).

Figure 1: Typical OpenCL Software Framework

Big Performance in Small Packages
Driven by the demands of PC gamers and movie studios, graphics processors have evolved in a relatively short period of time into specialized supercomputers that pack Teraflops of floating-point compute horsepower onto a single PCI Express add-in graphics card. With hundreds of computing cores, modern GPUs are much more scalable than the handful of cores offered in a CPU-centric paradigm. And now, the massively-parallel processing logic once dedicated to specific 3D graphics-processing tasks has become more flexible and programmable. Enhancements to the GPUs have enabled these processors to address a wider range of applications. While not suitable to accelerating every application, those with similar characteristics to graphics workloads can achieve significant improvements in performance and power efficiency by exploiting the GPU. The best-suited applications are characterized by many, largely independent tasks ideally operating on large, regularly structured data sets. The low-power GPUs used for these applications are capable of performance ranges of hundreds of Gigaflops to over a Teraflop.

In addition to the high-performance discrete graphics processors, smaller die geometries have enabled the first family of heterogeneous multicore processors that combine a CPU and GPU on a single die. These processors, also known as accelerated processing units (APUs) can greatly reduce the size and power of a comparably performing solution while increasing the data bandwidth between the CPU and graphics cores. These solutions offer the same parallel processing benefits of discrete GPUs alongside low-power multicore x86 processors for general-purpose tasks. The highly integrated APU solutions can deliver performance ranges from 80 Gigaflops to several hundreds of Gigaflops.

GPUs are optimized primarily for graphics tasks, and are best suited to certain types of parallel workloads. Because of their data-parallel execution logic, GPUs are effective at tackling problems that can be decomposed into a large number of independent parallel tasks. When it comes to graphics rendering, the particular work items are operations on vertices and pixels, including texturing and shading calculations. In a parallel-processing application, a set of operations is typically executed in parallel on each item in a data set. Applications well-suited for GPU execution, such as image processing, have large data sets, high parallelism and minimal dependency between data elements. A common form for a data set to take in a parallel-processing application is a 2D grid because this fits naturally with the rendering model built into GPUs. Many computations naturally map into grids: matrix algebra, image processing, physically based simulation, and so on.

To take advantage of the programmability of GPUs, the OpenCL™ programming language was developed to provide developers with a platform to create C-based applications that run on the CPU but also to offload parallel kernels to the GPU. GPUs are implemented with dozens to hundreds of very powerful math engines with fast local RAM. And OpenCL kernels can take advantage of dedicated texture processing hardware in the GPU to perform various 2D filtering operations on certain memory reads.

Leveraging OpenCL, high-performance GPUs or APUs are well-positioned to reduce the size and power of imaging and vision systems dramatically.

Figure 2: An OpenCL static runtime library enables broader portability

OpenCL Opens Embedded Doors
Originated by Apple and turned over to the Khronos Group for standardization, OpenCL is a framework for writing programs that can exploit heterogeneous platforms consisting of CPUs, GPUs and other processors. OpenCL includes a language (based on C99) for writing kernels (functions that execute on OpenCL devices), and APIs that are used to define and then control the platforms. Each kernel is a main body of a loop or routine. The developer specifies the kernel, memory regions and data set to process using OpenCL constructs.

OpenCL enables parallel computing using task-based and data-based parallelism, and presents the GPU resources in a very clean manner as having many instantiations of a single processor type, buffers and memory spaces.

OpenSURF: Community Development with a Vision
As an example of the use of OpenCL with a GPGPU, one group within the open-source community analyzed and profiled the components of the speeded-up robust features (SURF) computer vision algorithm written in OpenCL. SURF analyzes an image and produces feature vectors for points of interest (“ipoints”).

SURF features have been used to perform operations such as object recognition, feature comparison and facial recognition. A feature vector describes a set of ipoints, consisting of the location of the point in the image, the local orientation at the detected point, the scale at which the interest point was detected and a descriptor vector (typically 64 values) that can be used to compare with the descriptors of other features.

To find points of interest, SURF applies a fast-Hessian detector that uses approximated Gaussian filters at different scales to generate a stack of Hessian matrices. SURF utilizes an integral image, which allows scaling of the filter instead of the image. The location of the ipoint is calculated by finding the local maxima or minima in the image at different scales using the generated Hessian matrices. The local orientation at an ipoint maintains invariance to image rotation. Orientation (the fourth stage of the pipeline in the diagram) is calculated using the wavelet response in the X and Y directions in the neighborhood of the detected ipoint. The dominant local orientation is selected by rotating a circle segment covering an angle of 60 degrees around the origin. At each position, the x and y-responses within the segment of the circle are summed and used to form a new vector. The orientation of the longest vector becomes the feature orientation.

Figure 3: An OpenCL kernel multiplying a list of small matrices, being translated into C code

To demonstrate the power of familiar C-style structures in OpenCL with the data-parallel compute capability of GPGPUs, SURF code for calculating and normalizing descriptors is shown here.

The calculation of the largest response is done using a local memory-based reduction. The 64-element descriptor is calculated by dividing the neighborhood of the ipoint into 16 regular subregions. Haar wavelets are calculated in each region and each region contributes 4 values to the descriptor. Thus, 16 * 4 values are used in applications based on SURF to compare descriptors.

The goal of using OpenCL with GPGPUs is to extract as much parallelism out of the framework as possible. In SURF, execution performance is determined by the characteristics of the data set rather than the size of the data. This is because the number of ipoints detected in the non-max suppression stage of the algorithm helps to determine the workgroup dimensions for the orientation and descriptor kernels. Computer vision frameworks like SURF also have a large number of tunable parameters (e.g., a detection threshold, which changes the number of points detected in the suppression stage) that greatly impact the performance of an application.

In heterogeneous computing, knowledge about the architecture of the targeted set of devices is critical to reap the full benefits of the hardware. For example, selected kernels in an application may be able to exploit vectorized operations available on the targeted device, and if some of the kernels can be optimized with vectorization in mind, the overall application may be sped up significantly. However, it is important to gauge the contributions of each kernel to the overall application runtime. Then informed optimizations can be applied to obtain the best performance.

In a heterogeneous computing scenario, an application starts out executing on the CPU, and then the CPU launches kernels on a second device (e.g., a GPU). The data transferred between these devices must be managed efficiently to minimize the impact of communication. Data manipulated by multiple kernels should be kept on the same device where the kernels are run. In many cases, data is transferred back to the CPU host, or integrated into CPU library functions. Analysis of program flow can pinpoint sections of the application where it would be beneficial to modify data management, leading to more efficient use of the overall system.

Programmable GPUs offer a solution to help drive down the size, power and cost of high-performance multicore applications. They can be programmed and optimized efficiently using OpenCL, a royalty-free language based upon C, in order to unlock the full potential of standardized processing hardware for the rapid development of next-generation, high-performance multicore applications such as medical imaging, video surveillance and security.



John A. Stratton is a senior architect for MulticoreWare, Inc. John has been teaching GPU computing since the first university course on the subject in spring 2007, and developing compilers and runtimes for kernel-based accelerated programming models since that year. He has received several awards for outstanding research, teaching and technology development, most recently given the “Most Valuable Entrepreneurial Leadership in a Startup” award by the University of Illinois Research Park for his work with MulticoreWare.

Share and Enjoy:
  • Digg
  • Sphinn
  • Facebook
  • Mixx
  • Google
  • TwitThis

Tags: ,