Addressing the High-Performance Embedded Challenge with VSIPL and VSIPL++
VSIPL enhances programmer productivity – and abstraction – in signal- and image-processing.
Ask a high-performance embedded programmer how they write programs, and the answer is most likely to be some combination of C and low-level assembly. That’s been the state of the art for decades, and common wisdom is that’s what you have to do to get high performance. Unfortunately, it’s also becoming increasingly clear that this approach is no longer a viable solution. Hardware is getting more complicated, more heterogeneous, and changing more quickly – the next generation of high performance hardware is no longer a single-core CPU, and increasingly, it is no longer just a CPU at all. Today, we’re talking about GPUs. Three years ago the IBM Cell Broadband Engine (BE) was the talk of the day; who knows what we’ll be talking about three years from now?
Programs have to be entirely rewritten to target these new hardware platforms – not just translated from one flavor of assembly to another, but redesigned to take advantage of the new architecture. And, as the platforms are becoming more powerful, the programs to use them are getting larger. The expense of rewriting low-level code for each new generation of hardware is no longer sustainable.
We need a programming model that works with this new reality. It needs to give us good performance – although we are increasingly accepting that it need not be perfect; if it means faster time to market, we can make up the difference in newer hardware. It needs to be portable across architectures, so that we can preserve our investment and upgrade to future hardware. And finally, it needs to enable programmer productivity – programs are simply too large and expensive to write in an all-low-level-code fashion.
Common Alternative Programming Models
The solutions that most people reach for when programming GPUs are NVIDIA’s CUDA language or, for a cross-platform solution, OpenCL. The CUDA and OpenCL languages are useful for writing programs at a C-like low level, but they have many of the problems inherent in writing low-level programs. OpenCL is supported on a remarkably wide range of processors – not only GPUs from most of the major manufacturers, but multicore Intel and AMD CPUs, the Cell/BE, and even Altera FPGAs – but this does not imply that the same program will perform well on all processors. Low-level GPU programs in any language need to be carefully arranged to use the available hardware resources, and we can expect that new hardware will mean rearranging everything to retain good performance. One CUDA function I was recently working on ran 30 percent slower when I moved it to the latest NVIDIA Kepler hardware, while better-arranged functions were going nearly twice as fast. And this need for careful tuning is a problem not just for porting existing software, but for writing high-performance software in the first place.
An alternate solution is to use vendor-provided libraries – IPP and MKL from Intel, SAL from Mercury Computer, AXISLib from GE Intelligent Systems, Vector from Curtiss-Wright, CUBLAS, CUFFT, and NPP from NVIDIA, ACML and APPML from AMD, and the list goes on. Although libraries are inherently less flexible than low-level languages like OpenCL, they provide significant productivity benefits in the domains where they do apply. Most of the algorithm can typically be expressed as a series of high-level library function calls rather than low-level implementations. This allows the programmer to concentrate their development efforts on the remaining small portions of the program that cannot be expressed in the library, so that equal performance to hand-tuned code is obtained with significantly reduced effort.
These vendor-provided libraries also provide some measure of future-proofing; the vendors will update them for each new generation of their hardware. However, they are not portable to other platforms. The majority have proprietary interfaces, and in many cases hardware assumptions such as the use of GPU device-memory pointers are built into the data representations, requiring hardware-specific code to manage the data.
Open-architecture, domain-specific standards such as VSIPL and VSIPL++ – the Vector, Signal, and Image Processing Libraries in C and C++ respectively – provide a better solution than either low-level but portable programming languages like OpenCL, or high-level vendor-specific libraries like IPP or CUBLAS. VSIPL and VSIPL++ are high-level libraries that provide similar productivity benefits to vendor-provided libraries; the algorithms are written as a series of large function calls that abstract away the implementation details, allowing the programmer to write concise code. Using an open-architecture standard allows cross-platform portability and ensures future compatibility. Any changes to the API will be minimal and well-documented. Several hardware vendors support VSIPL and VSIPL++ on their platforms, and there are also independent implementations such as VSI/Pro from Runtime Computing Solutions and Sourcery VSIPL++ from Mentor Graphics.
To some extent, open source libraries such as OpenCV (for image processing) also fit this category. But the interface is not standardized, so there is not an assurance of stability nor a clear distinction between intended behavior and implementation details that may change without notice, but the availability of the source code as a basis for vendor-specific optimizations ensures some level of cross-compatibility.
The VSIPL and VSIPL++ Data Model
An important aspect of VSIPL and VSIPL++, beyond the high-level function calls, is the use of data abstractions to truly separate the algorithm from the implementation. Modern heterogeneous systems require a number data representations beyond the simple float* pointer of a CPU, to represent the multiplicity of memory spaces within the system such as GPU device memory, local storage on a coprocessor, and so forth – or even to represent simple distinctions such as row-major or column-major storage of multidimensional arrays. Thus, when algorithmic code uses a library that operates on these low-level data representations directly, the programmer has to be aware of the various memory locations and explicitly manage all of the data movement between them, and the resulting code will be specific to that hardware regardless of the portability of the function names.
By contrast, VSIPL and VSIPL++ wrap data in abstract objects that encapsulate the details of the representation behind a common interface. The programmer writes the algorithmic code to use data types such as Vector<float>, and the code does not change if that encapsulation resolves to a function pointer on one hardware platform or to an automatically-managed combination of GPU device pointers on another. The library is free to use optimizations such as caching between GPU device memory and CPU system memory, providing much-improved performance without affecting the API.
An example VSIPL++ Application
The productivity and performance benefits of VSIPL++ are illustrated by the Synthetic Aperture Radar (SAR) example that Mentor Graphics ships with the Sourcery VSIPL++ library. This program implements one of the “HPEC Challenge” synthetic benchmarks defined by MIT Lincoln Laboratory, and represents a realistic model of real-world algorithms. It is also a good test case for portability, as it was originally written for an x86 CPU target, and has since been ported to the Cell/BE and then to several generations of NVIDIA CUDA GPUs.
|Figure 1: Schematic of the HPEC Challenge Scalable Aperture Radar benchmark application.|
Figure 1 shows a diagram of the algorithm with the major operations. The incoming radar data is transformed through a sequence of steps that primarily involve FFTs, corner turns (transposes), and elementwise operations, which are then passed through an application-specific, polar-to-rectangular “range loop” interpolation to form the final reconstructed image.
For the optimized Cell/BE and GPU versions of the program, the majority of the program simply works as written, and provides good performance – see Table 1 for a comparison. The one exception is the range loop interpolation, which can’t be expressed well as VSIPL++ calls and so is written in nine lines of plain C++ code. For good performance on Cell/BE and CUDA GPUs, we rewrote target-specific optimized low-level implementations of this portion of the code. This amounted to 208 additional lines of low-level source code on Cell/BE, and 193 lines for CUDA GPUs. The expansion that occurs from a simple C or C++ implementation to a target-specific optimized one is obvious, but the important point is that we only had to do this for a small piece of the program.
|Table 1: Performance results for x86 and CUDA versions of the SAR example program.|
|Figure 2: A portion of the SAR example program, in MATLAB and VSIPL++ versions.|
The initial implementation shows the productivity benefits of using the VSIPL++ API; it has 203 lines of source code, compared to 391 lines for a completely un-optimized C reference implementation (using the FFTW library for the FFT operations) from MIT Lincoln Laboratory. The computational code is much more akin to MATLAB than C, as shown in Figure 2.
As shown in Table 1, this provided significant additional performance improvements, not only in the range loop interpolation itself but also in an adjacent FFT computation, as computing the interpolation on the GPU allows the data caching mechanism to avoid a CPU-to-GPU data transfer that would immediately follow it.
The results that we see with this example are typical of VSIPL++ applications, and we see similar results with VSIPL, although the code is somewhat more verbose. We can get excellent performance on about 95 percent of the program using VSIPL++ function calls, and even when the remainder is hand-optimized to complete the performance picture, there is still a substantial improvement in programmer productivity – and this productivity increase also provides long-term portability benefits, since the portions of the program that rely on VSIPL++ can simply be recompiled for a new target.
Thus, writing signal- and image-processing programs with VSIPL or VSIPL++ results in significant increases in programmer productivity that reduce time to market and programming costs even on new and complicated heterogeneous hardware, and results in a program that is portable across current and future platforms while providing very competitive performance. Outside of the signal and image processing domain, the same story applies; only the library names are different. Use open-standard libraries where you can for most of the program, and spend your high optimizing time only on the parts of the program that really need it.
Dr. Moses works for the High Performance Computing (HPC) solutions team in the Embedded Software Division at Mentor Graphics. Dr. Moses participates directly in the development of the Sourcery VSIPL++ library and other high-performance library products. He has worked extensively on the Cell/B.E. and NVIDIA CUDA ports of Sourcery VSIPL++. Dr. Moses holds a Ph.D. in Mechanical Engineering from Stanford University where he conducted advanced research into algorithms for computational fluid dynamics simulation.