Better, Faster Systems with Heterogeneous Design
How the Heterogeneous System Architecture (HSA) specification addresses the key issues in using accelerator execution units for processing.
While GPUs have demonstrated extraordinary gains in compute performance—in range of several Teraflops/second—the additional overhead caused by the software APIs diminishes some of these gains.
Heterogeneous systems typically consist of a CPU, GPU and other accelerator devices, all of which are integrated in a single platform with a shared high-bandwidth memory system. The special accelerators are used to obtain both power and performance beneﬁts. While the shared memory system eliminates data copies between the CPU and accelerator-owned memory, the different memory access models for most of the accelerators (e.g. for cache coherency) may still result in “expensive” software synchronization overhead when passing data between the different system components.
Intermediaries Can Diminish Gains
Typically, accelerators rely on a software driver applications programming interface (API) as an intermediary. This greatly increases the overhead to leverage accelerator functions and queue up enough work to the accelerator to amortize the necessary software overhead. While GPUs have demonstrated extraordinary gains in compute performance—in range of several Teraflops/second—the additional overhead caused by the software APIs diminishes some of these gains.
The need for a standard on heterogeneous computing has been set by the popularity of general-purpose GPUs (GPGPUs) but also applies to other programmable accelerators. One of the first computing platforms and programming models for GPGPUs is Compute Unified Device Architecture (CUDA) developed by NVIDIA. However, this is a proprietary interface. Several industry players have introduced open APIs for heterogeneous computing, such as DirectCompute extensions for GPGPU computing on Windows, and Renderscript for heterogeneous computing on Android. Similarly, the Khronos Group announced an open standard framework for heterogeneous computing, the OpenCL standard, which supports both task-level and data-level parallelism and allows platform control and program execution on compute devices.
The HSA speciﬁcations define virtual memory, memory coherency, architected dispatch mechanisms, and power-efﬁcient signal platform requirements. The architecture uses accelerators called kernel agents to reduce or eliminate software overhead paths in performance-critical dispatch paths. All these definitions help to dramatically reduce the overhead and latency needed to dispatch work to the accelerator.
The design allows targeting the accelerator hardware directly via high-level compilers and data parallel- and managed runtimes without the typical translation steps necessary to interface with a high-level API in the dispatch. The architecture also allows compute kernels running on the accelerator to efficiently call back to the host for OS services like file I/O, networking, and similar functions that typically would not be available and, therefore, allowing the accelerator to operate as a true peer of the host CPU.
We’ve outlined some features of the HSA system architecture including the use of agents, kernels and runtime, but we’ve yet to address the programmer’s model for using the architecture.
Paul Blinzer works on wide variety of Platform System Software architecture projects and specifically on the Heterogeneous System Architecture (HSA) System Software at Advanced Micro Devices, Inc. (AMD) as a Fellow in the System Software group. Living in the Seattle, WA area, during his career he has worked in various roles on system level driver development, system software development, graphics architecture, graphics & compute acceleration since the early ’90s. Paul is the chairperson of the “System Architecture Workgroup” of the HSA Foundation. He has a degree in Electrical Engineering (Dipl.-Ing) from TU Braunschweig, Germany.
Jarmo Takala received his M.Sc. (hons) degree in Electrical Engineering and Dr. Tech. degree in Information Technology from Tampere University of Technology, Tampere, Finland (TUT). From 1992 to 1995, he was a Research Scientist at Tampere-based VTT-Automation. Between 1995 and 1996, he was a Senior Research Engineer at Nokia Research Center in Tampere. From 1996 to 1999, he was a Researcher at TUT. Since 2000, he has been a professor of computer engineering at TUT and currently Dean of the Faculty of Computing and Electrical Engineering. From 2007-2011 he was Associate Editor and Area Editor for IEEE Transactions on Signal Processing and in 2012-2013 he was the Chair of IEEE Signal Processing Society’s Design and Implementation of Signal Processing Systems Technical Committee. Currently he is Co-Editor-in-Chief of Journal of Signal Processing Systems.
Dr. John Glossner is president of the HSA Foundation. He also serves as CEO of Optimum Semiconductor Technologies, Inc., dba General Processor Technologies, the US division of China-based Wuxi DSP in partnership with Beijing-based Hua Xia GPT. Dr. Glossner received his Ph.D. in Electrical Engineering from TU Delft in the Netherlands, M.S. degrees in electrical engineering and engineering management from NTU, and holds a B.S.E.E. degree from Penn State. He has published over 120 articles and has been issued 36 patents.