GPGPU-based HPEC Solves Toughest Mil/Aero ISR Problems

GPGPU computing in deployed HPEC systems brings mobile supercomputing to the battlefield in small packages.

Of all the capabilities required in defense and avionics, Intelligence, Surveillance and Reconnaissance (ISR) is among the most demanding. The levels of signal and image processing, the high bandwidth, the sheer volume of data and the processing to go with it require High Performance Embedded Computing (HPEC) technology. Rugged, deployed HPEC modules and systems must deliver high performance, multi-processing efficient operation within a power budget and in a form factor that suits both existing and new, emerging applications. If that were not enough, mil/aero systems, with their emphasis on size, weight and power (SWaP), often need to repurpose legacy algorithms, code and interfaces while mandating scalable and size-modular HPEC systems.

Radar, electronic warfare (EW) and ISR systems are “fed” from a collection of high performance sensors, delivering a ream of data into a cluster of CPUs. Processing that data, particularly image sensor data, can be improved with onboard processing, contends GE Intelligent Platforms, alleviating the need to downlink sensor data to command posts. This point-of-the-spear processing fuses sensors and the algorithms that interpret data, but requires rugged, deployable HPEC systems that are small, scalable, and very powerful (Figure 1).

The designer’s dream is an HPEC system powerful enough to do ISR/EW sensor processing and interpret the results — all in a SWaP-optimized deployed platform. This “FlexDAR” goal is achievable with the latest general purpose/graphics processing unit (GPGPU) COTS technologies.

Figure 1: Notional battlefield showing an increased number of sensors pushing the data processing boundaries for CPUs. Getting the desired data (of tank, inset) amid all that’s going on here requires extreme processing, often on mobile platforms optimized for SWaP.

Tegra K1: From Games to Gimbals

A leader in rugged COTS systems targeting DoD, MoD, defense and mil/aero platforms, GE Intelligent Platforms has signed an agreement with NVIDIA which sees it become NVIDIA’s preferred provider of solutions for harsh environments featuring the Tegra K1 GPU, which is based on the high performance Kepler graphics architecture. Tegra K1 is the follow-on to NVIDA’s Tegra 4 GPU and is optimized for extreme gaming in low-power, multicore mobile platforms.

GE Intelligent Platforms expects to integrate the recently announced K1 into a variety of board architectures to deliver rugged, high performance GPGPU-based embedded- plus graphics-number crunching in harsh environments. The processing combination of the two merges general-purpose —albeit ultra-high performance — processing with leading-edge game console graphics and image processing — a powerful combination for HPEC. In fact, the K1’s numbers are game-changers for deployed HPEC systems.

NVIDIA’s Tegra K1 was introduced at CES 2014 with much fanfare and promising benchmarks ideal for SWAP COTS mil/aero applications. The processor includes embedded CUDA support and new GPUDirect RDMA. These additions allow for a scalable platform that’s ideal for SMP and AMP HPEC arrays. NVIDIA’s whitepaper on the Kepler “CUDA Compute Architecture” is here.

The Tegra K1 is an ARM SoC that performs at an order of magnitude greater than other SoCs. It is the first to integrate a 32-bit quad-core ARM Cortex-A15 (2.3 GHz and 18.4 GFLOPS) and a 192-core programmable CUDA-enabled GPU (16 x 12 array; 792 MHz and 304.1 GFLOPS, total). Collectively, the K1 delivers 325 GFLOPS, dwarfing “comparable” competitors such as TI’s OMAP, or Qualcomm’s Snapdragon 800, which deliver about 50 GFLOPS. A fifth “battery saver core” extends battery life by performing routine housekeeping functions and low-level computation without arousing the other cores (Figure 2).

Figure 2: Block diagram of the NVIDIA Tegra K1 SoC, which GE Intelligent Platforms will exclusively offer in a ruggedized form. Note the four ARM Cortex A15 cores for general purpose processing plus 192 CUDA cores in the GPU. Not shown is the fifth “battery saver” ARM core for lighter-weight housekeeping and GUI rendering.

The TSMC-fabbed K1 is a 28nm geometry device, boasts about 40 percent more performance overall than the Tegra 4 predecessor (per NVIDIA), and cranks an amazing 1403 specInt2K per CUDA core. The device runs various operating systems on the ARM cores, plus OpenGL 4.4, DX11 and CUDA 6 (on the GPU).


K1’s Kepler architecture includes the GPUDirect RDMA feature that promises to open up real-time applications by allowing remote DMA (RDMA) direct access to CUDA cores by other cores via a unified address space. In HPEC applications, this means end-point sensor nodes or I/O can directly access memory spaces. Before the advent of this NVIDIA technology, there was a latency problem when using GPUs because data for the GPU was sent via the system’s CPU, which not only decreased bandwidth but also added overhead set-up, bus contention and copy/move latency.

The latency was a problem for HPEC systems like ISR that needed to move massive amounts of data from off-board I/O such as Gigabit Ethernet, 10Gigabit Ethernet or InfiniBand. Similarly, off-board co-processor/multi-processor nodes were throttled by this same GPGPU latency.

Using GPU Direct, external PCIe devices — like FPGA co-processors, Ethernet I/O, InfiniBand I/O, FMC/XMC mezzanine boards, or solid state drives — can send data to and from the GPU without incurring CPU overhead (Figure 3). This results in a lower latency by up to a factor of five, as well as efficient PCIe utilization and decreased CPU overhead due to availability of GPUDirect drivers for PCIe and InfiniBand. The RDMA mechanism can be applied to front-end sensor interfaces and also to inter-processor communications to facilitate novel SMP architectures such as hypercubes via NVIDIA’s “dynamic parallelism” (Figure 4).

For example, GPUDirect RDMA can be integrated with an InfiniBand HCA endpoint and/or 10 Gigabit Ethernet Network Interface Card (NIC) to allow scalable networks of GPUs across several CPU nodes. Removing the CPU and system memory bottlenecks opens up opportunities in new domains such as electronic warfare (EW) and ISR, where previously low latency had restricted access.

Figure 3: The GPUDirect RDMA can relieve system memory bottlenecks, allowing the GPU to perform intensive operations.
Figure 4: “Dynamic parallelism” due to GPUDirect moves data between on- and off-board resources such as Network Interface Cards (NIC) via RDMA. (Courtesy: NVIDIA).

Due to the game-changing (no pun) Tegra K1’s performance, a CUDA-enabled GPU not only plays nice with end-point co-processors and sensors. It offers an alternative by replacing FPGA boards or custom ASICs, which are expensive and can account for many man-hours in both hardware and software developers’ time.

ISR designers using a K1 HPEC GPGPU system can use the dynamic coded processor for tasks that previously needed static devices, like FPGAs or ASICs. For example, digitizing data and sending it to the GPU via the CPU and then transmitting the data, would have taken 1 ms. This is an unacceptably high level of latency for radar or EW. Now, using GPUDirect, the data is transmitted in a mere 20 µs and uses less bandwidth. This extra bandwidth can be applied to even higher bandwidth sensors such as radar resolution or wider scan range data.

Legacy Migration; SWaP-optimized Architecture

Tegra K1 can of course be programmed in NVIDIA’s CUDA language or in C. The ability to run C/C++ code on the GPU (in addition to the four ARM cores) not only means easier coding for a wider number of programmers, it makes GPU memory access and math operations familiar and portable from other processors. That is, a high-level language (C/C++) can convey a specific algorithm for rendering or graphics, instead of using a graphics processing language. This means that the tasks that can be applied to the GPU need not be graphics-related nor programmed in a graphical description language. This is particularly beneficial when the task, whether it is signal processing, FFT/IFFT, FIR or anything that does not fit into a graphical framework, is required.

On the other hand, Tegra K1 is so powerful, due to the four A15 cores and the 192 CUDA cores, that un-recompiled code can be run on the device, says GE Intelligent Platforms, with at least a 2-3 times performance improvement over baseline. But recompiled and optimized code can result in orders of magnitude improvement, as pointed out above. GE told this author that a Tegra K1-based PC graphics card purchased from a retailer such as Best Buy could have AESA radar algorithms for MTI, GMTI or SAR running within a matter of weeks. The key takeaway points are: simplicity and about as commercial off-the-shelf as one can get. Both of these characteristics are ideal for COTS-based HPEC systems.

But beyond GPUDirect RDMA and CUDA, the breakthrough with Tegra K1 is that it operates at less than 10 W, considerably less than the 40 W required by the closest competing GPU. And compared to the typical FPGA or ASIC-based HPEC system, well, there is simply no comparison.

As a result, CUDA code developed for supercomputers running at 300 W and above can be deployed on a 10 W system in an avionics platform, vehicle CANbus or even an industrial automation motion controller, bringing GPGPU performance accessible to a wider range of platforms. This makes K1-based HPEC systems flexible and capable of operating, for example, as a single board computer (SBC) and a sensor processor—which are typically two separate roles.

The fact that the four ARM cores can run Linux in parallel with the GPU cores running CUDA is significant for multi-use systems and for repurposing a scalable HPEC design across many mil/aero platforms such as UAV, UAS, pod-mounted SIGINT, or manpack-mobile EW/jamming. Program managers can “design once, deploy many times.”

Finally, a smaller size, weight and power (SWaP) profile also opens up new deployment possibilities, such as wearable equipment: night vision glasses and augmented reality (AR) reticles. The smaller form factor made possible by K1’s performance also allows for levels of imaging and processing in space-confined areas, for example in micro UAVs, Video Object Extraction (VOE)/H.264 applications, intelligent local-processing displays (replacing “dumb” displays of the same dimensions), software defined radio (SDR), personal robots and sensor gimbals and turrets.

Programmable GPU: Multi-use Has Advantages

GE Intelligent Platforms is able to leverage its experience in rugged and harsh environment system design to optimize Tegra K1 in scalable GPGPU HPEC modules and systems. The low power consumption and 300+ GFLOP performance of K1, plus the portability of both C/C++ and CUDA, allow rugged systems to be built for all platforms and SWAP budgets. The typical module-to-system continuum is shown in Figure 5. Of particular note is the code compatibility between the hardware elements, allowing a system design — and its proven algorithms and approved code—to be re-scaled and re-deployed on a different sized and cooled platform (such as from an avionics bay to a man-portable).

Figure 5: GE Intelligent Platforms’ range of GPGPU-ready HPEC modules and systems, made possible by NVIDIA’s Tegra K1 GPU.

An example of a state-of-the-art module is the company’s current IPN251 OpenVPX 6U board that achieves 100 W power consumption (approximately 40 W for the CPU and 40 W for the GPU and approximately 20 W for ancillaries, such as 10 GigaEthernet), which manages thermal issues effectively in conduction-cooled and VITA 48 (REDI) versions (Figure 6). As shown in Figure 5, the 6U system can be scaled down to a one GPU/3U LRU operating at 100 W (total), with the potential for SWaP benefits that can create smaller UAVs, with less weight and improved cooling and lower power or longer battery life.

Figure 6: The 6U OpenVPX IPN251 board sports an Intel Core i7 and NVIDIA Tegra K1 GPU (Kepler architecture). The inherent benefits of the i7 and K1 help manage thermal issues, allowing the board to scale into larger ATR systems (xN). A fully populated chassis might consume 2 KW, which isn’t high considering the processing power exceeds tens of TFLOPs in a single rugged box.

By comparison with the aforementioned GPGPU HPEC systems, FPGA boards and custom ASICs can also address the problems of performance, particularly in mil/aero projects. Yet they incur a heavy cost penalty in both hardware and software development time. Using a GPGPU, a small team of hardware engineers can produce and run a prototype in a matter of weeks without spending thousands of dollars in compiler tools. For a modest budget, the high performance, programmable GPU media processors can be embedded into small, low power budget systems.

The increase in performance offered by GPUs is also advantageous over FPGAs. The current 6U board from GE Intelligent Platforms, the IPN251, delivers 645 GFLOPS at 38 W and has a fast, ruggedized Intel Core i7 processor that runs software. FPGAs are still used for I/O, ADC/DAC and to digitize and transmit data to send to the GPU, but it is the GPU that can perform the heavy math-intensive operations, or dynamically change algorithms and parameters in the radar and EW spectrum, for example. The comparison given by the company is that the three months taken to implement an algorithm in an FPGA can be reduced to one week using CUDA.

Finally, FPGAs at best use OpenGL but more likely RTL for programming, which has limited DSP libraries, yet there are millions of lines of code and high end math libraries to license for CUDA, maintains GE Intelligent Platforms. To add insult to injury, NVIDIA GPUs also run OpenGL.

The flexibility of reduced SWaP, and the significantly increased performance of the NVIDIA Tegra K1 and the GE Intelligent Platform legacy of rugged design for harsh, mil/aero environments, will produce opportunities for high-performance, efficient, thermally managed HPEC designs of the future.

Editor’s note: This article is sponsored by GE Intelligent Platforms.

Share and Enjoy:
  • Digg
  • Sphinn
  • Facebook
  • Mixx
  • Google
  • TwitThis

Tags: ,