From Visual Studio to FPGA Hardware
A snapshot on the current state-of-the-art in FPGA software to hardware compilation.
Accelerating software by moving it to massively parallel hardware continues to develop as an attractive methodology. The merit of hardware acceleration is significant, but there are hurdles that should be budgeted for.
To start, it’s important to understand that with hardware acceleration there are three primary forms of parallelism: 1) blocks of code that execute in parallel by splitting the data set into multiple parts; 2) pipelined (bucket brigade) blocks of code that operate on streaming data; and 3) hybrid parallelism that combines both forms.
On the merit side, each unrolling of a critical path loop into an independently streaming process can double wall-clock performance. Modern FPGAs, with several million usable (in contrast to advertised) gates can host a dozen or more independent streams. Accordingly, we are seeing solid 10x acceleration in “parallelizable” designs in application domains such as image processing, encryption and network filtering. The process is not trivial. While we have student groups from Rochester Institute of Technology to the University of Naples delivering remarkable speed ups, adoption by industry is concentrated among a few thousand classic early adopters.
This article covers the current state of the art on software to hardware compilation, provides some realistic tips, and suggests a vision of how to make it more usable by the next wave of developers.
FPGAs explained for the non-hardware folk
Most software developers write code for microcontrollers or microprocessors. C remains the dominant language for design starts. CPUs and microcontrollers typically feature single or low number multiple-cores. They achieve throughput via increasing clock speeds but are constrained by having to share limited cores and common memory. Think of it as driving ever faster through a single or dual toll booth on a bridge.
FPGAs run at lower clock speeds than microprocessors. They achieve throughput by having very flexible input and output, so non-sequential tasks can be designed into parallel processes. While a conventional processor only does one operation at a time, a properly engineered FPGA design will concurrently perform hundreds or thousands of operations. Most designs for FPGAs are developed using an HDL (hardware description language) like VHDL or Verilog. They are not particularly difficult languages but are sufficiently arcane that C programmers do not generally take to them. We have seen more cross-over from HDL engineers learning C than the other way around.
Gate count can be deceptive. Microprocessors use the available silicon with high efficiency. FPGAs can use a significant fraction of their gates for routing. And, you may not be certain of resource availability. FPGAs have great “blocks” of special purpose gates such as DSP, but they are limited. When you’re out of special blocks, you’re out, and the routing software will use less efficient gates, reducing performance and space efficiency.
Historically FPGAs are descendants of PALs, GALs and PLDs. These precursor devices were much smaller. The earliest ones were small enough to program in Assembly. The devices grew and the early HDLs emerged and improved. Design shifted to HDLs like Data I/O’s ABEL, MMI’s PALASM and others. Jump to today and the same shift is occurring. HDLs like VHDL and Verilog are more time consuming but offer better control over resources. C to HDL to RTL (Register-transfer level: the gate level machine code that characterizes the FPGA) tools like Impulse C, ROCC, C2H, Vivado and others offer a higher level of abstraction but may not have the same QoR (Quality of results) initially. More about this later when we talk about the design flow.
High Level Language, or HLL programming typically refers to C. System C, C#, C++ are all great languages with ardent users. But for this article we’re sticking with ANSI C, as most IP is created in this version of C.
Design code is entered into a C development tool like GCC or Visual Studio. Ideally the system architect identifies a portion of the particular design to try in hardware, so one can maintain comparable C code files to compile both to FPGA and to microprocessor. This maintains equivalence and lets the designer “break one thing at a time”. The microprocessor-oriented C can be crudely “wrapped” and imported into the C to FPGA environment – where it will underperform until refactoring. Refactoring into individual streaming processes enables the compiler to better parallelize the code.
Identifying the critical path to focus on starts with the basics. You are hunting for key blocks of code, largely free of serial data dependencies, which are heavily used by the system (and they eat up a lot of clock cycles) and which have opportunities for parallelism (traditional parallelism or pipelining), as shown in Figure 1. While open source profiling tools are not fully realized, they can be useful. Commercial profiling tools are available with improved visibility and reporting to make run-time analysis easier and more accurate. It is not rocket science; the point is to chase clock cycle reduction. Typical design modules that are amenable to parallelism include encryption, image processes, FIR, FFT and any process that wants to sit on a bus and look at data streaming by.
|Figure 1: The software to hardware stack generates necessary hardware interfaces post-optimization.|
The point here is to generally offload the microprocessor and bypass limitations introduced by the von Neumann architecture. Some FPGA-enabled boards make this particularly easy by including PCIe connections to host so you can rapidly experiment with partitioning… moving code between FPGA resources and the system processor with single lines of instruction. In addition to partitioning, you are refactoring. Again, refactoring here means breaking C algorithms into coarse-grained logic: single processes that can be machine parallelized into multiple streaming processes. The C to FPGA compilers will unroll as much as they can but you’ve got to refactor into logic that makes it easier to do so. All while retaining behavioral equivalency. A great feature of FPGAs is the ability to simplify algorithms by creating a behavioral model that provides the functionality of the original microprocessor code, but eliminates the overhead generally associated with generic library routines from BLAS, LAPACK, and so on.
Verification occurs at every stage. Visual Studio verifies functional operation and equivalency to the microprocessor stack. Later in the tool flow you verify that the design will operate in the target FPGA, and what clock cycle reduction is possible given the available FPGA resources. Post machine compile, the HDL output can be directed to an industry standard HDL simulator to provide cycle accurate verification (Figure 2).
|Figure 2: Desktop simulation output.|
Now comes some leaps that can go wrong. To compile all the way to FPGA gates, the optimizing compiler hands off synthesizable HDL files to a place and route tool. This can be one provided by the FPGA manufacturer or one provided by another EDA firm. This can entail one heck of a coffee break. Place and route times for several million gates, taking into account all the special resources involved, can take hours. This is probably the biggest contrast in the experience compiling software to FPGA hardware vs. compiling it to microprocessor (Figure 3). There are newer FPGAs that enable partial reconfiguration, so if your particular process of focus is constrained to an area that can be isolated, the iterative times can be significantly reduced. Quartus, Vivado and Impulse all support partial reconfiguration. However, the practical usability of this technique remains to be fully field tested.
|Figure 3: Building HDL: complex and not very speedy, compared to compiling for a microprocessor.|
In the software to hardware process, another option is to run software on an embedded core inside the FPGA. The cores come in soft- and hard-core versions. Soft cores are programmed into general purpose gates. While FPGA-hosted processors are slower than those of the host machine, they are typically Harvard architecture with physically separate storage and signal pathways for instructions and data. This gives them a memory access advantage and direct communication with the FPGA logic, bypassing host to hardware overhead. Sometimes this can be a less efficient use of silicon, but multiple cores can be added as the design may need. Cores can be proprietary, provided by the FPGA vendor and useful if not familiar. Increasingly FPGA suppliers are standardizing on ARM cores. This trend is expected to continue. The use of cores in FPGAs as SoC (system on chip) solutions is an intrinsic benefit. The on-board core can consolidate microcontroller or light micro processing tasks on one chip.
Development environments such as Impulse CoDeveloper also interoperate with full-featured heavily used tools such as Microsoft™ Visual Studio. A practical example is provided in Impulse App Note 112 by Michael Kreeger “IATAPP-112”.
For example, while installing Impulse C and CoDeveloper the Visual Studio plug in is automatically installed and just has to be selected during setup. When Visual Studio is thereafter launched, the top pull down-down menu can be used to select “File->New Project…” which creates a sub directory for the new project. If beginning from existing code, those source files are copied into the solution directory. Header and source files are added to the Visual Studio Project and then to the source files folder in the solution explorer directory. To verify before hardware generation a “debug” software project is built, which makes it possible to test the application from desktop simulation. Next, select the hardware environment from the configuration manager to build the HDL for the target FPGA. This enables the synthesizable HDL to be exported to the appropriate place and route tool to generate RTL for the target FPGA. The whole process may take an hour or so.
As a new methodology this tends to be minimally disruptive. The pragmas and additions are pretty common sense. Glitches may arise from “plumbing” (a whole different topic), such as getting things lined up with PCIe drivers, DMA, DDR and all the devilish details. Our two cents is that the growing body of known good examples and reference designs makes this process less risky. On the tool side we’d like to see shorter place and route times, better back annotation and more useful profiling. Overall as the FPGAs and FPGA based acceleration cards mature, and the body of available IP expands, this technique becomes more mainstream.
Brian Durwood co-founded Impulse Accelerated Technologies in 2002 with David Pellerin, a co-worker from the ABEL® days at Data I/O. Impulse has grown to be the most widely used C to FGPA tool, with customers from NASA to Detroit to Wall Street. Mr. Durwood was previously a VP at Tektronix, a VP at Virtual Vision and an Analyst at NBC. Mr. Durwood is a graduate of Brown and Wharton. Impulse now offers tools, IP and design/integration services.
Nick Granny is a scientist, engineer, and entrepreneur who has been supporting the EDA and high-performance computing communities for more than 25 years. Currently Nick is co-founder and CEO of MNB Technologies, a small company that develops artificial intelligence-based EDA tools and provides technical services to the Impulse C user community. Prior to MNB, Nick was the lead staff scientist in Mentor Graphics research into FPGA-accelerated computing and was also a key member of the development and launch team for the IKOS/Mentor VirtuaLogic emulation system. Earlier in his career, Nick was an embedded systems engineering consultant to regional electric power utilities and the top-tier critical care medical apparatus manufacturers. Nick is a medically retired US Naval Officer and further serves his community as an adjunct computer sciences instructor and course development consultant to Indiana’s state-wide community college system.