Arm Bus Reconfigurable Accelerators with Embedded FPGA



It’s not difficult to come up with applications ready to reap the rewards reconfigurable I/O offers.

Embedded FPGA is rapidly being adopted for a wide range of applications such as data center, networking, deep learning, artificial intelligence, aerospace, defense, and more. Now that the technology is available and proven in silicon for multiple popular process nodes, customers are finding more and more ways to take advantage of its flexibility. Some customers are integrating embedded FPGA in the data path or in the control path of the chip. However, a very common usage is to connect embedded FPGA to a processor bus as a reconfigurable accelerator or a reconfigurable I/O processor. The benefit of this latter approach is that it offers flexibility in the accelerator function by not binding it to a fixed acceleration function.

This article explains how chip designers can used embedded FPGA to implement reconfigurable accelerators on Arm buses (AXI, AHB, even APB). Specialized accelerators can achieve performance much higher than a processor such as Arm, ARC and MIPS, and a reconfigurable accelerator can be reprogrammed to accelerate multiple tasks instead of just one. In addition, new accelerators can be added at any time, just like a firmware update.

Figure 1 shows how embedded FPGA can be used as a reconfigurable accelerator. For this example, we use an EFLX® embedded FPGA array from Flex Logix. However, there are embedded FPGA platforms available from a wide range of companies now that this technology has been proven and customers have been announced

Figure 1

EFLX Embedded FPGA I/O Resources
An EFLX-2.5K embedded FPGA IP core has 2,520 LUTs (6-inputs), 632 inputs, and 632 outputs. In 16nm, the I/O can operate at around 1GHz and in 28nm it can operate at greater than 500MHz. Larger arrays are also possible up to 7×7. The number of I/O is N*632 inputs and N*632 outputs in an NxM array and the number of LUTs is N*M*2520. Thus, even the smallest embedded FPGA has I/O resources and speed sufficient to connect to the widest, fastest Arm buses.

Accelerator Examples on the AXI/AHB Bus
We will now show four examples of how embedded FPGA can be used as an accelerator on the AXI/AHB bus. These examples were selected because open-source RTL is available on each, which allows the performance to be independently verified. They are also a good choice because performance of the same function on an Arm core is available to provide an Arm-to-embedded FPGA acceleration comparison. We selected Arm because it is by far the most popular processor core. However, we expect the conclusion would be the same for any processor core.

Many other types of accelerators can be programmed in embedded FPGA.

AES-128
Figure 2 is a block diagram of an embedded FPGA configured as an AES-128 accelerator. In this example, even the AXI4-Stream bus for data movement and APB bus for control logic is implemented in the embedded FPGA. This interface functionality won’t change, so it can also be hardwired externally. This example shows off the performance of the embedded FPGA.

Figure 2

As shown in Table 1, the RTL for this AES-128 accelerator requires 1142 LUTs and fits in a single EFLX-2.5K IP core, which is available in multiple process nodes. In TSMC16FFC, the AES-128 accelerator runs at a worst-case frequency of 374MHz (-40/125C, 0.72Vjunction, Slow-Slow corner).

Table 1

This performance is 136-300 times faster than AES-128 software code running on an Arm Cortex M4 in the same process, depending on the assumption of the clock speed of the Arm M4.

SHA-256
An embedded FPGA configured as a SHA-256 accelerator is pictured in Figure 3. In this example, the AXI4 slave RTL is external to the embedded FPGA and is used both for accelerator data movement and configuration of the accelerator registers. The AXI4 slave logic is external for lowest bus latency for data movement.

Figure 3

The RTL for this SHA-256 accelerator, operating on 64-byte data blocks, requires 1,634 LUTs and fits in a single EFLX-2.5K IP core, which is available in multiple process nodes. In TSMC16FFC, the SHA-256 accelerator runs at a worst-case frequency of 171MHz (-40/125C, 0.72Vjunction, Slow-Slow corner).

Table 2

This performance is approximately 40 times faster than SHA-256 software code running on an Arm Cortex M4 in the same process.

JPEG Encoder
Figure 4 presents an embedded FPGA configured as a JPEG encoder. In this example, the AXI4-Stream and APB interface logic are shown implemented in the embedded FPGA itself, but this RTL can easily be put outside and hardwired, as it won’t need to be reconfigured.

Figure 4

This RTL (Figure 5) requires 11,364 LUTs and a significant amount of memory (2 x 256Kbyte dual port RAMs), which need to be attached to the embedded FPGA. The number of signals required to attach to memory is very small compared to the I/Os available.

Figure 5

In TSMC16FFC (worst-case conditions), performance is 149MHz. This is approximately 31 times the throughput of JPEG encoder software code running on an Arm Cortex M4 in the same process.

Table 3

256-Point FFT
An embedded FPGA can be configured as a 256-Point FFT accelerator as a Slave/Master on an AXI4-Stream Bus, with the AXI RTL implemented in the EFLX array (Figure 6).

Figure 6

The RTL for this requires 8,360 LUTs and 16 External RAMS (256 words each, dual port). In this example, the RAM is attached inside the array for greater performance. (See Figure 7 and Table 4.)

Figure 7

The worst-case performance for TSMC16FFC is 303MHz. A benchmark versus an Arm processor is not available, but with the high amount of parallelism in the MACs and memory references, we expect the performance of this reconfigurable accelerator to be much more than a typical processor core.

Table 4

Conclusion
Embedded FPGA is available now from a number of suppliers including Flex Logix, Achronix and Menta in processes from 65nm to 14nm. Using this technology offers more flexibility over fixed-function accelerators and can deliver 30-300 times higher performance as shown in the examples above. This provides chip designers with significant competitive advantages and is applicable to a wide range of industries. As more companies take advantage of this approach, we expect embedded FPGAs to become as mainstream as Arm processors have become today.

Resources

Figures 2, 3, 4: Download Verilog source code from:
www.flex-logix.com/accelerators

Figure 6: Download Verilog source code from:
http://www.spiral.net/hardware/sort/sort.html


Tony Kozaczuk is Director, Solutions Architecture, Flex Logix Technologies, Inc.