Extending Battery Life in IoT Devices
Use embedded FPGAs to perform low-level chores.
Most Internet of Things (IoT) integrated circuits are battery operated, and users do not want to change batteries (consider smoke alarms chirping in the middle of the night). Ironically, during the life of an IoT chip, little time is spent doing complex processing. Rather, most time and energy is being dissipated in low-level, repetitive tasks. The good news is that battery life can be extended, at least in the cases we analyzed, by using an embedded FPGA to do the low-level, repetitive tasks instead of the processor. The embedded FPGA does not replace the processor, as it is still used for the more complex tasks. This article considers an example using the ARM® Cortex®-M4 processor only because of availability of data for comparison. However, the conclusion is not specific to ARM, but applies to any embedded processor on the market today.
Introduction to Embedded FPGAs
An FPGA combines an array of programmable/ reconfigurable logic blocks in a programmable interconnect fabric. In an FPGA chip, the outer rim of the chip consists of a combination of GPIO, SERDES and specialized PHYs such as DDR3/4. In advanced FPGAs, the I/O ring is roughly 1/4 of the chip and the “fabric” is roughly 3/4 of the chip. The “fabric” itself is mostly interconnect in today’s FPGA chips, where 20-25% of the fabric area is programmable logic and 75-80% is programmable interconnect. An embedded FPGA is an FPGA fabric without the surrounding ring of GPIO, SERDES, and PHYs (Figure 1). Instead, an embedded FPGA connects to the rest of the chip using standard digital signaling, enabling very wide, very fast on-chip interconnects.
For this analysis, to show power specifications we will use those of the Flex Logix EFLX embedded FPGA. Flex Logix provides high-density, high-performance, energy-efficient embedded FPGA hard IP. Its EFLX platform is a scalable architecture of silicon proven IP that enables embedded FPGAs from ~100 LUTs to >100K LUTs. The smallest core (EFLX-100) has 120 LUTs with 152 inputs and 152 outputs. The larger core (EFLX-2.5K) has 2,520 LUTs with 632 inputs and 632 outputs. The cores can be “tiled” to form arrays. The EFLX-100 cores can be tiled to build arrays up to 5×5 or 3,000 LUTs of reconfigurable logic with 760 inputs and 760 outputs. The EFLX 2.5K cores can be tiled to build arrays up to 7×7 or 123K LUTs of reconfigurable logic with 4424 inputs and 4424 outputs. The EFLX cores have two versions that can be mixed together in arrays. One version is all logic, and the other has 22-bit MACs for DSP functions. The small core can have 2 MACs, and the large cores can have 40 MACs. The cores also support memory structures that can be part of the array for small memory sizes or outside of the array for larger memory sizes.
The embedded FPGA is programmed using Verilog or VHDL, which are input to a synthesis tool such as Synopsys’ Synplify. The EFLX Compiler takes the synthesized output and packs/places/routes the array and generates a bit stream that programs the EFLX array to emulate the RTL. The bit stream for a single EFLX-100 is ~50K bits and can be stored in the same flash memory that stores the embedded processor code. The EFLX array can be reprogrammed at any time, just like an embedded processor.
Embedded FPGAs can also be integrated as accelerators on the processor bus, as programmable logic in the control path or as I/O processors; or some combination of these.
Using Embedded FPGAs to Offload DSP Functionality
The embedded FPGA can be used to offload some DSP algorithms that take fewer clock cycles and consume less energy than running the DSP algorithms on an ARM Cortex M4F CPU. This analysis was done in TMSC 40ULP process as an example showing that the ARM Cortex M4 consumes ~1.5-4.75x more energy over EFLX for the same DSP computations. The actual ARM power will be higher due to instruction fetch and memory access power that were not factored into the power analysis for the ARM CPU. The ARM Cortex M4 requires ~17-31x more clock cycles than EFLX for the same DSP functions. The total energy for running the DSP algorithm on EFLX embedded FPGA is frequency independent and the data throughput of EFLX is fully proportional to EFLX frequency (e.g. no cache misses, etc.). A 5-tap FIR filter and a single stage Biquad filter were examples of DSP algorithms used for the analysis.
Embedded FPGA Power Architecture Features
A typical FPGA architecture is shown in Figure 2.
An embedded FPGA has none of the most power-hungry blocks: PLL, SERDES, PHYs, and connections with the rest of the chip are through standard CMOS “pins” or drivers.
The EFLX embedded FPGA in TSMC 40ULP has power gating (Figure 3) on the array level with state retention (down to 0.5V) and fine grain clock gating on the logic-block level, reducing dynamic and static power.
Processor vs Embedded FPGA Data Flow and Power Profile
Figure 4 shows the data flow of running the DSP algorithm in an ARM Cortex CPU versus an EFLX array. For the ARM CPU, the DSP algorithm execution requires cache, FLASH and SRAM resources to be used. For EFLX, only a DMA bus master and the EFLX array is used to run the DSP algorithm while the rest of the SoC can be powered off. The memory controller and external memory is used for both cases.
Energy Analysis Details
Two filters were used for the analysis: 5-tap FIR filter and a single stage Biquad filter (Figure 5). Both filters were chosen because of their relevance in low-power, battery-powered applications (FIR filters are extensively used for data communications, audio echo cancellation and smoothing data, and Biquad filters are extensively used for audio equalization and motor control).
Running the filter algorithms in an ARM Cortex MF4 took 4080 cycles for the biquad and 8080 cycles for the FIR for 256 data samples (Figure 6). Running the same algorithms in EFLX took 256 cycles for 256 data samples, which was 17.5 times to 31.6 times faster than running on the ARM CPU respectively (Table 1).
- M4 joules per cycle estimated at 8pJ based on TSMC 40 nm G process @.9V, 25C
- M4 power will be higher due to cache, instruction fetch power
- External memory accesses are normalized between M4 and EFLX
The 16-bit FIR filter and 16-bit biquad filter (Figure 7) used 3 EFLX-100 cores and the 32-bit FIR filter used 5 EFLX-100 cores using only ~20% of the total LUTs available (the number of cores is driven by the number of MACs required for the DSP). This low utilization provided extra resources for additional functionality.
The ARM M4 consumed ~1.5x-4.75x more energy than embedded FPGA for the same function. The higher energy was caused by the M4 taking ~ 17x-31x more clock cycles to perform the same function. This did not include the memory access power for code and cache references, which will increase the processor total energy even higher versus embedded FPGA.
The total energy to perform the DSP computation on EFLX is frequency independent at higher frequencies. Leakage energy is a large contribution to total Energy at the lower frequencies. Leakage energy contribution at low frequencies (Figure can be eliminated by power gating EFLX between computation cycles by using EFLX’s power gating and state retention feature.
The DSP logic is the main contributor of dynamic power for these specific benchmarks. Figure 9 shows the DSP logic in the center consuming most of the power while the rest of the core (top and bottom of each core) are idle and consuming very little power.
Using embedded FPGA to offload some DSP algorithms takes less clock cycles and consumes less energy than an ARM Cortex M4F for those functions. As the above analysis demonstrated, EFLX has higher data throughput than ARM Cortex M4F across all frequencies. ARM Cortex M4F requires ~ 17x-31x more clock cycles than EFLX to perform the same function and consumes ~1.5x – 4.75x higher energy than EFLX for the same computations. In fact, actual ARM Cortex M4F power will be even higher due to instruction fetch and memory accesses that were not factored in to this analysis.
This conclusion is not specific to ARM and will apply to any processor. ARM Cortex M4 was simply used because of the availability of data for comparison. The total Energy to perform DSP computations on EFLX is frequency independent. Leakage energy contribution at low frequencies can be eliminated by power gating EFLX between computation cycles. Data throughput of EFLX is fully proportional to EFLX frequency (e.g. no cache misses, etc.). 5 tap FIR filters and single stage Biquad filters are examples of DSP algorithms that can yield these types of results.
Tony Kozaczuk is Director, Solutions Architecture, Flex Logix Technologies, Inc. Originally from Buenos Aires, Argentina, Kozaczuk earned his BSEE at San Francisco State University. His team’s role at Flex Logix is to provide support to customers to evaluate architectural alternatives for using EFLX to achieve the best result. Kozaczuk has more than twenty years’ experience architecting systems and ICs at National, Sun and Intel. Most recently at Intel, he was Lead System Architect for multiple generations of Intel CPU Cores, and led system clocking architecture for all client systems.