SoC FPGA Relies on Many Cores and 14 nm Tri-Gate Process
Altera’s multicore FPGA uses Intel’s tri-gate (FinFET) 14 nm 3D process technology to combine logic, four ARM A53 cores and OpenCL as a heterogeneous SoC replacement.
At the core of Altera’s recent Stratix 10 announcement is…multiple cores. Four ARM Cortex-A53s to be exact, plus a bunch of DSP blocks, more logic than ever before done by the company, double the on-chip clock rate and software designed to create a multicore system instead of just a dense FPGA. In effect, it’s a system-on-chip (SoC) that bests the company’s recent Arria 10 SoC.
But also at the core of the 2014 next-gen FPGA family is Intel’s tri-gate (FinFET) 14 nm 3D process technology, to which Altera has the exclusive FPGA rights for 12 years. It’s Intel’s tri-gate transistors that give Altera the edge—at least on paper—in the war of high density FPGAs. Details on Altera’s Stratix 10 (the follow-on FPGA to the Stratix V) are still scarce, though the company’s second pre-announcement added more details and we can guess at others.
The new multicore Stratix 10 FPGA SoC will tape out in 2014 and may possibly ship by late 2014. Let’s take a look at some of the elements that make up the Stratix 10’s performance goals.
Tri-Gate’s Secret Sauce
Altera is relying heavily on four factors to achieve the aggressive specs for Stratix 10: a refined architecture, the ARM Cortex A53, software tools such as OpenCL and Intel’s tri-gate 14 nm process.
Intel has consistently battled Moore’s Law through new process innovations. Following two-dimensional high K (dielectric) metal process improvements at 45 nm in 2007, Intel unveiled 3D transistors in 2011 in 22 nm feature size. As oxide thickness and transistor dimensions approach mere Angstroms, the company essentially made transistors “smaller” by stacking them in layers and wrapping the silicon channel with three gates instead of two. This allows more transistors per square nanometer by using the Z axis, and they’re smaller and more power efficient. The concept of a non-planar double-gate transistor with a fin wrapped around the channel is called a “finFET,” although Intel calls theirs “tri-gate” (Figure 1).
|Figure 1: Intel’s tri-gate transistors propagate Moore’s Law, increase silicon density and reduce power.|
In early 2013, Intel and Altera inked a deal that allowed Altera to become the only FPGA company with access to Intel’s 14 nm tri-gate process. Altera’s CEO John Daane said at the time that he believed Intel was two to four years ahead of the competition. (This is debatable as TSMC moves from 20 nm to 16 nm.) Intel is already producing 22 nm 4th generation Haswell processors on tri-gate and 14 nm Broadwell (5th generation Core) production is scheduled for 2014.
Despite a recent delay that necessitated design rule changes to Intel’s 14 nm process, Altera as an Intel foundry customer will certainly follow shortly after Intel’s own Broadwell production. According to Chris Balough, Altera’s senior director of SoC product marketing, design software for the Stratix V will be available Q1 2014 and first silicon will tape out sometime in 2014.
In June and in November, Altera began predicting some pretty big numbers for Stratix 10…so much so that the nomenclature leaped from Stratix V (available now) to Stratix 10. The company is making what Balough called a “hyperbolic claim, something we never guessed was possible,” he told me. There will be a 2x increase (from 28 nm) in logic fabric speed to over 1 GHz. Even though density increases, the SoC device will also achieve a 70 percent power savings with a target 4x. Other technology targets are shown in Table 1.
|Industry’s First Gigahertz FPGAs and SoCs
Break the Bandwidth Barrier with Unimaginable High-Speed Interface Rates
Table 1: Stratix 10 performance targets. (Source: Altera.)
There is credibility in these claims, albeit based upon scaling factors. The new, in-production Arria 10 SoC is based upon TSMC’s 20 nm process and is 15 percent faster than Arria V, consumes 40 percent less power and boasts a 50 percent processor system improvement. For Stratix 10, the magic again is Intel’s 14 nm tri-gate process as described in the whitepaper “The Breakthrough Advantage for FPGAs with Tri-Gate Technology” (http://www.altera.com/literature/wp/wp-01201-fpga-tri-gate-technology.pdf).
Altera has provided no details yet on architecture other than the notional marketing chart shown in Figure 2. Fundamentally, Stratix 10 processing will take place in FPGA gates, four ARM Cortex-A53 processors and a series of DSP blocks. Software tools meld it all together into a powerful one-chip system
|Figure 2: Altera’s Stratix 10 SoC FPGA relies on four elements to achieve what the company calls “unimaginable” performance. Intel’s 14 nm tri-gate process is the root of the performance discontinuity from current 28 nm FPGAs. (Source: Altera.)|
DSP Blocks and ARM Cores
Current-generation Stratix V devices contain two columns of DSP blocks surrounded by constellations of logic array blocks. Each DSP block (Figure 3) can be configured for up to eight 9 x 9 bit multipliers, four 18 x 18 bit multipliers and one 36 x 36 bit multiplier. These DSP blocks run at 333 MHz and provide data throughput performance of 2.67 giga-MACs per block. The largest Stratix V device (EP1S80) has 22 DSP blocks.
|Figure 3: Stratix V (current generation) DSP block.|
With the forecasted 1 GHz fabric on Stratix 10 over Stratix V’s 333 MHz, one might expect a 3x performance increase with no logic changes. However, Altera’s Balough told me they’re expecting a 6x throughput increase for Stratix 10 to “greater than 10 TFLOPs.” It’s entirely likely the number of DSP blocks will double, RAM block sizes will increase and there will be some fine tuning within the blocks and routing fabric. However, it’s also possible the 6x claim is taking into account other FPGA resources for data movement and manipulation such as the ARM processors.
An FPGA’s DSP sub-systems are a major reason FPGAs are chosen for high-performance algorithms like video CODECs or data-plane processing. On the Stratix 10, there are also the four ARM Cortex-A53 processors. Altera press materials cite an 8x throughput improvement over 28 nm FPGAs, and Altera says that ARM claims it’s the highest power efficiency of any 64-bit processor. That’s probably not hard to argue considering ARM’s leadership in mobile and low-power devices…and Intel’s struggle to catch up.
The four Cortex-A53 cores shown in Figure 4 are members of ARM’s Cortex-A50 series announced in 2012 and are based on the ARMv8 64-bit architecture. The A53 chosen for Stratix 10 is the “little brother” to the A57, ARM’s single-threaded, deep pipeline monster targeting servers. Both devices are designed for gigahertz performance in heterogeneous SoCs. For example, Stratix 10’s A53s will target applications that offload x86 host processors, while AMD’s HieroFalcon server accelerator 64-bit SoC (2H 2014) uses the Cortex-A57 to complement AMD’s on-chip x86.
|Figure 4: Stratix 10 contains four Cortex-A53 cores, ARM’s most impressive multi-threaded 64-bit core.|
Stratix 10’s A53s are software compatible with previous generation 32-bit ARM Cortex-A9s in Altera Cyclone, Arria and Stratix devices. This allows design sockets to upgrade to the latest FPGAs while migrating forward operating systems, unmodified software and IP cores (Figure 5). The A53 can access up to 256 TB of memory with ECC for on-core L1/L2 caches; both features pinpoint data center and high-end heterogeneous computing applications. To fully capitalize on Stratix 10’s logic, DSP and A53 resource elements, the company is planning a suite of heterogeneous-focused tools.
|Figure 5: Stratix 10’s quad Cortex-A53 cores can run in 32-bit mode for upgradeability from previous generation devices.|
Sweet Tool Suites
Altera’s Chris Balough asserts that “all of Stratix 10’s processing elements are compelling enough by themselves,” but the “unimaginable performance” comes into play when they work together as a system. Altera is counting on OpenCL and the SoC Embedded Design Suite (EDS) to capitalize on the FPGA’s heterogeneous elements.
|Intel® QuickAssist Technology Capitalizes on FPGA Resources
Altera was the first FPGA company on record to support Intel’s QuickAssist technology, a set of instructions and APIs that seek to offload host CPU instructions to a co-processor like an FPGA. XtremeData was the board vendor who demoed it at the 2008 Intel Developers Forum with a Xeon and Mathworks’ Simulink graphical block diagramming software.
Altera’s Stratix 10, with faster transceivers, DSP blocks and 1 GHz fabric, is likely to be a favorite co-processor for Intel host CPUs like the latest 4th Generation Core devices (code named Haswell) or Intel’s next-generation 14 nm tri-gate Broadwell CPUs. According to Intel, QuickAssist (http://www.intel.com/content/www/us/en/io/quickassist-technology/quickassist-technology-developer.html) is ideal for computational workloads including cryptography, data compression and pattern matching—all applications that are algorithmically “heavy” and can take advantage of an FPGA’s parallelism via mixed resources.
In the case of Altera’s Stratix 10 next-gen devices, CPU offload via QuickAssist could also utilize high-level language OpenCL to construct heterogeneous “hardware macros” which balance workloads between the FPGA fabric, DSP blocks and the decision-making capability of the ARM Cortex-A53. All of these resources, of course, work as an accelerator sub-system to Intel’s Xeon, Core and now the newest Rangeley Atom CPUs—all of which contain Intel QuickAssist Technology.
The company has endorsed C-based design entry using OpenCL for several years via the SDK for OpenCL product. First demoed by Altera at SuperComputing 2012, the OpenCL design flow allows designers to work in C and easily mix logic and on-chip resources into heterogeneous multicore architectures. The high-level design flow alleviates the typical RTL coding required for most FPGA tasks. It also mimics the way embedded developers mix and match processing resources at the board or system level.
OpenCL code is also portable and lets designers load-level or performance-tune applications to take advantage of the massively parallel nature of FPGAs and the Cortex-A53 CPUs. In October of this year, Altera announced the SDK for OpenCL is conformant to the OpenCL 1.0 standard and is listed on the Khronos Group list of conformant products. This means that Altera can “provide a validated cross-platform programming environment” designed to accelerate algorithms at significantly lower power versus other cross compiler alternatives.
Practically speaking, OpenCL allows Stratix 10 designers to port or create hardware accelerators to take advantage of the FPGA’s parallelism. This has always been possible when coding in RTL but not necessarily practical and usually not easily portable. The OpenCL SDK is complemented by EDS which provides FPGA-adaptive debug. Altera’s Chris Balough describes EDS as “a native multicore debug environment with intrinsic FPGA debug capabilities built in.” EDS supports real-time, in-system, whole-chip debug and visualization, including the ARM Cortex-A53 cores. That’s because EDS includes Altera’s existing DS-5 Altera Edition software, ARM’s own development tool that’s been tailored for Altera FPGA devices.
Both the OpenCL SDK and EDS software suites exist today for production devices, and will soon be released supporting the Stratix 10 so lead customers and early adopters can start their designs.
Conclusions: Altera’s Design Targets
We’ve written quite a bit about Altera’s burgeoning roadmap over the past twelve months. During an interview in July 2012 with Brad Howe, Altera’s senior VP of R&D, he talked about the need for FPGAs to evolve past their growing complexity and focus instead on “silicon convergence” (http://eecatalog.com/fpga/2012/07/17/fpgas-the-best-of-both-worlds-until-theyre-not/ ). He was of course referring not only to heterogeneous FPGA SoCs, but also on reducing design complexity with OpenCL.
Stratix 10, with its “unimaginable performance” made possible by Intel’s 14 nm tri-gate process, epitomizes the future of big density FPGAs. “It was an explicit choice to make [Stratix 10] a heterogeneous computing platform,” said Altera’s Chris Balough.
We’re anxious to see the company release details on the architecture, cite power estimates based upon notional reference designs and detail more of the specs. As we went to press, Altera revealed some data on Stratix 10’s transceivers (4x bandwidth and 28 Gbps backplane switching), and indicated that the device is 3D-capable for packaging with ASSPs and memories.
Also, it will be interesting to see if Altera targets Big Data applications with Stratix 10 as I had predicted in June 2013 (“Does Altera have ‘Big Data’ Communications on the Brain?” http://eecatalog.com/caciufo/2013/06/05/a-slew-of-recent-altera-high-performance-announcements-over-the-last-three-months-can-only-mean-one-thing-the-company-is-targeting-big-data-in-a-big-way/#comment-20464 ). The company’s string of acquisitions and posturing sure points that way.
We’ll have to wait until 2014 for concrete information on the Stratix 10 heterogeneous SoC FPGA and its heavy lift capabilities. Until then, all this “unimaginable” information remains a technology announcement. But a compelling one, indeed.
Chris A. Ciufo is editor-in-chief for embedded content at Extension Media, which includes the EECatalog print and digital publications and website, Embedded Intel® Solutions, and other related blogs and embedded channels. He has 29 years of embedded technology experience, and has degrees in electrical engineering, and in materials science, emphasizing solid state physics. He can be reached at firstname.lastname@example.org.