Artificial Intelligence: Where FPGAs Surpass GPUs



Accelerators offload repetitive calculations within cloud services, but what if new AI computing models do not conform to the orderly mold of array-based, data-parallel algorithms that GPUs are so good at processing? Intel thinks FPGAs are the answer. So does Microsoft.

Artificial Intelligence (AI) will transform how we engage with the world and is already the fastest growing workload for data centers. Field Programmable Gate Arrays (FPGAs) can accelerate AI-related workloads. It makes perfect sense that Intel purchased Altera, a leading company specializing in FPGAs, in December 2015 (for $16.7 billion). Intel has integrated Altera’s IP to improve performance and power efficiency and to enable reprogramming for custom chips that account for a more significant share of server chip shipments. Intel’s Data Center Group is the most profitable group at Intel, driven by the growth in “big data” and cloud servers. AI is one of the fastest growth drivers for cloud services.

In a competitive world where milliseconds count, FPGAs can create an edge for a growing number of AI applications.

Figure 1: Hardware microservices on FPGA, as depicted in a data center use case. Interconnected FPGAs form a separate plane of computation that can be managed and used independently from the CPU. (Source: Microsoft, Hot Chipsvii)

For those who need a refresher on FPGAs, they are integrated chips that can be programmed (and reprogrammed) for specialized tasks. Processors can only execute one instruction at a time, and a quad-core processor can execute four instructions at a time. One difference that is making an impact is that FPGAs are not as top-heavy as processors, and this includes Graphical Processing Units (GPUs). Processors need an Operating System (OS) as part of a software stack, managing memory and juggling processor capacity. Unlike processors, FPGAs don’t require the extra baggage of an OS. FPGAs genuinely execute in parallel, providing deterministic hardware circuits that are committed to each task during program execution. Unencumbered by an OS, FPGAs are fast and can minimize potential reliability concerns associated with “another moving part” in a platform where integrated systems involve different disciplines. In a nutshell, FPGAs execute programs in a hardware implementation rather than software.i

Figure 2: The demand for Data Centers is projected to grow to a Total Available Market (TAM) of $65 billion. (Source: Intel)

One reason why FPGAs have become so attractive in embedded technologies is that FPGAs have been steadily improving, and a system gains speed as it replaces software functionality with hardware. A hardware implementation sounds inflexible, but FPGAs can be changed (reprogrammed) at any point up to and after the end-product has been deployed. FPGAs can be customized to an embedded system’s exact requirements, creating a higher performance alternative to processors requiring layers of software. Applications with repetitive functions are especially faster when running on the “bare metal” of an FPGA. A wide range of embedded systems can replace Application Specific Standard Products (ASSPs) and Digital Signal Processors (DSPs) using microprocessors coupled with the custom logic of FPGAs.

Artificial Intelligence and FPGAs
AI is driving demand for High-Performance Computing (HPC), especially since cloud services allow AI scientists and engineers to pay for only what they use. Gathering funding to install a supercomputer in the basement just is not necessary anymore for startups using AI. Data scientists can rent a high-performance computer (cloud) and use powerful computing resources to train a deep learning model. Once training is complete, they export their model and get charged only for what they have used. For many researchers, the AI-tuned cloud platform is the only answer, as grants and other funding resources decrease to where universities and start-ups cannot afford the capital to establish HPC centers of their own. According to Tractica, a Market intelligence firm, revenue from AI, machine learning, deep learning, natural language processing, machine vision, machine reasoning, and other AI-related technologies will reach nearly $59.8 billion by 2025. Markets leading the adoption of AI include the aerospace, advertising, consumer, financial, and medical sectors, with many more seeking advantages in AI. Although global AI spending in 2016 was identified by Tractica at just $1.4 billion, the expectation is for exponential growth.

“Artificial intelligence has applications and use cases in almost every industry vertical and is considered the next big technological shift, similar to past shifts like the industrial revolution, the computer age, and the smartphone revolution,” says Tractica research director Aditya Kaul.ii  AI holds promise for business processes and new business models in the automotive, entertainment, investment, and legal industries, as well.

Acceleration-as-a-Service (AaaS) for cloud servers achieves higher performance on CPU-based workloads. According to Intel’s Altera site, “Cloud users can leverage FPGAs to accelerate a variety of workloads such as machine learning, genomics, video transcoding, big data analytics, financial computation, and database acceleration. Several cloud service providers are offering their cloud users access to Intel FPGAs within their infrastructures. This approach gives its users the ability to complete complex tasks faster than in virtualized systems.” Acceleration Stack for Intel® Xeon® CPUs with FPGAs is software that minimizes power consumption while maximizing performance. However, stand-alone FPGAs are notoriously difficult to program. The words “quick start” in the Intel Accelerator Functional Unit Simulation Environment Quick Start User Guide are engineering hyperbole. However, most engineers joined the profession precisely because of the constant presentation of challenges. FPGAs are quickly becoming adopted as accelerators in AI and related technologies. Intel claims that the Acceleration Stack for Intel Xeon CPUs with FPGAs is “a new collection of software, firmware, and tools that allow software developers to leverage the power of Intel FPGAs much more easily than before.”iii  In a competitive world where milliseconds count, FPGAs can create an edge for a growing number of AI applications.

Big Data and IoT are still growing. AI (which includes machine learning and deep learning) also analyzes large amounts of data and increasingly relies on neural networks. Neural networks are part of a type of computing, still run on silicon chips, which are patterned after the human brain. This branch of computing is known as cognitive or neuromorphic computing. Neural networks enable learning where the computer programs itself, based on massive sets of data used to train the model, rather than requiring a human to program it. Humans still need to select the initial data training sets, but once a model is trained, new data sets are loaded to train the model to a new concern. Neural nets can also identify similarities, detect anomalies, and form “associative memory.” Neuromorphic computing began decades ago but was quickly put on the back burner; a kind of curiosity as Moore’s Law kicked in to create ever faster and smaller processors with a computing architecture that we are so familiar with. Neuromorphic computing has a different architecture as chains of identical elements (neurons) simultaneously store and process information, collaborating with each other via a neural bus. Each neuron is akin to a tiny processor that stores information (memory) and reacts, much like a single cell or synapse in the brain. Huge numbers of neurons acting together produce remarkable results. Deep Neural Networks (DNNs) are massively parallel chains of neurons that have demonstrated exceptional performance in recognizing images in machine learning tasks.

DNNs require a high level of computing performance, which makes acceleration services attractive. FPGAs are playing a part in acceleration. Dr. Randy Huang, an FPGA Architect with the Intel Programmable Solutions Group, states, “The tested Intel® Stratix® 10 FPGA10 FPGA outperforms the GPU when using pruned or compact data types versus full 32-bit floating point data (FP32). In addition to performance, FPGAs are powerful because they are adaptable and make it easy to implement changes by reusing an existing chip which lets a team go from an idea to prototype in six months—versus 18 months to build an ASIC.”iv

Industry verticals have caught the AI bug, but applying AI to new applications finds researchers looking for a way to deal with challenging candidate models. GPUs have held the lead in accelerating computation thus far. However, FPGA technology has been continually advancing, finding a place in newer AI applications as the preferred choice. One reason is that FPGAs are better than GPUs wherever custom data types exist or irregular parallelism tends to develop. Parallel computing has introduced execution complexities that go far beyond the good old days of single-core microcontrollers. Computational hardware imbalances can occur if irregular parallelisms evolve. Some problems do not fit the neat mold of array-based, data-parallel algorithms that GPUs are so good at, and computer science is evolving at a phenomenal pace, inspecting each new technology advance in hardware and looking for more. Add to this, news that DNNs are a challenge to deploy in large cloud services.

Figure 3:  The spectrum of processors available to AI-related computing. (Source: Microsoft, Hot Chips)

Microsoft has joined the race to produce a better AI platform, recently announcing its choice of Intel’s Stratix 10 FPGA for the Microsoft deep learning platform dubbed “Project Brainwave.”v  Project Brainwave is a real-time AI cloud platform for processing data as fast as it receives it. Cloud services are increasingly processing live data streams (e.g., chatbots, mapping, voice recognition). Another large player in cloud services, Google embarked several years ago on a project to create an AI-related chip called the Tensor Processing Unit (TPU). The TPU was specifically designed to accelerate neural network computing. Norm Jouppi, Distinguished Hardware Engineer at Google, puts it simply enough, “…we started a stealthy project at Google several years ago to see what we could accomplish with our own custom accelerators for machine learning applications…. Our goal is to lead the industry on machine learning and make that innovation available to our customers.”vi 

Figure 4: FPGA fabric is great for irregular (and regular) computation. (Source: Microsoft, Hot Chipsvii)

Mount Everest
Is Microsoft’s answer to the problem more hardware-savvy? After all, FPGAs are a much more flexible hardware solution than application-specific chips like the TPU. Microsoft is best known for desktop and server software solutions. However, Microsoft’s lesser-known, albeit long history of developing embedded products shows with its decision to adopt FPGAs into its platform. The most intrepid developers, new to FPGAs, know that FPGAs come with a steep learning curve and rightly view them as the Mount Everest of platforms to program. Building an Application Specific Integrated Chip (ASIC) is easier. However, the cost can be months added to release-to-market dates. Literally, once the die is cast, a “re-do” of an ASIC needs designers and layout engineers to create a new set of masks. Then ASICs go through all the steps to transform from a lump of silicon to a finished, packaged chip. Comparatively speaking, it’s extraordinarily faster if you can meet the same challenges using an FPGA. However, no one is arguing that AI needs optimized hardware to accomplish a number of specific operations that many machine learning models need to create the highest-performing neural nets.

Figure 5: Google’s proprietary Tensor Processing Unit (TPU) board includes a custom ASIC developed for accelerating machine learning applications. (Source: Google)

Intel’s Stratix 10 FPGA qualifies Intel as a large hardware supplier for DNNs. Inevitably, we will see more from Intel as Altera IP is absorbed and put to good use throughout Intel’s technologies. Microsoft is betting on FPGAs for accelerating its AI cloud platform. According to Doug Burger, Distinguished Engineer at Microsoft and former Professor of Computer Sciences at the University of Texas in Austin, “By attaching high-performance FPGAs directly to our data center network, we can serve DNNs as hardware microservices, where a DNN can be mapped to a pool of remote FPGAs and called by a server with no software in the loop. This system architecture both reduces latency, since the CPU does not need to process incoming requests, and allows very high throughput, with the FPGA processing requests as fast as the network can stream them.”vii

FPGAs: Fine-grained Accelerators
A 2014 white paper titled, A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services (for which Burger is a contributing author) states, “FPGAs are now powerful computing devices in their own right, suitable for use as fine-grained accelerators.” The paper also states, “Our study has shown that FPGAs can indeed be used to accelerate large-scale services robustly in the data center. We have demonstrated that a significant portion of a complex data center service can be efficiently mapped to FPGAs, by using a low-latency interconnect to support computations that must span multiple FPGAs.”viii  At Hot Chips, a symposium on high-performance chips, Microsoft recently demonstrated the fruit of the above study, Project Brainwave, on Intel’s new 14nm Stratix 10 FPGA.

Figure 6: The Intel® Stratix® 10 FPGA chip. (Source: Intel)

One of the methods that Microsoft employs in tuning data centers for performance is to batch requests. Batching means breaking up a request into smaller pieces and feeding them to a processor to improve hardware utilization. However, batching is not effective for real-time AI since eating “one bite at a time” can cause latency. To combat latency issues with batched requests, Brainwave employs what Microsoft calls “persistent” neural nets. Thus, when a single request arrives, all resources (compute units and on-chip memories) are used to process the request; no batching is required.

What if a large model cannot fit in one FPGA? Brainwave accommodates large models at a scale of persistency that expands to the entire data center. Massive numbers of FPGAs form a collaborative, persistent DNN hardware microservice that enable scale-out of models, performing at ultra-low latencies. Inter-layer pipeline parallelism facilitates the data center network, and FPGAs communicate directly (at less than 2µs/hop). vii Therefore, DNN hardware microservices are shared by all FPGAs (Figure 1).

Intel FPGAs accelerate DNN workloads. Intel has plans for pre-configured, Intel FPGA algorithms for licensed customer use. To learn more about Project Brainwave, see Microsoft’s Research Blog at microsoft.com. To learn more about Machine Learning starting with the basics, visit Intel’s Nervana™ AI Academy.


Lynnette Reese is Editor-in-Chief, Embedded Intel Solutions and Embedded Systems Engineering, and has been working in various roles as an electrical engineer for over two decades. She is interested in open source software and hardware, the maker movement, and in increasing the number of women working in STEM so she has a greater chance of talking about something other than football at the water cooler.

i National Instruments. “Introduction to FPGA Technology: Top 5 Benefits.” Introduction to FPGA Technology: Top 5 Benefits – National Instruments, NI, www.ni.com/white-paper/6984/en/.

ii “Artificial Intelligence Software Revenue to Reach $59.8 Billion Worldwide by 2025.” Tractica, 2 May 2017, www.tractica.com/research/artificial-intelligence-market-forecasts/.

iii Intel(R) FPGA Acceleration Hub, Intel, 10 Oct. 2017, www.altera.com/solutions/acceleration-hub/acceleration-stack.html.

iv Barney, Linda. “Can FPGAs Beat GPUs in Accelerating Next-Generation Deep Learning?” The Next Platform, The Register, 21 Mar. 2017, www.nextplatform.com/2017/03/21/can-fpgas-beat-gpus-accelerating-next-generation-deep-learning/

v “Intel Delivers ‘Real-Time AI’ in Microsoft’s New Accelerated Deep Learning Platform.” Intel Newsroom, Intel, 22 Aug. 2017, newsroom.intel.com/news/intel-delivers-real-time-ai-microsofts-accelerated-deep-learning-platform/.

vi Jouppi, Norm. “Google Supercharges Machine Learning Tasks with TPU Custom Chip.” Google Cloud Platform Blog, Google, 18 May 2016, cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html.

vii Burger, Doug. “Microsoft Unveils Project Brainwave for Real-Time AI.” Microsoft Research, Microsoft, 22 Aug. 2017, www.microsoft.com/en-us/research/blog/microsoft-unveils-project-brainwave/.

viiiPutnam, Andrew, et al. “A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services.” IEEE Micro, vol. 35, no. 3, 2015, pp. 10–22., doi:10.1109/mm.2015.42.

 

 

 

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • TwitThis

Tags: