Bringing Deep Learning to IoT Edge Devices



How the Power, Performance and Area trio can be optimized

Machine learning is fast becoming a defining feature for the devices that have joined the internet of things (IoT). Home appliances are now beginning to sport voice-driven interfaces that can react intelligently to natural speech patterns. Robots are now being shown how to move material around the factory shop floor and program other machines by demonstrating the process to them on a smartphone camera, while smartphones have become smarter. These applications take advantage of what has become the most successful artificial-intelligence architecture for complex, multidimensional data so far: the deep neural network (DNN).

Intelligence is Moving to the Edge
An issue with the application of DNN technology up to now for embedded systems has been its thirst for compute performance. Although less computationally intensive than training, the inferencing phase during which input data is passed into a trained DNN for recognition and analysis still requires billions of calculations per second for streaming data such as voice and video. As a result, in many cases, processing is offloaded to the cloud, where ample power is available. But this is not an ideal solution for edge devices.

Mission-critical use cases, like self-driving vehicles and industrial robots, make use of DNNs for their ability to recognize objects in real time and improve situational awareness. But issues of latency, bandwidth and network availability are not good fits with cloud computing. In these scenarios, implementers cannot afford the risk of the cloud failing to respond in a real-time situation.

Privacy is another issue. Although consumers appreciate the convenience of voice assistance in devices such as smart speakers, they are increasingly concerned about the personal information that may be inadvertently revealed when recordings of what they say are routinely transferred to the cloud. This concern is becoming even more acute with the onset of camera-equipped smart speakers and vision-enabled robot assistants. Seeking to reassure their customers, manufacturers are looking at ways to move more of the DNN processing to the edge, in the end devices themselves. The key problem is that the DNN processing does not suit the architecture of conventional embedded systems.

Conventional Embedded Processors are Not Enough for DNN Processing
Conventional embedded processors based on CPU and GPU cannot handle the DNN workload efficiently enough for low-power devices. IoT and mobile devices have very stringent constraints for power and area, while high performance is necessary to handle the real-time DNN processing. This trio of Power, Performance and Area, known as PPA, must be optimal for the task at hand.

One approach to tackling these issues is to provide hardwired engines for DNN processing that have access to on-chip scratchpad memory arrays. The problem with this approach is that developers require high degrees of flexibility. The structure of each DNN implementation needs to be tuned to its target application. A DNN designed and trained for speech recognition will have a different mixture of convolutional, pooling, and fully connected layers to one intended for video. Flexibility is also crucial to designing a future-proof solution, as machine learning is still a nascent technology, continuously evolving.

Another approach, which is commonly used, is the addition of a vector processing unit (VPU) to the standard processing units. This enables more efficient calculations as well as flexibility to handle different types of networks. But this is still not enough. Reading data from external DDR memory  is a significant power consuming task associated with DNN processing. Therefore, data efficiency and memory access must also be scrutinized for a holistic solution. To maximize efficiency, scalability, and flexibility, the VPU is only one of the key modules required in an AI processor.

Enabling Optimal Bandwidth and Throughput
To address these requirements, CEVA built an architecture that meets the performance challenges of a DNN while maintaining the flexibility needed to handle the widest variety of embedded deep learning applications. The NeuPro AI processor includes a specialized, optimized Deep Neural Network Inference Hardware Engine to handle the convolution, fully connected, activation, and pooling layers. In addition, it harnesses a powerful, programmable VPU for unsupported layers and inference software execution. This is complemented by the CEVA Deep Neural Network (CDNN) software framework, which enables optimizing graph compilation and run-time executable.

Figure 1: NeuPro’s scalable and flexible architecture is suitable for a variety of AI applications (Source: CEVA)

To ensure data efficiency, the CEVA-NeuPro architecture uses techniques that minimize the accesses to the memory and which optimize the flow of data through the different layers.  It supports full on-the-fly forward propagation, and keeps all intermediate network layers in local memory, resulting in minimal DDR accesses. An additional technique is to minimize reads from local memory by re-using loaded data whenever possible. Together these components offer a complete DNN solution with optimal PPA: high performance, low power, and high area efficiency.

Figure 2: Block diagram of the NeuPro AI Processor combining the NeuPro Engine and NeuPro VPU (Source: CEVA)

Further optimizations for performance are made through support for 8- and 16-bit arithmetic. For some computations, the accuracy of 16-bit calculations is necessary.  In other cases, almost the same results can be achieved with 8-bit calculations, which significantly reduces the workload and hence the power consumption. The NeuPro engine enables a balanced mix between these operations so that each layer is optimally executed. The result is the best of both worlds: high accuracy and high performance.

Figure 3: Selecting 8 or 16-bit calculations per layer gives optimal accuracy and performance (Source: CEVA)

Together, the optimized hardware modules, the VPU, and the efficient memory system combine to deliver a solution that is scalable, flexible, and extremely efficient. On top of this, the CDNN enables streamlined development with push-button network conversion and ready-to-use library modules. The result is a holistic AI processor providing IoT device designers with the ability to make the most of localized machine learning in the next generation of products.


Liran Bar is Director of Product Marketing, Imaging & Vision DSP Core Product Line, CEVA.

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google
  • TwitThis

Tags: