As Embedded Vision Takes Off, Software Frameworks Bear Watching



Careful consideration of the software framework requirements for embedded vision applications for AR/VR; surveillance cameras, self-driving cars, drones, mobile phones, and more is key.

Deep learning techniques such as convolutional neural networks (CNN) have significantly increased the accuracy—and therefore the adoption rate—of embedded vision for embedded systems. Starting with AlexNet’s win in the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC), deep learning has changed the market by drastically reducing the error rates for image classification and detection tasks (Figure 1). Deep learning has also changed how embedded vision is implemented. CNN graphs are not programmed—they are trained using a software framework and then mapped into the embedded vision hardware.

Figure 1: ImageNet Large Scale Visual Recognition Challenge Results showing deep learning surpassing human levels of accuracy.

Compared to a CPU or GPU system, an embedded vision system must offer equivalent performance at a fraction of the power and die area. Embedded vision applications are highly optimized heterogeneous systems, which means they have processing units that are optimized for their specific task: a scalar unit for control, a vector unit for pixel processing, and a dedicated CNN engine for executing deep learning networks. These units, specifically optimized for embedded vision applications, provide excellent performance for the smallest area and power. When evaluating software frameworks for the final vision application, requirements around availability, bit resolution, graph mapping tools, and optimization options for hardware should be taken into consideration.

First Question
The first question to ask when considering an embedded vision system is, “Which software frameworks are supported?”, as not all software frameworks are supported by embedded vision system tools. Your best choices are the most popular, like Caffe and Tensorflow, but they were not originally designed with embedded vision systems in mind. Training of deep neural networks has traditionally targeted CPU and GPU solutions that can easily accept 32-bit floating-point coefficients. The Khronos group is driving a new common standard but it’s not fully adopted yet. These options have emerged due to parallel development by different groups of developers, reflecting their preferences on deploying deep neural networks as well as the desired capabilities.

Bit Resolution for Area and Performance
An embedded designer has more to consider while using a software framework for training a CNN graph for an embedded vision processor. Designers must pay attention to the bit resolution of the CNN calculations, possible hardware optimizations to consider during training, and how best to take advantage of new coefficient or feature map pruning and compression techniques.

When area isn’t a concern, an embedded vision processor could use the 32-bit floating point outputs of the software frameworks. However, 32-bit numbers require larger Multiply-Accumulator (MAC) units, more data storage, and higher bandwidth requirements. All of these are hits on the design’s power and area budgets. The goal is to use the smallest possible bit resolution for a CNN graph without sacrificing any of the accuracy of the originally trained graph.

Based on careful analysis of popular CNN graphs, Synopsys has determined that CNN calculations on common classification graphs provide the same performance down to at least 10-bit calculations. The CNN engine for the DesignWare EV6x Embedded Vision Processor (Figure 2) uses highly optimized 12-bit multiplications. Embedded systems have to be stingy on memory size—as too much memory adds to the cost and power consumption. Caffe graphs with 32-bit floating point output can be mapped to the Synopsys 12-bit CNN architecture with no loss in accuracy. This is important for designers whose customers have trained graphs that they do not want to retrain.

Software frameworks are starting to pay more attention to embedded systems, so it will become possible to train graphs for specific bit resolutions. By adding a few more graph layers, it is possible to retrain a graph for 8-bit multiplications, saving power and bandwidth. The Synopsys CNN engine also supports 8-bit multipliers to support graphs trained for 8-bit.

 

Graph Mapping Tools and Features
Graph mapping tools are critical to converting the output of the software framework to the appropriate bit resolution of your embedded vision processor are converted by During training, a graph mapping tool converts the coefficients and graphs from the software framework into the format the embedded vision system recognizes for deployment (Figures 3a and 3b).

 

Figure 2: DesignWare EV6x Vision Processor includes scalar and vector units as well as a dedicated, tightly integrated CNN engine.

Hardware Optimization
Designers may be able to make optimizations based on the hardware during training. Say your hardware is better optimized for a certain convolution size. If, for example, 3×3 and 5×5 convolutions have better MAC utilization than 4×4 or other configurations, you may choose to benefit by using a certain convolution size during training. To gain these benefits, the person doing the training must be connected to the choice of the hardware platform. This often isn’t the case.

As new CNN graph techniques have improved in accuracy, they have also increased in the number of layers and therefore the number of coefficients needed. The more coefficients required, the more memory storage, memory transfer bandwidth, and MAC operations are required. Again, these impact power and die size. New techniques to reduce the number of coefficients while preserving graph accuracy involve coefficient pruning and decompression. Graph pruning is a technique that prunes or zeros out coefficients that are close to zero. Pruning is an iterative process that must be done as part of training because it requires access to the dataset. Pruning can significantly reduce the computations and bandwidth for a CNN graph. For pruning to be effective, the embedded CNN engine must support decompression which handles the irregular network connections that result from pruning.

Figures 3a and 3b: Training and Deployment/Inference phases

Conclusion
Vision technology is enabling a wide range of applications, such as augmented/virtual reality, automated drone control, and smart surveillance, which offer intelligent and responsive capabilities. Deep learning software frameworks for embedded vision implementations are being embedded into larger SoCs, and applications that incorporate deep learning techniques will continue to be an attractive approach for vision developers in specific markets. To take full advantage of deep learning, software developers will need to examine many aspects of the software frameworks they are considering, including the final hardware deployment.


Gordon Cooper is a Product Marketing Manager for Synopsys’ Embedded Vision Processor family. Cooper brings more than 20 years of experience in digital design, field applications and marketing at Raytheon, Analog Devices, and NXP to the role. Cooper also served as a Commanding Officer in the US Army Reserve, including a tour in Kosovo. He holds a Bachelor of Science degree in Electrical Engineering from Clarkson University.

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google