Automotive MCUs Take on Performance /Power/Cost
Multicore architectures arrival on the scene poses new challenges for application designers.
The design challenge for automotive microcontrollers is to find a trade off between the ever-increasing demand for high processing speeds and low power consumption/low cost goals.
For a long time, silicon manufacturers increased processing speed and overall MCU performance with improvements in manufacturing and process technologies and with even more sophisticated core architectures, which meant increasing parallelism at the instruction level.
In this race to make processors faster, MCU designers have nearly exhausted the possibilities of alternative architectures, and MCU designs are becoming more complex and difficult to manage. And power consumption and heat dissipation are growing along with increases in MCU clock speed.
Designers are addressing the problem of balancing performance and power consumption by reducing clock speeds and redesigning the microcontroller’s architecture, putting multiple cores (two or more) on a die.
Yet the introduction of new MCU architectures creates fresh challenges for application designers, who have to take into account a new application development paradigms
Spurred by the fundamental need to boost performance, the evolution to multicore architectures requires distribution of the workload among the cores and applications by moving from a sequential to a parallel schema.
Shrinking Geometries and Growing Concerns
Automotive and other microcontrollers hunger for ever more computational power. The tradeoff between performance and power consumption (a key factor in the automotive sector) has been handled not only by increasing the operating frequency and implementing improvements at the core level (e.g. Instruction-level parallelism), but also by capitalizing on forward leaps in manufacturing technology and reducing the size of single gates.
For some time the only contribution to the challenge of using as little power as possible has been an approach related to the dynamic power that comes from the charge and discharge activity on the output of millions of gates. Because dynamic power is proportional to the square of supply voltage, reducing the voltage (and consequently the frequency) significantly lowers the power dissipated.
Unfortunately the “reduce-the-voltage” method is not iterative. Static power loss due to current leakage through transistor (subthreshold and oxide leakage) even when the transistors are turned off, considered negligible with geometries above 100nm, becomes a very prominent factor in the power budget when technology shrinks.
Nowadays the performance loss due to the slower operating frequency can be compensated by the use of multicore implementations that run the original tasks as parallel threads. Each thread executes in one core simultaneously to the others at slower operating frequency, so as to have total computational power unchanged compared to the serial case.
Several studies addressing the problem of delivering high performance while at the same time controlling leakage are in progress, and new technologies developed by the major silicon vendor players are under test to verify effectiveness. STMicroelectronics has introduced a promising innovation in silicon process called Fully Depleted Silicon On Insulator, or FD-SOI. This is a planar process technology that delivers the benefits of reduced silicon geometries while simplifying the manufacturing process.
The technology enables much better transistor electrostatic characteristics versus conventional bulk technology. The buried oxide layer (Figure 1) lowers the parasitic capacitance between the source and the drain while efficiently confining the electrons flowing from the source to the drain, dramatically reducing performance-degrading leakage currents.
Multiple cores can run multiple instructions at the same time, increasing overall performance and allowing higher performance at lower energy. Each core in the multicore is generally more energy-efficient, so the chip becomes more efficient than were it to have a single large monolithic core. Assuming that the die can fit (physically) into the package, the multicore CPU designs require much less printed circuit board (PCB) space than multi-chip symmetric multiprocessing (SMP) designs.
A multicore processor uses slightly less power than more coupled single-core processors, principally because of the decreased power required to drive signals external to the chip. Maximizing the utilization of the computing resources provided by multicore processors requires adjustments or redesign to both to the operating system (OS) support and to the existing application software.
Some architectures use one core design repeated consistently (”homogeneous”), while others use a mixture of different cores, each optimized for a different, “heterogeneous” role. In other cases to meet the safety requirements two cores can exactly execute the same instruction (in lockstep mode) so as to compare output results. These kinds of dual-core couples (sometimes replicated on the same chip) are commonly used in Chassis and Safety designs, and in general in all applications targeting ASIL-D ISO26262 compliance. Sometimes on these devices it’s possible to completely separate (decouple) execution flow to ensure that software running on the primary core will not interrupt measurements and data processing on the secondary one.
STMicroelectronics has introduced the SPC56 Automotive MCU family, a versatile set of dual core devices that address such Automobile Body applications such as parking control and door control, as well as Chassis and Safety applications.
SPC56AP60 and SPC56ELx belong to Chassis and Safety family. They are both homogeneous multicore devices which embed the Power Architecture cores e200z0h (in the SPC56AP60) and e200z4d ( in the SPC56 ). In these MCUs the second core can be activated only by the primary core.
Specifically, SPC56ELx can work either in lockstep mode (both cores execute the same code in the same time from the power up) or in decoupled mode (execution flows are independent).
The Body family has instead a heterogeneous MCU (SPC56EC70) which applies as primary core an e200z4d and as secondary an e200z0h (Power Architecture). This device is focused on power saving during idle state. Therefore the second core exists mostly for low-power mode handling and doesn’t need to be as powerful as the main core.
Designing a multicore application has its roots in the classical serial (for single core) design flow.
With the goal of boosting performance, designers, starting from serial optimization techniques, must move toward a parallel schema—where independent execution flows come together to solve large problems.
The two main factors that developers must take into account in parallel programming are a deep knowledge of the application and a clear understanding of the architecture where the application will run.
Understanding the application allows the developer to decompose where possible the serial execution flow into a parallel one. Deeper analysis of the application use case, including hotspots coming from profile information of serial implementations, can help in understanding the intrinsic nature of the problem that in general can be decomposed into task or data. While in task decomposition more tasks can be executed in parallel, since there are no dependencies. In data decomposition, large data manipulations can be split into smaller computational units.
Whatever the decomposition recognized or used, developers have to consider dependencies either for data manipulations or execution task ordering and finally choose a proper granularity that justifies the parallelization. Granularity, defined as the ratio of the number of computations to the number of communications, can be fine or coarse. With fine granularity, little portions of computation are done, and this usually implies a load balance across the architecture, which affects overall performance, since it can waste significant time in communication or synchronization with the other tasks. On the other hand a coarse-grained parallelism tends to increase the performance, but degrades the load balance.
In general more parallelism is desirable, but this doesn’t guarantee better performance. A successful performance balance relies on scheduling strategies, algorithm implementation, the most suitable granularity and the smart use of computational resources.
Application performance is strongly influenced by the structure of the code, and this is even more evident on multicore architectures, where the parallel executions incur overhead that limits the expected execution time benefits.
The user needs to find the correct tradeoff among having enough tasks to keep all cores busy, having enough computation in each task to amortize the overheads, and determining the optimal task size for achieving the shortest execution time.
Apart from the general development guidelines to reduce execution time and to save resources like compiler flags and low-level communications, the challenge in a multicore environment is to handle the overhead coming from inter-task interactions (e.g. synchronizations, data communications), hardware bookkeeping (e.g. memory consistency), software overheads (e.g. libraries), start-up and termination time for each function.
Good load balancing across a multicore platform is a must to obtain a good scaling of the application, and the scheduling strategy has to be carefully chosen.
In general a dynamic scheduling strategy usually introduces more overhead and is less scalable than static scheduling because it requires some global synchronization.
The weakness of the static scheduling arises when task duration is variable because it can generate an unbalanced load. In this case a very common solution is to split the tasks into a number of shorter tasks significantly larger than the number of the cores and apply to this a dynamic strategy to allocate the work on the idle cores while the parallel computation continues.
In general a fair load balance may conflict with data locality, and this of course may cause the waste of local memory resources available at core level.
In order to improve data alignment for better resource utilization and to reduce communication cost, it is reasonable to distribute data across the multicore architecture and keep it close to the core with primary use. Taking this approach means a strong use of local memories (TCM or system memories close to the core) and data tiling techniques over caches. However, apart from memory bandwidth weakness, a shared data model also implies that cache coherence must be ensured to allow multicore cache behavior to be consistent, as cores are seen as attached directly to the main memory. Improper or insufficient distribution of memory resources and the overhead of data synchronization may cause parallelized applications on multicore architectures to perform worse than serial use cases on a single-core device.
Moreover, in a multi-threading environment the competition for shared data or bus bandwidth can cause performance to deteriorate. Additional threads only waste on-chip power since performance execution drops. A strong analysis is needed to choose the right number and the proper distribution of threads.
Successfully navigating the transition from single core devices to multicore architectures means users must change their design approach, either for application building or for computation methodology. A clear understanding of computational units and data flow content is needed to tackle application design and find a way ensure better performance with lower power consumption.
Raising performance comes with tradeoffs, and the applications will not increase performance or meet expectations unless the developer makes the effort to adapt the application to the device where it will run.
Giovanni D’Aquino is senior application engineer for STMicroelectronics in Catania, Italy. He joined the company in 2003 and has been working in R&D as a software developer and OS expert. He has nine years experience in Power Architecture and currently works in the automotive microcontroller division as Chassis and Safety microcontroller’s application leader. D’Aquino graduated with a Master’s degree in Computer Engineering in 2002 and received his second Master’s specializing in Management of Development’s Projects about ICT Security in 2004.