Advances in Architecture allow ARM Processors to Tackle Endpoints and Gateways, for the Cloud



The plethora of devices that connect to the cloud, and even many of the devices in the cloud, are empowered by embedded processors that deliver a wide range of performance to execute the desired tasks. The tasks range from simple endpoints that sense and collect data such as temperature, vibration, pressure, or other parameters. The endpoints, in turn, are linked to the next level via wireless or wired networks. This next level consists of more complex edge devices (gateways) that aggregate and preprocess the data collected from endpoints to reduce the quantity of data and then feed that data into the cloud, where it is routed to the appropriate server.

Endpoints, gateways, and other edge devices are evolving and are tasked to handle more complex functions. The increased complexity of the functions, in turn, requires higher performance processors to handle the more complex data management, signal-processing and data-reduction algorithms. An evolved class of ARM processors provides an edge to designers…working on the IoT’s edge.

Embedded Processors Provide Intelligence

To provide the necessary “intelligence”, higher-performance but low-power embedded processors are needed in the endpoints. To that end, ARM has expanded its Cortex-M processor family with the addition of the Cortex®-M7 processor, which delivers twice the DSP performance of the previously released Cortex®-M4, but is software compatible. Based on data from various market research organizations such as Gartner, International Data Corp., and the Semiconductor Industry Association, as well as from ARM itself, ARM’s over 200 Cortex-M series customers have shipped about 4.3 billion Cortex-M series cores in 2014, and 1.6 billion in Q1 2015. Companies such as Atmel, Freescale Semiconductor, Marvell, ST Microelectronics, Texas Instruments and many other vendors have embedded these processor cores into various microcontrollers or custom application-specific integrated circuits.

The Cortex-M7 will be able to take on higher-end embedded applications in next-generation connected devices, vehicles, drones, street lighting, appliances, and many other applications where a full-blown Linux-based operating system is not required.

Cortex-M7 can run IoT operating systems such as ARM-developed mbed™ OS that provide secure, reliable communications within the home or enterprise and out into the cloud. The Cortex-M7 can also run traditional embedded RTOSes for control applications.

The higher performance point of the Cortex-M7 processor is well suited to intelligent end-point devices in markets such as industrial, lighter-weight access points and applications that require real-time response along with high-performance. Of course, ARM also offers still higher-performance processor cores in its Cortex A-series family, but those cores are not as suitable for real-time applications since they employ memory management units that limit the core’s ability to provide real-time deterministic behavior or respond in just a few cycles to interrupts.

To give the Cortex-M7 the performance and capabilities needed by the more demanding applications, designers crafted efficient memory interfaces such as tightly coupled memories (TCMs) for real-time response, and Instruction and Data caches for efficient access to large memories and powerful peripherals. Additionally, direct-memory-access (DMA) into the tightly coupled memories via the slave version of ARM’s AHB bus, and the AHBP to access existing AHB peripherals and memories give the processor the ability to respond quickly and handle a wide range of I/O requirements (Figure 1).

Figure 1: The ARM® Cortex®-M7 processor core packs improved DSP compute capabilities as well as tightly coupled memories and a wide array of peripheral support functions that let it deliver double the DSP performance of the M4 as well as higher overall performance. (TCM = tightly coupled memory, FPU = floating-point unit, WIC = wake-up interrupt controller, ETM = embedded trace macrocell, ECC = error checking and correction, MPU = memory protection unit.)

Figure 1: The ARM® Cortex®-M7 processor core packs improved DSP compute capabilities as well as tightly coupled memories and a wide array of peripheral support functions that let it deliver double the DSP performance of the M4 as well as higher overall performance. (TCM = tightly coupled memory, FPU = floating-point unit, WIC = wake-up interrupt controller, ETM = embedded trace macrocell, ECC = error checking and correction, MPU = memory protection unit.)

The compatibility of the Cortex-M7 with previous M-series processors gives the Cortex-M7 a wide range of pre-built resources that companies can leverage to quickly develop system-on-a-chip solutions for use in endpoints, home gateways, edge devices, data aggregation hardware and many other applications. Compilers, libraries and even application code will all benefit with an easy migration from previous devices. This should shorten development times and allow SoCs that integrate the new Cortex-M7 core to be used to generate devices, possibly by the end of 2015.

Software Compatibility Shortens Development Time

With the Cortex-M7, functions such as voice recognition, sensor fusion or performance optimization of control applications can be directly transferred over to new designs

Software and hardware compatibility with previous Cortex-M-series processors will allow designers to reuse hours to months of software development done on the older processors to directly transfer to the Cortex-M7 core, thus greatly reducing the time to develop applications. Functions such as voice recognition, sensor fusion or performance optimization of control applications can be directly transferred over to new designs and execute more efficiently. Embedded designers will find that their time spent finely optimizing application code on older processors can be directly ported to new devices built around the enhanced Cortex-M7 core. Hence the new performance enhancements of the Cortex-M7 can be utilized with little or no software work.

The CPU core in the Cortex-M7 processor has a six-stage superscalar pipeline with branch prediction, and the DSP extensions allow the core to perform single-cycle 16- or 32-bit multiply-accumulate (MAC) operations, single-cycle dual 16-bit MAC operations, as well as 8/16-bit SIMD (single-instruction/multiple-data) arithmetic. Also in the Cortex-M7 is a double-precision floating-point unit that delivers higher accuracy for applications such as required by precise positioning in the home or the enterprise and by GPS.

Inside the Cortex-M7 core, instruction and data buses have been enlarged to 64 bits vs the 32-bit buses used in previous M-series processors. That enables multiple instructions to be fetched in each clock cycle. Additionally, the high performance 64-bit AXI system bus provides a system interconnect capability that is new for Cortex-M-class cores. It’s optimized for throughput by supporting multiple transactions and queuing of transactions. Attached to the AXI bus are configurable instruction and data caches that provide low-latency buffering of information as it is fetched from slower memories. The resulting architecture enhances the processor’s ability to work with external memories to handle large data arrays and programs.

High-performance Microcontrollers Leverage the Cortex-M7 Core

Several of the early licensees of the Cortex-M7 core include Atmel, Freescale and STMicroelectronics. These vendors and others are developing microcontrollers that can tackle endpoint, gateway and edge applications. For example, both Atmel and Freescale have crafted full general-purpose microcontrollers around the Cortex-M7 core (the SAM70 and Kinetis families, respectively), while STMicro has developed a chip that integrates most of the functionality of a home gateway that is an offshoot of its STM32 F7 series of microcontrollers.

Additionally, other development partners such as Emcraft Systems have crafted development kits based on the Cortex-M7. For example, the STM32F7-SOM-1A Module from Emcraft is based on the STMicro STM32F746 microcontroller that runs at up to 200 MHz, packs 320 kbytes of RAM and 1 Mbyte of Flash (Figure 2). In addition to the MCU, the board adds 32 Mbytes of SDRAM, 16 Mbytes of NOR flash, an Ethernet PHY and still other resources.

Figure 2: The STM32F7-SOM-1A Module from Emcraft Systems provides a designer with a full starter kit for designing applications using the M7 core. In addition to the board, the company has ported a µClinix software package to the platform.

The microcontrollers in Atmel’s SAM70 series can operate at frequencies of up to 300 MHz and deliver over 600 DMIPS of computational throughput. The architecture allows for lower power consumption, high-speed data stream handling and the ability to handle video streams. In the future Atmel expects to migrate the processor to a 40nm process that will enhance the performance by about another 30%. The company has optimized the SAM70 microcontroller family for automotive infotainment and telematics applications by adding Ethernet AVB and MediaLB support on the chip. Advanced analog interfaces and timers will also let the processors handle various motor control and robotics applications.

Designers at Freescale are also leveraging the improvements in the Cortex-M7—the enhanced pipeline supports execution of multiple instructions per clock, improving the throughput of the core. Power modes such as high-speed run mode and very low-power run mode dynamically change the power management of Kinetis devices. High-speed run mode will complete tasks as quickly as possible, while the very-low-power run mode can be used to extract more processing from the Cortex-M7 core at lower CPU speeds.

High Speed Can Actually Lower Power Consumption

The higher processing performance can be used to perform functions in a shorter amount of time. Specifically, there are two aspects of the processing performance that will affect end applications–especially those requiring lower power consumption. First, having more capabilities per clock cycle will allow a task to be completed at lower system clock speeds. Digital filters which previously required 200 MHz to operate can now be done at 100 MHz. In addition, the computational improvements will allow designs to take advantage of low-power run modes as the improvements can be realized at all CPU speeds. Second, another strategy for low-power design is completing tasks as quickly as possible. Along with the processing throughput, the Cortex-M7 supports higher CPU speeds. So when using the new core to its fullest capabilities, time spent in active modes processing can be reduced, which will allow applications to spend more time in low-power modes.

With a top clock frequency of 200 MHz, the STM32 F7 microcontrollers take advantage of the six-stage pipeline and FPU to achieve a throughput of up to 1000 CoreMarks (Figure 3). In addition to the Cortex-M7 core, designers at STMicro included two independent mechanisms to reach 0-wait-state performance from both internal and external memories: using the company’s Adaptive Real-Time (ART Accelerator™) for internal embedded Flash, and employing the L1 cache for both execution and data access from internal and external memories.

Figure 3: The high-level of integration done by STMicro on its STM32 F7 microcontroller provides designers with a highly configurable solution that delivers double the performance of previous-generation M-series processors.

Figure 3: The high-level of integration done by STMicro on its STM32 F7 microcontroller provides designers with a highly configurable solution that delivers double the performance of previous-generation M-series processors.

Designers of the STM32 F7 optimized the entire system by integrating multiple new peripheral support functions around the Cortex-M7 processor core. Lots of timers, a crypto/hash processor, a true random number generator, multichannel DACs and ADCs, a high-throughput AXI and multi-AHB bus matrix, and many other features are all integrated on the chip (refer back to Figure 3). One of the more novel features is a large SRAM with a scattered architecture. The SRAM contains a total of 320 kbytes that is divided into a 240 kbyte and a 16 kbyte block on the bus matrix, 16 kbytes of instruction TCM RAM and 4 kbytes of backup SRAM. This scattered approach lets the large SRAM block support large data buffers and multiple software stacks, while the backup SRAM allows data retention in the lowest power modes for quick recovery, and the data and instruction TCM blocks support critical real-time data and program execution.

The high performance of the Cortex-M7 and its compatibility with previous generation Cortex-M series processors gives designers a jumpstart in developing next-generation microcontrollers or SoC solutions. The ability to operate at frequencies of 200 MHz and above will actually let the processors conserve energy and lower system power consumption in many of endpoints and lightweight access points.

In the future, the transition to smaller process nodes will further enhance the performance while lowering the operating power to deliver leading edge solutions in systems ranging from endpoints such as sensor nodes and smart wearables to critical monitoring and maintenance functions for intelligent edge devices and routers.

This article was sponsored by ARM.

Share and Enjoy:
  • Digg
  • Sphinn
  • del.icio.us
  • Facebook
  • Mixx
  • Google

Tags: ,