Blazing Algorithms for Rich Audio/Visual Applications



The ARM®Cortex®-M7-takes on aggressive algorithms while remaining gentle with power and cost, but its performance depends on system choices around memory to enable fast execution of tight code with deterministic execution at the full processor clock rate.

The ARM Cortex-M series is undisputedly the dominant processor family for embedded systems that need a combination of low power, cost effectiveness and moderate performance. Now the first implementations of the highest-performance member—the Cortex-M7—are emerging, and its capabilities make the Cortex-M7 a natural choice for the serious algorithms needed for rich audio and visual capabilities in low-power, low-cost applications. In a 65nm embedded flash process, the Cortex-M7 can achieve a 1500 CoreMark score while running at 300 MHz, and its DSP performance is double that available in the Cortex-M4. A double-precision floating-point unit and a double-issue instruction pipeline further position the Cortex-M7 for speed. These features make the Cortex-M7 ideal for applications such as drone image processing and the audio, voice control, object recognition and complex sensor fusion of automotive and higher-end Internet of Things (Iot) devices. But Cortex-M7-based microcontrollers’ performance depends on each designer’s system choices around memory implementation.

Tightly Coupled Memories for Fast, Deterministic Code Execution

Algorithms such as FIR, FFT or Biquad often need to run quickly and deterministically for real-time response or seamless audio and video performance. To meet system performance and latency goals, these routines should operate at the full clock rate, unencumbered by cache misses, interrupts, context swaps and other execution surprises that work against deterministic timing.

This means that you can’t simply run code out of standard memory, and you can’t rely on standard memory for storing the data being processed. Typical flash memories are too slow to keep up with the Cortex-M7 core, and they require caching, causing long delays when a cache miss occurs. So the Cortex-M7 architecture provides a way to bypass the standard execution mechanism using “tightly coupled memories,” or TCM.

ARM’s core provides the TCM interface, and it’s intended for single-cycle random TCM access running at the full processor speed. A 64-bit instruction memory port (ITCM) supports the dual-issue processor architecture, and two data ports (DTCM) make possible two simultaneous, parallel 32-bit data accesses. The architecture does not, however, specify what type of memory or how much memory should be provided. Those decisions are left for designers implementing the Cortex-M7 core in a microcontroller (MCU) as opportunities for innovation and differentiation.

Figure 1. The TCM interface provides a single 64-bit instruction port and two 32-bit data ports.

Figure 1. The TCM interface provides a single 64-bit instruction port and two 32-bit data ports.

It’s possible, for instance, to attach embedded flash memory to the TCM interface, but flash can’t run at the processor clock rate. Caching would be required, which threatens the determinism that the TCM is intended to provide. Code shadowing in DRAM is theoretically possible, but it would be cost-prohibitive. The only type of memory that is fast enough for direct, uncached access is SRAM.

SRAMs provide a number of benefits: an SRAM block can be connected directly to the TCM interface, SRAM technology is easily embedded on-chip, and SRAMs permit random accesses at the speed of the processor. The only SRAM drawback is the cost per bit, which is higher than that of flash and DRAM. That means it is critical to keep the size of the TCM limited, even if 65nm or 40nm technology allows a fair amount of SRAM integration at acceptable cost.

The TCM might be loaded from a number of sources, not specified by the architecture, so MCU implementations may vary in how these memories can be loaded. A direct-memory access (DMA) engine would certainly be involved, but whether it’s a single DMA or one of several loading data from various streams such as video or USB is left to the MCU designer.

The system programmer identifies the code that will be executed out of the TCM. When preparing a software build, the programmer will need to identify which code segments and data blocks should be allocated to the TCM. This is typically done by embedding pragmas into the software and by applying linker settings so that the build places code and memory appropriately.

Multiple Ports for Faster SRAM Access

While TCM provides a straightforward mechanism for quick, tight execution of key routines, it’s often useful to have system SRAM as a general-purpose, high-speed memory for the processor and peripherals to use by means of a DMA. While this memory would logically be separate from the TCM, it’s possible that a single SRAM block could share TCM and general-purpose duties while offering the possibility to tune the split between TCM and system RAM for each use case.

Peripheral data buffers implemented in general-purpose system SRAM are typically loaded by DMA transfers from system peripherals. The ability to load from a number of possible sources, however, raises the possibility of unnecessary delays and conflicts by multiple DMAs trying to access the memory at the same time.

In a typical example, three different entities might vie for DMA access to the SRAM: the processor (64-bit access, requesting 128 bits for this example) and two separate peripheral DMA requests (DMA0 and DMA1, 32-bit access each). Let’s assume that the processor has priority over the DMAs and that DMA0 has priority over DMA1.

This example is illustrated in Figure 2. The left shows the performance of a 64-bit wide single-banked memory; the right shows the same transactions in the same memory organized as four 32-bit banks.

Figure 2. By organizing the SRAM into banks, multiple DMA bursts can occur simultaneously with minimal latency.

Figure 2. By organizing the SRAM into banks, multiple DMA bursts can occur simultaneously with minimal latency.

With the single-bank memory, the processor would complete its access in two cycles, and in the next cycle, the DMA0 burst would start. DMA1 would be blocked until DMA0 was finished. Any higher-priority processor requests would interrupt the DMA loads, adding further delays. As shown above, an instruction fetch in Cycle 5 adds another cycle of delay to both DMA operations.

More efficient operation can be achieved by organizing the memory into banks with interleaved addresses, each of which can be independently accessed. In this manner, the DMA0 burst begins when the processor is done with the first bank—simultaneous to the processor accessing the higher banks. DMA1 would no longer have to wait until the entire DMA0 transfer is complete; it could start one cycle after DMA0, when DMA0 is finished with the first bank.

Having higher priority, the processor sees no latency difference between the two arrangements. But DMA0’s latency improves from 2 cycles to 1, and DMA1’s latency goes from 7 cycles down to 2. An instruction fetch during the DMA operations may even occur in parallel with the DMA operations, resulting in no additional latency. Lower latency improves performance, but it may also mean that peripheral FIFOs can be smaller.

Given these possible useful memory arrangements, it is left to the MCU designers to figure out how flash memory, system SRAM and TCM work together. Are they all separate? Do they occupy similar space? On- or off-chip? These decisions will differentiate competing Cortex-M7 implementations.

A Specific Cortex-M7 TCM Example

One example of a Cortex-M7 MCU is implemented in the SAM S70, SAM E70 and SAM V70/1 families from Atmel. It has an SRAM organized in 4 banks that can be used as general SRAM (which Atmel calls System SRAM) and for TCM. In the portion configured as TCM, instructions can run with deterministic, single-cycle access at the full core speed of 300 MHz, with no delays for fetches or caching. Similarly, data located in TCM can be accessed without caching penalties.

Figure 3. The Atmel SAM S70/E70 family uses SRAM that can serve as TCM and/or as System SRAM for high flexibility and utilization.

Figure 3. The Atmel SAM S70/E70 family uses SRAM that can serve as TCM and/or as System SRAM for high flexibility and utilization.

The Instruction TCM (ITCM) and data TCM (DTCM), which must be sized alike, can be 32K, 64K or 128K bytes each. This sizing is a build-time setting that cannot be changed on the fly.

The full SRAM block can be up to 384K bytes in size. That means that both System SRAM and TCM can be used at the same time. The amount of memory configured as TCM reduces the amount of memory available for use as System RAM; for example, if configured for 128 KB of ITCM and 128 KB of DTCM, then the amount of System RAM is reduced to 128 KB. The System SRAM runs up to 150 MHz, or half the processor speed, and it is organized into four banks, reducing DMA latency. DMA access priorities, which can be fixed or round-robin, are the same for System SRAM as they are for the internal bus matrix.

A register bit activates or deactivates the TCM, so the target of a given access will depend on TCM activation. A device is always booted with ITCM deactivated, allowing code to be loaded from the boot memory. After booting, the TCM can be activated as needed.

The ITCM address space overlaps the boot memory. If TCM is deactivated, then instruction accesses go to the boot memory. If TCM is activated, then accesses within the TCM address space come from the TCM; accesses above the TCM space come out of flash.

It’s possible to protect the various spaces with the MPU to ensure that there are no inadvertent accesses to the wrong memory.

Cortex-M7 is the Next ARM Wave

The Cortex-M7 is coming into its own, taking on much more aggressive algorithms while still remaining gentle with power and cost. Tightly coupled memories, one of the new Cortex-M7 features, make possible the fast execution of tight code with deterministic execution at the full processor clock rate.

Exactly how this performance will be realized on a given MCU will depend on system choices made by the designers integrating the Cortex-M7 into the MCU. The highest speeds will be afforded by using SRAM to implement the TCM. This SRAM can be shared with a more general system SRAM capability, and, by organizing the SRAM in banks, DMA latency can be minimized.

These architectural features, while relatively simple, provide a dramatic boost to the performance available for implementing critical embedded functions in personal, industrial and automotive devices, particularly those with rich audio/visual requirements.

Authors

LionelPerdigonLionel Perdigon is product marketing manager for Atmel’s ARM-based Flash MCU product family. After completing an MSc degree at the University of Montpellier, he started his career as a hardware engineer designing GSM modules at Sierra Wireless. He then moved to the position of applications engineer, supporting European mobile phone manufacturers at Rohm. With Atmel since 2006, he has held several positions, from applications engineer to product marketing engineer, which led to his current position. Lionel Perdigon has over 13 years of experience in the electronics and semiconductor industry.

JackoWilbrinkJacko Wilbrink is senior marketing director for Atmel’s ARM-based microcontrollers and has been directly involved with this product line since its inception. Holding an electronics engineering degree from the University of Twente, the Netherlands, he started his career as a design engineer with Philips Semiconductors. Subsequently he worked as a field applications engineer for European Silicon Structures, which led to his current position at Atmel. Jacko Wilbrink has over 25 years of experience in the semiconductor industry.

Contact Information

Atmel Corporation

1600 Technology Drive
San Jose, CA, 95110
USA

tele: 408.441.0311
fax: 408.487.2600
www.atmel.com/avr/