Choosing the Right Hardware/Software Configuration is Vital in Mitigating Development Risk
With increasing pressures for improved data throughput in telecommunications systems, developers are increasing data rates while mitigating risks through strategic hardware and software pairings. The key challenge – and opportunity – is to work within a development environment that is ‘hardware agnostic,’ using a hardware abstraction layer that will enable a solution to take advantage of new generations of processor technology yet require minimal – if any – changes to the software in response to changes in the underlying hardware platform.
Ever-Increasing Need for Speed
Major changes are happening within the telecommunication industry. Consider the iPhone and other smartphones and how quickly they have overwhelmed the existing service-provider infrastructure. A step back from flat data plans to bandwidth-limited data plans is a clear indication of how much the infrastructure is being oversubscribed. Statistics show that, driven by high-bandwidth Internet applications, the total traffic in the core network is growing at over 100 percent a year (Figure 1), indicating that individual core elements will be required to provide a corresponding increase in bandwidth. The pressure to upgrade existing infrastructure in order to provision higher bandwidth is being passed on to telecommunication equipment manufacturers (TEMs).

Only a few years ago, 10 Gigabit Ethernet was the state-of-the art in telecommunication ATCA systems. Today, however, everyone is talking about 40 Gigabit and even considering 100 Gigabit data rates within the ATCA chassis. Forty Gigabits per second is a lot of data to be processed. To make this issue even more difficult, the level of required processing continues to increase. Security and privacy concerns require data to be encrypted and continuously searched for possible virus signatures. A number of protocols are being encapsulated one on top of the other, requiring efficient packet parsing and header manipulation capabilities.
Silicon Vendors Step Up to the Challenge
On the hardware side, a number of silicon vendors are stepping up to the challenge and offering processors, including the Cavium OCTEON II, second-generation Intel Core i7 and NetLogic XLP that are positioned for packet processing at 40 Gigabit data rates. In order to increase performance in these devices, a number of tasks are offloaded to hardware acceleration engines, such as encryption/decryption engines, regular expression search engines and data compression/decompression blocks. Such hardware offload engines deliver very high performance, however their usage requires software to be specifically written to take advantage of them.
The other trick that silicon vendors play in order to deliver higher packet-processing performance is providing feature-reduced and performance-optimized operating systems, so called “bare-metal” operating systems. In this respect, Cavium created the Simple Executive OS and Intel introduced its DPDK software. Applications written for bare-metal operating systems are meant to run on individual processor cores and are highly efficient for packet-processing tasks. Since bare-metal operating systems do not use context switching and interrupts while applications typically run to completion, they deliver very predictable and repeatable packet-processing performance and latency.
For instance, a simple packet forwarding function written in Cavium Simple Executive can perform packet forwarding with latency under 2μs, and this latency value is highly repeatable. Although bare-metal operating systems support programming in the standard C language and include some basic software libraries, they lack an extensive software library and the protocol support that is available for Linux. Therefore, implementing sophisticated packet processing directly in a bare-metal operating system requires significant programming effort and expertise in processor-specific features.
Furthermore, applications written in such a way will be tightly coupled to a specific processor, making migration to different architectures painful and effort-intensive. This effectively locks TEMs into a specific processor family, leaving them at the processor vendor’s mercy when it comes to future roadmaps and next-generation processors.
Minimizing Avenues of Risk
Besides programming effort, complexity and dependency on a particular processor family, an equal challenge is risk mitigation. The schedules for releasing complex processors are routinely slipping, and the several-month delays commonly seen in rolling out new processors can significantly affect a TEM’s new product rollout schedule and rhythm.
To mitigate these risks, customers are looking for ways to reduce dependency on a particular silicon architecture and to minimize their software programming efforts while still maintaining high performance. Such a goal is difficult, but can be achieved by introducing a hardware-abstraction layer. Ideally the hardware-abstraction layer would provide the customer with a Linux-like application development environment that supports a large number of common Ethernet and IP protocols, and delivers the packet-processing performance that can only be achieved by using a bare-metal operating system and leveraging processor-specific hardware offload engines.
Furthermore, such a hardware-abstraction layer would reduce the customer’s dependency on the underlying processor architecture, thus allowing hardware to be changed without a major application software redesign.
Facilitating Packet Processing Through Software
Faced with growing data volumes and escalating processing complexity, the priority for developers is typically to maximize performance and throughput. In terms of the performance challenges for next-generation networks, a standard networking stack uses services provided by the operating system and is subject to significant overheads associated with functions such as pre-emptions, threads, timers and locking. These processing overheads are imposed on each packet passing through the system, resulting in a major performance penalty for overall throughput.
Furthermore, although some improvements can be made to an operating system stack to support multicore architectures, performance fails to scale linearly over multiple cores for complex packet processing such as required by 4G. A processor with, for example, eight cores may not process packets significantly faster than one with two cores for GTPu-to-GRE encapsulations.
All in all, a standard operating system stack does a poor job of exploiting the potential packet-processing performance of a multicore processor. A superior solution is provided by specialized packet-processing software optimized for multicore architectures (Figure 2). In a well-designed implementation, the networking stack is split into two layers. The lower layer, typically called the fast path, processes the majority of incoming packets outside the operating system environment and without incurring any of the operating system overheads that degrade overall performance. Only those rare packets that require complex processing are forwarded to the operating system networking stack, which performs the necessary management, signaling and control functions.

Multicore Processors are Well-Suited
A multicore processor is well-suited to implementing this kind of software architecture. Most of the cores can be dedicated to running the fast path in order to maximize the overall throughput of the system, while only one core is required to run the operating system, the operating system networking stack and the application’s control plane. In practice, the designer will analyze the specific performance requirements for the various software elements in the system (applications, control plane, networking stack and fast path), deciding on the most appropriate allocation of cores to balance the overall system workload.
Until recently, the only restriction when configuring the platform was that, since the cores running the fast path were running outside the operating system, they had to be dedicated exclusively to the fast path and not shared with other software. With the recent evolution towards a hybrid fast-path model, the system can now be reconfigured dynamically as traffic patterns change in order to share the CPU resources allocated to the control plane and the fast path.
Splitting the networking stack in this way has no impact on the functionality of application software, which interfaces to the same operating system networking stack as previously. Existing applications do not need to be rewritten or recertified, but they run significantly faster because the underlying packet processing is accelerated through the fast-path environment. In a typical 4G application, such as a packet gateway (PGW) or switching gateway (SGW), when the standard operating system networking stacks are replaced by optimized packet processing software based on the fast-path concept, the networking performance of the processor subsystems will typically increase by seven to 10 times (Figure 3).
Achieving System Performance Goals
This massive increase in performance means the system as a whole will likewise be able to manage seven to 10 times more users with the same hardware. This type of fast-path-based implementation can allow the designer to meet system throughput goals that may have been unachievable on a single multicore processor when using a standard operating system stack. These compelling breakthroughs in system performance also translate directly into improvements in energy efficiency and cost.
Considering the high performance that customers need to achieve, performing benchmark testing and performance evaluation is a critical step in the hardware architecture selection process. A typical challenge is that, in order to get meaningful performance test results, the test setup needs to include many protocols and packet processing tasks that will be required by the final application.
For instance, it is fairly easy to create a test setup for a simple packet-forwarding application using Cavium Simple Executive. Such a performance test will deliver high packet-forwarding rates and can be used as an upper limit of the realistic application performance that can be achieved. However, adding protocols such as Network Address Translation and IPsec increases the software programming effort exponentially.
Hardware/Software Optimization Enable 40Gbit – and Beyond
Due to increasing demands from users and service providers, the telecommunication industry is aggressively moving towards 40Gbit data rates in ATCA systems. However, packet processing at such data rates is only feasible by using hardware-specific acceleration features and fully utilizing all available processor cores by using bare-metal operating systems. Such architectures pose a number of significant risks in terms of performance estimation, silicon availability and application-software migration from one processor family to the other. An appropriate combination of hardware and software can mitigate these risks by enabling customers to run accurate performance tests, reduce hardware specific software programming effort and time-to-market and enable painless application-software migration from one processor family to the other.
Gene Juknevicius is a technologist and architect at GE Intelligent Platforms. He has participated in the PICMG, AMC and MicroTCA committees, is currently an active member of the SCOPE Alliance and is responsible for new product definition and architecture at GE Intelligent Platforms. He received his M.S. degree in electrical engineering from Stanford University.
Charlie Ashton is VP of marketing and business development at 6WIND and is responsible for 6WIND’s global marketing initiatives and partnerships worldwide with semiconductor companies, subsystem providers and embedded software companies. Charlie has extensive experience in the embedded systems industry, with his career including leadership roles in both engineering and marketing at software, semiconductor and systems companies. He led the introduction of new products and the development of new business at Green Hills Software, Timesys, Motorola (now Freescale), AMCC, AMD and Dell.






