Simplifying Multicore Migration
Moving to Multicore
Multi-chip, multi-processing, multi-execution path technologies have been available for decades and have been used to solve complex problems and improve performance. Since single-core technology could address the vast majority of compute needs with constantly increasing performance, the tools and buzz was centered on single-core processors.
The computer world is changing as should be the case. Three factors continue to be the key drivers:
- Power: Increasing the MHz for a single core is approaching the physical limits. Whether the product is battery-operated or tethered by a power cord, developers are seeking to improve the performance/watt ratio.
- Performance: Application features and functionality will continue to expand, which fuels increasing compute performance.
- Consolidation: The progression of more capabilities on a single chip. Applications currently running on multiple processors will move to multicore, and applications currently running on multicore will move to denser multicore chips.
The programming model chosen for multicore applications must have the flexibility to address all three factors. While addressing power has been the focus of many single-core applications moving to multicore, the discussion presented addresses the latter two factors – consolidation and performance.
The broad range of platform technologies offers the system designer choices for the platform that would best fit the application’s requirements. A sample of the hardware options include homogeneous and heterogeneous multicore processors, processors with accelerator integration as well as processors with optimized mechanisms for interprocessor (IPC)/core communications (transports). The software paradigms executing on multicore processors also have variants like symmetric multiprocessing(SMP), asymmetric multiprocessing(AMP), CPU affinity and virtualization, plus the combinations and permutations of the models indicated.
The programming model used in any multicore software paradigm needs to be consistent, have the flexibility to scale and enable optimizations across various hardware and software multicore and multi-processor platforms which are available today and are being planned.
The Multicore Association (MCA) (see http://www.multicore-association.org) communications API (MCAPI) (see http://www.multicore-association.org/workgroup/mcapi.php) and TI DSPs are used to demonstrate a programming model for applications consolidating to denser processors and then looking for better ways to execute the application in a multicore environment.
MCAPI has a concept called “node” which is defined as a logical abstraction that can be mapped to many entities such as process, thread, hardware accelerator or a processor core. For this example, a node will be the entire application that executes on a single DSP.
The application is encapsulated with MCAPI calls and we assume that the IPC transports are available.
Now that our application is set up for MCAPI as encapsulated modules, the first pass on testing could be performed using any of the following environments:
- Windows provided that non-Windows calls are handled – such as an OS simulator on Windows
- Hardware simulator where communications links are simulated
- Existing single-core hardware
- Multicore hardware
Initial testing on target hardware could be challenging because of the number of variables introduced and this subject better handled by the debug experts. A simulatedenvironment provides more control. However, eventually, the application will be moved to the target platform.
Applying the Programming Model
Let’s look at moving a single multi-processor application to a multicore processor using Texas Instruments’ (TI) DSPs such as the TMS320C6474™ and then the TMS320C6678™ DSP.
The existing application runs on six single DSP processors. The source code for each DSP is encapsulated using MCAPI calls, resulting in the code for each DSP becoming a MCAPI node. By using Poly-Platform’s graphical menus and wizards, the encapsulation is performed quickly and completely through code insertion and code generation.
The MCAPI enabled application communicates with other DSPs in the architecture which may be a single core DSP and/or a multicore C6474. That is, the same application source code as node(s) may be run as:
- A single-core DSP communicating to other single-core DSPs
- .A multicore DSP communicating to cores on the same chip
- A single-core DSP communicating to cores on a multicore DSP processor
- A multicore DSP communicating to local cores or DSPs on a different multicore processor
By extension, moving to denser processors, that is, more cores on a chip, the MCAPI-enabled application source code can move without source code change.
In the previous example, note that the application had six nodes that map one-to-one in a dual C6474 configuration. When moving to the C6678, which has eight cores, the designer/architect has two extra cores. The additional two cores could be used as backup or to expand functionality or to improve the overall system performance. For this system, the two cores will be used to improve performance.
Optimizing the Application for Multicore
The MCAPI-enabled application nodes provide flexibility in deploying the nodes throughout the system. Nodes can be grouped onto cores for systems with a memory-protected operating environment and/or replicated to improve systems performance. By grouping the nodes, the system architect may be able to balance the operational load by having nodes with lesser resource needs share a resource. Should a node need substantially more resources, the node could be assigned to a resource and replicated, which would allow more data sets to be processed in parallel.
To quickly try alternate node mappings to cores, the designer could use Poly-Platform’s graphical user interface to reconfigure the topology, build and run the new topology using the same application source code. The communications infrastructure is quickly re-modeled and code is generated for the new topology.
The following diagram shows an application with four nodes that are mapped to three configurations. In the top configuration, all nodes operate on a single core. In the dualcore architecture, the nodes have different allocations. The middle mapping shows an optimized configuration where the resource load for the first two nodes is approximately the same as the latter two nodes. Should the architect find that the nodes have very different resource loads, the lower configuration may be the best fit for the application. It is important to note that the nodes are redeployed using the same application source code.
Returning to the application running on TI DSPs, the additional two cores could be used by node(s) that are more resource intensive. Suppose that N2 is substantially more resource intensive. Then, the architect may replicate N2 and deploy N2 on the seventh and eighth DSP, which should yield an improved system performance. The new configuration is seen in the following diagram.
Alternatively, the additional core could be used to improve a single node’s performance. As the developer learns more about the application’s behavior, a resource-intensive segment of a node may be identified. The resource-intensive segment may be separated from the original node, encapsulate as described earlier and made into a new node. This new node could be move to the additional core(s) where an improvement in performance could be realized.
The approach of moving the application to a multicore platform and then identifying areas for performance improvements could be called “divide and conquer” – a common approach to engineering problems. With the divide-and-conquer approach, moving to multicore becomes manageable because the developers understand the application partitioning. Debugging and further optimizations are more readily managed too.
Programming Model for Today and for the Next Generation
Multicore platforms are being used and we know more applications will be moving to multicore platforms. Most applications have been written as serial tasks or processes which should be well-defined execution blocks. Architects can employ the programming model with MCAPI described in this paper to move their application to multicore platforms with minimal effort.
Also, the MCAPI-enabled application has the benefit to quickly move to the next-generation multicore platform.
For this discussion, the nodes were deployed to homogeneous cores. The programming model readily extends to heterogeneous cores, processor platforms and operating system. Thus, a node could be allocated to a core that would better fit the node’s workload. As an example, an application could have nodes running on a server platform communicating with nodes running on embedded processors.
Economic benefits for the manufacturer are that the same application source code could be used across a product line where the underlying platform differs by compute power. An entry level product could have a single or fewer cores platform. A mid-range product could have more compute power. The high-end product could have a heterogeneous platform. Regardless of the product’s underlying platform, the same application source would be used.
By adopting the programming model based on the MCA standards, the developer is able to start using multicore platforms today. As more is learned about the application’s operation, the application can be modified to operate as multiple nodes with a result of improved performance and improved scalability for more core platforms and next-generation platforms. The programming model works for today’s platform and prepares the application for tomorrow’s platforms.
Ted Gribb is PolyCore Software’s vice president of sales. Prior to joining PolyCore Software, Gribb had sales management positions for Wind River, Diab Data and Mentor Graphics. Previously, he held management positions in software engineering. Gribb received a Bachelor of Science degree in mathematics from DeSales University.