Multicore Processor Performance Analysis: Revealing the Truth
|
Figure 1. Two dual-core, shared memory architectures depicted with different memory subsystems. There are performance advantages and disadvantages associated with each approach. The multicore technology era is solidly upon us and nearly every processor vendor is offering or developing products and architectures to support the imminent demand. Meanwhile, system developers nervously contemplate their options, as they realize that adopting multicore technology presents as many challenges as it does benefits. One of those challenges lies in analyzing the potential performance of a processor and/or system-on-a-chip (SoC) that is based on multicore technology. Not surprisingly, putting multiple execution cores into a single processor (as well as continuing to increase clock frequency), does not guarantee greater multiples of processing power. Furthermore, for your application, you have no assurance that a multicore processor will deliver a dramatic increase in a system’s throughput. Despite the pessimism, the right combination of processor and programming techniques can scale well with the number of cores. However, this will depend on how a processor is designed and how you write your application program. Industry-standard benchmark tests from EEMBC demonstrate a multicore system’s behavior in a wide variety of scenarios to model how your own system will function. Results of these benchmarks reveal some interesting truths that will hopefully make you pay close attention to the combined effects of the multicore processor, memory subsystem, operating system, and other system-level characteristics. Multicore Processor Design and Performance Analysis One factor to consider when analyzing multicore performance is scalability where contexts exceed resources. In familiar terms, assume that a program is composed of a varying number of threads (it’s not unreasonable to have hundreds of threads in a relatively complex program). If the number of threads exactly matched the number of processor cores, it’s possible that performance could scale linearly. However, realistically the number of threads will exceed the number of cores, and performance will be dependent on other factors such as memory and I/O bandwidth, inter-core communications, OS scheduling support, and synchronization efficiency. The memory bandwidth of a multicore processor depends on the memory subsystem design which is dependent on the underlying multicore architecture. Multicore implies either a shared- or distributed- memory architecture. Shared memory, typically associated with homogeneous multicore systems, is accessed through a bus and controlled by some locking mechanism to avoid simultaneous access of the same memory by multiple cores. It provides for a straight forward programming model as each processor can directly access the memory (Figure 1). The shared memory structure can become a bottleneck when too many cores try to access it simultaneously. This bottleneck also implies that the memory architecture doesn’t scale well with an increasing number of cores. Unless your application is running on ‘bare metal’ (i.e. directly on the processor hardware without operating system support), OS scheduling will also play a big role in determining multicore implementation behavior. Scheduling refers to the way processes are assigned priorities in a priority queue, but will also be determined by availability of on-chip processing resources (this will be based partly on the OS’s ability to monitor availability of hardware resources such as cores or hyper threads). High-Level Multicore Benchmark Categories There are several ways in which a multicore processor can be utilized. These include asynchronous multiprocessing (AMP), functional partitioning, and parallelization. Among other things, AMP provides a centralization of distributed processing, where all cores can run entirely different and theoretically unrelated tasks. Essentially, this is the same as four separate machines running in one package. Even though there may be very minimal interaction between the cores, the overall performance will be limited by the system-level memory bandwidth. Functional partitioning is possible with multiprocessors (i.e. separate processor chips), but a system can potentially benefit from the cores’ proximity in a multicore chip, especially for data sharing between cores. An example of functional partitioning is where core_1 runs a security application, core_2 runs a routing algorithm, and core_3 enforces policy, etc. Depending on the workload presented by each of these functions, it’s possible that additional functions can be added or swapped in as needed. Finally, multicore technology can be used to increase an application’s parallelization, or concurrency. This is perhaps the greatest benefit of multicore technology from a performance perspective, but it also raises the most challenges insofar as it requires a careful analysis of the manner in which threads share data. Although inter-thread communications will benefit from the cores’ proximity, cache utilization and system-level shared resources will also have an impact. Multicore benchmarks need to take all of these applications into account. But this is more easily said than done. Just as there’s never been a simple way to measure the performance of “normal” single-core processors and reduce it to a single measure of “goodness,” it’s exponentially more difficult to measure the performance of a multicore device and produce a single figure of merit. An Initial Exploration of Results One of the first things we’ve learned working with multicore benchmarks is that the combined interactions of all the factors outlined above result in marked performance differences even among quite similar platforms. Tests on two dual core processors, for example, show quite different rates of speed-up depending on the number of concurrent streams and which specific benchmarks are running (Figure 2). From this information, you can tailor your software to align with the benchmark characteristics that yielded the highest performance on that specific processor.
Figure 2. Comparing two dual-core platforms demonstrates how results can vary and depend on multiple factors. Benchmarks for Different Types of Parallelism From a parallelism perspective, a multicore benchmark must target three two fundamental areas of concurrency: data throughput and computational throughput. Benchmarks that analyze data throughput will show how well a solution can scale over scalable data inputs. This can be accomplished by duplicating the same computation and applying it to multiple different datasets. A real-world example of this method includes the decoding of multiple different JPEG images (as may occur when viewing a web page). It is interesting to determine the point at which performance begins to degrade while increasing the number of data inputs. In developing such a benchmark test, the biggest challenge is that the code must be thread-safe to support simultaneous execution by multiple threads. In particular, it must satisfy the need for multiple threads to access the same shared data, and the need for a shared piece of data to be accessed by only one thread at any given time, without compromising required performance throughput. To demonstrate computational throughput, the approach above can be extended further by developing tests that can initiate more than one task at a time, implementing concurrency over both the data and the code. This will demonstrate the scalability of a solution for general purpose processing. As an example, consider the execution of MPEG decode(x) followed by MPEG encode(x), which is similar to what you might find in a set-top box where the satellite signal is received, decoded, and encoded into a different quality signal for storing on the hard disk. As a benchmark, this application requires synchronization between the contexts and a method for determining when the benchmark completes. Data decomposition is where an algorithm is divided into multiple threads that work on a common data set, demonstrating support for fine grain parallelism. In this situation, the algorithm could be working on a single audio and video data stream, but the code can be split in such a way so as to distribute the workload among different threads, each of which can be handled by a different processor core. These threads are distributed based on the number of available processor cores. Of course, efficient processing is possible only because the cores within the multicore device are closely distributed and can support high-bandwidth, low-latency data transfers. The Effects of Concurrency EEMBC developed several hundred workloads for its first suite of multicore benchmarks, referred to as MultiBench. The combined effect of using all these workloads provides a very comprehensive view of multicore processor behavior. Some results derived by using data throughput tests can serve as examples of the methods described above. Let’s look at how two different brands of quad-core processors perform on the benchmark suite. Both chips share the same x86 instruction-set architecture (in other words, they’re both PC-compatible processors), both have four cores within a single device, and both are connected to 4 GB of 667-MHz DDR2 memory subsystems. About the only differences are the processors themselves and the way they’re connected to memory, which is dictated by the processor vendor. They are, for the most part, similar and competitive chips. Yet they exhibit different behaviors on the same set of benchmarks. The test results shown on the following charts are the SWM (SingleWorkerMark), MWM (MultiWorkerMark), and MIM (MultiItemMark) benchmarks. For now, the only significant difference is that the SWM test is single-threaded, while the other two are multithreaded benchmarks.
Figure 3. A single “Brand X” x86 quad-core processor with a single memory subsystem. Note: SWM (SingleWorkerMark), MWM (MultiWorkerMark), and MIM (MultiItemMark) are EEMBC’s code names for new scores that will represent the aggregation of different groups of benchmark scores. Figure 3 illustrates how the processor from “Brand X” performs on three different MultiBench tests as workloads increase. Looking at the horizontal scale, we can see that the workload increases from one context (on the left of the chart) up to 20 contexts (on the right of the chart). The vertical axis on this and subsequent charts has been scaled so that the performance of a single context is always 1.0. This makes it easier to see how performance scales – or not – with increasing workloads. The good news is that the performance throughput of the Brand X quad-core processor increases as the number of workloads increases. That’s a good thing. If it didn’t improve, there’d be no point in using a multicore processor in this application. The bad news is that it doesn’t increase linearly. As we can see, the maximum performance with 20 contexts is just shy of 3x the baseline performance with one context. So even with four processor cores working on 20 tasks, overall performance throughput triples (which many would consider a reasonable performance increase). However, performance on the multithreaded MIM test is a bit disappointing, maxing out at less than 2.0x the baseline performance. The other bit of good news is, at least performance doesn’t decrease with increasing workloads. As we’ll see, this can sometimes happen. Looking now at Figure 4, we see the exact same tests conducted on a nearly identical system but with a competing “Brand Y” processor. Unlike the previous example, there’s a pronounced “kink” in the results graph. This processor’s performance on the MWM benchmark increases linearly up to four contexts – one context per processor core – but then actually declines as more contexts are added. Its performance on the SWM and MIM benchmarks is somewhat more intuitive. Performance gradually increases but then plateaus at around 8-12 contexts.
Figure 4. A single “Brand Y” x86 quad-core processor with a single memory subsystem. Figure 5 shows our first dual, quad-core setup. That is, two processor chips each with four cores, for a total of eight processor cores. As in the first test, the system has 4GB of DDR2 memory but in this case the two processors are sharing it. In this particular case, all the memory is local to one of the processors and the other processor accesses it though a shared link between the two chips. This gives one of the processors a built-in advantage, although both processors can access all of the available memory. Here we see that performance scales better than it did with the single-processor (four-core) system from Figure 3. Peak performance on SWM is about 3.75x the baseline, much improved from before. It’s certainly not anywhere near 8x performance, but it’s a substantial improvement nonetheless. The two multicore benchmarks (MWM and MIM) also show steady growth, even suggesting that it might have grown further had the workload been dialed up even more.
Figure 5. Dual “Brand X” x86 quad-core processors sharing a single memory subsystem. Now the Work Begins The results that we’ve shown here merely scratch the surface on the amount of data that can be produced by MultiBench. The good news is that, as an industry standard, most (if not all) vendors are using the same methodology to measure the performance of their multicore processors (at least the ones with SMP architectures). I should also point out that we purposely avoided indicating specifically which processors were being used to generate this data. Our purpose here was to demonstrate the variety of results and to convince you that no single number is enough to determine a multicore processor’s performance levels. As the multicore revolution progresses, we can expect to see the number of cores per chip roughly double with each processor generation, thereby making it even more important to understand multicore processor behavior. And the work doesn’t stop with this first generation of MultiBench. EEMBC is currently developing subsequent versions of benchmark suites that will help analyze heterogeneous processors (i.e. SoCs), as well as Application Specific Standard Benchmarks (ASSBs) that will perform tests based on real-world scenarios. We can only hope that all this effort will help to provide you with the right tools to make good decisions about your next multicore processor project.
Markus Levy is founder and president of EEMBC. He is also president of The Multicore Association and chairman of Multicore Expo. Mr. Levy was previously a senior analyst at In-Stat/MDR and an editor at EDN magazine, focusing in both roles on processors for the embedded industry. He is also a volunteer firefighter.
|
|





















