Enterprise-class Printers Deliver Top-notch Performance by Leveraging Combined CPU and GPU
The use of accelerated processing units that combine a high-performance multicore CPU and highly parallel graphics processor on a single chip let 2D and 3D printers deliver market-leading performance.
High-performance multi-function printers and copiers are designed to provide print-on-demand services and deliver print speeds ranging from 50 to many hundreds of pages per minute. Such systems have, in the past, often required custom application-specific ICs (ASICs) or field-programmable gate arrays (FPGAs) to deliver the high computational throughput needed to scan the image, print the image, and manage/move the paper through the system. However, that is changing as high-performance graphics processors and general-purpose CPUs combine their compute capabilities to deliver leading-edge performance while lowering system costs.
Simpler System Programming
The inner workings of a legacy multi-function printer can loosely be grouped into four subsystems—a control CPU that manages task scheduling, a scan controller (usually an ASIC or FPGA) to manage the scan image pipeline, a print/scan control engine that handles all the mechanics of printing and paper movement, and a print controller (typically an ASIC or FPGA) that prepares the image for printing (Figure 1). Since each section is optimized for the function it performs, the printer designers will often have to deal with different, possibly custom processor architectures, and multiple programming languages. The multiple engines thus make programming the system a challenge, since a change in one engine’s software could affect how the other engines react, thus requiring software updates to all subsystems.
To reduce the software overhead and simplify the printer subsystems, graphics engines programmed using OpenCL or other languages can replace some of the custom compute blocks and simplify the programming through the use of a high-level language rather than assembly-level or register-transfer language (RTL) coding typically required by a custom compute block. Additionally, thanks to advances in integration and in CPU and graphics-engine performance, the entire processing pipeline can now be reduced to just a chip or two. The integration of the CPU and graphics processor onto a single chip, coupling them together through the implementation of the heterogeneous system architecture specification (HSA 1.0), allows a single instruction stream to control both processors, thus dramatically simplifying system programming.
Scalar and Parallel Team Up
The goal of companies implementing HSA is to create applications that seamlessly blend scalar processing on the CPU with parallel processing on the GPU or other parallel processing units—using high-bandwidth shared memory access, thus enabling greater application performance and lower power consumption. The HSA Foundation is defining key interfaces for parallel computation utilizing CPUs, GPUs, DSPs, and other programmable and fixed-function devices, thus supporting a diverse set of high-level programming languages and creating the next generation in general-purpose computing. In fact, as reported by AnandTech and other sources, there is now a C++ compiler that’s HSA compliant. (For more about HSA go to www.hsafoundation.com.)
The use of multicore CPUs closely coupled with highly parallel graphics engines such as those found in the G- and R-series accelerated processing units (APUs) from Advanced Micro Devices and eventually from other HSA organization members will greatly reduce system complexity. The APUs can function as the control CPU as well as integrate partial or full image scan and print functionality thanks to the high-performance GPU compute elements. Thus, only a small, low-cost ASIC or FPGA might be required to handle the analog front-end functions for the scan and print operations (Figure 2).
Figure 3 shows a scan image processing pipeline for a 2D printer. The pipeline, now implemented inside the APU, takes output from the analog front end and performs filtering, scaling, and color-space conversion to correct the image. The filtering operations executed in this section typically include 3×3, 5×5, or 7×7 filters to reduce image noise and scaling for image enlargement or reduction. Both the filtering and scaling are best done on the GPU due to its highly parallel architecture. Similarly, the color-space conversion from the input format to CMYK is also best executed on the GPU (Figure 3).
The next portion of the pipeline performs various image enhancements such as tone reproduction curve (TRC) adjustment and halftoning. The TRC adjustment modifies the CMYK format to compensate for non-linear tone reproduction. This operation can be handled by the CPU portion of the APU or can be accelerated if the algorithms are run on the GPU. Halftoning algorithms perform error diffusion to create smooth transitions and sharp edges. Such algorithms are usually proprietary to each printer vendor but are readily executed on the GPU portion of the APU. Lastly, the image data is compressed and stored in memory (either RAM or on the system’s hard-disk drive). Lossless compression saves memory space, and by using the GPU the compression and decompression operations have no impact on printer performance.
The printer’s print pipeline processes image data in memory to create a printed page (Figure 4). The APU’s CPU can take over and manage the print pipeline, working with data using Postscript, Adobe PDF, or PCL3 print languages. The first step in the sequence performs vector image processing, which consists of parsing the data, creating an object list, and then rendering the objects. Parsing is basically the lossless decompression of compressed vector data, and it is a mostly serial process, which would best be done on the CPU. Similarly, the object list generation is typically combined with the parsing operation and is also best done on the CPU.
The rendering operation can leverage the highly parallel architecture of a GPU, which significantly speeds up the calculations when programmed using OpenCL or other compute languages. The combined CPU-GPU workload and large image sizes makes this rendering step an excellent workload for the HSA architecture. When printing large documents, the printer will often include a compression-decompression block that allows it to store long documents. Lossless compression and decompression executed on the GPU ensures that the documents can be held in the system memory with no loss in quality. The last major block in the chain processes the raster image, performing color processing including color separation and possibly other image enhancement steps. This stage can readily leverage the GPU and help keep CPU loading to a minimum.
3D Print Pipeline
The 3D printing industry is going through a revolution. 3D printers are being used to print everything from toys, mechanical prototypes, anatomical models, and more. The 3D printer pipeline takes a 3D model of an object and slices it layer by layer to parameterize the model. The sliced model is then printed one layer at a time by a low-cost microcontroller that manages print-head motion as well as melting/deposition of the material used by the printer.
Today the 3D slicing operation is typically done on a PC before sending data to the printer, thus making the process of 3D printing tedious and slow. The 3D slicer, which is well suited for GPU compute acceleration, can be integrated with data management and error control in an APU. This move to APU based software architecture for 3D printers can enable faster printing of 3D models and reduce manual intervention (Figure 5).
The use of the GPU and the HSA architecture allows printers to deliver top-notch performance while simplifying the programming model through the use of OpenCL and other high-level compute languages that support the HSA implementation. Additionally, processor suppliers such as AMD provide a large library of compute and control functions for printing applications in the form of a vertical development kit (VDK) that will simplify the implementation of the printer software. The move to GPU based software architectures can provide OEMs the scalability and flexibility for faster and economical improvements, thereby reducing the cost of ownership for consumers.
Multicore processors such as the AMD R-series provide designers with two or four CPU cores, as well as multiple GPU compute engines to deliver the throughput needed to deliver print speeds of over 100 pages/minute for enterprise class printers. For less-demanding printer requirements, the company’s G-series of embedded processors can provide a lower-power, cost-effective solution.