Navigating Thermal Challenges for High-End ATCA Server Blades
Reliability concerns don’t allow for active CPU cooling on ATCA blades, but even as core counts rise and performance enhancements proliferate, designers are finding ways to tame high temps.
An ATCA server blade introduced in 2010 typically had a capacity of 64 to 96GB main memory, while today’s products can support up to 512GB of memory. The 2010 blade had six cores per CPU, while the latest Intel® Xeon® Processor E5-2600 v3 product family raises the usable core count on an ATCA blade to as high as 14 cores per processor. Future processor generations will likely bring the core count even higher.
Other enhancements in that short time (relatively speaking, in terms of deployed telecom systems) include larger main CPU cache, enhanced CPU instruction set, new application accelerating functions such as encryption for security or 256 b-t integer vector instructions for image processing and signal analysis. Intel has also improved the throughput of data paths inside CPUs, between CPUs and to the main memory, added support for DDR4 memory, upgraded I/O connectivity with PCIe Gen3 and integrated the host controllers directly into the CPUs.
The only performance factor that has remained relatively stable over the various processor generations is the average clock frequency. However, performance can be optimized by clocking individual processor cores at higher frequencies, if the application can benefit from this approach. Available on ATCA server blades, these advances have increased performance per ATCA slot and optimized energy efficiency. As a result CAPEX and OPEX measured per subscriber or per network packet can be greatly improved.
Power Predicament: It Could be Worse
Power consumption of the processor and subsystems has grown in the wake of added performance. In order to compensate for the increased power usage, processor manufacturers introduce new technologies that allow for the shrinking size of silicon dies. Together with lowering of the general operating voltages applied to silicon, these techniques allow for significant reduction of power losses. Lower power and cost are mostly exploited by new classes of processors intended for mobile and appliance applications.
For processors targeted at high-performance servers, adding more capabilities gobbles up the positive effects in the next processor generation. In fact the average power consumption of the Intel Xeon processor family has increased over recent years. Power efficiency could have gotten worse. But with performance improving much faster than the average power consumption in the same time period, power efficiency as a function of performance versus power has risen.
The power increases in the last four years are in the range of 15 to 45 W per Intel Xeon processor. This doesn’t sound much, but if a given cooling environment does not improve in the same timeframe, the increases can be significant.
For example, consider a scenario where compute blades are replaced in the field without upgrading the enclosure’s cooling subsystem. Insufficient cooling could then result in lower achievable CPU performance. If the desire is to also exploit higher clocked CPUs, the processor wattages can grow even more. This puts a lot of pressure on board designers.
Another concern for designers can be the silicon’s thermal specifications. Silicon manufacturers typically define the maximum temperature a silicon device is allowed to operate at under normal conditions. This is either specified for the surface of the device package (TCase) or for the silicon substrate (TJunction). Thermal specification of silicon devices with high power dissipation such as CPUs can be tight. Designers must pay special attention when designing the processor cooling solution. On ATCA, where active CPU cooling is not an option due to reliability concerns, cooling becomes a major design challenge.
Problems Leading to Lower Achievable Performance
Board real estate and cooling capabilities of the target enclosures set the stage for board designers. Typically,designers have to balance board functionality with achievable performance. The ideal processing board has hundreds of processor cores, runs at 5 GHz, features multiple disk drives, provides lots of default I/O and still has sufficient real estate for modular I/O extensions.
In reality, if performance is maxed out, board functionality and configuration flexibility can be limited. By the same token, adding a high degree of configuration flexibility—such as mezzanine cards or storage devices—limits the size of the cooling solution on remaining board real estate. This results in lower achievable performance.
Both the high-performance board and the multifunctional board have their advantages, but they can’t co-exist satisfactorily on the same design. Board designs that ignore this fact will create problems when they are integrated into a given shelf solution. The promised performance range may then not be fully achievable.
Processors running at full pace that are exposed to higher ambient temperature and higher software load will eventually overheat, which is not acceptable. It questions the stability of operation, reduces the component’s lifetime, exceeds air exhaust temperature above the allowed limits and in the extreme can cause serious damage.
In order to prevent this, Intel CPUs typically contain a mechanism called clock throttling. The device measures the on-die temperature and, when certain limits are reached, periodically gates the processor clock frequency for some time. This allows the device to dissipate less power and stay within the thermal limits. There is a further protection available that ultimately terminates operation if even the temporary clock gating doesn’t bring down the temperature.
While these are useful protection mechanisms, they are counterproductive to the overall system performance. The developer could proactively reduce clock frequency or use fewer of the available cores. The effect remains the same; the performance provided with the processor variant can only be partially exploited by the application.
Avoiding the Costs of Complete Shelf Replacement
Board products that pretend to have both a high degree of flexibility and promise high performance should be treated with caution. Integrators should carefully determine whether the promised performance is achievable in the target shelf environment under all circumstances, such as maximum ambient temperature or high software load. If processing performance is not crucial, it may be the right product. If excellent performance is paramount, it is probably the wrong product decision.
In recent years, new shelf products have become available that provide very strong cooling capabilities in order to support high performance applications. Such shelves are necessary for supporting blades with 400W or even higher power dissipation and enable the ultimate performance experience. Having said this, not every new compute blade that is delivered into communications markets will be integrated into such a high-performance shelf. In fact, there are thousands of installed shelves that have served at key locations in communications networks for many years. From an economic standpoint, it is understandable that service providers replace equipment as-needed and extend their infrastructure capabilities with the growing demand for bandwidth and service capabilities.
Upgrading individual payload blades is a much leaner approach compared to a complete shelf replacement. Shelves are therefore often kept in service for many years; they likely don’t have the cooling capabilities the new shelf generation is able to provide. Such new shelves often comply with the CP-TA B.4 cooling class (now managed by PICMG), which defines airflows of up to 40 CFM per ATCA slot. The Artesyn Centellis™ 4440 is an example of this product class. As such shelves are still commonly in use, it is important that the latest server blades provide satisfactory performance in such installations.
A key aspect of designing for communication applications is designing for NEBS Level 3 compliance. The NEBS requirements cover different aspects of the design such as safety, EMC compliance, earthquake or thermal requirements. NEBS requirements define a maximum ambient air temperature of 40°C during normal operation and up to 55°C ambient air temperature during exceptional operation for a limited amount of time per year (such as during a loss of the room air conditioning). As CP-TA B.4 shelves are still commonly in use, it should be a primary goal that ATCA blade products fit in these environments while providing outstanding performance. It is also paramount that there is no degradation of performance across the entire NEBS L3 temperature range.
For example, the Artesyn ATCA-7480 packet processing blade is based on the most recent family of Intel® Xeon® Processors E5-2600 v3. It can host two processors with 75 or 105 W thermal design power. These variants have extended temperature range capability and allow a blade to be fitted into a CP-TA B.4 compliant shelf under NEBS L3 conditions. With 12 cores per socket and 1.8 or 2.2 GHz clock frequency respectively and up to 512GB of main memory, the blade is optimized for excellent performance.
Furthermore, high-power CPUs (120W) are supported by the blade design. This means the blade can be integrated into high-performance shelves such as the Artesyn Centellis™ 8000 family, and enables data center class performance with ATCA technology. Processor derivatives with up to 14 cores per processor or up to 2.5GHz clock frequency can also be supported.
Artesyn undertook extensive thermal design and simulation efforts to enable the performance envelope for the different cooling environments. The ATCA-7480 supports the operation of the board in a maximum ambient air temperature of 55°C as defined by NEBS, when installed in a CP-TA B.4 compliant shelf. The design goals have been proven by thermal qualification during design verification and testing. The selected thermal solution makes the product sufficiently robust in shelves with CP-TA B.4 air cooling without compromising the available compute performance. It also adds sufficient headroom for squeezing the ultimate performance out of the product when installed in a shelf with enhanced airflow.
Along with the increasing performance of the Intel® Xeon® processor family comes the need for higher power provisioning and tighter thermal specifications. As a result, cooling becomes more challenging for board and system designers, particularly when products need to comply with the more stringent physical requirements of telecommunications equipment. When done right, designs can make use of the full performance capabilities under stringent thermal conditions as defined by NEBS.
Chris Engels is a senior technical marketing manager for the Embedded Computing business of Artesyn Embedded Technologies. In his current role, Engels focuses on AdvancedTCA®, server architectures and product definitions, in addition to inbound and outbound technical marketing. He also represents Artesyn in several standards development organizations.
Engels has 18 years of experience in the embedded computing industry. Prior to assuming his current role, he was responsible for the development of embedded computing and board-level products for industrial and telecommunications applications. He is a graduate of Germany’s RWTH Aachen University, where he earned his diploma for electrical engineering.