News Story
Reliability and Availability Analysis of Data Center Thermal Management System Presented at CEEE Consortium
CALCE and CEEE researchers Amir Hossein Zabihi Tari and Dr. Diganta Das presented reliability analysis of thermal management systems (TMS) for edge data centers at the recent CEEE Consortium meeting. Their work, conducted in collaboration with Dr. Andres Sarmiento and Prof. Michael Ohadi, demonstrated CALCE's expertise in assessing thermal management system designs into actionable availability and reliability metrics for high-power computing infrastructure.

The team's analysis focused on a 2.4 kW liquid-cooled unit, combining physics-of-failure principles with reliability block diagram (RBD) modeling to validate thermal performance against strict availability targets (minimum 95% availability, maximum 438 hours of annual downtime). By grouping components such as fasteners, seals, and structural elements into cold plate assemblies for joint replacement, they demonstrated how maintenance choices can impact maintenance costs and downtime while maintaining required system uptime.

Over the course of their study, this reliability framework will be upscaled to a 1.5 MW ISO-40 container system housing 12 racks (126 kW per rack, 16 servers each). Their concurrent design approach integrates thermal performance, reliability modeling, degradation analysis (TIM pump-out, O-ring aging, microchannel clogging), and simulation-based design of experiments to identify optimal redundancy levels, maintenance strategies, and component groupings that meet demanding edge data center requirements.
The presentation highlighted CALCE's role as an industry partner, turning complex reliability modeling into guidance on component selection, redundancy planning, and grouped maintenance strategies. Consortium members gained direct insight into how CALCE's physics-of-failure toolkit addresses the unique challenges of edge infrastructure, ensuring systems that are not only thermally efficient but also robust, maintainable, and capable of meeting mission-critical uptime demands.
This work is conducted under the grant Flexnode, Inc., Cooling System Reliability Evaluation,” Award # 25020573. The broader project is funded in part by the U.S. Department of Energy’s Advanced Research Projects Agency-Energy (ARPA-E).
For more information about the presentation, contact Dr. Diganta Das.
Learn about CALCE's upcoming events here.
Published March 9, 2026