close
close

In the 100,000-GPU xAI Colossus cluster that Supermicro helped build for Elon Musk

Computing hall of the XAI Colossus Data Center

Today we are releasing our tour of the xAI Colossus supercomputer. For those who have heard stories about Elon Musk's xAI building a giant AI supercomputer in Memphis, this is that cluster. With 100,000 NVIDIA H100 GPUs, this multi-billion dollar AI cluster stands out not only for its size, but also for the speed at which it was built. The teams built this massive cluster in just 122 days. Today we can show you the inside of the building.

Of course we have a video about this that you can find on X or on YouTube:

Normally at STH we do everything completely independently. That was different. Supermicro is sponsoring this because it is definitely the most expensive project for us this year. Also, some things will be blurred or I will be intentionally vague due to the sensitivity behind building the largest AI cluster in the world. To demonstrate this, we received special permission from Elon Musk and his team.

Supermicro Liquid Cooled Racks at xAI

The basic building block for Colossus is the liquid-cooled Supermicro rack. This includes eight 4U servers, each with eight NVIDIA H100 for a total of 64 GPUs per rack. Eight of these GPU servers plus one Supermicro coolant distribution unit (CDU) and associated hardware make up one of the GPU compute racks.

XAI Colossus Data Center Supermicro Liquid Cooled Nodes Low Angle
XAI Colossus Data Center Supermicro Liquid Cooled Nodes Low Angle

These racks are arranged in groups of eight for 512 GPUs and networking to provide mini-clusters within the much larger system.

XAI Colossus Data Center Supermicro 4U Universal GPU Liquid Cooled Server
XAI Colossus Data Center Supermicro 4U Universal GPU Liquid Cooled Server

Here xAI uses the Supermicro 4U Universal GPU system. These are the most advanced AI servers on the market today for several reasons. One of them is the level of liquid cooling. The other is how useful they are.

XAI Colossus Data Center Supermicro 4U Universal GPU Liquid Cooled Server Close
XAI Colossus Data Center Supermicro 4U Universal GPU Liquid Cooled Server Close

We first saw the prototype of these systems about a year ago at Supercomputing 2023 (SC23) in Denver. We were unable to open one of these systems in Memphis because they were busy conducting training work while we were there. An example of this is that the system sits on trays that can be serviced without having to remove the system from the rack. The 1U rack manifold helps bring in cool fluid and expel heated fluid for each system. Quick connectors are a quick way to get liquid cooling out of the way, and we showed how these can be removed and installed with one hand last year. Once these are removed, the trays can be pulled out for maintenance.

Supermicro 4U Universal GPU system for liquid-cooled NVIDIA HGX H100 and HGX 200 at SC23 3
Supermicro 4U Universal GPU system for liquid-cooled NVIDIA HGX H100 and HGX 200 at SC23 3

Luckily, we have images of the prototype of this server so we can show you what's inside these systems. Aside from the 8-GPU NVIDIA HGX tray, which uses custom Supermicro liquid cooling blocks, the CPU tray shows why this is a next-generation design that is unparalleled in the industry.

Supermicro 4U Universal GPU system for liquid-cooled NVIDIA HGX H100 and HGX 200 at SC23 6
Supermicro 4U Universal GPU system for liquid-cooled NVIDIA HGX H100 and HGX 200 at SC23 6

The two x86 CPU liquid cooling blocks in the SC23 prototype above are fairly common. The unique is on the right. Supermicro's motherboard integrates the four Broadcom PCIe switches used in almost every HGX AI server today, rather than placing them on a separate board. Supermicro then has a custom liquid cooling block to cool these four PCIe switches. Other AI servers in the industry are built and then liquid cooling is added to an air-cooled design. Supermicro's design is liquid cooling from the ground up and all from one vendor.

Supermicro SYS 821GE TNHR NVIDIA H100 and NVSwitch liquid cooling blocks 8
Supermicro SYS 821GE TNHR NVIDIA H100 and NVSwitch liquid cooling blocks 8

It is similar to cars, where some are initially designed to run on gas and then an electric powertrain is built into the chassis, as opposed to electric vehicles, which are designed to be electric vehicles from the ground up. This Supermicro system is the latter, while other HGX H100 systems are the former. We have hands-on experience with most of the public HGX H100/H200 platforms since their launch and some of the hyperscale designs. Make no mistake, there is a big gap in this Supermicro system and others, including some of Supermicro's other designs that can be liquid or air cooled that we have tested before.