Nvidia GTC: Hyperscaler Happiness and Enterprise Indigestion

23 Mar 2024

The New Stack

Share on XTweet

Nvidia GTC Hyperscaler Happiness and Enterprise Indigestion

Nvidia is focused on driving forward the state of the art for the hyperscaler users of AI hardware.

SAN JOSE, Calif. — On Monday, March 18, Jensen Huang, the CEO of Nvidia gave his annual keynote at the company’s GTC conference here, setting out their technology roadmap for the next year. Every large technology company does an event like this, but right now Nvidia is leading the industry, and speeding up their release cadence to a point that is hard for everyone to keep up with. It’s hard for competitors, hard for standards bodies, and especially hard for enterprises that are trying to find their feet in the new world of AI deployments. Nvidia is focused on driving forward the state of the art for the hyperscaler users of AI hardware, and everyone else just has to try and keep up. I think this is the most significant computing technology announcement we are likely to see this year.

The GPU servers that are currently being used in volume production such as the AWS p5 have two Intel CPUs and eight Nvidia H100 GPUs as the nodes that can be clustered. The eight GPUs share memory via NVlink at 900GBytes/s, but the Intel CPU doesn’t have an NVlink interface. Larger configurations are linked by eight 400Gbit/s Ethernet or Infiniband networks per node, and with hundreds to thousands of GPUs there is a lot of network traffic.

To speed up AI training workloads on the network both NVswitch and Infiniband switches have built-in processing power to efficiently perform operations like centrally averaging the output of all the GPUs. This was announced a few years ago as the Scalable Hierarchical Aggregation and Reduction Protocol (SHARP) architecture, but it’s not clear to me how heavily it’s being used.

Last year in his keynote Huang announced the Grace Hopper GH200 combined CPU/GPU architecture. The Grace CPU is the first by Nvidia to use Arm architecture rather than Intel and has an NVlink interface to couple it directly to the Hopper H100-based GPU.

NVlink replaces the network for up to 256 GH200s combined into a single system connected by 900 Gbyte/s interfaces to NVswitch chips. That allows up to 120 Terabytes of CPU-hosted memory, and 24 Terabytes of GPU-hosted high bandwidth memory, as a single shared memory system image.

This very large memory is needed to train the largest models, as fitting the model into GPU memory is a big bottleneck for training. The most significant thing about this system architecture is that the GPU is driving the NVlink interconnect, with the CPUs around the edge. This is reversed from more conventional architectures where the CPUs are driving interconnection traffic, and GPUs are attached to them.

The difference is that when a GH200 GPU wants to send data to another GH200 GPU, it goes directly, by writing to shared high-bandwidth memory. This is far faster than an H100 sending via a PCI bus to an Intel CPU then over a network to another Intel CPU then via PCI to the other H100 GPU.

GH200 Now Shipping

GH200 systems have now started shipping and will deploy in volume this year. At GTC there is a display of 11 partners displaying them, and benchmarks are beginning to be published, although we’ve heard that the hardware and software stack is currently not quite ready for mainstream production use. Nvidia told me:

“The largest early systems are in Supercomputing including the Swiss National Computer Center ‘Alps’ system, Los Alamos National Labs ‘Venado’, Texas Advanced Computing Center ‘Vista’ and the Jülich Supercomputing Centre’s JUPITER system with close to 24,000 GH200 Superchips. These systems and more will be coming online throughout 2024. Combined, these GH200-powered centers represent some 200 exaflops of AI performance to drive scientific innovation.”

Blackwell GPU

This year Huang announced the next generation Blackwell architecture GPU and the GB200 system. The Blackwell name is in honor of David Blackwell, an African American mathematician and game theory pioneer.

To get around limitations in the maximum size of chips that can be made, the Blackwell GPU is made from two of the largest possible chips, connected directly together with a 10 Terabyte/s chiplet interface, along with 192 Gbytes of high bandwidth HBM3e memory chiplets inside the package for a total of 208 billion transistors.

Blackwell also includes high-speed encryption and decompression engines so it can operate directly on encrypted and compressed data without involving the CPU. One Grace CPU is combined with two Blackwell GPU packages to make a GB200 node (GH200 has one CPU and one GPU). There are 480 GBytes of memory on Grace, and 384 Gbytes of high bandwidth memory on the two Blackwells, for a total of 864 GBytes in each GB200.

The main performance changes from GH100 to GB200 are four times higher performance for AI training workloads, 30 times higher for inference workloads, and 25 times better energy efficiency overall for the Blackwell GPU. The raw speed is 10 Petaflops of FP8 and 20 Petaflops of FP4. NVlink doubles its bandwidth to 1.8 Terabytes/s, and the maximum number of GPUs in the shared memory cluster more than doubles from 256 to 576.

The internal architecture for training workloads is similar between Hopper and Blackwell, but by getting two Hopper-equivalent chiplets into each package, and having two packages per GB200 module, there’s four times the performance for FP8 compared to GH200. Blackwell adds FP4 for inference that doubles the performance (which gets us to 8x) and there are additional optimizations for inference that give rise to the 30x claim. The 25x energy efficiency claim appears to be a blended mix of the 30x inference and 4x training workloads.

Netting this out, on a GPU-to-GPU basis, Blackwell is twice the performance of Hopper for training.

Connecting It All

To connect this all together there’s a new fifth-generation NVswitch design that supports the 1.8 Terabytes/s interfaces from each Blackwell GPU. The NVswitch chips have a much higher performance SHARP v4 processing capacity of 3.6 Teraflops of FP8 for computing shared averages between GPUs.

There’s an entry-level DGX B200 system that’s air-cooled and is rated at half the performance of the GB200 node, it’s designed as a plug-in replacement for the existing H100-based systems, with eight single Blackwell GPUs and two Intel CPUs.

The GB200 is delivered in a water-cooled 120kW rack-sized system package called the GB200 NVL72. This contains 18 boards each with two GH200 nodes, for a total of 72 Blackwell GPUs, and nine switch boards each with two NVswitch chips which have 14.4 Terabytes/s of bandwidth per board. The NVL72 is designed to support trillion parameter model training and inference and will be available first from AWS, Azure and Google Cloud later in 2024, then via the usual partners.

Eight of the NVL72 racks can be interconnected using NVlink to make the full-size 576 GPU DGX SuperPod with 240 Terabytes of memory in a single shared domain. Using Infiniband to cluster these together in 72 or 576 GPU-sized memory domains, tens of thousands of GPUs can be used to train the largest models.

There was also a doubling in the bandwidth of the Infiniband and Ethernet product line to 800Gbit/s. Quoting Nvidia:

“The Quantum-X800 platform sets a new standard in delivering the highest performance for AI-dedicated Infrastructure. It includes the Nvidia Quantum Q3400 switch and the Nvidia ConnectXR-8 SuperNIC, which together achieve anindustry-leading end-to-end throughput of 800Gb/s. This is 5x higher bandwidth capacity and a 9x increase of 14.4Tflops of In-Network Computing with Nvidia’s Scalable Hierarchical Aggregation and Reduction Protocol (SHARPv4) compared to the previous generation.

The Spectrum-X800 platform delivers optimized networking performance for AI cloud and enterprise infrastructure. Utilizing the Spectrum SN5600 800Gb/s switch and the Nvidia BlueField-3 SuperNIC, the Spectrum-X800 platform provides advanced feature sets crucial for multitenant generative AI clouds and large enterprises.”

I’m impressed at both the pace of development and the extremely high performance of the systems and interconnects. The main challenge appears to be building software that can operate these systems reliably and keep them running. That’s what I’d expect for an architectural transition like this, which highlights the biggest issue with very large coherent memory clusters, failure rates increase proportionally to the cluster size, and are likely to crash the entire node.

That’s annoying when a node is eight H100 GPUs and you lose one of 32 nodes in a 256GPU system, but it’s a much bigger issue if an entire 256 GH200 SuperPOD fails. Each Blackwell chip includes extensive automated self-test and predictive failure modeling to help offset the large size of the system.

The 72 GB200 system package is likely a good compromise between the failure rate, the ability to physically package, ship and cool a unit, and an industry-leading capacity to train extremely large AI models efficiently. It will be interesting to see if the largest configurations use the NVL72 as their building block or configure for the 576 GB200 SuperPOD domain size.

Deal with AWS

The previously announced deal with AWS to deploy a cluster of GH200s for Nvidia called Project Ceiba has been upgraded (and delayed somewhat) to be a GB200-based system with over 20,000 GPUs instead. This raises an interesting point, since the GB200 is so much faster than the GH200, and follows on perhaps a year behind it, how many customers will want to wait and switch their orders to the GB200?

There’s currently a supply shortage so everyone is waiting for deliveries for many months anyway. For training workloads that use FP8 the per-GPU gain is offset by using twice the silicon area (and cost) so it is less of an issue. As more workloads go to production and inference for very big models becomes more important, moving to Blackwell is going to make a lot more sense.

The bigger challenge for customers buying GPUs is that they need to decide what the real useful life of a GPU is before it’s obsolete. It’s far shorter than CPUs, so instead of depreciating like most hardware over 5 or 6 years as seems to be common nowadays, it likely makes sense to depreciate GPUs over 2 or 3 years.

That makes them even more expensive to own, and perhaps it’s better to get whatever the latest GPU is available from cloud providers and have them deal with depreciation. Old GPU capacity tends to end up as a cheap option on a spot market but at some point, it will cost more to power it than it’s worth as a GPU.

New Capabilities for Building AI Apps and More

Huang announced some new capabilities that make it easier for customers to deploy AI-based applications. Nvidia has expanded its range of API-based services but has also packaged up individual LLM components from many partners into deployable microservice containers called Nvidia Inference Microservices — NIMs, which will be available from ai.Nvidia.com and in online marketplaces.

These include Helm charts for deployment via Kubernetes clusters. This is a very helpful capability, as managing AI applications that keep breaking as the software stack they depend on gets old, month by month, is a big headache. Having Nvidia or its partners package up and maintain a well-tested stack as a container seems valuable.

There were a lot more announcements for markets like automotive, healthcare, quantum computing, digital twins, and cool virtual reality demos that you can watch in the keynote or read about elsewhere, but I’m most impressed by the new hardware specifications and the interconnect. They go far beyond the industry standard designs from the CXL consortium, based on extensions of the PCI bus interface, that are an order of magnitude less bandwidth than NVlink and several years later in its development process.

I’ve been excited to see the development of CXL over the last few years, it has its place, for memory pooling and more dynamic fabric management, but it’s not going to be competitive for GPU-based AI workloads. Enterprises that want to buy standard interfaces and have a long life for their installed GPU hardware are currently out of luck.

I’ve been predicting the onset of very large memory systems for several years, and I named the architecture pattern Petalith, for petascale monolith. I think it will take a while for the software to catch up and become optimized for these systems, but it will unlock the ability to build new kinds of applications, not just for AI models, but including models as a common building block.

It’s impressive to see Nvidia executing well, the people I know who work there are happy, their market capitalization has grown a lot recently, and the rest of the industry is trying to figure out how to compete.

Arm, AWS and Google are sponsors of The New Stack.