Note about “THE NVLINK-NETWORK SWITCH: NVIDIA’S SWITCH CHIP FOR HIGH COMMUNICATION-BANDWIDTH SUPERPODS”

ztex, Tony, Liu
4 min readJan 8, 2025

--

This post aims to pick some take-away from the presentation about design of NVLINK

NVLINK and NVSWITCH are technique developed by Nvidia. They purpose is to provide a strong communication solution for GPUs cluster.

Inevitably, we will need to do data-exchange across multiple SMs or even GPUs. Of curse, we want the data end up in shared L2 $, but sometime it does not. Therefore, NVLink is called for.

NVLink is 4x faster than PCIe Gen5 and even has broader interface.

We remove a lot of layers from communication protocols and corresponding software stack (eg. presentation layer). Instead, we do that in CUDA runtime.

In addition, since we don’t need features like adapter routing or packet reordering etc. we can pack more NVLink cores on a chip.

2018 is a turning point for NVLink, before that NVLink is a point-to-point bus protocol. After that, NVLink, with help from NVSwitch, becomes fabrics. (all-to-all mesh pattern)

Moreover, now NVLink is capable of running BETWEEN severs instead of just within a server (which is called NVLink network)

One powerful feature of NVLink/NVS chip is SHARP. This can be considered as an offload technique. Operations, like COPY and ALLREDUCE etc., can be done in the chip instead of GPUs.

But sometime, SHARP is not enough and all you need is simply more bandwidth.

One example is “Recommender Engine” which requires a data structure called “Embedded Table”. The problem is that Embedded Table is too big to fit into one GPU. Therefore we will have to split it and distribute it across multiple GPUs. The result is that GPUs will talk to each others all the time and tramedous bandwidth is required.

For security reason, we abandon global physical address shared across GPUs. Instead, NVLink network use “address on the network” so that each GPU has its own isolated physical address.

Also, it allows dynamic configuration with helps from GPU MMU and Link TLP hardware. This allows server tear-down or re-configuration for restart or computation based on needs.

Summary

This is a note about the NVLink-Network switch. It discusses the history of NVLink, the new features of the 4th generation NVLink, and the some take-way of the chip. It also discusses the NVLink-enabled server generations and the benefits of the NVLink network.

The NVLink network is a high-bandwidth, low-latency interconnect that can be used to connect multiple GPUs together. It is based on the NVLink interface, which is a high-speed interconnect that is used to connect GPUs. The NVLink network is a key component of NVIDIA’s DGX H100 SuperPOD, which is a high-performance computing system that is designed for AI workloads. The NVLink network can be used to improve the performance of a wide variety of AI applications, including training large language models and performing inference on recommender systems. The NVLink network is also scalable, and can be used to connect thousands of GPUs together.

Let me know if you have anything to share about NVLink or Nvidia GPU architecture.

video: https://youtu.be/S117CO2KL-0?si=RkDMnWbX9sNxZnsl&t=3939

slides: https://hc34.hotchips.org/assets/program/conference/day2/Network%20and%20Switches/NVSwitch%20HotChips%202022%20r5.pdf

also see:

--

--

ztex, Tony, Liu
ztex, Tony, Liu

Written by ztex, Tony, Liu

Incoming-Intern, CPU emulation software @Apple, Ex-SDE @Amazon. Working on embedded system, Free-RTOS, RISC-V etc.

No responses yet