Note: Virtual Memory System in Multi-GPUs architecture

4 min readFeb 8, 2025

Introduction

Virtual memory system on a GPU provides convenience in terms of the programmability, isolation etc.

But it also faces challenges:

The multi-level TLBs present on the GPU has to take care for high degree concurrency. The bandwidth demands on the translation services from the IOMMUs can be tremendous.
On-demand paging and page eviction. those relateed to automatic data management in UM can also create significant performance bottlenecks

Virtual Address Translation on a GPU

Modern GPUs use virtual addresses in their CUs. Virtual addressing abstracts away the physical location of the data, enabling many useful features. For example, virtual addressing enables Unified Memory (UM), which relieves the programmer from the burden of performing explicit memory management. The GPU hardware and driver can effectively move data from device to device by changing the virtual to physical address mapping. Virtual addressing also enables address-space isolation for running concurrent applications on the GPU.

The GPU driver which is vendor specific is responsible for handling VM related software functionalities such as page allocation and page fault handling.

Virtual addresses need to be translated to physical addresses before addressing data in the GPU L1-cache. Modern GPUs provide dedicated hardware for address translation, which includes multi-level TLBs and multi-threaded page table walkers (PTWs)

Data Movement in Multi-GPU Systems

Communication in multi-GPU systems occurs through two major mechanisms: 1) Demand paging and 2) direct cache access through Remote Direct Memory Access (RDMA)

Demand Paging: Demand paging is one mechanism to exchange data between multiple GPUs. It requires support from the GPU driver to orchestrate the page-migration process, as well as support from the hardware to ensure the TLBs and pipelines are flushed, so as to avoid any data coherency issues. When one GPU generates a page fault for a page that is resident on another GPU, the driver will initiate a TLB shootdown, pipeline flush, and cache flush on the other GPU. Once those steps are complete, the data is migrated from one device to another.

Direct Cache Access Through RDMA: the requested data is accessed remotely using the interconnect between the devices (i.e., the CPUs and GPUs) without migrating the page. Unlike demand paging, direct cache access happens entirely in hardware, at the granularity of a cache line.

On a GPU L1 cache miss for data that resides physically in remote memory, the request is sent to the RDMA engine. The RDMA engine then routes this request to the remote GPU’s L2 cache. Once the remote GPU’s L2 cache services this request, the data is returned to the local GPU’s L1 cache through the RDMA engine.

Demand paging enables better utilization of data locality, as all the following memory accesses after the first migration are local (i.e., on the same GPU). Demand paging transfers a large amount of data (usually 4KB-2MB) over a slow inter-device fabric, introducing long latency. Moreover, demand paging requires modifications to the page table, resulting in instruction pipeline flushes and TLB shootdowns on the device that currently holds the page. Alternatively, direct cache access does not have to pay the cost of a page migration. However, it does incur the cost of remote memory accesses for every reference to a page residing on another GPU.

Address translation process on a GPU

The process starts with one of the CUs of GPU 2 generating a memory access request, with the L1 TLB performing address translation. When the address translation request arrives at the IOMMU for servicing, the IOMMU detects that the page is not present on GPU 2 and triggers a page fault, which is handled by the GPU driver.
The driver, upon receiving the page fault request, flushes GPU 1 to invalidate the page on GPU 1, ensuring translation coherence. The flushing process includes flushing in-flight instructions in the CU pipeline, in-flight transactions in the caches and TLBs, and the contents of the caches and TLBs
The Page Migration Controller migrates the page from GPU 1 to GPU 2 and notifies the driver about the completion of the page migration
Then, the driver can let GPU 1 replay the flushed memory transactions before continuing execution
Now that the page table has been updated with the newly migrated page, it can send back the updated address translation to the L1 and L2 TLBs of GPU 2
Finally, with the updated translation information, GPU 2 can finish the memory access

We cannot skip the GPU flushing process for three reasons: 1) the TLB may buffer stale translation information, 2) the L2 cache may hold data that is more recent than the data in the DRAM, 3) in-flight memory transactions in the L1 caches, the L2 caches and the TLBs can have stale translations that have to be discarded

Reference:

Improving the Virtual Memory Efficiency of GPUs