NVIDIA CUDA Compiler Driver Process
Introduction
Typically, after writing some CUDA code, you be able to run CUDA kernel onto the GPUs. This is so convinient and you should not be worry about trivial detail.
This article aims to explain what happen at the compilation stage, and how is you CUDA kernel deployed onto GPUs.
CUDA source
CUDA source code is mixture of host and device code. Host code is compiled with conventional C++ compiler and run on CPU as device code is compiled with NVIDIA compilers and assembler and run on GPUs.
After compilation, we embed the compiled GPU functions as fat binary images in the host object file. In the linking stage, specific CUDA runtime libraries are added for supporting remote SPMD procedure calling and for providing explicit GPU manipulation such as allocation of GPU memory buffers and host-GPU data transfer.
What is Fat binary? Why do we need it?
As NVIDIA GPUs evolve to support new features, the instruction set architecture naturally changes. To support our applications accross different generation, the NVIDIA compiler tool chain supports compiling for multiple architectures in the same application executable or library.
CUDA also relies on the PTX virtual GPU ISA to provide forward compatibility, so that already deployed applications can run on future GPU architectures.
nvcc, the CUDA compiler driver has two-phase compilation, first we compile the source device code to PTX virual assembly and second we compile PTX to architecture-specific binary code.
While The CUDA driver can execute the second stage compilation at run time, compiling the PTX virtual assembly “Just In Time” to run it. The overhead is huge.
To reduce the overhead, there are two tricks: fat binaries and JIT caching.
What fat binary does is to include binary code with multiple architecture along with PTX code.
The CUDA runtime looks for the code for the GPU architecture and run it if found. otherwise, the driver compiles the PTX code.
Runtime Fatbin Creation
As before we compile fatbin at compilation time instead of runtime. CUDA Toolkit 12.4 introduced a new nvFatbin library for creating fatbins at runtime.
Until now, to generate a fatbin, you had to rely on the command line tool fatbinary, which was ill-suited for dynamic code generation. This made dynamically generating fatbins difficult, as you’d put the generated code into a file, call fatbinary with exec() or similar, and then handle the outputs. What we want to a mechanism, sort like dynamic linking library for CUDA.
The nvFatbin library creates a fatbin directly from the input files. It does not do any linking or compilation itself and does not have any reliance on the CUDA driver. It can even be run on systems without a GPU.
With the introduction of nvFatbin, generating flexible libraries dynamically is easier than ever.
For example, TensorRT wants to store both CUBINs for existing architectures, as well as PTX for future architectures. That way, optimized versions of the code are used when possible, while still remaining compatible.
With this, we are able to make sure the optimal code for the current architecture will also be compatable to the future architecture.
CUDA kernel launch process: compilation, linking, runtime and kernel execution
Compilation
The CUDA compilation process begins with the NVIDIA CUDA Compiler (NVCC), which separates the device code (GPU code) from the host code (CPU code)
Preprocessing: The input program is preprocessed twice — once for device compilation and once for host compilation.
- Device Code Compilation: NVCC compiles the device code into CUDA binary (cubin) and/or PTX intermediate code
- Host Code Compilation: The host code is compiled using a standard C++ host compiler (e.g., gcc, clang)
- Fatbinary Creation: The compiled device code (cubin and/or PTX) is packaged into a fatbinary
Linking
After compilation, the linking process combines the compiled host and device code:
- Device Code Linking: If separate compilation is used, device code from multiple object files is linked together using the
--device-link
or-dlink
option - Host Code Linking: The host object file is linked with the embedded fatbinary and necessary CUDA runtime libraries
- Final Executable: The result is a single executable containing both host and device code
Runtime
When the CUDA program is executed, the CUDA runtime system handles the interaction between the host and the device:
- Context Creation: A CUDA context, which is a virtual execution space for the GPU, is created when the first CUDA function is called
- Memory Allocation: The program allocates memory on both the host (CPU) and device (GPU)
- Data Transfer: Necessary data is transferred from host memory to device memory
Kernel Execution
The actual execution of a CUDA kernel involves several steps:
- Kernel Launch: The host code launches the kernel using the
<<<...>>>
syntax, specifying the execution configuration (grid and block dimensions) - Thread Hierarchy: The kernel is executed as a grid of thread blocks. Each thread block contains a number of threads that can work in parallel
- Scheduling: The GPU’s hardware scheduler distributes thread blocks across available Streaming Multiprocessors (SMs)
- Execution: Each thread executes the kernel code, operating on its assigned data based on its unique thread and block indices
- Synchronization: If necessary, threads within a block can synchronize using
__syncthreads()
- Completion: The kernel execution completes when all threads have finished their work
- Result Retrieval: The results are transferred back from device memory to host memory