Terminology List#

Kernel: function run on each thread
Device: GPU
Host: CPU

Introduction#

Artificial intelligence is now a part of everyday life, powering everything from search engines to chatbots. But behind the scenes, it takes enormous compute power to train and run large language models (LLMs). This is where GPUs step into the spotlight. Their architecture is uniquely designed to handle the scale of parallelism that modern AI demands. In this post, I’ll introduce some of the core ideas of CUDA programming, Nvidia GPU architecture, and highlight how Nvidia software and hardware work together.

GPU vs CPU#

CPUs are built for versatility. They have a small number of powerful cores, optimized for handling complex tasks one after another. GPUs, on the other hand, are built for raw parallelism. A single high-end GPU like the GB202 packs over 24,000 CUDA cores—orders of magnitude more than a typical CPU.

Of course, having thousands of cores only helps if you have thousands of tasks to run in parallel. That’s why GPU programming looks very different from CPU programming: instead of focusing on a handful of threads, you need to think in terms of thousands.

Additionally, the GPU and the host memory are usually disjoint, so an explicit memory transfer call is usually required to move the data between the CPU and GPU. It is totally different in a CPU program, where the memory load/store are usually implicit.

Finally, the GPU may adhere to a different floating point representation, which may cause the result of a GPU program differs from a CPU program.

TIP
It’s a good habbit to verify the result of the GPU program with an equivalent CPU result.

CUDA Programming Model#

A CUDA program is essentially a C/C++ program with some CUDA extension. The CUDA part, i.e. device code, will be compiled separately and linked with the host code.

CUDA Function Definition#

The functions are devided into three categories:

Host: called and run on the CPU
Global: called on the CPU/GPU and run on the GPU
Device: called and run on the GPU

They are marked with __host__(can be ommited) __global__ and __device__ in the function definition, for example:

1
__global__ void vecAdd(float *d_a, float *d_b, float *d_c, int N) {
2
    int i = blockIdx.x * blockDim.x + threadIdx.x;
3
    if (i < N) {
4
        d_c[i] = d_a[i] + d_b[i];
5
    }
6
}

NOTE
Host and Device address are in different spaces, therefore you cannot pass an host address space pointer to a function marked __global__ as a parameter.

TIP
Use prefix “h_”, “d_” to differentiate pointers in host and device address space.

To call a kernel, the grid size and block size need to be specified.

1
vecAdd<<<gridDim, blockDim, sharedMemBytes, stream>>>(d_a, d_b, d_c, N);

gridDim and block Dim will be detailed in the next section. You can also explicitly specify how many bytes of shared memory for each block, and the shared memory will be discussed in the memory architecture. By default, kernels will be launched sequentially, one by one. If more streams are defined, more kernels can be run simultaneously.

Another thing worth noticing is that the pointers like d_a, d_b and d_c need to be allocated on the device memory first before passing to the kernel. You need to explicily “malloc” on the device memory by calling

1
cudaMalloc((void**)&d_data, size);
2
cudaFree(d_data);

To initialize the device memory, you need to either memset or memcopy

1
cudaMemset(d_array, 0, N * sizeof(int));
2
cudaMemcpy(h_array, d_array, N * sizeof(int), cudaMemcpyDeviceToHost);

Here is a piece of sample code:

1
#include <cstdio>
2
#include <vector>
3
#include <cmath>
4

5
#define CUDA_CHECK(call)                                                          \
6
do {                                                                              \
7
    cudaError_t err__ = (call);                                                   \
8
    if (err__ != cudaSuccess) {                                                   \
9
        fprintf(stderr, "CUDA error %s at %s:%d\n", cudaGetErrorString(err__),    \
10
                __FILE__, __LINE__);                                              \
11
        std::exit(EXIT_FAILURE);                                                  \
12
    }                                                                             \
13
} while (0)
14

15
__global__ void vecAdd(const float* __restrict__ A,
16
                       const float* __restrict__ B,
17
                       float* __restrict__ C,
18
                       int N)
19
{
20
    int i = blockIdx.x * blockDim.x + threadIdx.x;
21
    if (i < N) C[i] = A[i] + B[i];
22
}
23

24
int main() {
25
    const int N = 1 << 20;                // 1,048,576 elements
26
    const size_t bytes = N * sizeof(float);
27

28
    // Host data
29
    std::vector<float> hA(N), hB(N), hC(N);
30

31
    // Initialize host arrays
32
    for (int i = 0; i < N; ++i) {
33
        hA[i] = static_cast<float>(i);
34
        hB[i] = static_cast<float>(2 * i);
35
    }
36

37
    // Device pointers
38
    float *dA = nullptr, *dB = nullptr, *dC = nullptr;
39

40
    // Allocate device memory (cudaMalloc)
41
    CUDA_CHECK(cudaMalloc(&dA, bytes));
42
    CUDA_CHECK(cudaMalloc(&dB, bytes));
43
    CUDA_CHECK(cudaMalloc(&dC, bytes));
44

45
    // Copy host -> device (cudaMemcpy)
46
    CUDA_CHECK(cudaMemcpy(dA, hA.data(), bytes, cudaMemcpyHostToDevice));
47
    CUDA_CHECK(cudaMemcpy(dB, hB.data(), bytes, cudaMemcpyHostToDevice));
48

49
    // Clear output buffer (cudaMemset) — safe because we set to 0.0f bytes
50
    CUDA_CHECK(cudaMemset(dC, 0, bytes));
51

52
    // Launch kernel
53
    const int blockSize = 256;
54
    const int gridSize  = (N + blockSize - 1) / blockSize;
55
    vecAdd<<<gridSize, blockSize>>>(dA, dB, dC, N);
56
    CUDA_CHECK(cudaGetLastError());
57
    CUDA_CHECK(cudaDeviceSynchronize());
58

59
    // Copy device -> host (cudaMemcpy)
60
    CUDA_CHECK(cudaMemcpy(hC.data(), dC, bytes, cudaMemcpyDeviceToHost));
61

62
    // Quick verify
63
    double max_abs_err = 0.0;
64
    for (int i = 0; i < 10; ++i) { // spot check first few
65
        printf("C[%d] = %f\n", i, hC[i]);
66
    }
67
    for (int i = 0; i < N; ++i) {
68
        max_abs_err = fmax(max_abs_err, std::fabs(hC[i] - (hA[i] + hB[i])));
69
    }
70
    printf("Max abs error: %g\n", max_abs_err);
71

72
    // Free device memory (cudaFree)
73
    CUDA_CHECK(cudaFree(dA));
74
    CUDA_CHECK(cudaFree(dB));
75
    CUDA_CHECK(cudaFree(dC));
76

77
    return 0;
78
}

Thread Organization#

In kernel invocation, you specified two 3d structure gridSize and blockSize, which actually defines the thread orgnization, and has a profound impact on the performance of the kernel.

The blocks are orgnized in a 3D structure, called grid. Each block is a group of threads in at most 3 dimensions. The threads are aware of its position within the block and its block’s position within the grid through CUDA defined variables, which can be used for retrieving the data needed for its workload. Two 3D intrinsic coordinate, threadIdx and blockIdx, along with two 3D dimension, blockDim and gridDim, can be used to calculate the global coordinate of the thread within the grid.

1
 int i = blockIdx.x * blockDim.x + threadIdx.x;ß

TIP
You can think of the orgnization of threads like a 6D array:
1
Thread t[gridDim.z][gridDim.y][gridDim.x][blockDim.z][blockDim.y][blockDim.x];
Or a 3D array:
1
Thread t[gridDim.z*blockDim.z]

The dimension of the grid doesn’t matter performance-wise. For example, (16,16,0) and (256, 0, 0) both consist of 256 blocks, and from GPU side, there is no real difference. What really matters is the number of blocks. Each block can only run on a single SM(Steamming Multiprocessor), so to fully use all the GPU compute resources, at least SM-size of blcoks should be defined if you don’t want to waste resources. The number of SMs varies among different GPU devices. On A100, there are 108 SMs, so at least 108 blocks should be given, no matter how the grid is defined, e.g. there is no difference between (4, 27, 0) and (108, 0, 0). This also applies to block dimensions. The dimensions of the blcok doesn’t affect the performance, but the total number of threads per block matters.

Addition to the Grid and Blocks, threads on a block can be further separate to different warps, which are the scheduling units of the threads on an SM that are executing the same instructions and are scheduled to run at the same time. This is the SIMT part of the GPU. This will be discussed when the architeture is explained.

Architecture#

Writing a runnable CUDA program is easy, but writing the one that can unleash the full power of GPU is definetely more difficult. It requires a good understanding of GPU architecture and Memory Hierarchy.

Compute Resources#

SP#

GPU cores are essentially vector processing units that can apply a single instruction on large amount of data. Each of the core is called SP(Stream Processor) or CUDA core and has its own pipeline. In each cycle, a warp scheduler issues 1 or 2(if they are independent) commands to one warp, which usually consists of 32 threads. These 32 threads will execute the instruction on 32 SPs simultaneously. This is why this execution model is called SIMT(Single Instruction Multiple Threads), as one instruction can be broadcasted to multiple data and execute on different pipelines at the same time. SIMD is similar to the SIMT. The only difference between SIMT and SIMD is that the number of parallel execution is determined by software instead of hardware. Nowadays, the GPU have 4 warp scheduler, therefore at most 4 instruction sequence can be issued to 4 warps simultaneously.

SM#

Multiple SPs executes instructions under the control of a single SM(Streaming Muliprocessor). One GPU can have lots of SMs. As mentioned above, A100 have 108 SMs.

Terminology List#

Introduction#

GPU vs CPU#

CUDA Programming Model#

CUDA Function Definition#

Thread Organization#

Architecture#

Compute Resources#

SP#

SM#

Memory Architecture#

DRAM#

L2/GLOBAL#

L1/Shared#

Registers#

More on Threads#