# 3D Cuda Kernel in Numba

There aren’t enough tutorials on writing 3D Cuda kernel in Numba so I decided to share my learnings here. In this small example, we will write a Cuda kernel (calculate_mean) to calculate the mean value of neighboring voxels for each voxel in a 3D volume.

There are few things we need to keep in mind while launching a kernel for example there are certain limits on the maximum number of Cuda blocks and the maximum number of threads inside each block. Taking this into consideration we calculate a feasible set of block and thread dimensions.

Here, the dimension of volume is 3200x1280x58.

`# Define the dimensions of the volumeIMGSIZx = 3200IMGSIZy = 1280IMGSIZz = 58`

To obtain the mean of neighboring voxels of each voxel we define the kernel calculate_mean below:

`# Cuda Kernel to calculate mean of neighboring voxels for each voxelimport numbafrom numba import njit, prangefrom numba import cuda@cuda.jit(debug=True)def calculate_mean(outbuf, inbuf):    # Calculate the index of the voxel being considered        ind_x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x    ind_y = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y    ind_z = cuda.blockIdx.z * cuda.blockDim.z + cuda.threadIdx.z        # Or you could use the cuda.grid    # ind_x, ind_y, ind_z = cuda.grid(3)    if ind_x < 0 or  ind_y < 0 or  ind_z < 0:        return        if ind_x > IMGSIZx-1 or ind_y > IMGSIZy-1 or ind_z > IMGSIZz-1:        return     sum1    = 0.0    counter = 0    for ind_nr_z in range(ind_z-2, ind_z+3):        for ind_nr_y in range(ind_y-2, ind_y+3):            for ind_nr_x in range(ind_x-2, ind_x+3):                if ( ind_nr_x<0 or ind_nr_y<0 or ind_nr_z<0):                    continue                if ind_nr_x>(IMGSIZx-1) or ind_nr_y>(IMGSIZy-1) or ind_nr_z>(IMGSIZz-1):                    continue                                sum1 = sum1 + inbuf[ind_nr_z, ind_nr_y, ind_nr_x]                counter = counter+1    outbuf[ind_z, ind_y, ind_x] = sum1/counter    return`

In the code above code, we calculate the index of the current voxel using the block dimension and thread index and store it in the variables ind_x, ind_y, ind_z.

A 3-dimensional loop calculates the sum of the voxel values in the neighboring region of size 5x5x5. We ignore the indexes outside the range of volume.

To launch and test the kernel we create a random NumPy matrix as input and a zero-initialized matrix for storing the output below.

`# Create random input matrix for which to calculate voxel wise meaninbuf = np.random.rand(58, 1280, 3200)# Create output matrix to store the output outbuf  = np.zeros((58, 1280, 3200), np.float32)`

We define the number of blocks per grids in BLOCKS_PER_GRID and threads per block in THREADS_PER_BLOCK. As can be observed the product of blocks and threads is equal to the dimension of the volume. As a rule of thumb, THREADS_PER_BLOCK should be less than 512 or 1024 (depending on GPU) and BLOCKS_PER_GRID should be less than 65535 in each dimension. Here the THREADS_PER_BLOCK are 8*8*2 = 128 which is fine.

`# Code to Launch the Cuda kernel in Numba# 8*400 = 3200 = IMGSIZx# 8*160 = 1280 = IMGSIZy# 2*29  = 58   = IMGSIZzTHREADS_PER_BLOCK = (8, 8, 2)BLOCKS_PER_GRID   = (400, 160, 29)calculate_mean[BLOCKS_PER_GRID, THREADS_PER_BLOCK](outbuf, inbuf) cuda.synchronize()`

Next, we test the result for a randomly selected voxel index. Re-run the rest if the selected voxel is at the edge.

`# Checking result for a random voxel (c_z, c_y, c_x)c_x = random.randint(0, IMGSIZx)c_y = random.randint(0, IMGSIZy)c_z = random.randint(0, IMGSIZz)result_manual = np.mean(inbuf[c_z-2:c_z+3,c_y-2:c_y+3,c_x-2:c_x+3])result_cuda   = outbuf[c_z, c_y, c_x]print(result_manual, result_cuda)`

## More from Pranjal Sahu

PhD Candidate in Computer Science

## How to assign Role-based Access in GCP Kubernetes Engine

Get the Medium app