3D Cuda Kernel in Numba

Courtesy: https://medium.com/rapids-ai/the-life-of-a-numba-kernel-a-compilation-pipeline-taking-user-defined-functions-in-python-to-cuda-71cc39b77625

There aren’t enough tutorials on writing 3D Cuda kernel in Numba so I decided to share my learnings here. In this small example, we will write a Cuda kernel (calculate_mean) to calculate the mean value of neighboring voxels for each voxel in a 3D volume.

There are few things we need to keep in mind while launching a kernel for example there are certain limits on the maximum number of Cuda blocks and the maximum number of threads inside each block. Taking this into consideration we calculate a feasible set of block and thread dimensions.

Here, the dimension of volume is 3200x1280x58.

# Define the dimensions of the volumeIMGSIZx = 3200
IMGSIZy = 1280
IMGSIZz = 58

To obtain the mean of neighboring voxels of each voxel we define the kernel calculate_mean below:

# Cuda Kernel to calculate mean of neighboring voxels for each voxelimport numba
from numba import njit, prange
from numba import cuda
@cuda.jit(debug=True)
def calculate_mean(outbuf, inbuf):
# Calculate the index of the voxel being considered

ind_x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x
ind_y = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y
ind_z = cuda.blockIdx.z * cuda.blockDim.z + cuda.threadIdx.z

# Or you could use the cuda.grid
# ind_x, ind_y, ind_z = cuda.grid(3)
if ind_x < 0 or ind_y < 0 or ind_z < 0:
return

if ind_x > IMGSIZx-1 or ind_y > IMGSIZy-1 or ind_z > IMGSIZz-1:
return

sum1 = 0.0
counter = 0
for ind_nr_z in range(ind_z-2, ind_z+3):
for ind_nr_y in range(ind_y-2, ind_y+3):
for ind_nr_x in range(ind_x-2, ind_x+3):
if ( ind_nr_x<0 or ind_nr_y<0 or ind_nr_z<0):
continue
if ind_nr_x>(IMGSIZx-1) or ind_nr_y>(IMGSIZy-1) or ind_nr_z>(IMGSIZz-1):
continue

sum1 = sum1 + inbuf[ind_nr_z, ind_nr_y, ind_nr_x]
counter = counter+1
outbuf[ind_z, ind_y, ind_x] = sum1/counter
return

In the code above code, we calculate the index of the current voxel using the block dimension and thread index and store it in the variables ind_x, ind_y, ind_z.

A 3-dimensional loop calculates the sum of the voxel values in the neighboring region of size 5x5x5. We ignore the indexes outside the range of volume.

To launch and test the kernel we create a random NumPy matrix as input and a zero-initialized matrix for storing the output below.

# Create random input matrix for which to calculate voxel wise mean
inbuf = np.random.rand(58, 1280, 3200)
# Create output matrix to store the output
outbuf = np.zeros((58, 1280, 3200), np.float32)

We define the number of blocks per grids in BLOCKS_PER_GRID and threads per block in THREADS_PER_BLOCK. As can be observed the product of blocks and threads is equal to the dimension of the volume. As a rule of thumb, THREADS_PER_BLOCK should be less than 512 or 1024 (depending on GPU) and BLOCKS_PER_GRID should be less than 65535 in each dimension. Here the THREADS_PER_BLOCK are 8*8*2 = 128 which is fine.

# Code to Launch the Cuda kernel in Numba# 8*400 = 3200 = IMGSIZx
# 8*160 = 1280 = IMGSIZy
# 2*29 = 58 = IMGSIZz
THREADS_PER_BLOCK = (8, 8, 2)
BLOCKS_PER_GRID = (400, 160, 29)
calculate_mean[BLOCKS_PER_GRID, THREADS_PER_BLOCK](outbuf, inbuf) cuda.synchronize()

Next, we test the result for a randomly selected voxel index. Re-run the rest if the selected voxel is at the edge.

# Checking result for a random voxel (c_z, c_y, c_x)c_x = random.randint(0, IMGSIZx)
c_y = random.randint(0, IMGSIZy)
c_z = random.randint(0, IMGSIZz)
result_manual = np.mean(inbuf[c_z-2:c_z+3,c_y-2:c_y+3,c_x-2:c_x+3])
result_cuda = outbuf[c_z, c_y, c_x]
print(result_manual, result_cuda)

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store