# 3D Cuda Kernel in Numba

There aren’t enough tutorials on writing 3D Cuda kernel in Numba so I decided to share my learnings here. In this small example, we will write a Cuda kernel (**calculate_mean) **to calculate the mean value of neighboring voxels for each voxel in a 3D volume.

There are few things we need to keep in mind while launching a kernel for example there are certain limits on the maximum number of Cuda blocks and the maximum number of threads inside each block. Taking this into consideration we calculate a feasible set of block and thread dimensions.

Here, the dimension of volume is 3200x1280x58.

# Define the dimensions of the volumeIMGSIZx = 3200

IMGSIZy = 1280

IMGSIZz = 58

To obtain the mean of neighboring voxels of each voxel we define the kernel c**alculate_mean** below:

# Cuda Kernel to calculate mean of neighboring voxels for each voxelimport numba

from numba import njit, prange

from numba import cuda@cuda.jit(debug=True)

def calculate_mean(outbuf, inbuf):

# Calculate the index of the voxel being considered

ind_x = cuda.blockIdx.x * cuda.blockDim.x + cuda.threadIdx.x

ind_y = cuda.blockIdx.y * cuda.blockDim.y + cuda.threadIdx.y

ind_z = cuda.blockIdx.z * cuda.blockDim.z + cuda.threadIdx.z

# Or you could use the cuda.grid

# ind_x, ind_y, ind_z = cuda.grid(3) if ind_x < 0 or ind_y < 0 or ind_z < 0:

return

if ind_x > IMGSIZx-1 or ind_y > IMGSIZy-1 or ind_z > IMGSIZz-1:

return

sum1 = 0.0

counter = 0 for ind_nr_z in range(ind_z-2, ind_z+3):

for ind_nr_y in range(ind_y-2, ind_y+3):

for ind_nr_x in range(ind_x-2, ind_x+3):

if ( ind_nr_x<0 or ind_nr_y<0 or ind_nr_z<0):

continue if ind_nr_x>(IMGSIZx-1) or ind_nr_y>(IMGSIZy-1) or ind_nr_z>(IMGSIZz-1):

continue

sum1 = sum1 + inbuf[ind_nr_z, ind_nr_y, ind_nr_x]

counter = counter+1 outbuf[ind_z, ind_y, ind_x] = sum1/counter

return

In the code above code, we calculate the index of the current voxel using the block dimension and thread index and store it in the variables **ind_x**, **ind_y**, **ind_z.**

A 3-dimensional loop calculates the sum of the voxel values in the neighboring region of size 5x5x5. We ignore the indexes outside the range of volume.

To launch and test the kernel we create a random NumPy matrix as input and a zero-initialized matrix for storing the output below.

# Create random input matrix for which to calculate voxel wise mean

inbuf = np.random.rand(58, 1280, 3200)# Create output matrix to store the output

outbuf = np.zeros((58, 1280, 3200), np.float32)

We define the number of blocks per grids in **BLOCKS_PER_GRID** and threads per block in **THREADS_PER_BLOCK. **As can be observed the product of blocks and threads is equal to the dimension of the volume. As a rule of thumb, **THREADS_PER_BLOCK** should be less than 512 or 1024 (depending on GPU) and **BLOCKS_PER_GRID** should be less than 65535 in each dimension. Here the **THREADS_PER_BLOCK** are 8*8*2 = 128 which is fine.

# Code to Launch the Cuda kernel in Numba# 8*400 = 3200 = IMGSIZx

# 8*160 = 1280 = IMGSIZy

# 2*29 = 58 = IMGSIZzTHREADS_PER_BLOCK = (8, 8, 2)

BLOCKS_PER_GRID = (400, 160, 29)calculate_mean[BLOCKS_PER_GRID, THREADS_PER_BLOCK](outbuf, inbuf) cuda.synchronize()

Next, we test the result for a randomly selected voxel index. Re-run the rest if the selected voxel is at the edge.

# Checking result for a random voxel (c_z, c_y, c_x)c_x = random.randint(0, IMGSIZx)

c_y = random.randint(0, IMGSIZy)

c_z = random.randint(0, IMGSIZz)result_manual = np.mean(inbuf[c_z-2:c_z+3,c_y-2:c_y+3,c_x-2:c_x+3])

result_cuda = outbuf[c_z, c_y, c_x]print(result_manual, result_cuda)