I’ve been busy with other things, but I woke up early and decided to get some CUDA studying in. I did talk with the hiring manager for the position that I’m interested in, who (as I expected) clarified that I didn’t actually need to know CUDA for this position. I’m still interested, though I should focus more on the Leetcode-style exercises that are more likely to come up on the interivew.
That said, I haven’t been entirely ignoring this. I’ve been watching some 3Blue1Brown videos in my spare time, like this one on convolution. My calculus is definitely rusty (I don’t fully remember how to take an integral), but I’m mostly just trying to gain some intuition here so that I know what people are talking about if they say things like, “take a convolution”.
For today, I started by looking through the source of the sample code I got running last time. Thanks to the book I’ve been reading, a lot of the code makes sense and I feel like I can at least skim the code and understand what’s going on at a syntax level, for example:
__global__ void increment_kernel(int *g_data, int inc_value) {
int idx = blockIdx.x * blockDim.x + threadIdx.x;
g_data[idx] = g_data[idx] + inc_value;
}
Writing this mostly for my own understanding:
The __global
identifier marks this as a Kernel – code that is called from the host but runs on the device. It takes in a pointer to an array g_data
and an int inc_value
. This kernel will be run for each element in the g_data
array and each instance of the kernel will operate on the element calculated in idx
. Each thread block of blockDim
threads will have a unique blockIdx
and each thread in that block will have a unique threadIdx
. Since we are working on 1D data (i.e. a single array, and not a 2D or 3D array), we only care about the x
property of each of these index variables. Then, we increment the value at index idx
by the inc_value
.
Ok, writing this up I think I have one question, which is about the .x
property. The book explains that you can use the .x, .y, .z
properties to easily split up 2D or 3D data, but also talks about ways to turn 2D or 3D data into a 1D representation. So are the .y, .z
properties just “nice” because they allow us to leave 2D data as 2D, or do they actually allow us to do something that re-representing the 2D data as 1D data and just using .x
doesn’t?
Ok, continuing on:
int main(int argc, char *argv[]) {
int devID;
cudaDeviceProp deviceProps;
printf("[%s] - Starting...\n", argv[0]);
Start the main function and set up some variables, as well as letting the user know that we’re starting.
// This will pick the best possible CUDA capable device
devID = findCudaDevice(argc, (const char **)argv);
// get device name
checkCudaErrors(cudaGetDeviceProperties(&deviceProps, devID));
printf("CUDA device [%s]\n", deviceProps.name);
Some questions here. What does it mean by “best”? Fortunately, the source for findCudaDevice is available to us. First it checks to see if a device is specified by command line flag, and if not, grabs the device “with “with highest Gflops/s”.
int n = 16 * 1024 * 1024;
int nbytes = n * sizeof(int);
int value = 26;
// allocate host memory
int *a = 0;
checkCudaErrors(cudaMallocHost((void **)&a, nbytes));
memset(a, 0, nbytes);
Setting some variables first, but then we allocate some host memory. I was curious about cudaMallocHost
. In the other examples I’d seen, host memory was usually created by just using malloc
(or simply assumed to already be allocated, in the book). cudaMallocHost
creates “pinned” memory, which is locked into RAM and is not allowed to swap. This allows us to use e.g. cudaMemcpy
without the performance overhead of constantly checking to make sure that the host memory has not been swapped to disk.
I’m still not used to the C convention of handling errors via macros like checkCudaErrors
instead of language constructs like try/catch
or if (err != nil)
. It just feels like an obsolete way of doing error handling that’s easy to forget.
That’s all I had time for this morning, but it’s fun to understand more and more about this as I continue to learn!