Category Archives: CUDA

CUDA – Four

I’ve been busy with other things, but I woke up early and decided to get some CUDA studying in. I did talk with the hiring manager for the position that I’m interested in, who (as I expected) clarified that I didn’t actually need to know CUDA for this position. I’m still interested, though I should focus more on the Leetcode-style exercises that are more likely to come up on the interivew.

That said, I haven’t been entirely ignoring this. I’ve been watching some 3Blue1Brown videos in my spare time, like this one on convolution. My calculus is definitely rusty (I don’t fully remember how to take an integral), but I’m mostly just trying to gain some intuition here so that I know what people are talking about if they say things like, “take a convolution”.

For today, I started by looking through the source of the sample code I got running last time. Thanks to the book I’ve been reading, a lot of the code makes sense and I feel like I can at least skim the code and understand what’s going on at a syntax level, for example:

__global__ void increment_kernel(int *g_data, int inc_value) {
  int idx = blockIdx.x * blockDim.x + threadIdx.x;
  g_data[idx] = g_data[idx] + inc_value;
}

Writing this mostly for my own understanding:

The __global identifier marks this as a Kernel – code that is called from the host but runs on the device. It takes in a pointer to an array g_data and an int inc_value. This kernel will be run for each element in the g_data array and each instance of the kernel will operate on the element calculated in idx. Each thread block of blockDim threads will have a unique blockIdx and each thread in that block will have a unique threadIdx. Since we are working on 1D data (i.e. a single array, and not a 2D or 3D array), we only care about the x property of each of these index variables. Then, we increment the value at index idx by the inc_value.

Ok, writing this up I think I have one question, which is about the .x property. The book explains that you can use the .x, .y, .z properties to easily split up 2D or 3D data, but also talks about ways to turn 2D or 3D data into a 1D representation. So are the .y, .z properties just “nice” because they allow us to leave 2D data as 2D, or do they actually allow us to do something that re-representing the 2D data as 1D data and just using .x doesn’t?

Ok, continuing on:

int main(int argc, char *argv[]) {
  int devID;
  cudaDeviceProp deviceProps;

  printf("[%s] - Starting...\n", argv[0]);

Start the main function and set up some variables, as well as letting the user know that we’re starting.


  // This will pick the best possible CUDA capable device
  devID = findCudaDevice(argc, (const char **)argv);

  // get device name
  checkCudaErrors(cudaGetDeviceProperties(&deviceProps, devID));
  printf("CUDA device [%s]\n", deviceProps.name);

Some questions here. What does it mean by “best”? Fortunately, the source for findCudaDevice is available to us. First it checks to see if a device is specified by command line flag, and if not, grabs the device “with “with highest Gflops/s”.

  int n = 16 * 1024 * 1024;
  int nbytes = n * sizeof(int);
  int value = 26;

  // allocate host memory
  int *a = 0;
  checkCudaErrors(cudaMallocHost((void **)&a, nbytes));
  memset(a, 0, nbytes);

Setting some variables first, but then we allocate some host memory. I was curious about cudaMallocHost. In the other examples I’d seen, host memory was usually created by just using malloc (or simply assumed to already be allocated, in the book). cudaMallocHost creates “pinned” memory, which is locked into RAM and is not allowed to swap. This allows us to use e.g. cudaMemcpy without the performance overhead of constantly checking to make sure that the host memory has not been swapped to disk.

I’m still not used to the C convention of handling errors via macros like checkCudaErrors instead of language constructs like try/catch or if (err != nil). It just feels like an obsolete way of doing error handling that’s easy to forget.

That’s all I had time for this morning, but it’s fun to understand more and more about this as I continue to learn!

CUDA – Three

I ran a CUDA program šŸ™‚

It was a rough experience šŸ™ƒ

Honestly, getting started with pretty much any programming language involves a lot of banging your head against the toolchain, and slowly untangling old tutorials that reference things that don’t exist anymore. Honestly, this was easier than some python setups I’ve done before.

I started with a pretty sparse windows installation. I keep my computers relatively clean and wipe them entirely about once a year, so all I had to start was VSCode and … that’s about it. I am lucky that I happen to already have a windows machine (named Maia) that has a GTX 2080, which supports CUDA.

I installed MSVC (the microsoft C++ compiler) and the NVIDIA toolkit.

Then I tried writing some C++, not even CUDA in VSCode and I couldn’t get it to compile. I kept getting the error that #include <iostream>was not valid. As I mentioned, I haven’t written C++ in about 10 years, so I knew that I was likely missing. I putzed around installing and poking various things. Eventually I switched out MSVC for MINGW (G++ for windows) and this allowed me to compile and run my “hello world” C++ code. Hooray!

Now I tried writing a .cu CUDA file. While NVIDIA provides an official extension for .cu files, and I had everything installed according to the CUDA quick start guide, But VSCode just did … nothing when I tried to run the .cu file with the C++ CUDA compiler selected. So I went off searching for other things to do.

Eventually I decided to install Visual Studio, which is basically a heavy version of VSCode and I don’t know why they named them the same thing except that giant corporations love to do that for whatever reason.

I got VS running and also downloaded Git (and then Github Desktop, since my CLI Git wasn’t reading my SSH keys for whatever reason.

Finally, I downloaded the CUDA-samples repo from NVIDIA’s Github, and it didn’t run – turns out that the CUDA Toolkit version number is hard-coded in two places in the config files, and it was 12.4 while I had version 12.5. But that was a quick fix, fortunately.

Finally, I was able to run one on my graphics card! I still haven’t *written* any CUDA, but I can at least run it if someone else writes it. My hope for tomorrow is to figure out the differences between my non-running project and their running project to put together a plan for actually writing some CUDA from scratch. Or maybe give up and just clone their project as a template!

 

CUDA – Two

I have an art sale coming up in three days, so Iā€™m spending most of my focus time finishing up the inventory for that. But in my spare time between holding the baby and helping my older kid sell lemonade, Iā€™ve started exploring a few of the topics Iā€™m interested in from the previous post.

Convolutions

Something I was reading mentioned convolutions, and I had no idea what that meant, so I tried to find out! I read several posts and articles, but the thing that made Convolutions click for me was a video by 3 Blue 1 Brown. The video has intuitive visualizations. Cheers to good technology and math communicators.

Sliding a kernel over data feels intuitive to me, and it looks like one of the cool things about this is that you can do this with extreme parallelism. Iā€™m pretty sure this is covered early on in the textbook, so Iā€™m not going to worry about understanding this completely yet.

It seems like convolutions are important for image processing, especially things like blur and edge detection, but also in being able to do feature detection – it allows us to search for a feature across an entire image, and not just in a specific location in an image.

One thing I donā€™t understand yet is how to build a convolution kernel for complicated feature detection. One of the articles I read mentioned that you could use feature detection convolution for something like eyes, which I assume requires a complicated kernel thatā€™s trained with ML techniques. But I donā€™t quite understand what that kernel would look like or how you would build it.

Parallel Processing

I started readingĀ Programming Massively Parallel Processors, and so far itā€™s just been the introduction. I did read it out loud to my newborn, so hopefully heā€™ll be a machine learning expert by the time heā€™s one.

Topics covered so far have been the idea of massive parallelism, the difference between CPU and GPU, and a formal definition of ā€œspeed upā€œ.

I do like that the book is focused on parallel programming andĀ not ML. It allows me to focus on just that one topic without needing to learn several other difficult concepts at the same time. I peeked ahead and saw a chapter on massively parallel radix sort, and the idea intrigues me.

Differentiation and Gradient Descent

Again, 3B1B had the best video on this topic that I could find. The key new idea here was that you can encode the weights of a neural network as an enormous vector, and then map that vector to a fitness score via a function. Finding the minimum of this function gives us the best neural network for whatever fitness evaluation method weā€™ve chosen. It hurts my brain a bit to think in that many dimensions, but I just need to get used to that if Iā€™m going to work with ML. I donā€™t fully understand what differentiation means in this context, but Iā€™m starting to get some of the general concept (we can see a ā€œgood directionā€ to move in).

I havenā€™t worked with gradients since Calc III in college, which was over a decade ago, but Iā€™ve done it once and I can do it again šŸ’Ŗ. It also looks like I need to understand the idea of total derivative versus partial derivative, which feels vaguely familiar.

Moving Forward

Once the art sale is over, Iā€™ll hopefully have more focus time for this šŸ™‚ For now, itā€™ll be bits and pieces here and there. For learning CUDA in particular, it looks like working through the textbook is going to be my best bet, so Iā€™m going to focus some energy there.

From Grand Rapids,
Erty

 

CUDA – One

First, some backstory. I was laid off from Google in January and Iā€™ve taken the last six months off, mostly working on art glass and taking care of my kids (one of whom was just born in April, and is sleeping on my chest as I write this). Iā€™m slowly starting to look for work again, with a target start date of early September 2024. If youā€™re hiring or know people who are, please check out my rĆ©sumĆ©.

A friend of mine recently let me know about a really interesting job opportunity, which will require working with code written in (with?) CUDA. The job is ML related, so Iā€™ll be focusing my learning in that direction.

I donā€™t know anything about CUDA. Time to learn! And, why not blog about the process as I go along.

First step: come up with some resources to help me learn. I googled something like ā€œlearn cudaā€ and found this Reddit post on the /r/MachineLearning subreddit. It looks like Iā€™ll probably be learning a couple of related topics as I go through this journey:

 

CUDA

This is the goal. It looks like CUDA is a language + toolkit for writing massively parallel programs on graphics cards, that arenā€™t necessarily for graphics. Basically, making the GPU compute whatever we want. If we use this for, say, matrix multiplications, we can accelerate training of ML models.

Python and C++

C++ ? I havenā€™t written C++ since college a decade ago. I think I remember some of it, but Iā€™ve always been intimidated by the size of the language, the number of ā€œcorrectā€ ways to write it, and the amount of magic introduced by macros. I also donā€™t like the whole .h / .cc thing, but I suppose Iā€™ll just have to get used to that.

Iā€™m pretty good at Python, having written several tens of thousands of lines of it at Google, so Iā€™m not super worried about that.

PyTorch or TensorFlow

Some folks on the Reddit post linked above recommend a specific tutorial on the PyTorch website, which looks interesting. It seems like PyTorch is a ML library written in Python (based on Torch, which was written in Lua).

PyTorch is Meta, now under Linux. TensorFlow is Google. Both use C++, Python, and CUDA.

Matrix Math

In college, I was only briefly introduced to matrix math, and most of that exposure was a graphics course that I audited. Based on my brief reading about all of this, it seems like the major advantage of using graphics cards to train ML is that they can do matrix mathĀ really, really fast.Ā Itā€™s up to me to brush up on this while I explore the other things. I donā€™t yet have a specific study plan for this.

Parallelism

According to redditor surge_cell in that previously linked thread, ā€œThere are three basic concepts – thread synchronization, shared memory and memory coalescing which CUDA coder should know in and out of [sic]ā€. Iā€™ve done some work with threading and parallelism, but not recently. Most of my work at Google was asynchronous, but I didnā€™t have to manage the threading and coalescing myself (e.g. async in JS)

Resources

Ok – so, what am I actually going to do?

I browsed some YouTube videos, but the ones that Iā€™ve watched so far have been pretty high level. It looks like NVIDIA has some CUDA training videos ā€¦ from 12 years ago. Iā€™m sure the language is quite different now. I also want deeper training than free YouTube videos will likely provide, so I need to identify resources to use that will give me a deep knowledge of the architecture, languages, and toolkits.

First, Iā€™ll try to do the Custom CUDA extensions for PyTorch tutorial. See how far I can get and make notes of what I get stuck on.

Second, One of the Reddit posts recommended a book called Programming Massively Parallel Processors by Hwu, Kirk, and Hajj, so I picked up a copy of that (4th Ed). Iā€™m going to start working through that. It looks like there are exercises so Iā€™ll be able to actually practice what Iā€™m applying, which will be fun.

Finally, Iā€™ll try implementing my own text prediction model in ML. I know you can do this cheaply by using something like šŸ¤— (aka HuggingFace) but the point here is to learn CUDA, and using someone elseā€™s pretrained model is not going to teach me CUDA. Iā€™m optimizing for learning, not for accurate or powerful models.

Questions

Thereā€™s a lot I donā€™t know, but here are my immediate questions.

  1. I have an NVIDIA card in my windows computer, but I donā€™t have a toolchain set up to write CUDA code for it. Iā€™m also not used to developing C++ on windows, so Iā€™ll need to figure out how to get that running as well. I have a feeling this wonā€™t be particularly tricky, itā€™ll just take time.
  2. I have a lot of unknown unknowns about CUDA – Iā€™m not even sure what I donā€™t know about it. I think Iā€™ll have more questions here as I get into the materials and textbooks.
  3. It seems like thereā€™s a few parts of ML with various difficulties. If you use a pretrained model, it seems pretty trivial (~20 lines of python) to make it do text prediction or what have you. But training the models is really, really difficult and involves getting a lot of training data. Or, perhaps not difficult, but expensive and time consuming. Designing the ML pipeline seems moderately difficult, and is probably where Iā€™ll spend most of my time. But I need to understand more about this.

Thatā€™s it for Day One

If youā€™re reading this and you see something Iā€™ve done wrong already, or know of a resource that helped you learn the tools that Iā€™m talking about here, please do reach out!

From Grand Rapids,
Erty