Category Archives: CUDA

CUDA – Three

I ran a CUDA program 🙂

It was a rough experience 🙃

Honestly, getting started with pretty much any programming language involves a lot of banging your head against the toolchain, and slowly untangling old tutorials that reference things that don’t exist anymore. Honestly, this was easier than some python setups I’ve done before.

I started with a pretty sparse windows installation. I keep my computers relatively clean and wipe them entirely about once a year, so all I had to start was VSCode and … that’s about it. I am lucky that I happen to already have a windows machine (named Maia) that has a GTX 2080, which supports CUDA.

I installed MSVC (the microsoft C++ compiler) and the NVIDIA toolkit.

Then I tried writing some C++, not even CUDA in VSCode and I couldn’t get it to compile. I kept getting the error that #include <iostream>was not valid. As I mentioned, I haven’t written C++ in about 10 years, so I knew that I was likely missing. I putzed around installing and poking various things. Eventually I switched out MSVC for MINGW (G++ for windows) and this allowed me to compile and run my “hello world” C++ code. Hooray!

Now I tried writing a .cu CUDA file. While NVIDIA provides an official extension for .cu files, and I had everything installed according to the CUDA quick start guide, But VSCode just did … nothing when I tried to run the .cu file with the C++ CUDA compiler selected. So I went off searching for other things to do.

Eventually I decided to install Visual Studio, which is basically a heavy version of VSCode and I don’t know why they named them the same thing except that giant corporations love to do that for whatever reason.

I got VS running and also downloaded Git (and then Github Desktop, since my CLI Git wasn’t reading my SSH keys for whatever reason.

Finally, I downloaded the CUDA-samples repo from NVIDIA’s Github, and it didn’t run – turns out that the CUDA Toolkit version number is hard-coded in two places in the config files, and it was 12.4 while I had version 12.5. But that was a quick fix, fortunately.

Finally, I was able to run one on my graphics card! I still haven’t *written* any CUDA, but I can at least run it if someone else writes it. My hope for tomorrow is to figure out the differences between my non-running project and their running project to put together a plan for actually writing some CUDA from scratch. Or maybe give up and just clone their project as a template!


CUDA – Two

I have an art sale coming up in three days, so I’m spending most of my focus time finishing up the inventory for that. But in my spare time between holding the baby and helping my older kid sell lemonade, I’ve started exploring a few of the topics I’m interested in from the previous post.


Something I was reading mentioned convolutions, and I had no idea what that meant, so I tried to find out! I read several posts and articles, but the thing that made Convolutions click for me was a video by 3 Blue 1 Brown. The video has intuitive visualizations. Cheers to good technology and math communicators.

Sliding a kernel over data feels intuitive to me, and it looks like one of the cool things about this is that you can do this with extreme parallelism. I’m pretty sure this is covered early on in the textbook, so I’m not going to worry about understanding this completely yet.

It seems like convolutions are important for image processing, especially things like blur and edge detection, but also in being able to do feature detection – it allows us to search for a feature across an entire image, and not just in a specific location in an image.

One thing I don’t understand yet is how to build a convolution kernel for complicated feature detection. One of the articles I read mentioned that you could use feature detection convolution for something like eyes, which I assume requires a complicated kernel that’s trained with ML techniques. But I don’t quite understand what that kernel would look like or how you would build it.

Parallel Processing

I started reading Programming Massively Parallel Processors, and so far it’s just been the introduction. I did read it out loud to my newborn, so hopefully he’ll be a machine learning expert by the time he’s one.

Topics covered so far have been the idea of massive parallelism, the difference between CPU and GPU, and a formal definition of “speed up“.

I do like that the book is focused on parallel programming and not ML. It allows me to focus on just that one topic without needing to learn several other difficult concepts at the same time. I peeked ahead and saw a chapter on massively parallel radix sort, and the idea intrigues me.

Differentiation and Gradient Descent

Again, 3B1B had the best video on this topic that I could find. The key new idea here was that you can encode the weights of a neural network as an enormous vector, and then map that vector to a fitness score via a function. Finding the minimum of this function gives us the best neural network for whatever fitness evaluation method we’ve chosen. It hurts my brain a bit to think in that many dimensions, but I just need to get used to that if I’m going to work with ML. I don’t fully understand what differentiation means in this context, but I’m starting to get some of the general concept (we can see a “good direction” to move in).

I haven’t worked with gradients since Calc III in college, which was over a decade ago, but I’ve done it once and I can do it again 💪. It also looks like I need to understand the idea of total derivative versus partial derivative, which feels vaguely familiar.

Moving Forward

Once the art sale is over, I’ll hopefully have more focus time for this 🙂 For now, it’ll be bits and pieces here and there. For learning CUDA in particular, it looks like working through the textbook is going to be my best bet, so I’m going to focus some energy there.

From Grand Rapids,


CUDA – One

First, some backstory. I was laid off from Google in January and I’ve taken the last six months off, mostly working on art glass and taking care of my kids (one of whom was just born in April, and is sleeping on my chest as I write this). I’m slowly starting to look for work again, with a target start date of early September 2024. If you’re hiring or know people who are, please check out my résumé.

A friend of mine recently let me know about a really interesting job opportunity, which will require working with code written in (with?) CUDA. The job is ML related, so I’ll be focusing my learning in that direction.

I don’t know anything about CUDA. Time to learn! And, why not blog about the process as I go along.

First step: come up with some resources to help me learn. I googled something like “learn cuda” and found this Reddit post on the /r/MachineLearning subreddit. It looks like I’ll probably be learning a couple of related topics as I go through this journey:



This is the goal. It looks like CUDA is a language + toolkit for writing massively parallel programs on graphics cards, that aren’t necessarily for graphics. Basically, making the GPU compute whatever we want. If we use this for, say, matrix multiplications, we can accelerate training of ML models.

Python and C++

C++ ? I haven’t written C++ since college a decade ago. I think I remember some of it, but I’ve always been intimidated by the size of the language, the number of “correct” ways to write it, and the amount of magic introduced by macros. I also don’t like the whole .h / .cc thing, but I suppose I’ll just have to get used to that.

I’m pretty good at Python, having written several tens of thousands of lines of it at Google, so I’m not super worried about that.

PyTorch or TensorFlow

Some folks on the Reddit post linked above recommend a specific tutorial on the PyTorch website, which looks interesting. It seems like PyTorch is a ML library written in Python (based on Torch, which was written in Lua).

PyTorch is Meta, now under Linux. TensorFlow is Google. Both use C++, Python, and CUDA.

Matrix Math

In college, I was only briefly introduced to matrix math, and most of that exposure was a graphics course that I audited. Based on my brief reading about all of this, it seems like the major advantage of using graphics cards to train ML is that they can do matrix math really, really fast. It’s up to me to brush up on this while I explore the other things. I don’t yet have a specific study plan for this.


According to redditor surge_cell in that previously linked thread, “There are three basic concepts – thread synchronization, shared memory and memory coalescing which CUDA coder should know in and out of [sic]”. I’ve done some work with threading and parallelism, but not recently. Most of my work at Google was asynchronous, but I didn’t have to manage the threading and coalescing myself (e.g. async in JS)


Ok – so, what am I actually going to do?

I browsed some YouTube videos, but the ones that I’ve watched so far have been pretty high level. It looks like NVIDIA has some CUDA training videos … from 12 years ago. I’m sure the language is quite different now. I also want deeper training than free YouTube videos will likely provide, so I need to identify resources to use that will give me a deep knowledge of the architecture, languages, and toolkits.

First, I’ll try to do the Custom CUDA extensions for PyTorch tutorial. See how far I can get and make notes of what I get stuck on.

Second, One of the Reddit posts recommended a book called Programming Massively Parallel Processors by Hwu, Kirk, and Hajj, so I picked up a copy of that (4th Ed). I’m going to start working through that. It looks like there are exercises so I’ll be able to actually practice what I’m applying, which will be fun.

Finally, I’ll try implementing my own text prediction model in ML. I know you can do this cheaply by using something like 🤗 (aka HuggingFace) but the point here is to learn CUDA, and using someone else’s pretrained model is not going to teach me CUDA. I’m optimizing for learning, not for accurate or powerful models.


There’s a lot I don’t know, but here are my immediate questions.

  1. I have an NVIDIA card in my windows computer, but I don’t have a toolchain set up to write CUDA code for it. I’m also not used to developing C++ on windows, so I’ll need to figure out how to get that running as well. I have a feeling this won’t be particularly tricky, it’ll just take time.
  2. I have a lot of unknown unknowns about CUDA – I’m not even sure what I don’t know about it. I think I’ll have more questions here as I get into the materials and textbooks.
  3. It seems like there’s a few parts of ML with various difficulties. If you use a pretrained model, it seems pretty trivial (~20 lines of python) to make it do text prediction or what have you. But training the models is really, really difficult and involves getting a lot of training data. Or, perhaps not difficult, but expensive and time consuming. Designing the ML pipeline seems moderately difficult, and is probably where I’ll spend most of my time. But I need to understand more about this.

That’s it for Day One

If you’re reading this and you see something I’ve done wrong already, or know of a resource that helped you learn the tools that I’m talking about here, please do reach out!

From Grand Rapids,