My experience learning CUDA to accelerate Conway’s Game of Life

Brendan Wilhelmsen
12 min readMay 18, 2021

--

A while ago, during NVIDIA’s GPU Technology Conference (GTC), I was fortunate enough to attend a couple of courses in CUDA led by one of their developers. In anticipation of these courses, I decided to create my own project to work on which would utilize GPU programming with CUDA. For this, I settled on creating Conway’s Game of Life (abbreviated CGL going forward).

Why/How did I decide on this project?

Choosing a project for this situation was tricky because my desired result wasn’t a particular product/program, but rather a knowledge of a technology that was new to me (CUDA). Additionally, I needed the project to meet a few criteria.

  1. It must be suitable for GPU utilization, which means lots of repetitive calculations.
  2. It needs to be interesting to me (no offense to anyone who finds random mathematical equations on matrices intriguing).
  3. It has to have a simple enough concept where a basic implementation can be created without too much hassle so I can focus on applying CUDA to it (maybe next time ray-tracing).

After browsing forums and recommendations sent to me by my mentors, I decided that CGL fit all that criteria and I began drafting my outline.

Initial concerns when planning my project

When planning the project, two main concerns stuck out to me.

  1. CLG technically takes place on an infinite array.
  2. I had absolutely no idea how to implement graphics in C/C++.

Obstacle one was tricky because I had a couple of different ideas that I needed to choose between. First off I ruled out an infinite world altogether as handling potentially infinite data was not my goal in this project. I found that I was stuck deciding between two options, option one was to make a cyclic world (an idea stolen from Marek Fiser who made a very similar project) or to just use a fixed size grid. Of the two options, I had to again remind myself that the focus of the project needs to be on CUDA so I went with a fixed size which is easier to implement and can be easily adjusted in the header file to test different quantities of data.

Obstacle two brought upon some interesting research regarding which libraries to use for graphics. When you google “how to implement graphics in C,” most things lead to the library “graphics.h.” After trying to implement it and doing some more research I concluded that the library is pretty outdated and I should look into better options. My attention was then diverted to deciding between OpenGL and SDL. Of the two I decided to go with SDL. While SDL is less powerful than OpenGL, with less power comes greater simplicity, and since I’m not trying to build graphics for a new PS5 exclusive I figured SDL would still be plenty for me.

Learning SDL

Before putting together any of the logic for the actual program, I wanted to get some graphics showing on my screen so I could display my output properly. If I didn’t do this my only way to view output would be to try and format a large 2D array to print in the standard output of my terminal (not fun).

To be perfectly honest, the hardest part of learning SDL for me was figuring out how to install it and get the “SDL2.h” header to properly compile in my code. This was my first time installing a third-party library and linking it in compilation, but after a day’s worth of seeing, “‘SDL2.h’ no such file or directory found,” I finally achieved a successful compilation.

From that point on with SDL, it was smooth sailing or at least as smooth as I could have hoped. Of course, I still ran into issues, but no one thing hung me up for an excessively long amount of time.

My favorite part of learning to use graphics was how satisfying each success felt. From creating a window to drawing a grid in that window to making it interactable, every step felt like its own accomplishment.

Interactable grid made with SDL (I promise all the grid lines are identical, for some reason my screenshot makes some seem more defined than others).

Initial implementation of CGL

Now that I had my plan and an understanding of how to use graphics I was ready to begin actually creating the program.

This first implementation of the program is meant to run just on the CPU and have each index of the grid evaluated in serial. For this, I set up a 2D array initialized that would reflect the initial state given by the user in the interactable window. Once satisfied with a starting state, pressing “Enter” on the keyboard begins the simulation, iterating through and evaluating each element of the 2D array. Each cell in the grid is evaluated based on its eight neighbors the result is stored in a new array as a 1 or 0 depending on whether it will exist in the next scene or not. At the end of the process, the updated array is filled in with what the next scene should look like. Using this updated array and SDL, I update the visual display to reflect the changes that took place. The final step is to swap the updated and original arrays so the updated one is evaluated for the next scene.

At this point, I had what I would consider a working implementation of CGL. The final touches I added were a timer to clock how long the program execution took and a command-line prompt to select the runtime parameters. The parameters included choosing whether to run it with a delay between each scene or not (either to experience it visually or to clock the performance). Once implementing CUDA, the plan was to add another option to select whether you wanted to run the code on the CPU or the GPU.

I was just going to use the “Hi :)” that I drew earlier as a joke, but turns out it actually produces some pretty awesome patterns when you run it!

Accelerating the program with CUDA (what we’ve all been waiting for)

Compiling

Similar to my initial experience with SDL, a massive challenge in using CUDA was just getting my code to compile. “Well Brendan, it sounds like you learned nothing from all that time you spent setting up SDL.” Yes, I know how it sounds, but before you jump to that conclusion, give me a chance to explain myself (aka make excuses).

Setting up CUDA was very different from setting up SDL because CUDA isn’t just another C library that I need to download and link in compilation, it’s a whole other language with its own compiler. When CUDA’s compiler compiles code it converts using the following path: CUDA →C++ →Object files. Now we look at the path for our C code: C →Object files. We are left with two types of object files, some based on C-code and some based on C++-code. This isn’t a problem for the C-code calling C functions and the C++-code calling C++ functions, but as soon as one tries to call the other, an error appears saying the function definition doesn’t exist despite the name and prototype matching.

Image from CUDA NVCC documentation giving a visual depiction of the compilation trajectory.

Why is this?

Using the name list (‘nm’) Linux command (huge shoutout to my mentor for his assistance on this step) I was able to see that C and C++ object files name function differently. Since overloading functions is possible in C++, the object files provide more information about the functions causing them to not match their respective definition in the C object files.

Once I understood what was happening, Googling for answers became much easier since I knew what to look for. Before too long I found a solution using the ‘extern’ keyword to force the C++ function to be prototyped as C functions. Once properly implemented I was able to successfully compile and could begin work on optimization.

Initial implementation of CUDA

When dealing with CUDA, a tip I learned from my instructor at the NVIDIA course is to implement it iteratively. What this means is to start with the most basic implementation and modify it from there. There are multiple places where your CUDA code can bottleneck so It’s important to start simple and work on optimizing the aspects that require the most optimization. This can be difficult to do if you shoot straight for what you assume will be optimal right off the bat.

For my first implementation, I needed my code to line up with the following sequence:

  1. Set up data on CPU
  2. Allocate space on GPU
  3. Copy data from CPU to GPU
  4. Launch kernel and operate on GPU data
  5. Copy data back to CPU

All the allocation/copy steps are single-line statements that can be found in the CUDA Runtime API. The kernel launch, while technically only one line, is much more complicated due to its parameters. It differs from our standard programming functions by taking an additional four parameters representing:

  1. Number of thread blocks (grid dimension)
  2. Number of threads per thread block (block dimension)
  3. Size of dynamically allocated shared memory (I don’t mess with this one in this project)
  4. Which stream the kernel launch will use

In this implementation, I don’t worry about dynamic memory and streams, so I omit those and just fill in the grid and block dimensions. My initial choice for the dimensions was very simple: Grid height = Number of blocks & Grid width = Threads per block. This way I would launch exactly one thread per cell to evaluate (this isn’t exactly how CUDA works, but for now let’s just pretend it is). This kernel launch configuration works great until I start messing with very large grid dimensions which would result in trying to launch more threads than what is available.

Regarding dealing with the memory transfers, to make it easier I used CUDA’s managed memory that utilized unified memory to allows the use of one pointer to access the memory on GPU and CPU. The benefit of managed memory is the simplicity in writing the code since I don’t have to explicitly write in all the memory transfers. The downside is that the memory transfers are happening behind the scenes so if let it run on its own I relinquish some control over how the memory is actually being managed (admittedly it does a pretty good job of managing on its own).

There are two final modifications left on the code before I consider the basic implementation complete.

  1. Flatten my 2D arrays to 1D arrays
  2. Remove conditional statements from the kernel

Upon trying to implement my arrays as they were, I learned that CUDA kernels do not play nice with 2D arrays. This resulted in me having to go through all my code and edit it to flatten the 2D arrays and use an indexing formula to convert between (x, y) coordinates and indexes.

Removing conditional statements is a technique to prevent warp divergence. When a CUDA kernel is launched the threads are grouped in chunks of 32 called warps. Since all threads in a warp need to execute the same set of instructions, a conditional statement can cause one thread to ‘diverge’ from the others. If one thread is performing a different action than the others, all the others standby until the divergent thread has been completed. This ruins parallelization and can lead to unnecessary slowdowns in the program.

And just like that my first implementation of CUDA in my program was completed.

So was there any difference in performance?

Yes. And greatly so.

When running smaller datasets the difference from the CPU was less noticeable because the improvements in calculation speed were canceled out by the memory transfers back and forth between the host and device. However, when running larger data sets the difference was much more noticeable.

Below is a visual analysis of the code being run using a CUDA implementation similar to what is described above. This runs 20 iterations on a 20,000x20,000 cell grid. You can see that the time spent on the kernel (thin blue lines) pales in comparison to the time spent performing memory transfers. And in the breakdown on the left side, 96.5% of the total time on the GPU was spent performing memory transfers.

This execution took approximately 4.3 seconds while the CPU running the same amount of data took just over 100 seconds. Despite this being many times faster than a serial implementation on the CPU, there is certainly room for improvement.

It is important to note that I am using a solid state drive rather than a hard disk drive in my computer. Had I ran this using a HDD, I would expect the pageable memory transfers to take significantly more time. At this point, since we are on the topic of hardware, here is a rundown of the system which I’m running this on.

I’m not entirely sure why it’s unable to recognize my graphics car, but I am running an NVIDIA GeForce GTX 1660Ti. In the program report a couple images up, you can see it listed there on the left panel as well.

Further optimization — Pinned memory

The obvious bottleneck in the code as pictured above is the memory transfers, so my goal here is to optimize them further. To do this, I look to address the “pageable” aspect that is labeled on all of them. To understand what this means, why it’s significant, and how to improve it, I had to do some more research.

What is it?

When memory is allocated, it doesn’t always get stored on the stack. Often when you allocate memory it ends up allocating it on a different storage device and uses the stack to ‘page’ it to quickly access.

Why is it significant?

When transferring memory from the host to device, if that memory needs to be paged it takes the path: storage →stack →device; instead of the more direct path: stack →device. The extra step in having to page the data can lead to a significant drop in performance.

How to improve?

To optimize this, I can take advantage of the CUDA function “cudaMallocHost()” which will ensure that the memory allocated is pinned to the stack and doesn’t have to be paged for each transfer.

To implement this, I had to refactor some code to get the pinned memory set up outside my original CUDA function. After figuring out how to use it properly I ended up with a result that looks like the following.

Clearly, they still take far more time than the kernel itself, but despite that, the memory transfers were around 33% faster. This dropped the execution time of the program from around 4.3 seconds to 2.8 seconds. Also, note that the memory copies are no longer noted with “pageable” confirming that it was done properly.

Before wrapping up this project there was one more technique I wanted to implement.

Final optimization — Asynchronous memory management & Streams

My final idea to optimize the memory transfers is to break them up and desynchronize the transfer. If for example, I break each memory transfer into two parts I can run the first and second half simultaneously thus halving the transfer speed. To accomplish this, I need to learn to use streams. In CUDA, streams are essentially a way to multi-thread kernel calls.

To implement this I created an array of streams and divide my array of data into chunks based on the number of streams I have. Then I create a loop and for each stream, I launch a kernel each with a different chunk of data from the last.

Streams in CUDA are interesting because they can all execute simultaneously, but they aren’t guaranteed to. So ideally many chunks of data will get executed simultaneously improving the overall speed of calculation. However, this still doesn’t solve the issue of memory transfers taking so long.

Thankfully, CUDA provides a function “cudaMemcpyAsync()” which is nearly identical to their standard “cudaMemcpy()” except it takes a stream as an additional parameter. This allows me to copy just specific chunks of data at a time to overlap memory transfers the same way I would overlap the kernel calculations.

Also a note, after calling the memory transfer on all the streams, the streams need to be synchronized to ensure all the memory has finished transferring before moving forward with the program.

Below now shows my final iteration where I managed to successfully overlap kernel calculations and memory transfers.

As you can see I have launched 5 separate streams which are experiencing overlap and greatly increasing performance. This iteration of my code executed in approximately 0.9 seconds, clearly much faster than anything prior.

Final times:

Serial implementation — CPU: 100.9 seconds

Unpinned memory — GPU: 4.3 seconds

Pinned memory — GPU: 2.8 seconds

Utilizing streams — GPU: 0.9 seconds

Throughout this project, I definitely learned a lot more than I had expected to at the start. Yes, I learned CUDA, but also now have a much better understanding of how to include third-party libraries in compilation, use multiple different languages within one program (at least C, C++, and CUDA), and work with the SDL graphics library. Overall, this was an awesome project which helped me improve as a programmer and I look forward to doing more like it in the future.

--

--

Brendan Wilhelmsen

I love to code, problem solve, and play music. I am currently a student attending the Elite Software Engineering Program at Qwasar Silicon Valley.