compute shaders in opengl 4.3

21 Nov 2021 :: 12 min read

if you’re like me and hate intro guff then feel free to skip to the section “let’s get started then” to cut the shit and learn how to make your gpu go brr

now i’ll be the first to admit: i’m not great at graphics programming. i’ve got the most cursory knowledge of opengl 3.3 (nevermind 4.3) if you can even call it that. frankly, i still don’t even completely get what a vertex array object even is. however, i am comfortable enough in my knowledge to at least make things work (most of the time, anyway). the bulk of what i know comes from the excellent series of written guides over at learnopengl. i really cannot recommend it enough if you want to get comfortable with not only writing applications with opengl, but also the concepts you’ll often come up against in 3d graphics.

maybe it was hubris that forced me to do this, but i decided that as a part of one of my final units in my undergrad uni course i would write compute shaders. it’s not even something that was strictly necessary but compute shaders were always this really cool arcane art that i wanted to have control over. so against my better judgement, i set out to learn how to harness their power.

okay but what are compute shaders?

i’m glad you asked! compute shaders allow us to, outside of the regular rendering pipeline, run arbitrary shader code on the gpu so we can exploit the cool properties of it’s architecture. most notably, gpus are insanely fast at parallel floating point computations — in layman’s terms it just do maths real fast like. this is great because while cpus are fast, they tend to be faster with branching operations (so things like conditional statements) due to how their architecture is set up.

this isn’t a new concept, by the way. in fact we have a name for it already: general purpose gpu programming (often shortened down to gpgpu). moreover, we actually already have ways to do this outside of compute shaders using things like cuda or opencl. now, if you’re going to ask me “well why don’t we just use those instead of compute shaders?” then i’ll just point you to this stack overflow post with an answer from someone a lot smarter than me and carry on.

the current compute shader situation

so as one would usually do when trying to learn something, i started searching around online. one of the first resources you’ll come across is the official documentation on compute shaders from khronos group. there’s absolutely lots of useful information here, such as the version of opengl that we first saw compute shaders in and some of the quirks they have. it’s… pretty complicated though. i mean it makes sense — the docs are technical because everything about this is technical. regardless, as useful as some of the information is, on it’s own it wouldn’t be enough for my tiny brain.

so now onto looking for tutorials, of which there are… not many. you see, compute shaders first became available in opengl 4.3 which was released in 2012. that’s really not long ago, and when you also consider that graphics programming is a niche practice, with compute shaders being an even more niche subset of it… yeah it’s sorta obvious that this would be the case. most of what i found which talked about compute shaders, talked about them in the context of unity. don’t get me wrong, this is absolutely cool but it’s also just not that helpful for me.

in all my scouring online i really only found a handful of potentially useful resources. one (which i now can’t find) looked promising, but the provided sample code didn’t compile and some of the things they were talking about seemed a bit off base so i gave it a skip. the next one i found was this pretty detailed guide on real time raytracing using compute shaders by anton gerdelan. it looked promising for sure, but maybe i was just too thick to comprehend much of what was being talked about, especially since my use case was going to be more general than what they were talking about. the last thing i found was this set of lecture slides from oregon state university which actually is a really good primer on what compute shaders are and how they work. if you want a proper understanding of them then absolutely read through those slides because they’ve helped me.

so yeah. not a lot of resources going around. but obviously i’ve figured it out at this point (which is why i’m writing this!) so how did i go about figuring it out? well, all i really did is use my existing opengl knowledge, read the opengl docs a lot, and poke around in my good mate cat flynn’s implementation of the previously mentioned real time raytracer. the following then is documentation of my findings, and how to replicate what i’ve achieved.

let’s get started then

i’m working off the following assumptions:

you’ve got pretty solid knowledge of opengl already, or you’ve completed the textures lesson on learnopengl
you already have a project set up with windowing and an opengl context set to 4.3 core profile
we want to be able to pass in arbitrary data to the gpu, perform maths on it, and then read out the data from the gpu

i also will have all of my code that i reference in this article available publicly in this repo (correct commit hash already linked to). for those interested, the tech used in my implementation is as follows:

glfw
glad
glm

a quick primer on how compute shaders work at a high level

compute shaders are, in concept, pretty simple. as previously mentioned we’re making the gpu process data for us, and it roughly goes down like this:

we send data to the gpu
we tell the gpu, through shader code, to perform a set of operations on that data
we wait for the gpu to finish processing
we retrieve the output from the gpu

to dig a little deeper, we can also tell the gpu how many work groups to dispatch during step 2 which is done by defining 3d dimensions for the work group to be bound by. that might sound a bit weird, but it’s really not all that bad. just know that the amount of workers you’ll end up having is a result of the formula x*y*z. this means that, provided your gpu can handle it, you can define a data set of some arbitrary size and then assign a single worker to each point of data. this becomes really important to understand later on, so keep this in mind.

so let’s go over the parts in this machine that make it tick.

part 1: the shader program

this should really not seem all that strange to you at this point. we go through the regular motions of compiling a shader, creating a program, linking the program with our new shader, etc. the only thing of note really is that when we create our shader we pass GL_COMPUTE_SHADER through to the glGreateShader() function for what i hope is obvious reasons (if it’s not obvious why we do this, then read the docs on this function). in my implementation i abstract everything away into compute.h so we can just construct the Compute class to work with our compute shader and it’s associated program.

part 2: input/output

this is the part i probably struggled with the most for the longest time. initially i tried to get shader storage buffer objects working since they felt like the best fit for my purpose but that kinda fell through. instead i use a single texture for input/output, which might make you scratch your head a bit. “textures are images!” i hear you say, and yeah you’re absolutely right. but textures are way cool for a couple reasons:

in opengl they’re just tightly packed values for the components of each pixel
they’re actually pretty easy to work with (for the most part)

so yeah, we can totally use them to store arbitrary values! and hey, if you wanna render them later then you can by all means do that (in fact, this may be useful for certain debugging purposes!).

so let’s create the texture that we will use to input/output data to/from the compute shader (all of this also in compute.h):

// generate texture
glGenTextures( 1, &out_tex );
glActiveTexture( GL_TEXTURE0 );
glBindTexture( GL_TEXTURE_2D, out_tex );// ???
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_NEAREST );
glTexParameteri( GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_NEAREST );// create empty texture
glTexImage2D( GL_TEXTURE_2D, 0, GL_R32F, size.x, size.y, 0, GL_RED, GL_FLOAT, NULL );
glBindImageTexture( 0, out_tex, 0, GL_FALSE, 0, GL_READ_WRITE, GL_R32F );

this whole jig comes in 3 distinct parts:

generating the texture and binding it (you should be familiar with this already)
setting the scaling filter mode for the texture to nearest neighbour
initialising the texture

i actually don’t know why we need to set parameters in step 2. all i know is in my testing it wouldn’t let the shader modify the values without them being set. so i guess it stays then. the more interesting part is the final step where we create the empty texture. this defines a few things, so lets take it bit by bit.

first we have our call to glTexImage2D() which the docs tell us takes the following arguments:

void glTexImage2D(  GLenum target,
   GLint level,
   GLint internalformat,
   GLsizei width,
   GLsizei height,
   GLint border,
   GLenum format,
   GLenum type,
   const void * data);

i won’t go into detail on everything here since not all of it is really that important to us. what does matter however is the internalformat, width, height, format, and type:

internalformat and format indicate to opengl what format the texture will be in. for our purposes we’re just going to send a tightly packed array of 32-bit floats, which is GL_R32F and GL_RED respectively. if you want to get more details on why we set the values like that or what else is available, i’ll again point you towards the relevant documentation.
width and height should be pretty self explanatory, but it should be noted that width * height will end up being the amount of discrete data points we can operate on.
type indicates to opengl what the type of data we’re sending is. in this case it’s pretty simple — they’re float values so we say they’re floats.

after this we call glBindImageTexture() which is vital, because this will bind our texture so that we are able to access it from within the compute shader. the docs tell us it takes the following arguments:

void glBindImageTexture(  GLuint unit,
   GLuint texture,
   GLint level,
   GLboolean layered,
   GLint layer,
   GLenum access,
   GLenum format);

the only things of interest to us are unit, access, and format:

unit can be thought of as the index for the binding (this is super important later on)
access defines how we should be allowed to access the texture (since ours is input and output we set it to read and write, but you may want this to be different in other circumstances!)
format is the same deal as with our glTexImage2D() call

all things done correctly, now we should have an input and output to our compute shader set up! you can of course change this to have more textures set up in different ways to meet your needs, but this should be pretty straightforward for you to do.

part 3: the shader

now we get to the really cool stuff: the actual data processing! opengl uses a c-like language for shaders called glsl, which if you don’t know how to write then you’ll need to go over at least the shaders section on learnopengl. if you want to follow along with the exact code i’ll be talking about here, then look at the file shader.comp.

before we get into writing any glsl code though we should probably talk about what we have access to. some of the other things you might be used to in vertex or fragment shaders aren’t available to us here, and instead we get the following inputs:

in uvec3 gl_NumWorkGroups;
in uvec3 gl_WorkGroupID;
in uvec3 gl_LocalInvocationID;
in uvec3 gl_GlobalInvocationID;
in uint gl_LocalInvocationIndex;

i won’t bother going over every single one here (if you want to know what they’re all for then check out the relevant documentation), i’ll just focus on what’s important to us: gl_GlobalInvocationID. recall earlier when i mentioned how we define the size of the work group:

we can also tell the gpu how many work groups to dispatch […] which is done by defining 3d dimensions for the work group to be bound by.

well, this is the index of the current worker within the work group! this lets us know which worker we currently are, which we can use to figure out which piece of data from the input set we should be looking at.

there’s only a couple things left before we get to the meat of the shader:

layout(local_size_x = 1, local_size_y = 1, local_size_z = 1) in;
layout(r32f, binding = 0) uniform image2D out_tex;

the first line defines the size of the local work group, which is a further subset of the global worker set. please don’t ask me how to use this cause frankly i do not know (at least, right now). maybe i’ll look into it in the future (and write another article!) but for now, if you’re interested in them then dig a bit deeper with this relevant piece of documentation.

the second line is more interesting to us since it grabs our texture we stored earlier on. note the start of this line, because this is where we indicate that the texture is of format r32f and at binding 0 — in other words what we set earlier when creating and binding our texture! then we just do some other pretty standard stuff when it comes to declaring variables in glsl, so i won’t bother going over that.

one thing i will note is that i think i remember reading somewhere that uniforms aren’t ideal in compute shaders. frankly i don’t know why that is or if that even is the case. hopefully someone a lot smarter than me will tell me why, and then proceed to tell me off for using something so slow or whatever.

with all that out of the way, now we have the body of our shader!

void main() {
    // get position to read/write data from
    ivec2 pos = ivec2( gl_GlobalInvocationID.xy );    // get value stored in the image
    float in_val = imageLoad( out_tex, pos ).r;    // store new value in image
    imageStore( out_tex, pos, vec4( in_val + 1, 0.0, 0.0, 0.0 ) );
}

so yeah, this is pretty straightforward. the steps we take are:

get the position of the texture to read/write from based on our global invocation index
read in the value from the texture using the built in imageLoad() glsl function
store a new, modified value into the texture using the built in imageStore() glsl function

just a couple notes on the above. notice that upon calling imageLoad() we also read out the r component of the return value. this is because it will always return a vec4 regardless of the format of the image. with that in mind it’s probably pretty obvious why we then need to create a vec4 to pass through to imageStore() — it always expects a vec4 for colour, regardless of the format of the image.

some quick last things before we put this all together

so now all the core parts are established, all that’s left really is to establish ways for our program to use this all. the following code will be present in compute.h.

we’ll declare methods on the Compute class which let us use the compute program:

void use() {
    glUseProgram( id );
    glActiveTexture( GL_TEXTURE0 );
    glBindTexture( GL_TEXTURE_2D, out_tex );
}

void dispatch() {
    // just keep it simple, 2d work group
    glDispatchCompute( work_size.x, work_size.y, 1 );
}

void wait() {
    glMemoryBarrier( GL_ALL_BARRIER_BITS );
}

use() will set up all the things we need in order to… well use the program. dispatch() will start the compute shader using the given work group size, and wait() will make our program wait for the compute shader to be done processing.

for setting values on the gpu for the shader to work with we have the following:

void set_values( float* values ) {
    glTexImage2D( GL_TEXTURE_2D, 0, GL_R32F, work_size.x, work_size.y, 0, GL_RED, GL_FLOAT, values );
}

really all we’re doing is the same thing as when we were creating the empty texture, but this time we’re actually sending data instead of a null pointer.

for getting values from the gpu we have the following:

std::vector<float> get_values() {
    unsigned int collection_size = work_size.x * work_size.y;
    std::vector<float> compute_data( collection_size );
    glGetTexImage( GL_TEXTURE_2D, 0, GL_RED, GL_FLOAT, compute_data.data() );

    return compute_data;
}

this is a bit more complicated, but still pretty straightforward. we’ll calculate the size of the texture to read back in, initialise a vector of that size, and then pass through a pointer to the underlying array to glGetTexImage(). hopefully the rest of the arguments make sense, but if they don’t then check out the docs here. then to close it all off, we just return the vector of floats we got back from opengl.

it’s go time

we finally have everything we need to process data on the gpu using compute shaders! have a look in main.cpp for the following code:

// initialise compute stuff
Compute compute_shader( "shader.comp", glm::uvec2( 10, 1 ) );
compute_shader.use();
float values[ 10 ] = { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 };
compute_shader.set_values( values );

// inside the main render loop
compute_shader.use();
compute_shader.dispatch();
compute_shader.wait();
auto data = compute_shader.get_values();

for ( auto d : data ) {
    std::cout << d << " ";
}
std::cout << std::endl;

at this point you should understand what we’ve done here (since we just wrote it all!), and now running it will get us this output:

1 2 3 4 5 6 7 8 9 10 
2 3 4 5 6 7 8 9 10 11 
3 4 5 6 7 8 9 10 11 12 
4 5 6 7 8 9 10 11 12 13 
5 6 7 8 9 10 11 12 13 14 
6 7 8 9 10 11 12 13 14 15 
7 8 9 10 11 12 13 14 15 16 
8 9 10 11 12 13 14 15 16 17 
9 10 11 12 13 14 15 16 17 18 
10 11 12 13 14 15 16 17 18 19 
11 12 13 14 15 16 17 18 19 20
...

and we have LIFTOFF!!

the shader is now constantly incrementing the values in the texture for us! this is a really simple example, and frankly it a pretty poor use case — but that’s not the point. hopefully through reading this you’ve gained an understanding of how to implement a trivial solution with compute shaders in opengl, and can now build upon this to create your own cool things with it.