What are CUDA graphs? How are they implemented? What does it take to actually use them in PyTorch?
What are CUDA graphs? How are they implemented? What does it take to actually use them in PyTorch?
Further reading.
Hello everyone, and welcome to the PyTorch Dev podcast. Today I want to talk about CUDA graphs, an NVIDIA mechanism for reducing kernel launch overhead and, you know, putting all your CUDA kernels together into one mega kernel that you can run really fast.
Why does CUDA graph exist? To understand this question, we have to think a little bit about how the CUDA programming model works. The way the CUDA programming model works, see my previous podcast about enough CUDA to be dangerous, is we have a bunch of kernels that the CUDA GPU knows how to run, and you run your host code, regular old CPU code, and you figure out what kernels you want to run, and you queue them on a stream. Whenever the CUDA driver gets these kernel launches, it actually goes ahead and runs them on your GPU. If your data is really big and it takes a long time to run various things on the GPU, after a short launch latency, the latency that it takes to get to the first CUDA launch, you will basically just queue a bunch of kernels to be run on the stream. CUDA will just go ahead and try to run them as fast as possible when the previous work gets done.
But sometimes your code is too small and it runs too fast, or maybe NVIDIA's graphics cards are way too fast, and you've got a problem which is you just can't keep up with the GPU. You can't feed it enough to keep it utilized. When you're in this regime where your tensors are really small and you have a lot of itty-bitty kernel launches, the kernel launch overhead can be pretty killer. CUDA graphs are a solution for this problem.
What a CUDA graph lets you do is it lets you take a whole bunch of kernel launches and bundle them up into one giant mega kernel launch so you don't have to deal with the kernel launch overhead. You've gotten rid of all that overhead, you've gotten rid of the overhead of running the host code so your CPU overhead is also lower, your CPU utilization is also lower, and then you can just go ahead and run this over and over again.
Okay, so that's the concept behind CUDA Graphs. But if I told you, hey, I need you to go implement CUDA Graphs for me, you might think about it a bit and then you might realize, actually, this is not so easy to do. Normally, what you would imagine is, hey, I want some sort of graph representing the entirety of the computation that I want to do. And then I'm going to feed it to some sort of internal engine, etc. And that's going to go ahead and compile into one monolithic kernel they can go ahead and send to NVIDIA. But there is no such graph representation that exists for CUDA. CUDA was designed from the very beginning as a streaming API.
What's actually going on is like in PyTorch, we've got loads and loads of CUDA kernels all over the place. They don't even necessarily have to be publicly visible names. They can be in an anonymous namespace. And they've got all of these parameters that you're calling them with, like all the tensors that they want to operate on, various parameters that you're passing on the parameter buffer to the kernel, like whatever scalar you want to multiply things by, or anything like that. How would you actually assemble a graph like this?
CUDA graphs, like many other wonderful technologies, such as the JIT TorchScript tracer, requires you to go and run your CUDA kernels first and record a CUDA graph that you actually then can run again in the future. That being said, there is an API in CUDA Graphs for explicitly building CUDA Graphs and doing modifications to them after the fact, but that's not the preferred way of generating a CUDA Graph. The preferred way of generating a CUDA Graph is to actually run your code once, and then you actually get a bunch of CUDA kernel launches. And by the way, when you do these CUDA kernel launches, we're going to record everything about how you launch them. So what tensors you're passing to them, what parameters you're passing to them, all of that we're going to just record as is.
So that means that it's totally hard coded. If you use some CUDA memory inside your region of CUDA calls, that memory is going to be the very same memory that a subsequent run of the CUDA graph is going to use. Because remember, NVIDIA has no idea what the meaning of the parameters you're passing to the kernels are. It's totally flexible. You can pass anything you want. You can pass any structs you want. So CUDA has no way of actually just swapping out pointers if you wanted to use different memory the next time you run it.
When you're doing CUDA graphs, you have to make sure that you allocate your memory in a persistent way so that the next time you want to run your code, you can reuse that memory. So the model behind CUDA graphs is that you run your CUDA code with a special setting on the memory allocator so that it gets kept for later. Once you get done, you get this CUDA graph. And for whatever the input CUDA tensors are, you have to go fill them in with whatever the new inputs you want to run in that situation. Then you can say, "Okay, NVIDIA, go run your CUDA graphs." Bang, bang, bang, it'll go ahead and run the kernels exactly as they did previously.
Oh, and one last thing, because how exactly do CUDA graphs know what kernels to actually record? They're stream based. Remember, the stream in CUDA is this queue that keeps track of all the operations and what ordering they need to run in. If you put things on the same stream, they're guaranteed to run in the order they got put in the stream. Of course, if you have multiple streams, then they can run in any order. It's a little hard to use streams correctly because it's a very fine-grained form of parallelism, and sometimes physically your GPU just can't do it, but it is a useful API.
CUDA graphs, when you record, you're not recording globally every CUDA launch. You're actually recording CUDA launches on specific streams. PyTorch is not that great at being very stream-friendly. PyTorch by default runs on the default stream. The default stream synchronizes with everything. It's very easy to use. You don't have to worry very much about it. But sometimes you want to have streams and then you have to actually write your code differently. It's easy to get this wrong because if you forget to do it and someone runs your code on the default stream, chances are things are just going to work out. So, M. Curili, who is the NVIDIA guy who's been working a lot on CUDA Graph support in PyTorch, he's also had to fix a bunch of stream bugs, especially in our autograd engine, to make everything all work out.
So that's basically most of what you needed to know about CUDA graphs. They are a way of running a bunch of CUDA kernels all together at once, and they hard code all the parameters, so that just leads to some UX problems that you have to be aware of if you want to use them.
I want to recap something that I talked about in the random number generators podcast, which was about the Philox random number generator, because this has a very interesting interaction with CUDA graphs. This is kind of bonus material. I've already said the most important thing about CUDA graphs, but this is interesting and I want to talk about it.
So I said that everything gets hard coded and in particular, the random number state gets hard coded when you run your CUDA graphs. Think about it, what I said in the RNG podcast is that the CUDA RNG state actually lives on CPU, it doesn't live on CUDA, it lives on CPU. You just pass the seed and the offset directly in the kernel parameters. Then on the CUDA kernel, it actually sets up the Philox state and then does sampling on it. It's pretty cool and very nice. And it's a complete disaster for CUDA graphs. Because what that means is you're actually going to get the same random numbers every single time you run your CUDA graphs. Okay, maybe that's okay. But usually, that's not okay. And you really do want different random numbers every time.
So how do you solve a problem like this? Clearly you need some way of actually feeding in what part of the sequence or the seed or something like that inside CUDA memory because you're going to totally hard code the parameters, right? It can't be anything passed in the parameters, so it has to be on the CUDA device. But then how exactly can you get it to the device? Do I have to, when I launch my kernel, first do a host to device copy of the RNG state to CUDA memory, and then run the kernel that way? That doesn't sound so great.
To be fair, it wouldn't be that bad because remember, it's all asynchronous. And so you can trigger this. As long as the host memory is pinned, which is not too hard to arrange, you could just trigger it asynchronously. And then have the transfer happen whenever CUDA gets around to doing it. But there's a better way to do it.
The better way is to pass in a pointer to a little bit of CUDA memory that doesn't say what the seed or the offset should be, but instead is an offset correction. What's the idea? We're going to put on a restriction. The restriction is that if you want to use CUDA graphs with RNGs, you have to reuse the same seed because the seed we're sending up with the parameters is hard coded, we can't do anything with it. But what you just want to do is right when I do subsequent calls to the CUDA graph, all I want is to advance the random number stream however far I had advanced via my previous consumption as well, right?
So there's only this extra bit of information, just the offset that I want in this situation. What I can do is, when I'm running normal PyTorch code, and there's no CUDA graphs involved, I'll send a little bit inside the parameters field saying, "Hey, this is a non capturing, you can just use the seed and the offset directly, and you don't have to do anything about it." But let's say that I am in capturing mode, then I'll do a different bit. And I'll send a pointer to the memory that is the offset that I want to do and say, "Hey, when you compute the RNG state, use the seed, use the offset, but also use this extra offset read out from memory to do the adjustment."
At the very beginning, the adjustment is zero, because whatever the seed and the offset were at the time I was recording is the correct one. But then later when I want to rerun the CUDA graph, all I need to do is do a host to device setting of that little bit of offset to be whatever the current state of the RNG is. And now I can run my CUDA graph and the CUDA graph is going to read out the offset from this memory and now offset the random numbers exactly how I need them to be.
There's one last thing I need to do this, which is I need to know how many random numbers my CUDA graph consumes, but that's not too hard to figure out. You just record what the RNG state was at the beginning and what the RNG state was at the end. This was not obvious to us at the very beginning, and M. Curili, Natalia, and I spent a while thinking about how to actually solve this. But I think this solution is very elegant. It's just once again, it comes out of having to solve the problem of, "Well, CUDA graphs hard code everything in the parameters."
Apparently, in an old version, someone was actually going into the CUDA graph post facto and editing all of the RNG parameters to update the new thing. This was a terrible idea and needed to solve this problem.
Okay, so that's the end of the fun technical digression. How can you actually use CUDA graphs in practice? We're working on landing the last PRs that actually give a nice user API. But there is something very important about CUDA graphs, which is you want to deploy them, you want to use them in a production setting, you need to be able to run your code initially to actually get the CUDA graph in question.
This is why things like TorchDeploy are actually very important for CUDA graphs. Because if you want to use CUDA graphs to do, say, GPU inference, because that's a situation where overhead matters a lot, you still need to bootstrap the CUDA graph at the very beginning, and then you can run it. If you can run Python code in your environment, and that's what TorchDeploy is all about, then you can just run the slow Python code to get the CUDA graph. But then pass it off to some C++ engine that just repeatedly runs the CUDA graph in the future. That'll be really good. You use the Python for the slow initialization, and then everything else doesn't even need to touch Python at all. And that's, I think, one of the main draws of CUDA graphs.
All right, that's everything I wanted to say for today. Talk to you next time.