PyTorch Developer Podcast

PT2 extension points

Episode Summary

We discuss some extension points for customizing PT2 behavior across Dynamo, AOTAutograd and Inductor.

Episode Transcription

Hello everyone, and welcome to the PyTorch Developer Podcast. Today, I want to talk about extension points to PyTorch 2. A lot of the work we're doing in PyTorch 2 involves adding new features, which sometimes have implications all over the stack. PyTorch 2's stack has a lot of different layers, and so sometimes planning out a change like this can be quite daunting because it's like, well, to add this feature, I need to understand Dynamo, and I need to understand AOT Autograd, and I need to understand Inductor.

Most people who work on PyTorch 2 full-time don't understand Dynamo, and so they don't understand Dynamo. They really only work on one layer of the stack at a time. So asking someone to know about all of these things just so that they can add a new feature, that's a bit of a lift.

Fortunately, there are a number of pre-existing extension points in PyTorch2 which you can use to implement functionality that otherwise doesn't exist right now. And further more than that, even we have some things that conceptually make sense but are just not implemented yet. But they could be implemented if someone wanted to go out and do them.

So in today's podcast, I want to walk us through some of the extension points in the PyTorch2 stack and tell you about how... how these work and how come they're consistent with the overall architecture of Partridge 2. Because one of the main themes about these extension points is the easy to implement extensions. involve only a change to one part of our stack without changing any of the global invariants throughout our stack. And we have some limited cases where we have a way to customize the behavior of something all the way through, but that tends to be a lot more work because you have to tell every subsystem how to deal with something in that question.

So to get started, let's first quickly look at the topmost layer of the stack, namely Dynamo. Dynamo is all about understanding any given piece of Python code, what exactly is it doing, capturing it into a form that is an FX graph that is well-behaved enough that we can run AOT Autograd on it to trace out an actual set of functions in the end.

So if we are thinking about what exactly, you know... we can do in the Dynamo front end that isn't too difficult to do. One of the most easy and easy to understand extensions is just adding support for other function calls.

So, you know, what is Dynamo's job in life? Dynamo's job is to look at bytecode, figure out what it's doing, and then put a appropriate function call into the graph so that... AOT Autograd handles it. So you can change whether or not something is put into the graph simply by marking something as allow in graph.

Now there are restrictions. When you mark something as allowed in graph, the function you place in the graph has to quote-unquote work with AOT Autograd because what you are saying is that this function is well behaved enough so that AOT Autograd can trace through it. And so what that means is that it has to support a fake implementation where you can run it with fake tensors without actually having to have real data. It needs to not have side effects as long, or it can only have side effects in limited situations where it is only allowed to mutate tensors. And if it mutates a tensor, it needs to be able to tell AOT Autograd that is doing it in this way.

Additionally, the function needs to only support a real-time operate on basic types that are supported by fx. Normally these are the set of types that are supported by TorchScript, so that's tensor, list of tensors, int, you know basic primitive types. If you've got a custom data type and you want that custom data type to be preserved inside of the fx graph that Dynamo is producing, that is much more of a lift. But just putting another function and asking it to be directly traced through, that's something that you can do quite easily. inside Dynamo itself.

A step up from just putting in a function for regular tracing is the so-called higher-order operators mechanism. We call them higher-order operators because typically the reason they exist is because they are operators that take in not just regular arguments, but also arbitrary callables, which themselves tend to contain more graph operations. So typically, A higher-order operation with one of these callables, we'll call that callable maybe never or once or twice or whatever.

So a canonical example of a higher-order operator is the cond operator, which takes in two callables for the true side and the false side, and at runtime only executes one of them. These higher-order operators can be pretty restrictive. They are typically not allowed to have side effects. The bodies of these functions are typically not allowed to interact with the Python state. in any non-trivial way. And when you implement a new higher-order operator, you know, one of the things is that most of our basic infrastructure doesn't work on them. So you have to say exactly, for example, how you want all of the AOT autograd passes to work on them. But this is also a sort of well-known extension point, and when people want to add, you know, new operations that are a bit more complicated, usually you use the higher-order operator mechanism.

So you can extend Dynamo by modifying what it is willing to output to give AOT Autograd. You can also extend Dynamo by making changes to how Dynamo processes Python code that is operating over. For example, when you have some code in Dynamo, that is calling some API, let's say I have a NumPy call, I can have Dynamo transparently translate this API call into an equivalent Torch function call. And this is the mechanism by which we implemented our NumPy interoperability layer.

So if you have some code that does some operations on NumPy ndarrays, we actually support transparently compiling this into PyTorch operations. And you can often take a standard NumPy program and automatically get it running on CUDA without any modifications. This is a very local change because all that's going on is Dynamo is producing a new set of torch operations where previously it would have just graph braked on the NumPy operations. So this change only requires you to know about how to deal with Dynamo.

Similarly, if you have some sort of custom user library code or another C extension that you need to interoperate with, One of the things that you could do is you could add support for it in Dynamo simply by teaching Dynamo what the semantics of these operations are. We actually don't have a public API extension mechanism for doing this right now because we just haven't implemented it yet. But in principle, Dynamo is unable to... handle anything that goes into C extensions, or it often also can't handle Python code that is too complicated, that uses too many features. But you can always teach Dynamo internally to have a special case for this sort of situation and handle it in some direct way.

We actually have had some discussions about what a good API for this might look like. One really promising idea is the concept of polyfills. A polyfill from JavaScript is when you have an implementation of some feature that normally is natively provided by your runtime in plain Python, in this case JavaScript, in the case of the web. So a polyfill would make sense in Dynamo because if you've got some code which doesn't work, with Dynamo because it's implemented in C. If you write an equivalent implementation of it in Python, then Dynamo can just transparently trace into the Python implementation and understand what your program is doing. So this is a really promising way for letting people who own C libraries and want to interoperate with Dynamo to let things work.

And finally, one really interesting possibility that Michael So has been investigating is the possibility for allowing dyno to trace non-standard tensor types into the graph entirely. And so this is a good segue into the AOT Autograd segment of this podcast episode, because to do this, dyno is actually the easy part, right? So to handle an arbitrary class and... our particular case, we wanted to reuse the mechanism from TorchScript called torch bind, which lets you take arbitrary C++ classes and make them available in TorchScript programs. And all you need to do in Dynamo is just say, Okay, well, if I see some operations on one of these TorchScript classes, one of these torch bind classes, all I need to do is just go ahead and write these operations in the graph. So this is actually the easy part as far as Dynamo is concerned, you just need a way of once again, dry running. these operations without having real data.

The real problem is once you have these operations in the graph, what exactly is AOT Autograd going to do with them? So what is exactly AOT Autograd going to do with things?

So remember, AOT Autograd is the part of our stack which is responsible for taking the output Python graph that was produced by Dynamo and then actually using all of the semantics, all the layers of PyTorch, including Autograd, including functionalization, all of these things to trace out a low-level ATEN representation, which is suitable for handing to the backend compiler. So this is the part that actually knows all the smarts about how all of the various subsystems in traditional eager pytorch work.

And this is, for example, the place where when you add... A new higher order op, you now have to specify how this higher order op should interact with each of the various things like tracing or functionalization or fake tensors because that's what AOT Autograd is going to use.

So if we talk about something like TorchBind, then if you do add the support for TorchBind, which this one's not complete, you have these weird objects which aren't actual tensor operations. And so if you wanted AOT Autograd to work with them, you'd also have to teach AOT Autograd how to either partition them away, which is a very valid thing to do, right? Like before you go from Dynamo to AOT Autograd, you could partition your graph up into multiple pieces. and only feed in AOT Autograd, the pieces that AOT Autograd actually understands.

In fact, this is what we do for DDP Optimize. DDP Optimize is an option you can use when you are running PyTorch2 with distributed data parallel. And what it does is it manually chunks up our graph so that you get pipelining with DDP where every chunk starts sending the gradients. to the nodes before you finish running everything else. So you're not waiting for all the communications at the very end. And that's done by splitting up the graph before we pass it to AOT Autograd.

So you can conceivably get rid of things AOT Autograd can understand by partitioning them into their own subgraphs before AOT Autograd handles them. You can also make AOT Autograd handle things directly. And with higher order ops, you can just specify how exactly the various layers should happen. Or for example, Brian Hirsch recently added support for tensor subclasses.

So in fact, tensor subclasses are a really nice extension point in PyTorch2. And the reason they're so nice is because, you know, well, tensor subclasses act like normal tensors. So they typically don't need that many changes on the Dynamo site. That's not entirely true. For example, we use detensor with tensor subclasses, and that has some extra API on top. And sometimes Dynamo needs to be taught. how to understand that API and transfer it into the graph.

And once you get to AOT Autograd, the real question is basically how to go ahead and de-sugar this tensor subclass into a simplified program that doesn't have any tensor operations in it. So tensor subclasses maybe require some dynamo work, have some support for it in AOT Autograd, but it evaporates by the time you get to the backend compiler. So you don't actually need to work on inductor if you do something like this.

And of course, AOT Autograd has a bunch of other knobs which you can use. For example, we have decompositions, which are the entire way we break down operations in a simpler forms for the compiler. And you can do pre-Autograd decompositions. You can also do post-Autograd decompositions. These are all valid things to do, and you can customize them.

You can obviously implement custom operators, which are just like regular operators that PyTorch has natively. But if you go ahead and use this API and implement what all the various operations on them should be, you can actually just preserve them all the way to Inductor, and Inductor will just call you when you actually want to run the operation.

And finally, once you get to Inductor, there's a few more things you can do. So for example, at an inductor level, you can introduce the concept of a new IR node, which lets you control how exactly code generation works when you go ahead and do an offer, when you actually go ahead and generate the Python code or the C++ code that's going to represent the operation. This is, you know, usually you don't need to because just being able to call some external function. is usually good enough and we have built-in support for that. But it's something you can do and people have added a lot of IR nodes to Inductor for better or for worse.

There's also the ability to take a custom Triton code and send it all the way to Inductor. This is some work by Ogas. It's pretty nice because it's often people are writing these Trident kernels for the very most important pieces of their model. And it's nice to have that interoperate with PyTorch too.

And inductor also has some facilities for doing code generation. So for example, let's say that you are doing matrix multiplies. We have the ability to generate epilogues and fuse them in. And so this capacity basically says, hey inductor knows how to generate simple code for pointwise operations. So if you've got some complicated CUDA kernel, but you have a spot where you just want to paste in some arbitrary extra code, that the user provided, that's something you want to do.

We also have some examples of people wanting to go ahead and add first class concepts to the inductor IR. For example, when we were working on nested tensor, this is something that you do need to generate different kernels that are pretty different from normal pointwise kernels when you want to do this generation. This is probably the hardest thing to do because obviously to get this concept all the way down to inductor you had to have made Dynamo and AOT Autograd play ball. So definitely a choice of last resort.

So we've talked about a bunch of extension points, which the PyTorch 2 stack provide. Some of them have public APIs for, and you can use them directly. Some of them are just ideas that are architecturally consistent with how PyTorch 2 works, but just haven't been implemented yet. So someone has to roll their sleeves up and handle things.

That's it for our whirlwind tour of all the things you can extend PyTorch 2 with. Hopefully in some later podcast episodes, we can dig into some of these things in more detail. Thanks for listening.