PyTorch Developer Podcast

Inductor - IR

Episode Summary

Inductor IR is an intermediate representation that lives between ATen FX graphs and the final Triton code generated by Inductor. It was designed to faithfully represent PyTorch semantics and accordingly models views, mutation and striding. When you write a lowering from ATen operators to Inductor IR, you get a TensorBox for each Tensor argument which contains a reference to the underlying IR (via StorageBox, and then a Buffer/ComputedBuffer) that says how the Tensor was computed. The inner computation is represented via define-by-run, which allows for compact definition of IR representation, while still allowing you to extract an FX graph out if you desire. Scheduling then takes buffers of inductor IR and decides what can be fused. Inductor IR may have too many nodes, this would be a good thing to refactor in the future.

Episode Notes

Inductor IR is an intermediate representation that lives between ATen FX graphs and the final Triton code generated by Inductor. It was designed to faithfully represent PyTorch semantics and accordingly models views, mutation and striding. When you write a lowering from ATen operators to Inductor IR, you get a TensorBox for each Tensor argument which contains a reference to the underlying IR (via StorageBox, and then a Buffer/ComputedBuffer) that says how the Tensor was computed. The inner computation is represented via define-by-run, which allows for compact definition of IR representation, while still allowing you to extract an FX graph out if you desire. Scheduling then takes buffers of inductor IR and decides what can be fused. Inductor IR may have too many nodes, this would be a good thing to refactor in the future.

Episode Transcription

Hello everyone, and welcome to the PyTorch Dev podcast. Today, I want to discuss Inductor IR. Inductor IR is an intermediate representation that lies after the ATen graph, but before the actual Triton code generation.

If you consider the overall PyTorch 2 stack, once we finish capturing the graph with Dynamo, we have several FX nodes referring to ATen operations. For Inductor to compile this code into Triton code, it will first take that ATen graph and perform a series of transformations, first converting it into Inductor IR, then scheduling that IR, and finally generating the code from the scheduled nodes. This process is how you obtain the Triton goodness, as well as the wrapper Python or C++ code that brings it all together.

As you might imagine, knowing how to work with Inductor IR is crucial if you plan to work on the compiler in Inductor at all. Now, I must confess that I don't have a perfect organization for all the topics I want to discuss in this episode. It will be somewhat of a grab bag of topics.

First, I want to address some of the motivating design considerations behind Inductor IR. Please note that I did not create Inductor IR, and I am still figuring things out, as most of us are in PyTorch core. There are some aspects of the current state of Inductor IR that are not ideal and could use some refactoring. Moreover, as the code is rapidly changing, this podcast might become outdated. I recommend joining the conversation on PyTorch GitHub if you are interested in contributing.

You might ask why we need an intermediate representation between ATen operators and the actual Triton code. Why can't we just directly generate Triton code from each ATen operation? There are several reasons. We don't want to generate a Triton kernel per ATen operation. We want to perform fusion, meaning we need a way to represent the result of fusing a sequence of ATen operations, such as a series of pointwise operations, together. ATen graphs are simple in that they only call a sequence of operations and are done, which means they lack the concept of a fused node containing multiple operations grouped together.

In Inductor IR, we tackled this problem differently for several reasons. Understanding Inductor IR can be approached in a few ways, which I will go through in this episode.

One way is to examine the code and see the classes and data structures defined for IR. If you look at Inductor IR.py, you'll see a class called IR node and several subclasses, such as loops, base view, layout, buffer, and mutable box. Each of these is a distinct concept, and IR node is a class that brings everything together.

Another way to understand how IR node works is by checking out the lowerings. A lowering in Inductor takes a particular ATen operation and produces a bunch of IR nodes or Inductor IR representing that operation. For example, if I'm lowering an addition between two operations, I would get something like a tensor, but in Inductor IR.

Now, let's discuss TensorBox, which represents a tensor in Inductor IR. This box contains a pointer to a storage box, representing the actual backing data store. This setup is useful because we can have multiple tensors referencing the same storage, so we may have multiple TensorBoxes referencing the same storage box.

One important thing to note is that Inductor IR faithfully models PyTorch semantics. If you understand how PyTorch's eager mode works and then look at how Inductor IR works, you'll find that they match. Inductor IR can handle eager mutation and views, and it can manage arbitrary indexing arithmetic depending on the view in question.

This concept was important to Jason Ansel when he was designing Inductor because many compilers don't fully understand PyTorch's idea of views and strides. As a result, they have to do a lot of work to deal with strange patterns that show up when people write PyTorch programs in practice. Inductor is all about being able to compile PyTorch as it is, and it builds in all these concepts that are crucial to PyTorch.

For instance, the ability to perform strided indexing allows Inductor to generate arbitrary indexing expressions to fetch the correct data. We need to be able to simplify these indexing expressions, which is one of the reasons why we use SymPy in Inductor.

Inductor IR faithfully models PyTorch semantics. We have the tensor storage distinction within the box. Eventually, you get to a buffer, which represents the data in question. You can perform views on the buffers, and mutation can occur on the IR. When a mutation occurs, we swap out the contents of a storage box with a new buffer that represents what happened after the mutation.

The most intriguing buffer you'll usually see when looking at Inductor IR is the so-called computed buffer. This buffer states that we performed some computation, like a pointwise operation or reduction, that produced the data in question. Inside these computed buffers, you finally have the IR nodes like pointwise and reduction that represent the actual computation being done in PyTorch.

Interestingly, these nodes don't contain FX graphs representing the operations being fused together in a pointwise operation or reduction. Instead, they're defined by a method we call "define by run." This method maintains a graph as a function. We have a function, which takes in several arguments representing all the arguments that the actual IR graph would have represented. Inside this function, we call all the operations that represent the define by run operation in question.

One significant advantage of the define by run method is that it allows you to write compact definitions. For example, if you have two pointwise bodies and you want to compose them together into a single pointwise body, you can do so easily with define by run.

The Inductor IR process results in a large pile of buffers and unfused computation. We have the concept of a buffer that has been realized versus a buffer that is just computation. When a buffer is realized, we forbid fusing into it. We've essentially said we guarantee this data is going to exist in physical form at this point in the IR.

On the other hand, unfused computation, which hasn't been realized, is allowed to fuse or potentially run multiple times if that's profitable for the situation. The scheduler, which runs after we've lowered all our ATen operations to Inductor IR, is responsible for deciding the most profitable fusion to do at any point in time.

Regarding the practical aspects of working with Inductor IR, one thing I found confusing when I first read it is that there are many IR nodes. While the most basic ones are the most important, there's a multitude of other nodes doing all sorts of other things. There's a node for collectives, a node for convolutions, and many more.

When working with an IR node, there are several things you can customize. For example, you can report the read/write dependencies, which the scheduler needs to decide the order of operations.

Another thing you can change is the concept of side effects. If an IR node has a side effect, we're not allowed to eliminate it.

Additionally, we keep track of origins for IR nodes. We keep track of which ATen FX node produced a particular IR node. This tracking is useful for generating meaningful kernel names if you have that enabled in Triton.

That concludes our overview of Inductor IR. As I mentioned earlier, it's highly in flux, and I don't claim to be the world expert on Inductor IR. However, I hope this gives you an idea of how to navigate this crucial Inductor IR data structure. Thanks for listening. See you next time.