PyTorch Developer Podcast

Unbacked SymInts

Episode Summary

This podcast goes over the basics of unbacked SymInts. You might want to listen to this one before listening to https://pytorch-dev-podcast.simplecast.com/episodes/zero-one-specialization Some questions we answer (h/t from Gregory Chanan): - Are unbacked symints only for export? Because otherwise I could just break / wait for the actual size. But maybe I can save some retracing / graph breaks perf if I have them too? So the correct statement is "primarily" for export? - Why am I looking into the broadcasting code at all? Naively, I would expect the export graph to be just a list of ATen ops strung together. Why do I recurse that far down? Why can't I annotate DONT_TRACE_ME_BRO? - How does 0/1 specialization fit into this? I understand we may want to 0/1 specialize in a dynamic shape regime in "eager" mode (is there a better term?), but that doesn't seem to matter for export? - So far we've mainly been talking about how to handle our own library code. There is a worry about pushing complicated constraints downstream, similar to torchscript. What constraints does this actually push?

Episode Notes

This podcast goes over the basics of unbacked SymInts. You might want to listen to this one before listening to https://pytorch-dev-podcast.simplecast.com/episodes/zero-one-specialization Some questions we answer (h/t from Gregory Chanan):

 

- Are unbacked symints only for export?  Because otherwise I could just break / wait for the actual size.  But maybe I can save some retracing / graph breaks perf if I have them too?  So the correct statement is "primarily" for export?

- Why am I looking into the broadcasting code at all?  Naively, I would expect the export graph to be just a list of ATen ops strung together.  Why do I recurse that far down?  Why can't I annotate DONT_TRACE_ME_BRO?

- How does 0/1 specialization fit into this?  I understand we may want to 0/1 specialize in a dynamic shape regime in "eager" mode (is there a better term?), but that doesn't seem to matter for export?

- So far we've mainly been talking about how to handle our own library code.  There is a worry about pushing complicated constraints downstream, similar to torchscript.  What constraints does this actually push?

Episode Transcription

Hello, everyone, and welcome to the PyTorch Dev Podcast. This podcast is a little bit out of order from the previous podcast about 01 Specialization. So if you haven't listened to the 01 Specialization podcast, try listening to this one, which is going to be about unbacked symbolic integers in general for PyTorch 2 in both eager mode and export. So this podcast is coming because we've been talking more about zero one specialization and also about the stack of PRs that I've been working on regarding unbacked symbolic integers. And there's been a lot of questions about what the heck are unbacked symbolic integers? What exactly is going on with them? You know, what are the consequences of adding this feature? And so I wanted to record this podcast to talk a little bit about, you know, what exactly is going on here and answer some of these questions. Gregory Chanan, who isn't joining me, but sent me a list of questions that he had regarding the feature. And I'm going to use these to sort of drive the discussion in this podcast.

Okay, so let's start off with the basics. So what is an unbacked symbolic integer? So to answer that question, I first need to mention what a backed symbolic integer is. So a backed symbolic integer refers to our symbolic shapes that we're passing through our program. You know, we have a bunch of input tensors. Instead of statically specializing on these tensors, we give them symbolic sizes, which just say, hey, you're going to do the symbolic execution on these sizes and you're not actually going to burn in any particular size. So if you do a view operation based on the size of something else, we'll pull out the symbolic size for that particular tensor that I'm reading out the shape from and pass it on to the view without burning in whatever the actual value is. So if that value changes in the future, then I can actually, you know, just reuse the same graph in this situation.

Now, the thing about having symbolic integers like this is if someone writes some Python code and they say, if x is equal to two, then do something else, do something else. There really isn't any way to keep things symbolic in this case, because, you know, we need to actually know which branch we're actually going to go down. Now, of course, there are some program analysis techniques that will allow you to sort of keep those. keep, you know, trace through both branches and do some sort of fancy stuff in that situation. But we're generally talking about straight line traces in PyTorch2, and we don't have anything that fancy. So we need to actually have an answer in this case. And so when you have a condition on a symbolic integer, we do what's called a guard. So we look at what the actual value is, the sort of backing value. And this is where the term backed versus unbacked comes from. We look at the backing value. This is also referred to as a hint inside of our code base because the hint basically says what kind of size we might expect this tensor to be in practice.

We look at the backing value, the hint of the tensor, and then we do the condition based on the actual value that we have in the backing value. And then we go ahead and we say, okay, well, if it's true, then I'm going to go down the true path. Otherwise, I'm going to go down the false path. And importantly, I will add a guard, a guard that is executed at the beginning of the graph, which just says whether or not I've actually fulfilled this condition. So the next time that I run my graph, will I actually go through the same conditional branch or not? And these conditional branches can happen anywhere in PyTorch code. It can happen in user code where a user does some condition on what the shape of a tensor is. And it can also happen in library code where inside of the PyTorch library, we're looking at sizes and we're making decisions based on whether or not the sizes are big or not to do one thing or another. For example, when you're running convolution, we will look at the size of your input tensor, decide which particular convolution algorithm we're going to do.

Okay, so to summarize, we have symbolic integers, but they have backing values, hints, and if we do a condition on them, then we look, we peek at the backing value and use that to resolve what the condition is, inserting a guard in that situation. So what is an unbacked symbolic integer? Well, an unbacked symbolic integer is simply when you just don't have a backing value. And there are two reasons why you might not want to have a backing value.

So one is you might just not have a backing value at all. For example, say you have a tensor that was produced by a nonzero call. What the actual value size of this tensor is going to be is not known to you. unless you actually run the operation because it's data dependent. So we don't know what the value is. We have no idea what it could be. And so we have no choice but to give you an unbacked symbolic integer in this case because we don't have a backing value. We don't know what it is.

The other example of when they might be useful is when you want to intentionally prevent guards from occurring on a variable. Let's say you're doing export. And so with export, you might want to produce a graph that can work for any batch size. So if you're going to make a graph that works for any batch size, then you would like to say, OK, well, I don't want you to be able to guard on a batch size being zero or one. I just want to, you know, like say, hey, you know, you did no conditional jumps on the value of batch. So my entire program is indifferent to whatever the batch size was. And so you might just intentionally feed in an unbacked symbolic integer for the dimension for your batch dimension, just so that you could make sure that you error out if some code, either user code or library code, attempts to actually do a guard on it in that question.

So one question that people often ask me, because a lot of our discussion has been revolving around export, because that's sort of what's been driving working on back-symbolic integers recently, is are unbacked symbolic integers only for export? And the answer is no. because you can also use them to implement, you can also use them right for the nonzero case if you are actually going to compile in that case. And you might also use them to just say, hey, I want to compile this model for eager mode, but I really, really don't want to have any guards on this value because I really only want to compile one graph in this case. And unbacked symbolic integers would be useful in this case. That being said, primarily we are working on unbacked symbolic integers right now because we are trying to do something with export. So most of the discussion that's happening right now is all about export because that's what we're spending most of the time thinking about.

I was in a discussion with Sam Gross and Sam was asking me, well, about this nonzero compilation case, is that a real use case? Because you might want to just graph break and then you run the nonzero and then you run the graph afterwards and isn't that good enough? And the answer is, well, yes, that is mostly good enough. But there are some situations where you will miss optimization opportunities for this. And in particular, If you have some sort of data dependent operation, say nonzero or more realistically, a packing operation where you have some padded tensors and you pack them into a small tensor that doesn't have any of the padding values. And by the way, the output of this packing operation is dynamic because what you pack depends on how much padding there was inside the original tensors. And that's a data dependent concept. So after you pack, you might want to run some... pointwise operations after it. And here it would be profitable to fuse in those pointwise operations into the packing operation, which is getting the data in this place. And this happens with jagged slash nested tensors, where, you know, often you have a bunch of input tensors, you want to pack them into, you know, a smaller, you know, with no padding tensor, and then do the operation on it. So this is a probable optimization, it's something that I've been told. by the folks working with Jagged tensors that they want. And it was one of the reasons why you might wanna support this. But as I said, like most of the discussion that's happening right now in PyTorch development is all about export. So that's what we're doing.

Um, so then, uh, okay, so we've got unbacked symbolic integers. And so, uh, a lot of our discussion with unbacked symbolic integers is unbacked symbolic integers work a lot like symbolic integers, but all your guards fail, right? So when you try to actually use them, you end up with a pretty common situation, which is you try to feed in unbacked symbolic integers into your model and they don't work because there's a guard. And now you're like, well, why is there a guard on my code? And you look into a bunch of the cases, and there are all sorts of different scenarios. And I actually talked through a bunch of these scenarios inside the Dynamic Shapes Manual, so you can check that out for more details. But one of the examples that has been causing folks quite a bit of trouble, sort of like, do we want to do unbacked statements in this way, is the so-called broadcasting example. So let's unpack the broadcasting example for a moment. The broadcasting example says, hey, you have got a tensor, and let's say it's got an unbacked symbolic integer and you want to add some other tensor to it. And let's say maybe it's also got an unbacked symbolic integer in it. And it just so happens that the sizes of the two tensors are equal, so they will add together no problem. So we happen to know out of ban that everything is going to be okay. But when you run this code, what PyTorch in the library code is going to do is it's going to attempt to test for broadcasting. Namely, it's going to check and see if any given size on the left-hand tensor is 1, because if so, it can broadcast to the right-hand side. And it will test if the right-hand side is 1, and if so, it can broadcast to the left-hand side. Broadcasting being, you know, just replicating the one size dim as many times as necessary to fill in the other size. So if you just run the library code as is without any changes, what we'll do is we will test if the input tensor size is one, and then we will test if the right-hand side tensor size is one, and then we will test if their sizes are equal. But I just told you that I was passing in a tensor that was unbacked. And so if I do a condition on it, if I actually say, hey, tell me if the tensor size is one, if that size is unbacked, then that will just immediately fail saying, hey, you tried to guard on an unbacked symbolic integer. But actually, you know, in this particular case, the guard was completely unnecessary because the sizes would have ended up being the same on both sides and you just would have been fine. You didn't need a broadcast because they were just equal. So like, this is the sort of situation. where you end up with a, hey, unbacked symbolic integer caused a guard failure, and now I need to go modify PyTorch library code.

Now, when I told people this, there were a few questions about like, is this a real problem? Because, well... Like, this seems like a dumb issue to have, because obviously the broadcasting code is going to be fine, and surely there's some simple solution to solve this problem. And one question that people had was, why am I looking into the broadcasting code at all? Naively, I would expect the export graph to just be a list of ATen ops strung together. So why do I have to recurse into the pointwise operation to actually, you know... where I actually run all this broadcasting logic, right? Because when I look at my graph, all I'm getting is an add operation. And so, like there's no broadcasting to be seen. So why does this matter for tracing? And to answer this question, I have to say, well, the reason why you're going into this code is because when you run the add operation, you get out some result tensor and that result tensor has sizes on it. What are those sizes going to be? Well, to figure out what those sizes are going to be, you have to run the shape propagation rules for addition. And those shape propagation rules are what actually do the broadcasting. So, you know, to do the shape propagation, that's when you actually do the broadcasting checks, and that's when you do the one check, and that triggers the guard. So guards aren't just, you know... remember executing on user code, they're also executing on library code. And in particular, they're executing in the shape propagation code, even if that shape propagation code is completely invisible in the final exported program you get.

So then you might be like, well, okay, Ed, I can see that to compute what the output size is gonna be, I have to run this operation, but what if I was, what if I said, hey, I just don't want to actually do any of this because I don't need to know what the output shapes are. Maybe I just, I don't care. I'm going to sum over them or do something very simple to them in the end. And I don't need a very, very fine-grained expression that tells me exactly how to compute the size of this in terms of the output size. the inputs. And so for one, yes, this is a thing you could do. Two, you typically don't want to do this in eager mode, because if you were to guard on the output side, Because remember, the user can do whatever they want. And in particular, they can pass it to another operation where that size needs to be checked. It's a quality against something else. So if you want to guard on it, then you actually need to be able to express the guard in terms of the input sizes. So you need to know how to actually do the computation from the graph inputs to the end. It's not like a traditional JIT system where when you realize that you violated some constraint, for your trace, you can bail out. We have to like, you know, move all of these bailout checks to the beginning of the graph when we compile them. But hey, we're export, we're not going to like, you know, really poke on these with guards. Would that be fine as well? And then here's the lens, and then is, yeah, sort of.

So what we can do is we can say, okay, we don't know anything about the output sizes of this tensor. We just want to say, hey, it's something. And as long as you don't look at it too hard, if you don't try to do any reasoning about it, it's fine. And we can do this. And in fact, I do this for... for the non-overlapping and dense check on tensors. So when you make a tensor, one of the Boolean fields we pre-compute is, is this tensor non-overlapping and dense? Sometimes this is obvious, but if you pass in a bunch of strides, it's very non-obvious. You have to sort the strides and then like look and make sure they all like line up exactly correctly. And it's very complicated and causes a lot of guards. So what I do instead is I just return a, hey, you know, this is just an opaque thing, is non-overlapping or dense function. It takes in all the sizes and strides for the tensor. And that's it. You don't get to know anything else about. this quantity is. And so the point is that as long as you never actually try to touch this quantity in any meaningful way, like you never try to condition on it, you never try to test it for equality with anything else, that's fine. And this works perfectly okay. And so it only blows up if you actually you actually try to do something with it. And it'll probably blow up if you actually try to do something with it. Because you said, Well, I don't know anything about this. So there's no way to do any reasoning about this. And this is one of the reasons why, you know, when Horace looked at the situation, he's like, Well, this seems kind of bad, because you're just, you know, pushing off the problem until later. And the answer is, yes, I'm pushing off the problem to later, it pays to be lazy, if you end up not having to do the work at all.

Another question, and we're going to relate this to the zero one specialization episode is, you know, how does zero one specialization fit into all of this? You know, we might want to zero one specialize in a dynamic shape regime, but like, does that actually seem to matter for export? And the answer is, yeah. So zero one specialization is kind of mixing up a few topics here. So one thing that I mentioned about zero one specialization, it is a trace time optimization, right? You don't have to upfront zero one specialized tensors when they feed into your program. You can just say, well, I'm not going to assume that this zero size tensor is always going to be zero. I'm going to try to run my program anyway. The reason why zero one specialization is so useful for PyTorch though, is a sort of empirical observation, which is that there's a lot of code in PyTorch, which does all sorts of zero one tests. So, you know, basically you're going to specialize on zero one anyway, when they do the test and guard on the quantity as well. So might as well do it earlier on in the program. But, you know, if you just say, well, I'm not going to do it upfront. Well, you'll just collect up a bunch of places where you actually do zero one specialization later. So it's sort of irrelevant. For export, you just turn off zero one specialization and you pass in an unbacked symbolic integer and then you just deal with the guards one by one, at least in my proposal for how to do unbacked symbolic integers.

Okay, one last thing that I want to talk about here, which is why has the unbacked symbolic integer stack of PRs been kind of controversial? So what you find the stack of PRs doing is it's saying, hey, you know, I had some model. I had like ResNet and I wanted to run it with an unbacked symbolic integer for batch size. So I put in one of these unbacked symbolic integers, I ran it. And whenever there was a guard failure, I went and twe aked PyTorch library code until it no longer had this problem. And so people look at these diffs and they say, hey, well, like, does this mean that I have to, you know, write my PyTorch library code in this funny way in the future? That sure sounds like, you know, having to torchscript my code and well, torchscripting my code was very painful and I don't want to have to do this again for another thing.

So I don't know exactly how to argue this one way or another, but my general thinking is that, yes, you have to modify your code. but I don't think it is as bad as Torchscript. So the few reasons I think this is not as bad as Torchscript.

So one is that really all of the really complicated cases have been inside PyTorch library code, very low level operations like empty, like reshape and like is contiguous. And so one of the ideas that I was hoping would be true with my patch set is I fixed these like low level problems. And then, you know, most code is not written in a branchy way, right? Like, you know, you don't have people re-implementing broadcast everywhere. They usually just call an operation that broadcasts. And, you know, if that broadcast implementation knows how to like, you know, tiptoe around unbacked symbolic integers, then that's fine.

Um, so the hope is that like the sort of fat, there's a fat tail of very complicated, uh, operators that we have to handle internally and the rest kind of will just work out because most people aren't writing their models, trying to, you know, like condition on what your batch size is going to be.

Um, the other thing that I think is a little different is that in Torchscript, you, it was an all or nothing deal, right? You had to get all of your code end to end torchscriptable to actually get something useful with unbacked symbolic integers. You don't have to actually get everything going right. Like if you're not doing export or you're not like, you know, saying, Hey, I must compile all of my program. in a single traced block from head to toe, then you're allowed to not, you know, not use unbacked symbolic integers all the way through. In fact, I would not recommend using an unbacked symbolic integer. In this case, you can just say, okay, well. This is fine. Like I'm going to make sure that it works for sizes that are greater than two. And if you happen to send me a batch size one, I'm just going to go ahead and recompile my program for the batch size equals one case. No problem, what's the big deal, right? Like it's just a two X cost in number of compiled graphs. And I still have one that, you know, can handle all of the variable cases. So really the only time you need to like squeeze into this regime is if you are trying to export and it is a dynamically sized model so you want the varying batch size and you're in a situation where you can't ship multiple graphs, you have to ship one graph. And to that I say, well, you know, what did you expect? Right, you're going to have to write your code so that it doesn't actually like do any branching. on the batch size. And it's sort of just some sort of irreducible complexity, at least in my opinion.

Okay, so this is an ongoing conversation. I recorded this to help information share. We might have an updated recording later once we have some more alignment. So I'll also link that in the podcast if that actually happens. All right, thank you very much for listening. See you all next time.