Traditionally, unsigned integer support in PyTorch was not great; we only support uint8. Recently, we added support for uint16, uint32 and uint64. Bare bones functionality works, but I'm entreating the community to help us build out the rest. In particular, for most operations, we plan to use PT2 to build anything else. But if you have an eager kernel you really need, send us a PR and we'll put it in. While most of the implementation was straightforward, there are some weirdnesses related to type promotion inconsistencies with numpy and dealing with the upper range of uint64. There is also upcoming support for sub-byte dtypes uint1-7, and these will exclusively be implemented via PT2.
Hello everyone, and welcome to the PyTorch Dev Podcast. Today, I want to talk about unsigned integer support that we recently added to PyTorch. PyTorch has supported unsigned integers, but only for 8-bit integers. You can do uint8, or also known as byte, but you don't get any of the other unsigned integer types, uint16, uint32, uint64.
The reason for this is mostly historical. Torch, the TH library that PyTorch was originally built off of, didn't have support for these dtypes. As a result, we never really added them. Most people could deal with having only the signed integer variants. That being said, it's somewhat inconvenient not having them for several reasons.
One reason is that sometimes you want the little bit of extra range that you get from an unsigned integer, say a 16-bit unsigned integer, because you're losing half of the range with a signed integer if you're only using it for indexing. Also, unsigned integers are great for doing bit manipulation because most of the bitwise operations are well-defined on them, as opposed to signed integers where, if you overflow, that's undefined behavior, and who knows what's going to happen.
I finally got fed up with this and on my plane ride back from holidays, I decided to go ahead and implement it. We now have unsigned integer support in PyTorch. It's a bit restricted. One of the problems, and probably the reason why TH didn't have unsigned integer support to begin with, is you have to pay a cost whenever you add a new dtype to PyTorch. For every kernel that you want to support your particular dtype, you actually need to generate code for it.
When you add a new dtype to PyTorch, that actually ends up being a lot of extra binary size for all the new kernels you have to add. We are already a very large binary if you've ever had to download PyTorch. And adding some more binary size for some dtypes that people mostly don't use in deep learning just didn't seem like a good trade-off for us.
It gets especially worse when you consider combinatorial explosion of operations. For example, let's say that I want to do an operation between a signed integer and an unsigned integer. Well, if I want to avoid doing a conversion, I have to actually generate a fused kernel for this case to do all the operations together. Sometimes, you can't even do the operation conveniently without a fusion. Like if you want to do a comparison, if you want to do a quality test between a signed int64 and an unsigned uint64, how are you even going to do that? You can't do the conversion because the conversion will overflow. However, if you're okay with overflow semantics, then that's fine. That's another question that I'll get to shortly.
With unsigned integer support in PyTorch, I made a compromise. We'll add a few new kernels. A few extra kernels isn't going to break the camel's back. The main problem is when you take the entirety of PyTorch's operator space and multiply it by another three dtypes. So we're gonna take only the most important operations: construction, filling it with some constant, equality, but not addition, not multiplication, not those types of things. And those are the only things we're going to implement.
Essentially, it's enough to get you a little bit of interoperability with, say, NumPy, which also supports unsigned integers, but not enough to do anything that useful. Then what we're going to do is we're going to do a twofold strategy.
One is that if you, a user, come to us and say, "Hey, I've got this use case, and I would really like support for unsigned integers," then we're like, "Okay, fine. If you ask us for it, then we'll add it. We'll add things if they actually are useful and used by someone. We're not going to add them if they're just sort of randomly like, "Oh, we have integer matrix multiply. So I guess we have to add a 16-bit unsigned integer matrix multiply." No, we probably don't actually want to spend binary size on that. If you've got a good use case for it, then just send in the bug report. And chances are, it's pretty easy. You just have to modify one of the macros that is going ahead and iterating through all the dtypes and stamping out versions of the code for each of them. They just go ahead and add the unsigned types to that. And you'll get a kernel for that.
I expect to accept a trickle of operations like this slowly through the future. The other strategy we have, and this is not entirely implemented yet, some of it is implemented, but not all of it, is we're going to use PyTorch2 to implement all the operations. PyTorch2 lets us do code gen on the fly for integer types. So it doesn't matter that you don't have out of the box, inequality tests between int64 and uint64, you can just generate it on the fly if you Torch compile your operation in question.
This is sort of leaning into this idea that in general in PyTorch, PyTorch 2 is this cool thing. It's a compiler. Normally, we tell people to use it on their entire models. But there are also bits of the regular PyTorch library that we could implement with PyTorch 2. A dtype like the unsigned integers from 16 to 64 is a good example where if we don't want to actually add all the kernel support in eager mode, we can still get it cheaply by using torch.compile.
Please send us any contributions you might like. This is the sort of thing where I've gone ahead and put in the basic infrastructure. So basic testing things work, but everything else doesn't. So if you're willing to roll up your sleeves and make some changes to PyTorch, at the same time you're trying to apply unsigned integers for some sort of use case of yours, I think this is a great way to do a contribution. In fact, Thomas Veman messaged me on Slack and he was like, "Hey, I've got a fix. Do you want me to send it in?" I was like, "Yes, please. Absolutely."
I want to talk a little bit about a few things to know about the unsigned integer implementation, because I thought it was going to be trivial. We already support integers, we support signed integers, and we support uint8. So surely it's just doing the same old thing. Well, not quite.
Here are the main things that are problematic. One, we need to decide what our semantics are regarding signed-unsigned overflow situations. For example, if I have negative one, and I compare this against hexadecimal 0xfffffff, well, on a bit level, this is the same thing. But if you ask Python, Python's like, "Well, no, these are not the same number. One of them is negative, and one of them is a very large number." So we need to decide whether or not we're following C semantics or Python semantics. Actually, before this podcast, I should have checked what NumPy semantics here were, but I didn't. So we'll need to check what NumPy semantics are. We'll need to check what the existing semantics for uint8 and int8 are, and then make a call about what exactly we want to do.
In particular, we actually have a class in C++ called C10 scalar, which represents essentially any sort of scalar type that Python is able to represent. And I had a problem while I was implementing this. I was like, "Okay. Well, I can store signed integers in this, and I can store unsigned integers in this, and I can also store booleans and floats and whatever. And if someone asks me, what's the equality between these two ints, what should I do?" And it's not an easy answer to this question. In the end, I believe I decided that this semantically is representing a Python bigint, which can be arbitrary precision. So no, these should not be the same thing. But I'm not convinced that the actual kernel should necessarily operate the same way. Of course, in Torch Compile, it doesn't really matter. You can get whatever semantics you want. We just need a way of actually spelling it out. And usually, there is a way of spelling it out.
We have a promotion problem regarding our compatibility with NumPy. Suppose that you have a Uint8 tensor. This existed in PyTorch before the new support we added. And you do a sum on it. What type does it promote into? Well, the naive answer is Uint8, which is not correct. It's not what we do. And it's also probably not what you want because if you're using these integer tensors, you usually want them to denote actual integers. So you probably don't want them to overflow when you run out. But if you have a big pile of uint8s, you're definitely going to overflow your 8-bit integer. So we actually promote this to int64.
Now, why int64 and not uint64? Well, remember, we didn't have support for uint64. So producing a uint64 tensor is not possible. So we just gave it the next best type. And it's not like you're going to really miss that last half of the range, 2 to the 63. However, this is not compatible with NumPy's behavior. When you do a NumPy sum on a NumPy uint8 ndarray, it'll give you a uint64. So we're inconsistent.
And if you do ever want to add the sum operations to the higher size ones, uint16, uint32, we have a choice to make, right? We can be consistent with how we currently do it with uint8 and produce an int64 tensor, or we can be inconsistent but match numpy semantics and have it be a uint64 tensor. This is especially poignant for the uint64 tensor, which we probably definitely want to be inconsistent with uint8 because it would be extremely strange if you sum over a uint64 tensor, and you got an int64 tensor. You just lost the entire... There's absolutely no reason to do it this way. This is something we have to figure out.
Another thing that's a bit of a pain with PyTorch today is our handling for the very top range of uint64. We have lots of places in the PyTorch code base where we've hard-coded int64. For example, when you do a randint call, the randint call takes an integer min and an integer max, and those are represented in C++ as int64. Well, you're going to have a hard time actually representing a randint call on a uint64 dtype that covers the entire range of uint64 because you just can't. You don't have enough space in your int64. So we need to do something about that as well.
I think probably the right call is to add a new overload to randint that takes in a scalar because scalars, scalars is this union type. So I've got a tag and I can say, "Oh, this is big. You need to store it in a uint64 instead of an int64." But it's something that we'll have to do. The random number generation doesn't really work with uint64. Buyer beware. Probably your best bet is to generate two uint32s and then use some bit twiddling to concatenate them together.
Finally, one last thing I want to say is that we've added all of the uint, large uint types, but we're also considering adding a bunch of small sub-byte size unsigned integers. So that's uint1 through uint7. These are kind of strange because they're not byte size. And in fact, we can't implement them in C++ in the traditional way.
But remember, we've got this awesome compiler that we can use to do things. So our plan on record with the sub-byte size unsigned integer types is that we are going to implement them via PyTorch2. The idea is that, hey, you can't actually directly do a uint1 operation typically, but you can reinterpret it as a uint8 tensor and then do whatever byte operations, bit level operations you need to do to simulate the operation in question.
Sometimes this is not very convenient to do, like, if you want to do addition, the carries are probably a kind of a pain, but you know, you, you can do it right. Especially because you probably don't actually have int1 hardware. So you have to simulate it by doing a bunch of larger size operations anyway. On CUDA, probably the performance won't even be that bad, assuming you are bandwidth bound rather than compute bound. And, you know, there are going to be a bunch of steps to getting all this working, but we definitely don't expect any good C++ eager mode support. So it's going to all be via PyTorch2. We do have some people working on this because sub-byte quantization is very popular. So the client team is working on this.
That's everything I wanted to say about unsigned integers. See you next time.