You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, I have integrated torchao to my training. But I think it's not very clear what the inference should be like.
Should I use the converted FP8 linear layer to do inference? Is delayed scaling supposed to work in inference?
Or, should I use the original linear layer to do inference?
Thanks in advance if you can help to clarify!
The text was updated successfully, but these errors were encountered:
Do you need Distributed Inference? Or are you doing inference on single GPU?
For single GPU, I think using original model definition + loading quantized weights should ideally "just work", @vkuzo to confirm. If not, please file a RFC in ao.
For Distributed Inference, we are building DTensor + Quantized Tensor support in torchchat. (We are yet to publish a demo.) There is also a simple ao + TP example in the ao repo: link.
Do you need Distributed Inference? Or are you doing inference on single GPU?
For single GPU, I think using original model definition + loading quantized weights should ideally "just work", @vkuzo to confirm. If not, please file a RFC in ao.
For Distributed Inference, we are building DTensor + Quantized Tensor support in torchchat. (We are yet to publish a demo.) There is also a simple ao + TP example in the ao repo: link.
If original model definition is working for single GPU, that means I could just use my current distributed inference, is that right? The ao + TP example looks like to use torchao converted FP8 linear layer to do inference, which is different from what you suggest for single GPU. To be honest, I feel a little bit confused.
Hello, I have integrated torchao to my training. But I think it's not very clear what the inference should be like.
Should I use the converted FP8 linear layer to do inference? Is delayed scaling supposed to work in inference?
Or, should I use the original linear layer to do inference?
Thanks in advance if you can help to clarify!
The text was updated successfully, but these errors were encountered: