You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thanks for the suggestion! I've been tinkering with it, but it seems like it requires some GPUs (to perform training), which I don't currently have to test quantization.
I can't promise this in the near future, but if GGUF conversion from GPTQ is implemented (AutoGGUF is mostly focused on llama.cpp), I can take a closer look.
Just an approximation, but the fp16 is ~14GB, and the script loads it into CPU memory first, and then moves the layers over to GPU in 4bit as it trains dynamically. So it would fit on an RTX 4090, or at minimum a GPU with around 12-16GB of VRAM including gradients, activations, optimizer states, and CUDA overhead.
Apparently it is a new method for doing quantization? Here is the reddit and Github, so that you can see whether it is worth rolling into AutoGGUF.
Quantize 123b to 35%
EfficientQAT Github
Thank you for AutoGGUF, I am looking forward to handling quantizations without being an acolyte of the command-line. :)
The text was updated successfully, but these errors were encountered: