-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Model files are big? #6
Comments
The actual model sizes are: The fp32 weights are provided to allow users to reduce precision to their needs. We will consider providing the weights in f16 since this is a common complaint :) Thank you for pointing it out! |
Ok, the size seems about right then. # took the size from disk. huggingface shows in / 1000**3
>>> (10_161_140_290+4_656_666_941) / 1024 / 1024 / 1024
13.800158380530775
>>> (3_638_525_952 * 4) / 1024 / 1024 / 1024
13.5545654296875 f16 weights would be nice, to download less stuff |
@jon-tow on this topic, do you expect these models to quantize well down to 4bits (or lower) via GPTQ and/or other quantizing strategies? I don't see why not, since GPTQ seems to be a general technique that works well for different transformer models. But I'm asking because part of reason behind Stable Diffusion's success is from how well it runs on consumer hardware. So I'm wondering if these models will follow a similar goal, of running very well on consumer hardware, and therefore consider quantization from the very beginning? |
Hi, @andysalerno! I do expect these models to quantize quite well. They're pretty wide, which should help reduce bandwidth boundness compared to models of similar size when quantized. |
There's a 4.9GB ggml 4bit GPTQ quantization for StableLM-7B up on HuggingFace which works in llama.cpp for fast CPU inference. (For comparison, LLaMA-7B in the same format is 4.1GB. But, StableLM-7B is actually closer to 8B parameters than 7B.) |
For the sake of convenience (2x less download size/RAM/VRAM), I've uploaded 16-bit versions of tuned models to HF Hub: |
Yeah we need a Colab for this stuff that doesn't crash from ram out of memory lol |
Hm, how do you actually run this? and got:
|
Hi @jon-tow @python273 Why we have multiple .bin files inside |
https://huggingface.co/stabilityai/stablelm-base-alpha-3b/tree/main
Looks like 3B is 14.7GB, and if I understand correctly, it's supposed to be f16. Even with f32, it should be about 11.2G. With f16, 5.6G. Am I missing something?
For reference LLaMA 7B (f16) is 12.6G.
upd: I guess it's actually f32. But still seems a little bigger than should be?
The text was updated successfully, but these errors were encountered: