-
Notifications
You must be signed in to change notification settings - Fork 140
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NEW] Llama3.2 weight converters 🦙 #255
base: main
Are you sure you want to change the base?
[NEW] Llama3.2 weight converters 🦙 #255
Conversation
Have you managed to train with When using your conversion script above for Llama 3.2 3B model it works fine for
(using your llama 3.2 yaml script in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice PR @TJ-Solergibert! Thanks
Added some small qsts before merging
@@ -0,0 +1,73 @@ | |||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this supposed to be pushed? 👀
|
||
# NOTE: this scale is for µTransfer, | ||
# in SP, we use sqrt(1/d_h) | ||
softmax_scale = 1 / query_states.shape[-1] if self.is_using_mup else None | ||
attn_output = flash_attn_varlen_func( | ||
attn_output = flash_attn_func( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes this is faster but only for causal masks. How do you deal with kv cache in inference? Are generations the same with and without use_kv_cache?
Hi!
In this branch I reintroduce & update to the current main branch the Llama model & conversion scripts to support Llama3.1 and Llama3.2 1B&3B models.
The main changes are the following:
transformers
LlamaRotaryEmbedding layer. Now this will be the only class in llama.py. I think it shouldn't break generations for the inference case WITHOUTLlamaConfig.rope_interleaved = True
inCausalSelfAttention.forward
, are there any tests?config.optimizer.finetuning
flag in order to (True) just load the weights or (False) Load weights, optimizer & LR Scheduler instead ofconfig.checkpoints.load_optimizer
&config.checkpoints.load_lr_scheduler
flash_attn_varlen_func
toflash_attn_func
as the later achieves greater performance. Keep in mind that we aren't using any feature of the varlen funct so it's recommended to stick withflash_attn_func
LlamaConfig.rope_interleaved
? It was useful for training when using FlashAttention RoPEs and now seems to be used also in the inference code. IMO we should unify all 3 cases (Training, inference with rope_interleaved & inference without rope interleaved) within a single RoPEResults
You can run the conversions & generations tests using the scripts in
tools/converters
. As I already mentioned in the previous PR (#174), despite we need at least 1 GPU (To init theParallelContext
) we are running the conversion with the CPU.As can be seen from the following table, we observe slightly differences between the 2 backends. Those differences are produced by the QKV projections in the CausalSelfAttention layer (Nanotron computes them in a single GEMM vs 3 different GEMMs in HF) and the LayerNorm layer is different (Nanotron is using a optimized one from FlashAttention vs Basic PyTorch LayerNorm in HF). Also note that the differences increase if we use TP which is totally expected as the sizes of the GEMMs are different, triggering different GEMM algorithms.
To run the Nanotron generations with different TP sizes:
TODO (Preferably in other PRs):
nanotron/tools/converters/delete/generate_hf_predictions.py
&nanotron/tools/converters/delete/generate_nanotron_predictions.py
scriptsapply_rotary_pos_emb
CausalSelfAttention.forward