fix: cross entropy for transformers>4.45 #123

anhuong · 2025-02-05T04:40:29Z

Checks for transformers version and creates new custom loss function for llama, granite, mistral, and mixtral models.
Adds shard_checkpoint function from transformers as it is missing in later versions

Tested and saw cross-entropy switching based on transformers version correctly. Benchmark singular images posted below here is a single image comparing the previous benchmark to the new one. Overall the train_loss is on par, memory use is on par/a little higher, and train_tokens_per_second is larger/slower.

closes: #98

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

Signed-off-by: Anh Uong <[email protected]>

anhuong

Had some questions when going through the code, thanks so much Fabian. Also once the new benchmark is complete I will add the results to the scripts/benchmarks/refs with the CSV and the requirements that shows the updated deps.

anhuong · 2025-02-05T04:42:05Z

plugins/fused-ops-and-kernels/src/fms_acceleration_foak/kernels/unsloth/cross_entropy_loss.py

+    shift_labels = shift_labels.to(shift_logits.device)
+
+    reduction = "sum" if num_items_in_batch is not None else "mean"
+    assert ignore_index == -100, "FastForCausalLMLoss currently supports only hardcoded ignore index -100."


What is -100 ignore_index, I see that ignore_index is the target value that is ignored and does not contribute to the input gradient, but for CausalLMLoss what is at index -100?

-100 is used extensively throughout HF, while they provide some means for user to change it, almostly nobody will bother to change it

It is the label that is at -100. For a label with that value, we will ignore that token's contribution to the loss

anhuong · 2025-02-05T04:43:06Z

plugins/fused-ops-and-kernels/src/fms_acceleration_foak/kernels/unsloth/cross_entropy_loss.py

+
+    reduction = "sum" if num_items_in_batch is not None else "mean"
+    assert ignore_index == -100, "FastForCausalLMLoss currently supports only hardcoded ignore index -100."
+    loss = Fast_CrossEntropyLoss.apply(


Can you describe the difference between Fast_CrossEntropyLoss and FastCrossEntropyLoss?

Fast_CrossEntropyLoss is the autograd function. we inherit this from unsloth

FastCrossEntropyLoss is a specialization of torch.nn.CrossEntropyLoss that serves as a convinienced, implemted using Fast_CrossEntropyLoss

anhuong · 2025-02-05T04:43:34Z

plugins/fused-ops-and-kernels/src/fms_acceleration_foak/kernels/unsloth/cross_entropy_loss.py

+# added by [email protected]
+
+# adapted from transformers.loss.loss_utils.ForCausalLMLoss
+def FastForCausalLMLoss(


Would we need to create a similar FastForCausalLMLoss for liger kernel cross entropy?

yes I think we will have a new function for liger cross entropy with the same API. then its a plug and play. But it should be used only if the transformer versioin is advanced

anhuong · 2025-02-05T04:44:15Z

plugins/fused-ops-and-kernels/src/fms_acceleration_foak/models/granite.py

+                rule_id="granite-custom-loss",
+                trigger=ModelPatcherTrigger(
+                    check=replace_custom_loss_when_triggered(
+                        GraniteForCausalLM, custom_loss_type="granite-custom-loss"


Thoughts on calling this granite-custom-crossent-loss instead to be specific that the custom loss is for cross entropy

I feel custom-loss is ok, because it mostly refers to that we are using the new custom loss feature.

Signed-off-by: Anh Uong <[email protected]>

anhuong · 2025-02-05T04:58:10Z

I also noticed in the new transformers version there is a lot of slowness after loading on the checkpoint on log line:

The new embeddings will be initialized from a multivariate normal distribution that has old embeddings' mean and covariance. As described in this article: https://nlp.stanford.edu/~johnhew/vocab-expansion.html. To disable this, use `mean_resizing=False`

Is this something we want to do? Is it expected that loading the embeddings will take a long time to run?

fabianlim · 2025-02-05T05:01:22Z

@anhuong the slownesss i feel its due to the recent changes in fms-hf-tuning, and is there because now we will resize the embedding layer if there is special tokens. Previously we didnt use to do that.

Signed-off-by: Anh Uong <[email protected]>

anhuong · 2025-02-05T22:59:30Z

Able to verify enabling the correct cross-entropy is triggeed based on transformers version for granite, llama, mistral, and mixtral models.

With transformers==4.48

***************** Module Forwards Patching *************
Rule: llama-custom-loss Module:                           Class: LlamaForCausalLM Num:  1
INFO:framework.py:Rule: llama-custom-loss Module:                           Class: LlamaForCausalLM Num:  1
Rule: llama-rms       Module: input_layernorm           Class: LlamaRMSNorm    Num: 32
INFO:framework.py:Rule: llama-rms       Module: input_layernorm           Class: LlamaRMSNorm    Num: 32
Rule: llama-rms       Module: model                     Class: LlamaRMSNorm    Num:  1
INFO:framework.py:Rule: llama-rms       Module: model                     Class: LlamaRMSNorm    Num:  1
Rule: llama-rms       Module: post_attention_layernorm  Class: LlamaRMSNorm    Num: 32
INFO:framework.py:Rule: llama-rms       Module: post_attention_layernorm  Class: LlamaRMSNorm    Num: 32
Rule: llama-rope      Module:                           Class: LlamaForCausalLM Num:  1
INFO:framework.py:Rule: llama-rope      Module:                           Class: LlamaForCausalLM Num:  1
***************** Accelerator Patching *************

With transformers=4.45

***************** Module Forwards Patching *************
Rule: llama-cross-ent Module:                           Class: LlamaForCausalLM Num:  1
INFO:framework.py:Rule: llama-cross-ent Module:                           Class: LlamaForCausalLM Num:  1
Rule: llama-rms       Module: input_layernorm           Class: LlamaRMSNorm    Num: 32
INFO:framework.py:Rule: llama-rms       Module: input_layernorm           Class: LlamaRMSNorm    Num: 32
Rule: llama-rms       Module: model                     Class: LlamaRMSNorm    Num:  1
INFO:framework.py:Rule: llama-rms       Module: model                     Class: LlamaRMSNorm    Num:  1
Rule: llama-rms       Module: post_attention_layernorm  Class: LlamaRMSNorm    Num: 32
INFO:framework.py:Rule: llama-rms       Module: post_attention_layernorm  Class: LlamaRMSNorm    Num: 32
Rule: llama-rope      Module:                           Class: LlamaForCausalLM Num:  1
INFO:framework.py:Rule: llama-rope      Module:                           Class: LlamaForCausalLM Num:  1
***************** Accelerator Patching *************

Signed-off-by: Anh Uong <[email protected]>

anhuong · 2025-02-07T02:37:36Z

plugins/accelerated-peft/src/fms_acceleration_peft/gptqmodel/models/base.py

+    # added by [email protected]
+    # adapted from transformers.modeling_utils.shard_checkpoint
+    # from transformers v4.46, removed in later versions
+    # TODO: split_torch_state_dict_into_shards from huggingface_hub library
+    def shard_checkpoint(


After transformers v4.46, this method no longer exists in in transformers so I copied it in here to start. The new method to migrate to as per the warning message in the original function says to migrate to split_torch_state_dict_into_shards as noted in the TODO item here. This method was similar but requires more investigation on the difference - https://github.com/huggingface/huggingface_hub/blob/main/src/huggingface_hub/serialization/_torch.py#L302

ok this is fine for now

anhuong · 2025-02-07T02:45:22Z

scripts/benchmarks/refs/a100_80gb_update_cross_entropy_and_deps.csv

@@ -0,0 +1,89 @@
+bf16,epoch,fp16,framework_config,learning_rate,lora_alpha,lora_dropout,mem_nvidia_mem_reserved,mem_peak_torch_mem_alloc_in_bytes,mem_torch_mem_alloc_in_bytes,model_name_or_path,num_gpus,peft_method,per_device_train_batch_size,r,target_modules,torch_dtype,train_loss,train_runtime,train_samples_per_second,train_steps_per_second,train_tokens_per_second
+,0.07,,none,2.00E-05,,,15116,11267745280,6770300416,bigcode/gpt_bigcode-santacoder,1,,4,,,bfloat16,2.33703125,47.6604,8.393,2.098,17188.262


I ran the benchmarks and found the failure in auto-gptq that I commented on the fix above. Rerunning benchmarks for auto-gptq. I also ran benchmarks for granite3.1 model but for comparison I had to run against the granite-gptcode model. Do we want to update to update to the granite3.1 model? It did run successfully with it as well.

Here are the charts showing the comparison against a100_80gb.csv without including the auto-gptq failed runs shows the train_loss is on par, memory use is on par/a little higher, and train_tokens_per_second is larger/slower

these look quite decent

Signed-off-by: Anh Uong <[email protected]>

fabianlim

LGTM the benches look good.

One question is that the new benches you ran. rather than have it in a seperate file, it is better to replace the previous a100_80gb and update the requirements, but did you run all the cases or only a subset?

anhuong · 2025-02-07T19:08:57Z

I will rename the benchmarks and requirements, I had run all of the benchmarks except for auto-gptq due to the error that came up so I ran those separately after the fix. I also did not run the baseline-bnb but running now separately, what is the purpose of this benchmark? updated the image in the description and will include individual images here. I will add all of them to the benchmark. I updated the above description with the summary results that include auto-gptq, Here are the individual images. Overall they continue to look good, the only outlier identified was

framework_config	peft_method	model_name_or_path	num_gpus	per_device_train_batch_size	reference	metric	new
accelerated-peft-bnb	lora	bigcode/gpt_bigcode-santacoder	2	2	6840.355	train_tokens_per_second	8548.922
accelerated-peft-bnb-foak	lora	bigcode/gpt_bigcode-santacoder	2	2	10345.932	train_tokens_per_second	11994.044

Signed-off-by: Anh Uong <[email protected]>

anhuong · 2025-02-07T22:48:33Z

I have replaced the benchmark and requirements with my full runs

fabianlim · 2025-02-07T22:52:28Z

I also did not run the baseline-bnb but running now separately, what is the purpose of this benchmark? updated the image in the description and will include individual images here.
the purpose of this is to have a baseline so we can compare the accelerations. The baseline could change also due to different transformer versions

anhuong · 2025-02-07T23:23:24Z

Makes sense, the benchmark I added has the complete benchmark including the baseline. It matches the original on the number of runs. With this, I will merge in this change

fabianlim · 2025-02-07T23:27:20Z

sounds good!

fabianlim and others added 2 commits January 30, 2025 14:36

trigger-only pattern for custom loss

2769736

Signed-off-by: Yu Chin Fabian Lim <[email protected]>

add cross ent fix for llama, mistral, mixtral

bb6e04e

Signed-off-by: Anh Uong <[email protected]>

anhuong requested a review from fabianlim as a code owner February 5, 2025 04:40

anhuong commented Feb 5, 2025

View reviewed changes

fix linting errors

168f170

Signed-off-by: Anh Uong <[email protected]>

anhuong added 2 commits February 4, 2025 22:12

run formatter

bfb8a8f

Signed-off-by: Anh Uong <[email protected]>

fix misspelling and error test

31b0416

Signed-off-by: Anh Uong <[email protected]>

anhuong added 2 commits February 6, 2025 19:30

fix import error with later transformers

834f3a0

Signed-off-by: Anh Uong <[email protected]>

add benchmarks

bfdef09

Signed-off-by: Anh Uong <[email protected]>

anhuong commented Feb 7, 2025

View reviewed changes

fix import order

9837db4

Signed-off-by: Anh Uong <[email protected]>

fabianlim approved these changes Feb 7, 2025

View reviewed changes

replace benchmark and requirements

fe11300

Signed-off-by: Anh Uong <[email protected]>

anhuong merged commit 24bdadb into foundation-model-stack:main Feb 7, 2025
7 checks passed

dushyantbehl mentioned this pull request Feb 8, 2025

chore(deps): revert trl restriction foundation-model-stack/fms-hf-tuning#448

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: cross entropy for transformers>4.45 #123

fix: cross entropy for transformers>4.45 #123

anhuong commented Feb 5, 2025 •

edited

Loading

anhuong left a comment •

edited

Loading

anhuong Feb 5, 2025

fabianlim Feb 5, 2025

anhuong Feb 5, 2025

fabianlim Feb 5, 2025

anhuong Feb 5, 2025

fabianlim Feb 5, 2025

anhuong Feb 5, 2025

fabianlim Feb 5, 2025

anhuong commented Feb 5, 2025

fabianlim commented Feb 5, 2025

anhuong commented Feb 5, 2025

anhuong Feb 7, 2025

fabianlim Feb 7, 2025

anhuong Feb 7, 2025 •

edited

Loading

fabianlim Feb 7, 2025

fabianlim left a comment

anhuong commented Feb 7, 2025 •

edited

Loading

anhuong commented Feb 7, 2025

fabianlim commented Feb 7, 2025

anhuong commented Feb 7, 2025 •

edited

Loading

fabianlim commented Feb 7, 2025

		@@ -0,0 +1,89 @@
		bf16,epoch,fp16,framework_config,learning_rate,lora_alpha,lora_dropout,mem_nvidia_mem_reserved,mem_peak_torch_mem_alloc_in_bytes,mem_torch_mem_alloc_in_bytes,model_name_or_path,num_gpus,peft_method,per_device_train_batch_size,r,target_modules,torch_dtype,train_loss,train_runtime,train_samples_per_second,train_steps_per_second,train_tokens_per_second
		,0.07,,none,2.00E-05,,,15116,11267745280,6770300416,bigcode/gpt_bigcode-santacoder,1,,4,,,bfloat16,2.33703125,47.6604,8.393,2.098,17188.262

fix: cross entropy for transformers>4.45 #123

fix: cross entropy for transformers>4.45 #123

Conversation

anhuong commented Feb 5, 2025 • edited Loading

anhuong left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anhuong commented Feb 5, 2025

fabianlim commented Feb 5, 2025

anhuong commented Feb 5, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

anhuong Feb 7, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabianlim left a comment

Choose a reason for hiding this comment

anhuong commented Feb 7, 2025 • edited Loading

anhuong commented Feb 7, 2025

fabianlim commented Feb 7, 2025

anhuong commented Feb 7, 2025 • edited Loading

fabianlim commented Feb 7, 2025

anhuong commented Feb 5, 2025 •

edited

Loading

anhuong left a comment •

edited

Loading

anhuong Feb 7, 2025 •

edited

Loading

anhuong commented Feb 7, 2025 •

edited

Loading

anhuong commented Feb 7, 2025 •

edited

Loading