-
Notifications
You must be signed in to change notification settings - Fork 497
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding BetterTransformer support for ProphetNet #648
base: main
Are you sure you want to change the base?
Adding BetterTransformer support for ProphetNet #648
Conversation
Hello, I had a couple of questions about how to add BetterTransformer support for ProphetNet. FYI, this is my first open source contribution so maybe a little (1) Just so that I am clear on what is going on within this ticket, we are attempting to replace any transformer encoder layers being used in models to integrate (2) I am having a hard time understanding how to identify which layers within ProphetNet need to be replaced. For example, within the documentation, I see that the lines:
Are used to identify the layers that need to be replaced. And if I understood correctly, layers which contain (Attention), are the ones that need to be replaced? My question is how do you know what to pass to the Thanks again for your patience! |
Hi, that's great! To answer your questions:
from transformers import ProphetNetModel, ProphetNetConfig
cfg = ProphetNetConfig()
model = ProphetNetModel(cfg)
print(model) you can find some submodules like |
Hi @adit299 , your branch is out of sync with main, especially |
Hi @fxmarty, I assume this is referring to the merge conflicts? Yes, I am familiar with Git, will resolve. Will let you know if I am unable to. |
Hello @fxmarty , Thanks for your detailed response to my initial question! Those links were super helpful! Also, please do let me know if there are any issues with the way I resolved the merge conflicts. I had some additional questions: (1) Just a quick question about how these attributes relate to the Transformer Encoder module mentioned in the paper:
For example, does in_proj_weight refer to the weights being applied to the input embedding? (And similar idea for out_proj_weight?) Perhaps, this question is out of scope for the task, but any clarification is appreciated! (2) In Step 3 of https://huggingface.co/docs/optimum/bettertransformer/tutorials/contribute, it says: After the first forward pass, the hidden states needs to be nested using the attention mask. I am confused about what is meant by nesting? And how do we check the shape of the attention mask? |
Hi @adit299 , the merge is good, thanks! For your questions:
correspond to
in PyTorch's
Yon can get a feel of it: import torch
hidden_states = torch.rand(4, 6, 8) # batch size, sequence length, hidden size
attention_mask = torch.ones(4, 6, dtype=torch.bool)
attention_mask[2][4:] = False
attention_mask[3][3:] = False
attention_mask[1][2:] = False
attention_mask = torch.reshape(attention_mask, (attention_mask.shape[0], attention_mask.shape[-1]))
res = torch._nested_tensor_from_mask(hidden_states, attention_mask) I'm not sure about the shape of the attention mask. Just a note: could you add the model in |
Hello @fxmarty, Thanks again for the detailed response, made things clear. I just added the model to the tests. Couple of things: (1) I'm noticing that I get this error when I run the tests locally: Currently, I am debugging this. Any advice/insight you could give would be appreciated! I have also attached the full stacktrace as well: |
if hidden_states.shape[0] != attention_mask.shape[0]: | ||
hidden_states = hidden_states.transpose(1, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe the error you get is due to those you could try to remove, not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just tried that.. no luck, those tests are still failing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By using print statements, I can see that hidden_states is of shape: torch.Size([2, 4, 16]) and attention_mask is of shape: torch.Size([8, 4]). So there appears to be a discrepancy in the batch_size value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @adit299 I'll look asap, probably on Friday. In the meanwhile you can compare with the shapes for e.g. Bert to see how it differs.
Hello, From the ProphetNet documentation (https://huggingface.co/docs/transformers/model_doc/prophetnet) it states that "ProphetNet is an encoder-decoder model..", which is why I added "hf-internal-testing/tiny-random-ProphetNetModel" under the ALL_ENCODER_DECODER_MODELS_TO_TEST list within test_bettertransformer_encoder.py. I tried running the tests when "hf-internal-testing/tiny-random-ProphetNetModel" is set under the ALL_ENCODER_MODELS_TO_TEST list and 1 test fails, with this failure (I have attached the full stacktrace): This error seems similar to the one referenced here: #494 (comment) So, I guess my questions are: (1) Since the layer being replaced within ProphetNet is an encoder layer, does this mean that we just need to add it under the ALL_ENCODER_MODELS_TO_TEST list? Or would it be under the ALL_ENCODER_DECODER_MODELS_TO_TEST list? (both lists are within test_bettertransformer_encoder.py) What differentiates the models within both these lists? (2) I had a look at the test_bettertransformer_encoder.py file, and was having a hard time understanding how the tests differ between models in the ALL_ENCODER_MODELS_TO_TEST list and ALL_ENCODER_DECODER_MODELS_TO_TEST list. Any clarification that could be given on how the tests differ would be appreciated. Thanks again for the speedy responses! |
Hi @adit299 You are right that the naming for the test file might not be the best, basically this files tests the BetterTransformer feature for text models. About your questions, The tests are different because things can vary a bit for seq2seq models, for instance inputs. In you case your test fails because no inputs are provided for the decoder. You should add it to the |
Hello @michaelbenayoun, Thank you for the clarification! I guess this confirms that the attention mask in ProphetNet is being constructed differently to BERT. I believe I have found the lines of code that are causing this difference: Prior to these lines of code executing, the attention_mask is of shape [2, 4] which lines up with the dimensions of the hidden_states. However, after these lines execute, the dimensions are [8, 1, 4]. So some sort of padding/re-shaping is happenning here. If I can figure out what the purpose of this padding is and how to deal with it, |
Hello, I have been having issues with the model not producing an attention mask of the correct dimension and not producing the same logits as the original model. I have attached the full stacktrace. I am not sure what the issue is, I suspect it could be that one of the parameters within ProphetNetEncoderLayerBetterTransformer is not set correctly but I'm not sure.. bit lost on how to proceed. |
Hi @adit299 , I'll try to have a look soon! |
@adit299 whats the status of this task? |
What does this PR do?
Opening up draft PR to start discussion on how to add Better Transformer support for ProphetNet
Fixes #488
Before submitting