-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Some questions about the pre-trained mask strategy #7
Comments
In the Span-aware entity masking. section: we expect OAG-BERT to memorize them well and thus develop a span-aware entity masking strategy combining the advantages of both ERNIE [55] and SpanBERT [17]. It's too unclear here. Can you explain how to deal with loss ? Looking forward to your reply. |
"choose 15% of the mask for all entities". The 15% probability is not strictly enforced. We select one entity randomly, mask it, check if the total number of masked tokens reaches 15% of all tokens. If not reached, repeat this process. |
For example, if the sentence length is 100, we want to mask 15 tokens. If we randomly picked one entity that has 17 tokens, we will mask all 17 tokens even if it is longer than 15. |
The loss computation just follows other masking language model. Only the mask strategy is customized. |
Thank you very much for help. The question I want to say is, do you only use MLM in the loss calculation ? and do you use SpanBert's Span Boundary Objective loss ? Looking forward to your reply. |
When reading the code implementation part of title generation, I found that the decoding strategy generation method is different of FOS. It is more like the method of Prefix LM. I don't quite understand why the same generation strategy as FOS is not used. Can you share the reason here ? THX |
No, we only use MLM. It is possible to use SpanBert's loss. Maybe it can help the model to learn span information more efficiently. |
We have tried various ways to train and inference. I suppose you read the code in We actually tried to use GLM or other advanced masking strategy to train the model and get better parameters which are more suitable for the sequence generating tasks. As far as I know, if your downstream tasks are mainly comprehension work, like cloze tasks, using MLM for training could be good. But I remember there are researches indicate that training like GPT could also achieve qualitative results. For sequence generation, pure MLM training is somehow harder I think. |
您好,我想了解一下预训练的第一步,文献的标题、摘要和正文,使用什么样的json格式更好,谢谢 |
In the Span-aware entity masking. section of the paper it is mentioned that "If the sampled length is less than the entity length, we will only mask out the entity. For text contents and entity contents, we mask 15% of the tokens for each respectively. "
I have 2 points of confusion here
First: "If the sampled length is less than the entity length, we will only mask out the entity." I can't understand the meaning of this sentence, assuming Geo(p) == 6 and entity_len == 7, here it means mask_len == 7 ? but when Geo(p) == 6 and entity_len == 5, what to do? Can you help with an example?
Second: "we mask 15% of the tokens for each respectively", for entity, I am very confused, this is to choose 15% of the tokens for each entity OR choose 15% of the mask for all entities? Combined with the first question, here is how to guarantee a 15% probability?
Looking forward to your reply.
The text was updated successfully, but these errors were encountered: