Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some questions about the pre-trained mask strategy #7

Open
biandh opened this issue Nov 7, 2022 · 9 comments
Open

Some questions about the pre-trained mask strategy #7

biandh opened this issue Nov 7, 2022 · 9 comments

Comments

@biandh
Copy link

biandh commented Nov 7, 2022

In the Span-aware entity masking. section of the paper it is mentioned that "If the sampled length is less than the entity length, we will only mask out the entity. For text contents and entity contents, we mask 15% of the tokens for each respectively. "

I have 2 points of confusion here

First: "If the sampled length is less than the entity length, we will only mask out the entity." I can't understand the meaning of this sentence, assuming Geo(p) == 6 and entity_len == 7, here it means mask_len == 7 ? but when Geo(p) == 6 and entity_len == 5, what to do? Can you help with an example?

Second: "we mask 15% of the tokens for each respectively", for entity, I am very confused, this is to choose 15% of the tokens for each entity OR choose 15% of the mask for all entities? Combined with the first question, here is how to guarantee a 15% probability?

Looking forward to your reply.

@biandh
Copy link
Author

biandh commented Nov 8, 2022

In the Span-aware entity masking. section: we expect OAG-BERT to memorize them well and thus develop a span-aware entity masking strategy combining the advantages of both ERNIE [55] and SpanBERT [17]. It's too unclear here. Can you explain how to deal with loss ?

Looking forward to your reply.

@Somefive
Copy link
Collaborator

Somefive commented Nov 9, 2022

Second: "we mask 15% of the tokens for each respectively", for entity, I am very confused, this is to choose 15% of the tokens for each entity OR choose 15% of the mask for all entities? Combined with the first question, here is how to guarantee a 15% probability?

"choose 15% of the mask for all entities". The 15% probability is not strictly enforced. We select one entity randomly, mask it, check if the total number of masked tokens reaches 15% of all tokens. If not reached, repeat this process.

@Somefive
Copy link
Collaborator

Somefive commented Nov 9, 2022

First: "If the sampled length is less than the entity length, we will only mask out the entity." I can't understand the meaning of this sentence, assuming Geo(p) == 6 and entity_len == 7, here it means mask_len == 7 ? but when Geo(p) == 6 and entity_len == 5, what to do? Can you help with an example?

For example, if the sentence length is 100, we want to mask 15 tokens. If we randomly picked one entity that has 17 tokens, we will mask all 17 tokens even if it is longer than 15.

@Somefive
Copy link
Collaborator

Somefive commented Nov 9, 2022

In the Span-aware entity masking. section: we expect OAG-BERT to memorize them well and thus develop a span-aware entity masking strategy combining the advantages of both ERNIE [55] and SpanBERT [17]. It's too unclear here. Can you explain how to deal with loss ?

Looking forward to your reply.

The loss computation just follows other masking language model. Only the mask strategy is customized.

@biandh
Copy link
Author

biandh commented Nov 10, 2022

In the Span-aware entity masking. section: we expect OAG-BERT to memorize them well and thus develop a span-aware entity masking strategy combining the advantages of both ERNIE [55] and SpanBERT [17]. It's too unclear here. Can you explain how to deal with loss ?
Looking forward to your reply.

The loss computation just follows other masking language model. Only the mask strategy is customized.

Thank you very much for help. The question I want to say is, do you only use MLM in the loss calculation ? and do you use SpanBert's Span Boundary Objective loss ?

Looking forward to your reply.

@biandh
Copy link
Author

biandh commented Nov 10, 2022

When reading the code implementation part of title generation, I found that the decoding strategy generation method is different of FOS. It is more like the method of Prefix LM. I don't quite understand why the same generation strategy as FOS is not used. Can you share the reason here ? THX

@Somefive
Copy link
Collaborator

In the Span-aware entity masking. section: we expect OAG-BERT to memorize them well and thus develop a span-aware entity masking strategy combining the advantages of both ERNIE [55] and SpanBERT [17]. It's too unclear here. Can you explain how to deal with loss ?
Looking forward to your reply.

The loss computation just follows other masking language model. Only the mask strategy is customized.

Thank you very much for help. The question I want to say is, do you only use MLM in the loss calculation ? and do you use SpanBert's Span Boundary Objective loss ?

Looking forward to your reply.

No, we only use MLM. It is possible to use SpanBert's loss. Maybe it can help the model to learn span information more efficiently.

@Somefive
Copy link
Collaborator

When reading the code implementation part of title generation, I found that the decoding strategy generation method is different of FOS. It is more like the method of Prefix LM. I don't quite understand why the same generation strategy as FOS is not used. Can you share the reason here ? THX

We have tried various ways to train and inference. I suppose you read the code in cogdl? The code there is not totally equivalent to the strategy in the paper, as several post-updates made, but still the general ideas are the same. The code in cogdl is more of inferencing, instead of directly training (if I remember it correct and no further updates were made). The MLM loss is the first attempt we made for learning entity information, targeting at comprehension. However, for language generation tasks, the so-called "Prefix LM" is more helpful for generating sequences, in terms of efficiency and quality.

We actually tried to use GLM or other advanced masking strategy to train the model and get better parameters which are more suitable for the sequence generating tasks.

As far as I know, if your downstream tasks are mainly comprehension work, like cloze tasks, using MLM for training could be good. But I remember there are researches indicate that training like GPT could also achieve qualitative results. For sequence generation, pure MLM training is somehow harder I think.

@blackbird11111
Copy link

您好,我想了解一下预训练的第一步,文献的标题、摘要和正文,使用什么样的json格式更好,谢谢

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants