Some questions about the pre-trained mask strategy #7

biandh · 2022-11-07T03:13:03Z

In the Span-aware entity masking. section of the paper it is mentioned that "If the sampled length is less than the entity length, we will only mask out the entity. For text contents and entity contents, we mask 15% of the tokens for each respectively. "

I have 2 points of confusion here

First: "If the sampled length is less than the entity length, we will only mask out the entity." I can't understand the meaning of this sentence, assuming Geo(p) == 6 and entity_len == 7, here it means mask_len == 7 ? but when Geo(p) == 6 and entity_len == 5, what to do? Can you help with an example?

Second: "we mask 15% of the tokens for each respectively", for entity, I am very confused, this is to choose 15% of the tokens for each entity OR choose 15% of the mask for all entities? Combined with the first question, here is how to guarantee a 15% probability?

Looking forward to your reply.

biandh · 2022-11-08T03:09:00Z

In the Span-aware entity masking. section: we expect OAG-BERT to memorize them well and thus develop a span-aware entity masking strategy combining the advantages of both ERNIE [55] and SpanBERT [17]. It's too unclear here. Can you explain how to deal with loss ?

Looking forward to your reply.

Somefive · 2022-11-09T08:31:24Z

Second: "we mask 15% of the tokens for each respectively", for entity, I am very confused, this is to choose 15% of the tokens for each entity OR choose 15% of the mask for all entities? Combined with the first question, here is how to guarantee a 15% probability?

"choose 15% of the mask for all entities". The 15% probability is not strictly enforced. We select one entity randomly, mask it, check if the total number of masked tokens reaches 15% of all tokens. If not reached, repeat this process.

Somefive · 2022-11-09T08:33:27Z

First: "If the sampled length is less than the entity length, we will only mask out the entity." I can't understand the meaning of this sentence, assuming Geo(p) == 6 and entity_len == 7, here it means mask_len == 7 ? but when Geo(p) == 6 and entity_len == 5, what to do? Can you help with an example?

For example, if the sentence length is 100, we want to mask 15 tokens. If we randomly picked one entity that has 17 tokens, we will mask all 17 tokens even if it is longer than 15.

Somefive · 2022-11-09T08:35:23Z

In the Span-aware entity masking. section: we expect OAG-BERT to memorize them well and thus develop a span-aware entity masking strategy combining the advantages of both ERNIE [55] and SpanBERT [17]. It's too unclear here. Can you explain how to deal with loss ?

Looking forward to your reply.

The loss computation just follows other masking language model. Only the mask strategy is customized.

biandh · 2022-11-10T09:44:32Z

In the Span-aware entity masking. section: we expect OAG-BERT to memorize them well and thus develop a span-aware entity masking strategy combining the advantages of both ERNIE [55] and SpanBERT [17]. It's too unclear here. Can you explain how to deal with loss ?
Looking forward to your reply.

The loss computation just follows other masking language model. Only the mask strategy is customized.

Thank you very much for help. The question I want to say is, do you only use MLM in the loss calculation ? and do you use SpanBert's Span Boundary Objective loss ?

Looking forward to your reply.

biandh · 2022-11-10T09:52:47Z

When reading the code implementation part of title generation, I found that the decoding strategy generation method is different of FOS. It is more like the method of Prefix LM. I don't quite understand why the same generation strategy as FOS is not used. Can you share the reason here ? THX

Somefive · 2022-11-15T13:02:34Z

In the Span-aware entity masking. section: we expect OAG-BERT to memorize them well and thus develop a span-aware entity masking strategy combining the advantages of both ERNIE [55] and SpanBERT [17]. It's too unclear here. Can you explain how to deal with loss ?
Looking forward to your reply.

The loss computation just follows other masking language model. Only the mask strategy is customized.

Thank you very much for help. The question I want to say is, do you only use MLM in the loss calculation ? and do you use SpanBert's Span Boundary Objective loss ?

Looking forward to your reply.

No, we only use MLM. It is possible to use SpanBert's loss. Maybe it can help the model to learn span information more efficiently.

Somefive · 2022-11-15T13:13:45Z

When reading the code implementation part of title generation, I found that the decoding strategy generation method is different of FOS. It is more like the method of Prefix LM. I don't quite understand why the same generation strategy as FOS is not used. Can you share the reason here ? THX

We have tried various ways to train and inference. I suppose you read the code in cogdl? The code there is not totally equivalent to the strategy in the paper, as several post-updates made, but still the general ideas are the same. The code in cogdl is more of inferencing, instead of directly training (if I remember it correct and no further updates were made). The MLM loss is the first attempt we made for learning entity information, targeting at comprehension. However, for language generation tasks, the so-called "Prefix LM" is more helpful for generating sequences, in terms of efficiency and quality.

We actually tried to use GLM or other advanced masking strategy to train the model and get better parameters which are more suitable for the sequence generating tasks.

As far as I know, if your downstream tasks are mainly comprehension work, like cloze tasks, using MLM for training could be good. But I remember there are researches indicate that training like GPT could also achieve qualitative results. For sequence generation, pure MLM training is somehow harder I think.

blackbird11111 · 2024-11-13T08:44:07Z

您好，我想了解一下预训练的第一步，文献的标题、摘要和正文，使用什么样的json格式更好，谢谢

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some questions about the pre-trained mask strategy #7

Some questions about the pre-trained mask strategy #7

biandh commented Nov 7, 2022 •

edited

Loading

biandh commented Nov 8, 2022

Somefive commented Nov 9, 2022

Somefive commented Nov 9, 2022

Somefive commented Nov 9, 2022

biandh commented Nov 10, 2022 •

edited

Loading

biandh commented Nov 10, 2022 •

edited

Loading

Somefive commented Nov 15, 2022

Somefive commented Nov 15, 2022

blackbird11111 commented Nov 13, 2024

Some questions about the pre-trained mask strategy #7

Some questions about the pre-trained mask strategy #7

Comments

biandh commented Nov 7, 2022 • edited Loading

biandh commented Nov 8, 2022

Somefive commented Nov 9, 2022

Somefive commented Nov 9, 2022

Somefive commented Nov 9, 2022

biandh commented Nov 10, 2022 • edited Loading

biandh commented Nov 10, 2022 • edited Loading

Somefive commented Nov 15, 2022

Somefive commented Nov 15, 2022

blackbird11111 commented Nov 13, 2024

biandh commented Nov 7, 2022 •

edited

Loading

biandh commented Nov 10, 2022 •

edited

Loading

biandh commented Nov 10, 2022 •

edited

Loading