You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
hi, the mate-kd is an excellent work on NLP KD. Here I have a question about the codes of this paper.
In the section 4.1 of the paper, the authors said that two different teacher models (Roberta large and BERT base) were used in the two steps, but the codes showed that only one teacher model is used. Is it right?
on the other hand, the two steps should be trained separately? But the codes showed that in the training procedure, 10 steps for updating the params of generator, then 100 steps for updating the student model. That makes me feel wired.
The text was updated successfully, but these errors were encountered:
hi, the mate-kd is an excellent work on NLP KD. Here I have a question about the codes of this paper.
In the section 4.1 of the paper, the authors said that two different teacher models (Roberta large and BERT base) were used in the two steps, but the codes showed that only one teacher model is used. Is it right?
on the other hand, the two steps should be trained separately? But the codes showed that in the training procedure, 10 steps for updating the params of generator, then 100 steps for updating the student model. That makes me feel wired.
The text was updated successfully, but these errors were encountered: