You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I've running the updated code and observe that at the pretraining stage, the loss is coverged to ~3(slightly above 3), does my training show similar tendency as your official experiment setting? If it seems correct, in the orginal LLaVA-1.5 pretraining, the loss is finally converged to ~2, how to inteprete this difference?
May I know the rough converged loss value of the fine-tuning stage?
According to you paper, Sec. 3.1 "In our experiments, we show that ViT and position embedding parameters can be kept frozen during pretraining, and updating these parameters during the instruction-tuning stage is sufficient for good performance", it means the ViT is fine-tuned, but the author claims in another issue that the ViT is freezed all the time. Can you clarify on this point? From my understanding, the ViT positional embedding changed adapting the dynamic aspect ratio (similar to pix2instruct), the ViT need to be fine-tuned.
Many thanks!
The text was updated successfully, but these errors were encountered:
Hi, thanks for the interesting work!
I've running the updated code and observe that at the pretraining stage, the loss is coverged to ~3(slightly above 3), does my training show similar tendency as your official experiment setting? If it seems correct, in the orginal LLaVA-1.5 pretraining, the loss is finally converged to ~2, how to inteprete this difference?
May I know the rough converged loss value of the fine-tuning stage?
According to you paper, Sec. 3.1 "In our experiments, we show that ViT and position embedding parameters can be kept frozen during pretraining, and updating these parameters during the instruction-tuning stage is sufficient for good performance", it means the ViT is fine-tuned, but the author claims in another issue that the ViT is freezed all the time. Can you clarify on this point? From my understanding, the ViT positional embedding changed adapting the dynamic aspect ratio (similar to pix2instruct), the ViT need to be fine-tuned.
Many thanks!
The text was updated successfully, but these errors were encountered: