-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cannot reproduce results on vllava datasets #81
Comments
Please adopt the new scripts to train our videollama2 under videollava settings. Previous scripts adopt another projector, which is not consistent with the projector of this experiment. |
I am using the lastest. Unless you guys updated within two days. |
Hi Yijiang, my colleague may not state this clearly 😂 We updated the fine-tuning script this afternoon, please check the latest commit and launch your training jobs (on video-llava dataset) with the new script. |
Oh thanks!!! Sorry for the misunderstanding. I will try tonight. |
Hi @lixin4ever and @clownrat6,
|
We attached all the json files generated here: |
env we use:
|
Hi, I am going to reproduce this experiment. May I ask how many gpus did you use and how many days it took to run? |
Hi Yijiang, We found that the latest codebase, migrated from the older one (I.e., V1.0) to be better compatible with Qwen2 (and other recent LLMs), indeed suffers from performance degradation when switching the language decoder to Mistral. However, due to the lack of resources, we temporarily have no GPUs to further verify if the code migration leads to this issue. We will continue the verification in early October, please stay tuned. |
Two A100/A800 nodes (i.e., 16 GPUs) for < 20 hours (pretraining + fine-tuning) |
Oh, that is much faster than I thought, thank you. Are you training full model or using LoRA? |
Also, I just tried using 8xA100 for the pretraining stage, it estimates the pretraining stage will take 48 hours, could you please clarify that the pretraining stage should include both valley and llavaimage dataset? |
Yes, both Valley and LLaVA-Image should be included. Regarding the time cost, I just checked the pretraining log of one run and it took around 8 hours on 2 A800 nodes (i.e., 16 80G-A800s). |
Thank you for your response, may I ask for the checkpoint of the model trained on valley dataset? I am keen to see how it performs on my custom dataset. |
I use 8 a800 80G GPUs. local and global batch size follows the scripts. |
Hi there, I have finished my experiment to reproduce the result with vllava. My result seems consistent with the reported result.
I will update the performance of ActivityNet later today, the dataset is still downloading. I have made TWO modifications:
@williamium3000 I compared my trainer_state.json with yours, and noticed your grad_norms are much higher than mine. In addition, my loss values are slightly lower as well. Here is my trainer_state.json I ran my experiment on a 8 x L40s machine. |
@zhuqiangLu How to train with lora? Do you mean that you first pretrain on the valley pretrain dataset and then use lora to fine-tune on sft dataset? |
That is right. The pretraining is done on valley dataset with the official script. To enable lora with SFT, simply add
|
I believe their original result is produced by full fine-tuning. I have not tried lora though. |
Using v1.0 tag version codebase, I retrain the VideoLLaMA with tcadapterv35 on videollava dataset. It seems that v1.0 version can reproduce the results in paper.
The results are:
|
I follow @zhuqiangLu and try to reproduce his result with lora.
Despite it's higher than the full-finetuning, it's still behind the reported or the results by @zhuqiangLu |
I was thinking could it be the dataset? My vllava dataset was downloaded from huggingface, then I simply use the jsons provided by videollama2. |
Could you please attempt v1.0 tag version code? You can checkout this version by |
Yeah, we have observed a similar trend. v1.0 does seem better. However, I wondering how @zhuqiangLu achieved such high result. @zhuqiangLu Did you use v1.0 as well? |
It was default branch, but now I have no available GPUs for this training, I will update when I finish training with the v1.0 branch. |
Hello,I'm a phD student from ZJU, I also use videollama2 to do my own research,we create a WeChat group to discuss some issues of videollama2 and help each other,could you join us? Please contact me: WeChat number == LiangMeng19357260600, phone number == +86 19357260600,e-mail == [email protected]. |
Dear authors of VideoLLaMA2,
Thanks for the great work. We tried to reproduce your results on vllava datasets using the latest version of the code. However, we observe a large discrepancy in the three test datasets.
We directly use your code, and follow your instructions to download the vllava datasets as well as three test sets, i.e. MVBench, Egoschema, and ActivityNet.
Can you hint at how you achieved the average 45.1 results?
Best
Yijiang
The text was updated successfully, but these errors were encountered: