-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inquiry Regarding Discrepancy in Results Using the Same Model and Methodology as Presented in Your Paper #16
Comments
Could you provide more experimental details? Moreover, our method is a decoding strategy that can applied to any model for mitigating its hallucinations. |
Specifically, we evaluated the performance of the unmodified LLaVA1.5-7B model on the MSCOCO dataset using the POPE evaluation, with the results (taking accuracy as an example) being: random-88.5, popular-87.3, adversarial-85.2. In contrast, the results for the Regular method in the VCD paper were: random-83.3, popular-81.8, adversarial-78.96, and after using VCD, the results were: random-87.7, popular-85.38, adversarial-80.88. |
Hi, may I know the decoding strategy you are applying? Different decoding strategies may affect the baseline performance quite significantly. You can refer to our Appendix for more ablations. |
Our decoding strategy adapted Greedy. In your Appendix, the ACC with VCD is 88.49, is lower than the regular result 88.5 |
There are quite a few reasons that may cause this kind of small difference in performance (e.g., torch versions). You can try to reproduce our method in your environment to see if our method can bring benefits. If you have further questions, you can also upload your evaluation scripts and ckpts for discussion. Thanks. |
what is the temperature coefficient set on the LLaVA 1.5-7B model to obtain the results in Table 1 of your paper? |
Our main paper states that we use direct sampling without temp normalization, top k, or top p sampling. |
Could the author please explain the phenomenon where all indicators dropped by 2-4 % after employing the VCD method on LLaVA1.5-7B/13B? |
Could you please provide more details? Moreover, for future inquiries, you better also raise with details to avoid any confusion and save time for both. Thanks. |
Hi, Sicong, I alos have the problem that why the result report in the paper is much lower than the performance report in the original LLAVA 1.5 paper? For example, in the orginal LLAVA-1.5 paper, it report the POPE F1 score in three split( Ran, Adv and Pop): 87.3 86.1 84.2 But in your paper, the result is 81.33, 77.57 and 80.06. |
I'm also very confused. Could @LengSicong please help me explain and give more details? |
Hi, the checkpoint version and the decoding strategies should contribute the most. Different decoding configurations, such as the value of temperature, topP, or topK, would greatly affect the results. |
Thanks for your reply. In my understanding, both the MME and POPE evaluations use greedy search to ensure the reproducibility of results, so there shouldn't be issues related to sampling parameters like top-k. May I ask if the weights you are using are the official weights of LLAVA-v1.5-7B? |
No, it's direct sampling. And we've conducted 5 runs for each experiment and reported the avg and std for reproducibility. |
Could you please confirm if there is an issue with the data on the Pope and MME datasets (for LLaVA-1.5-7b) in your study? I have been unable to replicate the results presented in the tables using the same model and methodology you described. Additionally, the results from your paper are not only inconsistent with my findings but also seem to be outperformed by a model with no modifications at all. Is this approach intended to be a method that negatively impacts performance?
The text was updated successfully, but these errors were encountered: