Inquiry Regarding Discrepancy in Results Using the Same Model and Methodology as Presented in Your Paper #16

obananas · 2024-08-17T07:47:25Z

Could you please confirm if there is an issue with the data on the Pope and MME datasets (for LLaVA-1.5-7b) in your study? I have been unable to replicate the results presented in the tables using the same model and methodology you described. Additionally, the results from your paper are not only inconsistent with my findings but also seem to be outperformed by a model with no modifications at all. Is this approach intended to be a method that negatively impacts performance?

LengSicong · 2024-08-17T08:01:41Z

Could you provide more experimental details?

Moreover, our method is a decoding strategy that can applied to any model for mitigating its hallucinations.

obananas · 2024-08-17T09:01:21Z

Specifically, we evaluated the performance of the unmodified LLaVA1.5-7B model on the MSCOCO dataset using the POPE evaluation, with the results (taking accuracy as an example) being: random-88.5, popular-87.3, adversarial-85.2. In contrast, the results for the Regular method in the VCD paper were: random-83.3, popular-81.8, adversarial-78.96, and after using VCD, the results were: random-87.7, popular-85.38, adversarial-80.88.
From the above results, it can be seen that our evaluated data not only surpasses the Regular results but also exceeds the VCD results. It is particularly important to note that the images and questions for the above evaluations come from the same settings in POPE (VCD also uses the same images and questions), and the model used is also the LLaVA1.5-7B.
Therefore, I would like to ask the author whether the VCD method will have a negative impact.

LengSicong · 2024-08-17T09:28:44Z

Hi, may I know the decoding strategy you are applying? Different decoding strategies may affect the baseline performance quite significantly. You can refer to our Appendix for more ablations.

obananas · 2024-08-17T09:59:24Z

Our decoding strategy adapted Greedy. In your Appendix, the ACC with VCD is 88.49, is lower than the regular result 88.5

LengSicong · 2024-08-17T10:09:03Z

There are quite a few reasons that may cause this kind of small difference in performance (e.g., torch versions).

You can try to reproduce our method in your environment to see if our method can bring benefits.

If you have further questions, you can also upload your evaluation scripts and ckpts for discussion. Thanks.

obananas · 2024-08-18T13:04:41Z

what is the temperature coefficient set on the LLaVA 1.5-7B model to obtain the results in Table 1 of your paper?

LengSicong · 2024-08-19T03:37:00Z

Our main paper states that we use direct sampling without temp normalization, top k, or top p sampling.

obananas · 2024-08-20T02:38:54Z

Could the author please explain the phenomenon where all indicators dropped by 2-4 % after employing the VCD method on LLaVA1.5-7B/13B?

LengSicong · 2024-08-21T15:54:39Z

Could you please provide more details?

Moreover, for future inquiries, you better also raise with details to avoid any confusion and save time for both. Thanks.

HaozheZhao · 2024-09-04T06:01:53Z

Could you please provide more details?

Moreover, for future inquiries, you better also raise with details to avoid any confusion and save time for both. Thanks.

Hi, Sicong, I alos have the problem that why the result report in the paper is much lower than the performance report in the original LLAVA 1.5 paper?

For example, in the orginal LLAVA-1.5 paper, it report the POPE F1 score in three split( Ran, Adv and Pop): 87.3 86.1 84.2

But in your paper, the result is 81.33, 77.57 and 80.06.

lowestbuaaer · 2024-10-04T08:25:57Z

Could you please provide more details?
Moreover, for future inquiries, you better also raise with details to avoid any confusion and save time for both. Thanks.

Hi, Sicong, I alos have the problem that why the result report in the paper is much lower than the performance report in the original LLAVA 1.5 paper?

For example, in the orginal LLAVA-1.5 paper, it report the POPE F1 score in three split( Ran, Adv and Pop): 87.3 86.1 84.2

But in your paper, the result is 81.33, 77.57 and 80.06.

I'm also very confused. Could @LengSicong please help me explain and give more details?

LengSicong · 2024-10-07T07:36:57Z

Hi, the checkpoint version and the decoding strategies should contribute the most.

Different decoding configurations, such as the value of temperature, topP, or topK, would greatly affect the results.

lowestbuaaer · 2024-10-07T13:09:36Z

Hi, the checkpoint version and the decoding strategies should contribute the most.

Different decoding configurations, such as the value of temperature, topP, or topK, would greatly affect the results.

Thanks for your reply.

In my understanding, both the MME and POPE evaluations use greedy search to ensure the reproducibility of results, so there shouldn't be issues related to sampling parameters like top-k. May I ask if the weights you are using are the official weights of LLAVA-v1.5-7B?

LengSicong · 2024-10-07T13:25:54Z

No, it's direct sampling.

And we've conducted 5 runs for each experiment and reported the avg and std for reproducibility.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inquiry Regarding Discrepancy in Results Using the Same Model and Methodology as Presented in Your Paper #16

Inquiry Regarding Discrepancy in Results Using the Same Model and Methodology as Presented in Your Paper #16

obananas commented Aug 17, 2024

LengSicong commented Aug 17, 2024

obananas commented Aug 17, 2024

LengSicong commented Aug 17, 2024 •

edited

Loading

obananas commented Aug 17, 2024

LengSicong commented Aug 17, 2024

obananas commented Aug 18, 2024

LengSicong commented Aug 19, 2024

obananas commented Aug 20, 2024

LengSicong commented Aug 21, 2024

HaozheZhao commented Sep 4, 2024

lowestbuaaer commented Oct 4, 2024

LengSicong commented Oct 7, 2024

lowestbuaaer commented Oct 7, 2024

LengSicong commented Oct 7, 2024 •

edited

Loading

Inquiry Regarding Discrepancy in Results Using the Same Model and Methodology as Presented in Your Paper #16

Inquiry Regarding Discrepancy in Results Using the Same Model and Methodology as Presented in Your Paper #16

Comments

obananas commented Aug 17, 2024

LengSicong commented Aug 17, 2024

obananas commented Aug 17, 2024

LengSicong commented Aug 17, 2024 • edited Loading

obananas commented Aug 17, 2024

LengSicong commented Aug 17, 2024

obananas commented Aug 18, 2024

LengSicong commented Aug 19, 2024

obananas commented Aug 20, 2024

LengSicong commented Aug 21, 2024

HaozheZhao commented Sep 4, 2024

lowestbuaaer commented Oct 4, 2024

LengSicong commented Oct 7, 2024

lowestbuaaer commented Oct 7, 2024

LengSicong commented Oct 7, 2024 • edited Loading

LengSicong commented Aug 17, 2024 •

edited

Loading

LengSicong commented Oct 7, 2024 •

edited

Loading