Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation code for benchmark #16

Open
iisxuwei opened this issue Dec 22, 2023 · 3 comments
Open

Evaluation code for benchmark #16

iisxuwei opened this issue Dec 22, 2023 · 3 comments

Comments

@iisxuwei
Copy link

Hi, I'm very interested in your work and would like to know if the evaluation code in the benchmark will be released? Additionally, is selecting only 100 images for the evaluation too few and potentially unfair, but it seems there's no alternative due to the API limitations of GPT-4V.

Looking forward to your reply.

@jwyang
Copy link
Member

jwyang commented Dec 23, 2023

@iisxuwei , thanks for your interests!

We will release the evaluation code for the benchmark very soon during this holiday season. Stay tunned!

It is indeed unfair to compare with other methods that were evaluated on the full validation set, but unfortunately, we could not have enough quota to call GPT-4V. We did evaluate some methods on our samples and noted in our table.

We are looking into how to setup a better evaluation pipeline for all methods.

thanks,

@iisxuwei
Copy link
Author

Hi,i'm wondering the some Metrics in the evaluation benchmark, like mIou and [email protected]. In the benchmark page, the prompt for REC and RES are same. And GPT's return is the mark number or a range of mark numbers. How do you compute the metrics according to the GPT's return?
By the way, i'm curious if GPT's return is not one number but some numbers, how do you judge it? Or should multiple rounds of dialogue be used to correct the output of the results?
Looking forward to your reply.

@iisxuwei
Copy link
Author

iisxuwei commented Mar 1, 2024

Hi!
I've been following this project and am very interested in its progress. Could you please provide any recent updates?

P.S. I am very confused about the results in RefCOCOg in the experimental results section (Table 2).

  • The published benchmark data does not match the number of instances(177) shown in the paper.
  • The RefCOCOg benchmark lacks the mask area data with the corresponding label, making it impossible to reproduce the results of REC and RES.

I hope you can reply to me as soon as possible. I am very interested in your work and would like to cite it.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants