Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kindly request for answer extraction script and clarification on some experiment details #1

Open
jinghan23 opened this issue Jun 19, 2024 · 1 comment

Comments

@jinghan23
Copy link

Hi authors, thanks for your amazing work which contributes to long video understanding a lot!
I'm repeating your experiments on LLaVA-NeXT-Video. I meet some problems and would like to know how you are solving them.

  1. Would you mind providing details on which LLaVA-NeXT-Video model you are testing on, lmms-lab/LLaVA-NeXT-Video-34B-DPO or lmms-lab/LLaVA-NeXT-Video-7B-DPO or models without DPO?
  2. I experiment with lmms-lab/LLaVA-NeXT-Video-7B-DPO at first and find that current instruction didn't request the assistant to answer with one exact option. They sometimes reply with a paragraph of reasoning, which makes extracting answers not easy. Would you mind provide your answer extraction script or shed some light on how to evaluate based on raw response? (Or are you evaluating in a perplexity-based mode? like connecting four options with the instruction seperately to see which one has lower perpleixity.)
  3. When I'm evaluating on LVBench with the official repo of LLaVA-NeXT-Video, I find that some videos cannot be read in because decord library does not support AV1 codec currently. I edit video2dataset package as in this issue and re-download LVBench videos using your download.sh, but there are still four videos fail to be processed. I really appreciate it if you can share experiment details like how you are downloading and dealing with AV1 codec stuff.
  4. There's a very interesting point mentioned in section 4.4 in the paper about "using large language models (LLMs) to filter question-answer pairs". But I'm a little bit confused about what LLM filtering means. Is it like only provide the instruction and question as input without the video and check which option will the LLM guess to choose? But it's hard to understand that this method get a even higher score than input with video. Would you mind making a simple clarify on this interesting discovery?

Thanks in advance for your helpful reply.

@jinghan23 jinghan23 changed the title Request for answer extraction script and other questions on experiment details Kindly request for answer extraction script and clarification on some experiment details Jun 19, 2024
@huangshiyu13
Copy link
Member

huangshiyu13 commented Jun 20, 2024

Response:

  1. lmms-lab/LLaVA-NeXT-Video-34B-DPO
  2. You can use this script to get the final option. (No perplexity-based mode.)
  3. We save all the videos to mp4 files. You can convert the video using ffmpeg first, then read them using decord.
  4. We find some questions are easy for the LLM to guess the answer. So we remove them out from the original dataset. Below is a filtered example from the original dataset:
{
'question': 'Why is the whole body of a man covered with white cloth?\n(A) He is sleepy\n(B) He is dead\n(C) He is tired\n(D) He is married', 
'answer': 'B'
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants