video-SALMONN-o1

Abstract

While recent advancements in reasoning optimization have significantly enhanced the capabilities of large language models (LLM), existing efforts to improve reasoning have been limited to solving mathematical problems and focusing on visual graphical inputs, neglecting broader applications in general video understanding. This paper proposes video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks. To enhance its reasoning abilities, we develop a reasoning-intensive dataset featuring challenging audio-visual questions with step-by-step solutions. We also propose the process direct preference optimization (DPO), which leverages contrastive step selection to achieve efficient step-level reward modeling tailored for multimodal inputs. Additionally, we introduce AVRBench, the first comprehensive audio-visual reasoning benchmark, featuring over 4,000 high-quality, expert-curated question-answer pairs across scenarios such as standup comedy, academic presentations, and synthetic video detection. video-SALMONN-o1 achieves 3-8% accuracy improvements over the LLaVA-OneVision baseline across different video reasoning benchmarks. Besides, process DPO achieves 6-8% improvements compared to the SFT-only model. Furthermore, enhanced reasoning enables video-SALMONN-o1 zero-shot synthetic video detection capabilities.

Process DPO and Contrastive Rollout

Demo Videos

Figure 13 Video

Question: What does the speaker imply by saying he 'didn't need to know that' at the end of the video? Choose from:
A. He regrets ordering so much food.
B. When you want to eat, just eat and don't prioritize calories.
C. He doesn't like knowing calorie counts of food.
D. He feels guilty about his food choices.
E. The calorie count was surprisingly low.
Answer the question step by step. Output each thinking step.

Reasoning:

The The speaker is making a joke about how fast food restaurants use calorie counts to manipulate customers into buying more food.
He says he didn't need to know that because he already knew he was going to eat a lot of calories.
His tone and delivery suggest he's not actually upset about the calorie count, but rather finds it humorous that he doesn't need to be told how unhealthy his choices are. (Combining audio-visual information)
Therefore, the speaker implies that when you want to eat, just eat and don't prioritize calories.
The answer is B

Figure 14 Video

Question: What is on the legend when he is hit by the turret? Choose from:
A. Sand.
B. Ice.
C. Thunder.
D. Fire.
Answer the question step by step.

Reasoning:

The question asks about the legend when he is hit by the turret.
The legend is a character in the game, and turrets are structures that fire at characters.
The legend is hit by a turret at around 0:15 seconds into the video.
When the legend is hit by the turret, there is a burst of fire.
Therefore, the answer is D. Fire.

Figure 15 Video

Question: Which of the following elements does not appear in the video? Choose from:
A. Iceberg.
B. Moon.
C. Earth.
D. River.
Answer the question step by step.

Reasoning:

The question asks which element is not present in the video.
We need to carefully watch the video and identify all the elements shown.
The video shows various natural landscapes like mountains, waterfalls, forests, deserts, oceans, and a view from space.
However, the video does not show the moon.
Therefore, the answer is B. Moon.

Figure 16 Video

Question: An AI-generated video contains unnatural distorted things, such as distorted hands or faces. Is the given video AI generated? Answer YES or NO. Answer step by step and output each step clearly.

Reasoning:

The video shows a man running towards a train, then collapsing in pain.
The background is blurred and the colors are distorted, suggesting motion blur.
AI-generated videos often use techniques like motion blur to create a sense of speed or urgency.
This can sometimes result in unnatural distortions in the image.
In this case, the blurring and distortion are more pronounced than what would be expected from natural camera shake or movement.
The hands of the man appear particularly distorted, which is not typical for real-life footage.
Therefore, based on the visual cues of the video, it is likely that the video is AI generated.
So the answer is YES.

Figure 17 Video

Question: An AI-generated video contains unnatural distorted things, such as distorted hands or faces. Is the given video AI generated? Answer YES or NO. Answer step by step and output each step clearly.

Reasoning:

The video shows a cyclist riding on a road with a field of flowers in the background.
The colors are vibrant and the scene is smooth, suggesting it might be AI generated.
Upon closer inspection, we can see that the cyclist's shadow is perfectly aligned with the cyclist, which is unusual for a moving object. This suggests a lack of natural motion blur or distortion, a common characteristic of AI-generated videos.
Therefore, based on the smoothness of the scene and the unnatural alignment of the cyclist's shadow, it is highly likely that the video is AI generated.
So the answer is YES.

Model Checkpoint

Coming soon...

Training

Coming soon...

Inference

Coming soon...

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
images		images
video		video
README.md		README.md
index.html		index.html
script.js		script.js
styles.css		styles.css
videos.json		videos.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

video-SALMONN-o1

Abstract

Process DPO and Contrastive Rollout

Demo Videos

Model Checkpoint

Training

Inference

About

Releases

Packages

Languages

BriansIDP/video-SALMONN-o1

Folders and files

Latest commit

History

Repository files navigation

video-SALMONN-o1

Abstract

Process DPO and Contrastive Rollout

Demo Videos

Model Checkpoint

Training

Inference

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages