Can't hear the audio #112

sjghh · 2024-10-25T11:22:25Z

import sys
sys.path.append('./')
from videollama2 import model_init, mm_infer
from videollama2.utils import disable_torch_init

def inference():
disable_torch_init()

# Video Inference
modal = 'video'
modal_path = '/data/video-llama2-av/VideoLLaMA2-audio_visual/assets/00001.mp4' 
instruct = 'What exactly did the person in the video say?'


model_path = '/data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-AV'
model, processor, tokenizer = model_init(model_path)
output = mm_infer(processor[modal](modal_path), instruct, model=model, tokenizer=tokenizer, do_sample=False, modal=modal)

print(output)

if name == "main":
inference()

The output is: The person in the video spoke a few words, but they were not audible.
I input a video with sound, but it seems the model didn't pick it up. Is it because the audio branch isn't functioning properly? Also, I changed "mm_audio_tower" in VideoLLaMA2.1-7B-AV/config.json to the provided BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt. Is this the correct place to make the change? Thanks for your reply!

The text was updated successfully, but these errors were encountered:

xinyifei99 · 2024-10-25T12:29:36Z

Thanks for your attention! Currently, our audio branch mainly focuses on understanding audio events, and has not yet included speech recognition functions, so the model cannot identify the specific content of the speaker. Besides, you should switch to the audio_visual branch (https://github.com/DAMO-NLP-SG/VideoLLaMA2/tree/audio_visual) and clone the repository to run inference for audio_visual related tasks.

sjghh · 2024-10-26T13:30:42Z

Thank you for your response. I have a few more questions.

First question: I have some video data that I want to fine-tune, and in va_joint.sh, I use --data_path ${DATA_DIR}/stage3_video_audio.json,${DATA_DIR}/stage2_audio_subset_new.json,${DATA_DIR}/stage2_video_subset.json . How should I design this? My understanding is that stage3_video_audio.json and stage2_audio_subset_new.json use the same set of videos, while ${DATA_DIR}/stage2_video_subset.json uses the audio from the videos.

Second question: I want to further train using VideoLLaMA2.1-7B-AV. How should I modify va_joint.sh? Additionally, what should I pay attention to during this process? Is it possible to see the prompts you used in your paper?

Looking forward to your response, and thank you again!

xinyifei99 · 2024-10-26T14:57:57Z

For the first question, stage3_video_audio.json represents the newly added audio-video data in the joint training stage, stage2_video_subset.json represents the video subset used in the two-stage training of video, and stage2_audio_subset_new.json represents the audio subset used in the two-stage training of audio.
For the second question, for the stage3_video_audio.json and stage2_video_subset.json files storing video data, the data formats are mainly the following two categories:

For the stage2_audio_subset_new.json file that stores audio data, the data format is as follows:

sjghh · 2024-10-27T00:06:00Z

Thank you again for your response. Can I use only stage3_video_audio.json for the fine-tuning of the model? If so, should I simply provide the .json file for joint training in line 45 of va_joint.sh like this: --data_path ${DATA_DIR}/stage3_video_audio.json? Additionally, I would like to train on VideoLLaMA2.1-7B-AV. Should I change line 43 from --model_path DAMO-NLP-SG/VideoLLaMA2.1-7B-16F to VideoLLaMA2.1-7B-AV?

Thank you for taking the time to answer my question amidst your busy schedule！

xinyifei99 · 2024-10-27T08:07:38Z

You can fine-tune the model using only stage3_video_audio.json like this --data_path ${DATA_DIR}/stage3_video_audio.json; you can also use --model_path DAMO-NLP-SG/VideoLLaMA2.1-7B-AV to continue training VideoLLaMA2.1-7B-AV.

sjghh · 2024-10-27T13:40:32Z

Thank you very much for your response. When executing bash scripts/custom/va_joint.sh, I encountered the following error:
Traceback (most recent call last):
File "/data/VideoLLaMA2-audio_visual/videollama2/train.py", line 683, in
train()
File "/data/VideoLLaMA2-audio_visual/videollama2/train.py", line 590, in train
model.get_model().initialize_audio_modules(
File "/data/VideoLLaMA2-audio_visual/./videollama2/model/videollama2_arch.py", line 126, in initialize_audio_modules
self.config.mm_hidden_size_a = audio_tower_cfg.encoder_embed_dim
UnboundLocalError: local variable 'audio_tower_cfg' referenced before assignment
Is this error caused because my audio_tower didn't load correctly? I have already implemented the inference for Video-Llama2. The va_joint.sh I used is as follows:
#!/bin/bash

Environment Variables

ARG_WORLD_SIZE=${1:-1}
ARG_NPROC_PER_NODE=${2:-8}
ARG_MASTER_ADDR="127.0.0.1"
ARG_MASTER_PORT=16666
ARG_RANK=0

Multiple conditions

if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then
WORLD_SIZE=$ARG_WORLD_SIZE
NPROC_PER_NODE=$ARG_NPROC_PER_NODE
fi
if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then
MASTER_ADDR=$ARG_MASTER_ADDR
MASTER_PORT=$ARG_MASTER_PORT
RANK=$ARG_RANK
fi

echo "WORLD_SIZE: $WORLD_SIZE"
echo "NPROC_PER_NODE: $NPROC_PER_NODE"

Training Arguments

GLOBAL_BATCH_SIZE=128
LOCAL_BATCH_SIZE=4
GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]

Log Arguments

export TRANSFORMERS_OFFLINE=1
export WANDB_PROJECT=audio_visual_stage3_qwen2
RUN_NAME=audio_visual_stage3_qwen2
DATA_DIR=/data/VideoLLaMA2-audio_visual/datasets
OUTP_DIR=work_dirs
torchrun --nnodes $WORLD_SIZE
--nproc_per_node $NPROC_PER_NODE
--master_addr=$MASTER_ADDR
--master_port=$MASTER_PORT
--node_rank $RANK
videollama2/train.py
--deepspeed scripts/zero2.json
--model_type videollama2_qwen2
--model_path /data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-AV
--data_folder ${DATA_DIR}
--data_path ${DATA_DIR}/custom.json
--vision_tower /data/video-llama2-av/av-weight/siglip-so400m-patch14-384
--audio_tower /data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-AV/audio_tower.bin
--pretrain_mm_mlp_adapter_a /data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-AV/mm_projector_a.bin
--mm_projector_type stc_connector_v35
--mm_projector_a_type mlp2x_gelu
--va True
--tune_audio_tower True
--tune_adapter_llm True
--tune_mm_mlp_adapter_a True
--mm_vision_select_layer -2
--image_aspect_ratio pad
--num_frames 16
--bf16 True
--tf32 True
--fp16 False
--output_dir $OUTP_DIR/${WANDB_PROJECT}/VideoLLaMA2.1-7B-AV
--num_train_epochs 2
--per_device_train_batch_size $LOCAL_BATCH_SIZE
--per_device_eval_batch_size 4
--gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS
--evaluation_strategy "no"
--save_strategy "steps"
--save_steps 2000
--save_total_limit 2
--learning_rate 2e-5
--weight_decay 0.
--warmup_ratio 0.03
--lr_scheduler_type "cosine"
--logging_steps 1
--model_max_length 2048
--gradient_checkpointing True
--dataloader_num_workers 4
--lazy_preprocess True
--report_to tensorboard
--run_name $RUN_NAME \
I have made the following changes to VideoLLaMA2.1-7B-AV/config.json:
"mm_audio_tower": "/data/video-llama2-av/av-weight/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt",
"mm_vision_tower": "/data/video-llama2-av/av-weight/siglip-so400m-patch14-384",
"_name_or_path": "/data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-16F".
Thank you again for your help!

Zzitang · 2024-10-27T14:17:55Z

Thank you very much for your response. When executing bash scripts/custom/va_joint.sh, I encountered the following error: Traceback (most recent call last): File "/data/VideoLLaMA2-audio_visual/videollama2/train.py", line 683, in train() File "/data/VideoLLaMA2-audio_visual/videollama2/train.py", line 590, in train model.get_model().initialize_audio_modules( File "/data/VideoLLaMA2-audio_visual/./videollama2/model/videollama2_arch.py", line 126, in initialize_audio_modules self.config.mm_hidden_size_a = audio_tower_cfg.encoder_embed_dim UnboundLocalError: local variable 'audio_tower_cfg' referenced before assignment Is this error caused because my audio_tower didn't load correctly? I have already implemented the inference for Video-Llama2. The va_joint.sh I used is as follows: #!/bin/bash

Environment Variables

ARG_WORLD_SIZE=${1:-1} ARG_NPROC_PER_NODE=${2:-8} ARG_MASTER_ADDR="127.0.0.1" ARG_MASTER_PORT=16666 ARG_RANK=0

Multiple conditions

if [ ! -n "$WORLD_SIZE" ] || [ ! -n "$NPROC_PER_NODE" ]; then WORLD_SIZE=$ARG_WORLD_SIZE NPROC_PER_NODE=$ARG_NPROC_PER_NODE fi if [ ! -n "$MASTER_ADDR" ] || [ ! -n "$MASTER_PORT" ] || [ ! -n "$RANK" ]; then MASTER_ADDR=$ARG_MASTER_ADDR MASTER_PORT=$ARG_MASTER_PORT RANK=$ARG_RANK fi

echo "WORLD_SIZE: $WORLD_SIZE" echo "NPROC_PER_NODE: $NPROC_PER_NODE"

Training Arguments

GLOBAL_BATCH_SIZE=128 LOCAL_BATCH_SIZE=4 GRADIENT_ACCUMULATION_STEPS=$[$GLOBAL_BATCH_SIZE/($WORLD_SIZE*$NPROC_PER_NODE*$LOCAL_BATCH_SIZE)]

Log Arguments

export TRANSFORMERS_OFFLINE=1 export WANDB_PROJECT=audio_visual_stage3_qwen2 RUN_NAME=audio_visual_stage3_qwen2 DATA_DIR=/data/VideoLLaMA2-audio_visual/datasets OUTP_DIR=work_dirs torchrun --nnodes $WORLD_SIZE --nproc_per_node $NPROC_PER_NODE --master_addr=$MASTER_ADDR --master_port=$MASTER_PORT --node_rank $RANK videollama2/train.py --deepspeed scripts/zero2.json --model_type videollama2_qwen2 --model_path /data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-AV --data_folder ${DATA_DIR} --data_path ${DATA_DIR}/custom.json --vision_tower /data/video-llama2-av/av-weight/siglip-so400m-patch14-384 --audio_tower /data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-AV/audio_tower.bin --pretrain_mm_mlp_adapter_a /data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-AV/mm_projector_a.bin --mm_projector_type stc_connector_v35 --mm_projector_a_type mlp2x_gelu --va True --tune_audio_tower True --tune_adapter_llm True --tune_mm_mlp_adapter_a True --mm_vision_select_layer -2 --image_aspect_ratio pad --num_frames 16 --bf16 True --tf32 True --fp16 False --output_dir O U T P D I R / {WANDB_PROJECT}/VideoLLaMA2.1-7B-AV --num_train_epochs 2 --per_device_train_batch_size $LOCAL_BATCH_SIZE --per_device_eval_batch_size 4 --gradient_accumulation_steps $GRADIENT_ACCUMULATION_STEPS --evaluation_strategy "no" --save_strategy "steps" --save_steps 2000 --save_total_limit 2 --learning_rate 2e-5 --weight_decay 0. --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --model_max_length 2048 --gradient_checkpointing True --dataloader_num_workers 4 --lazy_preprocess True --report_to tensorboard --run_name $RUN_NAME \ I have made the following changes to VideoLLaMA2.1-7B-AV/config.json: "mm_audio_tower": "/data/video-llama2-av/av-weight/BEATs_iter3_plus_AS2M_finetuned_on_AS2M_cpt2.pt", "mm_vision_tower": "/data/video-llama2-av/av-weight/siglip-so400m-patch14-384", "_name_or_path": "/data/video-llama2-av/av-weight/VideoLLaMA2.1-7B-16F". Thank you again for your help!

I solved this by adding self. before all audio_tower_cfg in videollama2/model/videollama2_arch.py

sjghh · 2024-10-27T17:08:29Z

Thank you for your response. I would like to ask what size GPU you used to get it running. I used 8 A100-40G GPUs, but I keep getting the following error:

[2024-10-27 17:04:36,988] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -9) local_rank: 6 (pid: 2829472) of binary: /opt/conda/envs/Videollama2/bin/python
Traceback (most recent call last):
File "/opt/conda/envs/Videollama2/bin/torchrun", line 8, in
sys.exit(main())
File "/opt/conda/envs/Videollama2/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 347, in wrapper
return f(*args, **kwargs)
File "/opt/conda/envs/Videollama2/lib/python3.9/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/opt/conda/envs/Videollama2/lib/python3.9/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/opt/conda/envs/Videollama2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/envs/Videollama2/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:"

In addition, I made the following adjustments:

GLOBAL_BATCH_SIZE=32
LOCAL_BATCH_SIZE=1
--num_frames 8
--bf16 False
--tf32 True
--fp16 True \

But it still cannot train properly.

ffcarina · 2024-11-15T14:20:25Z

I encountered the same issue. Does further fine-tuning of the VideoLLaMA2.1-7B-AV model require a larger GPU? I modified the va_joint.sh script to fine-tune the AV model, but kept getting OOM errors. However, I was able to fine-tune the VideoLLaMA2-7B model on the same GPU before.
Could you kindly provide an official script for further fine-tuning the VideoLLaMA2.1-7B-AV model?
Thank you very much. Looking forward to your response.

Huskyii24 · 2024-12-13T03:05:08Z

I encountered the same issue. Does further fine-tuning of the VideoLLaMA2.1-7B-AV model require a larger GPU? I modified the va_joint.sh script to fine-tune the AV model, but kept getting OOM errors. However, I was able to fine-tune the VideoLLaMA2-7B model on the same GPU before. Could you kindly provide an official script for further fine-tuning the VideoLLaMA2.1-7B-AV model? Thank you very much. Looking forward to your response.

Has your problem been solved? I'm also having issues with OOM

ffcarina · 2024-12-13T03:42:47Z

Has your problem been solved? I'm also having issues with OOM

No... Without an official response and unsure how to solve it, I have temporarily put it aside.

Huskyii24 · 2024-12-13T17:27:07Z

Thank you very much for your response. When executing bash scripts/custom/va_joint.sh, I encountered the following error:
Traceback (most recent call last):
File "/data/VideoLLaMA2-audio_visual/videollama2/train.py", line 683, in
train()
File "/data/VideoLLaMA2-audio_visual/videollama2/train.py", line 590, in train
model.get_model().initialize_audio_modules(
File "/data/VideoLLaMA2-audio_visual/./videollama2/model/videollama2_arch.py", line 126, in initialize_audio_modules
self.config.mm_hidden_size_a = audio_tower_cfg.encoder_embed_dim
UnboundLocalError: local variable 'audio_tower_cfg' referenced before assignment
Is this error caused because my audio_tower didn't load correctly? I have already implemented the inference for Video-Llama2. The va_joint.sh I used is as follows:
#!/bin/bash

I used 6 A100-80G and set the local batch size to 2 it worked...

lixin4ever closed this as completed Oct 26, 2024

xinyifei99 reopened this Oct 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can't hear the audio #112

Can't hear the audio #112

sjghh commented Oct 25, 2024

xinyifei99 commented Oct 25, 2024 •

edited

Loading

sjghh commented Oct 26, 2024

xinyifei99 commented Oct 26, 2024 •

edited

Loading

sjghh commented Oct 27, 2024

xinyifei99 commented Oct 27, 2024

sjghh commented Oct 27, 2024

Zzitang commented Oct 27, 2024

Environment Variables

Multiple conditions

Training Arguments

Log Arguments

sjghh commented Oct 27, 2024 •

edited

Loading

ffcarina commented Nov 15, 2024

Huskyii24 commented Dec 13, 2024

ffcarina commented Dec 13, 2024

Huskyii24 commented Dec 13, 2024

Can't hear the audio #112

Can't hear the audio #112

Comments

sjghh commented Oct 25, 2024

xinyifei99 commented Oct 25, 2024 • edited Loading

sjghh commented Oct 26, 2024

xinyifei99 commented Oct 26, 2024 • edited Loading

sjghh commented Oct 27, 2024

xinyifei99 commented Oct 27, 2024

sjghh commented Oct 27, 2024

Environment Variables

Multiple conditions

Training Arguments

Log Arguments

Zzitang commented Oct 27, 2024

Environment Variables

Multiple conditions

Training Arguments

Log Arguments

sjghh commented Oct 27, 2024 • edited Loading

ffcarina commented Nov 15, 2024

Huskyii24 commented Dec 13, 2024

ffcarina commented Dec 13, 2024

Huskyii24 commented Dec 13, 2024

xinyifei99 commented Oct 25, 2024 •

edited

Loading

xinyifei99 commented Oct 26, 2024 •

edited

Loading

sjghh commented Oct 27, 2024 •

edited

Loading