Training #30

kaiw7 · 2025-01-11T03:23:27Z

Hi, could I learn about how to enable the training with the VGGSound only and without text-audio pairs? In addition, whether it supports v2a generation less than 8s during inference? Many thanks

hkchengrex · 2025-01-12T01:09:58Z

VGGSound-only training: modify this file https://github.com/hkchengrex/MMAudio/blob/main/mmaudio/data/data_setup.py

<8s inference: Yes. The demo script already supports this. As in longer duration evaluation, using a duration that significantly differs from the training duration might introduce artifacts.

kaiw7 · 2025-01-12T01:54:37Z

Thanks a lot for your response. What do these two lines mean? Are they used during the training? https://github.com/hkchengrex/MMAudio/blob/34bf089fdd2e457cd5ef33be96c0e1c8a0412476/config/data/base.yaml#L31C1-L32C22

kaiw7 · 2025-01-12T07:05:05Z

In addition, I met this issue during the training: Do you have any ideas about how to resolve it?

/usr/bin/ld: cannot find -lcuda: No such file or directorycollect2: error: ld returned 1 exit status/usr/bin/ld: cannot find -lcuda: No such file or directorycollect2: error: ld returned 1 exit status[2025-01-12 06:52:22][r3][ERROR] - Error occurred at iteration 0![2025-01-12 06:52:22][r3][CRITICAL] - backend='inductor' raised:

hkchengrex · 2025-01-14T02:54:41Z

Thanks. Those two lines are for the evaluation caches. I have updated the readme to reflect this.

For the error: can you show the full stack trace?

kaiw7 · 2025-01-25T01:52:20Z

Hi, thank you very much. I solved this issue. I have another question about the training script. Does it support gradient accumulation for saving GPU memory?

kaiw7 · 2025-01-25T23:37:52Z

And also, for the 44k case, why the number of samples is 353280 rather than 352800?

MMAudio/training/extract_video_training_latents.py

Line 34 in 5bef2df

# NUM_SAMPLES = 353280

hkchengrex · 2025-01-26T08:28:25Z

We did not implement gradient accumulation. You can implement it yourself. Another route is to reduce batch size/reduce LR/increase the number of iterations -- unlike grad accum, this is not effectively the same but might be more efficient. The network should be fairly robust and not break with reasonable changes like these.
352800 is not divisible by the (STFT hop size * VAE downsampling ratio) which is 1024. 353280 is the next integer that is divisible by 1024.

hkchengrex closed this as completed Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training #30

Training #30

kaiw7 commented Jan 11, 2025

hkchengrex commented Jan 12, 2025

kaiw7 commented Jan 12, 2025

kaiw7 commented Jan 12, 2025

hkchengrex commented Jan 14, 2025

kaiw7 commented Jan 25, 2025

kaiw7 commented Jan 25, 2025

hkchengrex commented Jan 26, 2025

Training #30

Training #30

Comments

kaiw7 commented Jan 11, 2025

hkchengrex commented Jan 12, 2025

kaiw7 commented Jan 12, 2025

kaiw7 commented Jan 12, 2025

hkchengrex commented Jan 14, 2025

kaiw7 commented Jan 25, 2025

kaiw7 commented Jan 25, 2025

hkchengrex commented Jan 26, 2025