Questions regarding the DiT fine-tunning, optimization of input dataset #186

jquintanilla4 · 2024-11-16T09:42:54Z

i've gone over your scripts and your guide to try understand it better. But i still have some questions that I want confirm with ya'll. I'm interested in running a large training run for your 768 model, so i want to optimize my dataset for best results.

In your video latent extraction script your have the following:

ANNO_FILE=annotation/video_text.jsonl
WIDTH=640
HEIGHT=384
NUM_FRAMES=121

This seems to be for the 384 model. Should that be changed for the 768 model?
Also, 121 frames is about 5 seconds at 24fps, which matches the 384 model. Should that be changed to 240 frames (10s) for the 768 model?
Is the ANNO_FILE meant to point to an empty file, which then gets populated by your script?

I'm confused, because otherwise, you would be stuck in a catch 22. Because your video_text.jsonl is meant to be the following format:
{"video": video_path, "text": text prompt, "latent": extracted video vae latent, "text_fea": extracted text feature}
As seen in this link. But how can you have video latent, if you're using the script to get the video latent... or am i missing something?
Are we mean to start with a basic annotation file like:

{"video": "/path/to/video1.mp4", "text": "A dog running in the park"}
{"video": "/path/to/video2.mp4", "text": "A car driving down the street"}

From your experience in training the original SD3 based model and from mini_flux:
what is the best format for the 768 model, for example:
What is the ideal length (frames) of each video?
What is the ideal frame rate for each video?
What is the ideal height for each video?
Is there a character limit to the captioning?
is there an ideal dataset size for a training run?

I'm also curious about experimenting with 360 video data sources, such as fisheye videos, cubemap videos, and equirectangular videos. These may require a different aspect ratio from 16:9 or 1:1, such as 2:1. Any thoughts/tips on how to best achieve good results? I would imagine, i may need to fine-tune the VAE for that.

Thanks in advance for your help, and thanks for sharing your code and scripts with the community. Awesome work.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Questions regarding the DiT fine-tunning, optimization of input dataset #186

Questions regarding the DiT fine-tunning, optimization of input dataset #186

jquintanilla4 commented Nov 16, 2024 •

edited

Loading

Questions regarding the DiT fine-tunning, optimization of input dataset #186

Questions regarding the DiT fine-tunning, optimization of input dataset #186

Comments

jquintanilla4 commented Nov 16, 2024 • edited Loading

jquintanilla4 commented Nov 16, 2024 •

edited

Loading