Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions regarding the DiT fine-tunning, optimization of input dataset #186

Open
jquintanilla4 opened this issue Nov 16, 2024 · 0 comments

Comments

@jquintanilla4
Copy link

jquintanilla4 commented Nov 16, 2024

i've gone over your scripts and your guide to try understand it better. But i still have some questions that I want confirm with ya'll. I'm interested in running a large training run for your 768 model, so i want to optimize my dataset for best results.

In your video latent extraction script your have the following:

ANNO_FILE=annotation/video_text.jsonl
WIDTH=640
HEIGHT=384
NUM_FRAMES=121
  • This seems to be for the 384 model. Should that be changed for the 768 model?
  • Also, 121 frames is about 5 seconds at 24fps, which matches the 384 model. Should that be changed to 240 frames (10s) for the 768 model?
  • Is the ANNO_FILE meant to point to an empty file, which then gets populated by your script?

I'm confused, because otherwise, you would be stuck in a catch 22. Because your video_text.jsonl is meant to be the following format:
{"video": video_path, "text": text prompt, "latent": extracted video vae latent, "text_fea": extracted text feature}
As seen in this link. But how can you have video latent, if you're using the script to get the video latent... or am i missing something?
Are we mean to start with a basic annotation file like:

{"video": "/path/to/video1.mp4", "text": "A dog running in the park"}
{"video": "/path/to/video2.mp4", "text": "A car driving down the street"}

From your experience in training the original SD3 based model and from mini_flux:
what is the best format for the 768 model, for example:
What is the ideal length (frames) of each video?
What is the ideal frame rate for each video?
What is the ideal height for each video?
Is there a character limit to the captioning?
is there an ideal dataset size for a training run?

I'm also curious about experimenting with 360 video data sources, such as fisheye videos, cubemap videos, and equirectangular videos. These may require a different aspect ratio from 16:9 or 1:1, such as 2:1. Any thoughts/tips on how to best achieve good results? I would imagine, i may need to fine-tune the VAE for that.

Thanks in advance for your help, and thanks for sharing your code and scripts with the community. Awesome work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant