You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
i've gone over your scripts and your guide to try understand it better. But i still have some questions that I want confirm with ya'll. I'm interested in running a large training run for your 768 model, so i want to optimize my dataset for best results.
In your video latent extraction script your have the following:
This seems to be for the 384 model. Should that be changed for the 768 model?
Also, 121 frames is about 5 seconds at 24fps, which matches the 384 model. Should that be changed to 240 frames (10s) for the 768 model?
Is the ANNO_FILE meant to point to an empty file, which then gets populated by your script?
I'm confused, because otherwise, you would be stuck in a catch 22. Because your video_text.jsonl is meant to be the following format: {"video": video_path, "text": text prompt, "latent": extracted video vae latent, "text_fea": extracted text feature}
As seen in this link. But how can you have video latent, if you're using the script to get the video latent... or am i missing something?
Are we mean to start with a basic annotation file like:
{"video": "/path/to/video1.mp4", "text": "A dog running in the park"}
{"video": "/path/to/video2.mp4", "text": "A car driving down the street"}
From your experience in training the original SD3 based model and from mini_flux:
what is the best format for the 768 model, for example:
What is the ideal length (frames) of each video?
What is the ideal frame rate for each video?
What is the ideal height for each video?
Is there a character limit to the captioning?
is there an ideal dataset size for a training run?
I'm also curious about experimenting with 360 video data sources, such as fisheye videos, cubemap videos, and equirectangular videos. These may require a different aspect ratio from 16:9 or 1:1, such as 2:1. Any thoughts/tips on how to best achieve good results? I would imagine, i may need to fine-tune the VAE for that.
Thanks in advance for your help, and thanks for sharing your code and scripts with the community. Awesome work.
The text was updated successfully, but these errors were encountered:
i've gone over your scripts and your guide to try understand it better. But i still have some questions that I want confirm with ya'll. I'm interested in running a large training run for your 768 model, so i want to optimize my dataset for best results.
In your video latent extraction script your have the following:
ANNO_FILE
meant to point to an empty file, which then gets populated by your script?I'm confused, because otherwise, you would be stuck in a catch 22. Because your
video_text.jsonl
is meant to be the following format:{"video": video_path, "text": text prompt, "latent": extracted video vae latent, "text_fea": extracted text feature}
As seen in this link. But how can you have video latent, if you're using the script to get the video latent... or am i missing something?
Are we mean to start with a basic annotation file like:
From your experience in training the original SD3 based model and from mini_flux:
what is the best format for the 768 model, for example:
What is the ideal length (frames) of each video?
What is the ideal frame rate for each video?
What is the ideal height for each video?
Is there a character limit to the captioning?
is there an ideal dataset size for a training run?
I'm also curious about experimenting with 360 video data sources, such as fisheye videos, cubemap videos, and equirectangular videos. These may require a different aspect ratio from 16:9 or 1:1, such as 2:1. Any thoughts/tips on how to best achieve good results? I would imagine, i may need to fine-tune the VAE for that.
Thanks in advance for your help, and thanks for sharing your code and scripts with the community. Awesome work.
The text was updated successfully, but these errors were encountered: