This repo makes it straightforward to run CogVideo on Lambda, an affordable GPU cloud provider.
This instructions are meant to get you up and running as fast as possible so you can create videos using the minimum amount of GPU time.
Sign up to Lambda. Fill your information, complete the email verification and add a credit card.
Press the "Launch instance" button and introduce your public ssh key. Your public key should be on the folder in ~/.ssh/
in a file named id_rsa.pub
. You can see its content with cat ~/.ssh/id_rsa.pub
. Copy and paste the result to lambda and you'll be set. If you do not find this folder, check out here how you can generate an ssh key, it's really straightforward.
Once the instance is launched, wait for a minute or two until the Status of the machine says "Running". Then, copy the line under "SSH LOGIN", the one that looks like: ssh ubuntu@<ip-address>
, where the <ip-address>
will be a series of numbers in the form 123.456.789.012
. Paste it on your terminal, type "yes" to the prompt that will appear and you'll have accessed to your new machine with an A100!
Once you have accessed the machine, clone and access this repo by running git clone https://github.com/krea-ai/CogVideo-lambda
and cd CogVideo-lambda
. Then, edit the file one-prompt-per-line.txt
and write in different lines the texts that you want to use for each your generations. The text needs to be in simplified chinese. You can use nano texts.txt
and paste in different lines the texts that you want to use. Once you've finished, exit nano with CTRL + X
, press y
to save the changes and then press ENTER
to write the new file. Here's an example of how the file could look like:
3 架机械金属飞马在一个有很多人注视着他们的城市中疾驰而过,在一个暴风雨的夜晚,闪电穿过云层。
两个忍者战斗
一只巨大的玉米哥斯拉跺着纽约市
一个女孩跑向地平线。
There are 4 lines, so this file would generate 4 videos.
Now, we will use simple tmux commands. Tmux will allow us to have multiple terminal sessions running at the same time that can keep running on the background even if we leave the machine. Read more about it here.
The first command will start a new terminal called "cog". Run tmux new -s cog
and you'll be automatically inside a new terminal. Once inside, just run bash install-download-and-run.sh
. Surprise surprise, this wil install the required dependencies, download the CogVideo models and run them. This will take around 25-30 minutes, the models are large!
You can detach from the terminal pressing CTRL + B
and then D
(mac users: use CONTROL, not COMMAND) at anytime (make sure not to close the terminal, do not press CTRL + D
). After detaching, you'll be back to the terminal of your Lambda instance. You can leave this instance and the tmux terminal will keep running. Once you access the Lambda instance again, you can access the "cog" tmux terminal session with the command tmux a -t cog
and see how the generations are going.
Run the following command from your local terminal: rsync -chavzP --stats ubuntu@<lambda-instance-ip>:/home/ubuntu/CogVideo-lambda/output ./
. The <lambda-instane-ip>
will be the same series of numbers that you used to access the lambda instance, something like 123.456.789.012
.
This will create a folder named output
in the directory from where you ran the command. Inside you'll find all your generations!
Make sure to terminate your lambda instance once you're finished!
Feel free to open an issue to this repo with any problem you encounter so we can improve the installation script.
Have fun!
This is the official repo for the paper: CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers.
News! The demo for CogVideo is available!
News! The code and model for text-to-video generation is now available! Currently we only supports simplified Chinese input.
CogVideo_samples.mp4
- Read our paper CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers on ArXiv for a formal introduction.
- Try our demo at https://wudao.aminer.cn/cogvideo/
- Run our pretrained models for text-to-video generation. Please use A100 GPU.
- Cite our paper if you find our work helpful
@article{hong2022cogvideo,
title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},
author={Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang, Jie},
journal={arXiv preprint arXiv:2205.15868},
year={2022}
}
The demo for CogVideo is at https://wudao.aminer.cn/cogvideo/, where you can get hands-on practice on text-to-video generation. The original input is in Chinese.
Video samples generated by CogVideo. The actual text inputs are in Chinese. Each sample is a 4-second clip of 32 frames, and here we sample 9 frames uniformly for display purposes.
CogVideo is able to generate relatively high-frame-rate videos. A 4-second clip of 32 frames is shown below.
- Hardware: Linux servers with Nvidia A100s are recommended, but it is also okay to run the pretrained models with smaller
--max-inference-batch-size
and--batch-size
or training smaller models on less powerful GPUs. - Environment: install dependencies via
pip install -r requirements.txt
. - LocalAttention: Make sure you have CUDA installed and compile the local attention kernel.
git clone https://github.com/Sleepychord/Image-Local-Attention
cd Image-Local-Attention && python setup.py install
Our code will automatically download or detect the models into the path defined by environment variable SAT_HOME
. You can also manually download CogVideo-Stage1 and CogVideo-Stage2 and place them under SAT_HOME (with folders named cogvideo-stage1
and cogvideo-stage2
)
./scripts/inference_cogvideo_pipeline.sh
Arguments useful in inference are mainly:
--input-source [path or "interactive"]
. The path of the input file with one query per line. A CLI would be launched when using "interactive".--output-path [path]
. The folder containing the results.--batch-size [int]
. The number of samples will be generated per query.--max-inference-batch-size [int]
. Maximum batch size per forward. Reduce it if OOM.--stage1-max-inference-batch-size [int]
Maximum batch size per forward in Stage 1. Reduce it if OOM.--both-stages
. Run both stage1 and stage2 sequentially.--use-guidance-stage1
Use classifier-free guidance in stage1, which is strongly suggested to get better results.
You'd better specify an environment variable SAT_HOME
to specify the path to store the downloaded model.
Currently only Chinese input is supported.