Skip to content

krea-ai/CogVideo-lambda

 
 

Repository files navigation

Running CogVideo on Lambda Labs

This repo makes it straightforward to run CogVideo on Lambda, an affordable GPU cloud provider.

This instructions are meant to get you up and running as fast as possible so you can create videos using the minimum amount of GPU time.

Instructions

Set up Lambda

Sign up to Lambda. Fill your information, complete the email verification and add a credit card.

Press the "Launch instance" button and introduce your public ssh key. Your public key should be on the folder in ~/.ssh/ in a file named id_rsa.pub. You can see its content with cat ~/.ssh/id_rsa.pub. Copy and paste the result to lambda and you'll be set. If you do not find this folder, check out here how you can generate an ssh key, it's really straightforward.

Once the instance is launched, wait for a minute or two until the Status of the machine says "Running". Then, copy the line under "SSH LOGIN", the one that looks like: ssh ubuntu@<ip-address>, where the <ip-address> will be a series of numbers in the form 123.456.789.012. Paste it on your terminal, type "yes" to the prompt that will appear and you'll have accessed to your new machine with an A100!

Write down your prompts

Once you have accessed the machine, clone and access this repo by running git clone https://github.com/krea-ai/CogVideo-lambda and cd CogVideo-lambda. Then, edit the file one-prompt-per-line.txt and write in different lines the texts that you want to use for each your generations. The text needs to be in simplified chinese. You can use nano texts.txt and paste in different lines the texts that you want to use. Once you've finished, exit nano with CTRL + X, press y to save the changes and then press ENTER to write the new file. Here's an example of how the file could look like:

3 架机械金属飞马在一个有很多人注视着他们的城市中疾驰而过,在一个暴风雨的夜晚,闪电穿过云层。
两个忍者战斗
一只巨大的玉米哥斯拉跺着纽约市
一个女孩跑向地平线。

There are 4 lines, so this file would generate 4 videos.

Generate!

Now, we will use simple tmux commands. Tmux will allow us to have multiple terminal sessions running at the same time that can keep running on the background even if we leave the machine. Read more about it here.

The first command will start a new terminal called "cog". Run tmux new -s cog and you'll be automatically inside a new terminal. Once inside, just run bash install-download-and-run.sh. Surprise surprise, this wil install the required dependencies, download the CogVideo models and run them. This will take around 25-30 minutes, the models are large!

You can detach from the terminal pressing CTRL + B and then D (mac users: use CONTROL, not COMMAND) at anytime (make sure not to close the terminal, do not press CTRL + D). After detaching, you'll be back to the terminal of your Lambda instance. You can leave this instance and the tmux terminal will keep running. Once you access the Lambda instance again, you can access the "cog" tmux terminal session with the command tmux a -t cog and see how the generations are going.

Download results to your local computer

Run the following command from your local terminal: rsync -chavzP --stats ubuntu@<lambda-instance-ip>:/home/ubuntu/CogVideo-lambda/output ./. The <lambda-instane-ip> will be the same series of numbers that you used to access the lambda instance, something like 123.456.789.012.

This will create a folder named output in the directory from where you ran the command. Inside you'll find all your generations!

Terminate your Lambda instance

Make sure to terminate your lambda instance once you're finished!

Doubts?

Feel free to open an issue to this repo with any problem you encounter so we can improve the installation script.

Have fun!

-@viccpoes x @krea_ai

CogVideo

This is the official repo for the paper: CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers.

News! The demo for CogVideo is available!

News! The code and model for text-to-video generation is now available! Currently we only supports simplified Chinese input.

CogVideo_samples.mp4
@article{hong2022cogvideo,
  title={CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers},
  author={Hong, Wenyi and Ding, Ming and Zheng, Wendi and Liu, Xinghan and Tang, Jie},
  journal={arXiv preprint arXiv:2205.15868},
  year={2022}
}

Web Demo

The demo for CogVideo is at https://wudao.aminer.cn/cogvideo/, where you can get hands-on practice on text-to-video generation. The original input is in Chinese.

Generated Samples

Video samples generated by CogVideo. The actual text inputs are in Chinese. Each sample is a 4-second clip of 32 frames, and here we sample 9 frames uniformly for display purposes.

Intro images

More samples

CogVideo is able to generate relatively high-frame-rate videos. A 4-second clip of 32 frames is shown below.

High-frame-rate sample

Getting Started

Setup

  • Hardware: Linux servers with Nvidia A100s are recommended, but it is also okay to run the pretrained models with smaller --max-inference-batch-size and --batch-size or training smaller models on less powerful GPUs.
  • Environment: install dependencies via pip install -r requirements.txt.
  • LocalAttention: Make sure you have CUDA installed and compile the local attention kernel.
git clone https://github.com/Sleepychord/Image-Local-Attention
cd Image-Local-Attention && python setup.py install

Download

Our code will automatically download or detect the models into the path defined by environment variable SAT_HOME. You can also manually download CogVideo-Stage1 and CogVideo-Stage2 and place them under SAT_HOME (with folders named cogvideo-stage1 and cogvideo-stage2)

Text-to-Video Generation

./scripts/inference_cogvideo_pipeline.sh

Arguments useful in inference are mainly:

  • --input-source [path or "interactive"]. The path of the input file with one query per line. A CLI would be launched when using "interactive".
  • --output-path [path]. The folder containing the results.
  • --batch-size [int]. The number of samples will be generated per query.
  • --max-inference-batch-size [int]. Maximum batch size per forward. Reduce it if OOM.
  • --stage1-max-inference-batch-size [int] Maximum batch size per forward in Stage 1. Reduce it if OOM.
  • --both-stages. Run both stage1 and stage2 sequentially.
  • --use-guidance-stage1 Use classifier-free guidance in stage1, which is strongly suggested to get better results.

You'd better specify an environment variable SAT_HOME to specify the path to store the downloaded model.

Currently only Chinese input is supported.

About

Text-to-video generation.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.3%
  • Shell 4.7%