This project is a part of ECE324 at the University of Toronto
This project involves using a machine learning model to predict the next anime frame from the previous frames and the audio of the anime. The idea is inspired by the issue in the anime industry that all of the frames need to be drawn by hand, which can make the process of creating anime expensive. In addition, some producers have used computer generated imagery to speed up the process but the viewers are not satisfied with it.
This project used the same architecture as mentioned in the VideoGPT Paper as showned in the figure above. VideoGPT is a combination of two different models, Vector Quantized Variational Autoencoder (VQVAE) and Transformer model (GPT/Image-GPT). The VQ-VAE is used to compress video frames into discrete latent codes, which are then used as input to the Transformer model. The Transformer is then used to generate future video frames based on the compressed latent codes.
The functions for converting data is under the folder data
The main work for data collection is done inside this folder.
We have created these scripts for the data collection pipeline:
generate_timestamps.py
generates the timestamps of the beginning of a cut that lasts longer than 20 frames (we discard any cut that is smaller than 20 frames so that the cuts are meaningful)split_video.py
split the video into cuts with 20 frames images and save themgenerate_npy_from_jpeg.py
converts the jpeg files into a single npy file and also resize the image to the desired size. For the project we convert the image into 64x64 pixels.visualize.py
converts the numpy array back into image and videos for visualizing the results
General Data Collection Pipeline:
From this data collection pipeline, we are able to generate approximately 20,000 batches of 20 frames 64x64 images for traning and testing. Note that the Video-GPT uses 16 frames.
As the pre-trained weights are specific to the model architecture, to be able to use the pre-trained weights, the model architecture needs to be the same. Hence we tried to experiment with the hyperparameters that doesn't change the model architecture including the batch size and the number of conditioned frames.
Due to the memory constraint, increasing the batch size requires us to change the size of the input and output to fewer frames. Our experiment found that having the batch size of 1 and have 16 frames performs the best.
Increasing the number of conditioned frame increases the model's performance. In practice, we want to minimize the number of frames the model conditions on while performing with an acceptable performance.