MIRAI: Future Frame Prediction of Anime from Previous Frames and Audio Input

This project is a part of ECE324 at the University of Toronto

Motivation

This project involves using a machine learning model to predict the next anime frame from the previous frames and the audio of the anime. The idea is inspired by the issue in the anime industry that all of the frames need to be drawn by hand, which can make the process of creating anime expensive. In addition, some producers have used computer generated imagery to speed up the process but the viewers are not satisfied with it.

Architecture

This project used the same architecture as mentioned in the VideoGPT Paper as showned in the figure above. VideoGPT is a combination of two different models, Vector Quantized Variational Autoencoder (VQVAE) and Transformer model (GPT/Image-GPT). The VQ-VAE is used to compress video frames into discrete latent codes, which are then used as input to the Transformer model. The Transformer is then used to generate future video frames based on the compressed latent codes.

Data Collection

The functions for converting data is under the folder data The main work for data collection is done inside this folder. We have created these scripts for the data collection pipeline:

generate_timestamps.py generates the timestamps of the beginning of a cut that lasts longer than 20 frames (we discard any cut that is smaller than 20 frames so that the cuts are meaningful)
split_video.py split the video into cuts with 20 frames images and save them
generate_npy_from_jpeg.py converts the jpeg files into a single npy file and also resize the image to the desired size. For the project we convert the image into 64x64 pixels.
visualize.py converts the numpy array back into image and videos for visualizing the results

General Data Collection Pipeline:

From this data collection pipeline, we are able to generate approximately 20,000 batches of 20 frames 64x64 images for traning and testing. Note that the Video-GPT uses 16 frames.

Results

Hyperparameter Tuning

As the pre-trained weights are specific to the model architecture, to be able to use the pre-trained weights, the model architecture needs to be the same. Hence we tried to experiment with the hyperparameters that doesn't change the model architecture including the batch size and the number of conditioned frames.

Batch Size

Due to the memory constraint, increasing the batch size requires us to change the size of the input and output to fewer frames. Our experiment found that having the batch size of 1 and have 16 frames performs the best.

Number of Conditioned Frames

Increasing the number of conditioned frame increases the model's performance. In practice, we want to minimize the number of frames the model conditions on while performing with an acceptable performance.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
.vscode		.vscode
VideoGPT		VideoGPT
data		data
.DS_Store		.DS_Store
README.md		README.md
environment.yml		environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MIRAI: Future Frame Prediction of Anime from Previous Frames and Audio Input

Motivation

Architecture

Data Collection

Results

Hyperparameter Tuning

Batch Size

Number of Conditioned Frames

About

Releases

Packages

Contributors 3

Languages

kkatodus/MIRAI

Folders and files

Latest commit

History

Repository files navigation

MIRAI: Future Frame Prediction of Anime from Previous Frames and Audio Input

Motivation

Architecture

Data Collection

Results

Hyperparameter Tuning

Batch Size

Number of Conditioned Frames

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages