Skip to content
/ MIRAI Public

Project completed for the ECE324 course of University of Toronto. The project aimed to predict the future frames of animation given the past frames.

Notifications You must be signed in to change notification settings

kkatodus/MIRAI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MIRAI: Future Frame Prediction of Anime from Previous Frames and Audio Input

This project is a part of ECE324 at the University of Toronto

Motivation

This project involves using a machine learning model to predict the next anime frame from the previous frames and the audio of the anime. The idea is inspired by the issue in the anime industry that all of the frames need to be drawn by hand, which can make the process of creating anime expensive. In addition, some producers have used computer generated imagery to speed up the process but the viewers are not satisfied with it.

Architecture

Architecture This project used the same architecture as mentioned in the VideoGPT Paper as showned in the figure above. VideoGPT is a combination of two different models, Vector Quantized Variational Autoencoder (VQVAE) and Transformer model (GPT/Image-GPT). The VQ-VAE is used to compress video frames into discrete latent codes, which are then used as input to the Transformer model. The Transformer is then used to generate future video frames based on the compressed latent codes.

Data Collection

The functions for converting data is under the folder data The main work for data collection is done inside this folder. We have created these scripts for the data collection pipeline:

  • generate_timestamps.py generates the timestamps of the beginning of a cut that lasts longer than 20 frames (we discard any cut that is smaller than 20 frames so that the cuts are meaningful)
  • split_video.py split the video into cuts with 20 frames images and save them
  • generate_npy_from_jpeg.py converts the jpeg files into a single npy file and also resize the image to the desired size. For the project we convert the image into 64x64 pixels.
  • visualize.py converts the numpy array back into image and videos for visualizing the results

General Data Collection Pipeline: Data Collection Pipeline

From this data collection pipeline, we are able to generate approximately 20,000 batches of 20 frames 64x64 images for traning and testing. Note that the Video-GPT uses 16 frames.

Results

Results

Table

Hyperparameter Tuning

As the pre-trained weights are specific to the model architecture, to be able to use the pre-trained weights, the model architecture needs to be the same. Hence we tried to experiment with the hyperparameters that doesn't change the model architecture including the batch size and the number of conditioned frames.

Batch Size

Due to the memory constraint, increasing the batch size requires us to change the size of the input and output to fewer frames. Our experiment found that having the batch size of 1 and have 16 frames performs the best.

BatchSizeResults

Number of Conditioned Frames

Increasing the number of conditioned frame increases the model's performance. In practice, we want to minimize the number of frames the model conditions on while performing with an acceptable performance.

NumberFrameResults

About

Project completed for the ECE324 course of University of Toronto. The project aimed to predict the future frames of animation given the past frames.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published