FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

Yongqi Wang, Zhou Zhao | Zhejiang University

This is the PyTorch implementation of FastLTS (ACM MM'22), a non-autoregressive end-to-end model for unconstrained lip-to-speech synthesis.

Note

The correctness of this open-source version is still under validation. Feel free to create an issue if you find any problems.

Checkpoints

Speaker	Checkpoint
Chemistry Lectures	Google Drive
Chess Analysis	Google Drive
Hardware Security	Google Drive

Dependencies

python >= 3.6
pytorch >= 1.7.0
numpy
scipy
pillow
inflect
librosa
Unidecode
matplotlib
tensorboardX
ffmpeg sudo apt-get install ffmpeg

Data pre-processing

We adopt the same data format as Lip2Wav. Please download the datasets and following its preprocessing method in the Lip2Wav repository.

Training

First stage

Suppose we use the chess split in the Lip2Wav dataset. Use the following command for the first stage training.

python train_stage1.py -d <DATA_DIR>/chess -l <LOG_DIR> -cd <CKPT_DIR>

An additional -cp argument can be used to restore training from a checkpoint.

Second stage

When observing the convergence of the first-stage model, load its checkpoint for the second-stage training with the command:

python train_stage2.py -d <DATA_DIR>/chess -l <LOG_DIR> -cd <CKPT_DIR> -fr <PATH_TO_STAGE1_CKPT>

An additional -cp argument can be used to restore training from a checkpoint, which should not be used with -fr together. Besides, we add distributed data parallel in training of stage2, with can be turned on with --ddp and --ngpu <GPU_NUM>.

Inference

python test.py -d <DATA_DIR>/chess -cp <PATH_TO_STAGE2_CKPT> -o <OUTPUT_DIR> -bs <BATCH_SIZE>

Acknowledgements

Part of the code is borrowed from the following repos. We would like to thank the authors of these repos for their contribution.

Lip2Wav-pytorch: https://github.com/joannahong/Lip2Wav-pytorch
NATSpeech: https://github.com/NATSpeech/NATSpeech
Hifi-GAN: https://github.com/jik876/hifi-gan

Citations

If you find this code useful in your research, please cite our work:

@inproceedings{wang2022fastlts,
  title={Fastlts: Non-autoregressive end-to-end unconstrained lip-to-speech synthesis},
  author={Wang, Yongqi and Zhao, Zhou},
  booktitle={Proceedings of the 30th ACM International Conference on Multimedia},
  pages={5678--5687},
  year={2022}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

Yongqi Wang, Zhou Zhao | Zhejiang University

Note

Checkpoints

Dependencies

Data pre-processing

Training

First stage

Second stage

Inference

Acknowledgements

Citations

Files

README.md

Latest commit

History

README.md

File metadata and controls

FastLTS: Non-Autoregressive End-to-End Unconstrained Lip-to-Speech Synthesis

Yongqi Wang, Zhou Zhao | Zhejiang University

Note

Checkpoints

Dependencies

Data pre-processing

Training

First stage

Second stage

Inference

Acknowledgements

Citations