Skip to content

Latest commit

 

History

History
44 lines (33 loc) · 2.01 KB

README.md

File metadata and controls

44 lines (33 loc) · 2.01 KB

Update

Dataset Acoustic Feature Word Embedding PER(dev)
thchs30 fbank dim = 40 one-hot 17.54%

Introduction

This directory contains a pytorch implementation of

Transformer Transducer: A streamable speech recognition model with transformer encoders and RNN-T loss

which is pre-printed on arXiv in Feb. 2020 from Google. It shows that Transformer Transducer model achieved state-of-the-art results in streaming speech recognition.

Features

Transformer Transducer (T-T) is a combination of Transformer and RNN-T, which employs self-attention [1] to encode both acoustic features and word embeddings respectively instead of LSTM in RNN Transducer. Not only T-T uses the Relative Positional Encoding, which is mentioned in transformer-xl [2], but also Loss Function [3] and Joint Network [4] proposed by Alex Graves in 2012 and 2013 respectively.

Environment

  • Kaldi
    Use Kaldi as a toolbox to extract the MFCC (dim=39) or Fbank (dim=40) features
  • pytorch >=0.4
  • wraprnnt
    which is the wrapped RNNT Loss function

Usage

train

Before start to train, make sure that you already get the acoustic features for {train, dev} and the model unit either character or words, which may not yet supported in the origin dataset.

python train.py 

Thanks to

Reference

[1] Attention is all you need
[2] Transformer-xl: Attentive language models beyond a fixed-length context
[3] Sequence transduction with recurrent neural networks
[4] Speech recognition with deep recurrent neural networks

Author

Email: [email protected]