Speech recognition by predicting the phonemes in a recording. Implemented RNNs and the dynamic programming algorithm, Connectionist Temporal Classification, to generate such labels.
Make sure you have the following dependencies installed:
- Python 3.6+
- PyTorch 2.0
- Numpy
- Matplotlib
- Wandb
- DataLoader, TensorDataset
To execute the code, run all the cells in the notebook.
Different architectures were considered to achieve a high cutoff:
- Simple Conv1D with BatchNorm and 2 pBLSTM layers
- Four linear layers with BatchNorm, Dropout, and GELU activation
Trained for 50 epochs in total. The performance significantly improved even after 50 epochs.
- Learning Rate: 1e-4
- Batch Size: 64
- Criterion: CTC Loss
- Optimizer: AdamW
- Scheduler: ReduceLR on Plateau
PyTorch's DataLoader was used to load the data. The following transforms were applied:
- TimeMasking with time_mask_param as 200
- FrequencyMasking with freq_mask_param as 5