PT2020 Transcription project.
In this repository, we explore different strategies for automatic transcription enrichment for ASR data which includes tasks such as automatic capitalization (truecasing) and punctuation recovery.
- Multilingual Simultaneous Sentence End and Punctuation Prediction
- Towards better subtitles: A multilingual approach for punctuation restoration of speech transcripts
- Automatic truecasing of video subtitles using BERT: a multilingual adaptable approach
To replicate our winning submission to SEPP 2021 please go to the shared-task
branch.
This project uses Python >3.6
Create a virtual env with (outside the project folder):
virtualenv -p python3.6 caption-env
Activate venv:
source caption-env/bin/activate
Finally, run:
python setup.py install
If you wish to make changes into the code run:
pip install -r requirements.txt
pip install -e .
python caption train -f {your_config_file}.yaml
python caption test \
--checkpoint=some/path/to/your/checkpoint.ckpt \
--test_csv=path/to/your/testset.csv
Launch tensorboard with:
tensorboard --logdir="experiments/lightning_logs/"
If you are running experiments in a remote server you can forward your localhost to the server localhost..
In order to run the toolkit tests you must run the following command:
cd tests
python -m unittest
To make sure all the code follows the same style we use Black.