This repository is implementation of Transformer using ⚡Pytorch Lightning to translate Korean to English
⚡ PyTorch Lightning is an open-source Python library that provides a high-level interface for PyTorch.
It is my first time using Pytorch Lightning and I feel it is very flexible and easy to organize the code 😄
pytorch-lightning>=0.9.0
sentencepiece==0.1.91
torchtext==0.7.0
torch>=1.5.0
For this project, I used 1,100,000 sentences from AI HUB Korean-English AI Training Text Corpus.
DATASET | SENTENCES |
---|---|
TRAIN | 1,000,000 |
VALID | 5,000 |
TEST | 5,000 |
To use torchtext and this repo, please check the sample.tsv
in ./data
folder for data format.
To train,
python main.py --epochs 30
If you use GPU,
python main.py --gpus 1 --epochs 30
I uploaded my pretrained sentencepiece tokenizer files, but if you want to train tokenzier with your own corpus please run the code like below.
import sentencepiece as spm
input_file = 'kor.txt'
vocab_size = 32000 # Choose your vocab size
model_name = 'kor'
model_type = 'bpe'
character_coverage = 0.9995
input_argument = '--input=%s --model_prefix=%s --vocab_size=%s --model_type=%s --character_coverage=%s --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 '
cmd = input_argument%(input_file, model_name, vocab_size, model_type, character_coverage)
spm.SentencePieceTrainer.Train(cmd)
import sentencepiece as spm
input_file = 'eng.txt'
vocab_size = 32000 # Choose your vocab size
model_name = 'eng'
model_type = 'bpe'
character_coverage = 1
input_argument = '--input=%s --model_prefix=%s --vocab_size=%s --model_type=%s --character_coverage=%s --pad_id=0 --unk_id=1 --bos_id=2 --eos_id=3 '
cmd = input_argument%(input_file, model_name, vocab_size, model_type, character_coverage)
spm.SentencePieceTrainer.Train(cmd)
If you use ⚡ PyTorch Lightning, you can easily see the learning process with TensorBoard or other loggers.
%load_ext tensorboard
%tensorboard --logdir lightning_logs/
BLEU | BLEU1 | BLUE2 | BLEU3 | BLEU4 |
---|---|---|---|---|
26.28 | 56.7 | 33.3 | 21.2 | 14.0 |
To translate, set the checkpoint in translate.py
file after you finish train and run this file.
python translate.py
Examples,
kor : 안녕! 내일 뭐해?
eng : Hi! What are you doing tomorrow?
kor : 어제 무슨 영화봤어?
eng : What movie did you watch yesterday?
kor : 인공지능 공부는 재밌어요!
eng : Artificial intelligence studies are fun!