- numpy
- cv2
- networkx
- matplotlib
Set the dataset, corpus and destination paths in the generate_textlines.py
main, then run it to generate synthetic line images and their transcription.
Dataset folder expects structure dataset_folder/{character classes}/character_images.png
.
Corpus folder expects structure corpus_folder/{text files}.txt
Files in the destination folder will be of the type destination_folder/{i.png, i.txt}
for each line generated.
json file abbr_matchings.json
maps text to sequences of symbols in the dataset.
- tensor2tensor
Preprocess generated data (synthetic dataset in the form of i.png, i.txt couples must be in $TMP_DIR/ocr) and put it into $DATA_DIR, using custom problem definition (in t2t_usr
):
$ t2t-datagen \
--t2t_usr_dir=t2t_usr \
--problem=ocr_latin \
--tmp_dir=$TMP_DIR \
--data_dir=$DATA_DIR
Train the transformer_sketch model on the generated dataset, using custom problem definition (in t2t_usr
):
$ t2t-trainer \
--t2t_usr_dir=t2t_usr \
--problem=ocr_latin \
--tmp_dir=$TMP_DIR/ocr \
--data_dir=$DATA_DIR \
--model=transformer_sketch \
--hparams_set=transformer_small_sketch \
--output_dir=$OUTPUT_DIR
This project was developed by Elena Nieddu during Pi School's AI programme in Fall 2017.