In Codice Ratio

Synthetic dataset generation for sequence prediction models

Requirements

numpy
cv2
networkx
matplotlib

Usage

Set the dataset, corpus and destination paths in the generate_textlines.py main, then run it to generate synthetic line images and their transcription.

Dataset folder expects structure dataset_folder/{character classes}/character_images.png.

Corpus folder expects structure corpus_folder/{text files}.txt

Files in the destination folder will be of the type destination_folder/{i.png, i.txt} for each line generated.

json file abbr_matchings.json maps text to sequences of symbols in the dataset.

Model training

Requirements

tensor2tensor

Usage

Preprocess generated data (synthetic dataset in the form of i.png, i.txt couples must be in $TMP_DIR/ocr) and put it into $DATA_DIR, using custom problem definition (in t2t_usr):

$ t2t-datagen \
    --t2t_usr_dir=t2t_usr \
    --problem=ocr_latin \
    --tmp_dir=$TMP_DIR \
    --data_dir=$DATA_DIR

Train the transformer_sketch model on the generated dataset, using custom problem definition (in t2t_usr):

$ t2t-trainer \
    --t2t_usr_dir=t2t_usr \
    --problem=ocr_latin \
    --tmp_dir=$TMP_DIR/ocr \
    --data_dir=$DATA_DIR \
    --model=transformer_sketch \
    --hparams_set=transformer_small_sketch \
    --output_dir=$OUTPUT_DIR

Author

This project was developed by Elena Nieddu during Pi School's AI programme in Fall 2017.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
corpus		corpus
t2t_usr		t2t_usr
.gitignore		.gitignore
README.md		README.md
abbr_matchings.json		abbr_matchings.json
char_merge.ipynb		char_merge.ipynb
env.yaml		env.yaml
generate_textlines.py		generate_textlines.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

In Codice Ratio

Synthetic dataset generation for sequence prediction models

Requirements

Usage

Model training

Requirements

Usage

Author

About

Releases

Packages

Contributors 2

Languages

PiSchool/in-codice-ratio

Folders and files

Latest commit

History

Repository files navigation

In Codice Ratio

Synthetic dataset generation for sequence prediction models

Requirements

Usage

Model training

Requirements

Usage

Author

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages