Support for named entity recognition (NER), language and knowledge graph models is planned for future releases which are now standalone scripts.
This project requires Python 3.6+
Create a conda environment (preferred)
conda env create --file conda_requirements.yml
conda activate vayu
- JsonL : One line per record in json format exported using following scala script using spark
- Having metadata columns like
cptCode
andcontactMethod
is helpful if present to exportmodel_info_advanced.json
- The tokens column is
List[List[str]]
/ Doc[sentences] where doc is List[sentence], sentence isList[str]
- Datasets should already be split into train, validation and test in (BIO CoNLL format for NER) for the scripts to
ingest.
vayu/datasets/train_valid_split.py
can be used to split datasets. - Having a static validation and test set instead of random is important for measure performance of new models compared to earlier ones.
python -m main do_train=true
-
Training flow by default trains -> tunes -> evaluates
-
By default, training gets configured using yaml situated in
conf/config.yaml
. -
To override any aspect of training, either pass via cmdline e.g.
python -m main training.optimizer_learning_rate=0.1 model=classification/cnn_lstm model.params.embedding_size=10
-
Configs are composed from multiple hierarchical small config files along with their defaults. This allows highly flexible configuration e.g. having same training, data params but different params for different models.
-
After training, system outputs the path of trained model which consists of:
a. complete config.yaml b. lightning_logs/version_0 will have by default top 3 models found during training based on validation loss. c. model_info_advanced.json, thresholds.json, test_results.yaml, tune_results.yaml as well as train.log which comprises of stdout of training.
python -m main do_train=true --multirun \
model.params.embedding_size=10,50,100 \
hydra.sweep.dir=multiruns/embedding_effect \
hydra.sweep.subdir=\${model.embedding_size}
Above will create separate directories for each embedding_size, this saves developer effort in running multiple configs.
python -m main do_train=true \
model=classification/boe \
pl_trainer.checkpoint_path=path/to/checkpoint
You can evaluate on a directory of json files or single file
- This assumes your config is present in
outputs/2020-05-27/16-28-42/.hydra/config.yaml
- If you want to evaluate model on a separate threshold, simply update
threshold
param inthresholds.json
- If you want to tune again on different dataset:
- backup and delete
thresholds.json
in pretrained model dir. - If
thresholds.json
is not deleted, then it will be used for prediction. - New threshold will be exported to a new directory to avoid overwriting.
- backup and delete
python -m main do_eval=true do_train=false \
pretrained_model_path=outputs/2020-05-27/16-28-42 \
data.test_path=path \ # path can be dir of files or single json file
data.val_path=path # if you want to tune on a new dataset
Here you can compare experiments
tensorboard --logdir="outputs"
- Does not support batch size > 1 when training throws OOM for a bigger batch size, if you lower max_chunks and
chunk_size which will support larger batch size
- Custom collate functions for Dataloader
- Categorical data featurization (pytorch-tabnet)
- Support for streaming json datasets loading w/o blowing up memory
- Support for loading multiple json datasets
- Training takes evaluation set as param from beginning, which is a limitation for now since we would want to evaluate on any dataset post training. Workaround for that is load model manually in an interpreter and run custom evaluate dataset through it
- Imbalanced batch sampler
- CNN + classification head
- CNN + LSTM + classification head
- HAN + classification head
- CNN + HAN + classification head
- Support gensim pretrained vectors (w2v, d2v, fastText and glove)
- Transformers + Classification Head
- CDR custom tokenizer
- Hybrid CDR tokenizer that leverages a whitelist along with BPE/wordpiece tokenization