You should think of GitHub as your modern research journal. This repository template is intended to help quick start the model building process. Every new model archecture should have its own repository. When you are happy with the archecture, evaluate the performance on your dataset.
This repository should NOT house your academic paper. If the model archecture works well, make a paper repo. The paper repo should have a script under
~/code
that takes your data source, preprocesses it, and runs the model, saving the results to~/results
. This way others can apply your model to their dataset and more easly: (1) reproduce your work and (2).
When someone else uses your work, what academic paper should they cite.
{{Repo Name}} is the code embodyment of the model archecture described in {{Paper Name}}. It is in a seperate repo from the paper to facilate both code reuse and academic compareson. If you use this network layout in your paper, please use the citation as described in the references.bib file.
TensorFlow networks are notorsully difficult to run when the local enviroment is different than what is expected. Small differences in the versions of Python/CUDA/cuDNN can have large impacts. If you are having problems getting the network to run at all, consider setting up your local enviroment to have the following versions:
When you are finished checking in the initial version of your archecture, make sure to update the versions below
- CUDA (10.1)
- cuDNN (7.6.5 for CUDA 10.1)
- Python (3.8.4)
- tensorflow (2.2.0)
Below are the scripts needed to train the classifaction model. It is expected that any preprocessing steps (I.E. Stopword removal, stemming, ...) the corpus needs should already be applied before the scripts below are run. The corpus is expected in a specific folder format. A state folder is used to keep state between steps. The academic boilerplate describing the process is provided as a starting point. All paths can changed as desired. PowerShell is used as an overall scripting language to allow the process to be paused easly.
Remember to copy these steps to your processing script in your academic paper.
- Open a PowerShell prompt
- Change into the
~/code
folder - Copy the config file to the state folder
copy config.yml d:/state/config.yml
- Update the config as desired
- Tokenize the corpus.
- Remember to cleanup the destenation first.
Remove-Item "d:/corpus_tok/*.*" python tokenize_corpus.py -in d:/corpus_raw -out d:/corpus_tok
- Capture the k-folds for reproducability
python create_corpus_folds.py -in d:/corpus_tok -s d:/state
- Create the model.
python create_model.py -s d:/state
- Hyper-tune the model.
- Cleanup the working directories
- Extract the hyper-tuning subsample
- Run the hyper-tuning script
Remove-Item "d:/corpus_train/*.*" Remove-Item "d:/corpus_test/*.*" python extract_fold.py -in d:/corpus_tok -out d:/corpus_train -sub d:/state/sub/hypertune.train.csv python extract_fold.py -in d:/corpus_tok -out d:/corpus_test -sub d:/state/sub/hypertune.test.csv python hyper_tune_model.py -train d:/corpus_train -test d:/corpus_test -s d:/state
- Train the model
For each fold:
- Cleanup the working directory
- Extract the fold
- Train the model
- Validate model
$folds = 10 #get from config.yml for($i = 0; $i -lt $folds; $i++) { Remove-Item "d:/corpus_train/*.*" Remove-Item "d:/corpus_test/*.*" python extract_fold.py-in d:/corpus_tok -out d:/corpus_train -sub d:/state/sub/kfold.$i.train.csv python extract_fold.py-in d:/corpus_tok -out d:/corpus_test -sub d:/state/sub/kfold.$i.test.csv python train_model.py -in d:/corpus_train -s d:/state python validate_model.py -in d:/corpus_test -s d:/state }
- Collect the overall results.
The corpus is expected to be in the folder format below.
Documents are expected to be in .txt
format, 1 document per file.
This layout allows N
levels of clasifaction, without the need for an external control file.
- Root (
d:/corpus
)- Class 1 (
d:/corpus/1
) - Class 2 (
d:/corpus/2
) - ...
- Class N (
d:/corpus/#
)
- Class 1 (
In order to allow for the process to be stoped and re-started a state folder is used. Reproduceability by design is enabled using a subsample folder that captures the random folds.
- Root (
d:/state
)- Config (
d:/state/config.yml
) - Subsample
- Hyper-tuning training split (
d:/state/sub/hypertune.train.csv
) - Hyper-tuning test split (
d:/state/sub/hypertune.test.csv
) - K-fold training (
d:/state/sub/kfold.#.train.csv
) - K-fold test (
d:/state/sub/kfold.#.test.csv
)
- Hyper-tuning training split (
- Model weights (
d:/state/weights
) - Validation results (
d:/state/results
)
- Config (
Below is the suggested text to add to the Methods and Materials section of your paper when using this arectiture.
After preprocessing, the corpus was then tokenized. Hyper-tuning used a 5% sub-sample evenly distributed accross each class using a training/validation split of 80%/20%. The hyper-tuning sample was pre-calculated to aid in reproducability. Model building used a 10-fold cross-validation using a training/validation split of 80%/20%. Folds for cross-validation were pre-calculated to aid in reproducability.