Model Template v0.1

You should think of GitHub as your modern research journal. This repository template is intended to help quick start the model building process. Every new model archecture should have its own repository. When you are happy with the archecture, evaluate the performance on your dataset.

This repository should NOT house your academic paper. If the model archecture works well, make a paper repo. The paper repo should have a script under ~/code that takes your data source, preprocesses it, and runs the model, saving the results to ~/results. This way others can apply your model to their dataset and more easly: (1) reproduce your work and (2).

When someone else uses your work, what academic paper should they cite.

{{Repo Name}} is the code embodyment of the model archecture described in {{Paper Name}}. It is in a seperate repo from the paper to facilate both code reuse and academic compareson. If you use this network layout in your paper, please use the citation as described in the references.bib file.

Enviormental Setup

TensorFlow networks are notorsully difficult to run when the local enviroment is different than what is expected. Small differences in the versions of Python/CUDA/cuDNN can have large impacts. If you are having problems getting the network to run at all, consider setting up your local enviroment to have the following versions:

When you are finished checking in the initial version of your archecture, make sure to update the versions below

CUDA (10.1)
cuDNN (7.6.5 for CUDA 10.1)
Python (3.8.4)
tensorflow (2.2.0)

Scripts

Below are the scripts needed to train the classifaction model. It is expected that any preprocessing steps (I.E. Stopword removal, stemming, ...) the corpus needs should already be applied before the scripts below are run. The corpus is expected in a specific folder format. A state folder is used to keep state between steps. The academic boilerplate describing the process is provided as a starting point. All paths can changed as desired. PowerShell is used as an overall scripting language to allow the process to be paused easly.

Remember to copy these steps to your processing script in your academic paper.

Open a PowerShell prompt
Change into the ~/code folder
Copy the config file to the state folder
```
copy config.yml d:/state/config.yml
```
Update the config as desired

Tokenize the corpus.

Remember to cleanup the destenation first.

Remove-Item "d:/corpus_tok/*.*"
python tokenize_corpus.py -in d:/corpus_raw -out d:/corpus_tok

Capture the k-folds for reproducability

python create_corpus_folds.py -in d:/corpus_tok -s d:/state

Create the model.
```
python create_model.py -s d:/state
```

Hyper-tune the model.

Cleanup the working directories
Extract the hyper-tuning subsample
Run the hyper-tuning script

Remove-Item "d:/corpus_train/*.*"
Remove-Item "d:/corpus_test/*.*"
python extract_fold.py -in d:/corpus_tok -out d:/corpus_train -sub d:/state/sub/hypertune.train.csv
python extract_fold.py -in d:/corpus_tok -out d:/corpus_test -sub d:/state/sub/hypertune.test.csv
python hyper_tune_model.py -train d:/corpus_train -test d:/corpus_test -s d:/state

Train the model For each fold:

Cleanup the working directory
Extract the fold
Train the model
Validate model

$folds = 10 #get from config.yml
for($i = 0; $i -lt $folds; $i++) {
   Remove-Item "d:/corpus_train/*.*"
   Remove-Item "d:/corpus_test/*.*"
   python extract_fold.py-in d:/corpus_tok -out d:/corpus_train -sub d:/state/sub/kfold.$i.train.csv
   python extract_fold.py-in d:/corpus_tok -out d:/corpus_test -sub d:/state/sub/kfold.$i.test.csv
   python train_model.py -in d:/corpus_train -s d:/state      
   python validate_model.py -in d:/corpus_test -s d:/state
}

Collect the overall results.

Corpus Folder

The corpus is expected to be in the folder format below. Documents are expected to be in .txt format, 1 document per file. This layout allows N levels of clasifaction, without the need for an external control file.

Root (d:/corpus)
- Class 1 (d:/corpus/1)
- Class 2 (d:/corpus/2)
- ...
- Class N (d:/corpus/#)

State Folder

In order to allow for the process to be stoped and re-started a state folder is used. Reproduceability by design is enabled using a subsample folder that captures the random folds.

Root (d:/state)
- Config (d:/state/config.yml)
- Subsample
  - Hyper-tuning training split (d:/state/sub/hypertune.train.csv)
  - Hyper-tuning test split (d:/state/sub/hypertune.test.csv)
  - K-fold training (d:/state/sub/kfold.#.train.csv)
  - K-fold test (d:/state/sub/kfold.#.test.csv)
- Model weights (d:/state/weights)
- Validation results (d:/state/results)

Academic boilerplate

Below is the suggested text to add to the Methods and Materials section of your paper when using this arectiture.

After preprocessing, the corpus was then tokenized. Hyper-tuning used a 5% sub-sample evenly distributed accross each class using a training/validation split of 80%/20%. The hyper-tuning sample was pre-calculated to aid in reproducability. Model building used a 10-fold cross-validation using a training/validation split of 80%/20%. Folds for cross-validation were pre-calculated to aid in reproducability.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
code		code
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
references.bib		references.bib
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Model Template v0.1

Enviormental Setup

Scripts

Corpus Folder

State Folder

Academic boilerplate

About

Releases

Packages

Languages

License

MindMimicLabs/model-template

Folders and files

Latest commit

History

Repository files navigation

Model Template v0.1

Enviormental Setup

Scripts

Corpus Folder

State Folder

Academic boilerplate

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages