Below is a list of the corpus tools we use at Mind Mimic Labs.
They are intended to be building blocks for both general research in our lab as well as publication boilerplate.
Each tool should be considered stand-alone and includes both code (~/code
) and documentation (~/docs
).
There is a combined requirements.txt
file for all the tools found in the root of the repo.
The documentation will include both instructions as to what the code is for, how to run it, and what publication boilerplate to put in the Methods and Materials section.
Unless otherwise noted, all scripts follow the same execution path.
- Open a command prompt
- Change into the
~/code
folder. - Run
python {{scriptname}}.py -in d:/corpus_in -out d:/corpus_out
. You should change the input and output paths as desired.
The list of current scripts is below.
In general, you want to first run documents-to-corpus, then other scripts.
Individual papers/projects/repos will instruct on the exact order in their README.md
's Tabula Rasa section.
Data Pre-Processing
- remove-stop-words
- lowercase-corpus
- stem-corpus
- remove-whitespace-from-corpus
- remove-punction-from-corpus
- remove-numbers-from-corpus
Formatting
Deep Learning
- vectorize-corpus
- normalize-corpus-by-padding
- normalize-corpus-by-truncation
- normalize-corpus-by-windowing
- normalize-corpus-by-zipfs-law
Misc