Skip to content

Latest commit

 

History

History
51 lines (37 loc) · 2.23 KB

README.md

File metadata and controls

51 lines (37 loc) · 2.23 KB

Building Blocks

Python MIT license

Below is a list of the corpus tools we use at Mind Mimic Labs. They are intended to be building blocks for both general research in our lab as well as publication boilerplate. Each tool should be considered stand-alone and includes both code (~/code) and documentation (~/docs). There is a combined requirements.txt file for all the tools found in the root of the repo. The documentation will include both instructions as to what the code is for, how to run it, and what publication boilerplate to put in the Methods and Materials section.

Scripts

Unless otherwise noted, all scripts follow the same execution path.

  1. Open a command prompt
  2. Change into the ~/code folder.
  3. Run python {{scriptname}}.py -in d:/corpus_in -out d:/corpus_out. You should change the input and output paths as desired.

The list of current scripts is below. In general, you want to first run documents-to-corpus, then other scripts. Individual papers/projects/repos will instruct on the exact order in their README.md's Tabula Rasa section.

Data Pre-Processing

  1. remove-stop-words
  2. lowercase-corpus
  3. stem-corpus
  4. remove-whitespace-from-corpus
  5. remove-punction-from-corpus
  6. remove-numbers-from-corpus

Formatting

  1. documents-to-corpus
  2. flatten-corpus
  3. re-encode-corpus

Deep Learning

  1. vectorize-corpus
  2. normalize-corpus-by-padding
  3. normalize-corpus-by-truncation
  4. normalize-corpus-by-windowing
  5. normalize-corpus-by-zipfs-law

Misc

  1. corpus-to-reading-level
  2. (un)nest-corpus