Skip to content

Building blocks for text pre-processing for Deep Learning

License

Notifications You must be signed in to change notification settings

MindMimicLabs/building-blocks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Building Blocks

Python MIT license

Below is a list of the corpus tools we use at Mind Mimic Labs. They are intended to be building blocks for both general research in our lab as well as publication boilerplate. Each tool should be considered stand-alone and includes both code (~/code) and documentation (~/docs). There is a combined requirements.txt file for all the tools found in the root of the repo. The documentation will include both instructions as to what the code is for, how to run it, and what publication boilerplate to put in the Methods and Materials section.

Scripts

Unless otherwise noted, all scripts follow the same execution path.

  1. Open a command prompt
  2. Change into the ~/code folder.
  3. Run python {{scriptname}}.py -in d:/corpus_in -out d:/corpus_out. You should change the input and output paths as desired.

The list of current scripts is below. In general, you want to first run documents-to-corpus, then other scripts. Individual papers/projects/repos will instruct on the exact order in their README.md's Tabula Rasa section.

Data Pre-Processing

  1. remove-stop-words
  2. lowercase-corpus
  3. stem-corpus
  4. remove-whitespace-from-corpus
  5. remove-punction-from-corpus
  6. remove-numbers-from-corpus

Formatting

  1. documents-to-corpus
  2. flatten-corpus
  3. re-encode-corpus

Deep Learning

  1. vectorize-corpus
  2. normalize-corpus-by-padding
  3. normalize-corpus-by-truncation
  4. normalize-corpus-by-windowing
  5. normalize-corpus-by-zipfs-law

Misc

  1. corpus-to-reading-level
  2. (un)nest-corpus

About

Building blocks for text pre-processing for Deep Learning

Topics

Resources

License

Stars

Watchers

Forks

Languages