Skip to content

Latest commit

 

History

History
25 lines (18 loc) · 1.21 KB

vectorize-corpus.md

File metadata and controls

25 lines (18 loc) · 1.21 KB

Vectorize Corpus

A corpus is composed of a collection document. A document is composed of a collection of words. Deep Learning networks typically deal with integers, not words. Vectorizing a corpus is the process of collecting all the words in all the documents and assigning a unique integer to represent the word. Word order is maintained as part of vectorization.

Run the script

These scripts need to be run in a modified manner compared to the general form. There is another parameter called control that is needed to preserve the reverse operation. The control file can also be used for filtering or vectorising addtional documents later. Please update step 4 in the general form to be:

  • Run python vectorize-corpus.py -in d:/corpus_in -out d:/corpus_out -ctrl d:/control.csv.

Academic Boilerplate

This script should not be considered as a real transformation in terms of academic papers. Instead, it should be thought of as an accessibility step. The processing done by this step should be considered in the same manner as specifying the path on disk. If a boilerplate is required consider the following:

After preprocessing, documents were vectorized.