Setup

install scikit-learn and its dependencies

Materialize an nltk corpus

find a corpus at http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml, e.g. inaugural

python materialize_nltk_corpus.py inaugral

Set the appropriate environment variables

source ./settings.sh

Or, just set the variables by hand:

export HADOOP_VERSION=  # the version of hadoop you are using, e.g. 2.5.1
export AVRO_VERSION=  # if you are using avro, the version, e.g. 1.7.7
export HADOOP_HOME=  # the location of your hadoop installation
export RELATIVE_PATH_JAR=  # location of hadoop streaming jar in HADOOP_HOME
export NLTK_HOME=  # the location of your corpus, mappers and reducers
export AVRO_JAR=  # if you are using avro, the jar location

On the sampa cluster, you may also need to execute

source /shared/patents/settings.sh

in order to get hadoop, linux brew, python packages and nltk data to work.

you may also want to ensure that the mapper and reducer scripts are executable

Run the MapReduce jobs to produce output

(note, this depends upon avro, nltk and scikit-learn)

./mapred_tfidf --input INPUT_DIR --output OUTPUT_DIR

run with the --help flag to view all options
run with --force to automatically overwrite intermediate directories

On the cluster, this can all be done by executing ./run.sh, which sets the appropriate environment variables as well as using the appropriate hdfs dirs:

cd /shared/patents/nltk-hadoop
./run.sh

See the cosine similarities of all documents:

ls $OUTPUT_DIR/part-*

See the tfidf metrics for each document/word pair:

ls $tfidf/part-*

Run the test suite

with nose installed,

nosetests

Write a new map / reduce job and run it

Hadoop streaming accepts any command as a mapper or reducer, but to use the map_reduce_utils module, the basic pattern is as follows:

first, write a mapper like the abstract one below:

#!/usr/bin/env python

import map_reduce_utils as mru


def mapper():
    for in_key, in_value in mru.json_loader():
        out_key = {}  # the key that is emitted by hadoop as json
        out_value = {}  # the value that is emitted by hadoop as json
        mru.mapper_emit(out_key, out_value, sys.stdout)


if __name__ == '__main__':
   mapper()  # feel free to pass arguments here as well

then, write a reducer similar to:

#!/usr/bin/env python

import map_reduce_utils as mru
import sys


def reducer():
    for in_key, key_stream in mru.reducer_stream():
        values = []  # will contain each value associated with in_key
        for in_value in key_stream:
            values.append(in_value)
        # now, values contains all of the values stored as Dicts, so we can
        # do our "reduction" with arbitrary python. note that you don't need to
        # store all of the in_values if, for example, we only need a running sum
        out_key = {}  # the key that is emitted by hadoop as json
        out_value = {}  # the value that is emitted by hadoop as json
        mru.reducer_emit(out_key, out_value, sys.stdout)
        # you can also emit more than one key-value pairs here, for example
        # one for each key-value pair where key = in_key:
        for value in values:
            out_key = {} # the key that is emitted by hadoop as json
            out_value = {} # the value that is emitted by hadoop as json
            mru.reducer_emit(out_key, out_value, sys.stdout)


if __name__ == '__main__':
   reducer()  # feel free to pass arguments here as well

now, in your main driver (let's call it run_hadoop.py for future reference), invoke your mapper and reducer

import  map_reduce_utils as mru

# input_dir contains the lines piped into the reducer, output_dir is where the
# results will be placed.
mru.run_map_reduce_job('mapper.py', 'reducer.py', input_dir, output_dir)

# note that we can pass arguments or arbitrary commands as mappers and reducers
# and use the output of one job as the input of the next job to chain MR jobs

mru.run_map_reduce_job('second_mapper.py --arg 1', 'wc -l',
                        output_dir, second_MR_job_output_dir)

Before running the previous code, however, remember to define the appropriate environment variables. For example, in a shell, run:

source settings.sh
python run_hadoop.py

Note that

You don't need to use avro and json. If you want, you can specify the input and output format when invoking map_reduce_utils.run_map_reduce_job, as well as the tokenizers for the generators in both the mapper and reducer.
You can run just a map job (i.e. no reducer) with map_reduce_utils.run_map_job
To see a concrete example of a mapper and reduer, look at word_join_map.py and word_join_red.py.
To see a concrete example of invoking a hadoop job, look at mapred_tfidf.py

The TFIDF Metric

After cleaning and stemming a document, we obtain a list of words, d, for that document. The tfidf score of a word w in d is defined as follows:

let n be the number of times w appears in d
let N be the length of d
let D be the number of documents in the corpus
let m be the number of documents in which the word d appears at least once
tf = n / N (tf is the 'term frequency' of the word)
idf = D / m (idf is the 'inverse document frequency' of the word)
log_idf = log(D / m) (log_idf is the log inverse document frequency)
tfidf = tf*idf
tf_log_idf = tf*log_idf

These naming conventions are used in certain places in the codebase, for example in the docstrings for many mapper and reducer functions.

Name	Name	Last commit message	Last commit date
Latest commit bmyerz Merge pull request #8 from uwsampa/bmyerz/radish Jun 23, 2015 67dfc71 · Jun 23, 2015 History 169 Commits
cpp	cpp	Merge branch 'avro-refactor'	May 24, 2015
lib	lib	added avro 1.7.7 jars to lib	Apr 20, 2015
mrjob	mrjob	great! we can read avro in mrjob now	Jun 9, 2015
radish	radish	fixes to the query and catalog	Jun 23, 2015
tests	tests	added unit tests	Jan 15, 2015
.gitignore	.gitignore	added pyc and emacs backups to gitignore	Feb 18, 2015
.travis.yml	.travis.yml	updated readme, removed python 3 tests	Jan 16, 2015
Patents_and_NLP.ipynb	Patents_and_NLP.ipynb	notes deepdive	Feb 4, 2015
README.md	README.md	Update README.md	May 26, 2015
__init__.py	__init__.py	rename from nltk-hadoop to nltk_hadoop	May 24, 2015
claims_mapper.py	claims_mapper.py	added ability to turn stemming on/off	May 26, 2015
compare_texts.py	compare_texts.py	cleaned up commit `327f5de` for pull request	Nov 24, 2014
contents_mapper.py	contents_mapper.py	placed mappers/reducers back in top dir	May 25, 2015
corp_freq_map.py	corp_freq_map.py	placed mappers/reducers back in top dir	May 25, 2015
corp_freq_red.py	corp_freq_red.py	placed mappers/reducers back in top dir	May 25, 2015
corpus_size_map.py	corpus_size_map.py	placed mappers/reducers back in top dir	May 25, 2015
corpus_size_red.py	corpus_size_red.py	placed mappers/reducers back in top dir	May 25, 2015
cos_sim_map.py	cos_sim_map.py	placed mappers/reducers back in top dir	May 25, 2015
cos_sim_red.py	cos_sim_red.py	placed mappers/reducers back in top dir	May 25, 2015
create_db.py	create_db.py	added scripts to write/query db	Jan 15, 2015
hadoop-tag.sh	hadoop-tag.sh	basic mapper	Aug 24, 2014
hadoop_utils.py	hadoop_utils.py	use hadoop_util not subprocess to calc corpus size	May 25, 2015
invoke.sh	invoke.sh	mappers/reducers now invoked with correct env vars	Apr 24, 2015
map_reduce_utils.py	map_reduce_utils.py	added ability to turn stemming on/off	May 26, 2015
mapred_tfidf.py	mapred_tfidf.py	add help string for no-stem arg for mapred_tfidf	May 27, 2015
materialize_nltk_corpus.py	materialize_nltk_corpus.py	implement cosine similarity in mapreduce streaming	Nov 17, 2014
normalize_mapper.py	normalize_mapper.py	placed mappers/reducers back in top dir	May 25, 2015
normalize_reducer.py	normalize_reducer.py	placed mappers/reducers back in top dir	May 25, 2015
query_results.py	query_results.py	added scripts to write/query db	Jan 15, 2015
run.sh	run.sh	updated absolute path for settings file in run.sh	May 24, 2015
settings.sh	settings.sh	rename from nltk-hadoop to nltk_hadoop	May 24, 2015
slurm_hadoop_tfidf.sb	slurm_hadoop_tfidf.sb	added .sampa domain to sbatch script node list	Apr 19, 2015
stopwords.txt	stopwords.txt	Added stopword list, sbatch script & input mapper	Apr 19, 2015
tf_idf_map.py	tf_idf_map.py	placed mappers/reducers back in top dir	May 25, 2015
word_count_map.py	word_count_map.py	placed mappers/reducers back in top dir	May 25, 2015
word_count_red.py	word_count_red.py	placed mappers/reducers back in top dir	May 25, 2015
word_freq_map.py	word_freq_map.py	placed mappers/reducers back in top dir	May 25, 2015
word_freq_red.py	word_freq_red.py	placed mappers/reducers back in top dir	May 25, 2015
word_join_map.py	word_join_map.py	placed mappers/reducers back in top dir	May 25, 2015
word_join_red.py	word_join_red.py	placed mappers/reducers back in top dir	May 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Setup

Materialize an nltk corpus

Set the appropriate environment variables

Run the MapReduce jobs to produce output

Run the test suite

Write a new map / reduce job and run it

The TFIDF Metric

About

Releases

Packages

Contributors 3

Languages

uwsampa/nltk_hadoop

Folders and files

Latest commit

History

Repository files navigation

Setup

Materialize an nltk corpus

Set the appropriate environment variables

Run the MapReduce jobs to produce output

Run the test suite

Write a new map / reduce job and run it

The TFIDF Metric

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages