Learning semantic relations with distributional similarity.
-
A natural language processing pipeline based on DKPro Core which utilizes Pig for local or Hadoop-based execution.
-
Annotation: segmentation, part-of-speech tagging, lemmatization, and dependency parsing based on Stanford NLP. Any of these components can be conveniently replaced with alternative implementations and models, e.g., the Stanford Parser with the Berkeley parser or a PCFG model with an RNN parser. This list provides an overview of available models.
-
Feature extraction: all subtrees along a dependency parse that involve two tokens of a specific type are extracted as features. The type of token is specified generically – implemented options are common nouns, proper nouns, and named entities, but other types of tokens can be added easily. Features are weighed using the Lexicographer's mutual information[1].
-
Classification with logistic regression as implemented in scikit-learn; see simsets for details.
-
Clustering with Chinese Whispers. Extrinsic cluster evaluation with various measures, see clustering_utils and evaluate_cw_clustering.
-
Some Root code to plot histograms of many samples and/or dimensions.
-
Evaluation for both classification and clustering is done using the BLESS data set.
cd sensim
mvn package -Dmaven.test.skip=true -Phadoop-job
cd src/main/pig/
pig -P <propertyfile> -m <parameterfile> pipeline.pig &> <logfile>
# as above but substitute last line with
pig -x local -P properties -m parameters pipeline.pig
Import this Maven project into the IDE of your choice and run the method testCoreNLPAnnotator() in CoreNLPAnnotatorTest.java.
[1] http://wortschatz.uni-leipzig.de/~sbordag/papers/BordagMC08.pdf
[2] http://root.cern.ch