LCDMV

This is an implementation for several grammar induction models described in the the paper:

Hiroshi Noji, Yusuke Miayo, and Mark Johnson. 2016. Using Left-corner Parsing to Encode Universal Structural Constraints in Grammar Induction. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.

The models include DMV (dependency model with valence) and several extensions such as our structurally constrained model on the maximum degree of center-embedding using left-corner parsing.

Setup: Universal Dependencies

The primary datasets for experiment in LCDMV are Universal Dependencies (UD).

We assume for training and testing the input data is one of UD treebank, but it should be modified so that it follows the conventional CoNLL-X format. Specifically,

The comment lines strating with # must be removed; and
The word annotation lines (e.g., 2-3 haven't) must also be removed.

We also recommend to remove all punctuations prior to experiments.

The scripts script/to_conllx.py and script/remove_punct.py performs these preprocessing. They can be used as:

cat /xxx/UD_French/fr-ud-train.conllu | script/to_conllx.py | script/remove_punct.py > fr-ud-train-nopunct.conllx

All components in the system are included in the pre-assembled jar file lcdmv.jar. You can also build your own jar by ./bin/sbt assembly.

Learning

Baseline

The following command performs learning of the baseline featurized-DMV model with no structural constraints (except the UD function words constraint):

java -jar lcdmv.jar -actionType featureTrain -numIters 100 -trainPath /xxx/yyy-train.conllx \
-maxLength 15 -posdef cpos -numThreads 0 -unheadPOS udfunc -saveModelPath zzz-model.ser

The software supports many other options, which can be shown with java -jar lcdmv.jar -help. Note that not all options may be compatible with others.

E-step in EM is performed in parallel, and -numThreads specifies the number of threads in this computing. 0 is to use the maximum number of threads allowed (depending on the number of cores in the machine).

Constraints on left-corner stack depth

The following command reproduces the results with the constraint that we found the best: the maximum number of embedded tokens up to 3:

java -jar lcdmv.jar -actionType featureTrain -numIters 100 -trainPath /xxx/yyy-train.conllx \
-maxLength 15 -posdef cpos -numThreads 0 -unheadPOS udfunc -saveModelPath zzz-model.ser \
-maxDepth 1 -maxChunkSize 3

where -maxDepth 1 -maxChunkSize 3 means the stack depth >= 2 in the chart will be conditionally prohibited, in that it is allowed when the number of tokens in the embedded chunk is up to 3.

Dependency length

For the another baseline of dependency length constraint, the following setting reproduces the results:

java -jar lcdmv.jar -actionType featureTrain -numIters 100 -trainPath /xxx/yyy-train.conllx \
-maxLength 15 -posdef cpos -numThreads 0 -unheadPOS udfunc -saveModelPath zzz-model.ser \
-distanceBeta 0.1

Parsing

After learning, the following command outputs the predicted trees on test sentences and calculate (UAS) score against the gold treebank. The score is outputted in the standard output.

java -jar lcdmv.jar -actionType evaluate -testPath /xxx/yyy-test.conllx
-maxLength 40 -posdef cpos -loadModelPath zzz-model.ser -outputPath zzz-guess.conllx

Citation

If you use the code in research publications, please cite:

@InProceedings{noji-miyao-johnson:2016:EMNLP2016,
  author    = {Noji, Hiroshi  and  Miyao, Yusuke  and  Johnson, Mark},
  title     = {Using Left-corner Parsing to Encode Universal Structural Constraints in Grammar Induction},
  booktitle = {Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing},
  month     = {November},
  year      = {2016},
  address   = {Austin, Texas},
  publisher = {Association for Computational Linguistics},
  pages     = {33--43},
  url       = {https://aclweb.org/anthology/D16-1004}
}

Others

The paper above also used our hacked version of Naseem et al. (2010)'s grammar induction system, which can be available here.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
bin		bin
lib		lib
project		project
script		script
src		src
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
lcdmv.jar		lcdmv.jar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LCDMV

Setup: Universal Dependencies

Learning

Baseline

Constraints on left-corner stack depth

Dependency length

Parsing

Citation

Others

About

Releases

Packages

Languages

License

hiroshinoji/lcdmv

Folders and files

Latest commit

History

Repository files navigation

LCDMV

Setup: Universal Dependencies

Learning

Baseline

Constraints on left-corner stack depth

Dependency length

Parsing

Citation

Others

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages