README

04/05/2012

Nile -- a hierarchical, syntax-based discriminative alignment package

This document describes how to use and run Nile.

============================================
CONTENTS
============================================
I. REQUIREMENTS
II. PREPARING YOUR DATA
III. TRAINING
IV. USER-DEFINED FEATURES
V. ITERATIVE VITERBI TRAINING & INFERENCE
VI. TESTING
VII. OTHER OPTIONS
VIII. QUESTIONS/COMMENTS
IX. REFERENCES

============================================
I. REQUIREMENTS
============================================
Nile currently depends on a few packages for
logging, ui, implementation, and parallelization:
  A. python-gflags: a commandline flags module for python
     (http://code.google.com/p/python-gflags/)

  B. pyglog: a logging facility for python based on google-glog
     (http://www.isi.edu/~riesa/software/pyglog)

  C. svector: a python module for sparse vectors, by David Chiang
     A version is included in this distribution under svector/
     You can download the latest version from:
     http://www.isi.edu/~chiang/software/svector.tgz

  D. An MPI Implementation. We use MPICH2.
     (http://www.mcs.anl.gov/research/projects/mpich2/)
     Download the latest stable release for your architecture.
     Then follow the Installer's Guide, available in the documentation
     section of the website.
     See Section II for more information.

  E. Boost MPI Python bindings.
     (http://www.boost.org/users/download/)
     Installation instructions:
     http://www.boost.org/doc/libs/1_49_0/doc/html/mpi.html

============================================
II. PREPARING YOUR DATA
============================================

A.  To train an alignment model you will need some data.
    We use some simple canonical filenames below in describing each, but
    you can call them anything you'd like.

    1. train.f: a file of source-language sentences, one per line.
    2. train.e: a file of target-language sentences, one per line.
    3. train.a: a file of gold-standard alignments for each sentence pair
                in train.f and train.e; each line in the file should be a
                sequence of space-separated strings encoding a single link in
                f-e format.
                e.g.: 0-1 1-2 2-2 2-3 4-5
    4. train.e-parse: a file of target-language parse trees, one for each line
                in train.e; trees should be in standard Penn Treebank format, e.g.:
                (TOP (S (NP (DT the) (NN man)) (VP (VBD ate))))
                We use tokens -RRB- and -LRB- to represent right and left parentheses,
                respectively (see below).
    5. train.f-parse: a file of source-language parse-trees, one for each line
                in train.f (OPTIONAL)

    Also prepare heldout development and test data in the same manner.
    Source-tree files are optional, but all others are required.

    Throughout the rest of this document we use the same filename extensions
    as above for our development and test data, e.g.:
    dev.e <-- target-language sentences in heldout development data
    dev.f <-- source-language sentences in heldout development data
    test.e <-- target-language sentences in heldout test data
    test.f <-- source-language sentences in heldout test data

    ADDITIONAL NOTES:
    (i) Why use a heldout development (dev) and test set?
        After every epoch of training Nile checks it's
        current performance on this dev set. When performance is no longer
        increasing on this dev set, we say that we've converged and we stop
        training. (Section III(D) and III(E))

        Since we select the alignment model to ultimately use based on
        performance on our development set, we need a second heldout set
        to as a way to predict performance on truly unseen future data.
        Using the model we select based on development performance in
        Section III(E), we align our test.* data and note the accuracy.

    (ii) We relabel parentheses tokens before parsing,
         i.e. "(" -> -LRB- and ")" -> -RRB-. For example:
         $ sed -e 's/(/-LRB-/g' -e 's/)/-RRB-/g' < input > input.clean

         And then we parse, by doing:
         java -Xmx2600m -Xms2600m -jar berkeleyParser.jar \
              -gr eng_grammar.gr \
              -binarize \
              -maxLength 1000 < input.clean > output

         Because of the way cube pruning works, you will encounter far fewer
         search errors if you binarize your trees before training by using the
         -binarize flag.

    (ii) In case of sentences that failed to parse:
         Use a blank line, a 0 on a line by itself,
         or the Berkeley parser default failure string: (())
         to tell Nile to skip the affected sentence pair.

C.  Tables from GIZA++ output (Brown et al., 1993; Och and Ney, 2003)
    We run GIZA++ Model-4 on a large corpus, and compute p(e|f) and p(f|w)
    word association tables from simply counting links in the final Viterbi
    alignment.

    If you don't have time to run Model-4, that's fine. We've seen benefits
    from using counts from just HMM or Model-1 training.

    p(e|f) file format:
    <e-word> <f-word> p(e|f)

    p(f|e) file format:
    <f-word> <e-word> p(f|e)

D.  Alignment files from GIZA++ (OPTIONAL)
    You can pass up to two third-party alignment files to the trainer with flags
    --a1 and --a2 in nile.py. For --a1 we use intersection of Model-4 alignments
    from e->f and f->e directions. For --a2 we use grow-diag-final-and
    symmetrizatized alignments.

    These alignments will allow the trainer to fire indicator features for making
    the same predictions as your supplied alignments. Feel free to substitute any
    other type of alignments here as input. Using GIZA++ Model-4 intersection and
    grow-diag-final-and alignments here, we generally see a large F-score increase.

E.  Vocabulary files. We'll need to give the trainer (and aligner) some
    vocabulary files it will use to filter potentially large p(e|f) and p(f|e)
    data files. Keeping these full data files in memory can be prohibitively
    expensive.

    Concatenate your training and development e and f files and run
    prepare-vocab.py:
    $ cat train.e dev.e | ./prepare-vocab.py > e.vcb
    $ cat train.f dev.f | ./prepare-vocab.py > f.vcb

    Use these files as input to nile.py with flags --evcb and --fvcb.

============================================
III. TRAINING
============================================

Training a new model with Nile involves (1) specifying your data files
as commandline arguments, and (2) invoking training mode. We provide a
sample training script, train.sh, in this distribution invoking only the
flags required to get going.

A. Cluster computing
   The sample training script uses the Portable Batch System (PBS), a popular
   networked subsystem for controlling jobs on a computing cluster. You can
   remove the PBS directives at the top of the file if you are running locally
   on a single machine (we strongly recommend machines with multiple CPUs), or
   just modify the file to suit your architecture.

B. MPI
   Take note of where your MPI binaries, libraries,
   and MPI Python bindings live. Then modify the MPI Initialization section
   with the appropriate paths.

C. Training Name
   Every training run has a name. Your run's default name is:
   d<date>.k<beam-size>.n<cpu-pool-size>.<langpair>.target-tree.0

D. Running the program. On PBS, do:
   $ qsub train.sh
   Or, on a local machine with multiple CPUs, do:
   $ ./train.sh

D. Inspecting accuracy on the held-out data:
   To inspect held-out F-scores, do:
   $ grep F-score-dev <name>.err

   To sort held-out F-scores in descending order do:
   $ grep F-score-dev <name>.err | awk '{print $2}' | cat -n | sort -nr -k 2

E. Convergence
   If the highest-scoring epoch, H, is much earlier than your current
   epoch number, you have probably converged. Kill the training job
   and extract weights from epoch H:
   $ ./weights <name> H

   Weights will be written to file:
   <name>.weights-H

=====================================================
IV. USER-DEFINED FEATURES
=====================================================
You can add your own feature functions to Features.py
or maintain several different Feature modules for different
language pairs.

If you want to have Feature modules for, say,
Arabic-English and Chinese-English, name them:
Features_ar_en.py and Features_zh_en.py respectively.

Setting the --langpair flag with argument LANG1_LANG2 will
tell Nile to use these modules. Nile will look for a file called:
Features_LANG1_LANG2.py.

nile.py --e train.e \
        --f train.f \
        ...
        --langpair ar_en

=====================================================
V. ITERATIVE VITERBI TRAINING & INFERENCE (optional)
=====================================================
This procedure is somewhat time-consuming because you will need to train several
models, and align your data several times. However, if you have the time, the
improvement in alignment quality may be worth it.

Parse trees for both target and source text are required for this procedure.

1. Train a target-tree model as in section III.
2. Train a source-tree model by:
    (a) Transform your gold-standard data to e-f format;
        source-tree models will read and output alignments in
        e-f format as opposed to f-e format.
        $ perl -pe 's/(\d+)-(\d+)/$2-$1/g' < train.a.f-e > train.e.e-f

    (b) flip the argument flags for your e and f data when you run Nile. For example:
          python nile.py \
            --e train.f \
            --f train.e \
            --gold train.a.e-f \
            --ftrees train.e-parse \
            --etrees train.f-parse \
            --fdev dev.e \
            --edev dev.f \
            --ftreesdev dev.e-parse \
            --etreesdev dev.f-parse \
            --golddev dev.a.e-f \
            --fvcb e.vcb \
            --evcb f.vcb \
            --pfe GIZA++.m4.pef \
            --pef GIZA++.m4.pfe \
            --a1 train.m4i.e-f \
            --a2 train.m4gdfa.e-f \
            --a1_dev dev.m4i.e-f \
            --a2_dev dev.m4gdfa.e-f \
            --langpair zh-en \
            --train \
            --k 128

3. The next step of training involves learning target-tree and
   source-tree models again, but this time giving as input the
   outputs of the models learned in the first round. You do this
   with the --inverse and --inverse_dev flags.

  (a) Run Nile in --align mode and align your training data and then dev data with the
      source-tree model you've learned.

  (b) Flip the alignment links to f-e format and supply these to your next target-tree
      training with --inverse and --inverse_dev, e.g.:

      nile.py --e train.e --f train.f --a train.a.f-e --inverse train-st.a.f-e ... etc.

      Nile will fire features to softly enforce agreement between the two models.

  (c) Analogously, for your next source-tree model, flip the aligned 1-best alignments
      of your training and dev data from the target-tree model to e-f format, and supply
      it to Nile with the --inverse and --inverse_dev flags:

      nile.py --e train.f --f train.e --a train.e.e-f --inverse train-tt.a.e-f ... etc.

============================================
VI. TESTING
============================================
At test time, it is important to use the same types of parameters and input data you used
during training. If you trained a model with a beam of K=128, then keep that beam at test time.
If you used GIZA++ Model-4 alignments as input with flags --a1 and --a2, then similarly also
supply alignment predictions from GIZA++ at test time. Finally, binarize your trees on test
data the same way you did for training and development data.

  A. Preparing Vocabulary files:
     As with the training and development data, prepare source and target vcb files.
     $ ./prepare-vocab.py < test.e > test.e.vcb
     $ ./prepare-vocab.py < test.f > test.f.vcb

     Use these files as arguments for the --evcb and --fvcb flags to nile.py
     in your testing script.

  B. Editing test.sh
     Edit the WEIGHTS= line in test.sh for your weights filename.

  C. Set Nile to "align" mode.
     At test time, we replace the --train flag with the
     --align flag when running nile.py.

  D. Running test.sh
     Then, run test script test.sh. Using PBS, do:
     $ qsub test.sh
     Or, on a local machine with multiple CPUs:
     $ ./test.sh

     By default, alignment output is written in f-e format to:
     <name>.weights-H.test-output.a

  E. Evaluation
     If you are aligning data for which you have gold-standard alignments,
     you can calculate F-measure using our provided scripts. Remember,
     alignments are output in f-e format and should be compared against
     data in the same format.
     Usage: ./Fmeasure.py <your-file> <gold-file>


============================================
VII. OTHER OPTIONS
============================================
  A. L1 Regularization (Feature Selection; experimental)
  Nile implements a parallelized version of L1 Regularization via projection after each epoch.
  (Hastie 1996; Duchi et al., 2008; Martins et al. 2011)

  Enabling this feature will allow you to learn a much smaller model that, in our experiments,
  should achieve essentially the same accuracy or better. This is useful for scaling to very large
  training sets, and you may also see some generalization benefits.

  To enable, set Nile's L1 Tau coefficient variable to 1 with commandline flag:
  --tau 1

 B. Debiasing (experimental)
 While L1 yields sparse solutions, these solutions are known to be biased in magnitude which may
 negatively affect accuracy. After learning a sparse model with L1, we train a new model but this
 time, we only allow the features we learned in our sparse model to fire. This is called Debiasing.

  To enable:
  1. Turn debiasing mode on:
    --debiasing
  2. Tell Nile about the weight vector you learned during the L1 feature selection step,
  use --debiasing_weights and supply a weight vector in svector format:
    --debiasing_weights <sparse-model weights>
  (Make sure you have removed the --tau flag from your Nile invocation.)

C. Advanced Perceptron Updates
  1. Changing the default Oracle.
     In selecting the oracle towards which we update, Chiang et al. (2008)
     find that in their task, modify the traditional selection criterion
     from minimized loss to a linear combination of minimized loss and
     model score. We call this the "hope" oracle, because we have more
     of a chance to reach it; it has high model score and low loss.
     To use a "hope" oracle, use flag and argument:
     --oracle hope
  2. Changing the default hypothesis.
     In selecting the hypothesis that we update our model away from
     (and towards the oracle), we can select a hypothesis somewhat
     analogously to selecting the "hope" oracle as described above.
     In this case, we modify the default selection criterion from
     maximum model score to a linear combination of maximum model
     score and maximum loss. We call this the "fear" hypothesis;
     it has the nefarious property of having both a high model score
     (our model likes it), but also very high loss (it is a bad alignment).
     To use a "fear" hypothesis, use flag and argument:
     --hyp fear
  3. Changing the default learning rate.
     There is a single learning rate parameter used in the standard perceptron
     update which affects the magnitude of each update. It is set to 1.0 by default.
     To use a different learning rate, use:
     --learning_rate <new learning rate>

============================================
VIII. QUESTIONS/COMMENTS
============================================

Troubleshooting:
If you've gone through this brief guide and are having trouble getting
this software to work for you, send mail to Jason Riesa <riesa@isi.edu>.

Technical correspondence also welcomed.

If you are interested in contributing to this project please also let us know!

============================================
IX. REFERENCES
============================================
Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, Robert L. Mercer.
The Mathematics of Statistical Machine Translation: Parameter Estimation.
Computational Linguistics, Volume 19, Number 2, pages 263-311. June 1993.

David Chiang, Yuval Marton, and Philip Resnick. Online Large-Margin Training
of Syntactic and Structural Translation Features. 2008. Proceedings of EMNLP,
pages. 224-233.

John Duchi, Shai Shalev-Schwartz, Yoram Singer, and Tushar Chandra. Efficient
Projections onto the L1-Ball for Learning in High Dimensions. 2008. Proceedings of ICML.

Andre F. T. Martins, Noah A. Smith, Pedro M. Q. Aguiar, Mario A. T. Figueiredo.
Structured Sparsity in Structured Prediction. 2011. Proceedings of EMNLP,
pages 1500-1511.

Franz Josef Och and Hermann Ney. A Systematic Comparison of Various Statistical Alignment Models.
Computational Linguistics, Volume 29, Number 1, pages 19-51. March 2003.

Jason Riesa and Daniel Marcu. Hierarchical Search for Word Alignment. 2010.
Proceedings of ACL, pages 157-166.

Jason Riesa, Ann Irvine, and Daniel Marcu. Feature-Rich Language-Independent
Syntax-Based Alignment for Statistical Machine Translation. 2011.
Proceedings of EMNLP, pages 497-507.

Jason Riesa and Daniel Marcu. Automatic Parallel Fragment Extraction from Noisy Data.
2012. Proceedings of the NAACL HLT. To appear.

Robert Tibshirani. Regression shrinkage and selection via the lasso. 1996.
J. Royal. Statist. Soc B., Vol. 58, No. 1, pages 267-288.