Project 2 of NLP2, in which we implement a latent-variable conditional random field (LV-CRF) for the task of translation a source sentence x
into a target sentence y
. Latent inversion transduction grammar (ITG) trees mapping between x
and y
are constructed as the latent variables. The trees are stored compactly as hypergraph forests, where the hyper-edges are featurized into a vector phi
, and have local potential functions. Stochastic optimization on a weights vector w
is performed to fit the model to observed translation pairs (x,y)
. For more details read the project description or the paper that partly inspired it.
See the final report for our findings.
The following papers are useful reference material for the CRF-model:
For example, we can take some of their plots and pictures as inspiration.
-
Use
save-parses.py
to save the parse-forest of a number of sentence pairs of a corpus. Intranslations
you can setk
andnull
to control how many translations (k
) and insertions (null
) to make. Set the size of the corpus inread_data
and the maximal sentence length just below. -
Use
train.py
to load these parses and train on them. For SGD we scale the learning rate each time we make a weight-vector update (i.e. each minibatch). See section 5.2 of this paper on SGD-tricks. This introduces a new hyperparameterlmbda
which controls the rate of scaling. We now start with a high learning rate of around 1 to 10, and let the formula scale this down during training. -
Use
predict.py
to load in a trained weights vectorw
and someparses
in the right format, and predict the best translations (viterbi and sampled). Write these to a prediction .txt file in the folderpredict
. These can be used to compute BLEU scores with respect to a reference with this command.
Let's train with two types of parses: small sentences of length 10, with only 2 translations (plus -EPS-
, so 3) and small sentences of length 10, with only 4 translations (plus -EPS-
, so 5). With the new parallel parser we can now easily do max_sents=40000
. See the settings below and the link to the dropbox where they are located.
-
ch_en, en_ch, _, _ = translations(path='data/lexicon', k=3, null=3, remove_punct=True)
corpus = read_data(max_sents=40000)
corpus = [(ch, en) for ch, en in corpus if len(en.split()) < 10]
. Link to training parses. Link to dev parses. -
ch_en, en_ch, _, _ = translations(path='data/lexicon', k=5, null=5, remove_punct=True)
corpus = read_data(max_sents=40000)
corpus = [(ch, en) for ch, en in corpus if len(en.split()) < 10].
Link to training parses. Link to dev parses.
Note:
When you select the sentences of a certain length you get a smaller number than 40k! The first example with <10
gives 28372 parses. To make sure that in the final training we can compare the runs for the two different parse-types fairly let's only use the parses 0-28k.
Note:
Tim has made a parallel version of save-parses! You can now use the branch parallel to check it out for yourself. If you have 4 cores you can simply run python save-parse.py --num-cores 8
and see the magic of parallel computing unfold in front of your eyes. Warning: expect massive speedup (4x or more) and some beautiful wind-tunnel effects from your desktop/laptop.
-
DONE
Train oneps-40k-ml10-3trans
for one iteration with these settings. (Took 11 hours.) -
DONE
Train oneps-40k-ml10-5trans
for one iteration with these settings. (Took 13 hours.)
-
One iteration over the whole
eps-40k-ml10-3trans
: weights. See training settings here. -
One iteration over the whole
eps-40k-ml10-5trans
: weights. See training settings here.
We have some wonderful training-set translations! The reference translations of the training-set.
Viterbi translations
-
Translations for
eps-40k-ml10-3trans
and their probabilities. Results:BLEU = 4.04, 45.7/7.2/2.0/0.5
(200 sentences). -
Translations for
eps-40k-ml10-5trans
. Results:BLEU = 0.00, 32.3/2.6/0.1/0.0
(200 sentences).
Sampled translations
- Translations for
eps-40k-ml10-3trans
and their sample-frequency. Results:BLEU = ...
(200 sentences).
We have obtained the following translations with the above trained weights. See also the reference translations of the dev-set.
Viterbi translations
-
Translations for
eps-40k-ml10-3trans
. Results:BLEU = 0.00, 75.6/12.5/1.5/0.2
(200 sentences).BLEU = 0.00, 74.7/11.5/1.6/0.2
(500 sentences). -
Translations for
eps-40k-ml10-5trans
. Results:BLEU = 0.00, 65.4/6.4/0.4/0.0
(200 sentences)BLEU = 0.00, 65.4/6.5/0.3/0.0
(500 sentences).
Sampled translations
- Translations for
eps-40k-ml10-3trans
and their sample-frequency. Results:BLEU = ...
(200 sentences).
As an interesting baseline we use the IBM1 word-translations to generate sentence-translations by monotonically translating the Chinese sentences word-by-word using this code.
This achieves the following results:
-
Translations of the training-set. Results:
BLEU = 7.22, 60.6/12.8/3.4/1.3
(200 sentences). -
Translations of the dev-set. Results:
BLEU = 0.00, 83.8/18.4/3.4/0.4
(200 sentences).
Here is a small selection of individual comparisons of translations.
See these translations for our best result so far! This has been achieved by training 1 iteration over 1300 sentences of maximal length 9 parsed with eps=True
and maximally 3 epsilon insertions, with minibatch size 1, delta_0=10
, lmbda=0.01
, scale_weight=2
and regularizer=False
. See the correct translations for reference. (Also note that later iterations get worse which you can see here.) Lastly: we achieve a BLEU score of 3.44 on these translations (hurray!): BLEU = 3.44, 49.8/6.2/1.1/0.5 (BP=0.967, ratio=0.968, hyp_len=1222, ref_len=1263)
.
- The problem with derivations for which the
p(y,d|x) = nan
is this: the weights vectorw
. This still occurs, even with the above described hack. It only occurs with long sentences though. I think because for a long sentence, the derivation has many edges. And thensum([estimated_weights[edge] for edge in derrivation])
gets upset, which we use injoin_prob
to compute thep(y,d|x)
. NOTE: This is not really an issue: we still get Viterbi estimates! We just cannot compute the correct probability.