Skip to content

Latest commit

 

History

History
124 lines (99 loc) · 5.41 KB

README.md

File metadata and controls

124 lines (99 loc) · 5.41 KB

Neural Language Modeling

Pre-trained models

Model Description Dataset Download
transformer_lm.gbw.adaptive_huge Adaptive Inputs
(Baevski and Auli, 2018)
1026M params
Google Billion Words download (.tar.bz2)
transformer_lm.wiki103.adaptive Adaptive Inputs
(Baevski and Auli, 2018)
247M params
WikiText-103 download (.tar.bz2)
transformer_lm.wmt19.en English LM
(Ng et al., 2019)
WMT News Crawl download (.tar.gz)
transformer_lm.wmt19.de German LM
(Ng et al., 2019)
WMT News Crawl download (.tar.gz)
transformer_lm.wmt19.ru Russian LM
(Ng et al., 2019)
WMT News Crawl download (.tar.gz)

Example usage

We require a few additional Python dependencies for preprocessing:

pip install fastBPE sacremoses

To sample from a language model using PyTorch Hub:

import torch

# List available models
torch.hub.list('pytorch/fairseq')  # [..., 'transformer_lm.wmt19.en', ...]

# Load an English LM trained on WMT'19 News Crawl data
en_lm = torch.hub.load('pytorch/fairseq', 'transformer_lm.wmt19.en', tokenizer='moses', bpe='fastbpe')
en_lm.eval()  # disable dropout

# Move model to GPU
en_lm.cuda()

# Sample from the language model
en_lm.sample('Barack Obama', beam=1, sampling=True, sampling_topk=10, temperature=0.8)
# "Barack Obama is coming to Sydney and New Zealand (...)"

# Compute perplexity for a sequence
en_lm.score('Barack Obama is coming to Sydney and New Zealand')['positional_scores'].mean().neg().exp()
# tensor(15.1474)

# The same interface can be used with custom models as well
from fairseq.models.transformer_lm import TransformerLanguageModel
custom_lm = TransformerLanguageModel.from_pretrained('/path/to/model/dir', 'checkpoint100.pt', tokenizer='moses', bpe='fastbpe')
custom_lm.sample('Barack Obama', beam=5)
# "Barack Obama (...)"

Training a transformer language model with the CLI tools

1) Preprocess the data

First download and prepare the WikiText-103 dataset:

cd examples/language_model/
bash prepare-wikitext-103.sh
cd ../..

Next preprocess/binarize the data:

TEXT=examples/language_model/wikitext-103
fairseq-preprocess \
    --only-source \
    --trainpref $TEXT/wiki.train.tokens \
    --validpref $TEXT/wiki.valid.tokens \
    --testpref $TEXT/wiki.test.tokens \
    --destdir data-bin/wikitext-103 \
    --workers 20

2) Train a language model

Next we'll train a basic transformer language model on wikitext-103. For more advanced examples (e.g., using adaptive inputs), please see the Transformer LM README.

To train a basic LM (assumes 2 GPUs):

$ fairseq-train --task language_modeling \
  data-bin/wikitext-103 \
  --save-dir checkpoints/transformer_wikitext-103 \
  --arch transformer_lm --share-decoder-input-output-embed \
  --dropout 0.1 \
  --optimizer adam --adam-betas '(0.9, 0.98)' --weight-decay 0.01 --clip-norm 0.0 \
  --lr 0.0005 --lr-scheduler inverse_sqrt --warmup-updates 4000 --warmup-init-lr 1e-07 \
  --tokens-per-sample 512 --sample-break-mode none \
  --max-tokens 2048 --update-freq 16 \
  --fp16 \
  --max-update 50000

If you run out of memory, try reducing --max-tokens (max number of tokens per batch) or --tokens-per-sample (max sequence length). You can also adjust --update-freq to accumulate gradients and simulate training on a different number of GPUs.

3) Evaluate

fairseq-eval-lm data-bin/wikitext-103 \
    --path checkpoints/transformer_wiki103/checkpoint_best.pt \
    --max-sentences 2 \
    --tokens-per-sample 512 \
    --context-window 400
# | Evaluated 245569 tokens in 56.1s (4379.02 tokens/s)
# | Loss: 3.4164, Perplexity: 30.46

Note: The --context-window option controls how much context is provided to each token when computing perplexity. When the window size is 0, the dataset is chunked into segments of length 512 and perplexity is computed over each segment normally. However, this results in worse (higher) perplexity since tokens that appear earlier in each segment have less conditioning. When the maximum window size is used (511 in this case), then we compute perplexity for each token fully conditioned on 511 tokens of context. This slows down evaluation significantly, since we must run a separate forward pass for every token in the dataset, but results in better (lower) perplexity.

Convolutional language models

Please see the convolutional LM README for instructions to train convolutional language models.