A wrapper around tensor2tensor to flexibly train, interact, and generate data for neural chatbots.
The wiki contains my notes and summaries of over 150 recent publications related to neural dialog modeling.
💾 Run your own trainings or experiment with pre-trained models
✅ 4 different dialog datasets integrated with tensor2tensor
🔀 Seemlessly works with any model or hyperparameter set in tensor2tensor
🚀 Easily extendable base class for dialog problems
Run setup.py which installs required packages and steps you through downloading additional data:
python setup.py
You can download all trained models used in this paper from here. Each training contains two checkpoints, one for the validation loss minimum and another after 150 epochs. The data and the trainings folder structure match each other exactly.
python t2t_csaky/main.py --mode=train
The mode argument can be one of the following four: {generate_data, train, decode, experiment}. In the experiment mode you can speficy what to do inside the experiment function of the run file. A detailed explanation is given below, for what each mode does.
You can control the flags and parameters of each mode directly in this file. For each run that you initiate this file will be copied to the appropriate directory, so you can quickly access the parameters of any run. There are some flags that you have to set for every mode (the FLAGS dictionary in the config file):
- t2t_usr_dir: Path to the directory where my code resides. You don't have to change this, unless you rename the directory.
- data_dir: The path to the directory where you want to generate the source and target pairs, and other data. The dataset will be downloaded one level higher from this directory into a raw_data folder.
- problem: This is the name of a registered problem that tensor2tensor needs. Detailed in the generate_data section below. All paths should be from the root of the repo.
This mode will download and preprocess the data and generate source and target pairs. Currently there are 6 registered problems, that you can use besides the ones given by tensor2tensor:
- persona_chat_chatbot: This problem implements the Persona-Chat dataset (without the use of personas).
- daily_dialog_chatbot: This problem implements the DailyDialog dataset (without the use of topics, dialog acts or emotions).
- opensubtitles_chatbot: This problem can be used to work with the OpenSubtitles dataset.
- cornell_chatbot_basic: This problem implements the Cornell Movie-Dialog Corpus.
- cornell_chatbot_separate_names: This problem uses the same Cornell corpus, however the names of the speakers and addressees of each utterance are appended, resulting in source utterances like below.
BIANCA_m0 what good stuff ? CAMERON_m0
- character_chatbot: This is a general character-based problem that works with any dataset. Before using this, the .txt files generated by any of the problems above have to be placed inside the data directory, and after that this problem can be used to generate tensor2tensor character-based data files.
The PROBLEM_HPARAMS dictionary in the config file contains problem specific parameters that you can set before generating data:
- num_train_shards/num_dev_shards: If you want the generated train or dev data to be sharded over several files.
- vocabulary_size: Size of the vocabulary that we want to use for the problem. Words outside this vocabulary will be replaced with the token.
- dataset_size: Number of utterance pairs, if we don't want to use the full dataset (defined by 0).
- dataset_split: Specify a train-val-test split for the problem.
- dataset_version: This is only relevant to the opensubtitles dataset, since there are several versions of this dataset, you can specify the year of the dataset that you want to download.
- name_vocab_size: This is only relevant to the cornell problem with separate names. You can set the size of the vocabulary containing only the personas.
This mode allows you to train a model with the specified problem and hyperparameters. The code just calls the tensor2tensor training script, so any model that is in tensor2tensor can be used. Besides these, there is also a subclassed model with small modifications:
- gradient_checkpointed_seq2seq: Small modification of the lstm based seq2seq model, so that own hparams can be used entirely. Before calculating the softmax the LSTM hidden units are projected to 2048 linear units as here. Finally, I tried to implement gradient checkpointing to this model, but currently it is taken out since it didn't give good results.
There are several additional flags that you can specify for a training run in the FLAGS dictionary in the config file, some of which are:
- train_dir: Name of the directory where the training checkpoint files will be saved.
- model: Name of the model: either one of the above or a tensor2tensor defined model.
- hparams: Specify a registered hparams_set, or leave empty if you want to define hparams in the config file. In order to specify hparams for a seq2seq or transformer model, you can use the SEQ2SEQ_HPARAMS and TRANSFORMER_HPARAMS dictionaries in the config file (check it for more details).
With this mode you can decode from the trained models. The following parameters affect the decoding (in the FLAGS dictionary in the config file):
- decode_mode: Can be interactive, where you can chat with the model using the command line. file mode allows you to specify a file with source utterances for which to generate responses, and dataset mode will randomly sample the validation data provided and output responses.
- decode_dir: Directory where you can provide file to decode from, and outputted responses will be saved here.
- input_file_name: Name of the file that you have to give in file mode (placed in the decode_dir).
- output_file_name: Name of the file, inside decode_dir, where output responses will be saved.
- beam_size: Size of the beam, when using beam search.
- return_beams: If False return only the top beam, otherwise return beam_size number of beams.
The following results are from these two papers.
Loss and Metrics of Transformer Trained on Cornell
TRF is the Transformer model, while RT means randomly selected responses from the training set and GT means ground truth responses. For an explanation of the metrics see the paper.
Responses from Transformer and Seq2seq Trained on Cornell and Opensubtitles
S2S is a simple seq2seq model with LSTMs trained on Cornell, others are Transformer models. Opensubtitles F is pre-trained on Opensubtitles and finetuned on Cornell.
Loss and Metrics of Transformer Trained on DailyDialog
TRF is the Transformer model, while RT means randomly selected responses from the training set and GT means ground truth responses. For an explanation of the metrics see the paper.
Responses from Transformer Trained on DailyDialog
Check the issues for some additions where help is appreciated. Any contributions are welcome ❤️
Please try to follow the code syntax style used in the repo (flake8, 2 spaces indent, 80 char lines, commenting a lot, etc.)
New problems can be registered by subclassing WordChatbot, or even better to subclass CornellChatbotBasic or OpensubtitleChatbot, because they implement some additional functionalities. Usually it's enough to override the preprocess and create_data functions. Check the documentation for more details and see daily_dialog_chatbot for an example.
New models and hyperparameters can be added by following the tensor2tensor tutorial.
- Richard Csaky (If you need any help with running the code: [email protected])
This project is licensed under the MIT License - see the LICENSE file for details.
Please include a link to this repo if you use it in your work and consider citing the following paper:
@InProceedings{Csaky:2017,
title = {Deep Learning Based Chatbot Models},
author = {Csaky, Richard},
year = {2019},
publisher={National Scientific Students' Associations Conference},
url ={https://tdk.bme.hu/VIK/DownloadPaper/asdad},
note={https://tdk.bme.hu/VIK/DownloadPaper/asdad}
}