Multimodal emotion recognition, using speech and text. The proposed model is inspired in the encoder-decoder architechture used in the problem of Neural Machine Translation (NMT). In our multimodal model, we use a main modality (text) and an auxiliary modality (speech). The main modality performs the classification given its input and intermediate representations generated by the auxiliary modality.
This repository was developed as part of my semester project at the Chair for Mathematical Information Science, at ETH Zürich in Spring 2019. The full report with the detailed description of our approach is avaliable here.
- Python 3 (tested with Python 3.7)
- Tensorflow (tested with version 1.13.1)
- Exhaustive list of requirements can be found in
requirements.txt
- GPU is not required, although it significantly acceletates training (tested with CUDA 10.0)
The project is comprised of the directories data
, model
, parameters
and preprocessing
. After training a model, two other folders are created, namely graphs
and pretrained-models
.
Contains all the data used, in its raw and processed stages. The dataset used is the IEMOCAP dataset. The IEMOCAP is available under liscense agreement in the previous link. The data actually used by our model is preprocessed from the raw dataset. Our preprocessed data can be made available uppon request, provided that the person already has access to the IEMOCAP dataset.
Contains all of the implemented models. All the models have three main files:
process_[model]_data.py
train_[model].py
evaluate_[model].py
,
where [model]
is the model of interest (text
, audio
, multimodal
, multimodal_attention
). process_[model]_data.py
contains the data input pipeline of the model of interest, loading, batching and splitting the data into training, validation and test sets. train_[model].py
contains the main training loop. evaluate_[model].py
evaluates the performance of the model in the validation and test sets and creates the confusion matrix for the test set.
The file parameters.py
contains all of the relevant parameters for the simulation, divided into sections.
The file prepare_raw_audio.py
reads all the raw audio files that are used by our model, truncates, zero pads and saves them to the expected directory within the /data
folder.
Created once the model starts training. Contains two folders: /graph_train
and /graph_val
, with the information that can be visualized on TensorBoard, including the model's graph and the training and validation accuracies and losses.
Created after the model is trained. Saves the whole model (graph and weights), that can be used to do inference at a later stage.
To train a model, the first step is obtaining all the data. If you already have all the preprocessed data in the correct folder within the /data
directory, you are good to go! If you would like to truncate the raw audio files differently, you can edit that in /preprocessing/prepare_raw_audio.py
and run it once. The preprocessed raw audio files will be saved to the correct directory.
Once you have all the data correctly placed, you can edit the model's parameters in /parameters/parameters.py
. The parameters for all the models are in this single file, but they are organized in sections, so it is important that you edit the parameters in the correct section.
With the correct parameters, it is time to train! The proportion of the data used to build the training, validation and test sets is hardcoded in train_[model].py
. To train a model, you should run train_[model].py
, where [model]
is one of the options text
, audio
, multimodal
, multimodal_attention
. The model is evaluated on the validation set every 50 batches, but that can be changed in train_[model].py
inside the training loop.
After training, the model is evaluated on the test set and the full model is saved to /pretrained-models
.
A significant part of the code for this project is built over the code from multimodal-speech-emotion.