This is a Tensorflow implementation of a bidirectional, self-attentive LSTM as proposed by Coskun et al. (2018) in their paper Human Motion Analysis with Deep Metric Learning. The model differs in the following way from the one proposed in the paper:
- No layer normalization on the LSTM
- Does not use triplet loss
TQDM
progress bar not working - fix?
The model is ready-to-use for classification tasks on the Human3.6M dataset. It works with the 2D and 3D joint position data. Please visit http://vision.imar.ro/human3.6m/ in order to contact the maintainers of the dataset and request access. I do not own the dataset and do not have permission to redistribute the data.
python main.py --path PATH_TO_DATA
environment.yml
contains the environment that I used to work on and run the model. NB: The environment was created for an M1 Apple Silicon chip. Not all packages might be compatible across platforms.
The following are required:
- Python 3
tensorflow
tensorflow-addons
cdflib
to work with the .cdf files which contain the joint position datanumpy
sklearn
The following arguments can be supplied when running the script:
--path
: path to the data. The model can work with both the 2D and 3D joint position data of the Human3.6M dataset--seq_len
: maximum length of the sequences. Sequences in the dataset will be cut down to this length. Should be >= shortest sequence length--downsample_rate
: rate by which to downsample existing sequences (e.g., 5 means only keep every 5th frame)--normalize
:--onehot
:--add_noise
:--noise_factor
:--shuffle_size
:--train_test_split
:--lstm_size
:--dropout_rate
:--R
:--D
:--embedding_size
:--classification
:--batch_size
:--epochs
:
A-LSTM network architecture - figure taken from Coskun et al. (2018) |
The network largely follows the architecture from Coskun et al. (2018). Some minor tweaks have been made:
- Leaky ReLU activation on the FC layers (instead of normal ReLU)
The model uses joint position data from the Human3.6M dataset, a large motion capture dataset, as first introduced in Human3.6m: Large scale datasets and predictive methods for 3D human sensing in natural environments by Ionescu et al. It can work with both the 2D and 3D joint position data. I neither own the dataset nor do i have permission to redistribute it, so please visit http://vision.imar.ro/human3.6m/ and follow the instructions in order to get access.
Technically, the model can work with any kind of joint position/joint angle data. When fed into the network, it has to be
https://towardsdatascience.com/create-your-own-custom-attention-layer-understand-all-flavours-2201b5e8be9e
https://jalammar.github.io/visualizing-neural-machine-translation-mechanics-of-seq2seq-models-with-attention/
https://omoindrot.github.io/triplet-loss
https://aiden.nibali.org/blog/2016-09-06-neural-network-implementation-tricks/
Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer Normalization. https://doi.org/10.48550/arxiv.1607.06450
Coskun, H., Tan, D. J., Conjeti, S., Navab, N., & Tombari, F. (2018). Human Motion Analysis with Deep Metric Learning. Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11218 LNCS, 693–710. https://doi.org/10.48550/arxiv.1807.11176
Lin, Z., Feng, M., dos Santos, C. N., Yu, M., Xiang, B., Zhou, B., & Bengio, Y. (2017). A Structured Self-attentive Sentence Embedding. 5th International Conference on Learning Representations, ICLR 2017 - Conference Track Proceedings. https://doi.org/10.48550/arxiv.1703.03130