Awesome Multimodal Machine Learning

By Paul Liang ([email protected]), Machine Learning Department and Language Technologies Institute, CMU, with help from members of the MultiComp Lab at LTI, CMU. If there are any areas, papers, and datasets I missed, please let me know!

Course content + workshops

Check out our comprehsensive tutorial paper Foundations and Recent Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions.

Tutorials on Multimodal Machine Learning at CVPR 2022 and NAACL 2022, slides and videos here.

New course 11-877 Advanced Topics in Multimodal Machine Learning Spring 2022 @ CMU. It will primarily be reading and discussion-based. We plan to post discussion probes, relevant papers, and summarized discussion highlights every week on the website.

Public course content and lecture videos from 11-777 Multimodal Machine Learning, Fall 2020 @ CMU.

Table of Contents

Survey Papers
Core Areas
Architectures
- Multimodal Transformers
- Multimodal Memory
Applications and Datasets
Workshops
Tutorials
Courses

Research Papers

Survey Papers

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions, arxiv 2023

Multimodal Learning with Transformers: A Survey, TPAMI 2023

Trends in Integration of Vision and Language Research: A Survey of Tasks, Datasets, and Methods, JAIR 2021

Experience Grounds Language, EMNLP 2020

A Survey of Reinforcement Learning Informed by Natural Language, IJCAI 2019

Multimodal Machine Learning: A Survey and Taxonomy, TPAMI 2019

Multimodal Intelligence: Representation Learning, Information Fusion, and Applications, arXiv 2019

Deep Multimodal Representation Learning: A Survey, arXiv 2019

Guest Editorial: Image and Language Understanding, IJCV 2017

Representation Learning: A Review and New Perspectives, TPAMI 2013

A Survey of Socially Interactive Robots, 2003

Core Areas

Multimodal Representations

Identifiability Results for Multimodal Contrastive Learning, ICLR 2023 [code]

Unpaired Vision-Language Pre-training via Cross-Modal CutMix, ICML 2022.

Balanced Multimodal Learning via On-the-fly Gradient Modulation, CVPR 2022

Unsupervised Voice-Face Representation Learning by Cross-Modal Prototype Contrast, IJCAI 2021 [code]

Towards a Unified Foundation Model: Jointly Pre-Training Transformers on Unpaired Images and Text, arXiv 2021

FLAVA: A Foundational Language And Vision Alignment Model, arXiv 2021

Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer, arXiv 2021

MultiBench: Multiscale Benchmarks for Multimodal Representation Learning, NeurIPS 2021 [code]

Perceiver: General Perception with Iterative Attention, ICML 2021 [code]

Learning Transferable Visual Models From Natural Language Supervision, arXiv 2021 [blog] [code]

VinVL: Revisiting Visual Representations in Vision-Language Models, arXiv 2021 [blog] [code]

Learning Transferable Visual Models From Natural Language Supervision, arXiv 2020 [blog] [code]

12-in-1: Multi-Task Vision and Language Representation Learning, CVPR 2020 [code]

Watching the World Go By: Representation Learning from Unlabeled Videos, arXiv 2020

Learning Video Representations using Contrastive Bidirectional Transformer, arXiv 2019

Visual Concept-Metaconcept Learning, NeurIPS 2019 [code]

OmniNet: A Unified Architecture for Multi-modal Multi-task Learning, arXiv 2019 [code]

Learning Representations by Maximizing Mutual Information Across Views, arXiv 2019 [code]

ViCo: Word Embeddings from Visual Co-occurrences, ICCV 2019 [code]

Unified Visual-Semantic Embeddings: Bridging Vision and Language With Structured Meaning Representations, CVPR 2019

Multi-Task Learning of Hierarchical Vision-Language Representation, CVPR 2019

Learning Factorized Multimodal Representations, ICLR 2019 [code]

A Probabilistic Framework for Multi-view Feature Learning with Many-to-many Associations via Neural Networks, ICML 2018

Do Neural Network Cross-Modal Mappings Really Bridge Modalities?, ACL 2018

Learning Robust Visual-Semantic Embeddings, ICCV 2017

Deep Multimodal Representation Learning from Temporal Data, CVPR 2017

Is an Image Worth More than a Thousand Words? On the Fine-Grain Semantic Differences between Visual and Linguistic Representations, COLING 2016

Combining Language and Vision with a Multimodal Skip-gram Model, NAACL 2015

Deep Fragment Embeddings for Bidirectional Image Sentence Mapping, NIPS 2014

Multimodal Learning with Deep Boltzmann Machines, JMLR 2014

Learning Grounded Meaning Representations with Autoencoders, ACL 2014

DeViSE: A Deep Visual-Semantic Embedding Model, NeurIPS 2013

Multimodal Deep Learning, ICML 2011

Multimodal Fusion

Robust Contrastive Learning against Noisy Views, arXiv 2022

Cooperative Learning for Multi-view Analysis, arXiv 2022

What Makes Multi-modal Learning Better than Single (Provably), NeurIPS 2021

Efficient Multi-Modal Fusion with Diversity Analysis, ACMMM 2021

Attention Bottlenecks for Multimodal Fusion, NeurIPS 2021

VMLoc: Variational Fusion For Learning-Based Multimodal Camera Localization, AAAI 2021

Trusted Multi-View Classification, ICLR 2021 [code]

Deep-HOSeq: Deep Higher-Order Sequence Fusion for Multimodal Sentiment Analysis, ICDM 2020

Removing Bias in Multi-modal Classifiers: Regularization by Maximizing Functional Entropies, NeurIPS 2020 [code]

Deep Multimodal Fusion by Channel Exchanging, NeurIPS 2020 [code]

What Makes Training Multi-Modal Classification Networks Hard?, CVPR 2020

Dynamic Fusion for Multimodal Data, arXiv 2019

DeepCU: Integrating Both Common and Unique Latent Information for Multimodal Sentiment Analysis, IJCAI 2019 [code]

Deep Multimodal Multilinear Fusion with High-order Polynomial Pooling, NeurIPS 2019

XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification, IEEE TNNLS 2019 [code]

MFAS: Multimodal Fusion Architecture Search, CVPR 2019

The Neuro-Symbolic Concept Learner: Interpreting Scenes, Words, and Sentences From Natural Supervision, ICLR 2019 [code]

Unifying and merging well-trained deep neural networks for inference stage, IJCAI 2018 [code]

Efficient Low-rank Multimodal Fusion with Modality-Specific Factors, ACL 2018 [code]

Memory Fusion Network for Multi-view Sequential Learning, AAAI 2018 [code]

Tensor Fusion Network for Multimodal Sentiment Analysis, EMNLP 2017 [code]

Jointly Modeling Deep Video and Compositional Text to Bridge Vision and Language in a Unified Framework, AAAI 2015

A co-regularized approach to semi-supervised learning with multiple views, ICML 2005

Multimodal Alignment

Reconsidering Representation Alignment for Multi-view Clustering, CVPR 2021 [code]

CoMIR: Contrastive Multimodal Image Representation for Registration, NeurIPS 2020 [code]

Multimodal Transformer for Unaligned Multimodal Language Sequences, ACL 2019 [code]

Temporal Cycle-Consistency Learning, CVPR 2019 [code]

See, Hear, and Read: Deep Aligned Representations, arXiv 2017

On Deep Multi-View Representation Learning, ICML 2015

Unsupervised Alignment of Natural Language Instructions with Video Segments, AAAI 2014

Multimodal Alignment of Videos, MM 2014

Deep Canonical Correlation Analysis, ICML 2013 [code]

Multimodal Pretraining

Align before Fuse: Vision and Language Representation Learning with Momentum Distillation, NeurIPS 2021 Spotlight [code]

Less is More: ClipBERT for Video-and-Language Learning via Sparse Sampling, CVPR 2021 [code]

Transformer is All You Need: Multimodal Multitask Learning with a Unified Transformer, arXiv 2021

Large-Scale Adversarial Training for Vision-and-Language Representation Learning, NeurIPS 2020 [code]

Vokenization: Improving Language Understanding with Contextualized, Visual-Grounded Supervision, EMNLP 2020 [code]

Integrating Multimodal Information in Large Pretrained Transformers, ACL 2020

VL-BERT: Pre-training of Generic Visual-Linguistic Representations, arXiv 2019 [code]

VisualBERT: A Simple and Performant Baseline for Vision and Language, arXiv 2019 [code]

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks, NeurIPS 2019 [code]

Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training, arXiv 2019

LXMERT: Learning Cross-Modality Encoder Representations from Transformers, EMNLP 2019 [code]

VideoBERT: A Joint Model for Video and Language Representation Learning, ICCV 2019

Multimodal Translation

Zero-Shot Text-to-Image Generation, ICML 2021 [code]

Translate-to-Recognize Networks for RGB-D Scene Recognition, CVPR 2019 [code]

Language2Pose: Natural Language Grounded Pose Forecasting, 3DV 2019 [code]

Reconstructing Faces from Voices, NeurIPS 2019 [code]

Speech2Face: Learning the Face Behind a Voice, CVPR 2019 [code]

Found in Translation: Learning Robust Joint Representations by Cyclic Translations Between Modalities, AAAI 2019 [code]

Natural TTS Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions, ICASSP 2018 [code]

Crossmodal Retrieval

Learning with Noisy Correspondence for Cross-modal Matching, NeurIPS 2021 [code]

MURAL: Multimodal, Multitask Retrieval Across Languages, arXiv 2021

Self-Supervised Learning from Web Data for Multimodal Retrieval, arXiv 2019

Look, Imagine and Match: Improving Textual-Visual Cross-Modal Retrieval with Generative Models, CVPR 2018

Scene-centric vs. Object-centric Image-Text Cross-modal Retrieval: A Reproducibility Study, ECIR 2023

Multimodal Co-learning

Self-Supervised Learning in Event Sequences: A Comparative Study and Hybrid Approach of Generative Modeling and Contrastive Learning, arXiv 2024

Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision, ICML 2021

Multimodal Co-learning: Challenges, Applications with Datasets, Recent Advances and Future Directions, arXiv 2021

Vokenization: Improving Language Understanding via Contextualized, Visually-Grounded Supervision, EMNLP 2020

Foundations of Multimodal Co-learning, Information Fusion 2020

Missing or Imperfect Modalities

A Variational Information Bottleneck Approach to Multi-Omics Data Integration, AISTATS 2021 [code]

SMIL: Multimodal Learning with Severely Missing Modality, AAAI 2021

Factorized Inference in Deep Markov Models for Incomplete Multimodal Time Series, arXiv 2019

Learning Representations from Imperfect Time Series Data via Tensor Rank Regularization, ACL 2019

Multimodal Deep Learning for Robust RGB-D Object Recognition, IROS 2015

Analysis of Multimodal Models

M2Lens: Visualizing and Explaining Multimodal Models for Sentiment Analysis, IEEE TVCG 2022

Decoupling the Role of Data, Attention, and Losses in Multimodal Transformers, TACL 2021

Does my multimodal model learn cross-modal interactions? It’s harder to tell than you might think!, EMNLP 2020

Blindfold Baselines for Embodied QA, NIPS 2018 Visually-Grounded Interaction and Language Workshop

Analyzing the Behavior of Visual Question Answering Models, EMNLP 2016

Knowledge Graphs and Knowledge Bases

MMKG: Multi-Modal Knowledge Graphs, ESWC 2019

Answering Visual-Relational Queries in Web-Extracted Knowledge Graphs, AKBC 2019

Embedding Multimodal Relational Data for Knowledge Base Completion, EMNLP 2018

A Multimodal Translation-Based Approach for Knowledge Graph Representation Learning, SEM 2018 [code]

Order-Embeddings of Images and Language, ICLR 2016 [code]

Building a Large-scale Multimodal Knowledge Base System for Answering Visual Queries, arXiv 2015

Intepretable Learning

Multimodal Explanations by Predicting Counterfactuality in Videos, CVPR 2019

Multimodal Explanations: Justifying Decisions and Pointing to the Evidence, CVPR 2018 [code]

Do Explanations make VQA Models more Predictable to a Human?, EMNLP 2018

Towards Transparent AI Systems: Interpreting Visual Question Answering Models, ICML Workshop on Visualization for Deep Learning 2016

Generative Learning

MMVAE+: Enhancing the Generative Quality of Multimodal VAEs without Compromises, ICLR 2023 [code]

On the Limitations of Multimodal VAEs, ICLR 2022 [code]

Generalized Multimodal ELBO, ICLR 2021 [code]

Multimodal Generative Learning Utilizing Jensen-Shannon-Divergence, NeurIPS 2020 [code]

Self-supervised Disentanglement of Modality-specific and Shared Factors Improves Multimodal Generative Models, GCPR 2020 [code]

Variational Mixture-of-Experts Autoencodersfor Multi-Modal Deep Generative Models, NeurIPS 2019 [code]

Few-shot Video-to-Video Synthesis, NeurIPS 2019 [code]

Multimodal Generative Models for Scalable Weakly-Supervised Learning, NeurIPS 2018 [code1] [code2]

The Multi-Entity Variational Autoencoder, NeurIPS 2017

Semi-supervised Learning

Semi-supervised Vision-language Mapping via Variational Learning, ICRA 2017

Semi-supervised Multimodal Hashing, arXiv 2017

Semi-Supervised Multimodal Deep Learning for RGB-D Object Recognition, IJCAI 2016

Multimodal Semi-supervised Learning for Image Classification, CVPR 2010

Self-supervised Learning

DABS: A Domain-Agnostic Benchmark for Self-Supervised Learning, NeurIPS 2021 Datasets & Benchmarks Track [code]

Self-Supervised Learning by Cross-Modal Audio-Video Clustering, NeurIPS 2020 [code]

Self-Supervised MultiModal Versatile Networks, NeurIPS 2020 [code]

Labelling Unlabelled Videos from Scratch with Multi-modal Self-supervision, NeurIPS 2020 [code]

Self-Supervised Learning of Visual Features through Embedding Images into Text Topic Spaces, CVPR 2017

Multimodal Dynamics : Self-supervised Learning in Perceptual and Motor Systems, 2016

Language Models

Neural Language Modeling with Visual Features, arXiv 2019

Learning Multi-Modal Word Representation Grounded in Visual Context, AAAI 2018

Visual Word2Vec (vis-w2v): Learning Visually Grounded Word Embeddings Using Abstract Scenes, CVPR 2016

Unifying Visual-Semantic Embeddings with Multimodal Neural Language Models, ICML 2014 [code]

Adversarial Attacks

Attend and Attack: Attention Guided Adversarial Attacks on Visual Question Answering Models, NeurIPS Workshop on Visually Grounded Interaction and Language 2018

Attacking Visual Language Grounding with Adversarial Examples: A Case Study on Neural Image Captioning, ACL 2018 [code]

Fooling Vision and Language Models Despite Localization and Attention Mechanism, CVPR 2018

Few-Shot Learning

Language to Network: Conditional Parameter Adaptation with Natural Language Descriptions, ACL 2020

Shaping Visual Representations with Language for Few-shot Classification, ACL 2020

Zero-Shot Learning - The Good, the Bad and the Ugly, CVPR 2017

Zero-Shot Learning Through Cross-Modal Transfer, NIPS 2013

Bias and Fairness

PromptStyler: Prompt-driven Style Generation for Source-free Domain Generalization, ICCV 2023 [project page]

Worst of Both Worlds: Biases Compound in Pre-trained Vision-and-Language Models, arXiv 2021

Towards Debiasing Sentence Representations, ACL 2020 [code]

FairCVtest Demo: Understanding Bias in Multimodal Learning with a Testbed in Fair Automatic Recruitment, ICMI 2020 [code]

Model Cards for Model Reporting, FAccT 2019

Black is to Criminal as Caucasian is to Police: Detecting and Removing Multiclass Bias in Word Embeddings, NAACL 2019 [code]

Gender Shades: Intersectional Accuracy Disparities in Commercial Gender Classification, FAccT 2018

Datasheets for Datasets, arXiv 2018

Man is to Computer Programmer as Woman is to Homemaker? Debiasing Word Embeddings, NeurIPS 2016

Human in the Loop Learning

Human in the Loop Dialogue Systems, NeurIPS 2020 workshop

Human And Machine in-the-Loop Evaluation and Learning Strategies, NeurIPS 2020 workshop

Human-centric dialog training via offline reinforcement learning, EMNLP 2020 [code]

Human-In-The-Loop Machine Learning with Intelligent Multimodal Interfaces, ICML 2017 workshop

Architectures

Multimodal Transformers

Pretrained Transformers As Universal Computation Engines, AAAI 2022

Perceiver: General Perception with Iterative Attention, ICML 2021

FLAVA: A Foundational Language And Vision Alignment Model, arXiv 2021

PolyViT: Co-training Vision Transformers on Images, Videos and Audio, arXiv 2021

VATT: Transformers for Multimodal Self-Supervised Learning from Raw Video, Audio and Text, NeurIPS 2021 [code]

Parameter Efficient Multimodal Transformers for Video Representation Learning, ICLR 2021 [code]

Multimodal Memory

Multimodal Transformer with Variable-length Memory for Vision-and-Language Navigation, arXiv 2021

History Aware Multimodal Transformer for Vision-and-Language Navigation, NeurIPS 2021 [code]

Episodic Memory in Lifelong Language Learning, NeurIPS 2019

ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection, EMNLP 2018

Multimodal Memory Modelling for Video Captioning, CVPR 2018

Dynamic Memory Networks for Visual and Textual Question Answering, ICML 2016

Applications and Datasets

Language and Visual QA

TAG: Boosting Text-VQA via Text-aware Visual Question-answer Generation, arXiv 2022 [code]

Learning to Answer Questions in Dynamic Audio-Visual Scenarios, CVPR 2022

SUTD-TrafficQA: A Question Answering Benchmark and an Efficient Network for Video Reasoning over Traffic Events, CVPR 2021 [code]

MultiModalQA: complex question answering over text, tables and images, ICLR 2021

ManyModalQA: Modality Disambiguation and QA over Diverse Inputs, AAAI 2020 [code]

Iterative Answer Prediction with Pointer-Augmented Multimodal Transformers for TextVQA, CVPR 2020

Interactive Language Learning by Question Answering, EMNLP 2019 [code]

Fusion of Detected Objects in Text for Visual Question Answering, arXiv 2019

RUBi: Reducing Unimodal Biases in Visual Question Answering, NeurIPS 2019 [code]

GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering, CVPR 2019 [code]

OK-VQA: A Visual Question Answering Benchmark Requiring External Knowledge, CVPR 2019 [code]

MUREL: Multimodal Relational Reasoning for Visual Question Answering, CVPR 2019 [code]

Social-IQ: A Question Answering Benchmark for Artificial Social Intelligence, CVPR 2019 [code]

Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering, ICML 2019 [code]

Learning to Count Objects in Natural Images for Visual Question Answering, ICLR 2018, [code]

Overcoming Language Priors in Visual Question Answering with Adversarial Regularization, NeurIPS 2018

Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding, NeurIPS 2018 [code]

RecipeQA: A Challenge Dataset for Multimodal Comprehension of Cooking Recipes, EMNLP 2018 [code]

TVQA: Localized, Compositional Video Question Answering, EMNLP 2018 [code]

Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering, CVPR 2018 [code]

Don't Just Assume; Look and Answer: Overcoming Priors for Visual Question Answering, CVPR 2018 [code]

Stacked Latent Attention for Multimodal Reasoning, CVPR 2018

Learning to Reason: End-to-End Module Networks for Visual Question Answering, ICCV 2017 [code]

CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning, CVPR 2017 [code] [dataset generation]

Are You Smarter Than A Sixth Grader? Textbook Question Answering for Multimodal Machine Comprehension, CVPR 2017 [code]

Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding, EMNLP 2016 [code]

MovieQA: Understanding Stories in Movies through Question-Answering, CVPR 2016 [code]

VQA: Visual Question Answering, ICCV 2015 [code]

Language Grounding in Vision

Core Challenges in Embodied Vision-Language Planning, arXiv 2021

MaRVL: Multicultural Reasoning over Vision and Language, EMNLP 2021 [code]

Grounding 'Grounding' in NLP, ACL 2021

The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes, NeurIPS 2020 [code]

What Does BERT with Vision Look At?, ACL 2020

Visual Grounding in Video for Unsupervised Word Translation, CVPR 2020 [code]

VIOLIN: A Large-Scale Dataset for Video-and-Language Inference, CVPR 2020 [code]

Grounded Video Description, CVPR 2019

Show, Control and Tell: A Framework for Generating Controllable and Grounded Captions, CVPR 2019

Multilevel Language and Vision Integration for Text-to-Clip Retrieval, AAAI 2019 [code]

Binary Image Selection (BISON): Interpretable Evaluation of Visual Grounding, arXiv 2019 [code]

Finding “It”: Weakly-Supervised Reference-Aware Visual Grounding in Instructional Videos, CVPR 2018

SCAN: Learning Hierarchical Compositional Visual Concepts, ICLR 2018

Visual Coreference Resolution in Visual Dialog using Neural Module Networks, ECCV 2018 [code]

Gated-Attention Architectures for Task-Oriented Language Grounding, AAAI 2018 [code]

Using Syntax to Ground Referring Expressions in Natural Images, AAAI 2018 [code]

Grounding language acquisition by training semantic parsers using captioned videos, ACL 2018

Interpretable and Globally Optimal Prediction for Textual Grounding using Image Concepts, NeurIPS 2017

Localizing Moments in Video with Natural Language, ICCV 2017

What are you talking about? Text-to-Image Coreference, CVPR 2014

Grounded Language Learning from Video Described with Sentences, ACL 2013

Grounded Compositional Semantics for Finding and Describing Images with Sentences, TACL 2013

Language Grouding in Navigation

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning, ICLR 2021 [code]

Hierarchical Cross-Modal Agent for Robotics Vision-and-Language Navigation, ICRA 2021, [code], [video], [project page]

Improving Vision-and-Language Navigation with Image-Text Pairs from the Web, ECCV 2020

Towards Learning a Generic Agent for Vision-and-Language Navigation via Pre-training, CVPR 2020 [code]

VideoNavQA: Bridging the Gap between Visual and Embodied Question Answering, BMVC 2019 [code]

Vision-and-Dialog Navigation, arXiv 2019 [code]

Hierarchical Decision Making by Generating and Following Natural Language Instructions, arXiv 2019 [code]

Stay on the Path: Instruction Fidelity in Vision-and-Language Navigation, ACL 2019

Are You Looking? Grounding to Multiple Modalities in Vision-and-Language Navigation, ACL 2019

Touchdown: Natural Language Navigation and Spatial Reasoning in Visual Street Environments, CVPR 2019 [code]

Reinforced Cross-Modal Matching and Self-Supervised Imitation Learning for Vision-Language Navigation, CVPR 2019

The Regretful Navigation Agent for Vision-and-Language Navigation, CVPR 2019 [code]

Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation, CVPR 2019 [code]

Multi-modal Discriminative Model for Vision-and-Language Navigation, NAACL SpLU-RoboNLP Workshop 2019

Self-Monitoring Navigation Agent via Auxiliary Progress Estimation, ICLR 2019 [code]

From Language to Goals: Inverse Reinforcement Learning for Vision-Based Instruction Following, ICLR 2019

Read, Watch, and Move: Reinforcement Learning for Temporally Grounding Natural Language Descriptions in Videos, AAAI 2019

Learning to Navigate Unseen Environments: Back Translation with Environmental Dropout, NAACL 2019 [code]

Attention Based Natural Language Grounding by Navigating Virtual Environment, IEEE WACV 2019

Mapping Instructions to Actions in 3D Environments with Visual Goal Prediction, EMNLP 2018 [code]

Vision-and-Language Navigation: Interpreting Visually-Grounded Navigation Instructions in Real Environments, CVPR 2018 [code]

Embodied Question Answering, CVPR 2018 [code]

Look Before You Leap: Bridging Model-Free and Model-Based Reinforcement Learning for Planned-Ahead Vision-and-Language Navigation, ECCV 2018

Multimodal Machine Translation

Unsupervised Multimodal Neural Machine Translation with Pseudo Visual Pivoting, ACL 2020

Multimodal Transformer for Multimodal Machine Translation, ACL 2020

Neural Machine Translation with Universal Visual Representation, ICLR 2020 [code]

Visual Agreement Regularized Training for Multi-Modal Machine Translation, AAAI 2020

VATEX: A Large-Scale, High-Quality Multilingual Dataset for Video-and-Language Research, ICCV 2019 [code]

Latent Variable Model for Multi-modal Translation, ACL 2019

Distilling Translations with Visual Awareness, ACL 2019

Probing the Need for Visual Context in Multimodal Machine Translation, NAACL 2019

Emergent Translation in Multi-Agent Communication, ICLR 2018

Zero-Resource Neural Machine Translation with Multi-Agent Communication Game, AAAI 2018

Learning Translations via Images with a Massively Multilingual Image Dataset, ACL 2018

A Visual Attention Grounding Neural Model for Multimodal Machine Translation, EMNLP 2018

Adversarial Evaluation of Multimodal Machine Translation, EMNLP 2018

Doubly-Attentive Decoder for Multi-modal Neural Machine Translation, ACL 2017 [code]

An empirical study on the effectiveness of images in Multimodal Neural Machine Translation, EMNLP 2017

Incorporating Global Visual Features into Attention-based Neural Machine Translation, EMNLP 2017 [code]

Multimodal Pivots for Image Caption Translation, ACL 2016

Multi30K: Multilingual English-German Image Descriptions, ACL Workshop on Language and Vision 2016 [code]

Does Multimodality Help Human and Machine for Translation and Image Captioning?, ACL WMT 2016

Multi-agent Communication

Multi-agent Communication meets Natural Language: Synergies between Functional and Structural Language Learning, ACL 2020

Emergence of Compositional Language with Deep Generational Transmission, ICML 2019

On the Pitfalls of Measuring Emergent Communication, AAMAS 2019 [code]

Emergent Translation in Multi-Agent Communication, ICLR 2018 [code]

Emergent Communication in a Multi-Modal, Multi-Step Referential Game, ICLR 2018 [code]

Emergence of Linguistic Communication From Referential Games with Symbolic and Pixel Input, ICLR 2018

Emergent Communication through Negotiation, ICLR 2018 [code]

Emergence of Grounded Compositional Language in Multi-Agent Populations, AAAI 2018

Emergence of Language with Multi-agent Games: Learning to Communicate with Sequences of Symbols, NeurIPS 2017

Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog, EMNLP 2017 [code1] [code2]

Learning Cooperative Visual Dialog Agents with Deep Reinforcement Learning, ICCV 2017 code

Multi-agent Cooperation and the Emergence of (natural) Language, ICLR 2017

Learning to Communicate with Deep Multi-agent Reinforcement Learning, NIPS 2016.

Learning multiagent communication with backpropagation, NIPS 2016.

The Emergence of Compositional Structures in Perceptually Grounded Language Games, AI 2005

Commonsense Reasoning

Adventures in Flatland: Perceiving Social Interactions Under Physical Dynamics, CogSci 2020

A Logical Model for Supporting Social Commonsense Knowledge Acquisition, arXiv 2019

Heterogeneous Graph Learning for Visual Commonsense Reasoning, NeurIPS 2019

SocialIQA: Commonsense Reasoning about Social Interactions, arXiv 2019

From Recognition to Cognition: Visual Commonsense Reasoning, CVPR 2019 [code]

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge, NAACL 2019

Multimodal Reinforcement Learning

MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research, NeurIPS 2021 [code]

Imitating Interactive Intelligence, arXiv 2020

Grounded Language Learning Fast and Slow, ICLR 2021

RTFM: Generalising to Novel Environment Dynamics via Reading, ICLR 2020 [code]

Embodied Multimodal Multitask Learning, IJCAI 2020

Learning to Speak and Act in a Fantasy Text Adventure Game, arXiv 2019 [code]

Language as an Abstraction for Hierarchical Deep Reinforcement Learning, NeurIPS 2019

Hierarchical Decision Making by Generating and Following Natural Language Instructions, NeurIPS 2019 [code]

Habitat: A Platform for Embodied AI Research, ICCV 2019 [code]

Multimodal Hierarchical Reinforcement Learning Policy for Task-Oriented Visual Dialog, SIGDIAL 2018

Mapping Instructions and Visual Observations to Actions with Reinforcement Learning, EMNLP 2017

Reinforcement Learning for Mapping Instructions to Actions, ACL 2009

Multimodal Dialog

Two Causal Principles for Improving Visual Dialog, CVPR 2020

MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations, ACL 2019 [code]

CLEVR-Dialog: A Diagnostic Dataset for Multi-Round Reasoning in Visual Dialog, NAACL 2019 [code]

Talk the Walk: Navigating New York City through Grounded Dialogue, arXiv 2018

Dialog-based Interactive Image Retrieval, NeurIPS 2018 [code]

Towards Building Large Scale Multimodal Domain-Aware Conversation Systems, arXiv 2017 [code]

Visual Dialog, CVPR 2017 [code]

Language and Audio

Lattice Transformer for Speech Translation, ACL 2019

Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation, ACL 2019

Audio Caption: Listen and Tell, ICASSP 2019

Audio-Linguistic Embeddings for Spoken Sentences, ICASSP 2019

From Semi-supervised to Almost-unsupervised Speech Recognition with Very-low Resource by Jointly Learning Phonetic Structures from Audio and Text Embeddings, arXiv 2019

From Audio to Semantics: Approaches To End-to-end Spoken Language Understanding, arXiv 2018

Natural TTS Synthesis by Conditioning Wavenet on Mel Spectrogram Predictions, ICASSP 2018 [code]

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning, ICLR 2018

Deep Voice 2: Multi-Speaker Neural Text-to-Speech, NeurIPS 2017

Deep Voice: Real-time Neural Text-to-Speech, ICML 2017

Text-to-Speech Synthesis, 2009

Audio and Visual

Music Gesture for Visual Sound Separation, CVPR 2020

Co-Compressing and Unifying Deep CNN Models for Efficient Human Face and Speaker Recognition, CVPRW 2019

Learning Individual Styles of Conversational Gesture, CVPR 2019 [code]

Capture, Learning, and Synthesis of 3D Speaking Styles, CVPR 2019 [code]

Disjoint Mapping Network for Cross-modal Matching of Voices and Faces, ICLR 2019

Wav2Pix: Speech-conditioned Face Generation using Generative Adversarial Networks, ICASSP 2019 [code]

Learning Affective Correspondence between Music and Image, ICASSP 2019 [dataset]

Jointly Discovering Visual Objects and Spoken Words from Raw Sensory Input, ECCV 2018 [code]

Seeing Voices and Hearing Faces: Cross-modal Biometric Matching, CVPR 2018 [code]

Learning to Separate Object Sounds by Watching Unlabeled Video, CVPR 2018

Deep Audio-Visual Speech Recognition, IEEE TPAMI 2018

Look, Listen and Learn, ICCV 2017

Unsupervised Learning of Spoken Language with Visual Context, NeurIPS 2016

SoundNet: Learning Sound Representations from Unlabeled Video, NeurIPS 2016 [code]

Visual, IMU and Wireless

ViFiT: Reconstructing Vision Trajectories from IMU and Wi-Fi Fine Time Measurements, MobiCom 2023 ISACom Workshop [code]

ViTag: Online WiFi Fine Time Measurements Aided Vision-Motion Identity Association in Multi-person Environments, SECON 2022 [code]

Vi-Fi: Associating Moving Subjects across Vision and Wireless Sensors, IPSN 2022 [code]

Media Description

Towards Unsupervised Image Captioning with Shared Multimodal Embeddings, ICCV 2019

Video Relationship Reasoning using Gated Spatio-Temporal Energy Graph, CVPR 2019 [code]

Joint Event Detection and Description in Continuous Video Streams, WACVW 2019

Learning to Compose and Reason with Language Tree Structures for Visual Grounding, TPAMI 2019

Neural Baby Talk, CVPR 2018 [code]

Grounding Referring Expressions in Images by Variational Context, CVPR 2018

Video Captioning via Hierarchical Reinforcement Learning, CVPR 2018

Charades-Ego: A Large-Scale Dataset of Paired Third and First Person Videos, CVPR 2018 [code]

Neural Motifs: Scene Graph Parsing with Global Context, CVPR 2018 [code]

No Metrics Are Perfect: Adversarial Reward Learning for Visual Storytelling, ACL 2018

Generating Descriptions with Grounded and Co-Referenced People, CVPR 2017

DenseCap: Fully Convolutional Localization Networks for Dense Captioning, CVPR 2016

Review Networks for Caption Generation, NeurIPS 2016 [code]

Hollywood in Homes: Crowdsourcing Data Collection for Activity Understanding, ECCV 2016 [code]

Show and Tell: Lessons learned from the 2015 MSCOCO Image Captioning Challenge, TPAMI 2016 [code]

Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, ICML 2015 [code]

Deep Visual-Semantic Alignments for Generating Image Descriptions, CVPR 2015 [code]

Show and Tell: A Neural Image Caption Generator, CVPR 2015 [code]

A Dataset for Movie Description, CVPR 2015 [code]

What’s Cookin’? Interpreting Cooking Videos using Text, Speech and Vision, NAACL 2015 [code]

Microsoft COCO: Common Objects in Context, ECCV 2014 [code]

Video Generation from Text

Image Generation from Scene Graphs, CVPR 2018

Learning to Color from Language, NAACL 2018

Generative Adversarial Text to Image Synthesis, ICML 2016

Affect Recognition and Multimodal Language

End-to-end Facial and Physiological Model for Affective Computing and Applications, arXiv 2019

Affective Computing for Large-Scale Heterogeneous Multimedia Data: A Survey, ACM TOMM 2019

Towards Multimodal Sarcasm Detection (An Obviously_Perfect Paper), ACL 2019 [code]

Multi-modal Approach for Affective Computing, EMBC 2018

Multimodal Language Analysis with Recurrent Multistage Fusion, EMNLP 2018

Multimodal Language Analysis in the Wild: CMU-MOSEI Dataset and Interpretable Dynamic Fusion Graph, ACL 2018 [code]

Multi-attention Recurrent Network for Human Communication Comprehension, AAAI 2018 [code]

End-to-End Multimodal Emotion Recognition using Deep Neural Networks, arXiv 2017

AMHUSE - A Multimodal dataset for HUmor SEnsing, ICMI 2017 [code]

Decoding Children’s Social Behavior, CVPR 2013 [code]

Collecting Large, Richly Annotated Facial-Expression Databases from Movies, IEEE Multimedia 2012 [code]

The Interactive Emotional Dyadic Motion Capture (IEMOCAP) Database, 2008 [code]

Healthcare

Multimodal Co-Attention Transformer for Survival Prediction in Gigapixel Whole Slide Images, ICCV, 2021

PET-Guided Attention Network for Segmentation of Lung Tumors from PET/CT Images, GCPR 2020 [code]

Pathomic Fusion: An Integrated Framework for Fusing Histopathology and Genomic Features for Cancer Diagnosis and Prognosis, IEEE TMI, 2020

Leveraging Medical Visual Question Answering with Supporting Facts, arXiv 2019

Unsupervised Multimodal Representation Learning across Medical Images and Reports, ML4H 2018

Multimodal Medical Image Retrieval based on Latent Topic Modeling, ML4H 2018

Improving Hospital Mortality Prediction with Medical Named Entities and Multimodal Learning, ML4H 2018

Knowledge-driven Generative Subspaces for Modeling Multi-view Dependencies in Medical Data, ML4H 2018

Multimodal Depression Detection: Fusion Analysis of Paralinguistic, Head Pose and Eye Gaze Behaviors, TAC 2018

Learning the Joint Representation of Heterogeneous Temporal Events for Clinical Endpoint Prediction, AAAI 2018

Understanding Coagulopathy using Multi-view Data in the Presence of Sub-Cohorts: A Hierarchical Subspace Approach, MLHC 2017

Machine Learning in Multimodal Medical Imaging, 2017

Cross-modal Recurrent Models for Weight Objective Prediction from Multimodal Time-series Data, ML4H 2017

SimSensei Kiosk: A Virtual Human Interviewer for Healthcare Decision Support, AAMAS 2014

Dyadic Behavior Analysis in Depression Severity Assessment Interviews, ICMI 2014

Audiovisual Behavior Descriptors for Depression Assessment, ICMI 2013

Robotics

Detect, Reject, Correct: Crossmodal Compensation of Corrupted Sensors, ICRA 2021

Multimodal sensor fusion with differentiable filters, IROS 2020

Concept2Robot: Learning Manipulation Concepts from Instructions and Human Demonstrations, RSS 2020

See, Feel, Act: Hierarchical Learning for Complex Manipulation Skills with Multi-sensory Fusion, Science Robotics 2019

Early Fusion for Goal Directed Robotic Vision, IROS 2019

Simultaneously Learning Vision and Feature-based Control Policies for Real-world Ball-in-a-Cup, RSS 2019

Probabilistic Multimodal Modeling for Human-Robot Interaction Tasks, RSS 2019

Making Sense of Vision and Touch: Self-Supervised Learning of Multimodal Representations for Contact-Rich Tasks, ICRA 2019

Evolving Multimodal Robot Behavior via Many Stepping Stones with the Combinatorial Multi-Objective Evolutionary Algorithm , arXiv 2018

Multi-modal Predicate Identification using Dynamically Learned Robot Controllers, IJCAI 2018

Multimodal Probabilistic Model-Based Planning for Human-Robot Interaction, arXiv 2017

Perching and Vertical Climbing: Design of a Multimodal Robot, ICRA 2014

Multi-Modal Scene Understanding for Robotic Grasping, 2011

Strategies for Multi-Modal Scene Exploration, IROS 2010

Autonomous Driving

Deep Multi-modal Object Detection and Semantic Segmentation for Autonomous Driving: Datasets, Methods, and Challenges, IEEE TITS 2020 [website]

nuScenes: A multimodal dataset for autonomous driving, CVPR 2020 [dataset]

Multimodal End-to-End Autonomous Driving, arXiv 2020

Finance

Self-Supervised Learning in Event Sequences: A Comparative Study and Hybrid Approach of Generative Modeling and Contrastive Learning, arXiv 2024

A Multimodal Event-driven LSTM Model for Stock Prediction Using Online News, TKDE 2020

Multimodal Deep Learning for Finance: Integrating and Forecasting International Stock Markets, 2019

Multimodal deep learning for short-term stock volatility prediction, 2018

Human AI Interaction

Multimodal Human Computer Interaction: A Survey, HCI 2005

Affective multimodal human-computer interaction, Multimedia 2005

Building a multimodal human-robot interface, IEEE Intelligent Systems 2001

Multimodal Content Generation

Non-Linear Consumption of Videos Using a Sequence of Personalized Multimodal Fragments, IUI 2021

Generating Need-Adapted Multimodal Fragments, IUI 2020

Workshops

Multimodal KDD 2023: International Workshop on Multimodal Learning, KDD 2023

Multimodal Representation Learning: Perks and Pitfalls, ICLR 2023

Social Intelligence in Humans and Robots @ ICRA 2021

LANTERN 2021: The Third Workshop Beyond Vision and LANguage: inTEgrating Real-world kNowledge @ EACL 2021

Multimodal workshops @ CVPR 2021: Multimodal Learning and Applications, Sight and Sound, Visual Question Answering, Embodied AI, Language for 3D Scenes.

Multimodal workshops @ NAACL 2021: MAI-Workshop, ALVR, ViGIL.

ICLR 2021 workshop on Embodied Multimodal Learning.

NeurIPS 2020 workshop on Wordplay: When Language Meets Games.

ACL 2020 workshops on Multimodal Language (proceedings) and Advances in Language and Vision Research.

Multimodal workshops @ ECCV 2020: EVAL, CAMP, and MVA.

Multi-Modal Video Reasoning and Analyzing Competition, ICCV 2021

Grand Challenge and Workshop on Human Multimodal Language, ACL 2020, ACL 2018

Advances in Language and Vision Research, ACL 2020

Visually Grounded Interaction and Language, NeurIPS 2019, NeurIPS 2018

Emergent Communication: Towards Natural Language, NeurIPS 2019

Workshop on Multimodal Understanding and Learning for Embodied Applications, ACM Multimedia 2019

Beyond Vision and Language: Integrating Real-World Knowledge, EMNLP 2019

The How2 Challenge: New Tasks for Vision & Language, ICML 2019

Visual Question Answering and Dialog, CVPR 2019, CVPR 2017

Multi-modal Learning from Videos, CVPR 2019

Multimodal Learning and Applications Workshop, CVPR 2019, ECCV 2018

Habitat: Embodied Agents Challenge and Workshop, CVPR 2019

Closing the Loop Between Vision and Language & LSMD Challenge, ICCV 2019

Multi-modal Video Analysis and Moments in Time Challenge, ICCV 2019

Cross-Modal Learning in Real World, ICCV 2019

Spatial Language Understanding and Grounded Communication for Robotics, NAACL 2019

YouTube-8M Large-Scale Video Understanding, ICCV 2019, ECCV 2018, CVPR 2017

Language and Vision Workshop, CVPR 2019, CVPR 2018, CVPR 2017, CVPR 2015

Sight and Sound, CVPR 2019, CVPR 2018

The Large Scale Movie Description Challenge (LSMDC), ICCV 2019, ICCV 2017

Wordplay: Reinforcement and Language Learning in Text-based Games, NeurIPS 2018

Interpretability and Robustness in Audio, Speech, and Language, NeurIPS 2018

Multimodal Robot Perception, ICRA 2018

WMT18: Shared Task on Multimodal Machine Translation, EMNLP 2018

Shortcomings in Vision and Language, ECCV 2018

Computational Approaches to Subjectivity, Sentiment and Social Media Analysis, EMNLP 2018, EMNLP 2017, NAACL-HLT 2016, EMNLP 2015, ACL 2014, NAACL-HLT 2013

Visual Understanding Across Modalities, CVPR 2017

International Workshop on Computer Vision for Audio-Visual Media, ICCV 2017

Language Grounding for Robotics, ACL 2017

Computer Vision for Audio-visual Media, ECCV 2016

Language and Vision, ACL 2016, EMNLP 2015

Tutorials

Tutorial on MultiModal Machine Learning, ICML 2023, CVPR 2022, NAACL 2022

Recent Advances in Vision-and-Language Research, CVPR 2020

Connecting Language and Vision to Actions, ACL 2018

Machine Learning for Clinicians: Advances for Multi-Modal Health Data, MLHC 2018

Multimodal Machine Learning, ACL 2017, CVPR 2016, ICMI 2016

Vision and Language: Bridging Vision and Language with Deep Learning, ICIP 2017

Courses

CMU 11-777 Multimodal Machine Learning

CMU 11-877 Advanced Topics in Multimodal Machine Learning

CMU 05-618, Human-AI Interaction

CMU 11-777, Advanced Multimodal Machine Learning

Stanford CS422: Interactive and Embodied Learning

CMU 16-785, Integrated Intelligence in Robotics: Vision, Language, and Planning

CMU 10-808, Language Grounding to Vision and Control

CMU 11-775, Large-Scale Multimedia Analysis

MIT 6.882, Embodied Intelligence

Georgia Tech CS 8803, Vision and Language

Virginia Tech CS 6501-004, Vision & Language

Machine Learning Career: A Comprehensive Guide