Fusical: Multimodal Fusion for Video Sentiment

Boyang Tom Jin, Leila Abdelrahman, Cong Kevin Chen, Amil Khanzada

Tue, Oct 27, 2020

Checkout the published article GitHub Slides

Abstract

Determining the emotional sentiment of a video remains a challenging task that requires multimodal, contextual understanding of a situation. In this paper, we describe our entry into the EmotiW 2020 Audio-Video Group Emotion Recognition Challenge to classify group videos containing large variations in language, people, and environment, into one of three sentiment classes. Our end-to-end approach consists of independently training models for different modalities, including full-frame video scenes, human body keypoints, embeddings extracted from audio clips, and image-caption word embeddings. Novel combinations of modalities, such as laughter and image-captioning, and transfer learning are further developed. We use fully-connected (FC) fusion ensembling to aggregate the modalities, achieving a best test accuracy of 63.9% which is 16 percentage points higher than that of the baseline ensemble.

Type

Conference paper

Publication

In Proceedings of the 2020 International Conference on Multimodal Interaction

Deep Learning Affective Computing