论文标题
MISA:多模式分析的模态不变和特定的表示
MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis
论文作者
论文摘要
多模式情感分析是一个活跃的研究领域,它利用多模式信号来了解用户生成的视频。解决此任务的主要方法是开发复杂的融合技术。但是,信号的异质性质会产生分布方式差距,从而构成重大挑战。在本文中,我们旨在学习有效的模态表示以帮助融合过程。我们提出了一个新颖的框架米萨(Misa),该框架将每种方式都投射到两个不同的子空间。第一个子空间是模态不变的,跨模态的表示形式学习了它们的共同点并减少了模态差距。第二个子空间是模式特异性的,它是每种模式的私有,并捕获其特征特征。这些表示形式提供了多模式数据的整体视图,该数据用于融合,导致任务预测。我们对流行情感分析基准MOSI和MOSEI的实验证明了对最先进的模型的显着增长。我们还考虑了在最近提出的ur_funny数据集中多模式幽默检测和实验的任务。在这里,我们的模型票价要比强大的基线更好,将Misa确立为有用的多模式框架。
Multimodal Sentiment Analysis is an active area of research that leverages multimodal signals for affective understanding of user-generated videos. The predominant approach, addressing this task, has been to develop sophisticated fusion techniques. However, the heterogeneous nature of the signals creates distributional modality gaps that pose significant challenges. In this paper, we aim to learn effective modality representations to aid the process of fusion. We propose a novel framework, MISA, which projects each modality to two distinct subspaces. The first subspace is modality-invariant, where the representations across modalities learn their commonalities and reduce the modality gap. The second subspace is modality-specific, which is private to each modality and captures their characteristic features. These representations provide a holistic view of the multimodal data, which is used for fusion that leads to task predictions. Our experiments on popular sentiment analysis benchmarks, MOSI and MOSEI, demonstrate significant gains over state-of-the-art models. We also consider the task of Multimodal Humor Detection and experiment on the recently proposed UR_FUNNY dataset. Here too, our model fares better than strong baselines, establishing MISA as a useful multimodal framework.
