论文标题
具有自适应视听视觉字幕的视觉吸引音频字幕
Visually-Aware Audio Captioning With Adaptive Audio-Visual Attention
论文作者
论文摘要
音频字幕旨在生成音频剪辑的文本说明。在现实世界中,许多物体会产生相似的声音。如何准确识别模棱两可的声音是音频字幕的主要挑战。在这项工作中,受到固有的人类多模式感知的启发,我们提出了视觉吸引的音频字幕,该字幕利用视觉信息来帮助描述模棱两可的声音对象。具体来说,我们引入了一个现成的视觉编码器,以提取视频功能并将视觉功能纳入音频字幕系统。此外,为了更好地利用互补的视听环境,我们提出了一种音频视觉注意机制,该机制可自适应地整合音频和视觉上下文,并消除潜在空间中的冗余信息。最大的音频字幕数据集的AudioCaps上的实验结果表明,我们提出的方法可实现机器翻译指标的最新结果。
Audio captioning aims to generate text descriptions of audio clips. In the real world, many objects produce similar sounds. How to accurately recognize ambiguous sounds is a major challenge for audio captioning. In this work, inspired by inherent human multimodal perception, we propose visually-aware audio captioning, which makes use of visual information to help the description of ambiguous sounding objects. Specifically, we introduce an off-the-shelf visual encoder to extract video features and incorporate the visual features into an audio captioning system. Furthermore, to better exploit complementary audio-visual contexts, we propose an audio-visual attention mechanism that adaptively integrates audio and visual context and removes the redundant information in the latent space. Experimental results on AudioCaps, the largest audio captioning dataset, show that our proposed method achieves state-of-the-art results on machine translation metrics.
