论文标题
基于增强骨架的对比度动作学习,具有动量LSTM,以进行无监督的动作识别
Augmented Skeleton Based Contrastive Action Learning with Momentum LSTM for Unsupervised Action Recognition
论文作者
论文摘要
通过3D骨骼数据的行动识别是这些年来的新兴主题。大多数现有方法要么提取手工制作的描述符,要么通过需要大量标记数据的监督学习范式来学习动作表示。在本文中,我们首次提出了一个名为AS-Cal的对比动作学习范式,该范式可以利用未标记的骨骼数据的不同增强以无监督的方式学习动作表示。具体而言,我们首先提出了输入骨架序列的增强实例(查询和键)之间的相似性,这些实例通过多种新颖的增强策略进行了转换,以了解不同骨架变换的固有动作模式(“模式存在”)。其次,为了鼓励以更一致的动作表示学习模式不变性,我们提出了一个动量LSTM,该动量LSTM被用作基于LSTM的基于LSTM的查询编码器的基于动量的移动平均值,以编码关键序列的长期动作动力学。第三,我们介绍了一个队列以存储编码的密钥,该键使我们的模型可以灵活地重复使用程序密钥并构建更一致的词典以改善对比度学习。最后,通过暂时平均查询编码器学到的隐藏的动作状态,提出了一种名为“对比行动编码(CAE)”的新颖表示,以有效地代表人类的行动。广泛的实验表明,我们的方法通常将现有的手工制作方法提高了10-50%的TOP-1准确性,并且可以与众多监督学习方法实现可比甚至优越的性能。
Action recognition via 3D skeleton data is an emerging important topic in these years. Most existing methods either extract hand-crafted descriptors or learn action representations by supervised learning paradigms that require massive labeled data. In this paper, we for the first time propose a contrastive action learning paradigm named AS-CAL that can leverage different augmentations of unlabeled skeleton data to learn action representations in an unsupervised manner. Specifically, we first propose to contrast similarity between augmented instances (query and key) of the input skeleton sequence, which are transformed by multiple novel augmentation strategies, to learn inherent action patterns ("pattern-invariance") of different skeleton transformations. Second, to encourage learning the pattern-invariance with more consistent action representations, we propose a momentum LSTM, which is implemented as the momentum-based moving average of LSTM based query encoder, to encode long-term action dynamics of the key sequence. Third, we introduce a queue to store the encoded keys, which allows our model to flexibly reuse proceeding keys and build a more consistent dictionary to improve contrastive learning. Last, by temporally averaging the hidden states of action learned by the query encoder, a novel representation named Contrastive Action Encoding (CAE) is proposed to represent human's action effectively. Extensive experiments show that our approach typically improves existing hand-crafted methods by 10-50% top-1 accuracy, and it can achieve comparable or even superior performance to numerous supervised learning methods.
