论文标题
通过域的对抗训练,噪音刺激性的语音转换
Noise-robust voice conversion with domain adversarial training
论文作者
论文摘要
在过去的几年中,在录音室质量的测试方案下,语音转换在语音质量和说话者的相似性方面取得了巨大进展。但是,在实际应用中,来自来源发言人或目标发言人的测试语音可能会被各种环境噪音损坏,这些噪音会严重降低语音质量和说话者的相似性。在本文中,我们提出了一个基于编码器的新颖编码器噪声语音转换框架,该框架由扬声器编码器,内容编码器,解码器和两个域对抗性神经网络组成。具体而言,我们将脱离扬声器和内容表示技术与域对抗培训技术集成在一起。域的对抗训练使演讲者表示和内容表示,由说话者编码器和内容编码器从同一空间中分别从干净的语音和嘈杂的语音中提取。通过这种方式,学识渊博的说话者和内容表示形式是噪音不变的。因此,解码器可以将两个噪声不变表示作为输入,以预测清洁转换的光谱。实验结果表明,我们提出的方法可以在嘈杂的测试场景下综合清洁转换的语音,在训练过程中,可以通过看到或看不见的噪声类型来破坏源语音和目标语音。此外,语音质量和说话者的相似性都得到了改善。
Voice conversion has made great progress in the past few years under the studio-quality test scenario in terms of speech quality and speaker similarity. However, in real applications, test speech from source speaker or target speaker can be corrupted by various environment noises, which seriously degrade the speech quality and speaker similarity. In this paper, we propose a novel encoder-decoder based noise-robust voice conversion framework, which consists of a speaker encoder, a content encoder, a decoder, and two domain adversarial neural networks. Specifically, we integrate disentangling speaker and content representation technique with domain adversarial training technique. Domain adversarial training makes speaker representations and content representations extracted by speaker encoder and content encoder from clean speech and noisy speech in the same space, respectively. In this way, the learned speaker and content representations are noise-invariant. Therefore, the two noise-invariant representations can be taken as input by the decoder to predict the clean converted spectrum. The experimental results demonstrate that our proposed method can synthesize clean converted speech under noisy test scenarios, where the source speech and target speech can be corrupted by seen or unseen noise types during the training process. Additionally, both speech quality and speaker similarity are improved.
