论文标题
VIRTUOSO:文本到语音的大量多语言语音文本联合半监督学习
Virtuoso: Massive Multilingual Speech-Text Joint Semi-Supervised Learning for Text-To-Speech
论文作者
论文摘要
本文提出了Virtuoso,这是一种大规模多语言语音文本联合半监督的学习框架,用于文本到语音综合(TTS)模型。现有的多语言TT通常支持数十种语言,这是世界上数千种语言的一小部分。将多语言TT缩放到数百种语言的困难是,以低资源语言收集高质量的语音文本配对数据。这项研究将Maestro扩展到了语音识别识别(ASR)的语音文本联合预处理框架(ASR),以至于语音生成任务。为了从各种类型的语音和文本数据中训练TTS模型,设计了不同的培训方案,以处理受监督的(配对的TT和ASR数据)和无监督的(无转录的语音和无话说的文本)数据集。实验评估表明,1)在Virtuoso中训练的多语言TTS模型可以比在看到语言中的基线获得更好的自然性和清晰度,而2)它们可以综合可理解合理的可理解和自然声音的不可见力的语言,而没有高质量的配对TTS数据可用。
This paper proposes Virtuoso, a massively multilingual speech-text joint semi-supervised learning framework for text-to-speech synthesis (TTS) models. Existing multilingual TTS typically supports tens of languages, which are a small fraction of the thousands of languages in the world. One difficulty to scale multilingual TTS to hundreds of languages is collecting high-quality speech-text paired data in low-resource languages. This study extends Maestro, a speech-text joint pretraining framework for automatic speech recognition (ASR), to speech generation tasks. To train a TTS model from various types of speech and text data, different training schemes are designed to handle supervised (paired TTS and ASR data) and unsupervised (untranscribed speech and unspoken text) datasets. Experimental evaluation shows that 1) multilingual TTS models trained on Virtuoso can achieve significantly better naturalness and intelligibility than baseline ones in seen languages, and 2) they can synthesize reasonably intelligible and naturally sounding speech for unseen languages where no high-quality paired TTS data is available.
