Authors: Yuan-Jui Chen*, Tao Tu*, Cheng-chieh Yeh, Hung-yi Lee
Abstract: End-to-end text-to-speech (TTS) has shown great success on large quantities of paired text plus speech data. However, laborious data collection remains difficult for at least 95% of the languages over the world, which hinders the development of TTS in different languages. In this paper, we aim to build TTS systems for such low-resource (target) languages where only very limited paired data are available. We show such TTS can be effectively constructed by transferring knowledge from a high-resource (source) language. Since the model trained on source language cannot be directly applied to target language due to input space mismatch, we propose a method to learn a mapping between source and target linguistic symbols. Benefiting from this learned mapping, pronunciation information can be preserved throughout the transferring procedure. Preliminary experiments show that we only need around 15 minutes of paired data to obtain a relatively good TTS system. Furthermore, analytic studies demonstrated that the automatically discovered mapping correlate well with the phonetic expertise.
All phrases below are unseen during training. Since our goal is to investigate transfer learning for languages with small amounts of data, all audio files on this demo page were synthesized using Griffin-Lim for faster experiment cycle as stated in the paper.
Samples for each target languages with small amount of data
These samples refer to Section 4.3 of our paper. Here we demonstrate the performance of our 3 transfer learning methods on different languages: Mandarin, French and German. For all "N min" of the following, it means models are trained with only N-minute paired data in the target language. "Unified", "Learned" and "Separate" denote unified symbol space, learned symbol space and separate symbol space, respectively. "Scratch" is the model trained without transfering knowledge from high-resource data. "phn2phn" denotes the situation using phoneme as input in both source and target languages, and "phn2char" denotes the situation using phoneme input in source language but character input in target languages.
We can observe that
3 transfer learning methods ("Unified", "Learned", "Separate") outperform training from scratch
methods ("Unified", "Learned") w/ pronunciation information from source language outperform the method ("Separate") w/o pronunciation information
Mandarin (phn2phn)
Amount
Text
Unified
Learned
Separate
Scratch
25 min
還能笑得這麼開心
25 min
想在這樣的楓紅裡野餐
15 min
祝你快快寫出好玩的新遊戲
15 min
原來我們只差一天
Method
Text
30 min
25 min
20min
15min
Unified
就原諒我這個虛榮的願望吧
Learned
就要在大阪多待一天了
Separate
好像輕輕嘆口氣
French (phn2phn)
Amount
Text
Unified
Learned
Separate
Scratch
15 min
le conseil supérieur de la magistrature est composé de cinq membres titulaires
15 min
votre dossier de candidature doit être accompagné des éléments suivants
10 min
il est moins souvent utile de savoir si la réponse est exacte ou erronée
10 min
c’est dans ces centres sociaux non spécialisés en toxicomanie
Method
Text
15 min
12.5 min
10min
Unified
elle ne sent que par sa tête
Learned
ils ont réalisé pour vous des ouvrages présentant la synthèse de leurs observations
Separate
certaines mauvaises langues disent même que certains agriculteurs plantent justement du maïs
French (phn2char)
Amount
Text
Learned
Separate
Scratch
15 min
il est alors utile de poursuivre les conseils diététiques précédents
15 min
les agents peuvent communiquer coopérer
10 min
il ne faut pas aller trop vite
10 min
les cercles de couleurs transparentes ne se superposent que partiellement
Method
Text
15 min
12.5 min
10min
Learned
il est moins souvent utile de savoir si la réponse est exacte ou erronée
Separate
elle est constituée de la liste des titres
German (phn2phn)
Amount
Text
Unified
Learned
Separate
Scratch
25 min
Sie hatte zwei Töchter die längst verheiratet waren
25 min
Wohin sie führen fragte ich Sie lachte und sagte Weiter
15 min
Ich wurde gestraft und erfuhr daß ich etwas sehr Böses getan hatte
15 min
Er grüßte blieb stehen er wollte mich ansprechen
Method
Text
30 min
25 min
20min
15min
Unified
An den Fenstern weiße Gardinen und braune Kinderköpfchen die lustig herauslugten
Learned
Ich habe ja noch beinahe ein halbes Jahrhundert vor mir
Separate
Was die Andern taten das war für sie das Richtige
German (phn2char)
Amount
Text
Learned
Separate
Scratch
25 min
Nun aber sind mir fast seine Gesichtszüge entschwunden
25 min
Ich saß stundenlang und tat nichts und dämmerte so hin
15 min
Von einer Reise zu meinen Töchtern konnte keine Rede sein
15 min
Ich tat was ich konnte es war auch wirklich nicht zu viel
Method
Text
30 min
25 min
20min
15min
Learned
Ich hatte solche Unlust zu meinen Töchtern zu reisen
Separate
Immer nur wandle ich am Strand entlang weiter und weiter
Symbol Mapping
The mapping results refer to Section 2.3.2 and Section 4.3 of our paper.
We derive the mappings from 15 minutes paired data in target languages.
For a better presentation and comparison, we mapped all the phonemes to IPA.