Audio samples from "End-to-end Text-to-speech for Low-resource Languages by Cross-Lingual Transfer Learning"

Paper: arXiv
Authors: Yuan-Jui Chen*, Tao Tu*, Cheng-chieh Yeh, Hung-yi Lee
Abstract: End-to-end text-to-speech (TTS) has shown great success on large quantities of paired text plus speech data. However, laborious data collection remains difficult for at least 95% of the languages over the world, which hinders the development of TTS in different languages. In this paper, we aim to build TTS systems for such low-resource (target) languages where only very limited paired data are available. We show such TTS can be effectively constructed by transferring knowledge from a high-resource (source) language. Since the model trained on source language cannot be directly applied to target language due to input space mismatch, we propose a method to learn a mapping between source and target linguistic symbols. Benefiting from this learned mapping, pronunciation information can be preserved throughout the transferring procedure. Preliminary experiments show that we only need around 15 minutes of paired data to obtain a relatively good TTS system. Furthermore, analytic studies demonstrated that the automatically discovered mapping correlate well with the phonetic expertise.

All phrases below are unseen during training. Since our goal is to investigate transfer learning for languages with small amounts of data, all audio files on this demo page were synthesized using Griffin-Lim for faster experiment cycle as stated in the paper.

Samples for each target languages with small amount of data

These samples refer to Section 4.3 of our paper. Here we demonstrate the performance of our 3 transfer learning methods on different languages: Mandarin, French and German. For all "N min" of the following, it means models are trained with only N-minute paired data in the target language. "Unified", "Learned" and "Separate" denote unified symbol space, learned symbol space and separate symbol space, respectively. "Scratch" is the model trained without transfering knowledge from high-resource data. "phn2phn" denotes the situation using phoneme as input in both source and target languages, and "phn2char" denotes the situation using phoneme input in source language but character input in target languages.

We can observe that

Mandarin (phn2phn)

Amount Text Unified Learned Separate Scratch
25 min 還能笑得這麼開心
25 min 想在這樣的楓紅裡野餐
15 min 祝你快快寫出好玩的新遊戲
15 min 原來我們只差一天

Method Text 30 min 25 min 20min 15min
Unified 就原諒我這個虛榮的願望吧
Learned 就要在大阪多待一天了
Separate 好像輕輕嘆口氣

French (phn2phn)

Amount Text Unified Learned Separate Scratch
15 min le conseil supérieur de la magistrature est composé de cinq membres titulaires
15 min votre dossier de candidature doit être accompagné des éléments suivants
10 min il est moins souvent utile de savoir si la réponse est exacte ou erronée
10 min c’est dans ces centres sociaux non spécialisés en toxicomanie

Method Text 15 min 12.5 min 10min
Unified elle ne sent que par sa tête
Learned ils ont réalisé pour vous des ouvrages présentant la synthèse de leurs observations
Separate certaines mauvaises langues disent même que certains agriculteurs plantent justement du maïs

French (phn2char)

Amount Text Learned Separate Scratch
15 min il est alors utile de poursuivre les conseils diététiques précédents
15 min les agents peuvent communiquer coopérer
10 min il ne faut pas aller trop vite
10 min les cercles de couleurs transparentes ne se superposent que partiellement

Method Text 15 min 12.5 min 10min
Learned il est moins souvent utile de savoir si la réponse est exacte ou erronée
Separate elle est constituée de la liste des titres

German (phn2phn)

Amount Text Unified Learned Separate Scratch
25 min Sie hatte zwei Töchter die längst verheiratet waren
25 min Wohin sie führen fragte ich Sie lachte und sagte Weiter
15 min Ich wurde gestraft und erfuhr daß ich etwas sehr Böses getan hatte
15 min Er grüßte blieb stehen er wollte mich ansprechen

Method Text 30 min 25 min 20min 15min
Unified An den Fenstern weiße Gardinen und braune Kinderköpfchen die lustig herauslugten
Learned Ich habe ja noch beinahe ein halbes Jahrhundert vor mir
Separate Was die Andern taten das war für sie das Richtige

German (phn2char)

Amount Text Learned Separate Scratch
25 min Nun aber sind mir fast seine Gesichtszüge entschwunden
25 min Ich saß stundenlang und tat nichts und dämmerte so hin
15 min Von einer Reise zu meinen Töchtern konnte keine Rede sein
15 min Ich tat was ich konnte es war auch wirklich nicht zu viel

Method Text 30 min 25 min 20min 15min
Learned Ich hatte solche Unlust zu meinen Töchtern zu reisen
Separate Immer nur wandle ich am Strand entlang weiter und weiter

Symbol Mapping

The mapping results refer to Section 2.3.2 and Section 4.3 of our paper. We derive the mappings from 15 minutes paired data in target languages. For a better presentation and comparison, we mapped all the phonemes to IPA.

Mandarin (phn2phn)

Phoneme (EN) Phoneme (ZH) Phoneme (EN) Phoneme (ZH)
n n ʃ ʂ
e e b p
d t ɡ k
z ɕ s s
æ a f f
w w i i
ɑ ɑ m m
h h ə ə
j j

French (phn2char)

Phoneme (EN) Character (FR) Phoneme (EN) Character (FR)
n n ɑ a
p p t t
k c ɡ g
z s d d
b b v v
w o ɝ e
m m i i
h r f f

German (phn2char)

Phoneme (EN) Character (DE) Phoneme (EN) Character (DE)
ŋ n z s
a e p p
t t ʃ c
ɡ g d d
b b ɔ o
k k v w
ʊ u ɪ i
ɑ a æ ä
m m h h
j j l l
f f