Parameter-Efficient Learning for Text-to-Speech Accent Adaptation

Abstract

This paper presents a parameter-efficient learning (PEL) to develop a low-resource accent adaptation for text-to-speech (TTS). A resource-efficient adaptation from a frozen pre-trained TTS model is developed by using only 1.2\% to 0.8\% of original trainable parameters to achieve competitive performance in voice synthesis. Motivated by a theoretical foundation of optimal transport (OT), this study carries out PEL for TTS where an auxiliary unsupervised loss based on OT is introduced to maximize a difference between the pre-trained source domain and the (unseen) target domain, in addition to its supervised training loss. Further, we leverage upon this unsupervised loss refinement to boost system performance via either sliced Wasserstein distance or maximum mean discrepancy. The merit of this work is demonstrated by fulfilling PEL solutions based on residual adapter learning, and model reprogramming when evaluating the Mandarin accent adaptation. Experiment results show that the proposed methods can achieve competitive naturalness with parameter-efficient decoder fine-tuning, and the auxiliary unsupervised loss improves model performance empirically.

Audio samples

All the audio samples are produced by the same Parallel-WaveGan vocoder

  • zh-TW accent
  • 1. 但不是一開始就在現場,而是中途後才加入
    He is not at the scene at the beginning but joined after halfway.

    Ground Truth FT Decoder-FT
    Input reprogram Adapter Input + latent reprogram
    Input reprogram - SWD Adapter - SWD Input + latent reprogram - SWD
    Input reprogram - MMD Adapter - MMD Input + latent reprogram - MMD

    2. 但也不算太差,可以說是部中等以上水準的影集 /It's not too bad, it can be said to be an album of above-average level.

    Ground Truth FT Decoder-FT
    Input reprogram Adapter Input + latent reprogram
    Input reprogram - SWD Adapter - SWD Input + latent reprogram - SWD
    Input reprogram - MMD Adapter - MMD Input + latent reprogram - MMD

    3. 甚麼動物最沒有方向感 / Which animal has the least sense of direction?

    Ground Truth FT Decoder-FT
    Input reprogram Adapter Input + latent reprogram
    Input reprogram - SWD Adapter - SWD Input + latent reprogram - SWD
    Input reprogram - MMD Adapter - MMD Input + latent reprogram - MMD

    4. 我那時老愛去漫畫店 / I used to love going to comic shops

    Ground Truth FT Decoder-FT
    Input reprogram Residual Adapter Input + latent reprogram
    Input reprogram - SWD Adapter - SWD Input + latent reprogram - SWD
    Input reprogram - MMD Adapter - MMD Input + latent reprogram - MMD

    5. 魷魚遊戲的企劃發想始於12年前/ The planning idea of squid game started 12 years ago

    Ground Truth FT Decoder-FT
    Input reprogram Residual Adapter Input + latent reprogram
    Input reprogram - SWD Adapter - SWD Input + latent reprogram - SWD
    Input reprogram - MMD Adapter - MMD Input + latent reprogram - MMD