Samenvatting
We compare phone labels and articulatory features as input for cross-lingual transfer learning in text-to-speech (TTS) for low-resource languages (LRLs). Experiments with FastSpeech 2 and the LRL West Frisian show that using articulatory features outperformed using phone labels in both intelligibility and naturalness. For LRLs without pronunciation dictionaries, we propose two novel approaches: a) using a massively multilingual model to convert grapheme-to-phone (G2P) in both training and synthesizing, and b) using a universal phone recognizer to create a makeshift dictionary. Results show that the G2P approach performs largely on par with using a ground-truth dictionary and the phone recognition approach, while performing generally worse, remains a viable option for LRLs less suitable for the G2P approach. Within each approach, using articulatory features as input outperforms using phone labels.
Originele taal-2 | Engels |
---|---|
Pagina's | 5461-5465 |
DOI's | |
Status | Gepubliceerd - 20 aug. 2023 |
Evenement | Interspeech 2023 - Convention Centre, Dublin, Ierland Duur: 20 aug. 2023 → 24 aug. 2023 https://interspeech2023.org |
Conferentie
Conferentie | Interspeech 2023 |
---|---|
Land/Regio | Ierland |
Stad | Dublin |
Periode | 20/08/2023 → 24/08/2023 |
Internet adres |