Table of Contents
Text-to-speech model technology has been advancing rapidly, but it still faces challenges in producing natural and expressive speech for various scenarios. However, new research from Amazon AGI suggests that a TTS model can achieve remarkable improvements by using massive amounts of data and exhibiting emergent abilities.
What are emergent abilities in the Text-to-Speech model?
Emergent abilities in the best text-to-speech models enhance the naturalness and expressiveness of generated speech. TTS helps by making synthesized voices sound more human-like, improving user interaction and engagement.
The researchers at Amazon AGI, who are aiming to create artificial general intelligence, hypothesized that the Text-to-Speech model would show emergent abilities as they grow larger and use more data, similar to what has been observed in large language models (LLMs).
To test this hypothesis, they trained three versions of a Text-to-speech model called the Big Adaptive Streamable Text-to-speech model with Emergent abilities (BASE TTS), with different sizes and data sources.
How did they train and evaluate BASE TTS?
The largest version of BASE TTS, BASE-large, has 980 million parameters and uses 100,000 hours of public domain speech, primarily in English but also in German, Dutch, and Spanish. The medium version, BASE-medium, has 400 million parameters and uses 10,000 hours of speech, while the smallest version, BASE-small, has 150 million parameters and uses 1,000 hours of speech.
The researchers evaluated the three models on various tasks and sentences that are challenging for the Text-to-speech model, such as:
- Compound nouns: The Beckhams chose to lease a delightful stone-built rustic vacation cottage in the countryside.
- Emotions: Jennie couldn’t contain her excitement as she exclaimed, “Wow! Are we going to the Maldives? I can’t believe it!” She jumped up and down with joy.
- Foreign words: Mr. Henry, famous for his meticulous preparation, organized a seven-course meal, with each dish being a masterpiece.
- Paralinguistics (i.e. readable non-words): Tom whispered to Lucy, “Shh, we have to be quiet so we don’t wake up your baby brother,” as they quietly walked past the nursery.
- Punctuations: Her brother sent her a strange message: ‘Urgent situation at home! Call immediately! Mom and Dad are concerned… #familymatters.’
- Questions: However, the question of Brexit still lingers: Will the ministers be able to find the answers in time, despite all the challenges they have faced?
- Syntactic complexities: De Moya, who recently received the Lifetime Achievement Award, starred in a blockbuster movie in 2022, despite receiving mixed reviews.
They found that BASE-medium and BASE-large performed significantly better than BASE-small and other existing Text-to-speech models, such as Tortoise and VALL-E, on these tasks. They also received higher ratings from human listeners on speech quality and naturalness. The results suggest that the size of the model and the amount of data are the key factors for enabling emergent abilities in Text-to-speech models.
What are the benefits and risks of BASE TTS?
The researchers also noted that BASE TTS is a streamable model, meaning that it can generate speech on the fly, without waiting for the whole sentence to be processed. This makes it more suitable for real-time applications, such as voice assistants or audiobooks. Moreover, they proposed a method to encode and transmit the speech metadata, such as emotion, prosody, and accent, in a separate low-bandwidth stream, which can enhance the expressiveness of the speech without affecting the audio quality.
The research team believes that their work is a breakthrough for TTS technology, as it shows that TTS models can escape the uncanny valley and produce natural and diverse speech for various scenarios. They also hope that their work will inspire further research on the emergent abilities of TTS models and how to leverage them for different applications.
However, they also acknowledged the potential risks of their technology, such as misuse or abuse by malicious actors, and decided not to release the model or the data publicly.
The paper, titled “Big Adaptive Streamable TTS with Emergent Abilities”, was presented at the International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2024. The authors are: Anirudh Raju, Anusha Prakash, Aswin Shanmugam Subramanian, Balaji Vasan Srinivasan, Chandra Sekhar Seelamantula, and Rohit Prabhavalkar.