A recent project undertook by DeepMind with with Google as part of Google’s Euphonia project, demonstrates an early proof of concept of how text-to-speech technologies can synthesise a natural sounding voice using minimal recorded speech data.
Losing one’s voice can be socially devastating. Today, the main option available to people to preserve their voice is message banking, wherein people with Amyotrophic lateral sclerosis (ALS, commonly known as Lou Gehrig’s disease) can digitally record and store personally meaningful phrases using their natural inflection and intonation. But message banking lacks flexibility, resulting in a static dataset of phrases.
DeepMind has been been collaborating with Google and people like ALS campaigner Tim Shaw to help develop technologies that can make it easier for people with speech difficulties to communicate. The challenges of this are two-fold. Firstly, the technology can recognise the speech of people with non-standard pronunciation–something Google AI has been researching through Project Euphonia. Secondly, people should ideally be able to communicate using their original voice. Stephen Hawking, who also suffered from ALS, communicated with a famously unnatural sounding text-to-speech synthesiser. Thus, the second challenge is customising text-to-speech technology to the user’s natural speaking voice.
With WaveNet and Tacotron, DeepMind has seen tremendous breakthroughs in the quality of text-to-speech systems. However, whilst it is possible to create natural sounding voices that sound like specific people in certain contexts developing synthetic voices requires many hours of studio recording time with a very specific script – a luxury that many people with ALS simply don’t have. Creating machine learning models that require less training data is an active area of research at DeepMind, and is crucial for use cases such as this where we need to recreate a voice with just a handful of audio recordings. DeepMind helped do this by harnessing the WaveNet work and the novel approaches demonstrated in a paper, Sample Efficient Adaptive Text-to-Speech (TTS).
Thanks to Tim’s time in the media spotlight, resulting in about thirty minutes of high-quality audio recordings, DeepMind's researchers were able to apply the methodologies from WaveNet and TTS to recreate his former voice.
Following a six-month effort, Google’s AI team visited Tim and his family to show him the results of their work. The meeting was captured for the new YouTube Originals learning series, “The Age of A.I.” hosted by Robert Downey Jr. Tim and his family were able to hear his old voice for the first time in years, as the model – trained on Tim’s NFL audio recordings – read out the letter he’d recently written to his younger self.
“I don’t remember that voice,” Tim remarked. His father responded, “we do.” Later, Tim recounted–"it has been so long since I've sounded like that, I feel like a new person. I felt like a missing part was put back in place. It's amazing. I'm just thankful that there are people in this world that will push the envelope to help other people."
How the technology works
WaveNet is a generative model trained on many hours of speech and text data from diverse speakers. It can then be fed arbitrary new text to be synthesized into a natural-sounding spoken sentence.
DeepMind has already illustrated that it’s possible to train a new voice with minutes, rather than hours, of voice recordings through a process called fine-tuning. This involves first training a large WaveNet model on up to thousands of speakers, which takes a few days, until it can produce the basics of natural sounding speech. Then, the researchers take the small corpus of data for the target speaker and intelligently adapt the model, adjusting the weights so that we can create a single model that matches the target speaker. The concept of fine-tuning is similar to how people learn. For example, if you are attempting to learn calculus, you should first understand the foundations of basic algebra, and then apply these simpler concepts to help solve more complex equations.
Later the researchers migrated from WaveNet to WaveRNN, which is a more efficient text to speech model co-developed by Google AI and DeepMind. WaveNet requires a second distillation step to speed it up to serve requests in real-time, which makes fine-tuning more challenging. WaveRNN, on the other hand, does not require a second training step and can synthesize speech much faster than a WaveNet model that has not been distilled.
In addition to speeding up the models by switching to WaveRNN, DeepMind's researches collaborated with Google AI to improve the quality of the models. Google AI researchers demonstrated that a similar fine-tuning approach could be applied to the related Google Tacotron model, which DeepMind uses in conjunction with WaveRNN to synthesise realistic voices. By combining these technologies trained on audio clips of Tim Shaw from his NFL days, the researchers were able to generate an authentic sounding voice that resembles how Tim sounded before his speech degraded. While the voice is not yet perfect – lacking the expressiveness, quirks, and controllability of a real voice, the combination of WaveRNN and Tacotron may help people like Tim preserve an important part of their identity, and one day the technology could be integrated it into speech-generation devices.