Skip to main content

IBM’s AI generates high-quality voices from 5 minutes of talking

IBM
Image Credit: IBM

Join us in Atlanta on April 10th and explore the landscape of security workforce. We will explore the vision, benefits, and use cases of AI for security teams. Request an invite here.


Training powerful text to speech models requires sufficiently powerful hardware. A recent study published by OpenAI drives the point home — it found that since 2012, the amount of compute used in the largest runs grew by more than 300,000 times. In pursuit of less demanding models, researchers at IBM developed a new lightweight and modular method for speech synthesis. They say it’s able to synthesize high-quality speech in real time by learning different aspects of a speaker’s voice, making it possible to adapt to new speaking styles and voices with small amounts of data.

“Recent advances in deep learning are dramatically improving the development of Text-to-Speech (TTS) systems through more effective and efficient learning of voice and speaking styles of speakers and more natural generation of high-quality output speech,” wrote IBM researchers Zvi Kons, Slava Shechtman, and Alex Sorin in a blog post accompanying a preprint paper presented at Interspeech 2019. “Yet, to produce this high-quality speech, most TTS systems depend on large and complex neural network models that are difficult to train and do not allow real-time speech synthesis, even when leveraging GPUs. In order to address these challenges, our … team has developed a new method for neural speech synthesis based on a modular architecture.”

The IBM team’s system consists of three interconnected parts: a prosody feature predictor, an acoustic feature predictor, and a neural vocoder. The prosody prediction bit learns the duration, pitch, and energy of speech samples, toward the goal of better representing a speaker’s style. As for the acoustic feature production, it creates representations of the speaker’s voice in the training or adaptation data, while the vocoder generates speech samples from the acoustic features.

All components work together to adapt synthesized voice to a target speaker via retraining, based on a small amount of data from the target speaker. In a test involving volunteers asked to listen and rate the quality of pairs of synthesized and natural voice samples, the team reports that the model maintained high quality and similarity to the original speaker for voices trained on as little as five minutes of speech.

VB Event

The AI Impact Tour – Atlanta

Continuing our tour, we’re headed to Atlanta for the AI Impact Tour stop on April 10th. This exclusive, invite-only event, in partnership with Microsoft, will feature discussions on how generative AI is transforming the security workforce. Space is limited, so request an invite today.
Request an invite

The work served as the basis for IBM’s new Watson TTS service, which can be heard here. (Select “V3” voices from the dropdown menu.) Here’s a sample:

The new research comes months after IBM scientists detailed natural language processing techniques that cut down AI speech recognition training time from a week to 11 hours. Separately, in May, an IBM team took the wraps off of a novel system that achieves “industry-leading” results on broadcast news captioning tasks.

VB Daily - get the latest in your inbox

Thanks for subscribing. Check out more VB newsletters here.

An error occured.