AI Speech Synthesis and Voice Cloning: ElevenLabs and Suno Bark
TTS Tools, David Attenborough speaks German and AI Ariana sings
Voice As An Interface
Natural language as an interface is the most profound innovation in the interface since the touch-screen on the smartphone. Communicating to computers the same way we communicate to each-other is the lowest-friction interface. We use language most naturally and efficiently through speaking. Hence, voice is the best and most natural interface.
Voice interfaces have two components: Speech recognition takes the human voice input and converts it to text; on the output side, speech synthesis or text-to-speech (TTS) converts text into audio waveform output, artificially generating human speech. Apple’s Siri and similar popular virtual assistants such as Amazon’s Alexa and Google Assistant first made voice interfaces mainstream.
Deep learning models for speech recognition and speech synthesis have advanced to a point where we have high-quality and low-cost solutions for both, widely used for various applications in assistive technology, gaming, customer service, entertainment and more. We will dive in further on speech synthesis.
Text to Speech (TTS)
Text-to-speech (TTS) or speech synthesis technology that allows you to convert written text into an audio form has been around for several decades. In the early years, pre-deep-learning models produced understandable but robotic voices.
Deep learning brought about new generations of improved speech synthesis1. In 2016, a deep-learning based speech synthesis approach called direct waveform modeling, WaveNet, was introduced.2 Deep learning was used for speech synthesis in Apple’s Siri starting with iOS 10 in 2017, improving Siri’s voice to be smoother, more natural, and with human inflections.
Voice generation technology has continued to improve to the point where voice cloning is possible, imitating with high-fidelity the voices of others. This January, Microsoft announced a new text-to-speech AI model called VALL-E that can simulate a person’s voice when given only a three-second audio sample.
TTS Tools
Thanks to powerful and inexpensive underlying deep learning TTS technology, there’s a wide of availability of good TTS tools, including many free and open-source tools.
Embedded TTS has become ubiqutious in our interfaces. It powers Alexa, Google Assistant, Windows Cortana, and other interfaces. You’ll also find TTS embedded on many phone apps, including Siri, Samsung’s Bixby, health apps, and more.
Content creators - whether making YouTube videos, marketing pitches, or audio books - have many use-cases for standalone TTS apps that directly turn text into audio outputs. Here are some standalone TTS options for such users to try:
Lovo.ai - Provides a number of human-level-quality voice-overs, with emotional voice-over tuning.
Synthesis.io - Provides both TTS and Text-to-Video (TTV) technology, using avatars for generating dynamic media presentations, and is available in a number of languages and supplied AI voices.
Murf.ai - TTS voice-overs with realistic AI voices, as well as voice cloning. Also provides a way to “Transform your voiceover from a home recording to a professional AI voice.”
Speechify - They pitch their TTS product mainly to readers: “Speechify users are students, working professionals, and people who like speed-listening. … Power through docs, articles, PDFs, email — anything you read — by listening with our leading text-to-speech reader.” They also have a voice-over ‘studio’ for content creators.
These and other similar tools have free tiers to try them out, additional features such as adding music and soundtracks, and other bells and whistles. The human voice-over business has been disrupted entirely.
Voice Cloning and Eleven Labs
A leader in high-quality voice-cloning is ElevenLabs. At their free level, you can choose from a number of pre-made voices that have quite convincing human inflections and intonations. Stepping up to the low-priced starter level, you can use “Instant Voice cloning” to generate the cloned voice of anyone, including yourself.
The cloning itself is an impressive enough technology, but the fact that it is so convincing that it can be used to create ‘Deep Fake’ audio and video is what’s really crazy. This amazing demo shows what’s possible:
With great technology for mimicry comes dangerous and malicious fakery. Thanks to advances like afore-mentioned Microsoft’s Vall-E, it takes as little as 3 seconds of audio to clone someone’s voice. That’s something many people might have on social media or on their answering machine. Scammers are typically using cloned AI voices to call up relatives and say they need money.
In a recent infamous case, their ploy involved a fake kidnapping scam, using a cloned voice to convince a Mom her daughter was being kidnapped. It’s getting bad enough that one headline from today is Why AI Voice Scams Mean You Should Probably Never Answer Your Phone. We might as well get ready for it: Don’t trust what you see or hear, even a supposed relative in trouble, without verifying.
Hopefully, we will find more positive and creative ways of using your own cloned voice. You can listen to audiobooks in the voice of the authors, without the authors having had to sit and read them. One could also more easily create audio books where the dialogue of different characters is spoken by distinct voices.
ElevenLabs also introduced a multilingual speech synthesis model that can convey accents and intonations of a cloned voice across several languages. One can imagine how such multi-lingual capabilities could expand localized gaming and entertainment (get every character in every game in your language), make the dubbing of movies much better and more interesting (preserve the original actor’s accent while translating the language), and more.
How does it sound, though? AiBreakfast shared this ElevenLabs demo of David Attenborough cloned to speak German:
Suno Bark and the Sounds of Music
As these systems get better and better, we are moving beyond spoken speech synthesis and into the realm of generating additional sounds, singing, musical voices and more.
Suno Bark is an open-source transformer-based text-to-audio model that is able to incorporate many non-verbal effects:
Bark can generate highly realistic, multilingual speech as well as other audio - including music, background noise and simple sound effects. The model can also produce nonverbal communications like laughing, sighing and crying.
Since text on a page cannot really convey how capable this tool is, you can try it yourself on Hugging Face. The results are impressive in its ability to generate musical lilts, appropriate verbal cadence and non-verbal expressions.
When it comes to AI cloned voices making music, here are some famous and infamous AI music clone examples. Shared below is an AI generated clone of Ariana doing a cover of the Rhianna song “Diamonds.” There are endless possibilities you can imagine in remixing and re-inventing both new and old music with AI-generated voices and human creativity. Just need to get lawyers to sort out who owns what.
The future was yesterday and today is just plain wild
Tan, X., Qin, T., Soong, F., & Liu, T. (2021). A Survey on Neural Speech Synthesis. arXiv preprint arXiv:2106.15561.
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, K. Kavukcuoglu. Wavenet: A generative model for raw audio, arXiv preprint arXiv:1609.03499, 2016.