3 text-to-speech examples I’ve randomly encountered online
background photo courtesy of BandLab on Unsplash
Long before the rise of Bev Standing’s iconic text-to-speech voice all over TikTok and the internet, we’ve heard computers talk. Most people in this day and age have experienced the phenomenon of synthetic speech and its eerie non-human-ness. But what exactly is synthetic speech and why do we keep using it?
Voice branding expert Phoebe Ohayon defines speech as: the signal produced by modulating voice into meaningful patterns. Although many people use “speech” interchangeably with the term “voice”, speech is not necessarily always produced by humans. In fact, that’s exactly what synthetic speech refers to: the artificial production of human speech, a.k.a. machine-created speech. As highly communicative creatures, humans are pretty good at parsing if something is natural or artificial speech. A lot of synthetic speech systems have wonky word emphasis or pauses at the “wrong” time, among other factors that reveal their “unhuman” nature.
The wonkiness explained
Text-to-speech (TTS) is a process to create “spoken” content from written text. It’s also referred to as “read aloud” technology. In plain words, it’s live output made with pre-recorded input. Traditional TTS voices were created in a recording studio. Voice actors were hired to train software on human speech and to try to capture all possible sounds (not words) in a particular language, which were later “stitched together” for a vast combination of words (i.e. the words and sentences not explicitly recorded). This video from Acapela Group does a great job in showing how the word “impressive” can be created by stitching together parts of the words: “impossible”, “president”, and “detective”.
However, not all TTS software are created equally, with some less natural-sounding than others. The speech might sound flat (lack of intonation) or punctuation might get ignored. So the question remains: if the technology sounds so bad, why do we keep relying on synthetic speech?
The authors of the 2005 book, Wired for Speech, summarized it best:
“Because of limitations of storage space (digital recordings are large), processing speed (finding and combining arbitrary utterances can be slow), bandwidth speed (sound files do not transmit gracefully over a 33 kilobyte phone line), dynamism of content (all of the Web’s content cannot be spoken and recorded in real time), and other technical constraints, much of the speech that is and will be produced by computers, the Web, telephone interfaces, and wireless devices will be ‘synthesized speech’[.]”
It’s much easier and viable to create speech artificially rather than have interfaces present “fully recorded words and phrases”, as Clifford Nass and Scott Brave state in their book. It’s expensive, in terms of both money and computing power, and hard to scale. These days, there’s been further advancement of this technology. Neural TTS is all the rage now.
Examples of TTS and its modern usage
Personally, I’ve loved to see this kind of speech technology evolve and improve over time— and become more predominant in everyday life. As someone particularly fond of voice technology, it’s been super fun to follow the modern online trend of creating short videos with synthetic speech content. The following examples listed below are a few of my personal favorite use cases for TTS that are not Instagram Reels/TikToks.
TTS to open a music video
BLOSWOM, a music artist from France, released a music video for his song “Rosiana” where a TTS voice sets context to the scene and reveals why this character wakes up on the beach.
https://medium.com/media/daa7d3988ba6214b8436c680bbf2b309/href
TTS for comedic effect in a video essay
In the video commentary on the 2022 Andrew Dominik film “Blonde”, the Be Kind Rewind channel points out there are potentially many inaccuracies to look out for in the film adaptation of Marilyn Monroe’s life— one of which is a parody on the film’s use of a talking fetus.
https://medium.com/media/fc0e72aacc9e355e7e9acba4d3c6c98b/href
TTS to replace human commentary
This was an interesting find: a channel that uses a TTS voice to narrate movie recap commentary. While there are many reasons someone might choose to omit recording their own voice for a video (including speech impediments, insecurity around accent, etc.), it was nice to see a video trying to normalize its use.
https://medium.com/media/41c580b72d166749aafc2715ca07594e/href
Got any favorite examples of synthetic speech in your life? Let me know by leaving a comment on this post! I’d love to hear more everyday examples.
Everyday Speech: Examples of TTS out in the wild was originally published in Chatbots Life on Medium, where people are continuing the conversation by highlighting and responding to this story.