Text to Speech

Robot Brand Promise

I’ve been bombarded lately by Facebook adverts for Text to Speech products that “sound human”. I’ve been asked what I think of these, given my background in IT, programming and Voice Acting.

The best advice I can give is to set your expectations wisely before using this technology.

First, understand what you want to achieve from the audio. If it’s simply a text message that sounds like it might be a human reading it, then great. But in my experience, many audio messages carry far more than just the textual meaning and the intent is to invoke an emotional reaction from the recording.

Meaning is more than just text

There’s a massive amount of information carried in the delivery. There is a myriad of ways to deliver the same text, and deliver very different meanings. If this weren’t the case there would be one definitive version of every one of Shakespeare’s plays (excluding changes in aspects such as gender, period and setting). The text would always mean exactly and precisely what it said.

But language isn’t like that – certainly common use English isn’t! Just glance at any legal document for confirmation of this. When looking at text for the first time, an Actor will spend many hours researching the context, the character, the given circumstances and the word usage. There will be certain fixed points that will emerge from this, but a plethora of possibilities from the information not given by the author. This is where the Actor may use artistic skill to fill in the gaps in a way that respects the fixed points and remains credible and interesting to the intended audience. None of this would be necessary were the text as precise as programming syntax.

This is as true and relevant in an e-Learning text as a corporate explainer or a Shakespearian play.

Understand AI text to speech limitations

While Artificial Intelligence (AI) can do incredible things, it’s nowhere near being able to cost effectively take direction from the client, or nuance from the text and carry the intended meaning. Certainly, if it were agreed how text should be delivered, even within certain parameters, an AI programme could perform it adequately. However, this would be a specific piece of programming for that particular need. That costs big money. Even big budget feature films would think twice.

And that’s how these text to speech applications work. They’re set to analyse the whole text and using the AI learning, produce the most likely vocalisation. There are few, if any front-end settings for the user to change the meaning, intonations or delivery. You get what you get. It’s way better than the robotic sounding text to speech. It has a human sound, but it’s missing the human brain, experience, culture and heart. A human-sounding voice but not a human talking.

The Voice Actor's "front end"

Voice Actors understand the text, they understand the direction given them by the client, the context and the audience. The Voice Actor chooses a delivery appropriate to those requirements.

It’s not yet practical for programmes to have a front-end to take all these nuances on board in anything but specific cases, when the programming costs outweigh using a skilled human.

Computers cannot intentionally generate an emotional response in the audience based on the requirements of the text, context and direction. Only humans can do this.

Want to test it? Have the programme tell a sarcastic joke. See what I mean?

Is it useful?

Now before I’m accused of being a Luddite or biased through protecting my own profession, let me be clear. There certainly are applications for this technology. I can see many Interactive Voice Response (IVR) recordings using it. This needs little human emotion. Certain e-Learning situations could benefit if used in moderation. I say moderation because the human ear will quickly tire of the AI voice which although varied, does have subtle repetitive patterns that become irritating after a few minutes. It simply doesn’t sound human enough to build trust with the human subconscious.


If the AI voice isn’t building trust or creating empathy or generating an emotional response in the listener, be very careful about using it in any client-facing situation. If the story sounds fake to the subconscious, the message and so your product, service or company will be perceived as fake.

So, in summary, by all means use this technology where it makes sense but set your expectations and understand what your audio needs to achieve.

Only a human can generate an emotional response from another human.