Jump to content

Community Wishlist Survey 2022/Generate Audio for IPA/Design/Robot vs Human Voices

From Meta, a Wikimedia project coordination wiki

Hello!

Starting from the basis of the IPA audio renderer wish and the different text to speech engines, I wanted to understand, from a user-centric point of view, the background on preferences for "human" or "robot" voices. I’ve gathered information from different articles on this, and I’m sharing the key concepts here! Unfortunately, there isn’t a concise answer on if it’s better to use a human or a robot voice but I think this will give us an idea and we can arrive at a conclusion on which type of voice to think of. Sources are listed below.

What information do we have regarding statistics?[edit]

When referring to data and numbers, this is a study that was carried out in 2019. The report is about the preference of human voices or robot voices for voice assistant user experience. This was conducted on a panel of 249 testers. As we can see in the graph, the human voice preference is much higher than the synthetic/”robotic” one.[1]

Voice favorability: human vs. synthetic voices

An interesting aspect on human voice preference[edit]

Human voices are described as having emotion, while robotic voices lack of emotional involvement. This is why some users may prefer human voices over synthetic voices.[2]

The point is that this preference mainly emerges when we talk about dynamic voice content, that is, when there is an interaction between the user and the voice, like a conversation (e.g. Alexa / Siri). This preference also arises when we seek to persuade the user (for example, if at the time of convincing the customer to buy something we want to influence emotionally, it is necessary to have a voice that can hold that emotional aspect).[3] [4]

Now, when referring to other uses of voice (like an IPA audio renderer, where the voice only appears for a matter of seconds) we are talking about static voice content, meaning a one-way experience (we are not having a “conversation” with the user).[5]

The uncanny valley[edit]

The “uncanny valley” concept refers more to robots and human-like robots, but I feel like a lot of this hypothesis also goes along with the aim of this page.

This concept was identified by professor Masahiro Mori and refers to the emotional response we have to an object that resembles a human being. Basically, this hypothesis argues that when a robot resembles a person without achieving perfection, it produces a feeling of strangeness and rejection in the observer.[6] [7] [8]

Mori graphically represented his theory by tracing the line of a sympathy that rises, plummets, and rises again, forming the valley that gives its name to this eventuality.

Uncanny Valley
One might say that the prosthetic hand has achieved a degree of resemblance to the human form, perhaps on a par with false teeth. However, when we realize the hand, which at first sight looked real, is in fact artificial, we experience an eerie sensation. For example, we could be startled during a handshake by its limp boneless grip together with its texture and coldness. When this happens, we lose our sense of affinity, and the hand becomes uncanny. — Masahiro Mory

I think this hypothesis is really interesting because if we receive feedback from potential users that the voice on engine X is "robotic" we can think about implementing a voice that sounds really close to a human, but this may not be the best idea because there would not be a "positive" emotional response from the user.

A great example of this is Siri.[9] A usertesting article on the user experience of voice interaction mentions:

Apple has been able to walk a fine line: we can relate to Siri, but she stays just robotic enough that we don’t really think there’s a human woman trapped in our device. We are still talking to an interface, a comforting, yet flat and synthetic robotic interface. If an uncanny valley of audio does exist, Siri falls just far enough on the robotic edge to allow us to be friendly with her. — UX of voice interaction

My two cents on this[edit]

I don’t think it’s necessary to have an engine that resembles a human voice on a 100% level as we wouldn’t be looking for emotional engagement in this wish. I think it would be useful to test the different engines on users to see their emotional response to them and maybe ask for some input on the type of voice to maybe add it as a requirement.

References[edit]