In pursuit of the perfect AI voice

In pursuit of the perfect AI voice
From Engadget - April 9, 2018

Amazon's Alexa and Microsoft's Cortana debuted in 2014; Google Assistant followed in 2016. IT research firm Gartner predicts that many touch-required tasks on mobile apps will become voice activated within the next several years. The voices of Siri, Alexa and other virtual assistants have become globally ubiquitous. Siri can speak 21 different languages and includes male and female settings. Cortana speaks eight languages, Google Assistant speaks four, Alexa speaks two.

But until fairly recently, voice -- and the ability to form words, sentences and complete thoughts -- was a uniquely human attribute. It's a complex mechanical task, and yet nearly every human is an expert at it. Human response to voice is deeply ingrained, beginning when children hear their mother's voice in the womb.

What constitutes a pleasant voice? A trustworthy voice? A helpful voice? How does human culture influence machines' voices, and how will machines, in turn, influence the humans they serve? We are in the infancy stage of developing a seamless facsimile of human interaction. But in creating it, developers will face ethical dilemmas. It's becoming increasingly clear that for a machine to seamlessly stand in for a human being, its users must surrender a part of their autonomy to teach it. And those users should understand what they stand to gain from such a surrender and more importantly, what they stand to lose.

When I asked Danz to listen to three Siri voice samples from three different eras -- iOS 9 (2015), iOS 10 (2016) and iOS 11 (2017) -- she connected their differences to Apple's target audience.

"As the versions progress from iOS 9, the actual pitch of the voice becomes much higher and lighter," said Danz. "By raising the pitch, what people hear in iOS 11 is a more energized, optimistic-sounding voice. It is also a younger sound.

"The higher pitch is less about the woman's voice being commanding and more about creating a warmer, friendlier vocal presence that would appeal to many generations, especially millennials," continued Danz. "With advances in technology, it is becoming easier to adapt quickly to a changing marketplace. Even a few years ago, things we now take for granted in vocal production may not have been developed, used or adopted."

There is research to support Danz's conclusions: The book Wired for Speech: How Voice Activates and Advances the HumanComputer Relationship by Clifford Nass and Scott Brave explores the relationships among technology, gender and authority. When it was published in 2005, Nass was a professor of communications at Stanford University and Brave was a postdoctoral scholar at Stanford. Wired for Speech documents 10 years' worth of research into the psychological and design elements of voice interfaces and the preferences of users who interact with them.

According to their research, men like a male computer voice more than a female computer voice. Women, correspondingly, like a female voice more than a male one.

But regardless of this social identification, Nass and Brave found that both men and women are more likely to follow instructions from a male computer voice, even if a female computer voice relays the same information. This, the authors theorize, is due to learned social behaviors and assumptions.

Elsewhere, the book reports another, similar finding: A "female-voiced computer [is] seen as a better teacher of love and relationships and a worse teacher of technical subjects than a male-voiced computer." Although computers do not have genders, the mere representation of gender is enough to trigger stereotyped assumptions. According to Wired for Speech, a sales company might implement a male or female voice depending on the task.


Continue reading at Engadget »