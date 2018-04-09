Amazon's Alexa and Microsoft's Cortana debuted in 2014; Google Assistant followed in 2016. IT research firm Gartner predicts that many touch-required tasks on mobile apps will become voice activated within the next several years. The voices of Siri, Alexa and other virtual assistants have become globally ubiquitous. Siri can speak 21 different languages and includes male and female settings. Cortana speaks eight languages, Google Assistant speaks four, Alexa speaks two.

But until fairly recently, voice -- and the ability to form words, sentences and complete thoughts -- was a uniquely human attribute. It's a complex mechanical task, and yet nearly every human is an expert at it. Human response to voice is deeply ingrained, beginning when children hear their mother's voice in the womb.

What constitutes a pleasant voice? A trustworthy voice? A helpful voice? How does human culture influence machines' voices, and how will machines, in turn, influence the humans they serve? We are in the infancy stage of developing a seamless facsimile of human interaction. But in creating it, developers will face ethical dilemmas. It's becoming increasingly clear that for a machine to seamlessly stand in for a human being, its users must surrender a part of their autonomy to teach it. And those users should understand what they stand to gain from such a surrender and more importantly, what they stand to lose.

Terri Danz is a vocal coach who was named by entertainment-industry publication Backstage as one of the top eight in the United States. Her clients include singers, news anchors and stand-up comedians wishing to improve their technique, range and nerves. Among her most high-profile clients are comedian Greg Fitzsimmons and actor Taylor Handley. Danz believes that current VPA voices lack resonance -- the vocal quality most associated with warmth.

When I asked Danz to listen to three Siri voice samples from three different eras -- iOS 9 (2015), iOS 10 (2016) and iOS 11 (2017) -- she connected their differences to Apple's target audience.

"As the versions progress from iOS 9, the actual pitch of the voice becomes much higher and lighter," said Danz. "By raising the pitch, what people hear in iOS 11 is a more energized, optimistic-sounding voice. It is also a younger sound.

"The higher pitch is less about the woman's voice being commanding and more about creating a warmer, friendlier vocal presence that would appeal to many generations, especially millennials," continued Danz. "With advances in technology, it is becoming easier to adapt quickly to a changing marketplace. Even a few years ago, things we now take for granted in vocal production may not have been developed, used or adopted."

There is research to support Danz's conclusions: The book Wired for Speech: How Voice Activates and Advances the Human–Computer Relationship by Clifford Nass and Scott Brave explores the relationships among technology, gender and authority. When it was published in 2005, Nass was a professor of communications at Stanford University and Brave was a postdoctoral scholar at Stanford. Wired for Speech documents 10 years' worth of research into the psychological and design elements of voice interfaces and the preferences of users who interact with them.

According to their research, men like a male computer voice more than a female computer voice. Women, correspondingly, like a female voice more than a male one.

But regardless of this social identification, Nass and Brave found that both men and women are more likely to follow instructions from a male computer voice, even if a female computer voice relays the same information. This, the authors theorize, is due to learned social behaviors and assumptions.

Elsewhere, the book reports another, similar finding: A "female-voiced computer [is] seen as a better teacher of love and relationships and a worse teacher of technical subjects than a male-voiced computer." Although computers do not have genders, the mere representation of gender is enough to trigger stereotyped assumptions. According to Wired for Speech, a sales company might implement a male or female voice depending on the task.

"While a male voice would be a logical choice for [an] initial sales function, the complaint line might be 'staffed' by a female voice, because women are perceived as more emotionally responsive, people-oriented, understanding, cooperative and kind. However, if the call center has a rigid policy of 'no refunds, no returns,' the interface would benefit from a male voice as females are harshly evaluated when they adopt a position of dominance."

Rebecca Kleinberger, a research assistant and PhD candidate at the MIT Media Lab, added some scientific context to Nass and Brave's findings. Her primary academic interest is voice and what people can learn about themselves by listening to their voice.

"Unlike a piano note, which, when looking at a spectrogram, will be centered around a single main frequency peak, a human voice has a more complex spectrum," Kleinberger said. "Vocal sounds contain several peaks that are called formants, and the position of those formants roughly corresponds to the vowel pronounced. So the human voice might be seen more as playing a chord on the piano rather than a single note. Sometimes, these formants are going to have a musically harmonious relationship between themselves, like a musical chord, and sometimes, they have an inharmonious relationship and the chord sounds 'off' according to the rules of western harmony."

"Interestingly, in the lower frequencies, those formants have a more harmonious relationship than in the higher," Kleinberger continued. "Because of bone conduction, we each individually hear the lower part of our own voice better or louder than the higher parts. This seems to play a role in the fact that most of us dislike hearing our own voice recorded and also why generally we might prefer lower voices to higher voices."

It might also be why Siri's 2013 voice, according to communications-analytics company Quantified Communications, had a pitch that was 21 percent lower than the average woman's -- not only to reflect "masculine" qualities but also to sound acoustically pleasing.

What might we learn from all of this? Users want technology to assist them, not tell them what to do. And a fledgling technology company, eager to gain a foothold in a competitive marketplace, might rather play into cultural assumptions -- create a feminine voice that is, ironically, low in pitch -- instead of challenging deeply ingrained biases. It's more expedient to uphold the status quo instead of attempting to change it.

Engadget reached out to several technology companies and asked how they determined the voice they use. Amazon was the only company that responded and stated in an email in Engadget, "To select Alexa's voice, we tested several voices and found that this voice was preferred by customers." In an article in Wired, writer David Pierce interviewed Apple executive Alex Acero, who is in charge of Siri's technology. The company's designers and user-interface team sifted through hundreds of voices to find the right ones for Siri.

"This part skews more art than science," Pierce writes. "They're listening for some ineffable sense of helpfulness and camaraderie, spunky without being sharp, happy without being cartoonish."

The common retort to concerns about bias and subjectivity is that technology does not determine culture but is merely a reflection of the culture. But in an interview with the Australian Broadcasting Corporation, Miriam Sweeney, a feminist researcher and digital media scholar at the University of Alabama, discusses how digital assistants are often subject to verbal abuse and sexual solicitation. The VPA will respond to this abuse with a moderate, even apologetic tone, regardless of the user's treatment. And when VPAs have feminine voices, which are often programmed to flirt back or respond with sassy repartee, it renders that bad behavior acceptable.

No real human should be subject to this sort of treatment. If developers' quest is to create a relatable, digital stand-in, they may have to imbue their creations with a basic sense of dignity and respect.