Anyone who has given a public speech knows that voices change depending on the environment you're in.
In an auditorium, for instance, sweat collects. Muscles in the shoulders, neck and throat tighten. And much of the resulting physical pressure goes to the throat's vocal folds, which bear increased tension and vibrate at a faster rate. That's why so many people sound strained and high-pitched when speaking to a crowd. Combine this with irregular, quickened breathing that can cause a voice to shake or crack and even the most practiced orator can fall victim to nerves.
In the course of her research, Kleinberger has also observed that your voice -- the musicality, the tempo, the accent and especially the pitch -- changes depending on the person you're talking to. Kleinberger notes that when women are in a professional setting, they typically use a lower voice than when they speak to their friends.
These variables are ingrained in the human experience, because reproducing sound is an inverse process that begins with mimicry. There are many ways, for example, to shape one's mouth to make the "ma ma" sound, and positive reinforcement from parents and peers will shape people's vocal techniques from a young age.
A major part of VPAs' appeal is that they replicate human interaction -- they respond to their users with predetermined jokes, apparently offhand remarks and short, verbal affirmations. Yet unlike humans, they are unerringly consistent in the way they sound.
Wired for Speech co-author Scott Brave, who is now CTO of FullContact (which helps businesses analyze their customers for marketing and data purposes), expressed enthusiasm for a more "neurological" layer of insight -- insight he did not have when he and Nass were conducting their experiments. He also discussed what surprised him most while writing Wired for Speech.
"One of the studies I was involved in [years ago] was related to emotions in cars, and what was the 'right' emotion for a car to represent as a co-pilot," said Brave, who earned a PhD in human-computer interaction at Stanford. "It turns out that matching the user's emotions is more important that what the emotion is. It makes the user think, 'Hey, this entity is responding to me.'"
"If a person isn't feeling calm, is it always going to be the case that a calming [computer] voice will be the most effective?" asked Brave. "The best way to get someone to change his or her state is to first match that state emotionally and then bring that person to a place that's soothing."
Perhaps trying to pinpoint a perfect voice was the wrong question all along. The future is no longer about developing a single ideal voice that appeals to the widest audience possible. That's just a stopgap measure on the path toward the real goal: to create a voice that, like ours, changes in response to the human beings around it.
"Technologies have individuality of voice, but they lack prose of voice."
"Ideally, the machine acknowledges context to what is being said," said Brave. "Because the needs of a user get expressed over the course of a conversation. A user cannot always express what he wants in a few words.
"Some of that context is linguistic: What does a person mean when he says a particular word? Some of that is emotional, and some of that is historical," said Brave. "There are many types of context. And our current systems are aware of very few."
Kleinberger agreed with this sentiment.
"[When technologies speak to us currently], they're doing so in voices that are uncanny, still slightly robotic and non-contextual," said Kleinberger. "Technologies have individuality of voice, but they lack diversity and responsivity of the prosody, vocal posture and authenticity. An individual's prosody changes all the time and is very responsive to the context."
Today, technology can pick out a voice's subtleties to a specific, perhaps discomforting degree. Hormone levels, for example, can affect the texture of a person's voice.
"Our voice reveals a lot about our physical health and mental state," Kleinberger said. "Changes in tempo in sentences can be used as a marker of depression, breathiness in the voice can be an indicator of heart or lung disease, and acoustic information about the nonlinearity of air turbulences could even predict early stages of Parkinson's disease.
"Smart home devices are listening to us all the time, and soon, they might be able to detect those physical and mental conditions, and as the voice is also very dependent on our hormone levels, even one day detect if someone is pregnant before the mother knows it," Kleinberger continued.
An always-on AI can act as a fly on the wall. It can extract metadata as it listens to partners and family members talk to one another. It can detect the social dynamics among people solely from the acoustic information it gathers. What it cannot do, as of now, is explicitly act upon this information. It will not change its phrasing to match a person's unstated preference; it will not raise or lower its pitch depending upon who is requesting its help -- yet. Kleinberger believes we may be as few as five years from this.
Could a personal assistant someday listen to its users, detect stress, suss out power imbalances in relationships and match its voice, phrasing and tempo accordingly? If so, the "ideal" voice is specific to each person, and like a human's voice, it should adjust itself in real time throughout the day.
If this is successfully implemented, it has enormous societal potential. Imagine an AI that corresponds to its user's manner of speaking -- that raises its voice or reacts sharply in response to its user's tone, not just the content of her speech.
"Could Siri mimic the voice of the user to be more likable? Absolutely. We humans do that all the time unconsciously, adapting our vocal timbre to the people we talk to."
There is an uncanny, morally gray area that comes with this territory. The goal of many developers is to create a seamless illusion of sentience, yet if users are being monitored and judged beyond their control or consent, the technology can easily be read as insidious or manipulative. Kleinberger mentioned Microsoft's infamous Clippy as a cautionary example of how users want to be catered to but not intruded upon unsolicitedly.
"There are many tangible benefits from collecting data from the voice, but I believe that creating a 'truly caring dialogue' between Siri and a user is not one of them," said Kleinberger. "But could Siri mimic the voice of the user to be more likable? Absolutely. We humans do that all the time unconsciously, adapting our vocal timbre to the people we talk to.
"It would be great," Kleinberger concluded, "as long as the whole process and the data are transparent for and controlled by the user."
On the issue of privacy, Apple is more discreet than competitors like Google or Facebook. Rather than pulling data off a server to customize its assistance, Apple emphasizes the less intrusive power of its machine learning and AI.
But Apple's competitors already have that, and recently, Siri has fallen behind other VPAs with regards to its overall ability and diversity of features. It's become apparent that the more personal information a user surrenders, the more the VPA can learn from it, and the better the VPA can serve the user.
Human-to-human relationships, after all, require openness and transparency. Perhaps for humans to create that sort of dialogue with technology -- whether through voice or another avenue -- they must be similarly open and transparent to the technology. And a trusting relationship cuts both ways: Companies need to be more explicit and aboveboard about the type of data they collect from consumers and how that data is used.
How much of yourself are you willing to sacrifice in search of symbiotic perfection?