Latest in Computer-speech

Image credit:

How Siri gained its voice

Mel Martin

The Verge has a terrific article about voice synthesis and speech recognition that gives some interesting insights into how Siri and other digital voice assistants work.

Although not officially acknowledged by Apple, Siri is based on technology from Nuance, the folks behind Dragon Dictate for Mac. Nuance also offers the free Dragon Dictation and Dragon Go! for iOS. Nuance licenses speech-recognition and voice-synthesis technology to many software companies, and has made some dramatic breakthroughs that are being used extensively in the medical field.

While Siri's voice isn't quite as good as the Hal 9000 in the movie 2001, it is getting close. In iOS 7, you can choose to have a male or female voice, and Apple has added more languages. Most voice synthesis starts with a human reading sounds, which are then taken in by a computer. It's not a matter of reading every possible word, but having a catalog of sounds, called phonemes that can be used to construct new words. If you used one of the Dragon Dictation products, you see the process in reverse. You read a story into the computer composed of various words, but the computer is not just learning the words, but key parts of speech that can be used to understand words not in the story. It's complex, and requires intensive processing. With a product like Dragon Dictate, your computer does the processing. With Siri and other smartphone assistants, like Google Search, the computing is done not on your device, but on powerful servers in the cloud.

To keep speech from sounding robotic, computer voices now have inflection, rising at the end of sentences where appropriate, but following a set of rules so the style and tone of speech match the context. It isn't perfect, but Siri sounds a lot better than the computer voices of 10 years ago, and Siri does sound more natural in iOS 7.

The next few years are likely to show even more progress. Better recognition, more realistic voices and faster processing will be rapidly coming. I find Siri a bit half-baked at times, with server time-outs or bafflingly inaccurate recognition. Still, a feature like Siri was unthinkable on a phone just a few short years ago, and the best is yet to come.

From around the web

ear iconeye icontext filevr