Speech Recognition Through the Decades: How We Ended Up With Siri
By Melanie Pinola
Looking back on the development of speech recognition technology is like watching a child grow up, progressing from the baby-talk level of recognizing single syllables, to building a vocabulary of thousands of words, to answering questions with quick, witty replies, as Apple’s supersmart virtual assistant Siri does.
Listening to Siri, with its slightly snarky sense of humor, made us wonder how far speech recognition has come over the years. Here’s a look at the developments in past decades that have made it possible for people to control devices using only their voice.
1950s and 1960s: Baby Talk
The first speech recognition systems could understand only digits. (Given the complexity of human language, it makes sense that inventors and engineers first focused on numbers.) Bell Laboratories designed in 1952 the “Audrey” system, which recognized digits spoken by a single voice. Ten years later, IBM demonstrated at the 1962 World’s Fair its “Shoebox” machine, which could understand 16 words spoken in English.
Labs in the United States, Japan, England, and the Soviet Union developed other hardware dedicated to recognizing spoken sounds, expanding speech recognition technology to support four vowels and nine consonants.
They may not sound like much, but these first efforts were an impressive start, especially when you consider how primitive computers themselves were at the time.
1970s: Speech Recognition Takes Off
Speech recognition technology made major strides in the 1970s, thanks to interest and funding from the U.S. Department of Defense. The DoD’s DARPA Speech Understanding Research (SUR) program, from 1971 to 1976, was one of the largest of its kind in the history of speech recognition, and among other things it was responsible for Carnegie Mellon’s “Harpy” speech-understanding system. Harpy could understand 1011 words, approximately the vocabulary of an average three-year-old.
Harpy was significant because it introduced a more efficient search approach, called beam search, to “prove the finite-state network of possible sentences,” according to Readings in Speech Recognition by Alex Waibel and Kai-Fu Lee. (The story of speech recognition is very much tied to advances in search methodology and technology, as Google’s entrance into speech recognition on mobile devices proved just a few years ago.)
The ’70s also marked a few other important milestones in speech recognition technology, including the founding of the first commercial speech recognition company, Threshold Technology, as well as Bell Laboratories’ introduction of a system that could interpret multiple people’s voices.
1980s: Speech Recognition Turns Toward Prediction
Over the next decade, thanks to new approaches to understanding what people say, speech recognition vocabulary jumped from about a few hundred words to several thousand words, and had the potential to recognize an unlimited number of words. One major reason was a new statistical method known as the hidden Markov model.
Equipped with this expanded vocabulary, speech recognition started to work its way into commercial applications for business and specialized industry (for instance, medical use). It even entered the home, in the form of Worlds of Wonder’s Julie doll (1987), which children could train to respond to their voice. (“Finally, the doll that understands you.”)
However, whether speech recognition software at the time could recognize 1000 words, as the 1985 Kurzweil text-to-speech program did, or whether it could support a 5000-word vocabulary, as IBM’s system did, a significant hurdle remained: These programs took discrete dictation, so you had … to … pause … after … each … and … every … word.
Next page: Speech recognition for the masses, and the future of speech recognition
1990s: Automatic Speech Recognition Comes to the Masses
In the ’90s, computers with faster processors finally arrived, and speech recognition software became viable for ordinary people.
In 1990, Dragon launched the first consumer speech recognition product, Dragon Dictate, for an incredible price of $9000. Seven years later, the much-improved Dragon NaturallySpeaking arrived. The application recognized continuous speech, so you could speak, well, naturally, at about 100 words per minute. However, you had to train the program for 45 minutes, and it was still expensive at $695.
The advent of the first voice portal, VAL from BellSouth, was in 1996; VAL was a dial-in interactive voice recognition system that was supposed to give you information based on what you said on the phone. VAL paved the way for all the inaccurate voice-activated menus that would plague callers for the next 15 years and beyond.
2000s: Speech Recognition Plateaus–Until Google Comes Along
By 2001, computer speech recognition had topped out at 80 percent accuracy, and, near the end of the decade, the technology’s progress seemed to be stalled. Recognition systems did well when the language universe was limited–but they were still “guessing,” with the assistance of statistical models, among similar-sounding words, and the known language universe continued to grow as the Internet grew.
Did you know speech recognition and voice commands were built into Windows Vista and Mac OS X? Many computer users weren’t aware that those features existed. Windows Speech Recognition and OS X’s voice commands were interesting, but not as accurate or as easy to use as a plain old keyboard and mouse.
Speech recognition technology development began to edge back into the forefront with one major event: the arrival of the Google Voice Search app for the iPhone. The impact of Google’s app is significant for two reasons. First, cell phones and other mobile devices are ideal vehicles for speech recognition, as the desire to replace their tiny on-screen keyboards serves as an incentive to develop better, alternative input methods. Second, Google had the ability to offload the processing for its app to its cloud data centers, harnessing all that computing power to perform the large-scale data analysis necessary to make matches between the user’s words and the enormous number of human-speech examples it gathered.
In short, the bottleneck with speech recognition has always been the availability of data, and the ability to process it efficiently. Google’s app adds, to its analysis, the data from billions of search queries, to better predict what you’re probably saying.
In 2010, Google added “personalized recognition” to Voice Search on Android phones, so that the software could record users’ voice searches and produce a more accurate speech model. The company also added Voice Search to its Chrome browser in mid-2011. Remember how we started with 10 to 100 words, and then graduated to a few thousand? Google’s English Voice Search system now incorporates 230 billion words from actual user queries.
And now along comes Siri. Like Google’s Voice Search, Siri relies on cloud-based processing. It draws what it knows about you to generate a contextual reply, and it responds to your voice input with personality. (As my PCWorld colleague David Daw points out: “It’s not just fun but funny. When you ask Siri the meaning of life, it tells you ’42’ or ‘All evidence to date points to chocolate.’ If you tell it you want to hide a body, it helpfully volunteers nearby dumps and metal foundries.”)
Speech recognition has gone from utility to entertainment. The child seems all grown up.
The Future: Accurate, Ubiquitous Speech
The explosion of voice recognition apps indicates that speech recognition’s time has come, and that you can expect plenty more apps in the future. These apps will not only let you control your PC by voice or convert voice to text–they’ll also support multiple languages, offer assorted speaker voices for you to choose from, and integrate into every part of your mobile devices (that is, they’ll overcome Siri’s shortcomings).
As everyone starts becoming more comfortable speaking aloud to their mobile gadgets, speech recognition technology will likely spill over into other types of devices. It isn’t hard to imagine a near future when we’ll be commanding our coffee makers, talking to our printers, and telling the lights to turn themselves off.