Speech Recognition Systems Must Get Smarter, Professor Says
Those who loathe talking on the phone to automated speech recognition systems may take solace in the fact that scientists are working to make such systems more lifelike and less annoying to use.
"From consumer experience, people find these systems very frustrating," said James Allen, who is the chairman of computer science at University of Rochester, speaking before the SpeechTEK conference 2010, held in New York this week.
Most computerized speech recognition systems can understand what a human says up to 98 percent of the time, and yet people still chafe at using automated phone help-desk systems. The key to making these systems less frustrating to use would be by giving them a deeper understanding of language and making them more interactive, Allen said.
By now, the customer service departments of most large organizations offer automated phone-based help systems. A user calls the help number and an artificial voice asks the caller a series of questions. Most of these systems are based on frameworks that are basically large decision trees. With such systems, "you don't find out what the person wants, you are following a script," he said.
The systems are actually a composite of a number of different technologies. One is speech recognition, or the ability for a computer to understand, or successfully translate into text, what the speaker is saying.
The other technology, natural language processing (NLP), attempts to either convert the speaker's message into a command that the computer can execute, or that can be summarized for a human operator.
Great strides have been made in both voice recognition and NLP over the past few decades, but they have seemingly brought mostly frustration to their users. "I only call the bank when I've got a problem and battle these systems. [I ask] what I can answer to get through to a person as fast as possible," Allen said.
Allen's academic research work has been in finding ways that "we can talk to a machine the same way we can talk to a person," he said.
Conversations between two people can be precise in ways computers have difficulty matching. Allen pointed to some early work he did as a graduate student, in which he recorded conversations at a train station information desk. In one interaction, a passenger walks up to the booth and says "8:50 to Windsor," and the attendant answers "Gate 10, 20 minutes late." While the attendant knew exactly what information the inquirer sought, computerized systems would find the passenger's first statement befuddling.
The way Allen sees it, two elements are missing from the modern systems: The ability to analyze what the speaker is saying and the ability to converse with the speaker to learn more about what the speaker intends to say.
"Lots of off-the-shelf NLP tends to be shallow. We don't have technology that gives you a meaning of the sentences," he said. Statistical processing tools and word definition service such as WordNet can help define a word but also the relations of a word, so a system will know that, for instance, a "subsidiary" is a part of a "company."
More two-way communications between the users and the computers is also needed. When talking about their needs, people may provide information in no particular order. It should be up to the computer to piece together this information and not burden the user with questions whose answers have already been provided.
"This is the future, this is really what you want systems to do, and can we build dialog systems that can support this range of complexity," he said.
To illustrate this idea, Allen and a team of researchers designed a program called Cardiac that could mimic the questions a nurse would ask to a patient with heart disease. The program was created with funding from the U.S. National Institutes of Health. With this system, once a user supplies information, the system would not ask for it again, Allen said. The system would reason about what material was already provided and what was still needed.
Another program designed by Allen and his team, called Plow, can learn how to carry out common tasks on a computer. "This is a system that allows you to essentially use dialog to train your system how to do things for you," he said.
As an example, Allen demonstrated the program learning how to find nearby restaurants using a browser. The user would open a browser, navigate to a restaurant locator site, type in the type of restaurant sought and the location, and then cut and paste the results into a blank page. The user described each step as it was carried out.
In the process, Plow would record each step, and audibly respond when the step is understood. Later, when the user would like to look up another restaurant, the program would go through all the same moves, producing another list of restaurants automatically. The U.S. Defense Advanced Research Projects Agency funded the development of this program.
More data is the key for more human-like language processing systems, agreed Microsoft chief scientist for speech Larry Heck, in another talk at the conference. "If you don't have the data, it doesn't matter how sophisticated your algorithms are," he said.
One place to find more data would be in search engine queries, he suggested. Search engine services get massive numbers of queries, all of which get linked to answers. "I view search as a close cousin to language processing technology," Heck said.
These days, people are trained to structure their queries as a set of keywords. Instead, if users were to type in full sentences describing what they need, the resulting data set could go a long way in helping systems better understand what people are looking for.
Heck predicted that as more people use voice-activated search services from Microsoft and Google, they will become more accustomed to structuring their queries as full sentences, which over time could help NLP systems better anticipate user needs.