Cloud Drives Speech Recognition Forward for Microsoft
For years, using voice recognition technology on phones or other devices has been a novelty -- something people try once but never again, usually because it works so poorly. But recent developments, including harnessing the computational power of the cloud, have made it more usable and will make it even better in the near future, according to Microsoft.
Of all the services Microsoft hosts, speech recognition uses one of the largest cloud systems the company has, said Zig Serafin, general manager for speech at Microsoft. It includes the voice response systems used by the customer-service phone lines of large companies like Orbitz and American Airlines, as well as the technology that lets mobile Bing users search by voice and Ford Sync users ask for directions.
Microsoft got into the field when it acquired Tellme in 2007. Voice recognition had already been around for years but it didn't work very well.
"Even just standing in a quiet room back in the day trying to use some of the embedded software on a mobile phone was just painful," said Will Stofega, an analyst at IDC.
But the technology has improved enough that of all mobile searches handled by Microsoft, 20 percent now come in using voice, Microsoft said.
Microsoft uses the cloud to collect information about how people use the service as a way to improve it. For instance, if a user speaks "Italian restaurant Seattle" into Bing on their Windows Phone 7 device, Microsoft knows if the user then clicks on a result, presumably getting the answer they want. The user instead may speak a search query a few more times, indicating that Microsoft probably didn't get the translation right. Microsoft collects information also about phone connectivity, in case it is partly to blame for delivering poor results.
"That [data] becomes valuable to help improve the underlying science on the system," Serafin said.
Google, which also lets users search by voice and has other offerings that use voice recognition, similarly uses back-end processing to learn from the way that people use the services.
At Microsoft, because the same back-end system handles speech recognition in multiple products, the company is now processing about 11 billion speech requests in a year, it said. On its new Windows Phone 7 devices, users hold down the home button to launch the speech feature, which can be used to control many applications on the phones.
Microsoft sifts through that massive volume of data from a network operations center in Silicon Valley. "It's fascinating to see the number of requests coming in," Serafin said. "It's like walking into a mini version of NASA."
Some elements of the feedback loop are automated so that the speech recognition engine itself is capable of parsing the data, he said. Some data is examined closer by experts who might then make changes to the system.
The ability to learn from a large volume of users is one factor that will enable Microsoft's vision for the next step in speech recognition, something the company calls conversational understanding. "It's the idea that we're taking advantage of the underlying research and development work with speech and connecting that with machine learning technologies and natural learning to better infer what a user is trying to do," Serafin said.
Conversational understanding will be able to draw from multiple applications, said Ilya Bukshteyn, senior director of marketing for Microsoft's speech business. For example, a user could speak into Bing: "Find somewhere for Zig and I to have dinner tomorrow night," Bukshteyn said. The phone would then automatically check Serafin's and Bukshteyn's calendars to discover that they'll be in San Francisco. From there the system would know that the two have eaten sushi before. The phone would then query Bukshteyn, asking if he'll be eating in San Francisco and then if he'd like sushi.
In the meantime, Microsoft is hoping to stay ahead of Google, a prominent competitor in the space, Serafin said. Microsoft believes it is ahead of Google because it is already offering speech recognition, based on the same platform, to a wide array of users, including gamers, phone users and drivers.
That is one definite advantage, Bern Elliot, an analyst with Gartner, said. "They have this incredible spread from on premise to in the cloud with Tellme," he said. "They have the ability to deliver speech into a whole bunch of markets."
In addition, Microsoft thinks that it has a leg up on doing the kind of processing that lets the system query users for more details or information.
Google recently bought a company called Phonetic Arts that might let it offer similar capabilities. Phonetic Arts is a speech syntheses company that can generate natural computer speech, Google said. In a statement about the acquisition, Google said that the company could help it deliver voice output, or responses to people who are using voice recognition technologies.
Stofega said Microsoft also may have an advantage in progress it's made with the user experience. On Windows Mobile 7 phones, users see a Tellme icon and a row of dots showing that the service is processing speech. "It's not a technology thing but from an experience perspective, it's cool," he said.
Microsoft hopes to use similar icons and branding across services so that users recognize that whether they are using the Kinect or Windows Phone 7, they can use speech in a similar way, it said.
Both Google and Microsoft also have to compete with Nuance, the leader in a small field of speech-recognition technology developers. Nuance has the reputation for having the best speech technology out there, Elliot said. Some rumors have suggested Apple might be interested in buying Nuance, which would mean one more field that Apple, Google and Microsoft compete in.
While the companies have all made advances that have improved speech recognition, they still have work to do. "There are other core problems around background noise and other things that haven't really been solved," Stofega said.