This story was originally published Oct. 7, 2016, and updated May 10, 2017 with new information.
Windows has a feature it doesn’t like to talk about. While the OS lets you scrawl notes with a stylus, log in with you face (or secure the Web) via Windows Hello, and even order Cortana to set a reminder, what it’s not so eager for you to do, apparently, is use its speech recognition engine to issue commands or take voice dictation.
The reason for its silence may go back 10 years, to when Microsoft product manager Shanen Boettcher demonstrated voice dictation inside Windows Vista—and flubbed it. The technology kept a low profile after that, and today, few users know you can dictate a document within Windows.
If there were ever a time for Windows to try again, though, it would seem to be now, when advances in computers and artificial intelligence provide a much better foundation for the technology. And it has.
At its Build 2017 developer conference, Microsoft launched a new Video Indexer preview that not only transcribes the video, but also identifies the speaker, provides optional translations in up to nine languages, automatically generates subtitles, and guesses what objects or overlays are on the screen. It even performs basic sentiment analysis, determining whether the words used are positive or negative. And it’s all searchable via a Web portal: If you only want to view the text from a specific speaker, you can.
Video Indexer is an example of how Microsoft is applying artificial intelligence to daily tasks. For example, the company showed off a PowerPoint Translator function that will allow users to auto-configure a PowerPoint presentation in their native language. Video Indexer, though, goes far beyond.
According to the product manager for Video Indexer, Milan Gada, the indexer can’t immediately identify every speaker in a video. But if a user identifies an “unknown” speaker with their name, the entire database will be updated with the correct information, he said. Video Indexer also quickly allows a video to be searchable, allowing consumers to skip right to where they’re most interested.
That all begs the question: if Microsoft can deliver a solution like this for enterprise customers, why can’t it at least tap into the power of Cortana to deliver the same features for consumers?
Microsoft’s silence on speech dictation
“This is such a great question,” said Harry Shum, the executive vice president overseeing Microsoft’s speech-recognition research, as well as Cortana and Bing, when asked last year about dictation’s future within Microsoft Office. “There is really no reason why it is not playing a much more prominent role yet.”
We decided to give it another chance: We delved into Windows’ voice dictation features to see how they compared to more recent speech-based technologies.
Why speech recognition can’t be too perfect
Some of us still think about voice dictation in the same way Doonesbury lampooned the Apple Newton, turning “I am writing a test sentence” into “Siam fighting atomic sentry.” And you’d be forgiven for thinking so, too: Windows Speech Recognition is powered by the Microsoft Speech Recognizer 8.0, which has remained literally unchanged since Vista. Shum called it a “grandpa” technology.
What has changed, however, is the hardware: Listening for and interpreting speech requires far less processing power than a decade ago. The quality of integrated array mics within PCs like the Surface Book mean that dedicated headsets aren’t necessarily required to achieve superior accuracy. Voice dictation for the masses is here, right?
When I tested Windows’ speech capabilities, however, I experienced firsthand the merciless perfection that’s required for the system to be usable. This story has 1,028 words in it, including subheadings. If you used voice dictation software to write it, a 95.0% accuracy rate would mean you’d have to correct more than fifty mistakes. That gets old fast.
In my tests, based on a methodology I developed for another speech recognition product I’m testing, Windows produced an accuracy rate of 93.6%, That’s pretty bad on paper, and somewhat behind the dedicated software I’m trying. Windows also had an odd habit of interjecting the word “comma” when I was dictating the punctuation mark. The speech community seems split on whether relatively minor mistakes like this are significant.
That, of course, was just the baseline. As anyone who’s used dictation software can tell you, the key to accuracy is training. Over time, a voice dictation program learns your accent, whether you pronounce the “a” in apricot like “bad” or “ape,” and how to filter out our unconscious verbal tics. I’ve seen Microsoft employees claim that, properly trained, Windows’ speech recognition was 99% accurate. Ten mistakes or so per 1,000 words isn’t bad at all.
Very few of us, though, probably want to spend the time training the software. Windows Speech Recognition requires up to 10 minutes to run through a few practice sentences, and it feels like a lifetime. Cortana and Siri don’t require any of the same setup time, as they’ve already been trained on millions of voice samples. There’s something to be said for instant gratification.
What makes Cortana (which you can use on your PC or phone) so much better than Windows’ own ancient voice dictation systems is her link to the massive computational power of the Microsoft cloud. Microsoft can crunch and correlate your voice input together with whatever other data Microsoft knows about you, generating the intelligence that is the soul of Cortana.
Microsoft talks up speech recognition
Given Cortana’s proven skills, you’d think speech would take center stage. But at Build 2016, executives said dictation capabilities won’t be added to Office. Last October, though, chief executive Satya Nadella’s keynote address at its Ignite conference painted speech recognition as a critical component of Microsoft’s future.
Take Skype Translator, for example. Microsoft’s Star Trek-like universal translator depends upon three different strands of research, according to Nadella: speech recognition, speech synthesis, and machine translation.
“Even inside of Word or Outlook when you’re writing a document we now don’t have simple thesaurus-based spell correction,” Nadella said, adding that Office can now even compensate for dyslexia. “We have complete computational linguistic understanding of what you’re building. Or what you’re writing.”
But not what you’re saying, apparently.
During the same speech, Nadella bragged that Microsoft’s speech algorithms achieved a word error rate of 6.9 percent using the NIST Switchboard test. That sounds bad: that’s accuracy of about 93.1 percent. But the Switchboard test uses sample rates of just 8KHz, about the quality of a telephone conversation in the year 2000. Windows Media Audio 10, the codec within OneNote, can capture audio at up to 48KHz, providing much more accurate samples.
I think it’s pretty obvious that the pieces of the puzzle are there, technically. If there’s any obstacle, it might be organizational: Microsoft’s Office apps have been spun out into their own group, away from Cortana and Bing. Shum, however, said that intelligence is still part and parcel of Microsoft’s offerings. “Rest assured that we are infusing AI technology into all Microsoft products,” he said in October.
Microsoft representatives also said that users should expect more from Microsoft in the future.
“We see value in conversations across a range of devices and experiences,” Microsoft said in an October statement. “We’re just at the beginning of what we believe is possible and certainly see lots of opportunity to connect Cortana and conversations into a number of productivity scenarios. Today, Cortana integrates with Office 365 for glance-able information about upcoming meetings, along with flight and package tracking, and Bing is also providing intelligent insights directly in Office. We will continue to invest heavily here.”
If Microsoft truly believes in productivity, though, the future of speech recognition within your PC probably isn’t using Skype to book a hotel in Bangladesh. It’s writing about the experience—but with your voice rather than your fingers.