Say It With Me: Voice Is Ready for Prime Time
Talking is the best user interface. But is the technology there yet?
Language is natural to people and universal to all cultures. Language is a spoken medium. Written language is merely the symbolic representation of spoken language. It's an abstraction, but a necessary one.
In the future, we'll talk to our computers and they'll talk back. We know this is true because talking is the most natural way for human beings to communicate. The evolution of the human-machine interface always moves the workload of interaction from the person to the computer. The perfect UI would be a natural conversation, just like you have with other people.
So what are we waiting for? There are two primary technological hurdles to overcome, and one cultural one.
The first technological hurdle is the creation of software (supported by powerful hardware) that can understand spoken language. This is a particularly difficult problem because even the most basic conversation requires computers to know what humans know and think the way humans think.
The second is that content must be searchable. Text can be indexed, and we've grown addicted to the ability to search for and find the things we've written.
The cultural barrier to voice-based computer interaction is one of habit. We've grown used to typing on keyboards. Although speaking is natural, speaking to a computer feels a little weird at first. And people generally don't like learning a new way to do things.
Still, I've been curious about just how far it's possible to go with voice interaction. To test how much of my work I could accomplish with voice, I embarked on a two-week project to use voice for everything I possibly could. My aim was both to test the technology and to try to understand how difficult it is to make the transition to primarily voice-based input.
I'll tell you my conclusion at the end, but first let me tell you about the products I used.
A company called VoiceBase this week unveiled a Web-based service of the same name that indexes the spoken word for search. VoiceBase does this by generating a transcript that is time-coded to the actual recording.
To use VoiceBase, you simply upload sound files of meetings, conversations, phone calls, lectures or whatever, and then later you can search them as you would search using Google. You can also record directly with a smart phone, which is great for meetings and interviews.
When you search for a specific word, VoiceBase takes you to that place in the recording so you can listen to the word in context. In other words, not only is the input a recorded voice, but the output is, too. The transcript is also there for you to read if you want to.
VoiceBase is accessed through its Web site or smartphone client (iPhone and Android only; search the app stores for "VoiceBase"). The service is free for a year. After that, the company charges for storage at a price ranging from 85 cents to $1 per hour of audio, depending on how much you buy.
The easiest way to use VoiceBase is through the phone app. Launching the app shows you a cassette tape, which you can label. After recording, you press a button to upload the file to the server. The company notifies you when indexing has been completed. That's it. Later, you simply search the site for keywords, just like you would in a Google search.
VoiceBase's voice recognition capability is so-so. Because It isn't trainable, it stumbles when it encounters unclear talkers, environmental noise or accented speech.
It's easy to get into the habit of using VoiceBase. The biggest cultural hurdle is "other people." It's difficult enough to convince others in a meeting that you want to record what they say. It's harder when they learn the recording will be online and indexed for search. Still, if you can convince others to go along, VoiceBase is really useful.
Dial2Do is a low-cost service that lets you do common tasks with your voice. By calling the number they give you, or by using their Android or BlackBerry app, you can leave yourself reminders, listen to e-mail, send text or e-mail, listen to your calendar items or add an appointment, post on Twitter or similar networks and listen to the weather or news. You can even post expenses to an expense account service.
Dial2Do is very easy to use. A computer voice tells you what you can say.
Dial2Do's voice recognition capability is good, but not great. One funny blunder: I wanted to send a note to my Evernote account, but Dial2Do misunderstood "Evernote" for "send an e-mail." I said "no" but Dial2Do interpreted "no" as a person on my contacts list. I let out an involuntary cuss word, and Dial2Do obediently sent my expletive as the content of the e-mail.
This service focuses on ease of use, rather than power. So it struggles with words that aren't in the dictionary.
I tested it with both an iPhone and also Google Voice, and it worked great on both. I have not tried the apps.
Dial2Do is easy to form new habits around. The first day I used it, I kept the Dial2Do Web page open as a cheat sheet. After that, I simply used my cell phone whenever I wanted a reminder or needed to send an e-mail. When sitting at my desk, I quickly got into the habit of using Google Voice through Gmail.
And now we come to the Mother of All Voice Applications, Dragon NaturallySpeaking 11 from Nuance. As a cheapskate, I bought the bottom-of-the-line Home version for about $89 on Amazon.com. I wish I'd bought the Premium version, which costs about $150 and lets you record with a digital voice recorder and later upload to Dragon for processing.
Let me be very clear about something: Version 11 is vastly superior to previous versions. You can still buy old versions and the prices are very low. Don't do it.
Dragon NaturallySpeaking takes dictation so accurately that it begins to approach Steve Jobs' favorite word: "Magical." For the first week of use, I was actually shocked when it correctly recognized obscure names, extremely technical terms, brand names with correct capitalization (for example, iPhone) and performed other unlikely feats. Since I started using it, I've written the first drafts of all of my columns and blog posts, including this column, using Dragon NaturallySpeaking.
The accuracy has an unexpected and very welcome side effect: It makes it easier to write. I assumed that typing was automatic, requiring little brain power. But using Dragon has demonstrated that mental energy was diverted from the task of typing to the task of thinking, which is what makes writing so much easier. I can also write faster using Dragon.
Dragon's dictation capabilities are augmented by powerful voice-based editing tools. I found it surprisingly easy to master the commands for editing, aided by the universal "what can I say" command that pops open a list of commands. Best of all, the editing process teaches Dragon so that in the future, less editing is necessary. It's constantly learning, constantly improving accuracy.
In addition to dictation, Dragon also lets you control your PC by voice. You can open folders, launch applications, surf the Web and click on links. It works best with Firefox, OK with Internet Explorer and hardly at all with Google Chrome.
Dragon's PC control capabilities aren't as powerful, polished or perfect as its ability to take dictation. I wasn't able to go completely voice-based but ended up combining voice commands with mouse clicks.
The downsides of using Dragon are overwhelmed by the upsides. Those downsides include a learning curve -- both for you and for Dragon -- a fair but nevertheless high price tag, and significant requirements for system resources. I personally wouldn't install Dragon on a five-year-old PC.
Training Dragon takes about 20 minutes initially. It shows text for you to read so it can learn how you talk and assess the quality of your microphone. Then it does something really interesting: You point and click your mouse at folders of documents you've written so Dragon can learn the words that you know. This, I suspect, is how it so masterfully handled the weird words I threw at it.
Dragon NaturallySpeaking 11 has transformed the way I work. I recommend it to everybody.
What can I say about voice?
I have good news and bad news, plus a little advice about replacing text and typing with talking.
First the bad news: It's not feasible yet for most people to completely abandon keyboards, mice and text and interact entirely via the spoken word.
The good news is that my experience proved to me at least that embracing voice tools improves your life significantly, especially if you write a lot. Everything becomes faster and easier. Your health is improved because you don't have to spend so much time sitting and looking at a screen. Most surprising of all: Voice makes using a computer a lot more fun.
Here's my advice: I strongly recommend that everyone use all three of the voice tools in this column. VoiceBase lets you capture conversations, meetings and lectures that might otherwise evaporate into the ether as soon as they're done. Dial2do makes it much easier to capture notes and to communicate. And what can I say about Dragon NaturallySpeaking 11? It's the biggest user interface advance since the iPhone.
My last bit of advice is to invest in a super-high-quality microphone. You want a boom microphone that fits at the corner of your mouth. Dragon NaturallySpeaking comes with a headset and microphone, but it's not very comfortable to wear. Comfort is everything, because if you truly embrace voice, you're going to be wearing this thing for 10 hours a day.
The bottom line is that voice is finally ready for prime time. I've decided to continue my experiment indefinitely and to keep pushing the voice envelope as far as it will go. Voice makes using a computer faster, easier and a lot more fun.
Product mentioned in this article
Dragon NaturallySpeaking 11 Professional
NaturallySpeaking software lets you dictate text and commands and bop around your PC with aplomb. The release provides great new tools, but its accuracy hasn’t improved by leaps and bounds.