For most of the history of computing, our primary way of inputting data has been via a keyboard, and we’ve received information back on a display monitor. It’s worked pretty well, especially for those of us who grew up touch typing and can “think with our fingers.” At least it did until computers started shrinking, reaching the ultra-compact dimensions of a cell phone.
One of the challenges of stuffing the power of a full-fledged computer into a tiny, pocket-sized package is that those input and output methods don’t work nearly as well in the small form factor. Smartphone vendors have tried to make it more palatable with physical keyboards, virtual keyboards, and technologies such as Swype to make text entry a little easier. Phone displays have also gotten sharper, more vivid (Samsung Super AMOLED Plus), and grown larger (4.5 inches on the Samsung Infuse and upcoming Motorola Droid Bionic). However, handheld devices just aren’t capable of providing the same input/output experience as a full-sized desktop computer.
What phones are designed for is voice input and output. Thus, speech would seem to be the most logical way to interact with them. But — despite catching the imagination of a generation through Star Trek’s ship-wide talking computers — speech recognition has been problematic in real world computing, and it never really caught on. Will smartphones change that?
Why smartphones need speech input/output
In addition to the challenges of typing on a pint-sized keyboard and squinting to read the text on a screen that measures 10 square inches, even under the best circumstances, smartphone computing is often done in less than optimal conditions. Much of the time when we’re interacting with our phone apps, we aren’t sitting comfortably at a desk. We may be standing up, walking, or even driving a car (such as when we use the navigation app on our phones). Thus, even if the keyboard and display were big enough for comfortable typing and viewing, it wouldn’t be ergonomic (or in some cases, safe) to use them.
The history of speech applications
Speech input seems like a great idea. Even if you’re a fast typist like me (90 wpm average) and a slow talker (born and raised in Texas), you probably speak much faster than you can type. In fact, humans usually talk at a rate of 150-200 words per minute. Court reporters train to take dictation on stenotype machines at 225 wpm, in order to be able to capture all speakers’ words verbatim in real time. That means most of us could at least double the speed of input by speaking instead of typing.
Speech recognition technology has been around for a long time. IBM demonstrated a speech recognition device called the Shoebox at the New York World’s Fair in 1964. Speech recognition was built into Microsoft Office XP, Office 2003, and Microsoft Plus! for Windows XP. It’s been a fully integrated part of Windows since the release of Vista. It was also available in Windows Mobile (as Microsoft Voice Command) since 2003.
Voice output is achieved via speech synthesis (also called text-to-speech) technology, which has been around even longer. Rudimentary machines were built in the 1700s and Bell Labs created the VOCODER in the 1930s. Text-to-speech was first integrated into Windows 2000, in the form of Narrator, as an accessibility option for those with visual disabilities. It has been included in every Windows OS since. Apple first introduced MacInTalk in 1984. The AmigoOS included a voice emulation system, as well.
Today’s smartphones have built-in speech recognition and speech synthesis capabilities — and in general, it seems to work better than the speech applications for desktop computers.
The challenge of getting speech right
There are a number of reasons why we still haven’t seen widespread use of speech applications on the desktop. In many work environments, it’s just not practical. Today’s workplaces are often comprised of cubicles or fully open spaces shared by multiple workers, rather than individual private offices. If all the workers are talking to their computers at the same time, and those computers are talking back, you end up with a cacophony of voices and an abundance of noise pollution.
Not only would it make for an unpleasant work environment, but all the extraneous sounds can also interfere with the ability of speech recognition software to understand what you’re saying. Even without background noises, the software often has difficulty accurately processing voice input. Anyone who has used voice dictation programs, such as Dragon NaturallySpeaking, knows that a computer has limitations in this area that the human ear typically doesn’t have.
If you happen to have a perfect Midwestern non-accent, you may be good to go, but if you’re from the south, or Boston, or the Bronx, or speak English as a second language, or otherwise have a non-standard accent or dialect, speech recognition software can get pretty confused. To get any degree of accuracy, you need to go through a long and tedious training process. Many people (I was one) tried it and gave up in frustration. I can type faster than I can dictate and then correct the mistakes that the software makes in transcribing what I say. It’s not as if I need to talk to my computer; there’s a perfectly good keyboard right there.
Speech on a phone: The perfect match
What makes speech a better match with a smartphone than with a computer? Talking to a phone is a very intuitive act, and you don’t have that big, easy-to-use keyboard sitting there for you to use instead. In addition, the kinds of tasks we do on phones are different. We generally only need to communicate short, simple commands (such as “call Tom at home”) instead of creating long documents. Those who have played with both voice command and dictation on a Windows computer know that the former has always worked much better. We’re also more likely to be using our phones in situations where we need to keep our hands free for other tasks (and in some cases, our eyes someplace other than on our displays), and speech input/output allows us to do that.
The first stumbling block to implementing speech recognition on early smartphones was the relatively low memory and processing power of the devices. However, today’s smartphones have system resources that top those of desktop computers from only a few years ago, so that’s less of an issue. In addition, with today’s fast 3G and 4G networks, some of the processing load can even be offloaded onto a remote server.
Voice-enabled smartphone apps
Voice dialing was, of course, one of the first smartphone applications to use speech recognition and synthesis. Most phones, including low cost “feature phones,” now include this feature. You speak a command (”call” or “dial”) and the name of a person in your contact list or the number itself. The phone uses voice recognition to identify the number and dials it. The phone may also use voice synthesis to repeat the name or number to you for confirmation, or if the contact has multiple numbers, to ask you which number to dial.
Another popular feature is voice search. This is much handier than trying to type a long search term into the browser with a tiny keyboard.
Voice search makes it faster and easier to enter search terms in the smartphone browser
The Google Navigation app on Android phones allows you to speak your destination instead of typing it in — something that you can’t do with some standalone GPS units. By its very nature, the Navigation app is something that’s often using while driving, so this is not only a big convenience but a safety feature, as well.
The Google Navigation app on Android allows you to speak your destination
The Navigation app also uses speech synthesis to provide spoken turn-by-turn directions so that it’s not even necessary to look at the maps on the screen. The Navigation app is so good that many people (myself included) have traded in their standalone GPS units and now just use their phones to get directions.
Translation apps such as Google Translate allow you to speak words and have them translated into another language. Google Translate also has a conversation mode feature, which will speak the translated text aloud, so that two people who speak different languages can communicate by alternately typing in their responses in their own language and having them spoken to the other person in that person’s language.
Translation apps use both voice input and voice output to enable two people who speak different languages to have a conversation
Other apps, such as SpeakNotes, let you dictate to your phone and have your words transcribed into editable text, which you can then share via email. And some phones’ keyboards, such as the one on my HTC Thunderbolt, have a microphone key that allows you to speak directly into an email message, text message, document, or other text field and have your speech transcribed into text.
Most phones also have general settings for voice, which allow you to select the input language, filter offensive words in voice searches, select a speech synthesizer engine, set the rate at which synthesized speech will be spoken, install additional languages, and so forth.
I’ve used Android phones in this example, but the iPhone and Windows Phone 7 also incorporate similar voice input and output capabilities.
The future of voice integration
Futurists envision the day when smartphones will be even smarter than they are now and able to process your spoken words in much the same way a human can. That is, you’ll be able to talk to your phone just like the characters on Star Trek talked to the ship’s computer (e.g., “Send an email to Bob to tell him the project is a go” or “Find the closest sushi restaurant”). This is referred to as natural language processing.
Voice pattern recognition could even one day be routinely used as a security mechanism for smartphones to authenticate users. It seems clear that voice is the most natural way to communicate with a phone, and phones of the future are likely to have even more sophisticated support for vocal input and output. Of course, there are times when silence is required, so we will probably always have alternative methods of interacting with our phones, but it’s very likely that voice will become the primary input/output method.