Long ago the way we “talked” to computers was with punch cards. Literally holes punched in cards that represented data. They were created using a typewriter-like machine, then fed into a card reader. Not long after that someone got the brilliant idea to just connect the typewriter directly to the computer, and the keyboard was born.
That’s a gross over-simplification, of course, but it helps illustrate how far we’ve come.
Today we have computers that we carry in our pockets and keyboards that can dynamically change from one format to another, and even to an entirely different language. It’s quite amazing! Nevertheless, it’s still manual entry. Sure, we’ve got autocomplete and autocorrect to help us along, but we’re still keying things into the machine (and we’ve got a few hilarious websites devoted to when things are “autocorrected” terribly wrong).
When I talk to another person I use my voice. In turn, they use their ears to hear what I’m saying, and their brain to interpret what I meant. Seems pretty basic, doesn’t it? Applying that same metaphor to computers, however, has proven to be quite elusive.
People have been dictating letters, memoranda, and who know what else into dictaphones and tape recorders for later compilation into written formats for decades. Anyone who has ever dictated a letter can tell you that it’s not as easy as speaking the words, you also have to dictate punctuation, emphasis, and layout. In short, you need to learn how to effectively dictate, and incorporate those keywords into your dictated audio.
This comma some say is what has held back computer hyphen based voice hyphen recognition throughout the years period
See what I did there? I dictated that last sentence. Now imagine that all of your voice interactions with your mobile devices have to comply with that format. It’s a pain! Unfortunately, in today’s voice recognition applications, that’s exactly what you must do to get things just right. Even still, recognizing free-form text is sometimes still a challenge, and can yield even funnier results than autocorrect.
What we needed is a predefined set of commands that our devices can listen for, and handle them accordingly. For example: “send text to (recipient) (message)” or “navigate to (address/city/business name)”.
If you know the syntax, and have a button to invoke the app someplace handy (say a dedicated button, for example), this works really well. Ironically, this has worked essentially the same way since Windows Mobile days, and even earlier.
The syntax is still difficult, and the processing power is still somewhat high.
Along came Siri…
Once you start adding in a certain amount of artificial intelligence and fuzzy logic, voice recognition starts to get really interesting. Unfortunately, the level of processing power needed to accomplish this is arguably beyond the reach of even today’s high-end smartphones and tablets. Even if it weren’t, once installed on a device, this type of app would need near continual updating to take advantage of improvements it “learns” from other people using the app.
What we need is a centralized system. We’d speak to our devices, and with a fast-enough connection and clean enough audio, the super-computers on the other end of the line would be able to return fairly decent results.
That’s what Siri (and all the apps that are attempting to do what Siri does) is trying to accomplish. Honestly, it’s doing a very good job.
Where does that leave us?
We can boil down voice interactions with our devices to dictated text, and spoken commands. Most interactions will include portions of both. This adds additional complexity to an already complex problem. Even still, we’ve come a long way and progress is being made in leaps and bounds.
Voice interaction isn’t going to go away any time soon, and it’s going to get better, faster, more accurate, and more powerful.
Now all we need to do is figure out some way to make it work even when an Internet connection isn’t available (or isn’t fast enough).
End of line.
Image Credit: Paramount Pictures