December 27, 2000

By Karen Kenworthy

IN THIS ISSUE

I hope you had a wonderful Christmas holiday. I was able to spend the day at with my parents, my brother Bill, and Bill's wonderful family. All together there were 15 of us, including Bill's oldest daughter Vanessa, and her husband and new baby. We missed my other brother, Kevin, his wife and five children. But I don't know where we would have put them if they'd come. :)

Speech Recognition

Just before Christmas, I received a visit from my buddy Bob. Bob is a man of many surprising talents and interests. Over the years Bob has been a lifeguard, furniture mover, and a member of the touring company of the choral group "Up With People." Today he's the IT manager of a Fortune 500 company, and a Master pistol shooter who competes regularly in the U.S. national championships. He's also a generous man, and a Star Trek fanatic.

So it came as no surprise when Bob arrived bearing a gift. And no surprise the gift was the latest Star Trek computer game. Bob's been raving about this game for weeks. According to Bob, this game has the most amazing user interface. I think he wants me to copy it, for the programs I write for him.

I'm eager to try the new game. But I'm afraid I'll be disappointed. Like others who grew up watching the Star Trek television show, my dream has always been a computer that listens to my voice, understand my questions and commands, and responds with a voice of its own. But despite the best efforts of a lot of very bright people, we're a long way from that goal.

Comprehending human speech, what computer scientists call Speech Recognition (or SR) is proving to be very difficult. Although we think of speech in terms of distinct words and sentences, real human speech is a continuous sound. Very few moments of silence punctuate what we say. Instead, the sound of each syllable and word blend almost seamlessly into the next.

To make sense of this continuous stream of sounds, a computer must first break it into a series of syllables. These building blocks of words are what linguists and computer scientist call phonemes. Next, the computer must group these phonemes into words.

To recognize words computers rely on "grammars." These are lists of words, and the phonemes that form those words. Grammars help a lot. But there are still some problems that must be solved.

We don't all pronounce words the same way, for example. Regional accents, differences in voices, even colds and other conditions, affect the exact sounds we make when speaking specific words. Grammars that contain alternate pronunciations for certain words help compensate for these differences. So does "training," the automatic creation of grammars based on the actual speech of a particular person. But problems still remain.

Computers must also cope with homonyms -- words that sound alike but have different meanings, like "bear" and "bare." Now you might think grammars can't help with this job. But they can. That's because grammars often contain more than the phonetic spelling of words. Advanced grammars may also contain information about how each word may be used. For example, they may identify which words are nouns, verbs, adjectives, etc. This extra information enables computers to examine how a spoken word is used, and identify the proper spelling based on that usage.

Command And Control

The speech recognition software that works best today falls into a category programmers call "Command and Control." These applications understand a small number of words, such as the menu choices of a single program. Each word or short phrase causes the program to take a specific action. For example, stating "Warp 10" might cause your voice-controlled vehicle to accelerate to the speed of light, raised to the 10th power.

With a little ingenuity, command words can be chosen so that no two sound very much alike. Use a little more neuron grease, and a programmer can create specialized grammars that specify which command words can be used in various contexts.

Consider a program that allows you to use your voice to make selections from a Windows program's menus. Most programs have a small number of top-level menu choices, with names like File, Edit, View, and Help. These choices appear in a line, across the top of the program's main window.

Choose one of these, and additional choices appear beneath the top-level menu choice. For example, click on the word File in a program's top-level menu, and you'll see sub-menu containing several new options such as Open, Save, Print, and Exit. Click a different top-level menu choice, and a different list will appear. Clicking the word Edit in the top-level menu might provide a sub-menu which choices such as Cut, Copy, and Paste.

Now let's replace your mouse with your mouth. With an appropriate grammar, a program that lets you navigate this menu structure by voice will first listen for words that appear in the program's top-level menu. It will ignore any other words it might hear. Even though the program's total vocabulary might total several dozen words, initially it only has to recognize a few.

Once the name of a top-level menu has been spoken and correctly recognized, the computer listens for your choice from the appropriate sub-menu. Only words found in that sub-menu need be considered. All others can be safely ignored. Once again, the computer only needs to concern itself with a small portion of its total vocabulary, increasing its chances of understanding what's being said.

My little Power Toy program is one example of a command and control speech recognition application. It doesn't allow you to make choices from a menu. But it does let you select entries from a list of actions that can be performed by an animated character. For example, most of the animated characters that the Power Toy can display, called Agents, can perform an action called Greet. To cause the Agent to perform its greeting ritual, just say the word Greet.

One-word phrases are especially easy. But the Power Toy's grammar also contains a few simple phrases such as "What Time Is It?" If the program recognizes this phrase the Agent will respond by speaking the current date and time.

Even though this question contains four words, the program only needs to listen for one word initially -- the word "What." Until that word has been heard, the program can ignore the other words in the phrase. Only after "What" has been heard, does the program need to listen for the three words that remain.

Dictation

Programs that fall into the second category of speech recognition applications, Dictation, are generally less successful. These sorts of applications let you say anything they like, such as "Star Date 2050 -- We entered Romulan space..." The statements you just made are transcribed, or converted to computer text, then stored in a file or saved in some other way.

Dictation speech recognition also relies on a grammar. But as you can imagine, the full grammar of an entire language is quite large. It's so large it cannot be stored in the RAM of most computers, where it could be accessed quickly. Instead, it must be stored on disk, where access is much slower. As a result, full grammars cannot currently be used to perform real-time speech recognition.

But smaller grammars can, and often are, used for dictation-style recognition. An application might use a grammar of medical terms to allow doctors to dictate patient information, for example. Or a grammar of legal terms could be used by a computer program used to create legal documents. Other grammars, with common business terms, or even simply reduced vocabularies, can also be found or created.

Unfortunately, even with the help of specialized grammars, computer dictation is still error-prone. Human languages are just too complex, and our patterns of speech too variable, to allow current day computers to reliably do the job.

Fortunately, people are often fairly tolerant of mistakes made by dictation applications. In most cases, we assume that the work of the program will be reviewed by a human being before the job is done.

On the other hand, people expect more accuracy from the simpler command and control applications. This is especially true when there is no chance to undo whatever action such a program may perform. As voice-controlled computers take on ever more important tasks, such as opening cargo bay doors, this demand for accuracy will only increase.

While computer speech recognition has been slow to evolve, there's no doubt it will eventually live long and prosper. Many speech recognition applications already exist. And Microsoft recently disclosed that speech recognition features would be built into the next version of their office software suite, Microsoft Office 10. As the technology improves, other applications are sure to follow.

In the meantime, if you'd like to give computer speech recognition a try, beam down the free Winmag.com Power Toy from https://www.karenware.com/powertools/pttoy. There you'll also find the other free software needed to teach your computer to talk and listen.

And if you see my buddy Bob, strolling about the Alpha Quadrant or anywhere on the 'net, give him a shot with your phaser. But be sure to set it on "tickle" first. As for me, a simple wave and a "Hi!" will do just fine. :)