Speech Recognition vs. Language Processing

I have stressed that we are still waiting for natural language processing (NLP). One thing that might lead you to believe otherwise is that some companies run systems that enable you to hold a conversation with a machine. But that doesn’t involve NLP, i.e. syntactic and semantic analysis of sentences. It involves automatic speech recognition (ASR), which is very different.

ASR systems deal with words and phrases rather as the song “Rawhide” recommends for cattle: “Don’t try to understand ’em; just rope and throw and brand ’em.”

Labeling noise bursts is the goal, not linguistically based understanding.

Current ASR systems cannot reliably identify arbitrary sentences from continuous speech input. This is partly because such subtle contrasts are involved. The ear of an expert native speaker can detect subtle differences between detect wholly nitrate or holy night rate, but ASR systems will have trouble.

Fifty years ago Noam Chomsky and George Miller contrasted The good can decay many ways and The good candy came anyway, and pointed out that the final [s] sound is crucial in making decisions about word identifications in the earlier part (since *The good can decay many way is ungrammatical). The two actually sound slightly different in most people’s speech, but don’t bet on an ASR system getting them right.

Unrestricted text-to-speech ASR has to be risked, despite its shortcomings, in one important application: generating closed captions on live TV. A “respeaking” technique is generally used: Someone follows the broadcast in a studio and carefully repeats all of the language into a microphone connected to an ASR system that produces caption text in real time. The results are bad enough that this week the British government’s Office of Communications (Ofcom), watching out for the needs of the hearing-impaired, made some proposals for lowering the error rate.

The errors that Ofcom complains of can be delightfully entertaining: Ed Miller Band for Ed Miliband (leader of the Labour Party); Arch bitch of Canterbury for Archbishop of Canterbury; mist and Fox patches for mist and fog patches (in a weather report); never been on a bus on heroin for never been on a bus on her own (in news about a missing girl); a moment’s violence for a moment’s silence (in coverage of the funeral of Queen Elizabeth’s mother).

In a spectacular recent ASR error, the Fox affiliate KDFW briefly identified the star of the TV comedy New Girl, Zooey Deschanel, as one of the Boston bombers, when Dzhokhar Tsarnaev was (understandably) not recognized by the system. (Screen shot here.)

Despite the threat of such disastrous mistakes, speech-recognition systems have been able to take off and become really useful in interactive voice-driven telephone systems. But what facilitated this was not powerful processors or cheap memory or improved algorithms (though they helped), and it certainly wasn’t NLP. It was in large measure the magic of a technique known as dialog design: Mapping the course of possible coherent conversations on a fixed topic, so that a network of guideposts and prompts can be laid out. (See this useful review for some discussion of applying dialog design in testing ASR systems.)

At a point where you have just been asked, “Are you calling from the phone you wish to ask about?” you are extremely likely to say either Yes or No, and it’s not too hard to differentiate those acoustically.

Even if you’ve been asked “What state do you live in?” there are only 50 distinct possible answers (as opposed to uncooperative responses like “Who’s asking?”).

Prompting a bank customer with “Do you want to pay a bill or transfer funds between accounts?” considerably improves the chances of getting something with either “pay a bill” or “transfer funds” in it; and they sound very different.

In the latter case, no use is made by the system of the verb + object structure of the two phrases. Only the fact that the customer appears to have uttered one of them rather than the other is significant. What’s relevant about pay is not that it means “pay” but that it doesn’t sound like tran-. As I said, this isn’t about language processing; it’s about noise-burst classification.

Classifying noise bursts in a dialog context is way easier than recognizing continuous text. Mapping the space of possible interactions offers a way to make ASR useful even when the extent to which speech as such is being processed and understood (i.e., grammatically recognized and analyzed into its meaningful parts) is essentially zero. And that’s part of the reason why we are still waiting for NLP.

And there’s one other thing to discuss; that’s for my next post.

Return to Top