UX Magazine

Defining and Informing the Complex Field of User Experience (UX)
Article No. 909 November 28, 2012

Talking to Machines and Being Heard : A speech recognition primer

Speech recognition presents an exciting and dynamic set of challenges and opportunities for UX designers. With the mass-market reception of consumer technologies such as Apple’s Siri and the near-omnipresence of speech in telephone applications, speech recognition is a computer–human interface many people interact with daily. The uses range from self-service telephone systems like banking applications, to mobile apps that allow users to speak commands and compose messages verbally.

In the future, we can expect to see many different applications integrate speech recognition in some form. The time is near when speech will be the most universal user interface.

An Overview of Speech Technology Jargon

Like many technologies, speech recognition has its own special lingo. Here are some of the basic terms:

Automatic Speech Recognition (ASR) and Text-To-Speech (TTS)

ASR is software that turns spoken words into written text and TTS is software does the opposite, turning text into synthesized speech. They both tend to run in connection with an Interactive Voice Response (IVR) platform, which automates interactions with telephone callers via a media server.

Speaker Independent and Speaker Dependent

The majority of speech recognition used in commercial applications (including most mobile applications and telephone IVR systems) is speaker independent, which means the speaker is not known to the system before the interaction begins. Two common speaker-independent applications are phone banking and flight arrival/departure systems. Speaker independence simply means that the software is designed to understand a wide range of users.

Speaker dependent software, on the other hand, is used primarily for dictation with desktop computers and requires that a user “train” the system over time by speaking into it and correcting its mistakes. One popular brand of dictation software is Dragon Naturally Speaking.

While early generations of mobile ASR used speaker-dependent software, this is becoming increasingly rare as speech recognition moves off of mobile devices and into the cloud. Today, most speech interface design opportunities come in the IVR and mobile spaces, which are built almost entirely around speaker-independent dialogues that lead a user through an interaction.

Short Utterances vs. Natural Language Understanding (NLU)

When designing a Voice User Interface (VUI), it’s important to consider the type of input, or utterances, expected from users. The simplest types of utterances are simple commands: “Main menu,” “Checking account balance,” “Transfer to operator,” etc. This kind of speech application is often called directed dialog, as the system directs the user to give simple responses, e.g. ,“You can say ‘checking account,’ ‘savings account,’ or ‘main menu.’”

More complex applications make use of Natural Language Understanding (NLU), which tends to be substantially more difficult to design and test. These applications are marked by long, natural sentences in response to very open-ended prompts such as, “Please state the reason for your call.”

Not all ASR software provides the same capabilities and some may be limiting and constraining to the VUI. For example, ASR that does not support NLU capabilities will obviously not allow for a very open-ended dialog. This interplay between technology and VUI design choices illustrates an important point for the designer working with speech: building a great user experience with speech recognition requires a blend of technical knowledge of the science of speech technology with an understanding of the art of spoken human interface design.

The Human Factor: The Art of Creating a Great User Experience

There are many verbal communication factors to consider when building a VUI compared to a traditional GUI design. For instance, when users are presented with a web page, they are largely constrained and guided by the GUI design. They can click on links, hover over items, enter text with the keyboard, etc., making it relatively easy for the designer to anticipate the types of actions a user will attempt. With speech, however, users have spent their whole lives speaking naturally and may have very limited exposure to speech recognition applications. The highly contextual nature of conversational speech—where humans use intelligence and context to fill in the gaps of what’s actually spoken in a conversation—makes building good VUIs more difficult.

This must be kept in mind when creating a good spoken experience for users. An important consideration is the careful creation of audio prompts. Both the wording and the tone of an audio prompt will affect how users respond. Consider the following dialog:

SYSTEM: Did you want to speak with “John Boston” or “Don Austin?”

USER: Yes.

In this case, the user may be responding “Yes” without listening to the entire prompt, but the result is ambiguous. The designer of the dialog likely intended to get a response indicating which person the user was trying to speak with, but instead constructed a question where a yes/no response is grammatically accurate. A better way to construct the question is:

SYSTEM: We have a “John Boston” and a “Don Austin.” Which one would you like to speak with?

USER: Don Austin.

There are many more of these sorts of human factors considerations—more than will fit in this article. But some of the other significant considerations in developing a good speech-driven VUI include the need to understand “turn-taking” in conversation between the speech system and the user, the cognitive limits that constrain how many choices a user can keep in his head at once, and how speech patterns change when a user becomes frustrated.

The Technical Aspects: The Science of Designing Good Grammars

A designer must also keep in mind the technical aspects of speech technology. After prompt choices, the most important element in speech design is what is known as a grammar. Grammars are requirements of speaker-independent software, and they essentially provide a structured list of the words and phrases that can be recognized at any given time. Grammars constrain the word choices that can be understood by the system. Speech cannot be recognized by the ASR unless it is contained within the grammar.

Designing good grammars is a challenge. Recognition accuracy tends to decrease as the number of word choices in a grammar increase because the more options that are available for the ASR, the greater the chance there is for it to make a mistake. The ideal grammar is big enough to cover the range of things users are likely to say while being small enough to be accurate. Good grammars flow naturally from good application and prompt design, as illustrated above. Well-worded prompts tend to elicit specific responses from users, allowing for more compact grammars.

Tuning: The Key to Accuracy

An important part of a designer’s job when working with speech recognition is to be involved in the speech tuning process. This is a process where recorded audio from users of the application is captured and then transcribed manually. Using a tuning tool, these transcripts are compared to the recognition results from the ASR to understand how well the system is performing.

Tuning is an extremely important task because it reveals problems in the overall application and in the individual dialog sections that are known as “states.” It can illustrate that users are getting lost in a complex interaction or that a specific grammar doesn’t include enough variations on a phrase. A designer must therefore understand the science of tuning (including how to gather statically significant samples of data) and the design of VUIs to be able to find and fix any problems.

Error Handling: Avoiding Frustrated Users

As a designer’s sophistication with ASR technologies increases, so does his ability to improve a user’s experience. ASR engines that decode the audio into probable words provide a speech application with a wealth of data such as confidence scores (a numeric representation of how likely the answer returned by the ASR matches what the user said) and n-best lists (a list of possible alternative things the user may have said in the event that the top result from the ASR choices is incorrect). These tools allow a designer to perform complex error handling that often creates a seamless experience for the user, leading to fewer of the dreaded “I’m not sure what you said,” re-prompts. Apple’s Siri product, for example, employs these tricks combined with metadata about a user’s location and search habits to provide an elegant experience that rarely needs to ask for clarification.

Getting Started: How Long Will It Take?

The good news is that elegant speech applications are easier than ever to build and deploy thanks to a broad range of platform and toolset choices, mature standards, and established best practices. Budgeting the design and development of a speech application is tricky, but the complexity of a solution can usually be gauged roughly by looking at the number of interactions there are with the user in a standard use case and the complexity of each interaction. A yes/no question generally requires one turn (or response) with the user and is quite simple, while capturing a street address may require five turns with the user, some of which are quite complex.

In addition to the initial design and development time, developers must budget for tuning time. The industry generally recommends that about 40% of the initial design and development time be budgeted for tuning. Tuning is performed after the initial deployment and is usually done in multiple cycles, including a round that is performed just after deployment. Other tuning rounds follow after the application has been running for a while with live users. Over time, the need to tune decreases.

A new designer should generally budget a couple of weeks of dedicated time for designing a relatively simple speech application such as a directed-dialog IVR that asks users only five or six questions, with about the same amount of time dedicated to post-deployment tuning.

“Computer, compute to the last digit the value of pi.”—Spock

Hollywood has imaginatively shown us how easy communication should be with machines, but that kind of simplicity only comes with good design. Since most computer applications have limited functions, a good speech interface needs to handle only a narrow slice of potential conversations that addresses the special input and output needs of that application. While speech technology can handle variations in a speaker’s accent, speed, volume, and tone, the UX designer is thoughtful about how to handle the many variations in situational and emotional contexts, and in word choice.

The art of speech user interface design is to craft clear questions and fully anticipate the range of potential responses at each stage of a structured conversation. Creating a good interface is easier than computing the last digit of pi; you merely need to focus on a limited application, obtain some assistance, and use tuning tools to improve it over time. This will get you going down the road to a great speech application.

 

Image of person on phone courtesy Shutterstock.

ABOUT THE AUTHOR(S)

User Profile

Dave is a hands-on executive who has worked with start-ups and Fortune 500 experience, growning initiatives from scratch to create divisions and whole companies operating on four continents. He has a track record of successful strategic planning, business development, sales acceleration, marketing, and assembling high-performance teams. His strengths are in applying financial rigor and metrics to guide decision-making. He has an interest in transformative technologies and is the CEO of LumenVox.

Add new comment

Comments

80
90

You're dead on with respect to accuracy and frustration. Us humans rely heavily on being able to communicate. Our survival as a species depends on it, and our success is a direct result of the ability we have to understand each other. We are hard-wired to be really upset when we cannot make ourselves understood. At the gut-level, mis-communication is a threat, and so when the system doesn’t understand us, we lose trust in it’s ability to help us. This is a great step into making this interaction methodology more acceptable.

However, I'm a bit skeptical about your statement "The time is near when speech will be the most universal user interface." There are cognitive factors to consider:
- It creates more cognitive load to verbalize what you want something on screen to do and then say it, then confirm that it has worked;

- Humans work better by recognition rather than recall. Visual UI’s aid recognition, while voice UI basically requires good recall.

- It is essentially serial, as opposed to visual UI which is parallel. This is one of the biggest drawbacks of voice based interaction with a computer. This is one of the reasons why I think the iPhone’s visual vmail was such a hit. In this respect, the computer would really need to get to the level of a human-human interaction – just “knowing” when to interrupt and when to get interrupted in order to carry a serial interaction with almost parallel efficiency