Flag

We stand with Ukraine and our team members from Ukraine. Here are ways you can help

Get exclusive access to thought-provoking articles, bonus podcast content, and cutting-edge whitepapers. Become a member of the UX Magazine community today!

Home ›› Business Value and ROI ›› 6 Key Questions to Guide International UX Research ›› Talking to Machines and Being Heard

Talking to Machines and Being Heard

by Dave Rich
8 min read
Share this post on
Tweet
Share
Post
Share
Email
Print

Save

There are many factors to consider when designing speech recognition applications.

Speech recognition presents an exciting and dynamic set of challenges and opportunities for UX designers. With the mass-market reception of consumer technologies such as Apple’s Siri and the near-omnipresence of speech in telephone applications, speech recognition is a computer–human interface many people interact with daily. The uses range from self-service telephone systems like banking applications, to mobile apps that allow users to speak commands and compose messages verbally.

In the future, we can expect to see many different applications integrate speech recognition in some form. The time is near when speech will be the most universal user interface.

An Overview of Speech Technology Jargon

Like many technologies, speech recognition has its own special lingo. Here are some of the basic terms:

Automatic Speech Recognition (ASR) and Text-To-Speech (TTS)

ASR is software that turns spoken words into written text and TTS is software does the opposite, turning text into synthesized speech. They both tend to run in connection with an Interactive Voice Response (IVR) platform, which automates interactions with telephone callers via a media server.

Speaker Independent and Speaker Dependent

The majority of speech recognition used in commercial applications (including most mobile applications and telephone IVR systems) is speaker independent, which means the speaker is not known to the system before the interaction begins. Two common speaker-independent applications are phone banking and flight arrival/departure systems. Speaker independence simply means that the software is designed to understand a wide range of users.

Speaker dependent software, on the other hand, is used primarily for dictation with desktop computers and requires that a user “train” the system over time by speaking into it and correcting its mistakes. One popular brand of dictation software is Dragon Naturally Speaking.

While early generations of mobile ASR used speaker-dependent software, this is becoming increasingly rare as speech recognition moves off of mobile devices and into the cloud. Today, most speech interface design opportunities come in the IVR and mobile spaces, which are built almost entirely around speaker-independent dialogues that lead a user through an interaction.

Short Utterances vs. Natural Language Understanding (NLU)

When designing a Voice User Interface (VUI), it’s important to consider the type of input, or utterances, expected from users. The simplest types of utterances are simple commands: “Main menu,” “Checking account balance,” “Transfer to operator,” etc. This kind of speech application is often called directed dialog, as the system directs the user to give simple responses, e.g. ,“You can say ‘checking account,’ ‘savings account,’ or ‘main menu.’”

More complex applications make use of Natural Language Understanding (NLU), which tends to be substantially more difficult to design and test. These applications are marked by long, natural sentences in response to very open-ended prompts such as, “Please state the reason for your call.”

Not all ASR software provides the same capabilities and some may be limiting and constraining to the VUI. For example, ASR that does not support NLU capabilities will obviously not allow for a very open-ended dialog. This interplay between technology and VUI design choices illustrates an important point for the designer working with speech: building a great user experience with speech recognition requires a blend of technical knowledge of the science of speech technology with an understanding of the art of spoken human interface design.

The Human Factor: The Art of Creating a Great User Experience

There are many verbal communication factors to consider when building a VUI compared to a traditional GUI design. For instance, when users are presented with a web page, they are largely constrained and guided by the GUI design. They can click on links, hover over items, enter text with the keyboard, etc., making it relatively easy for the designer to anticipate the types of actions a user will attempt. With speech, however, users have spent their whole lives speaking naturally and may have very limited exposure to speech recognition applications. The highly contextual nature of conversational speech—where humans use intelligence and context to fill in the gaps of what’s actually spoken in a conversation—makes building good VUIs more difficult.

This must be kept in mind when creating a good spoken experience for users. An important consideration is the careful creation of audio prompts. Both the wording and the tone of an audio prompt will affect how users respond. Consider the following dialog:

SYSTEM: Did you want to speak with “John Boston” or “Don Austin?”

USER: Yes.

In this case, the user may be responding “Yes” without listening to the entire prompt, but the result is ambiguous. The designer of the dialog likely intended to get a response indicating which person the user was trying to speak with, but instead constructed a question where a yes/no response is grammatically accurate. A better way to construct the question is:

SYSTEM: We have a “John Boston” and a “Don Austin.” Which one would you like to speak with?

USER: Don Austin.

There are many more of these sorts of human factors considerations—more than will fit in this article. But some of the other significant considerations in developing a good speech-driven VUI include the need to understand “turn-taking” in conversation between the speech system and the user, the cognitive limits that constrain how many choices a user can keep in his head at once, and how speech patterns change when a user becomes frustrated.

The Technical Aspects: The Science of Designing Good Grammars

A designer must also keep in mind the technical aspects of speech technology. After prompt choices, the most important element in speech design is what is known as a grammar. Grammars are requirements of speaker-independent software, and they essentially provide a structured list of the words and phrases that can be recognized at any given time. Grammars constrain the word choices that can be understood by the system. Speech cannot be recognized by the ASR unless it is contained within the grammar.

Designing good grammars is a challenge. Recognition accuracy tends to decrease as the number of word choices in a grammar increase because the more options that are available for the ASR, the greater the chance there is for it to make a mistake. The ideal grammar is big enough to cover the range of things users are likely to say while being small enough to be accurate. Good grammars flow naturally from good application and prompt design, as illustrated above. Well-worded prompts tend to elicit specific responses from users, allowing for more compact grammars.

Tuning: The Key to Accuracy

An important part of a designer’s job when working with speech recognition is to be involved in the speech tuning process. This is a process where recorded audio from users of the application is captured and then transcribed manually. Using a tuning tool, these transcripts are compared to the recognition results from the ASR to understand how well the system is performing.

Tuning is an extremely important task because it reveals problems in the overall application and in the individual dialog sections that are known as “states.” It can illustrate that users are getting lost in a complex interaction or that a specific grammar doesn’t include enough variations on a phrase. A designer must therefore understand the science of tuning (including how to gather statically significant samples of data) and the design of VUIs to be able to find and fix any problems.

Error Handling: Avoiding Frustrated Users

As a designer’s sophistication with ASR technologies increases, so does his ability to improve a user’s experience. ASR engines that decode the audio into probable words provide a speech application with a wealth of data such as confidence scores (a numeric representation of how likely the answer returned by the ASR matches what the user said) and n-best lists (a list of possible alternative things the user may have said in the event that the top result from the ASR choices is incorrect). These tools allow a designer to perform complex error handling that often creates a seamless experience for the user, leading to fewer of the dreaded “I’m not sure what you said,” re-prompts. Apple’s Siri product, for example, employs these tricks combined with metadata about a user’s location and search habits to provide an elegant experience that rarely needs to ask for clarification.

Getting Started: How Long Will It Take?

The good news is that elegant speech applications are easier than ever to build and deploy thanks to a broad range of platform and toolset choices, mature standards, and established best practices. Budgeting the design and development of a speech application is tricky, but the complexity of a solution can usually be gauged roughly by looking at the number of interactions there are with the user in a standard use case and the complexity of each interaction. A yes/no question generally requires one turn (or response) with the user and is quite simple, while capturing a street address may require five turns with the user, some of which are quite complex.

In addition to the initial design and development time, developers must budget for tuning time. The industry generally recommends that about 40% of the initial design and development time be budgeted for tuning. Tuning is performed after the initial deployment and is usually done in multiple cycles, including a round that is performed just after deployment. Other tuning rounds follow after the application has been running for a while with live users. Over time, the need to tune decreases.

A new designer should generally budget a couple of weeks of dedicated time for designing a relatively simple speech application such as a directed-dialog IVR that asks users only five or six questions, with about the same amount of time dedicated to post-deployment tuning.

“Computer, compute to the last digit the value of pi.”—Spock

Hollywood has imaginatively shown us how easy communication should be with machines, but that kind of simplicity only comes with good design. Since most computer applications have limited functions, a good speech interface needs to handle only a narrow slice of potential conversations that addresses the special input and output needs of that application. While speech technology can handle variations in a speaker’s accent, speed, volume, and tone, the UX designer is thoughtful about how to handle the many variations in situational and emotional contexts, and in word choice.

The art of speech user interface design is to craft clear questions and fully anticipate the range of potential responses at each stage of a structured conversation. Creating a good interface is easier than computing the last digit of pi; you merely need to focus on a limited application, obtain some assistance, and use tuning tools to improve it over time. This will get you going down the road to a great speech application.

Image of person on phone courtesy Shutterstock.

post authorDave Rich

Dave Rich

Dave is a hands-on executive who has worked with start-ups and Fortune 500 experience, growning initiatives from scratch to create divisions and whole companies operating on four continents. He has a track record of successful strategic planning, business development, sales acceleration, marketing, and assembling high-performance teams. His strengths are in applying financial rigor and metrics to guide decision-making. He has an interest in transformative technologies and is the CEO of LumenVox.

Specialties

Business, channel and sales development, strategic negotiations, product and program management, global planning and operations, financial modeling, offshoring, call center sales and customer service, process improvement, widget and internet based applications, start-ups, German language fluency, spoken French, inter-cultural communication.

Tweet
Share
Post
Share
Email
Print

Related Articles

Discover the hidden costs of AI-driven connectivity, from environmental impacts to privacy risks. Explore how our increasing reliance on AI is reshaping personal relationships and raising ethical challenges in the digital age.

Article by Louis Byrd
The Hidden Cost of Being Connected in the Age of AI
  • The article discusses the hidden costs of AI-driven connectivity, focusing on its environmental and energy demands.
  • It examines how increased connectivity exposes users to privacy risks and weakens personal relationships.
  • The article also highlights the need for ethical considerations to ensure responsible AI development and usage.
Share:The Hidden Cost of Being Connected in the Age of AI
9 min read

The role of the Head of Design is transforming. Dive into how modern design leaders amplify impact, foster innovation, and shape strategic culture, redefining what it means to lead design today.

Article by Darren Smith
Head of Design is Dead, Long Live the Head of Design!
  • The article examines the evolving role of the Head of Design, highlighting shifts in expectations, responsibilities, and leadership impact within design teams.
  • It discusses how design leaders amplify team performance, foster innovation, and align design initiatives with broader business goals, especially under changing demands in leadership roles.
  • The piece emphasizes the critical value of design leadership as a multiplier for organizational success, offering insights into the unique contributions that design leaders bring to strategy, culture, and team cohesion.
Share:Head of Design is Dead, Long Live the Head of Design!
9 min read

Discover how digital twins are transforming industries by enabling innovation and reducing waste. This article delves into the power of digital twins to create virtual replicas, allowing companies to improve products, processes, and sustainability efforts before physical resources are used. Read on to see how this cutting-edge technology helps streamline operations and drive smarter, eco-friendly decisions

Article by Alla Slesarenko
How Digital Twins Drive Innovation and Minimize Waste
  • The article explores how digital twins—virtual models of physical objects—enable organizations to drive innovation by allowing testing and improvements before physical implementation.
  • It discusses how digital twins can minimize waste and increase efficiency by identifying potential issues early, ultimately optimizing resource use.
  • The piece emphasizes the role of digital twins in various sectors, showcasing their capacity to improve processes, product development, and sustainability initiatives.
Share:How Digital Twins Drive Innovation and Minimize Waste
5 min read

Join the UX Magazine community!

Stay informed with exclusive content on the intersection of UX, AI agents, and agentic automation—essential reading for future-focused professionals.

Hello!

You're officially a member of the UX Magazine Community.
We're excited to have you with us!

Thank you!

To begin viewing member content, please verify your email.

Tell us about you. Enroll in the course.

    This website uses cookies to ensure you get the best experience on our website. Check our privacy policy and