Flag

We stand with Ukraine and our team members from Ukraine. Here are ways you can help

Home ›› Business Value and ROI ›› 6 Key Questions to Guide International UX Research ›› Two Major Challenges with Speech-Recognition Technology

Two Major Challenges with Speech-Recognition Technology

by Andrew Wagner
5 min read
Share this post on
Tweet
Share
Post
Share
Email
Print

Save

An effective voice user interface requires solid error mitigation and the ability to actively demonstrate the capabilities of your design.

Speech is still a relatively new interface. Technology is finally starting to catch up to the dreams that we’ve had since the invention of the computer itself: dreams of having natural conversations with computers—something portrayed in countless science fiction movies, TV shows, and books.

But from a UX standpoint, what does “natural” even mean?

The most popular conception of a natural conversation with a computer is one in which the user can say anything they want and the computer will understand and respond appropriately. In commercials, Apple advertises Siri as a personal assistant that you can seemingly say anything to. This, however, is far from reality.

It’s a speech designer’s job to make it seem as if users can say anything, when that’s not actually the case. Designers must develop ways to shape and constrain a user’s interactions with their device. Users must be trained to communicate in an understandable manner that doesn’t make them feel like they’re tailoring the way they speak to their devices.

Users must also be made aware of what the device can do to prevent them from making errors and how to harness the complete power of the device: the two biggest challenges designing for user experience in speech recognition technology.

Feature Discovery

This is by far the hardest part of speech interface design. Speech recognition is still very new so we simply cannot recognize and do everything. Even other humans sometimes misunderstand or misinterpret what someone is saying. On top of that, people rarely look through user manuals or research everything a device can do.

Designers need to find ways to educate users about what they can do as they are interacting with devices. With touch interfaces this can be achieved through well-named buttons and high level categorization. Many speech interfaces do not have the luxury of these visual queues.

The most obvious way that people train one another is through explicit instruction. Teachers spend a lot of time lecturing their students. Parents explain to their kids that they should treat others the way they wish to be treated. This can be one way for devices to train users, but is potentially time-consuming and frustrating for experienced users. As interface designers we must find more ways to help users train themselves through self-discovery.

Another way that we teach one another is through leading by example. We don’t teach a newborn to speak their first words by telling them how to shape their mouth and where to place their tongue. We speak in front of them and they experiment on their own to mimic the sounds they hear.

Nobody teaches someone to use a revolving door: we see someone else use it and copy them. Are there any opportunities for the device to lead the way in an interaction? Maybe two different virtual actors communicating for a user to observe. This method could end up being verbose but, if done well, could also be very successful considering our brains are wired to learn this way.

Bottom line: if the user can’t figure out what they can do with the device they may never unlock its power, negating all of the work designers put into it.

Phrasing

People have developed many ways to express ideas, and even many ways to express the same idea. Synonyms and ambiguities are incredibly challenging elements of speech recognition from a technical point of view, forcing developers to choose between accuracy and performance. If we can design a UX that reduces ambiguity and the number of ways to phrase an idea, the system can be tuned to perform much better. If the device uses consistent phrasing the user will tend towards using the same phrasing in the future.

People frequently repeat what another person has said, with very slight variation, in order to clarify an idea. This can often be a mechanism for helping someone learn to express an idea better.

A mother teaching the difference between “can” and “may” might go like this:

“Mommy, can I have soda?”

May you have soda?”

Designers standardizing terminology:

“When the user drags their finger quickly to the left the page should move to the next page on the right.”

“Ok, so a swipe left transitions to the right page?”

This means that if we have a device that can tell time, it can be listening for the following phrases.

  • “What time is it?”
  • “What hour is it?”
  • “How late is it?”

The device can always reply “The time is five thirty-two,” queuing the user to use “time” instead of “hour” or “late.” Developers can then concentrate on making the “What time is it?” phrase work better.

Lastly, another idea for training the user’s phrasing is to use non-verbal positive and/or negative feedback. People use body language to indicate if they understand what someone else is saying. They will often nod along if they understand or they may have a puzzled expression if they don’t.

It would be great if we could develop a similar system for speech recognition devices. A positive tone of voice could indicate that the device understands the user very well, providing a subtle hint that they should continue to use similar phrasing. We may also flash the screen or vibrate the device to signify a positive or negative response.

Phrasing may be just an intermediate step until technology improves, but training the user to use more predictable phrasing will always improve the experience.

The Even Bigger Problem

The key to solving these problems is feedback, and here is the true difficulty: how can the device provide unobtrusive and concise feedback that helps shape the new user without frustrating the experienced one? To make this even more difficult, speech interfaces are often used in circumstances where the user cannot be looking at the device, so we can’t rely on existing paradigms of visual feedback.

There is hope, however. People are expert auditory communicators: we all do it every day. There are many things we can learn by studying how we speak with one another. What other tools can we learn and utilize to make everyone an expert with speech interfaces?

 

Image of speaking mouth courtesy Shutterstock

post authorAndrew Wagner

Andrew Wagner
Andrew Wagner tries to bridge the gap between programmer and UX designer. He has worked primarily in programming roles but has always contributed to the UX conversation. He currently works as a developer and consultant for Chronos Interactive, a development shop focusing on both websites and mobile apps. He also developes his own apps as Learn Brigade, LLC His current apps include:

  • Notecards - Study using virtual note cards, anywhere, anytime
  • Busy Bee Cafe - Contract app for a Cafe / Restaurant / Bar in Raleigh, North Carolina. It allows patrons to look up the current menus, events, articles, and updates.
Before starting Chronos Interacttve, Andrew worked as an independent developer as Drewag, LLC and at a startup called ShowMobile located in Denver, CO as the lead iOS developer. He also worked at Garmin developing Speech Recognition. There he brainstormed and implemented new types of speech interaction with Garmin's navigation devices.

Tweet
Share
Post
Share
Email
Print

Related Articles

Discover how Flux.1, with its groundbreaking 12 billion parameters, sets a new benchmark in AI image generation. This article explores its advancements over Midjourney and Dall-E 3, showcasing its unmatched detail and prompt accuracy. Don’t miss out on seeing how this latest model redefines what’s possible in digital artistry!

Article by Jim Clyde Monge
Flux.1 is a Mind-Blowing Open-Weights AI Image Generator with 12B Parameters
  • This article examines Flux.1’s 12 billion parameters and its advancements over Midjourney and Dall-E 3. Highlights its superior image detail and prompt adherence.
  • The piece explores the shift of developers from Stability AI to Black Forest Labs and how this led to Flux.1. Analyzes the innovation impact.
  • It compares Flux.1 with Midjourney V6, Dall-E 3, and SD3 Ultra, focusing on visual quality, prompt coherence, and diversity.
  • The guide explains how to access Flux.1 via Replicate, HuggingFace, and Fal. Covers the different models—Pro, Dev, Schnell—and their uses.
  • The article investigates Flux.1’s capabilities in generating photorealistic and artistic images with examples of its realism and detailed rendering.
Share:Flux.1 is a Mind-Blowing Open-Weights AI Image Generator with 12B Parameters
5 min read

Is true consciousness in computers a possibility, or merely a fantasy? The article delves into the philosophical and scientific debates surrounding the nature of consciousness and its potential in AI. Explore why modern neuroscience and AI fall short of creating genuine awareness, the limits of current technology, and the profound philosophical questions that challenge our understanding of mind and machine. Discover why the pursuit of conscious machines might be more about myth than reality.

Article by Peter D'Autry
Why Computers Can’t Be Conscious
  • The article examines why computers, despite advancements, cannot achieve consciousness like humans. It challenges the assumption that mimicking human behavior equates to genuine consciousness.
  • It critiques the reductionist approach of equating neural activity with consciousness and argues that the “hard problem” of consciousness remains unsolved. The piece also discusses the limitations of both neuroscience and AI in addressing this problem.
  • The article disputes the notion that increasing complexity in AI will lead to consciousness, highlighting that understanding and experience cannot be solely derived from computational processes.
  • It emphasizes the importance of physical interaction and the lived experience in consciousness, arguing that AI lacks the embodied context necessary for genuine understanding and consciousness.
Share:Why Computers Can’t Be Conscious
18 min read

AI is transforming financial inclusion for rural entrepreneurs by analyzing alternative data and automating community lending. Learn how these advancements open new doors for the unbanked and empower local businesses.

Article by Thasya Ingriany
AI for the Unbanked: How Technology Can Empower Rural Entrepreneurs
  • The article explores how AI can enhance financial systems for the unbanked by using alternative data to create accessible, user-friendly credit profiles for rural entrepreneurs.
  • It analyzes how AI can automate group lending practices, improve financial inclusion, and support rural entrepreneurs by strengthening community-driven financial networks like “gotong royong”.
Share:AI for the Unbanked: How Technology Can Empower Rural Entrepreneurs
5 min read

Tell us about you. Enroll in the course.

    This website uses cookies to ensure you get the best experience on our website. Check our privacy policy and