Whether it’s Siri reminding you when your next meeting is, your in-car GPS telling you to take the next left, or software reading online text aloud, voice systems have become a commonplace—though often frustrating—feature of our digital lives. Now more than ever, we need some top-notch voice UX. Three UX experts tell us what’s happening now, and what should happen in the future.
Voice systems have been with us for a long time--too long, it often feels. Are there new features—and better ways of dealing with errors—that might bring hope to both users and designers?
Aaron Gustafson: First off, even though they are directly connected, we should draw a distinction between speech recognition and speech synthesis. On the recognition end of things, the algorithms we’ve employed to analyze and process human speech have improved dramatically, bolstered by the improvements in chip speed and processing power. Apple’s Siri, for instance, recognizes the start and end of a speech block, then quickly removes the background noise and breaks the speech up into tiny chunks of sound for analysis. That’s a lot of processing that needs to take place in a very short time. And it works relatively well with no training. Early speech recognition software, by comparison, required hours of training where you had to read from a script to “train” the software on how to recognize your speech.
As processors get smaller and faster, we can dump more and more complex algorithms on top of them. Back in 2012, two college students—Nicole Newman and Cintia Kotsubo—created a real-time translator called BabelSushi. And they crowdsourced the translations to enable it to keep up with native-speaker slang to boot. That sort of Star Trek-caliber tech would never have been possible ten years ago; now it’s moving into the mainstream: both Google and Microsoft recently announced and launched (respectively) real-time speech translation.
On the other end of things, synthesized speech has come a long way. Even looking at the perennial voice-UX whipping boy—automated response telephone systems—you can see how far we’ve come. Software voices have, in large part, moved beyond the robotic and embraced more human-like nuance including emphasis, stress, and intonation, which help us better understand them and reduce our stress level in interacting with them (assuming they’ve been well-programmed by humans of course, but that’s where we come in).
Steve Portigal: Maybe it’s worth a moment to take stock. As users, frustration is frustration, granted. But as the people who help bring these experiences into the world, we are truly working in the realm of science fiction. When I was coming out of graduate school, the prediction was that we would never have speaker-independent voice recognition (where the system could understand anyone, without having to be trained on a particular user’s voice), and with speaker-dependent we would at best be able to work with a small vocabulary. Synthesized speech was not expected to get much better than that ‘80s robot voice. Barriers have been broken and we can do amazing things right now given where we used to be. I once interviewed a couple where the husband was enthusiastically installing “smart home” technology, and it seemed that the voice recognition was much better at recognizing a man’s voice than a woman’s. Gender dynamics made manifest by technology! While it’s not so easy to be a Scottish Siri user these days, there’s still been remarkable progress.
Whitney Quesenbery: A “voice UI” can mean so many things, from a device that reads to us, to one that responds to our voice. The mobile context is obviously driving the use of voice and speech, for situations when it’s hard to type or hard to read the screen. Just think of using a navigation system without voice. Not just difficult, but dangerous.
Both speech recognition and output are good enough for general use—I regularly see people sending texts by talking to their phone, for example. The technology is finally getting there.
Some of the pioneers in using voice UIs are people with disabilities (and a lot of the early research focused on accessibility). There are now whole libraries of voices for screen readers and text-to-speech programs, including appropriate voices for different languages, so you can choose a voice that you like, not just settle for the default.
A screen reader is half of a voice UI—in that it, it uses speech output with keyboard input. A full voice control system adds voice input. These systems make it possible to use a computer or device without using your hands. They can be important for people with repetitive stress injuries who find using a mouse and keyboard difficult or tiring. There are programs from companies like Redstart Systems that add full control of the computer-to-speech recognition engines such as Dragon NaturallySpeaking.
Our voices are imperfect—they modulate, they hem and haw, they express how we feel. Should these systems become more human? Do we want them to, for example, learn a user’s speech patterns or wait if we don’t respond?
Aaron Gustafson: There’s definitely the potential for synthetic, disembodied voices to take on an uncanny feel. It’s the voice equivalent of the CGI in movies like Polar Express: seemingly realistic, with an eerie underpinning of hollow lifelessness. We need to be careful to design voices that don’t just tick the boxes of “human-ness” but that actually have some personality as well. Some of this comes in the form of cadence, appropriate (albeit fake) reflective pausing, and intonation, but also from how the system assembles its responses. A system that only speaks in “proper” English, for instance, is likely to be less believable than one that uses more conventional spoken English with the occasional dash of contextually appropriate slang.
At the same time, we also need to make our systems smart enough to translate what the user literally says to what they actually mean, accounting for slang, “improper” grammar, and non-traditional word use. It’s incredibly frustrating to repeatedly try to find the one word the speech-recognition programmer used to describe something when there are numerous alternatives that are equally applicable. It’s our job to reduce friction like this, so we can’t be lazy.
Steve Portigal: I’ll invoke the Uncanny Valley here. Right now we expect them to be limited in how human-like they come across. That’s why if you overhear someone on the phone, you can tell by their volume and pacing that they are speaking with an interactive voice response system. That’s a learned response to an imperfect interaction. Whether or not designers want us to be talking in that way, that’s the established practice. So if these systems become more human-like (becoming more human isn’t of course possible) then there’s the possibility of increasing awkwardness until we collectively establish a comfort with it. Parody videos, SNL sketches, op-ed columns, are all ways that we acknowledge as a culture that this new thing is weird, and we adjust to it. But the prospect of increasing the human-ness of these systems raises some interesting design possibilities. What if instead of simply learning my voice patterns, it mirrors (gently!) some of those patterns. You’ve probably been on the phone with someone who is whispering due to a sore throat or because they’re speaking in a confidential context, and you can’t help but whisper back. That mirroring behavior is instinctive, and it helps us offer a sense of connection to our interlocutor.
Whitney Quesenbery: In Wired for Speech, Clifford Nass and Scott Brave wrote about the human response to speech and how we could use it to match the personality of the voice system to the role and context of the system. Much like we have different ringtones for family, friends, and strangers, I can imagine different voices for different apps, or even different voices depending on the activity. Reading an article might want a different voice than navigating traffic. Or the voice might vary depending on whether the app was acting as a quietly helpful personal assistant.
One of my favorite telephone “personal assistants” (sadly, now defunct) had a very lifelike voice. Most of the time, it was simply pleasant and neutral. But when you did not pick up the phone, the system told your caller, “I’m sorry but she’s not where she said she would be” in a slightly aggrieved tone. It always made me laugh.
Like so much of design, the line between competent and delightful is in the details.
If voice systems do become human-ish, theoretically they should learn from and about us. That can improve efficiency, but it could also be creepy. Where as designers do we draw the line?
Aaron Gustafson: I think the best systems will learn from us and reflect our cadence and word choice just enough to reassure us they are listening. It’s an empathetic response that we, as humans, often provide to one another unconsciously, and it would go a long way toward making voice interactions with software more natural. I’m not sure I can foretell where that line between natural-enough and creepy is, but luckily it’s all software. We can always tune it to get better over time.
Steve Portigal: The creepy line is something that many designers should be considering well beyond voice-based interactions. As systems have more information about their users, they can provide an enhanced experience--but that line, oh, that line. We can point to what’s way over the creepy line, but the line itself continues to shift. We worked with a client to help them understand, for their particular design, where that line was. And while it’s going to change over time and for each category of product, the themes we uncovered—that this is a conversation that takes place in a relationship; the permissions you have advance and escalate over the course of that relationship, just like they do between people—are general, and powerful, guidelines for other situations.
Whitney Quesenbery: Why is this any creepier in a voice system than in any other system? Systems that learn about us and try to predict what we need to know or what we want always seem to walk the line between being amazing and creepy. Almost all predictive or “intelligent” systems learn by considering context. This is especially true when words are used metaphorically.
I’ve experimented with listening to speech output. Even with the fairly mechanical voice on the early Kindle, it took me just a few hours to get used to hearing it read. What I never really got used to, though, was the way it didn’t seem to understand semantic markup. The transition between chapters, for example, had no more weight or time added than going from one sentence to another. Where’s the deep breath as you turn the page and start a new section of a book?
I’m always amazed at how fast people who use voice output regularly can hear. People who use screen readers can set the speech rate so fast that most of us can’t even hear the words. Maybe impatience is a powerful motivator. The first system with sped-up speech that I remember was a message system for people on Wall Street.
Like what these experts had to say? You can have them bring their brains to you. Aaron Gustafson, Steve Portigal, and Whitney Quesenbery are available for consulting and training through Rosenfeld Media.
Image of man yelling at phone courtesy Shutterstock.