“I hate you, Siri!”

Tween rage erupts from the backseat of my car. Even now—more than three years after my once-unusual first name became every iPhone owner’s personal assistant—my initial reaction is hurt surprise: “What did I do?” Seconds later, I realize with relief that it’s not about me. My 11-year-old daughter and her friends are talking to the iPhone.

We all get frustrated, but that’s a pretty strong reaction. While expectations may have been high at first, most iPhone users have settled into a less emotional relationship with their electronic assistants. A study previously conducted by Blink UX (PDF) described this process of initial excitement, optimistic curiosity, frustrated exploration, and acceptance and habit formation iPhone users went through in an initial month of Siri usage.

The Problem of Mishearing

Though our skills as individuals may vary, adult humans are generally pretty good at adjusting to others in interaction, even when the ‘other’ is a disappointing smartphone assistant. Most of us also have some sense of what we can reasonably expect a machine to be able to do, based on our (perhaps rough) understanding of how the machine might be doing it. So, after a brief hopeful period, we adjusted the way we talk to Siri (or stopped using it entirely). These days, though we use our voices, we do not attempt or expect conversations. Rather, we produce “input” based on what we expect the system to be able to use, and then do what we can with what we get back.

The frustration with Siri is about the difficulty of getting things back on track when they go wrong

It’s different for the kids in my backseat. Whether it’s due to their still-developing ability to take the more flexible perspective of what a machine can and cannot do, or simply less experience with the technology, Siri’s inability to perform as a competent communication partner seems to cause them great difficulty and annoyance. And no one can express annoyance like an 11-year old girl.

The frustration I’m hearing from the backseat is not just about being misheard or misunderstood by Siri. Sure, it’s going to get things wrong. Their frustration is mainly about the difficulty of getting things back on track when they do go wrong.

Here is a typical breakdown:

  1. Trina: How do you say, “I want the big cheeseburger” in Spanish?
  2. Siri: Here is what I found on the web.
  3. Trina: I mean, can you translate it for me?
  4. Siri: I’m sorry. I can’t find any translation services.
  5. Trina: How do I get translation services?
  6. Siri: I’m sorry. I can’t find any translation services.
  7. Trina: How do I get them?
  8. Siri: Let me think about that … Here is what I found on the web for “How do I get them?”
  9. Trina: Aaaaaah! Forget it!

At one time, I heard it estimated that Siri correctly interprets about 50% of the utterances spoken to it. This has gotten better—some reported a dramatic improvement in speech recognition accuracy with the release of iOS 8. But that’s not the real problem. Let’s keep in mind that human beings also do not accurately hear and understand every utterance directed to them. Mishearings and conceptual mismatches are common.

The difference is that humans have a robust system for recognizing and repairing these breakdowns.

This system is part of our common heritage and embedded in our languages, interactional norms, and social institutions.

Siri, at this point, does not take advantage of this system.

What if it did only understand about half of the time, but made it easy to clear up misunderstandings? That would bring us a lot closer to having successful conversations with a digital assistant.

Conversational Repair

Research on language and social interaction has identified methods that are used across cultures and languages to deal with the very common speech errors and misunderstandings that occur when people talk with each other.

Here are a few of those patterns, as relevant for this discussion:

First position: self-repair

This is called “first position” repair because it takes place within the same utterance as the problem that’s being repaired (the first position). It’s when we realize that we haven’t got it quite right and correct ourselves before even finishing the sentence.

  • Person A: How do I get to the ai- Sea Tac airport?
  • Person B: Probably easiest to take the light rail.
  • Person A: How do I get to the ai- Sea Tac airport?
  • Siri: I didn’t find any airports matching “Plasticraft”

Human beings use auditory and contextual cues to determine which parts of the sentence to pay attention to and filter out the rest (“How do I get to [ ] Sea Tac airport?”) – usually without even realizing that we are doing it.

Siri (and speech recognition systems in general) do not deal well with this type of repair, presumably because they attempt to understand the entire utterance as spoken.

Most experienced iPhone users anticipate this problem and prevent it by stopping and starting over with a full, well-produced sentence. It is not obvious to the kids in my backseat, however, that Siri is going to have trouble with this sort of thing—after all, no one else ever does.

Second position: other-initiated self-repair
  • Anna: I’m meeting with Mark tomorrow.
  • Mark: Who?
  • Anna: I’m meeting with Mike tomorrow. Sorry! (laughs)

This type of repair happens when, in response to an utterance, a person indicates that there is some source of trouble that is preventing his or her understanding. The initial speaker then corrects the problem in their next utterance.

Repair can be initiated in a number of ways. Some responses like “Huh?” or “What?” only indicate that there is a problem and give no clue as to what it is. In the example above, the question “Who?” suggests that the problem is with the name that has been provided.

Another way of initiating this type of repair is by repeating the problematic part of the initial utterance with a questioning intonation:

  • Anna: I’m meeting with Mark tomorrow.
  • Mark: Mark?
  • Anna: I’m meeting with Mike tomorrow. Sorry! (laughs)

Siri does not do well with these, either.

  • Trina: What’s in enchilada sauce?
  • Siri: Consulting my tea leaves. Here’s what I found on the web for “What’s an enchilada sauce”
  • Trina: Your tea leaves?
  • Siri: You’re certainly entitled to that opinion.

It’s hard to say how Siri interpreted “Your tea leaves?” but it clearly was not interpreted as the request for clarification intended.

Third position: repair

Third position self-repair occurs when the initial speaker realizes, after hearing the other’s response, that her initial utterance was misunderstood, and attempts to clarify her response.

For instance, in the first example above, in which Trina seeks a translation from Siri, Siri makes reference to translation services. In response, Trina asks how she can get them.

  • Trina: How do I get translation services?
  • Siri: I’m sorry. I can’t find any translation services.
  • Trina: How do I get them?

Trina uses a common repair mechanism: a partial repeat of the preceding utterance with stress placed to clarify how the utterance is to be interpreted (i.e., the important part here is the getting). Siri is not able to understand this as a repair attempt and rather, after trying to make sense of the phrase in isolation, offers to search for it on the web.

Trina is just like, whatevs. Never mind.

What’s Human?

In a way, these kids were set up. We all were. Among all the witticisms that Apple engineers programmed into Siri, the best was the practical joke of providing just enough smart and sassy answers that we believed—some of us for longer than others—that we could actually have conversations with our telephones (not just through them). We started talking to these chunks of glass and metal and expected to be understood. Hilarity, and frustration, ensued.

Google took a different approach with Google Now—emphasizing contextual awareness over chatty interaction. It’s the assistant that knows what you need before you know it yourself. The promised benefit is not that you can ask questions of your phone, but that you don’t have to.

Echo is Amazon’s recent offering in this space—currently available only in limited release. It is a speaker, not a phone, so its potential differentiators are the use cases afforded by its persistent location in the home, far-field voice recognition, and integration with streaming music services. Its dependency on a wake word (“Alexa”) makes natural conversation unlikely: Imagine an interaction in which you have to preface every comment with the name of the person you are talking to. Though a few witty responses have been programmed in (ask Alexa to “show me the money” if you’ve always wanted to hear Cuba Gooding Jr.’s role in Jerry Maguire played by your bank’s automated telephone system), chattiness does not seem to be a primary goal.

Microsoft, on the other hand, sought to imbue Cortana with the sass and humor that Apple has led us to expect from a digital personal assistant. In interviews, the company has indicated that building in “chit chat” is part of a strategy for eliciting natural language input from users, the idea being that people will be more likely to speak in a conversational manner to a digital assistant that has a personality. That natural language input will then form a rich source of data from which Cortana can learn and improve its responses.

But what about Siri? Plenty of sass, but it doesn’t seem like people are spending a lot of time chatting with her anymore. Is there reason to believe that canned responses—no matter how many—are going to elicit rich conversational input from users?

What if the effort these companies put towards giving digital assistants “personality” went instead towards giving them the ability to support evolved, instinctual human methods for making sense of one another? It’s a good bet that users would be more likely to speak informally and conversationally to a digital assistant that is capable of dealing with misunderstandings in a natural, non-frustrating way. This strategy is likely to stimulate greater amounts of more varied conversational input while also providing an improved experience for current users. Canned answers, on the other hand, are likely to elicit canned questions. (“Who’s your daddy?” “Where can I hide a dead body?”)

Doubtless, there are tough technical problems involved. Incredible advances have been made in the areas of natural language recognition and production, but there clearly is a great deal more to be done. One of the most valuable things we do as UX professionals is to help clients understand where their investments can make the biggest difference. There are other ways, besides improving the ability to smoothly fix miscommunication, that technologies like Siri could be improved to more closely replicate human interaction. For instance, these technologies need to be better at recognizing intonational cues, pronominal references, nonstandard accents, and distinguishing the vocal streams of separate speakers (and that’s just a start).

But few improvements offer as much immediate and long term benefit as making it easier to fix what’s gone wrong. The ability to smoothly recover from routine misunderstandings is a critical, if often overlooked, element of natural human interaction. A robust, intuitive process for conversational repair can help to:

  • reduce the cognitive load of voice interactions
  • ensure successful outcomes
  • make interactions feel more fluent and satisfying
  • elicit more natural language input, providing richer data for machine learning.

Even if most users have given up on conversations with Siri, the tech world is not giving up on natural interactions between people and computers. DARPA, for instance, recently launched a new program called Communicating with Computers (CwC) aimed specifically at making interactions with machines more like conversations between human beings.

As we move forward into a new era of natural voice interaction, let’s just keep in mind how much we stand to gain by designing machines that can use and understand the universal word: “Huh?”

 

Illustration of cheeseburger courtesy Shutterstock.

References:

 

Article No. 1 253 | June 12, 2014
Article No. 981 | March 20, 2013
Article No. 909 | November 28, 2012

Add new comment

Comments

I suspect we are at the same stage as OCR was in the 80's when you knew there was a 10% error rate BUT you couldn't find it without reading the whole passage and then manually over type. This really changed with the use of techniques to focus on likely error areas and use of out of band techniques such that a dictionary could focus the reviewer on a dictionary detected suspect image and highlight the resultant text to make correction easy and avoid a lot of wasted effort. Maybe not 100% but much better, quicker, easier to get a higher quality output.
I saw something similar with early Lotus machine language translation using IM with a translation window. In terms of voice what I think we may be missing is the out of band mechanism to spot and rectify the "important" mistakes.... without losing the key "hands free" aspect of the UI. Otherwise even more intelligence is needed understanding and assisting the correction process itself.
Perhaps we need to incorporate camera/screened gestures to help the interpreter and user stay on the same page..... but not sure how to direct focus with cues. Fascinating area of study....keep up the good work as I never learned to type and need help !

You can tell google now "I meant "X" and it will correct it. 

A step in the right direction!

I hate you Siri. Just kidding, great article!

As an Android user I am constantly teasing my cult of Apple peers about their Siri interactions as I watch them become increasingly frustrated attempting to converse with "her" as if she was a real human. While I believe some form of humanized personal assistant AI technology will one day get there, it's got a ways to go for sure.

Personally, I have become trained to speak to Google Now in short, direct, robotic like queries. My results are accurate with that method almost 100% of the time and I get results faster without wasting time trying to have a conversation with my phone. Now that Google Now is available on the iPhone, I highly encourage iPphoners to give it a shot and just remember you are talking to a machine, not a living person.

Thanks for your comment, Stephen! Your approach makes a lot of sense given the current state of voice technology. In the longer term, I think there is still a lot to gain from making conversations with machines more natural, even for those of us who don’t see our smartphone as a potential new BFF or love interest.

Imagine, for example, how we might reduce the cognitive impact of interacting with machines in high-risk situations (such as on the road) by supporting the communication practices that humans find most intuitive and habitual.