Over the years, I’ve heard the same question again and again each time a new type of consumer technology starts trending and becoming part of popular culture: “So,” someone asks, “how is this going to change how you do user research?” I heard this question when mobile got its first WAP browser, and I heard it again when mobile web and apps started to pervade society. More recently, I’ve heard this question asked about wearables, and occasionally, about IoT research.

User research methods are stable

My answer is usually the same as I explain that, in fact, while some of our terminology may have changed over time, user research methods have remained relatively stable over the span of my two-decade research career. While we always need to make some adaptations to approaches based on whatever our object of study is, these tweaks don’t reflect changes in methods. As Amy Buckner Chowdry and Kerry Bodine described at UXPA 2015, adding a pillow to avoid participant arm in research with intensive wearable usage is a simple adaptation to usability studies.

To some extent, the same is true with AI-based technologies. In fact, studying the interface that surrounds the AI could be similar to typical user research around any hardware or software interface. But an examination of the AI itself still involves some important methodological considerations and adaptations.

Let’s think about some initial considerations around AI research and what user research methods – whether classic or new – would need to be employed.

For the purposes of this article, we’ll limit this discussion to current types of consumer-friendly AI bots that either work within existing screen-based technology or include their own dedicated interface.

Understanding current user behavior

While future AI bots may be far more sophisticated and intelligent than the likes of current products like Siri, Cortana, Amazon Alexa or the recently announced Google Assistant, products like these do reflect the existing state of the market. Finding heavy users of these products and observing them, either natively in their own environments via ethnographic research, or by asking them to demo how they use these products with a cognitive walkthrough could be valuable. Conducting targeted focus groups with these heavy users may also lead to insights, not only around where these products succeed and where they fail, but also to lessons on what is missing from these products when users want to use them for certain specific activities.

New AI interfaces

Near-term new bot-centered products will likely either exist within current screen-based technologies, be it computers, mobile devices, or wearables, or will represent alternative types of self-contained products, such as those created by Amazon and soon Google.

Regardless of where the front-end of the AI lives, making sure that users know how to work with the interface would likely involve pretty standard usability testing methods. For example, do participants know how to initiate usage of the interface, be it with spoken words or a physical switch or a tap? Do participants understand how to formulate communication with the AI (likely expected to be via natural language)? If text is shown on a screen, is it readable? Is speech synthesized output fully understandable, and in general, is the vocabulary appropriate for the audience? These questions can be easily integrated into a typical usability testing script for representative users (who may not be the power-AI users we would want for focus groups).

If there are multiple ways that users can communicate with a bot, such as typing or by speaking, then testing these multiple interfaces is important. This could be accomplished within each participant session, for example, by showing each participant both interface types (in alternating order) and rotating tasks for each interface type. Alternatively, testing both interfaces could be accomplished between participants, where a single set of tasks would be used for both interface types, but each participant would only see one of the interfaces over the course of their testing.

Evaluation of AI logic

While it shouldn’t be too hard or too unusual to evaluate an interface, evaluating the logic and accuracy of artificial intelligence will be a bit more difficult. At a basic level, I’ve been evaluating logic and accuracy in usability testing for years. In the past ten years, many a usability study has involved evaluation of search engine results, be it external search engines and ability of users to get to specific client pages, or internal search engines and their ability to show the correct results quickly.

In even more recent years, many usability studies have included pages with faceted filtering and allowed an assessment of whether the best filters and filter options were showing up to produce the correct results.

So an examination of logical results itself is nothing new. But the sophistication of the logic behind results of today’s (and the near-future’s) AI is going to involve extra considerations when doing usability testing. Realistic tasks will need to include both simple and sophisticated scenarios. Unlike a search engine where there is a single query and a single set of results that can, perhaps, subsequently be filtered, a dialog with an AI bot could involve a back-and-forth discussion where each additional user input further refines the prior inputs. The tasks used during testing, therefore, might have to be more conceptual, or perhaps provide more room for open-ended approaches, as opposed to a concrete, clearly defined end-point. In fact, once participants understand the capabilities of the bot, there should very likely be some entirely open ended tasks, where the participants are just asked to play with the bot and try to come up with their own real-world tasks.

Reporting on findings

Key findings on AI logic would need to include both whether the bot can properly understand the questions that people ask and whether the bot provides answers that users want. It would also be important to determine whether participants who may not use a bot-based product regularly (or perhaps even those who do) appreciate the value of the interaction over “old-school” methods (such as doing a simple web search). And get ready to show lots of video clips; screenshots will not likely be sufficient when trying to explain logical difficulties that participants encountered!

Can early-stage research be done?

The evaluation of interface and logic discussed above relies on a completed, or mostly completed, product. But would it be possible to test artificial intelligence before it is, in fact, intelligent?

While an interface, particularly a screen-based interface, can be mocked up pretty easily, mocking up intelligence can be much harder, if not impossible to achieve. On the other hand, a Wizard of Oz approach may be fairly simple, where there is “a man behind the curtain.”

In order to test out approaches to how the AI bot might behave, an additional researcher could observe the usability study and give responses that mimic expected AI output. In order to avoid a level of naturalness that would be clearly human when responses are spoken, the additional researcher could type responses and have a realistic speech synthesizer read those responses.

Does context matter?

Should research on AI be conducted in a usability lab or in the field? Does methodological rigor and consistency trump authenticity or vice versa? It’s probably authenticity – at least to some level - that matters most, particularly when the bot is intended to be used beyond the home. Context will help frame both the way that participants provide inputs to the bot and the realistic level of attention that they’d be able to give to responses. If a bot, for example, is intended to help guide users through purchasing products at a store, then if at all possible, the research should be done in a real store, not as an in-lab simulation.

If the bot is intended to be used at home while sitting on the couch (or perhaps when there are extra logistical hurdles, like simulation, necessary for early-stage research) then an in-lab approach may be okay, but consider dragging in a couch and providing some soft mood lighting.

Accessibility matters too

A bot that lives within existing screen-based technology, within a website or app, should be easily capable of providing for accessible input and output (provided that the decision is made early-on to code correctly).

A bot that lives within stand-alone devices and perhaps only provides for one method of input/output (such as speaking/listening) may involve a more difficult implementation of accessibility standards. But it will still be just as critical to provide accessible ways of interacting with the AI. But even if the stand-alone interface does provide for accessible alternatives, it’s likely that these approaches won’t be as rigorously evaluated (if at all) as the most common approaches to communicating with the product.

An accessibility evaluation will have to make sure to test out the interface for accessibility by checking interface code (if web-based) or by actually using assistive technologies (for websites and apps or for physical stand-alone devices). Given the unique nature of AI communication, companies should consider not just relying on expert testers and perhaps involving usability testing with valid representative users with specific kinds of impairments, and let them use the product during testing with whatever types of assistive technologies they use regularly.

Break the bot: User Research meets security testing

As Microsoft unfortunately discovered when they released their messaging bot, Tay, into the wild, bots that rely on crowd-sourced learning can be redirected towards evil. When Tay rolled out in March, she was soon spitting out racist, anti-Semitic and misogynistic language. Lesson learned – now what?

Just like hackers can get a bounty for figuring out how to hack a website and then letting the company know first, AI companies should leverage the participation of such “learning-hackers.” Keep the bot in a password-protected sandbox and provide authentication credentials to a select set of people who have the ability to turn the bot to the dark side. As researchers virtually observe, let these participants do their best and figure out how to succeed.

Researchers can then work with the AI team to further refine the logic so that whatever approaches were used can be accounted for. And then let these participants or others try again and again, continuing to pay for this participation as they continue to succeed, until such time as they have trouble making any appreciable evilness dent in the bot’s behavior. While never perfect, it’s certainly many steps improved from the unfortunate Tay.

Are you ready?

I imagine that most bot research as of now is being done in-house, so as a UX consultant, I have not yet had the opportunity to run a true AI research study. While I don’t know if my first real AI study will take place this year, I’m looking forward to the opportunity. And likely, as time progresses, it’s not just going to be a single opportunity. As artificial intelligence fully pervades so much of what we do, I’m sure that there will be more and more of these kinds of studies, investigating AI implementations that we haven’t yet even begun to imagine.

Some day – but hopefully a long time from now – I wonder if we’ll have qualitative usability studies where both the participants and the moderator will themselves be AI entities. Yikes!

Add new comment

Comments

Cory, very interesting topic. Thank you.

I think, as AI/machine learning/bots become a greater part of the interaction landscape, it may be that "user research" becomes part of the bot's continuous "training" rather than a usability intervention/improvement cycle. A Design Thinking question that might spur more ideas on this:

    How might we enable bots to know when they are not helping a person, so everyday interactions become a part of their continuous training leading to anticipation of needs?

Perhaps bots will update responses based on analysis of previous negative responses. Or, perhaps designers can enable people with a standard emoji or response that explicitly tells the bot when it is on the wrong track, so the bot can learn from specific moments. In a human-to-human exchange, one human might state something like, "oops, not what I meant," or, "no, I was talking about..."

Context and accessibility are critical for successful bot interactions. I think it may be, in the near-term, also important to avoid the uncanny valley by either infusing bot interactions with some human input or by bots introducing themselves as bots. This might instill some willingness to "forgive" bots when they fall, encouraging people to help them improve.