Collective A.I.: Defense Abridged

I've been meaning to blog a reader's digest version of my PhD thesis defense since December. Now that six months have passed, it's about time to follow through with that plan. You can also watch my AI Summit talk from the GDC Vault (starts at 20:45), which is a condensed version of my defense, minus the study results. The complete thesis document is available here: Collective Artificial Intelligence: Simulated Role-Playing from Crowdsourced Data

PREFACE

I had a great committee -- my PhD advisor Deb Roy (@dkroy), along with Nick Montfort (@nickmofo), and Mark Riedl (@mark_riedl). In an attempt to give my thesis some industry relevance, I invited Gabe Newell to join the committee as well. To my surprise, he agreed and participated in the proposal phase. Gabe's proposal feedback greatly motivated my direction moving forward from the proposal to the defense. In part he said, "You will fail at creating a greater sense of player contingency." My grand plan backfired -- one of my game industry heroes was predicting failure at exactly what I was intending to achieve! This skepticism actually served me well, and pushed me to think through some big issues.

Prior to my proposal, I had focused on using data recorded from thousands of players to automate AI characters who could dynamically converse and interact with other AI characters. But my ultimate goal for this research was to support AI dynamically interacting and conversing with human players, and to show how data-driven interaction can support a vastly more open-ended, player driven experience. So, catalyzed by Gabe's feedback, after the proposal I shifted 100% of my energy toward demonstrating how data from The Restaurant Game could support unscripted face-to-face social interaction and dialogue between a human player and an NPC.

HERE WE GO....

My thesis looks at games as a storytelling medium. Every medium allows telling stories in different ways. As Rockstar's Dan Houser said to the New York Times, "Books tell you something, movies show you something, games let you do something." The videogame industry has made a lot of progress in allowing the player to do things physically -- players can run anywhere, drive anywhere, and shoot at whatever they want, but has made much less progress supporting open-ended social interaction and dialogue. In general, we're still stuck with the same pre-scripted, multiple choice dialogue trees we've been seeing for 30 years, limiting the players' ability to express themselves and guide the storytelling experience.

Multiple-choice social interaction in Mass Effect 3

There are two obstacles preventing us from creating more open-ended experiences. The first is the content authoring bottleneck -- creating character behaviors is a technical, labor-intensive process, and authoring tools are relatively primitive. The second obstacle may be the bigger issue: human imagination is a limited resource. No matter how talented your designers and programmers are, each individual can only anticipate so many possibilities. In order to support more open-ended interaction, we need to rethink the way we author character behavior and dialogue. In particular, we need to move toward more data-driven approaches in order to scale up the interaction. To explore this, I launched The Restaurant Game in 2007, which anonymously paired 16,000 people online to play the roles of customers and waitresses. Players could say anything they wanted to each other (via typed text), and interact with the 3D environment via a point-and-click interface. We recorded everything, and could extract a discrete action sequence from each gameplay session. The question is, how can we exploit thousands of these action sequences to support open-ended interaction? And what will that experience be like for the player? Answering these questions could not only impact games, but also have implications for online education and training, and social robotics.

Open-ended natural language input in Facade.

There are a few examples of games that have ventured beyond multiple choice dialogue -- notably Facade, which very much inspired my own work. Facade was released in 2005, as I was wrapping up work on FEAR. While I was focusing on simulating action-packed combat, Facade was delivering drama by simulating social interaction in the mundane setting of a yuppy couple's apartment, and I was blown away. As Grace and Tripp start bickering, the player can type anything they want, to try to defuse the situation, or stoke the fire. Facade can't understand everything the player types, but elegantly designs around the hard language understanding problems -- when input is not understood, Grace and Tripp just continue to argue, which brilliantly succeeds in making the player feel like an awkward third wheel. But this is not a general solution. How can we support interactions between two characters, rather than three, where the player is face-to-face with an NPC, and everything s/he says matters, and cannot be ignored? How can the player use language effectively to navigate the story space?

Tension between freedom and system comprehension, represented as a 2D space.

We can think of the interaction problem as a two-dimensional space, where the Y-axis indicates how much freedom the player has to say and do things, and the X-axis indicates how well the machine can understand what the player is saying or doing, and respond appropriately. Commercial games, like Mass Effect 3, sit close to the X-axis -- the machine can understand almost everything they do, because their freedom is so constrained. Facade is somewhere in between, giving much more freedom, at the cost of system comprehension. The holy grail is the top-tight corner, where the player has complete freedom, and the machine understands everything. My research is aiming for something closer to that holy grail.

SO, WHAT DID I BUILD?

To get closer to that holy grail in a practical way, I've been experimenting with a hybrid interface, where the user can say anything they want (typed, or speech-to-text), and when an exact match for the input does not exist, the system dynamically generates dialogue options intended to be semantically similar and contextually relevant.

Hybrid user-interface in The Restaurant Game.

The underlying system that drives the behavior and dialogue of the NPC, in response to human interaction, relies on an approach I refer to as Collective Artificial Intelligence, which consists of three steps:

Record thousands of people playing roles in some scenario.
Mine gameplay data for patterns of language and behavior.
Replay fragments of recorded behavior at appropriate times at runtime.

Below is a graph generated by plotting all action sequences observed in 5,000 gameplay sessions of The Restuarant Game. Each node represents a unique action, and all games progressed from node "Start" at the top to node "End" at the bottom. This image illustrates that human behavior is complex, and nuanced, and far beyond what we can encode by hand. I spent a couple years looking at various ways to automatically mine patterns in this data (n-grams, SVMs, HMMs, affinity propagation, PLWAP), and made some encouraging progress, but ultimately concluded that these approaches have the effect of filtering out the nuance of the interaction due to sparse data. But the motivation of recording thousands of people in the first place was to capture the nuance! So, in 2010 I changed direction, toward a human-machine collaborative approach, where humans are employed to interpret the meaning of patterns in the data.

Graph of action sequences observed in 5,000 two-player games.

I created browser-based tools (Flex, ActionScript 3) and used oDesk to hire people from the Philippines, Pakistan, India, and the U.S. to annotate data, applying a narrative structure that represents a hierarchy of events, long-range dependencies indicating causal chains and references, and expressions of attitude. This structure also represents modulation of affinity and tension, but these aspects have not yet been implemented.


Narrative structure applied to gameplay transcripts.

It took a team of seven outsourced annotators a total of 415 person-hours to tag 1,000 transcripts with four types of meta-data, which works out to about 1.5 weeks, if they were working 40 hours/week, and it cost about $3,000. As a final step, lines of dialogue are manually semantically clustered. I did this step myself, and it took about two weeks to cluster 18,000 lines.

The annotated data serves as Collective Memory, driving the decisions of the runtime planning architecture (written in Java), which combines plan recognition with case-based planning. At a high level, the agent recognizes discrete sequences of observations representing events, infers a hierarchy of events, and retrieves gameplay transcripts (aka cases) containing event hierarchies that are similar at an abstract level. Retrieved cases are critiqued, leveraging meta-data to scrutinize a proposed next-action for coherence. For a simple example, if someone ordered steak, and the AI waitress is considering a next action for serving pie, a critic will reject this proposal due to violating a long-range dependency tagged by a human, where ordering steak causes the waitress to serve a steak, rather than pie. All of the critics are domain-independent, with the exception of the Domain Critic, which accesses rules encoded in the Domain Knowledge Manager.

Runtime planning architecture for an agent.

DEMOS!

So, enough jibber jabber. Let's see what this system actually does. Below are three videos of a human customer interacting with an AI waitress. The first video highlights how the system auto-completes the same input in different ways depending on context, and how the waitress can exploit player data to respond to some of the more unusual things the player does.

I think of these data-driven characters as improvisational actors, who can take direction at a high-level. This video demonstrates directing the waitress to be rude, which has the effect of biasing her to retrieve gameplay transcripts with actions tagged as rude when possible.

The last video demonstrates a waitress directed to upsell. This is accomplished through a combination of applying an upselling attitude tag, and adding a couple domain-specific rules to the Domain Knowledge Manager which tell the waitress to never bring an entree until an appetizer has been ordered, and never bring the bill until dessert has been ordered.

WELL, DID IT SUCCEED?

I ran both a quantitative and qualitative study to evaluate whether the implemented system succeeded at supporting a more open-ended, player-driven experience. Subjects in the quantitative study interacted with the system with speech-to-text, based on Microsoft's speech recognizer. This study looked at how often the subject was able to find a dialogue option with the same meaning as what the player was trying to say, where the dialogue options were driven directly by recognized speech in one condition, and by the full Collective A.I. system in another condition (which can exploit context to generate relevant dialogue options, even when speech is misunderstood). Results show that subjects were able to find a satisfactory dialogue option 29% more often with the full system.

Quantitative study results.

I also looked at the ranking of the selected dialogue option. The figure below shows how exploiting context increases the likelihood that the desired dialogue option will appear higher in the list.

More quantitative study results.

Subjects in the qualitative study played three games in groups, followed by a focus group discussion. Each subject played about 10 minutes of Facade, The Restaurant Game, and Skyrim -- interacting with NPCs in a tavern as a control, as a reminder of the current state-of-the-art in industry. This study was risky, given that the other games are polished released products, while The Restaurant Game is not really even a game -- more of a proof-of-concept tech demo. None of the subjects had played Facade before, and most were captivated by it. However, it was encouraging that the discussions revealed that players did find The Restaurant Game to be more player-directed, and responsive to nuanced language, while dramatizing a restaurant narrative in cooperation with the player. Unprompted, subjects described The Restaurant Game as a sandbox. Some notable comments included:

“It felt like The Restaurant Game was trying to play along with the player. It just kind of roles with it.”

“Façade led you, The Restaurant Game lets you lead it.”

“What I noticed about The Restaurant is that it was trying to do more than Façade in the sort of AI actual interpretation of colloquialisms.”

WHERE DO WE GO FROM HERE?

My thesis has only scratched the surface of what's possible with crowdsourced data-driven interaction, and I see it as a starting point rather than an end. As a former game programmer, the fact that these characters can still surprise me by saying things I've never seen them say before, even after working with this data for years, is incredibly exciting, and hints at the possibilities for truly next generation characters, driven by massive collections of content.

Over the course of the PhD, I dabbled in a couple related side-projects, reusing The Restaurant Game platform -- I collaborated with the Personal Robots Group on Mars Escape, a game to capture data about human-robot interaction, and collaborated with the GAMBIT Game Lab on Improviso, which collects data about playing roles on the set of a low-budget sci-fi movie. But there is still much, much more to explore.

To that end, in case you haven't been following my recent Facebook and Twitter spam, I'm continuing to explore data-driven simulated role-playing through a new venture called Giant Otter Technologies. Follow our latest developments at @GiantOtterTech.

Wednesday, July 3, 2013

Defense Abridged