PREFACE
I had a great committee -- my PhD advisor Deb Roy (@dkroy), along
with Nick Montfort (@nickmofo), and Mark Riedl (@mark_riedl).
In an attempt to give my thesis some industry relevance, I invited Gabe
Newell to join the committee as well. To
my surprise, he agreed and participated in the proposal phase. Gabe's proposal feedback greatly motivated my
direction moving forward from the proposal to the defense. In part he said, "You will fail at
creating a greater sense of player contingency." My grand plan backfired -- one of my game
industry heroes was predicting failure at exactly what I was intending to
achieve! This skepticism actually served
me well, and pushed me to think through some big issues.
Prior to my proposal, I had focused on using data recorded
from thousands of players to automate AI characters who could dynamically
converse and interact with other AI characters.
But my ultimate goal for this research was to support AI dynamically
interacting and conversing with human players, and to show how data-driven
interaction can support a vastly more open-ended, player driven
experience. So, catalyzed by Gabe's
feedback, after the proposal I shifted 100% of my energy toward demonstrating
how data from The Restaurant Game could support unscripted face-to-face social
interaction and dialogue between a human player and an NPC.
HERE WE GO....
My thesis looks at games as a storytelling medium. Every medium allows telling stories in different ways. As Rockstar's Dan Houser said to the New York Times, "Books tell you something, movies show you something, games let you do something." The videogame industry has made a lot of progress in allowing the player to do things physically -- players can run anywhere, drive anywhere, and shoot at whatever they want, but has made much less progress supporting open-ended social interaction and dialogue. In general, we're still stuck with the same pre-scripted, multiple choice dialogue trees we've been seeing for 30 years, limiting the players' ability to express themselves and guide the storytelling experience.
Multiple-choice social interaction in Mass Effect 3 |
There are two obstacles preventing us from creating more open-ended experiences. The first is the content authoring bottleneck -- creating character behaviors is a technical, labor-intensive process, and authoring tools are relatively primitive. The second obstacle may be the bigger issue: human imagination is a limited resource. No matter how talented your designers and programmers are, each individual can only anticipate so many possibilities. In order to support more open-ended interaction, we need to rethink the way we author character behavior and dialogue. In particular, we need to move toward more data-driven approaches in order to scale up the interaction. To explore this, I launched The Restaurant Game in 2007, which anonymously paired 16,000 people online to play the roles of customers and waitresses. Players could say anything they wanted to each other (via typed text), and interact with the 3D environment via a point-and-click interface. We recorded everything, and could extract a discrete action sequence from each gameplay session. The question is, how can we exploit thousands of these action sequences to support open-ended interaction? And what will that experience be like for the player? Answering these questions could not only impact games, but also have implications for online education and training, and social robotics.
Open-ended natural language input in Facade. |
There are a few examples of games that have ventured beyond multiple choice dialogue -- notably Facade, which very much inspired my own work. Facade was released in 2005, as I was wrapping up work on FEAR. While I was focusing on simulating action-packed combat, Facade was delivering drama by simulating social interaction in the mundane setting of a yuppy couple's apartment, and I was blown away. As Grace and Tripp start bickering, the player can type anything they want, to try to defuse the situation, or stoke the fire. Facade can't understand everything the player types, but elegantly designs around the hard language understanding problems -- when input is not understood, Grace and Tripp just continue to argue, which brilliantly succeeds in making the player feel like an awkward third wheel. But this is not a general solution. How can we support interactions between two characters, rather than three, where the player is face-to-face with an NPC, and everything s/he says matters, and cannot be ignored? How can the player use language effectively to navigate the story space?
Tension between freedom and system comprehension, represented as a 2D space. |
We can think of the interaction problem as a two-dimensional space, where the Y-axis indicates how much freedom the player has to say and do things, and the X-axis indicates how well the machine can understand what the player is saying or doing, and respond appropriately. Commercial games, like Mass Effect 3, sit close to the X-axis -- the machine can understand almost everything they do, because their freedom is so constrained. Facade is somewhere in between, giving much more freedom, at the cost of system comprehension. The holy grail is the top-tight corner, where the player has complete freedom, and the machine understands everything. My research is aiming for something closer to that holy grail.
SO, WHAT DID I BUILD?
To get closer to that holy grail in a practical way, I've been experimenting with a hybrid
interface, where the user can say anything they want (typed, or
speech-to-text), and when an exact match for the input does not exist, the
system dynamically generates dialogue options intended to be semantically
similar and contextually relevant.
The underlying system that drives the behavior and dialogue
of the NPC, in response to human interaction, relies on an approach I refer to
as Collective Artificial Intelligence, which consists of three steps:
- Record thousands of people playing roles in some scenario.
- Mine gameplay data for patterns of language and behavior.
- Replay fragments of recorded behavior at appropriate times at runtime.
Below is a graph generated by plotting all action sequences
observed in 5,000 gameplay sessions of The Restuarant Game. Each node represents a unique action, and all
games progressed from node "Start" at the top to node "End"
at the bottom. This image illustrates
that human behavior is complex, and nuanced, and far beyond what we can encode
by hand. I spent a couple years looking
at various ways to automatically mine patterns in this data (n-grams, SVMs,
HMMs, affinity propagation, PLWAP), and made some encouraging progress, but
ultimately concluded that these approaches have the effect of filtering out the
nuance of the interaction due to sparse data.
But the motivation of recording thousands of people in the first place
was to capture the nuance! So, in 2010 I
changed direction, toward a human-machine collaborative approach, where humans
are employed to interpret the meaning of patterns in the data.
I created browser-based tools (Flex, ActionScript 3) and
used oDesk to hire people from the Philippines, Pakistan, India, and the U.S. to annotate data, applying a narrative structure that represents
a hierarchy of events, long-range dependencies indicating causal chains and
references, and expressions of attitude.
This structure also represents modulation of affinity and tension, but
these aspects have not yet been implemented.
Narrative structure applied to gameplay transcripts. |
It took a team of seven outsourced annotators a total of 415
person-hours to tag 1,000 transcripts with four types of meta-data, which works
out to about 1.5 weeks, if they were working 40 hours/week, and it cost about
$3,000. As a final step, lines of dialogue
are manually semantically clustered. I
did this step myself, and it took about two weeks to cluster 18,000 lines.
The annotated data serves as Collective Memory, driving the
decisions of the runtime planning architecture (written in Java), which
combines plan recognition with case-based planning. At a high level, the agent recognizes
discrete sequences of observations representing events, infers a hierarchy of
events, and retrieves gameplay transcripts (aka cases) containing event
hierarchies that are similar at an abstract level. Retrieved cases are critiqued, leveraging
meta-data to scrutinize a proposed next-action for coherence. For a simple example, if someone ordered
steak, and the AI waitress is considering a next action for serving pie, a
critic will reject this proposal due to violating a long-range dependency
tagged by a human, where ordering steak causes the waitress to serve a steak,
rather than pie. All of the critics are
domain-independent, with the exception of the Domain Critic, which accesses
rules encoded in the Domain Knowledge Manager.
DEMOS!
So, enough jibber jabber.
Let's see what this system actually does. Below are three videos of a human customer
interacting with an AI waitress. The
first video highlights how the system auto-completes the same input in
different ways depending on context, and how the waitress can exploit player
data to respond to some of the more unusual things the player does.
The last video demonstrates a waitress directed to
upsell. This is accomplished through a
combination of applying an upselling attitude tag, and adding a couple
domain-specific rules to the Domain Knowledge Manager which tell the waitress
to never bring an entree until an appetizer has been ordered, and never bring
the bill until dessert has been ordered.
WELL, DID IT SUCCEED?
I ran both a quantitative and qualitative study to evaluate
whether the implemented system succeeded at supporting a more open-ended,
player-driven experience. Subjects in the
quantitative study interacted with the system with speech-to-text, based on
Microsoft's speech recognizer. This
study looked at how often the subject was able to find a dialogue option with
the same meaning as what the player was trying to say, where the dialogue
options were driven directly by recognized speech in one condition, and by the
full Collective A.I. system in another condition (which can exploit context to
generate relevant dialogue options, even when speech is misunderstood). Results show that subjects were able to find
a satisfactory dialogue option 29% more often with the full system.
I also looked at the ranking of the selected dialogue
option. The figure below shows how
exploiting context increases the likelihood that the desired dialogue option
will appear higher in the list.
Subjects in the qualitative study played three games in
groups, followed by a focus group discussion.
Each subject played about 10 minutes of Facade, The Restaurant Game, and
Skyrim -- interacting with NPCs in a tavern as a control, as a reminder of the
current state-of-the-art in industry.
This study was risky, given that the other games are polished released
products, while The Restaurant Game is not really even a game -- more of a proof-of-concept
tech demo. None of the subjects had
played Facade before, and most were captivated by it. However, it was encouraging that the
discussions revealed that players did find The Restaurant Game to be more
player-directed, and responsive to nuanced language, while dramatizing a restaurant narrative in cooperation with the player. Unprompted, subjects described The Restaurant
Game as a sandbox. Some notable comments
included:
“It felt like The Restaurant Game was trying to play
along with the player. It just kind of roles with it.”
“Façade led you, The Restaurant Game lets you lead it.”
“What I noticed about The Restaurant is that it was
trying to do more than Façade in the sort of AI actual interpretation of
colloquialisms.”
WHERE DO WE GO FROM HERE?
My thesis has only scratched the surface of what's possible
with crowdsourced data-driven interaction, and I see it as a starting point
rather than an end. As a former game
programmer, the fact that these characters can still surprise me by saying
things I've never seen them say before, even after working with this data for
years, is incredibly exciting, and hints at the possibilities for truly next
generation characters, driven by massive collections of content.
Over the course of the PhD, I dabbled in a couple related
side-projects, reusing The Restaurant Game platform -- I collaborated with the
Personal Robots Group on Mars Escape, a game to capture data about human-robot
interaction, and collaborated with the GAMBIT Game Lab on Improviso, which
collects data about playing roles on the set of a low-budget sci-fi movie. But there is still much, much more to
explore.
To that end, in case you haven't been following my recent Facebook and Twitter spam, I'm continuing to explore data-driven simulated role-playing through a new venture called Giant Otter Technologies. Follow our latest developments at @GiantOtterTech.
No comments:
New comments are not allowed.