Let’s Play videos — which document game playthroughs — have surged in popularity in the past decade. Felix Kjellberg, better known by his online pseudonym PewDiePie, now reaches over 100 million subscribers on YouTube with his Let’s Play content. And a recent report estimates that the audience for Let’s Play videos and livestreams now rivals that of paid content from HBO, Netflix, ESPN, and Hulu combined.
But producing quality Let’s Play videos takes time, much of which is devoted to writing scripts. To ease the burden on creators, a team at the Georgia Institute of Technology and the University of Alberta recently investigated an AI system that can automatically generate commentary. They say their approach outperforms existing work and lays the groundwork for future studies.
“Let’s Plays of video games represent a relatively unexplored area for experimental AI in games … There are a number of reasons why Let’s Plays may be of interest to Game AI researchers,” explained the paper’s coauthors. “First, part of Let’s Play commentary focuses on explaining the game, which is relevant to game tutorial generation, gameplay commentary, and explainable AI in games broadly. Second, Let’s Plays focus on presenting engaging commentary. Thus if we can replicate Let’s Play commentary, we may be able to extend such work to improve NPC dialogue and system prompts. Finally, Let’s Plays are important cultural artifacts, as they are the primary way many people engage with video games.”
An AI architecture commonly applied to analyzing visual imagery served as the system’s framework: a convolutional neural network (CNN). Three 25-minute YouTube videos were collected — one each from three popular Minecraft Let’s Play channels — and their associated transcripts were extracted to build a commentary corpus. The videos were then broken apart into frames at 1 frame per second, and each individual frame — 4,840 in total, 3,600 of which were used for training, with the remainder reserved for testing — was paired with a sentence that had been converted into a vector (a mathematical representation) the CNN could process and understand.
The researchers note that the generated commentary (“to a close you can see we might actually”; “it too so like I say we’re gonna start”) isn’t consistently coherent or precise, but they point out that it outperforms the baseline across three quantitative tests. More importantly, they say it demonstrates the difficulty of the task at hand, given the model’s lack of contextual knowledge.
“We anticipate future developments in this work to more closely engage with scholarship in these areas,” the researchers wrote. “[G]eneralizing to other types of games would itself present a unique challenge, since context and commentary are highly dependent on the rules and design of a particular game. Nonetheless … [we] hope to extend this project to other, popular games for Let’s Plays by abstracting lower-level details and focusing on higher-level themes shared across games.”
They leave to future work increasing the size of the data set and the number of games included.