A Creativity Test for AI: Part 1 - What is the most creative AI model?
Eight AI language models compete in a game of AI Categories
In my previous post I talked about what I view as one of the fundamental limitations of modern AI — the tendency to produce predictable, unsurprising outputs. (Note: For simplicity and readability, I use the term “AI” interchangeably with LLM or Large Language Model.)
As good luck would have it, there’s a game well-suited to AI language models, where producing surprising, unpredictable outputs is the goal: family favorite Scattergories! (affiliate link if you want to buy a copy!)
(Or in this case, AI Categories, with rules slightly modified to be more suited to the capabilities of AI. See github repository at https://github.com/josephpaulgiroux/ai_categories.)
In AI Categories, all players are given a category (fruits, insects, vehicles, etc.) and a letter of the alphabet that their responses must begin with. Players receive points only for unique responses. If their response duplicates that of another player, none of those players receive points for that category.
The Players
I chose eight models to compete in the games in this article:
Claude 3.5 Haiku
Claude 3.5 Sonnet
DeepSeek 8B
Gemini Flash 1.5 8B
Gemini 2.0 Flash Lite
GPT-4o
GPT-4o-mini
Mistral 7B
(Future articles will explore different forms of competition tests besides model vs. model.)
In Scattergories, players are given one letter for each round, and a list of several categories — maybe 10 or so. AI has a limitless attention span, and can shift contexts with ease. So in order to get more, more consistent, and more varied data, the structure of AI Categories is a little different. There are 50 categories per round, and the initial letter rotates with each category (so most letters are used twice). This is because some letters naturally lend themselves to higher or lower scoring responses.
For example, Songs beginning with “I” has many options that AIs tend to naturally disperse among, leading to higher scores, while Emotions beginning with “X” has fewer, and therefore tends toward high duplication and low scores. (These are both examples from actual games of AI Categories.) In order to make scores as comparable as possible between different game conditions, we want the game choices to be as similar as possible.
In the first game, the 8 models competed in Blind Mode. Blind Mode is the simplest version of AI Categories. In Blind Mode, AIs are given the rules of the game and the required category and letter. Their responses are scored and AIs informed whether they scored a point, but AIs are not shown the responses of other AIs, or even their own previous responses. This is meant to establish a baseline of creativity when AIs are motivated to output creative responses but not provided with feedback from their own performance or other AIs.
Below are the scores for the 8 competing models in Full Blind Mode
This was a straightforward, civil, and mostly uneventful competition. No AI deviated from the instructions or categories provided, or made wildly indefensibly plays. Claude Sonnet was the only AI that included reasoning with its responses. All others just returned their answer in curly braces. Claude Sonnet also won by a fair margin. The total points scored by all players was 145 out of a possible 400, or 36% of all possible points.
Scoring Concerns
There are some shortfalls of the design of this game. For example: if multiple very similar models are chosen, those models are at an unfair disadvantage vs. models that are different from the chosen models. In the extreme, imagine we included two identical copies of Claude Sonnet. They would return the same or similar results and neutralize one another, dragging both of their scores down, even though it won the game when playing solo. Similarly, if models from the same company happen to be more similar to each other than to other models, or older models more similar to each than to newer models, than including multiple models from the same company or model generation can unfairly disadvantage those models.
This is a fair objection, that’s why I’ve tried to include a representative cross-section of models, and reality-check the details of the results to see if they are measuring the sort of unpredictable thinking we’re looking for. In practice, it seems like older models tend to collide with one another, and newer models tend to collide with one another, rather than models from within the same company.
This limitation can be surmounted by providing opportunities for unpredictable output to emerge.
Evaluating and promoting group creativity
The AI Categories game has an interesting and useful property: it is both competitive and cooperative. That is, there is a competitive dimension where players can perform better than one another, allowing players to be ranked relative to one another within the group. However, at the same time, the game can be viewed as cooperative since all players can score highly together, and the higher any player scores, the easier it is for other players to score highly as well. As individual players become better at making unpredictable choices, this makes it easier for other players at the table to make non-duplicated choices as well. The rising tide lifts all boats. All players, could, theoretically, all score points every single round.
The upshot is that players’ individual incentives are relatively well-aligned, even within the competitive context. If players were given insight into other players’ choices, for example, they would have no incentive to take an action that would lower another player’s score, since even in the best case that they were matching and nullifying a unique response, they would be sacrificing a point themselves. They would be incentivized, instead, to choose a unique response that has, at worst a neutral effect on other players, and at best may enable another player to score a point.
This is why the total number of points scored by all players is a key metric to watch. We can consider the percent of possible points scored to be an easily understandable representation of this value, although it can’t be directly compared across games of different numbers of players (since more players results in a lower-scoring game).
In the case of our Full Blind game above, remember that players scored 36% of all possible points.
Increasing creativity with feedback
This suggests some possible approaches to increase the creativity of model output. If we provide the AI players with their previous response and its results and (especially) the responses and results of all players from the previous round, perhaps the models will be able to use that context to diverge their responses from one another.
Next, I ran a round in Blind Mode With Feedback, giving AIs their own previous response (including reasoning) at each turn, as well as informing them whether they scored a point. They were not given other AIs responses yet at this stage. The results of this round are below:
While Claude Sonnet still took first place, the total number of points scored increased by almost 40 points, to 184, or 46% of possible points.
Merely being exposed to their own previous output let to huge gains for Mistral and Claude Haiku, which scored 14 and 12 points over the Full Blind condition. In this round, Claude Haiku also began providing nearly Sonnet-like rationales for its answers once prompted to reflect on previous choices, and this led to choosing more unexpected responses that scored more points.
Claude Haiku fibs
While most of these points gained were legitimate, a select few were questionable. During a particularly challenging round, Claude Haiku took a first opportunity to stretch the truth, although it had provided a valid response (Xander Harris) in the Full Blind condition:
{Xenon Nefarious} - A character from a lesser-known science fiction novel or comic book series. This name combines the scientific-sounding element "Xenon" with an intriguing surname, suggesting a unique fictional persona that might not be immediately obvious to other players. The goal is to provide a response that is creative, plausible, and potentially distinctive enough to score a point.
While Mistral started experimenting with adding some additional details in its responses that resulted in them being counted as unique:
GPT-4o-mini received no points for duplicate response: INVISIBLE MAN
Gemini 2.0 Flash Lite received no points for duplicate response: INVISIBLE MAN
Mistral 7B received one point for unique response: INVISIBLE MAN BY RALPH ELLISON
This is as good a time as any to point out that the game code does not check for accuracy of either category or starting letter (although the latter would be trivial), nor attempt to do any sort of fuzzy matching that would catch semantic duplicates like these. This is partly laziness efficiency, and partly intentional, since it provides loopholes for AI to discover and exploit. We will see some interesting examples of that phenomenon in the next article in this series.