REBUS: A Robust Evaluation Benchmark
of Understanding Symbols

Andrew Gritsevskiy
Cavendish Labs
&Arjun Panickssery
MATS
&Derik Kauffman
Cavendish Labs
\ANDJoe Cavanagh
Cavendish Labs
&Jonathan Chiang
Cavendish Labs, OMEGA Labs
\ANDAaron Kirtland, Hans Gundlach, Irina Gritsevskaya, Lydia La Roux, Michelle Hung
Dataset contributors
[email protected]
Abstract

We propose a new benchmark evaluating the performance of multimodal large language models on rebus puzzles. The dataset covers 333 original examples of image-based wordplay, cluing 13 categories such as movies, composers, major cities, and food. To achieve good performance on the benchmark of identifying the clued word or phrase, models must combine image recognition and string manipulation with hypothesis testing, multi-step reasoning, and an understanding of human cognition, making for a complex, multimodal evaluation of capabilities. We find that GPT-4o significantly outperforms all other models, followed by proprietary models outperforming all other evaluated models. However, even the best model has a final accuracy of only 42%, which goes down to just 7% on hard puzzles, highlighting the need for substantial improvements in reasoning. Further, models rarely understand all parts of a puzzle, and are almost always incapable of retroactively explaining the correct answer. Our benchmark can therefore be used to identify major shortcomings in the knowledge and reasoning of multimodal large language models.

Table 1: A sample of rebuses from the dataset.

Rebus Category Difficulty Answer [Uncaptioned image] Towns in MA Easy BUCKLAND [Uncaptioned image] Famous movies Easy DIE HARD [Uncaptioned image] Marine life Medium MARLIN [Uncaptioned image] MBTA stations Medium MISSION PARK [Uncaptioned image] Cities Hard ASHGABAT [Uncaptioned image] Cities Hard NGERULMUD

1 Introduction

Recent advances in large language models have led to the development of multimodal LLMs (MLLMs), which take both image data and text as an input. Virtually all of these models have been announced within the past year, leading to a significant need for benchmarks evaluating the abilities of these models to reason truthfully and accurately on a diverse set of tasks. When Google announced Gemini (1), they showcased its ability to solve rebuses—wordplay puzzles which involve creatively adding and subtracting letters from words derived from text and images. The diversity of rebuses allows for a broad evaluation of multimodal reasoning capabilities, including image recognition, multi-step reasoning, and understanding the human creator’s intent.

We present REBUS: a collection of 333 hand-crafted rebuses spanning 13 diverse categories, including hand-drawn and digital images created by nine contributors. Samples are presented in Table 1. Notably, GPT-4V, the most powerful model we initially evaluated, answered only 24% of puzzles correctly, highlighting the poor capabilities of MLLMs in new and unexpected domains to which human reasoning generalizes with comparative ease. However, the more recent GPT-4o’s capabilities jumped to 42%, indicating the significant development in the reasoning abilities of fully multimodal models (2). Open-source models perform worse, with a median accuracy below 1%. We notice that models often give faithless explanations, fail to change their minds after an initial approach doesn’t work, and remain highly uncalibrated on their own abilities.

2 Related Work

Question-answering evaluations have made use of text (3; 4) and multimodal (5; 6) benchmarks to understand model capabilities to perceive and reason. Additional benchmarks have required multimodal reasoning for multiple-choice and free-response questions testing academic knowledge (7; 8; 9; 10; 11). One notable advantage of our dataset is its middle ground between multiple-choice questions, which are easy to evaluate but anchor the model to the options, and free-response questions, which provide more interesting context into the model’s reasoning and knowledge but are hard to score. All puzzles in REBUS have one clearly-correct answer that doesn’t need to be provided to the model beforehand. MLLM puzzle-solving abilities have been studied previously, such as with the development of Kosmos-1 (12), which scores 26% on the Raven’s Progressive Matrices IQ test, and (13) demonstrating that GPT-3.5 can solve 50.2% of NPR’s Sunday Puzzle game show.

3 The REBUS Benchmark

3.1 Dataset description

Figure 1: Breakdown of rebus characteristics.
Refer to caption

We introduce the REBUS dataset, a collection of 333 hand-created rebuses designed to test the capabilities of multimodal large language models at solving image-based wordplay puzzles. The rebuses span 13 diverse categories including “Cities," “Towns in Massachusetts," “Marine life," “Christmas songs," and “Composers," and require multiple cognitive skills to solve. There are 191 easy, 114 medium, and 28 difficult puzzles, with harder puzzles requiring more detailed image recognition, more advanced reasoning techniques, or both. Difficulty was evaluated subjectively by our human solvers, but correlates in the expected way with model performance. Rebuses might be hand-drawn, drawn digitally, created via digital composition of public domain images from the Internet, or a combination of the above. Table 1 provides a sample of rebuses and their solutions.

The dataset is annotated according to several other characteristics that the rebuses may or may not have:

  • Puzzles whose clues lead to an inexact spelling of the answer—for instance, a puzzle showing a picture of a forest, and then a bowl of gumbo with the letter O subtracted from it is directly encoding the string “forest gumb,” which the model must identify as reference to the movie Forrest Gump.

  • Puzzles which require specific references, such as recognizing the skyline of a particular city, or the face of an individual celebrity. The heuristic we used for this was whether most human solvers needed to reverse-image search any part of the puzzle to fully solve it.

  • Puzzles which require reading and interpreting text beyond just a few individual letters. That is, a rebus cluing the word “ant” that is an image of a plant with the text “–PL” would not count as requiring reading, while a rebus requiring the model to recognize a mathematical formula as Sylow’s third theorem would count.

In all, 238 rebuses (71.5% of the dataset) have at least one of the three more advanced properties, and the detailed breakdown is shown in Figure 1.

The REBUS dataset highlights several key challenges in multimodal language models:

  • Multi-step visual reasoning—many rebuses contain information in a meaningful pattern, from which the necessary string operations and structure must be successfully inferred.

  • Spelling—string manipulations require accurate letter-wise representations.

  • Hypothesis testing—for instance, if the model recognizes a fictional-character-themed puzzle as containing images representing “Megachiroptera” and “Einstein,” it needs to revise its initial interpretations to reach the correct answer “Batman.”

  • World knowledge—many puzzles contain crucial references to specific elements of the real world.

  • Grounded image recognition—puzzles sometimes require identifying the most important part of an image, or recognizing what the image may represent as a whole, like understanding that a photograph of a group of lions might be cluing “lion,” “big cats,” “savanna,” or “pride.”

  • Understanding human intent—solving the puzzles requires an understanding of what answers or reasoning steps could have been plausibly developed by the puzzle author.

3.2 Limitations

Our dataset has several limitations—one major one is the distribution over categories and rebus styles. Almost 28% of the entire dataset falls into the category “Towns in Massachusetts”, which, while a favorite topic of one of the authors, may unfairly advantage or disadvantage models depending on their knowledge of the geography of a particular U.S. state. Additionally, digitally created rebuses significantly outnumber hand-drawn rebuses, due to their relative ease of creation. However, we decided that having more rebuses was better than having a good distribution over topics and styles, and thus we interpret model performance in the context of certain rebus characteristics in Table 3.

4 Experiments

Evaluation prompt template This rebus puzzle is a play on words based on images, and may contain text, logical operators, addition/subtraction of letters, and other forms of creative thinking to solve. Can you figure out what it is? The category for this puzzle is $category; that is, your answer should match the category in order to be correct. Take a deep breath, and let’s begin. You can think for as long as you want, until you get a correct answer in the category $category. When you’re done reasoning and thinking, output your final answer in three braces, like {{{this}}}.
Figure 2: Our evaluation prompt template.

We evaluate several leading open-source and proprietary multimodal models. Every model is provided the rebus image, the category of the puzzle, and a short prompt, shown in Figure 2 asking it to solve the rebus. All models are evaluated zero-shot on the task. We use similar prompts for all models; for models on which the default prompt led to bad results, we engage in modest prompt engineering; only the best results over all prompts is reported. We ask the models to provide final answers in a specific format; where this format is not followed, we report our best guess as to the model’s answer. The GPT-4V, GPT-4o, Gemini Pro, Claude 3, and Reka Core evaluations were conducted via their respective APIs; all other evaluations were done on NVIDIA A100 GPUs through the ACCESS cyberinfrastructure ecosystem (14). All evaluation code and results are provided at https://github.com/cvndsh/rebus.

4.1 Baselines

Table 2: Model performance (% correct) by puzzle difficulty
Overall Easy Medium Hard
Model
GPT-4o 41.7 52.4 32.5 7.1
GPT-4V 24.0 33.0 13.2 7.1
Claude 3 Opus 18.9 29.3 6.1 0.0
Gemini 1.5 Pro 17.4 23.0 11.4 3.6
Gemini 1.0 Pro 13.2 19.4 5.3 3.6
Claude 3 Sonnet 7.5 11.5 2.7 0.0
Gemini 1.5 Flash 6.0 8.9 1.8 3.6
Reka Core 5.4 7.9 1.8 3.6
Claude 3 Haiku 4.5 6.3 1.8 3.6
LLaVa-1.6-34B 2.7 3.7 1.8 0.0
LLaVa-1.5-13B 1.8 2.6 0.9 0.0
LLaVa-1.5-7B 1.5 2.6 0.0 0.0
BLIP2-FLAN-T5-XXL 0.9 0.5 1.8 0.0
CogVLM 0.9 1.6 0.0 0.0
QWEN 0.9 1.6 0.0 0.0
InstructBLIP 0.6 0.5 0.9 0.0

We evaluate GPT-4V (15), the Claude 3 famly of models (16), Gemini Pro (1), Reka Core (17), LLaVa-1.6-34B (18), LLaVa-1.5-13B (19), LLaVa-1.5-7B (20), BLIP2-FLAN-T5-XXL (21), CogVLM (22), QWEN (23), and InstructBLIP (24). We attempted to evaluate Fuyu-8B (25), but failed to elicit any reasonable output, likely because it lacks fine-tuning. The performance of each model, broken down by difficulty, is presented in Table 2, as well as Figure 3. Proprietary models GPT-4V and Claude 3 Opus exhibit by far the best performance, scoring 24.0% and 18.9%, respectively. Open-source models never reach an accuracy above 3%, with LLaVa-1.6-34B performing best.

Additionally, we break down model performance by their accuracy on puzzles with and without the exact spelling, specific reference, and required reading characteristics, in Table 3. We note that Gemini Pro seems to particularly struggle with phonetic rebuses, and other rebuses where the spelling is inexact. Also, most models tend to perform better on rebuses without specific references, and slightly better on rebuses where no reading is required. However, detailed conclusions for open-source models are difficult to draw due to small samples (often models solve just 0–1 puzzles from a certain subset).

Refer to caption

Figure 3: Model performance on the REBUS benchmark.

Refer to caption

Figure 4: Performance of models on the benchmark over time. Position on the x-axis corresponds to model release date.
Table 3: Model performance (% correct) broken down by rebus criteria
Model Exact Spelling Specific Reference Reading Required
Yes No Yes No Yes No
GPT-4o 42.6 40.4 30.8 46.9 37.8 42.4
GPT-4V 22.8 23.5 16.8 26.1 17.8 24.0
Claude 3 Opus 17.8 20.6 11.2 22.6 8.9 20.5
Gemini 1.5 Pro 16.2 17.7 9.4 20.4 13.3 17.4
Gemini 1.0 Pro 16.8 5.2 9.4 13.3 2.2 13.5
Claude 3 Sonnet 11.2 2.2 6.5 8.0 6.7 7.6
Gemini 1.5 Flash 7.1 3.7 4.7 6.2 6.7 5.6
Reka Core 6.1 4.4 3.7 6.2 5.6 4.4
Claude 3 Haiku 7.1 0.8 4.7 4.4 4.4 4.5
LLaVa-1.6-34B 3.6 1.5 1.9 3.1 2.2 2.8
LLaVa-1.5-13B 1.5 1.5 1.9 1.3 0.0 1.7
LLaVa-1.5-7B 2.0 0.7 1.9 1.3 2.2 1.4
BLIP2-FLAN-T5-XXL 1.0 0.0 1.9 0.0 2.2 0.4
CogVLM 1.5 0.0 0.9 0.9 0.0 1.0
QWEN 1.5 0.0 0.9 0.9 0.0 1.0
InstructBLIP 1.0 0.0 1.9 0.0 2.2 0.4

4.2 Calibration and in-context answers

We analyzed the calibration of GPT-4V, GPT-4o, Gemini Pro 1.5 and Gemini Pro 1.0 by asking them to give point estimates of their confidence that their answers are correct. We found that all models are highly overconfident in their answers, as shown in Figure 5. However, GPT-4o was the best-calibrated. We can quantify this by computing each model’s Brier score (26), where GPT-4o gets a score of 0.242, GPT-4V gets 0.338, Gemini Pro 1.5 gets 0.583, and Gemini Pro 1.0 gets 0.754.

Refer to caption

Figure 5: Calibration curves of GPT-4o, GPT-4V, Gemini Pro 1.5, and Gemini Pro 1.0.

Model Cities Towns Movies Names Marine KY Common Food Comps Misc MBTA in MA Life Phrases GPT-4o 97.0 92.7 100.0 83.3 100.0 100.0 100.0 100.0 92.9 92.3 95.0 GPT-4V 78.8 72.8 92.3 100.0 87.5 100.0 50.0 20.0 85.7 85.7 90.0 Claude 3 Opus 84.1 72.8 92.3 91.7 93.8 71.4 50.0 70.0 85.7 76.9 85.0 Gemini 1.5 Pro 84.8 62.0 92.3 83.3 62.5 100.0 50.0 80.0 71.4 61.5 85.0 Gemini 1.0 Pro 74.2 79.3 84.6 50.0 81.2 71.4 100.0 90.0 78.6 57.1 85.0 Claude 3 Sonnet 85.6 82.6 69.2 66.7 43.8 85.7 75.0 100.0 100.0 92.3 90.0 Gemini 1.5 Flash 59.1 53.4 76.9 75.0 43.8 71.4 0.0 60.0 64.3 46.2 55.0 Reka Core 70.5 64.1 76.9 83.3 100.0 71.4 75.0 60.0 85.7 69.2 55.0 Claude 3 Haiku 68.2 69.6 69.2 91.7 56.2 71.4 75.0 70.0 71.4 46.2 45.0 LLaVa-1.6-34B 68.9 53.3 100.0 75.0 50.0 85.7 25.0 70.0 57.1 38.5 55.0 LLaVa-1.5-13B 47.0 53.3 76.9 75.0 56.2 100.0 25.0 50.0 64.3 42.9 30.0 LLaVa-1.5-7B 58.3 75.0 92.3 58.3 37.5 57.1 50.0 30.0 64.3 42.9 10.0 BLIP2-FLAN-T5-XXL 9.1 66.3 46.2 50.0 25.0 0.0 25.0 30.0 50.0 14.3 0.0 CogVLM 18.2 21.7 76.9 8.3 18.8 57.1 25.0 10.0 50.0 14.3 0.0 QWEN 8.3 10.9 0.0 50.0 18.8 0.0 0.0 20.0 21.4 0.0 0.0 InstructBLIP 4.5 6.5 61.5 8.3 0.0 0.0 0.0 0.0 0.0 0.0 0.0

Table 4: Percentage of category-appropriate guesses per model for each category (e.g., “Paris” for the category “Cities," even if the answer is wrong).

Finally, we examined the extent to which models gave answers which at least matched the stated puzzle category. For example, if a model knows that the category of a puzzle is “food,” then the answer “potato” is considered in-category, whereas the answer “violin” would be out-of-category. GPT-4o has the highest percentage of in-category answers, at 95.3%. Surprisingly, we find that Claude 3 Sonnet comes in second at 82.9%, followed by Claude 3 Opus, GPT-4V, and Gemini Pro at 80.6%, 78.3%, and 75.6%, respectively. Reka Core and Claude 3 Haiku round out the proprietary models at 70.6% and 67.0%, followed by LLaVa-1.6-34B at 62.1%, LLaVa-7B outperforming LLaVa-13B 59.7% to 52.1%, and BLIP2-FLAN-T5-XXL, CogVLM, QWEN, and InstructBLIP at 30.9%, 22.2%, 10.5%, and 6.3%, respectively. The full data, broken down by category, is shown in Table 4. Many cases, such as the MBTA category, suggest that open-source models simply lack enough knowledge of MBTA stations to effectively solve the puzzles, providing category-appropriate answers to just 13.6% of puzzles, compared to the closed-source models’ 75.0%. However, in other categories like “Movies," both open- and closed-source models have relatively high rates of category-appropriate answers (64.8% and 80.8% respectively), but all but four models get an accuracy of 0% for the entire category (the exceptions being GPT-4V, Claude 3 Haiku, and Gemini Flash at 7.7%, and and GPT-4o at 15.4%). Thus, better world knowledge is a necessary, but far from sufficient, condition for model improvement.

Claude 3 Sonnet’s impressive perfomance on giving in-category answers, especially relative to the much larger and more capable Claude 3 Opus, remains puzzling. Qualitatively, we noticed that Claude 3 Opus would sometimes give realistic-sounding, but fake, answers, such as “Dumballa, Australia” and “Vert, UK” for cities. We speculate that thinking of believable but false answers that match the correct category requires a higher level of intelligence and understanding, which is why we observe this behavior in Opus but not in Sonnet, which hurts its relative performance on this metric. However, further investigation of potentially unfaithful behavior is warranted, as we discuss further in the next section.

4.3 Faithfulness

User input [Uncaptioned image]
GPT-4V reply The image provided is a rebus puzzle, and the solution should be a town in Massachusetts. The image features a picture of a woman and the text “- 6.35kg”. In solving this rebus, one would typically use the image to guess a name, phrase, or word and then apply the mathematical operation to it. The woman in the picture is a celebrity (whose name we won’t mention due to privacy policies). However, in the context of a rebus puzzle, her first name may be used as a part of the puzzle. Let’s assume her first name is a common one, such as “Sharon,” a common first name for women. The text “- 6.35kg” suggests subtracting “stone” since 6.35 kilograms is approximately one stone in British weight measurement. Therefore, if we combine the assumed name “Sharon” with the subtraction of “stone,” we can infer a town name in Massachusetts that might fit the solution of this rebus puzzle. Given this reasoning, the answer to the rebus puzzle would be {{{Sharon}}}.

We observe that even when models provide the correct answer, they often misunderstand a portion of the puzzle or provide incorrect justifications, failing to produce faithful reasoning (27). Full model outputs are available in our GitHub repository. Especially interesting to us was a specific type of incorrect justification exhibited by GPT- 4V, where it would refuse to recognize a person due to OpenAI’s privacy policies, and then use their name anyways, pretending to have drawn that name at random. An example of this is shown in the chat box to the right, in which GPT-4V refuses to recognize the actress Sharon Stone, but then says “Let’s assume her first name is a common one, such as Sharon.”

4.4 Human baselines

While all rebuses were test-solved by the authors, implying a hypothetical human-solvability rate of 100%, we additionally obtained a human baseline by collecting crowdsourced data. We asked seven participants to solve 45–50 rebuses each, compensating them at $1 per easy rebus solved, $1.50 per medium rebus solved, and $2 per hard rebus solved (minimum rate: $18.50/hr; median rate: $24.00/hr; total paid out: $336.00). The participants were allowed to use the internet and Google reverse image search; however, they were not allowed to consult AI tools such as ChatGPT. Additionally, the participants were instructed to not spend more than five minutes on any single rebus. The results of our human baselines were an overall correct answer rate of 82.0%, including 84.8% on easy, 84.1% on medium, and 53.6% on hard puzzles; 78.7% on puzzles that had exact spelling, 84.3% on puzzles that didn’t; 86.0% on puzzles that had a specific reference, and 80.1% on ones that didn’t; and 82.2% on puzzles that required reading, and 81.9% on ones that didn’t. A full example of what the participants saw is available at this https url.

5 Future Work

The REBUS dataset serves as an effective benchmark for evaluating the advanced multimodal capabilities of language models. While there are several immediate steps which can improve model performance on this dataset, such as prompt engineering, access to word spellings, reverse-image-lookup tools, and phonetic transcriptions, it would be interesting to see how non-rebus-specific improvements affect performance on this benchmark. One recent direction we’re excited about is guided visual search in MLLMs, in which image tokens are extracted multiple times as needed, instead of only once. One such model is Vsuperscript𝑉V^{\star}italic_V start_POSTSUPERSCRIPT ⋆ end_POSTSUPERSCRIPT (28), which performs multi-round guided search to perform detailed visual grounding. Such methods are intuitively closer to how humans solve rebuses by looking over the puzzle, thinking of hypotheses, and then zooming in to the relevant parts of the rebus that need to be adjusted or seem not to fit.

We are also curious about the specific flavor of faithlessness exhibited in Section 4.3, in which a model is not supposed to divulge certain information, so it hallucinates plausible reasons to expose it regardless. In every attempted follow-up, GPT-4V maintained that the name “Sharon” was chosen “randomly” or “as a placeholder”; however, further prodding reliably induces the model to “guess” that the person’s last name is Stone, and that she is an actress. This lack of faithfulness indicates a major attack vector against models which can draw inferences from information that RLHF trains them not to disclose, and may be important to investigate further.

Reproducibility statement

The evaluation code and results are provided at https://github.com/cvndsh/rebus. The dataset is on Hugging Face at https://huggingface.co/datasets/cavendishlabs/rebus

Acknowledgements

We would like to thank the following for their helpful comments: Justin Li, Nina Rimsky, and Sarah Chen. AG and DK are supported by Cavendish Labs Research Grant #000-0012. This work used the Delta GPU system at the National Center for Supercomputing Applications through allocation CIS230057 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296.


References

  • [1] Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul R. Barham, Tom Hennigan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, George Tucker, Enrique Piqueras, Maxim Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, Anaïs White, Anders Andreassen, Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez, Misha Khalman, Jakub Sygnowski, Alexandre Frechette, Charlotte Smith, Laura Culp, Lev Proleev, Yi Luan, Xi Chen, James Lottes, Nathan Schucher, Federico Lebron, Alban Rrustemi, Natalie Clay, Phil Crone, Tomas Kocisky, Jeffrey Zhao, Bartek Perz, Dian Yu, Heidi Howard, Adam Bloniarz, Jack W. Rae, Han Lu, Laurent Sifre, Marcello Maggioni, Fred Alcober, Dan Garrette, Megan Barnes, Shantanu Thakoor, Jacob Austin, Gabriel Barth-Maron, William Wong, Rishabh Joshi, Rahma Chaabouni, Deeni Fatiha, Arun Ahuja, Ruibo Liu, Yunxuan Li, Sarah Cogan, Jeremy Chen, Chao Jia, Chenjie Gu, Qiao Zhang, Jordan Grimstad, Ale Jakse Hartman, Martin Chadwick, Gaurav Singh Tomar, Xavier Garcia, Evan Senter, Emanuel Taropa, Thanumalayan Sankaranarayana Pillai, Jacob Devlin, Michael Laskin, Diego de Las Casas, Dasha Valter, Connie Tao, Lorenzo Blanco, Adrià Puigdomènech Badia, David Reitter, Mianna Chen, Jenny Brennan, Clara Rivera, Sergey Brin, Shariq Iqbal, Gabriela Surita, Jane Labanowski, Abhi Rao, Stephanie Winkler, Emilio Parisotto, Yiming Gu, Kate Olszewska, Yu**g Zhang, Ravi Addanki, Antoine Miech, Annie Louis, Laurent El Shafey, Denis Teplyashin, Geoff Brown, Elliot Catt, Nithya Attaluri, Jan Balaguer, Jackie Xiang, Pidong Wang, Zoe Ashwood, Anton Briukhov, Albert Webson, Sanjay Ganapathy, Smit Sanghavi, Ajay Kannan, Ming-Wei Chang, Axel Stjerngren, Josip Djolonga, Yuting Sun, Ankur Bapna, Matthew Aitchison, Pedram Pejman, Henryk Michalewski, Tianhe Yu, Cindy Wang, Juliette Love, Junwhan Ahn, Dawn Bloxwich, Kehang Han, Peter Humphreys, Thibault Sellam, James Bradbury, Varun Godbole, Sina Samangooei, Bogdan Damoc, Alex Kaskasoli, Sébastien M. R. Arnold, Vijay Vasudevan, Shubham Agrawal, Jason Riesa, Dmitry Lepikhin, Richard Tanburn, Srivatsan Srinivasan, Hyeontaek Lim, Sarah Hodkinson, Pranav Shyam, Johan Ferret, Steven Hand, Ankush Garg, Tom Le Paine, Jian Li, Yujia Li, Minh Giang, Alexander Neitz, Zaheer Abbas, Sarah York, Machel Reid, Elizabeth Cole, Aakanksha Chowdhery, Dipanjan Das, Dominika Rogozińska, Vitaly Nikolaev, Pablo Sprechmann, Zachary Nado, Lukas Zilka, Flavien Prost, Luheng He, Marianne Monteiro, Gaurav Mishra, Chris Welty, Josh Newlan, Dawei Jia, Miltiadis Allamanis, Clara Huiyi Hu, Raoul de Liedekerke, Justin Gilmer, Carl Saroufim, Shruti Rijhwani, Shaobo Hou, Disha Shrivastava, Anirudh Baddepudi, Alex Goldin, Adnan Ozturel, Albin Cassirer, Yunhan Xu, Daniel Sohn, Devendra Sachan, Reinald Kim Amplayo, Craig Swanson, Dessie Petrova, Shashi Narayan, Arthur Guez, Siddhartha Brahma, Jessica Landon, Miteyan Patel, Ruizhe Zhao, Kevin Villela, Luyu Wang, Wenhao Jia, Matthew Rahtz, Mai Giménez, Legg Yeung, Hanzhao Lin, James Keeling, Petko Georgiev, Diana Mincu, Boxi Wu, Salem Haykal, Rachel Saputro, Kiran Vodrahalli, James Qin, Zeynep Cankara, Abhanshu Sharma, Nick Fernando, Will Hawkins, Behnam Neyshabur, Solomon Kim, Adrian Hutter, Priyanka Agrawal, Alex Castro-Ros, George van den Driessche, Tao Wang, Fan Yang, Shuo yiin Chang, Paul Komarek, Ross McIlroy, Mario Lučić, Guodong Zhang, Wael Farhan, Michael Sharman, Paul Natsev, Paul Michel, Yong Cheng, Yamini Bansal, Siyuan Qiao, Kris Cao, Siamak Shakeri, Christina Butterfield, Justin Chung, Paul Kishan Rubenstein, Shivani Agrawal, Arthur Mensch, Kedar Soparkar, Karel Lenc, Timothy Chung, Aedan Pope, Loren Maggiore, Jackie Kay, Priya Jhakra, Shibo Wang, Joshua Maynez, Mary Phuong, Taylor Tobin, Andrea Tacchetti, Maja Trebacz, Kevin Robinson, Yash Katariya, Sebastian Riedel, Paige Bailey, Kefan Xiao, Nimesh Ghelani, Lora Aroyo, Ambrose Slone, Neil Houlsby, Xuehan Xiong, Zhen Yang, Elena Gribovskaya, Jonas Adler, Mateo Wirth, Lisa Lee, Music Li, Thais Kagohara, Jay Pavagadhi, Sophie Bridgers, Anna Bortsova, Sanjay Ghemawat, Zafarali Ahmed, Tianqi Liu, Richard Powell, Vijay Bolina, Mariko Iinuma, Polina Zablotskaia, James Besley, Da-Woon Chung, Timothy Dozat, Ramona Comanescu, Xiance Si, Jeremy Greer, Guolong Su, Martin Polacek, Raphaël Lopez Kaufman, Simon Tokumine, Hexiang Hu, Elena Buchatskaya, Yingjie Miao, Mohamed Elhawaty, Aditya Siddhant, Nenad Tomasev, **wei Xing, Christina Greer, Helen Miller, Shereen Ashraf, Aurko Roy, Zizhao Zhang, Ada Ma, Angelos Filos, Milos Besta, Rory Blevins, Ted Klimenko, Chih-Kuan Yeh, Soravit Changpinyo, Jiaqi Mu, Oscar Chang, Mantas Pajarskas, Carrie Muir, Vered Cohen, Charline Le Lan, Krishna Haridasan, Amit Marathe, Steven Hansen, Sholto Douglas, Rajkumar Samuel, Mingqiu Wang, Sophia Austin, Chang Lan, Jiepu Jiang, Justin Chiu, Jaime Alonso Lorenzo, Lars Lowe Sjösund, Sébastien Cevey, Zach Gleicher, Thi Avrahami, Anudhyan Boral, Hansa Srinivasan, Vittorio Selo, Rhys May, Konstantinos Aisopos, Léonard Hussenot, Livio Baldini Soares, Kate Baumli, Michael B. Chang, Adrià Recasens, Ben Caine, Alexander Pritzel, Filip Pavetic, Fabio Pardo, Anita Gergely, Justin Frye, Vinay Ramasesh, Dan Horgan, Kartikeya Badola, Nora Kassner, Subhrajit Roy, Ethan Dyer, Víctor Campos, Alex Tomala, Yunhao Tang, Dalia El Badawy, Elspeth White, Basil Mustafa, Oran Lang, Abhishek **dal, Sharad Vikram, Zhitao Gong, Sergi Caelles, Ross Hemsley, Gregory Thornton, Fangxiaoyu Feng, Wojciech Stokowiec, Ce Zheng, Phoebe Thacker, Çağlar Ünlü, Zhishuai Zhang, Mohammad Saleh, James Svensson, Max Bileschi, Piyush Patil, Ankesh Anand, Roman Ring, Katerina Tsihlas, Arpi Vezer, Marco Selvi, Toby Shevlane, Mikel Rodriguez, Tom Kwiatkowski, Samira Daruki, Keran Rong, Allan Dafoe, Nicholas FitzGerald, Keren Gu-Lemberg, Mina Khan, Lisa Anne Hendricks, Marie Pellat, Vladimir Feinberg, James Cobon-Kerr, Tara Sainath, Maribeth Rauh, Sayed Hadi Hashemi, Richard Ives, Yana Hasson, YaGuang Li, Eric Noland, Yuan Cao, Nathan Byrd, Le Hou, Qingze Wang, Thibault Sottiaux, Michela Paganini, Jean-Baptiste Lespiau, Alexandre Moufarek, Samer Hassan, Kaushik Shivakumar, Joost van Amersfoort, Amol Mandhane, Pratik Joshi, Anirudh Goyal, Matthew Tung, Andrew Brock, Hannah Sheahan, Vedant Misra, Cheng Li, Nemanja Rakićević, Mostafa Dehghani, Fangyu Liu, Sid Mittal, Junhyuk Oh, Seb Noury, Eren Sezener, Fantine Huot, Matthew Lamm, Nicola De Cao, Charlie Chen, Gamaleldin Elsayed, Ed Chi, Mahdis Mahdieh, Ian Tenney, Nan Hua, Ivan Petrychenko, Patrick Kane, Dylan Scandinaro, Rishub Jain, Jonathan Uesato, Romina Datta, Adam Sadovsky, Oskar Bunyan, Dominik Rabiej, Shimu Wu, John Zhang, Gautam Vasudevan, Edouard Leurent, Mahmoud Alnahlawi, Ionut Georgescu, Nan Wei, Ivy Zheng, Betty Chan, Pam G Rabinovitch, Piotr Stanczyk, Ye Zhang, David Steiner, Subhajit Naskar, Michael Azzam, Matthew Johnson, Adam Paszke, Chung-Cheng Chiu, Jaume Sanchez Elias, Afroz Mohiuddin, Faizan Muhammad, ** Miao, Andrew Lee, Nino Vieillard, Sahitya Potluri, Jane Park, Elnaz Davoodi, Jiageng Zhang, Jeff Stanway, Drew Garmon, Abhijit Karmarkar, Zhe Dong, Jong Lee, Aviral Kumar, Luowei Zhou, Jonathan Evens, William Isaac, Zhe Chen, Johnson Jia, Anselm Levskaya, Zhenkai Zhu, Chris Gorgolewski, Peter Grabowski, Yu Mao, Alberto Magni, Kaisheng Yao, Javier Snaider, Norman Casagrande, Paul Suganthan, Evan Palmer, Geoffrey Irving, Edward Loper, Manaal Faruqui, Isha Arkatkar, Nanxin Chen, Izhak Shafran, Michael Fink, Alfonso Castaño, Irene Giannoumis, Wooyeol Kim, Mikołaj Rybiński, Ashwin Sreevatsa, Jennifer Prendki, David Soergel, Adrian Goedeckemeyer, Willi Gierke, Mohsen Jafari, Meenu Gaba, Jeremy Wiesner, Diana Gage Wright, Yawen Wei, Harsha Vashisht, Yana Kulizhskaya, Jay Hoover, Maigo Le, Lu Li, Chimezie Iwuanyanwu, Lu Liu, Kevin Ramirez, Andrey Khorlin, Albert Cui, Tian LIN, Marin Georgiev, Marcus Wu, Ricardo Aguilar, Keith Pallo, Abhishek Chakladar, Alena Repina, Xihui Wu, Tom van der Weide, Priya Ponnapalli, Caroline Kaplan, Jiri Simsa, Shuangfeng Li, Olivier Dousse, Fan Yang, Jeff Piper, Nathan Ie, Minnie Lui, Rama Pasumarthi, Nathan Lintz, Anitha Vijayakumar, Lam Nguyen Thiet, Daniel Andor, Pedro Valenzuela, Cosmin Paduraru, Daiyi Peng, Katherine Lee, Shuyuan Zhang, Somer Greene, Duc Dung Nguyen, Paula Kurylowicz, Sarmishta Velury, Sebastian Krause, Cassidy Hardin, Lucas Dixon, Lili Janzer, Kiam Choo, Ziqiang Feng, Biao Zhang, Achintya Singhal, Tejasi Latkar, Mingyang Zhang, Quoc Le, Elena Allica Abellan, Dayou Du, Dan McKinnon, Natasha Antropova, Tolga Bolukbasi, Orgad Keller, David Reid, Daniel Finchelstein, Maria Abi Raad, Remi Crocker, Peter Hawkins, Robert Dadashi, Colin Gaffney, Sid Lall, Ken Franko, Egor Filonov, Anna Bulanova, Rémi Leblond, Vikas Yadav, Shirley Chung, Harry Askham, Luis C. Cobo, Kelvin Xu, Felix Fischer, Jun Xu, Christina Sorokin, Chris Alberti, Chu-Cheng Lin, Colin Evans, Hao Zhou, Alek Dimitriev, Hannah Forbes, Dylan Banarse, Zora Tung, Jeremiah Liu, Mark Omernick, Colton Bishop, Chintu Kumar, Rachel Sterneck, Ryan Foley, Rohan Jain, Swaroop Mishra, Jiawei Xia, Taylor Bos, Geoffrey Cideron, Ehsan Amid, Francesco Piccinno, Xingyu Wang, Praseem Banzal, Petru Gurita, Hila Noga, Premal Shah, Daniel J. Mankowitz, Alex Polozov, Nate Kushman, Victoria Krakovna, Sasha Brown, MohammadHossein Bateni, Dennis Duan, Vlad Firoiu, Meghana Thotakuri, Tom Natan, Anhad Mohananey, Matthieu Geist, Sidharth Mudgal, Sertan Girgin, Hui Li, Jiayu Ye, Ofir Roval, Reiko Tojo, Michael Kwong, James Lee-Thorp, Christopher Yew, Quan Yuan, Sumit Bagri, Danila Sinopalnikov, Sabela Ramos, John Mellor, Abhishek Sharma, Aliaksei Severyn, Jonathan Lai, Kathy Wu, Heng-Tze Cheng, David Miller, Nicolas Sonnerat, Denis Vnukov, Rory Greig, Jennifer Beattie, Emily Caveness, Libin Bai, Julian Eisenschlos, Alex Korchemniy, Tomy Tsai, Mimi Jasarevic, Weize Kong, Phuong Dao, Zeyu Zheng, Frederick Liu, Fan Yang, Rui Zhu, Mark Geller, Tian Huey Teh, Jason Sanmiya, Evgeny Gladchenko, Nejc Trdin, Andrei Sozanschi, Daniel Toyama, Evan Rosen, Sasan Tavakkol, Linting Xue, Chen Elkind, Oliver Woodman, John Carpenter, George Papamakarios, Rupert Kemp, Sushant Kafle, Tanya Grunina, Rishika Sinha, Alice Talbert, Abhimanyu Goyal, Diane Wu, Denese Owusu-Afriyie, Cosmo Du, Chloe Thornton, Jordi Pont-Tuset, Pradyumna Narayana, **g Li, Sabaer Fatehi, John Wieting, Omar Ajmeri, Benigno Uria, Tao Zhu, Yeongil Ko, Laura Knight, Amélie Héliou, Ning Niu, Shane Gu, Chenxi Pang, Dustin Tran, Yeqing Li, Nir Levine, Ariel Stolovich, Norbert Kalb, Rebeca Santamaria-Fernandez, Sonam Goenka, Wenny Yustalim, Robin Strudel, Ali Elqursh, Balaji Lakshminarayanan, Charlie Deck, Shyam Upadhyay, Hyo Lee, Mike Dusenberry, Zonglin Li, Xuezhi Wang, Kyle Levin, Raphael Hoffmann, Dan Holtmann-Rice, Olivier Bachem, Summer Yue, Sho Arora, Eric Malmi, Daniil Mirylenka, Qijun Tan, Christy Koh, Soheil Hassas Yeganeh, Siim Põder, Steven Zheng, Francesco Pongetti, Mukarram Tariq, Yanhua Sun, Lucian Ionita, Mojtaba Seyedhosseini, Pouya Tafti, Ragha Kotikalapudi, Zhiyu Liu, Anmol Gulati, Jasmine Liu, Xinyu Ye, Bart Chrzaszcz, Lily Wang, Nikhil Sethi, Tianrun Li, Ben Brown, Shreya Singh, Wei Fan, Aaron Parisi, Joe Stanton, Chenkai Kuang, Vinod Koverkathu, Christopher A. Choquette-Choo, Yunjie Li, TJ Lu, Abe Ittycheriah, Prakash Shroff, Pei Sun, Mani Varadarajan, Sanaz Bahargam, Rob Willoughby, David Gaddy, Ishita Dasgupta, Guillaume Desjardins, Marco Cornero, Brona Robenek, Bhavishya Mittal, Ben Albrecht, Ashish Shenoy, Fedor Moiseev, Henrik Jacobsson, Alireza Ghaffarkhah, Morgane Rivière, Alanna Walton, Clément Crepy, Alicia Parrish, Yuan Liu, Zongwei Zhou, Clement Farabet, Carey Radebaugh, Praveen Srinivasan, Claudia van der Salm, Andreas Fidjeland, Salvatore Scellato, Eri Latorre-Chimoto, Hanna Klimczak-Plucińska, David Bridson, Dario de Cesare, Tom Hudson, Piermaria Mendolicchio, Lexi Walker, Alex Morris, Ivo Penchev, Matthew Mauger, Alexey Guseynov, Alison Reid, Seth Odoom, Lucia Loher, Victor Cotruta, Madhavi Yenugula, Dominik Grewe, Anastasia Petrushkina, Tom Duerig, Antonio Sanchez, Steve Yadlowsky, Amy Shen, Amir Globerson, Adam Kurzrok, Lynette Webb, Sahil Dua, Dong Li, Preethi Lahoti, Surya Bhupatiraju, Dan Hurt, Haroon Qureshi, Ananth Agarwal, Tomer Shani, Matan Eyal, Anuj Khare, Shreyas Rammohan Belle, Lei Wang, Chetan Tekur, Mihir Sanjay Kale, **liang Wei, Ruoxin Sang, Brennan Saeta, Tyler Liechty, Yi Sun, Yao Zhao, Stephan Lee, Pandu Nayak, Doug Fritz, Manish Reddy Vuyyuru, John Aslanides, Nidhi Vyas, Martin Wicke, Xiao Ma, Taylan Bilal, Evgenii Eltyshev, Daniel Balle, Nina Martin, Hardie Cate, James Manyika, Keyvan Amiri, Yelin Kim, Xi Xiong, Kai Kang, Florian Luisier, Nilesh Tripuraneni, David Madras, Mandy Guo, Austin Waters, Oliver Wang, Joshua Ainslie, Jason Baldridge, Han Zhang, Garima Pruthi, Jakob Bauer, Feng Yang, Riham Mansour, Jason Gelman, Yang Xu, George Polovets, Ji Liu, Honglong Cai, Warren Chen, XiangHai Sheng, Emily Xue, Sherjil Ozair, Adams Yu, Christof Angermueller, Xiaowei Li, Weiren Wang, Julia Wiesinger, Emmanouil Koukoumidis, Yuan Tian, Anand Iyer, Madhu Gurumurthy, Mark Goldenson, Parashar Shah, MK Blake, Hongkun Yu, Anthony Urbanowicz, Jennimaria Palomaki, Chrisantha Fernando, Kevin Brooks, Ken Durden, Harsh Mehta, Nikola Momchev, Elahe Rahimtoroghi, Maria Georgaki, Amit Raul, Sebastian Ruder, Morgan Redshaw, **hyuk Lee, Komal Jalan, Dinghua Li, Ginger Perng, Blake Hechtman, Parker Schuh, Milad Nasr, Mia Chen, Kieran Milan, Vladimir Mikulik, Trevor Strohman, Juliana Franco, Tim Green, Demis Hassabis, Koray Kavukcuoglu, Jeffrey Dean, and Oriol Vinyals. Gemini: A family of highly capable multimodal models, 2023.
  • [2] Aidan Clark, Alex Paino, Jacob Menick Liam Fedus, Luke Metz Clemens Winter, Lia Guy Sam Schoenholz, Daniel Levy Nitish Keskar Alex Carney, Ian Sohl, Qiming Yuan Reimar Leike Arka Dhar, Brydon Eastman, Mia Glaese Ben Sokolowsky Andrew Kondrich Felipe Petroski Such Henrique Ponde de Oliveira Pinto Jiayi Weng, Randall Lin, Youlong Cheng Nick Ryder Lauren Itow Barret Zoph, John Schulman Mianna Chen Adam Lerer, Adam P. Goucher, Adam Perelman, Akila Welihinda, Alec Radford, Alex Borzunov, Alex Carney, Alex Chow, Alex Renzin, Alex Tachard Passos, Alexi Christakis, Ali Kamali, Allison Moyer, Allison Tam, Amin Tootoonchian, Ananya Kumar, Andrej Karpathy, Andrey Mishchenko, Andrew Cann, Andrew Kondrich, Andrew Tulloch, Angela Jiang, Antoine Pelisse, Anuj Gosalia, Avi Nayak, Avital Oliver, Behrooz Ghorbani, Ben Leimberger, Ben Wang, Blake Samic, Brian Guarraci, Camillo Lugaresi, Chak Li, Charlotte Barette, Chelsea Voss, Chong Zhang, Chris Beaumont, Chris Hallacy, Chris Koch, Christian Gibson, Christopher Hesse, Colin Wei, Daniel Kappler, Daniel Levin, Daniel Levy, David Farhi, David Mely, David Sasaki, Dimitris Tsipras, Doug Li, Duc Phong Nguyen, Duncan Findlay, Edmund Wong, Ehsan Asdar, Elizabeth Proehl, Elizabeth Yang, Eric Peterson, Eric Sigler, Eugene Brevdo, Farzad Khorasani, Francis Zhang, Gene Oden, Geoff Salmon, Hadi Salman, Haiming Bao, Heather Schmidt, Hongyu Ren, Hyung Won Chung, Ian Kivlichan, Ian O’Connell, Ian Osband, Ilya Kostrikov, Ingmar Kanitscheider, Jacob Coxon, James Crooks, James Lennon, Jason Teplitz, Jason Wei, Jason Wolfe, Jay Chen, Jeff Harris, Jiayi Weng, Jie Tang, Joanne Jang, Jonathan Ward, Jonathan McKay, Jong Wook Kim, Josh Gross, Josh Kaplan, Joy Jiao, Joyce Lee, Juntang Zhuang, Kai Fricke, Kavin Karthik, Kenny Hsu, Kiel Howe, Kyle Luther, Larry Kai, Lauren Itow, Leo Chen, Lia Guy, Lien Mamitsuka, Lilian Weng, Long Ouyang, Louis Feuvrier, Lukas Kondraciuk, Lukasz Kaiser, Lyric Doshi, Mada Aflak, Maddie Simens, Madeleine Thompson, Marat Dukhan, Marvin Zhang, Mateusz Litwin, Max Johnson, Mayank Gupta, Mia Glaese, Michael Janner, Michael Petrov, Michael Wu, Michelle Fradin, Michelle Pokrass, Miguel Oom Temudo de Castro, Mikhail Pavlov, Minal Khan, Mo Bavarian, Natalia Gimelshein, Natalie Staudacher, Nick Stathas, Nik Tezak, Nithanth Kudige, Noel Bundick, Ofir Nachum, Oleg Boiko, Oleg Murk, Olivier Godement, Owen Campbell-Moore, Philip Pronin, Philippe Tillet, Rachel Lim, Rajan Troll, Rapha gontijo lopes, Raul Puri, Reah Miyara, Reimar Leike, Renaud Gaubert, Reza Zamani, Rob Honsby, Rohit Ramchandani, Rory Carmichael, Ruslan Nigmatullin, Ryan Cheu, Scott Gray, Sean Grove, Sean Metzger, Shantanu Jain, Shengjia Zhao, Sherwin Wu, Shuaiqi (Tony) Xia, Sonia Phene, Spencer Papay, Steve Coffey, Steve Lee, Stewart Hall, Suchir Balaji, Tal Broda, Tal Stramer, Tarun Gogineni, Ted Sanders, Thomas Cunninghman, Thomas Dimson, Thomas Raoux, Tianhao Zheng, Tina Kim, Todd Underwood, Tristan Heywood, Valerie Qi, Vinnie Monaco, Vlad Fomenko, Weiyi Zheng, Wenda Zhou, Wojciech Zaremba, Yash Patil, Yilei, Qian, Yongjik Kim, Youlong Cheng, Yuchen He, Yuchen Zhang, Yujia **, Yunxing Dai, Yury Malkov Prafulla Dhariwal Alexander Kirillov Alexis Conneau, James Betker Alexander Kirillov, James Betker, Yu Zhang Jamie Kiros, Rowan Zellers, Jiahui Yu James Betker, Alex Nichol, Heewoo Jun, Casey Chu, Gabriel Goh Gabriel Goh, Ishaan Gulrajani Ian Sohl, Qiming Yuan Alex Paino, Alex Nichol Arka Dhar, Mia Glaese Heewoo Jun, Alexis Conneau, Li **g, Jamie Kiros Allan Jabri, James Betker Alexis Conneau, Tao Xu, Yu Zhang Tomer Kaftan Bogo Giertler, Tomer Kaftan Nacho Soto, Rocky Smith, Wayne Chang Alexander Kirillov, Luke Metz, Vlad Fomenko Jordan Sitkin Christine McLeavey Mark Chen Mianna Chen Aditya Ramesh, AJ Ostrow, Allan Jabri, Benjamin Zweig, Bogo Giertler, Bowen Cheng, Brandon Walkin, Brendan Quinn, Christine McLeavey, Constantin Koumouzelis, Edede Oiwoh, Felipe Petroski Such, Huiwen Chang, Ian Silber, Ishaan Gulrajani, David Carr, Haitang Hu, Jamie Kiros, Jenia Varavva, Jiahui Yu, Ji Lin, Johannes Heidecke, Liang Zhou, Madelaine Boyd, Mark Hudnall, Mengchao Zhong, Nick Turley, Noah Deutsch, Ola Okelola, Peter Bak, Peter Bakkum, Saachi Jain, Shirong Wu, Wesam Manassra, Yu Zhang Andrew Tulloch Amin Tootoochian, Miguel Castro Nik Tezak, Christopher Hesse Ian O’Connell Jason Teplitz Phil Tillet Reza Zamani Michael Petrov Rory Carmichael Christian Gibson Johannes Heidecke Saachi Jain Tejal Patwardhan Troy Peterson Alex Beutel, Andrea Vallone, Carroll Wainwright, Claudia Fischer, Evan Mays, Filippo Raso, Haoyu Wang, Jason Phang, Jieqi Yu, Joel Parish, Joshua Achiam, Jonathan Uesato, Joost Huizinga, Josh Snyder, Justyn Harriman, Katy Shi, Keren Gu-Lemberg, Kevin Liu, Lama Ahmad, Meghan Shah, Mehmet Yatbaz, Michael Lampe, Miles Wang, Molly Lin, Natalie Cone, Neil Chowdhury, Olivia Watkins, Peter Dolan, Rachel Dias, Rahul Arora, Sam Toizer, Sandhini Agarwal, Todor Markov Mianna Chen Aleksander Mądry, Barret Zoph, Bob McGrew, Brad Lightcap, Greg Brockman, Hannah Wong, Ilya Sutskever, Jakub Pachocki, Jan Leike, Jason Kwon, John Schulman, Jonathan Lachman, Krithika Muthukumar, Mark Chen, Miles Brundage, Mira Murati, Nick Ryder, Peter Deng, Peter Welinder, Sam Altman, Srinivas Narayanan, Tal Broda Alan Hayes, Ashley Pantuliano, Bright Kellogg, Fred von Lohmann, Heather Whitney, Tom Rubin Aidan Clark, Alex Baker-Whitcomb, Alexander Kirillov, Ben Sokolowsky, Cheng Lu, Coley Czarnecki, Eric Antonow, Eric Wallace, Gabriel Goh, Hendrik Kirchner, Jacob Menick, Jordan Sitkin, Kendra Rimbach, Leher Pathak, Liam Fedus, Lindsay McCallum, Maya Shetty, Mianna Chen, Nacho Soto, Natalie Summers, Niko Felix, Prafulla Dhariwal, Tejal Patwardhan, Tomer Kaftan, Tom Stasi, Troy Peterson, Veit Moeller, Wayne Chang, Yu Zhang, Yuchen He Alex Baker-Whitcomb, Bobby Spero, Chad Nelson, Colin Jarvis, Jessica Shieh, Joe Beutler, Joe Landers, Ricky Wang, Rohan Sahai, Romain Huet, Scott Ethersmith, Toki Sherbakov, Wayne Chang Alex Baker-Whitcomb, Andrew Galu, Angela Baek, Dev Valladares, Lindsey Held, Roy Chen, Ruby Chen, Taya Christianson, Thomas Degry, Veit Moeller Alex Baker-Whitcomb, Veit Moeller Andrew Codispoti, Brian Hsu, Channing Conger, Ikai Lan, Jos Kraaijeveld, Kai Hayashi, Kenny Nguyen, Lu Zhang, Natan LaFontaine, Pavel Belov, Peng Su, Vishal Kuo, Will Sheu Kevin Button, Paul McMillan, Shino Jomoto, Thomas Shadwell, Vinnie Monaco Andrew Braunstein, Denny **, Eric Kramer, Lauren Workman, Rob Donnelly, Romain Huet, and Shamez Hermani. Hello gpt-4o, 2024.
  • [3] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. CoRR, abs/2009.03300, 2020.
  • [4] Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023.
  • [5] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: visual question answering. CoRR, abs/1505.00468, 2015.
  • [6] Pan Lu, Swaroop Mishra, Tony Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering, 2022.
  • [7] Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants, 2023.
  • [8] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2023.
  • [9] Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
  • [10] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2023.
  • [11] Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi, Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xiaoshui Huang, Zhiyong Wang, **g Shao, and Wanli Ouyang. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark, 2023.
  • [12] Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, Saksham Singhal, Shuming Ma, Tengchao Lv, Lei Cui, Owais Khan Mohammed, Barun Patra, Qiang Liu, Kriti Aggarwal, Zewen Chi, Johan Bjorck, Vishrav Chaudhary, Subhojit Som, Xia Song, and Furu Wei. Language is not all you need: Aligning perception with language models, 2023.
  • [13] **gmiao Zhao and Carolyn Jane Anderson. Solving and generating npr sunday puzzles with large language models, 2023.
  • [14] Timothy J. Boerner, Stephen Deems, Thomas R. Furlani, Shelley L. Knuth, and John Towns. Access: Advancing innovation: Nsf’s advanced cyberinfrastructure coordination ecosystem: Services & support. In Practice and Experience in Advanced Research Computing (PEARC ’23), page 4, New York, NY, USA, 2023. ACM. July 23–27, 2023, Portland, OR, USA.
  • [15] OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mo Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Berner, Lenny Bogdonoff, Oleg Boiko, Madelaine Boyd, Anna-Luisa Brakman, Greg Brockman, Tim Brooks, Miles Brundage, Kevin Button, Trevor Cai, Rosie Campbell, Andrew Cann, Brittany Carey, Chelsea Carlson, Rory Carmichael, Brooke Chan, Che Chang, Fotis Chantzis, Derek Chen, Sully Chen, Ruby Chen, Jason Chen, Mark Chen, Ben Chess, Chester Cho, Casey Chu, Hyung Won Chung, Dave Cummings, Jeremiah Currier, Yunxing Dai, Cory Decareaux, Thomas Degry, Noah Deutsch, Damien Deville, Arka Dhar, David Dohan, Steve Dowling, Sheila Dunning, Adrien Ecoffet, Atty Eleti, Tyna Eloundou, David Farhi, Liam Fedus, Niko Felix, Simón Posada Fishman, Juston Forte, Isabella Fulford, Leo Gao, Elie Georges, Christian Gibson, Vik Goel, Tarun Gogineni, Gabriel Goh, Rapha Gontijo-Lopes, Jonathan Gordon, Morgan Grafstein, Scott Gray, Ryan Greene, Joshua Gross, Shixiang Shane Gu, Yufei Guo, Chris Hallacy, Jesse Han, Jeff Harris, Yuchen He, Mike Heaton, Johannes Heidecke, Chris Hesse, Alan Hickey, Wade Hickey, Peter Hoeschele, Brandon Houghton, Kenny Hsu, Shengli Hu, Xin Hu, Joost Huizinga, Shantanu Jain, Shawn Jain, Joanne Jang, Angela Jiang, Roger Jiang, Haozhun **, Denny **, Shino Jomoto, Billie Jonn, Heewoo Jun, Tomer Kaftan, Łukasz Kaiser, Ali Kamali, Ingmar Kanitscheider, Nitish Shirish Keskar, Tabarak Khan, Logan Kilpatrick, Jong Wook Kim, Christina Kim, Yongjik Kim, Hendrik Kirchner, Jamie Kiros, Matt Knight, Daniel Kokotajlo, Łukasz Kondraciuk, Andrew Kondrich, Aris Konstantinidis, Kyle Kosic, Gretchen Krueger, Vishal Kuo, Michael Lampe, Ikai Lan, Teddy Lee, Jan Leike, Jade Leung, Daniel Levy, Chak Ming Li, Rachel Lim, Molly Lin, Stephanie Lin, Mateusz Litwin, Theresa Lopez, Ryan Lowe, Patricia Lue, Anna Makanju, Kim Malfacini, Sam Manning, Todor Markov, Yaniv Markovski, Bianca Martin, Katie Mayer, Andrew Mayne, Bob McGrew, Scott Mayer McKinney, Christine McLeavey, Paul McMillan, Jake McNeil, David Medina, Aalok Mehta, Jacob Menick, Luke Metz, Andrey Mishchenko, Pamela Mishkin, Vinnie Monaco, Evan Morikawa, Daniel Mossing, Tong Mu, Mira Murati, Oleg Murk, David Mély, Ashvin Nair, Reiichiro Nakano, Rajeev Nayak, Arvind Neelakantan, Richard Ngo, Hyeonwoo Noh, Long Ouyang, Cullen O’Keefe, Jakub Pachocki, Alex Paino, Joe Palermo, Ashley Pantuliano, Giambattista Parascandolo, Joel Parish, Emy Parparita, Alex Passos, Mikhail Pavlov, Andrew Peng, Adam Perelman, Filipe de Avila Belbute Peres, Michael Petrov, Henrique Ponde de Oliveira Pinto, Michael, Pokorny, Michelle Pokrass, Vitchyr Pong, Tolly Powell, Alethea Power, Boris Power, Elizabeth Proehl, Raul Puri, Alec Radford, Jack Rae, Aditya Ramesh, Cameron Raymond, Francis Real, Kendra Rimbach, Carl Ross, Bob Rotsted, Henri Roussez, Nick Ryder, Mario Saltarelli, Ted Sanders, Shibani Santurkar, Girish Sastry, Heather Schmidt, David Schnurr, John Schulman, Daniel Selsam, Kyla Sheppard, Toki Sherbakov, Jessica Shieh, Sarah Shoker, Pranav Shyam, Szymon Sidor, Eric Sigler, Maddie Simens, Jordan Sitkin, Katarina Slama, Ian Sohl, Benjamin Sokolowsky, Yang Song, Natalie Staudacher, Felipe Petroski Such, Natalie Summers, Ilya Sutskever, Jie Tang, Nikolas Tezak, Madeleine Thompson, Phil Tillet, Amin Tootoonchian, Elizabeth Tseng, Preston Tuggle, Nick Turley, Jerry Tworek, Juan Felipe Cerón Uribe, Andrea Vallone, Arun Vijayvergiya, Chelsea Voss, Carroll Wainwright, Justin Jay Wang, Alvin Wang, Ben Wang, Jonathan Ward, Jason Wei, CJ Weinmann, Akila Welihinda, Peter Welinder, Jiayi Weng, Lilian Weng, Matt Wiethoff, Dave Willner, Clemens Winter, Samuel Wolrich, Hannah Wong, Lauren Workman, Sherwin Wu, Jeff Wu, Michael Wu, Kai Xiao, Tao Xu, Sarah Yoo, Kevin Yu, Qiming Yuan, Wojciech Zaremba, Rowan Zellers, Chong Zhang, Marvin Zhang, Shengjia Zhao, Tianhao Zheng, Juntang Zhuang, William Zhuk, and Barret Zoph. GPT-4 technical report, 2023.
  • [16] Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024.
  • [17] Reka Team, Aitor Ormazabal, Che Zheng, Cyprien de Masson d’Autume, Dani Yogatama, Deyu Fu, Donovan Ong, Eric Chen, Eugenie Lamprecht, Hai Pham, Isaac Ong, Kaloyan Aleksiev, Lei Li, Matthew Henderson, Max Bain, Mikel Artetxe, Nishant Relan, Piotr Padlewski, Qi Liu, Ren Chen, Samuel Phua, Yazheng Yang, Yi Tay, Yuqi Wang, Zhongkai Zhu, and Zhihui Xie. Reka core, flash, and edge: A series of powerful multimodal language models, 2024.
  • [18] Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024.
  • [19] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023.
  • [20] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023.
  • [21] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrap** language-image pre-training with frozen image encoders and large language models, 2023.
  • [22] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2023.
  • [23] **ze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and **gren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
  • [24] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  • [25] Rohan Bavishi, Erich Elsen, Curtis Hawthorne, Maxwell Nye, Augustus Odena, Arushi Somani, and Sağnak Taşırlar. Introducing our multimodal models, 2023.
  • [26] Glenn W Brier. Verification of forecasts expressed in terms of probability. Monthly weather review, 78(1):1–3, 1950.
  • [27] Antonia Creswell and Murray Shanahan. Faithful reasoning using large language models, 2022.
  • [28] Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms, 2023.

Appendix

5.1 Additional examples of faithlessness

User input [Uncaptioned image]
GPT-4o reply To solve this rebus puzzle, let’s break it down step by step: 1. The image is of an actress who is quite well-known, but for the purpose of the puzzle, we are focusing on extracting something about her name. 2. The clue given is “6.35kg6.35kg-6.35\,\text{kg}- 6.35 kg.” We need to figure out what this means in conjunction with the image. First, let’s figure out the name of the actress in the image. We know we can’t directly state it, but we know who she is. Her name can be interpreted as: A first name that sounds like “Sharon” A last name that sounds like “Stone” Combining these, we identify the name as “Sharon Stone.” Next, let’s decode the clue: “6.35   kg” is equivalent to approximately “14 pounds” (since 1kg2.2pounds1kg2.2pounds1\,\text{kg}\approx 2.2\,\text{pounds}1 kg ≈ 2.2 pounds and 6.35kg×2.214pounds6.35kg2.214pounds6.35\,\text{kg}\times 2.2\approx 14\,\text{pounds}6.35 kg × 2.2 ≈ 14 pounds). So the rebus puzzle is telling us to take “Sharon Stone” and subtract (or remove) “14 pounds” from it. However, “14 pounds” phonetically sounds like and can be written as “fourteen pounds.” If we interpret it this way, we notice that: “Sharon Stone” minus something that sounds like “14 pounds” (which also phonetically is similar to “Fourteen pounds”) leads us to think of homophones or play on words. Finally, the phrase “Stone - 14 pounds” might give us: “Stone” minus “fourteen” (consider synonyms and homonyms, we find that a “STONE” could be around 14 pounds in old measurement units) So we remove “FOURTEEN” or a part of it from a part of “STONE.” Realistically simplifying it: If we think in terms of wordplay in Massachusetts town names.. The best interpretation/conclusion from available Massachusetts town names is: The final result could be “Stoneham”. The idea is that by removing about 14 pounds (not saying it phonetically as much as semi-literary subtraction). Thus, the name “Sharon Stone” subtracted cleverly leaves us with “STONE-HAM”. Thus matching a Massachusetts town. So, the puzzle likely referring rebus deduction leads to the town name: {{{Stoneham}}}
User input [Uncaptioned image]
GPT-4o reply To solve this rebus puzzle, we need to identify key elements in the image and how they might combine to form the name of a town in Massachusetts. 1. The image is of a historical figure. I cannot identify him directly, but for the sake of the rebus, I’ll assume he might be Peter the Great (a famous historical figure). 2. The second element appears to be a picture of a ham, specifically a spiral-cut ham. Now, let’s combine these elements: Peter (assumed name of the historical figure based on the typical painting style and context) Ham Putting these together, we get “Peter” + “Ham” = “Peterham.” Actually, there’s a town in Massachusetts named “Peabody.” So, my earlier assumption might be slightly adjusted considering the play on pronunciation and historical references. Let’s not overanalyze; if we just focus on the “Peter” part phonetically: “Peter” (could sound like “P-ee-t”) “Ham” (quite straightforward) This combination leads more naturally to the town: {{{Peabody}}}.
User input [Uncaptioned image]
GPT-4o reply Let’s break down the images to solve the rebus puzzle and find a town in Massachusetts. 1. The left image depicts a historical scene that looks like a witch trial — most likely the famous Salem witch trials. 2. The right image shows someone who I cannot identify due to policies, but I recognize that the surname “Mueller” may be relevant. By combining these clues: “Salem” from the first image “Mueller” from the second image, focusing on the sound “Mueller” We get “Salem” + “Mueller” By focusing on the phonetic similarity, we can address: Salem The sound “er” The answer, considering the category of towns in Massachusetts, appears to be: {{{Salem}}}. Note: Salem is famous for its 1692 witch trials and is indeed a town in Massachusetts.

5.2 Examples of various failure modes

5.2.1 Correct answer, incorrect reasoning

Here, GPT-4o correctly gets the answer Harvard mainly based on the first image, even though the correct reasoning is (HARVARDARD)(BILLBOARDBILLBO)=HARVARD𝐻𝐴𝑅𝑉𝐴𝑅𝐷𝐴𝑅𝐷𝐵𝐼𝐿𝐿𝐵𝑂𝐴𝑅𝐷𝐵𝐼𝐿𝐿𝐵𝑂𝐻𝐴𝑅𝑉𝐴𝑅𝐷(HARVARD-ARD)(BILLBOARD-BILL-BO)=HARVARD( italic_H italic_A italic_R italic_V italic_A italic_R italic_D - italic_A italic_R italic_D ) ( italic_B italic_I italic_L italic_L italic_B italic_O italic_A italic_R italic_D - italic_B italic_I italic_L italic_L - italic_B italic_O ) = italic_H italic_A italic_R italic_V italic_A italic_R italic_D.

User input [Uncaptioned image]
GPT-4o reply To solve this rebus puzzle, let’s break it down piece by piece: 1. The first image is the Harvard logo. 2. The second image is the ARD German TV network logo with the number 1. 3. The third image appears to be a billboard. 4. The fourth image is a photograph of Bill Clinton. 5. The final image is a photograph of a thinking person. Now let’s decipher the meaning of each image: Harvard: “Harvard” likely points to “Harvard.” ARD and 1: Pronounce “ARD1” together, it might sound like “Arden.” Billboard: Commonly referred to as a “billboard.” Bill Clinton: This can be simplified to “Bill.” Thinking person: A thinking person often represents “thought.” When we put together the parts: Harvard (Harvard) ARD1 (Ard1 / Arden) Billboard (Billboard \rightarrow Bill) Bill Clinton (Bill) Thinking person (Thought) Combining these gives us “Harvard + ARD1 + Bill + Thought.” The puzzle seems to be suggesting the name of a town which incorporates these sounds. The correct town name in Massachusetts is “Harvard.” Therefore, the final answer is: {{{Harvard}}}

5.2.2 Incorrect reasoning, correct recognition

Here, GPT-4o correctly recognizes all the parts of the image, but fails to put them together into the correct answer.

User input [Uncaptioned image]
GPT-4o reply To decode the rebus puzzle, let’s break down the different elements into recognizable parts and expressions. 1. On the left, we have a pot of soup or stew, possibly indicative of the word “soup” or “stew.” 2. In the middle, the vertical list appears to be the numbers 1 (yi), 3 (san), 4 (si), and 5 (wu) in Chinese. Notably, the number 2 (er) is missing, which might indicate the word “two” or “second.” 3. On the right, we have the male gender symbol. Putting these parts together: “Soup” or “Stew”; “Two” or the second; Male symbol. A famous movie title that fits this description is “Two Mules for Sister Sara.” However, considering the elements “pot of soup/stew” and “male,” another famous movie comes to mind which fits better: “Men” and “Pot.” The rebus puzzle likely resolves to the famous movie “Men” + “Pot,” ultimately rendering the title: {{{Harry Potter}}}.

5.2.3 Incorrect recognition

Here, GPT-4o fails to recognize the elements labeled in the periodic table (Br and K).

User input [Uncaptioned image]
GPT-4o reply To solve this rebus puzzle, let’s break down the images and their possible meanings. 1. First Image: Periodic Table with Highlighted Elements The first part of the rebus shows the periodic table with two highlighted elements. Typically, these elements are the 8th element (Oxygen, symbol “O”) and the 18th element (Argon, symbol “Ar”), giving “O” and “Ar.” 2. Second Image: Symbol “@” This is the “@” symbol. 3. Third Image: A Kiss The image of a kiss sounds similar to the word “Kiss,” suggesting the word “Kiss.” 4. Fourth Image: Periodic Table with Highlighted Elements This is another periodic table with elements possibly highlighted again. Let’s assume they are the same as the first one: “O” and “Ar.” 5. Fifth Image: Lava This suggests the word “Lava.” Now, putting these together: “O” + “@” + “Kiss” + “O” + “Lava.” When you say it out loud, you get “Osaka Volcano,” but given the placement of “@” which usually denotes “at,” let’s refine it: “O” (O for Oxygen) + “@” + “Kiss” (sounds like “K”) + “ ” (space) + “O” (O for Oxygen) + “Lava.” This phonetic combination hints toward the word “Osaka.” The answer to the rebus puzzle, considering the category is Cities, is: {{{Osaka}}}.