Creative Problem Solving in Large Language and Vision Models – What Would it Take?

Lakshmi Nair Georgia Institute of Technology
Atlanta, GA, USA
   Evana Gizzi Tufts University
Medford, MA, USA
   Jivko Sinapov Tufts University
Medford, MA, USA
Abstract

In this paper, we discuss approaches for integrating Computational Creativity (CC) with research in large language and vision models (LLVMs) to address a key limitation of these models, i.e., creative problem solving. We present preliminary experiments showing how CC principles can be applied to address this limitation through augmented prompting. With this work, we hope to foster discussions of Computational Creativity in the context of ML algorithms for creative problem solving in LLVMs. Our code is at: https://github.com/lnairGT/creative-problem-solving-LLMs

I Introduction

Creativity is “…the ability to come up with an idea which, relative to the pre-existing domain-space in one’s mind, one could not have had before. Whether any other person (or system) has already come up with it on an earlier occasion is irrelevant.[1], p.216. For artificial agents, Computational Creativity (CC) is a multi-disciplinary field (spanning Philosophy, Psychology, Neuroscience, and Computer Science) that seeks to develop computational methods capable of generating creative outcomes reminiscent of creative processes in humans [2]. Within CC, creative problem solving is a sub-area that requires an agent to discover – from its perspective – novel and previously unseen ways to accomplish a task. For example, in the absence of a ladle to scoop ingredients, an agent might creatively choose to substitute a bowl in place of the ladle. In this sense, creative problem solving encompasses creativity that is specifically task-oriented, as opposed to the generation of creative artefacts such as music or images.

While recent state-of-the-art large language and vision models (LLVMs)111We use LLVM to denote the umbrella of large transformer models for natural language and vision tasks. Here, LLVMs include both LLMs and Vision-LMs or VLMs. have demonstrated competency in artistic endeavours [3, 4], creative problem solving continues to be a shortcoming of these models. For instance, in [5], the authors point out that “discontinuous tasks” that require a certain “Eureka” idea, i.e., creative problem solving, is currently a limitation of models like GPT-4. Similar observations have been made in follow up work showing that state-of-the-art LLMs inherently possess poor creative problem solving capabilities compared to humans [6]. Given this obvious limitation, ongoing research in Machine Learning should seek to address the gap between LLVMs and creative problem solving, to further enhance the intelligent capabilities of these models. As defined in prior work, “Intelligence is the ability to work and adapt to the environment with insufficient knowledge and resources.[7], p.10. Demonstrated in hallmark examples of human ingenuity, like the makeshift CO2𝐶subscript𝑂2CO_{2}italic_C italic_O start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT filter built onboard the Apollo-13 [8], or the makeshift medical devices used to offset equipment shortages during COVID-19 [9], creative problem solving is especially important when dealing with resource-critical scenarios. However, such an exceptional degree of creative problem solving remains beyond the scope of LLVMs today.

Refer to caption
Figure 1: Computational Creativity can help address a gap in the intelligence of present-day LLVMs, elevating their ingenuity through creative problem solving.

We believe that a discussion of Computational Creativity is essential to addressing this limitation. Machine Learning and Computational Creativity should be strongly integrated in research to enable effective creative problem solving in LLVMs and push the frontiers of their ingenuity. While [5] and [6] accurately point out creative problem solving as a shortcoming of state-of-the-art LLVMs, they do not expand their work to include the broader scope of Computational Creativity nor discuss how its principles can be applied to potentially alleviate this problem.

This paper seeks to encourage the ML community to think about how LLVMs can be augmented with creative problem solving skills through a deeper discussion of Computational Creativity. To emphasize the applicability of principles from CC for creative problem solving in LLVMs, we discuss the seminal work of Margaret A. Boden from CC literature that introduces three forms of creativity, namely, “exploratory”, “combinational”, and “transformational[1]. While prior work has discussed the extension of Boden’s forms of creativity to creative problem solving in AI [2], their work does not include recent advances in LLVMs nor how Boden’s principles can be extended to specific approaches for these models.

Ongoing discussions by leading ML experts like Dr. Shane Legg, co-founder of DeepMind, have suggested that “search” could help such models perform creative problem solving, quote, “… these foundational models are world models of a kind, and to do really creative problem solving, you need to start searching[10]. There has also been speculation that OpenAI’s Qsuperscript𝑄Q^{*}italic_Q start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT search (described as a “significant breakthrough” in popular media) could be targeting a similar approach [11, 12]. Interestingly, we note that “search” as described here, can be linked to Boden’s proposed “exploratory” approach (Section III-A1). However, we posit that “combinational” and “transformational” modes (Section III) should also be equally emphasized for achieving creative problem solving in LLVMs.

We begin by discussing the relation between creative problem solving and task planning, providing an overview of how LLVMs are used in task planning. We discuss the application of Boden’s three forms of creativity to enable creative problem solving in LLVMs. We also present preliminary experiments demonstrating the extension of transformational creativity to creative problem solving in LLVMs.

II Overview: LLVMs in typical task planning

Creative problem solving is described as the process through which agents discover novel ways of accomplishing a task goal for a previously unsolvable task that can be computationally achieved through planning, learning, or hybrid methods [2]. In this section, we discuss how typical task planning is achieved with LLVMs. We divide the discussion into three subsections based on the level of task planning abstraction where LLVMs are applied: a) high-level task planning, b) low-level task planning, and c) hybrid task planning. While not exhaustive, our review is intended to provide a general insight into how LLVMs are used for task planning, to identify entry points for introducing creative problem solving capabilities.

II-A LLVMs for high-level task planning

Approaches for high-level task planning often involve using LLVMs to identify high-level goals for accomplishing a task. Some approaches to task planning with LLMs often take a user input specifying the task, and generate high-level task plans for accomplishing it. These approaches often use LLMs as a form of “knowledge base”, to extract actionable task plans from the models via appropriate prompting [13], further iterating over the task plan with repeated calls to the LLM as needed [14].

In the context of Reinforcement Learning (RL), prior work has focused on using LLMs to suggest high-level goals for an RL agent [15]. Dubbed as ELLMs (Exploring with LLMs), an RL agent provides its current state to an LLM via a prompt, and receives a goal suggestion from the LLM that is then used to shape the reward and the agent exploration. Further work has extended this approach to incorporate the use of experience memory [16]. Existing approaches have also used LLMs to generate directed acyclic graphs composed of sub-goal states to aid the exploration of an RL agent [17].

II-B LLVMs for low-level task planning

Approaches for low-level task planning involve using LLMs to generate low-level code for performing a task. In contrast to high-level planning, where high-level goals and sub-goals are generated, these approaches use LLMs to directly generate low-level execution code via appropriate API calls [18]. Other approaches have also investigated the capacity of LLMs to generate task plans via a low-level planning language such as PDDL [19], including iterating over the generated plan descriptions in case of errors [20]. In terms of low-level planning using vision-based VLMs, prior work has introduced an approach that uses a diffusion model to generate robot trajectories conditioned on language and the current visual state of the robot [21].

II-C Hybrid high and low-level planning with LLVMs

Hybrid approaches use LLVMs both for high-level goal generation as well as low-level planning. For instance, in [22], user inputs are passed as LLM prompts to generate high-level plans. The high-level plans are then converted to low-level plans for robot execution via LLMs specialized for coding. Other approaches have used a high-level LLM planner, a VLM perceiver, and a low-level LLM planner for re-planning with both visual and language inputs [23].

II-D Summary

Given this overview, we see that LLVMs both at the high-level and low-level, can be modified to incorporate creative problem solving into task planning. For instance, the high-level task plans generated can encompass a novel substitution for a missing object, whereas the low-level task plan can generate an appropriate trajectory for creatively using the object. While the above approaches could, in principle, be studied within the framework of creative problem solving, that is not usually how the problem is formulated; there is a lack of paradigms for studying creative problem solving beyond just, “do you solve the problem or not?”. Creative problem solving needs a fundamental rethinking of the typical problem formulations and approaches in ML. The next section is aimed at ways in which ML approaches in LLVMs can be reformulated from the perspective of CC.

III Augmenting LLVM embedding spaces for creative problem solving

In this section, we discuss how principles from CC can be extended to LLVMs for creative problem solving. We begin with Boden’s definition of “conceptual spaces” as “[conceptual space] is the generative system that underlies the domain and defines a certain range of possibilities: chess moves, or molecular structures, or jazz melodies[24], p.18 and “… in short, any reasonably disciplined way of thinking[1], p.214. By this definition, the embedding space of an LLVM describes its conceptual space or “its way of thinking”. Some evidence for this also comes from existing work that introduces an approach for enabling LLMs to interpret continuous embedding spaces via natural language. Given an embedding vector representing an interpolation of different concepts, the model is able to interpret a text prompt in the context of the supplied embedding [25]. The embedding thus determines the model’s way of thinking. Hence, a discussion of enabling creative problem solving in LLVMs should target their embedding space. To this end, we explore two questions: a) how can LLVM embedding spaces be augmented to achieve creative problem solving, and b) what information should they be augmented with? Aligning with our original position, we show that CC literature can offer insights into these questions.

III-A How can LLVM embedding spaces be augmented?

In this section, we draw parallels between Boden’s three forms of creativity and existing approaches in LLVMs. We further elaborate on how the three forms of creativity may enhance the potential of LLVMs to perform creative problem solving. We note that the ML approaches discussed in this section do not specifically perform creative problem solving. However, we discuss how they could potentially be extended to do so, by leveraging references from the CC literature.

III-A1 Exploratory Creativity

Exploratory approaches involve exploration within the conceptual or equivalently, the embedding space of the model, and most closely relates to “search”. Note that the term “exploration” here differs from its usage in RL, instead referring to exploration through the model’s embedding space. Several existing approaches in the ML literature involve searching the output space of LLMs with the goal of improving the performance of these models. The “tree-of-thought” model generates a “tree” of next possible LLM outputs, and searches through the states via Breadth-first or Depth-first search to reach the desired goal state, often guided by heuristics [26]. Numerous other approaches have built upon a similar strategy, such as using Monte-Carlo Tree Search (MCTS) [27, 28], beam search [29] or integrating pruning to remove sub-par candidates [30].

Extension of exploratory creativity to LLVMs: An important point to note here is that these approaches involve searching exclusively within the output “solution space” of the LLMs rather than directly operating in the embedding space itself. In contrast to operating in the solution space of the LLM, exploratory approaches directly within the LLMs’ embedding space would not be limited by what the LLM can generate as output – “Some exploration merely shows us the nature of the relevant conceptual space that we had not explicitly noticed before[24], p.18. To effectively reveal the full extent of the conceptual space for creative problem solving, the approach should not be limited by the outputs the LLVM can generate. Rather, the generated (creative) outputs itself should be the result of heuristic or non-heuristic based search within the model’s embedding space. However, to the best of our knowledge current approaches have not focused on LLVMs from this perspective, and have also not applied search to embedding spaces of Vision-LMs. Regardless, exploratory approaches are still limited by the dimensions of the model’s embedding space. “To overcome a limitation in the conceptual space, one must change it in some way[24], p.18 - this leads us to combinational and transformational creativity.

III-A2 Combinational Creativity

Combinational approaches involve combining two concepts to create something new - “A novel combination of two familiar ideas is something which did not happen before.[1], p.213. We can broadly translate this to a function that takes in multiple concepts within an LLVM’s embedding space to output a novel concept.

One way of extending this definition to LLVMs involves applying cross-attention layers. The attention operation is defined as follows [31]:

Attention(Q,K,V)=softmax(QKTdk)V,𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝑄𝐾𝑉𝑠𝑜𝑓𝑡𝑚𝑎𝑥𝑄superscript𝐾𝑇subscript𝑑𝑘𝑉Attention(Q,K,V)=softmax(\frac{QK^{T}}{\sqrt{d_{k}}})V,italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( italic_Q , italic_K , italic_V ) = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) italic_V ,

where, Q𝑄Qitalic_Q, K𝐾Kitalic_K and V𝑉Vitalic_V denote query, keys and values respectively, and dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT denotes the dimensionality of the keys. Cross-attention involves passing K𝐾Kitalic_K and V𝑉Vitalic_V from a different model, e.g., in Flamingo [32], the keys and values represent visual input (from a separate vision encoder) and queries represent a language input. By applying cross attention in this manner, the embedding space of a model can be extended with capabilities of another model. In [33] the authors show that using cross-attention layers can help augment an anchor LLM with an augmenting LLM’s capabilities to perform a task that the anchor LLM was incapable of achieving before - hinting at some creative possibilities of this method.

Other approaches in LLVMs, while using “combinations” in some way, do not conform to the notion of combinational creativity. This includes, for instance, approaches that perform arithmetic combination of LLM weights to enhance the model performance [34, 35]. Or approaches that combine image and text embeddings via concatenation [36] or a scaled dot product at the output [37]. While these approaches may be useful in imparting multi-modal capabilities, however, they do not lead to combinational creativity since the combination occurs external to the models as opposed to within the model’s embedding space.

Extension of Combinational Creativity to LLVMs: The ML approaches described here involve combining embedding spaces across models. Existing approaches have not looked at combining concepts within the same model’s embedding space. The extension of combinational creativity to LLVMs is much more apparent in the sense of conceptual blending [38] for generation of creative artefacts, e.g., via blending of artistic styles. However, the extension of combinational creativity to creative problem solving is less obvious, and CC literature offers us further insights for making this connection. Typical conceptual blending corresponds to a form of “aesthetic combination”, whereas creative problem solving would benefit from “functional combinations” [39]. Functional combination combines the functions (as opposed to aesthetic) of two components, e.g., a coin combined with pliers could function as a makeshift screwdriver. The authors extend this framework to a combination of two nouns with a “base” noun (e.g., “pliers”) and “additive” noun (e.g., “coin”). An interesting possibility stems from this notion: Can a combination of embeddings of the same LLVM, corresponding to “base” and “additive” nouns (perhaps with some prior denoting the task), enable the LLVM to generate creative combinations of objects for solving a task? This question remains unexplored, and points to a potential research direction for LLVMs inspired by CC.

III-A3 Transformational Creativity

These approaches involve transforming existing conceptual spaces to produce new ones. Transforming conceptual spaces can involve “altering existing rules[1], p.216. One way of transforming a model’s embedding space involves fine-tuning or training [40]. However, additional insight into transformational creative problem solving comes from prior work in CC, that describes creative problems as those with a poorly defined structure where a solution is not immediately apparent [41]. And in such cases, “… re-representation being the process which transforms an ill-structured problem into a well-structured one with direct inference to a problem solution[41], p.1. The notion of “re-representing” or “redefining” the problem can be best captured in the input prompts provided to an LLVM. This most closely connects to prompt engineering and in-context learning (ICL).

Prompt engineering augments LLVMs with task specific hints, called prompts, to adapt the LLVM to new tasks [42]. Relatedly, in-context learning is a prompting method that provides the LLVM with instructions for solving a new task without requiring additional training. Prior work has shown that in-context learning and gradient-based optimization are equivalent [43], thus connecting ICL to training or fine-tuning.

Extension of transformational creativity to LLVMs: Task re-representations for creative problem solving, through prompting or ICL, has not been well explored within ML. Prompt engineering and ICL is a challenging task, since model performance depends strongly on the chosen prompts [44], further compounded by the fact that creative problems are inherently poorly defined [41]. However, useful insights can be derived from CC literature. For instance, regarding problems that require creatively re-purposing objects, the Object-replacement-object-composition (OROC) framework [45] illustrates re-representations of tasks, that can be translated into prompts. The paper defines three types of creative tasks involving objects, and their task re-representations [45], p.16:

  1. 1.

    Replace an unfound object needed for a task with other objects present in the environment: “If I do not have an object X, which I would normally use because of its affordance222Affordance is defined as the relation between an agent, action and object, e.g., bowls have the “contain” affordance for humans. AfX𝐴subscript𝑓𝑋Af_{X}italic_A italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , what other object Y could I use, so that I can get a similar affordance, AfXAfY𝐴subscript𝑓𝑋𝐴subscript𝑓𝑌Af_{X}\approx Af_{Y}italic_A italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ≈ italic_A italic_f start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT?

  2. 2.

    Compose objects. ”If I do not have object X with affordance AfX𝐴subscript𝑓𝑋Af_{X}italic_A italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , which objects Y1;Y2;;Ynsubscript𝑌1subscript𝑌2subscript𝑌𝑛Y_{1};Y_{2};...;Y_{n}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; … ; italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT, could I use to construct X𝑋Xitalic_X or an object Xsuperscript𝑋X^{\prime}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT with an equivalent or similar affordance, AfXAfX,AfXAfY1+AfY2++AfYnformulae-sequence𝐴subscript𝑓𝑋𝐴subscript𝑓superscript𝑋𝐴subscript𝑓𝑋𝐴subscript𝑓𝑌1𝐴subscript𝑓𝑌2𝐴subscript𝑓𝑌𝑛Af_{X}\approx Af_{X^{\prime}},Af_{X}\approx Af_{Y1}+Af_{Y2}+...+Af_{Yn}italic_A italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ≈ italic_A italic_f start_POSTSUBSCRIPT italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT , italic_A italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ≈ italic_A italic_f start_POSTSUBSCRIPT italic_Y 1 end_POSTSUBSCRIPT + italic_A italic_f start_POSTSUBSCRIPT italic_Y 2 end_POSTSUBSCRIPT + … + italic_A italic_f start_POSTSUBSCRIPT italic_Y italic_n end_POSTSUBSCRIPT?

  3. 3.

    Decompose objects. “If I do not have object X𝑋Xitalic_X with affordance AfX𝐴subscript𝑓𝑋Af_{X}italic_A italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT , which objects Y1;Y2;;Ynsubscript𝑌1subscript𝑌2subscript𝑌𝑛Y_{1};Y_{2};...;Y_{n}italic_Y start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ; italic_Y start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ; … ; italic_Y start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT which are components of object Y𝑌Yitalic_Y could I use to obtain an object Yisubscriptsuperscript𝑌𝑖Y^{\prime}_{i}italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT with an equivalent or similar affordance, AfXAfYi𝐴subscript𝑓𝑋𝐴subscript𝑓superscript𝑌𝑖Af_{X}\approx Af_{Y^{\prime}i}italic_A italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT ≈ italic_A italic_f start_POSTSUBSCRIPT italic_Y start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_i end_POSTSUBSCRIPT?

For task re-representation, affordances can refer to object properties that are relevant to the task, e.g., in some cases the shape may be relevant and in other cases, the material [45]. Within LLVMs, the affordances AfX𝐴subscript𝑓𝑋Af_{X}italic_A italic_f start_POSTSUBSCRIPT italic_X end_POSTSUBSCRIPT or AfY𝐴subscript𝑓𝑌Af_{Y}italic_A italic_f start_POSTSUBSCRIPT italic_Y end_POSTSUBSCRIPT can be defined via natural language, or other modalities such as images. In the following section, we present preliminary experiments on using LLVMs for object replacement, with prompts that are inspired by the above task re-representations. However, an in-depth application of these re-representations as defined in CC to the field of ICL in LLVMs remains an open question.

III-A4 Summary

In the previous sections, we drew parallels between Boden’s three forms of creativity and approaches in LLVMs, emphasizing how principles from CC can potentially help enable creative problem solving skills in these models.

Integration with task planning: Given the three methods, we see that transformational and combinational approaches may be especially aligned with LLVMs for high-level task planning. In contrast, exploratory methods may be better suited for low-level planning, e.g., trajectory generation.

Creative problem solving as a combination of the three methods: An effective approach to creative problem solving may require all the three methods described in this section. While papers have explored chaining of LLMs within frameworks (often via prompts) [46, 47], the individual LLMs themselves do not exhibit the characteristics described here. Existing frameworks in CC have shown that achieving creative problem solving would take a combination of all three methods, each of which is triggered in different contexts [41]. This presents potential opportunities for ML approaches that develop frameworks using multiple LLVMs. In particular, using CC frameworks such as “CreaCogs[45] as a start, can be highly beneficial for productive developments in ML.

Model Overall acc. % (no creativity)
CLIP-B-32 100.0%
CLIP-B-16 92.0%
CLIP-L-14 98.0%
CLIP-H-14-laion 98.0%
ViLT-B-32 68.0%
TABLE I: Accuracy of the models in predicting the nominal use of objects with no creativity involved.

III-B What information should the LLVM embeddings be augmented with?

In the previous section, we discussed three methods for augmenting LLVM embedding spaces. In this section, we explore the question: “What information should be targeted by the three methods when augmenting the embedding space for creative problem solving?”. In the previous section, we discussed this in the context of OROC. According to the OROC framework [45], information about object affordances could enable models to re-represent the task, such that the solution becomes evident. We propose a small experiment to validate whether the principles of transformational creativity from OROC are useful to LLVMs. We note that creativity can occur in various contexts, e.g., creatively solving a math problem or creatively playing a chess move, each of which would require different information. However, to facilitate the discussion in this paper, we focus on solving tasks that require innovatively replacing missing objects (Task #1 of OROC).

III-B1 Experiment Setup

We create a simple experiment setup that tests the “object replacement” principle from OROC, where we create test sets composed of images of objects for replacing one of five core objects: “Scoop”, “Hammer”, “Spatula”, “Toothpick”, and “Pliers”. We create two groups of tests: a) a nominal group where the actual object itself is available in each test set and requires no replacement (which serves as a form of baseline), and b) an object replacement group, where the nominal tool is missing and a creative replacement object should be chosen.

For each group, we create test sets with 4 objects each, chosen from a set of RGB images of 16 objects (Appendix Figure 7). We create 10 such test sets per core object (total 50 samples per model). Each test set only includes one ground truth object, along with three other random objects that will not suit as an appropriate replacement. In the nominal group, the ground truth is the actual object itself. In the object replacement group, the replacements are chosen based on self-assessment of the authors as (core object absent\xrightarrow{}start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW replacement): “Scoop” absent\xrightarrow{}start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW “Bowl”; “Hammer” absent\xrightarrow{}start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW “Saucepan”; “Spatula” absent\xrightarrow{}start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW “Knife”; “Toothpick” absent\xrightarrow{}start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW “Safety pin”; “Pliers” absent\xrightarrow{}start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW “Scissors”. For each test case, we pass the images in the test set along with a prompt. We record whether the ground truth object image was chosen by the model for the prompt (i.e., assigned highest output probability)333CLIP generates probabilities that given images correspond to a text. ViLT responds with a text, and we evaluate if the model responded “yes” with a high probability for the ground truth..

The nominal group is subjected to one type of prompt: “Can this object be used as a core_object?\bigl{\langle}core\_object\bigl{\rangle}?⟨ italic_c italic_o italic_r italic_e _ italic_o italic_b italic_j italic_e italic_c italic_t ⟩ ?”. In the object replacement group, each test case is subjected to four prompts:

  1. 1.

    Baseline (regular) prompt: The same prompt as used in the nominal cases to obtain a baseline.

  2. 2.

    Prompt prepended with affordance information: the prompt includes additional information about the desired object affordances specified as object features.

  3. 3.

    Prompt prepended with task information: the prompt includes additional information about the desired task.

  4. 4.

    Prompt prepended with task and affordance information: the prompt includes additional information about the task and object affordance.

Case #2 aligns with task re-representations of OROC, and we explore cases #3 and #4 for comparison. We formulate our affordance prompts as brief versions of OROC’s task re-representations. According to [45] affordances can be defined using shape features, which we apply to the prompts here. The full set of prompts is shown in Appendix Table II. The models that we explore include versions of CLIP [37], and ViLT [36] obtained from HuggingFace. We use different model sizes (Base, Large, Huge) and patch sizes (14, 16, 32). The open-source code for reproducing our experiment results (including our dataset and test cases) is available at: https://github.com/lnairGT/creative-problem-solving-LLMs.

Refer to caption
Figure 2: Object replacement test: Using the same prompts as for the nominal group. Random selection of a replacement object achieves \approx30% overall accuracy (i.e. 0.3).
Refer to caption
Figure 3: Object replacement test: Accuracies when the prompts are augmented with object affordance information.
Refer to caption
Figure 4: Object replacement test: Accuracies when the prompts are augmented with task information.
Refer to caption
Figure 5: Object replacement test: Accuracies when the prompts are augmented with task and object affordance.

III-B2 Results

In Table I, we see the performances of the different models in the nominal test group, where the object requires no creative replacement. The models perform >90%absentpercent90>90\%> 90 % in such cases (except for ViLT). In Figures 2 - 5, we see the performances (accuracy shown on a 0.01.00.01.00.0-1.00.0 - 1.0 scale) of the models in the object replacement test cases, where the object requires a creative replacement. For reference, a model that randomly picks an object achieves about 30% overall accuracy. The figures show performances for the different prompting strategies. From Table I to Figure 2, the models perform poorly when they need to creatively reason about object replacements, highlighting their limitation. Comparing Figure 2 to Figure 3, we see a general improvement in model performances, when object affordance information is provided, consistent with description of the OROC framework [45]. We note that “hammers” present a particularly challenging case for all the models, perhaps due to the fact that correlating affordance of a hammer to a saucepan textually is difficult. Moreover, information about the task (Figure 4) leads to mostly detrimental results. Information about task and affordances (Figure 5) does not lead to substantial improvements either, and is also detrimental in certain cases. We note that there is quite a variance in performances across the different models, which may be partially attributed to the original training datasets of the models. However these observations warrant further exploration beyond the scope of this paper.

Refer to caption
Figure 6: Object replacement group: Average accuracies of the models across ten different seed settings.

We repeated the experiments across multiple random seeds and found similar performances, showing that our general findings hold across different random cases. Figure 6 shows the average accuracy and standard deviations.

III-B3 Summary

While the experiments that we conducted are only preliminary, they offer some validity that the extension of principles in Computational Creativity can help overcome limitations of LLVMs in creative problem solving. The notion of task re-representation via improved prompting warrants further investigation in LLVMs, regarding how the prompts can be generated automatically based on the creative task.

The models used in our experiments have all been trained jointly in visual and text domains. Multi-modal prompting capabilities may be useful for achieving creative problem solving. It can be quite challenging to describe affordances in words (example of “hammers” in our tests) and they may be better described through other means, e.g., images or depth maps or spectral data for material properties [48]. This would require application of multi-modal LLVMs that can process a variety of data types [49, 50]. Computational creativity can offer insights into meaningful representations of these different modalities that would help achieve creative problem solving, e.g., whether object material or shape matters more for one task vs. another [45].

It is also worth noting that the creative problem solving examples in our experiments are human-centric. For instance, robots may not have similar capabilities as humans to manipulate bowls for scoo**. In such cases, LLVMs need to account for the affordances as described with respect to the agent, in order to derive creative solutions. However, that adds another level of complexity, yet to be explored, since these models are typically trained on human-centric data.

IV Evaluation of Creativity

Existing approaches in [6, 51] describe problem settings that can be used to measure CPS skills of LLMs. In [6], the authors create a dataset of 1600 real-world problems that involve creative reasoning abilities and [51] introduces the Only-Connect-Wall (OCW) dataset to measure CPS capabilities of LLMs. Currently, to the best of our knowledge, there are no standard benchmarks available to measure CPS skills of VLMs, although our preliminary experiments show one way to measure this using the task of object substitution [45]. Since CPS involves solving a previously unsolvable task using newly discovered information, these example benchmarks specifically evaluate how the task was solved rather than the typical ML evaluation of whether the task was successful or not. Future CPS benchmarks should target the same.

V On the potential link between creative problem solving and general intelligence

While not the thrust of our position paper, existing literature hints at a potential link between creative problem solving and Artificial General Intelligence (AGI) - systems that are broadly capable of solving almost all tasks that humans can [52]. For instance, in [53], p.85., the author argues that there exists a strong correlation between creativity and AGI: “… features that systems need to develop in order to achieve general intelligence are aspects that they need to possess also to earn the attribute creative”. In [54], the author compiles a list of competencies deemed essential for achieving AGI, including creative capacities like “conceptual invention” and “creative constructive play with objects”. The processes of “insight” or “incubation” often associated with creative problem solving [55, 56] is also considered important for AGI [57]. Taken together, it is likely that any promising vision of AGI would be incomplete without creative problem solving.

Alongside the heavy ongoing discussion of AGI surrounding LLVMs [5, 58, 59, 60, 61, 62], there is often little to no discussion of creative problem solving or Computational Creativity within mainstream ML. As described in [53], p.96, “The investigation on the nature of creativity and on how it manifests itself not only in human but also in animal and artificial systems should, thus, not be intended as a niche discussion but, rather, as a fundamental research which can lay the foundations for further studies in artificial intelligence and its relation to humans”. We hope that this work will encourage discussions of creative problem solving and Computational Creativity alongside discussions on AGI.

VI Conclusion and Future Work

In this paper, we argued that an effective approach for enabling creative problem solving – currently a key limitation of LLVMs – should derive from Computational Creativity literature. To emphasize this at each juncture, we discussed the specific principles from CC that can be extended to achieve creative problem solving in LLVMs, describing the potential for further research with these insights. It is rare to see special tracks or workshops targeted at Computational Creativity within more prestigious ML conferences such as ICML, ICLR, or NeurIPS. There was a related 2021 workshop at ICLR on The Role of Mathematical Reasoning in General Artificial Intelligence featuring an intro talk by Dr. Alison Pease titled, “The Relevance of Computational Creativity to Mathematical Reasoning Machines[63]. Other workshops targeted at creativity (such as the NeurIPS Workshop on Machine Learning for Creativity and Design [64]) do not discuss creative problem solving and often fails to bridge the gap between ML and CC, instead focusing primarily on algorithmic approaches like stable diffusion and its extensions. We hope to see a deeper integration of the CC communities at such strong ML venues. We hope this paper encourages the reader to view creative problem solving and ML from a more holistic perspective, through the lens of CC.

References

  • [1] M. A. Boden, “Creativity and Artificial Intelligence,” Artificial Intelligence, vol. 1-2, pp. 347–356, 1998.
  • [2] E. Gizzi, L. Nair, S. Chernova, and J. Sinapov, “Creative problem solving in artificially intelligent agents: A survey and framework,” Journal of Artificial Intelligence Research, vol. 75, pp. 857–911, 2022.
  • [3] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models. 2022 ieee,” in CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2021, pp. 10 674–10 685.
  • [4] J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Défossez, “Simple and controllable music generation,” arXiv preprint arXiv:2306.05284, 2023.
  • [5] S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023.
  • [6] Y. Tian, A. Ravichander, L. Qin, R. L. Bras, R. Marjieh, N. Peng, Y. Choi, T. L. Griffiths, and F. Brahman, “Macgyver: Are large language models creative problem solvers?” arXiv preprint arXiv:2311.09682, 2023.
  • [7] C. Pennachin and B. Goertzel, “Contemporary approaches to artificial general intelligence,” in Artificial general intelligence.   Springer, 2007, pp. 1–30.
  • [8] S. Cass, “Apollo 13, we have a solution,” IEEE Spectrum On-line, 04, vol. 1, 2005.
  • [9] M. Turner, L. Duggan, B. Glezerson, and S. Marshall, “Thinking outside the (acrylic) box: a framework for the local use of custom-made medical devices,” Anaesthesia, 2020.
  • [10] D. Patel, “Llms need search for problem solving - shane legg (deepmind founder),” https://www.youtube.com/watch?v=qulfo6-54k0, 2023, [Online; accessed 19-Jan-2024].
  • [11] B. Wang, “Openai q* could be based upon a* search without expansions,” https://www.nextbigfuture.com/2023/11/openai-q-could-be-based-upon-a-search-without-expansions.html, 2023, [Online; accessed 19-Jan-2024].
  • [12] J. D. Anna Tong and K. Hu, “Openai researchers warned board of ai breakthrough ahead of ceo ouster, sources say,” https://www.reuters.com/technology/sam-altmans-ouster-openai-was-precipitated-by-letter-board-about-ai-breakthrough-2023-11-22/, 2023, [Online; accessed 19-Jan-2024].
  • [13] W. Huang, P. Abbeel, D. Pathak, and I. Mordatch, “Language models as zero-shot planners: Extracting actionable knowledge for embodied agents,” in International Conference on Machine Learning.   PMLR, 2022, pp. 9118–9147.
  • [14] A. Prasad, A. Koller, M. Hartmann, P. Clark, A. Sabharwal, M. Bansal, and T. Khot, “Adapt: As-needed decomposition and planning with language models,” arXiv preprint arXiv:2311.05772, 2023.
  • [15] Y. Du, O. Watkins, Z. Wang, C. Colas, T. Darrell, P. Abbeel, A. Gupta, and J. Andreas, “Guiding pretraining in reinforcement learning with large language models,” arXiv preprint arXiv:2302.06692, 2023.
  • [16] D. Zhang, L. Chen, S. Zhang, H. Xu, Z. Zhao, and K. Yu, “Large language model is semi-parametric reinforcement learning agent,” arXiv preprint arXiv:2306.07929, 2023.
  • [17] Y. Shukla, W. Gao, V. Sarathy, A. Velasquez, R. Wright, and J. Sinapov, “Lgts: Dynamic task sampling using llm-generated sub-goals for reinforcement learning agents,” arXiv preprint arXiv:2310.09454, 2023.
  • [18] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence, and A. Zeng, “Code as policies: Language model programs for embodied control,” in 2023 IEEE International Conference on Robotics and Automation (ICRA).   IEEE, 2023, pp. 9493–9500.
  • [19] T. Silver, S. Dan, K. Srinivas, J. B. Tenenbaum, L. P. Kaelbling, and M. Katz, “Generalized planning in pddl domains with pretrained large language models,” arXiv preprint arXiv:2305.11014, 2023.
  • [20] L. Guan, K. Valmeekam, S. Sreedharan, and S. Kambhampati, “Leveraging pre-trained large language models to construct and utilize world models for model-based task planning,” arXiv preprint arXiv:2305.14909, 2023.
  • [21] L. Chen, S. Bahl, and D. Pathak, “Playfusion: Skill acquisition via diffusion from language-annotated play,” in Conference on Robot Learning.   PMLR, 2023, pp. 2012–2029.
  • [22] B. Li, P. Wu, P. Abbeel, and J. Malik, “Interactive task planning with language models,” arXiv preprint arXiv:2310.10645, 2023.
  • [23] M. Skreta, Z. Zhou, J. L. Yuan, K. Darvish, A. Aspuru-Guzik, and A. Garg, “Replan: Robotic replanning with perception and language models,” arXiv preprint arXiv:2401.04157, 2024.
  • [24] M. A. Boden, “What is creativity?” Creativity in human evolution and prehistory, 2005.
  • [25] G. Tennenholtz, Y. Chow, C.-W. Hsu, J. Jeong, L. Shani, A. Tulepbergenov, D. Ramachandran, M. Mladenov, and C. Boutilier, “Demystifying embedding spaces using large language models,” arXiv preprint arXiv:2310.04475, 2023.
  • [26] S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” arXiv preprint arXiv:2305.10601, 2023.
  • [27] A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y.-X. Wang, “Language agent tree search unifies reasoning acting and planning in language models,” arXiv preprint arXiv:2310.04406, 2023.
  • [28] X. Feng, Z. Wan, M. Wen, Y. Wen, W. Zhang, and J. Wang, “Alphazero-like tree-search can guide large language model decoding and training,” arXiv preprint arXiv:2309.17179, 2023.
  • [29] S. Zhang, Z. Chen, Y. Shen, M. Ding, J. B. Tenenbaum, and C. Gan, “Planning with large language models for code generation,” arXiv preprint arXiv:2303.05510, 2023.
  • [30] O. Golovneva, S. O’Brien, R. Pasunuru, T. Wang, L. Zettlemoyer, M. Fazel-Zarandi, and A. Celikyilmaz, “Pathfinder: Guided search over multi-step reasoning paths,” arXiv preprint arXiv:2312.05180, 2023.
  • [31] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  • [32] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.
  • [33] R. Bansal, B. Samanta, S. Dalmia, N. Gupta, S. Vashishth, S. Ganapathy, A. Bapna, P. Jain, and P. Talukdar, “Llm augmented llms: Expanding capabilities through composition,” arXiv preprint arXiv:2401.02412, 2024.
  • [34] M. S. Matena and C. A. Raffel, “Merging models with fisher-weighted averaging,” Advances in Neural Information Processing Systems, vol. 35, pp. 17 703–17 716, 2022.
  • [35] G. Ilharco, M. T. Ribeiro, M. Wortsman, S. Gururangan, L. Schmidt, H. Hajishirzi, and A. Farhadi, “Editing models with task arithmetic,” arXiv preprint arXiv:2212.04089, 2022.
  • [36] W. Kim, B. Son, and I. Kim, “Vilt: Vision-and-language transformer without convolution or region supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 5583–5594.
  • [37] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  • [38] G. Fauconnier and M. Turner, “Conceptual blending, form and meaning,” Recherches en communication, vol. 19, pp. 57–86, 2003.
  • [39] L. Chen, P. Wang, F. Shi, J. Han, P. Childs et al., “A computational approach for combinational creativity in design,” in DS 92: Proceedings of the DESIGN 2018 15th International Design Conference, 2018, pp. 1815–1824.
  • [40] G. Franceschelli and M. Musolesi, “On the creativity of large language models,” arXiv preprint arXiv:2304.00008, 2023.
  • [41] A.-M. Olteteanu, “Two general classes in creative problem-solving? an account based on the cognitive processess involved in the problem structure-representation structure relationship.” Publications of the Institute of Cognitive Science, 2014.
  • [42] J. Gu, Z. Han, S. Chen, A. Beirami, B. He, G. Zhang, R. Liao, Y. Qin, V. Tresp, and P. Torr, “A systematic survey of prompt engineering on vision-language foundation models,” arXiv preprint arXiv:2307.12980, 2023.
  • [43] J. Von Oswald, E. Niklasson, E. Randazzo, J. Sacramento, A. Mordvintsev, A. Zhmoginov, and M. Vladymyrov, “Transformers learn in-context by gradient descent,” in International Conference on Machine Learning.   PMLR, 2023, pp. 35 151–35 174.
  • [44] O. Rubin, J. Herzig, and J. Berant, “Learning to retrieve prompts for in-context learning,” arXiv preprint arXiv:2112.08633, 2021.
  • [45] A.-M. Olteţeanu and Z. Falomir, “Object replacement and object composition in a creative cognitive system. towards a computational solver of the alternative uses test,” Cognitive Systems Research, vol. 39, pp. 15–32, 2016.
  • [46] E. Karpas, O. Abend, Y. Belinkov, B. Lenz, O. Lieber, N. Ratner, Y. Shoham, H. Bata, Y. Levine, K. Leyton-Brown, D. Muhlgay, N. Rozen, E. Schwartz, G. Shachaf, S. Shalev-Shwartz, A. Shashua, and M. Tenenholtz, “Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning,” 2022.
  • [47] Z. Ling, Y. Fang, X. Li, T. Mu, M. Lee, R. Pourreza, R. Memisevic, and H. Su, “Unleashing the creative mind: Language model as hierarchical policy for improved exploration on challenging problem solving,” 2023.
  • [48] Z. Erickson, E. Xing, B. Srirangam, S. Chernova, and C. C. Kemp, “Multimodal material classification for robots using spectroscopy and high resolution texture imaging,” in 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).   IEEE, 2020, pp. 10 452–10 459.
  • [49] R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” 2023.
  • [50] J. Han, K. Gong, Y. Zhang, J. Wang, K. Zhang, D. Lin, Y. Qiao, P. Gao, and X. Yue, “Onellm: One framework to align all modalities with language,” 2023.
  • [51] S. Naeini, R. Saqur, M. Saeidi, J. Giorgi, and B. Taati, “Large language models are fixated by red herrings: Exploring creative problem solving and einstellung effect using the only connect wall dataset,” arXiv preprint arXiv:2306.11167, 2023.
  • [52] H. Shevlin, K. Vold, M. Crosby, and M. Halina, “The limits of machine intelligence: Despite progress in machine intelligence, artificial general intelligence is still a major challenge,” EMBO reports, vol. 20, no. 10, p. e49177, 2019.
  • [53] C. Moruzzi, “Artificial creativity and general intelligence,” Journal of Science and Technology of the Arts, 2020.
  • [54] B. Goertzel, “Artificial general intelligence: concept, state of the art, and future prospects,” Journal of Artificial General Intelligence, vol. 5, no. 1, p. 1, 2014.
  • [55] S. Hélie and R. Sun, “Incubation, insight, and creative problem solving: a unified theory and a connectionist model.” Psychological review, vol. 117, no. 3, p. 994, 2010.
  • [56] K. J. Gilhooly, “Incubation and intuition in creative problem solving,” Frontiers in psychology, vol. 7, p. 1076, 2016.
  • [57] D. Ventura, “Can a computer be lucky? and other ridiculous questions posed by computational creativity,” in Artificial General Intelligence: 7th International Conference, AGI 2014, Quebec City, QC, Canada, August 1-4, 2014. Proceedings 7.   Springer, 2014, pp. 208–217.
  • [58] N. Fei, Z. Lu, Y. Gao, G. Yang, Y. Huo, J. Wen, H. Lu, R. Song, X. Gao, T. Xiang et al., “Towards artificial general intelligence via a multimodal foundation model,” Nature Communications, vol. 13, no. 1, p. 3094, 2022.
  • [59] Y. Ma, C. Zhang, and S.-C. Zhu, “Brain in a vat: On missing pieces towards artificial general intelligence in large language models,” arXiv preprint arXiv:2307.03762, 2023.
  • [60] Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. **, E. Zhou et al., “The rise and potential of large language model based agents: A survey,” arXiv preprint arXiv:2309.07864, 2023.
  • [61] M. Moor, O. Banerjee, Z. S. H. Abad, H. M. Krumholz, J. Leskovec, E. J. Topol, and P. Rajpurkar, “Foundation models for generalist medical artificial intelligence,” Nature, vol. 616, no. 7956, pp. 259–265, 2023.
  • [62] J. Grudin and R. Jacques, “Chatbots, humbots, and the quest for artificial general intelligence,” in Proceedings of the 2019 CHI conference on human factors in computing systems, 2019, pp. 1–11.
  • [63] A. Pease, “The relevance of computational creativity to mathematical reasoning machines,” https://iclr.cc/virtual/2021/workshop/2124, 2021, [Online; accessed 19-Jan-2024].
  • [64] NeurIPS, “Workshop on machine learning for creativity and design,” https://nips.cc/virtual/2022/workshop/49965, 2022, [Online; accessed 19-Jan-2024].
Prompt type Prompt
Regular
“can this object be used as a scoop?”
“can this object be used as a hammer?”
“can this object be used as a spatula?”
“can this object be used as a toothpick?”
“can this object be used as pliers?”
Affordance
“scoops must be concave and hollow. can this object be used as a scoop?”
“hammers must be heavy and have a handle attached to a cylinder at the end. can this object be used as a hammer?”
“spatulas must have a handle attached to a flat surface at the end. can this object be used as a spatula?”
“toothpicks must have a pointed tip. can this object be used as a toothpick?”
“pliers must have two-prongs. can this object be used as pliers?”
Task
“scoops can transfer beans from one jar to another jar. can this object be used as a scoop?”
“hammers can hit a nail into the wall. can this object be used as a hammer?”
“spatulas can spread butter onto a pan. can this object be used as a spatula?”
“toothpicks can pick food caught between the teeth. can this object be used as a toothpick?”
“pliers can grab a coin. can this object be used as pliers?”
Task and affordance
“scoops can transfer beans from one jar to another jar. scoops are concave and hollow. can this object be used as a scoop?”
“hammers can hit a nail into the wall. hammers have a handle attached to a cylinder at the end. can this object be used
as a hammer?”
“spatulas can spread butter onto a pan. spatulas have a handle attached to a flat surface at the end. can this object be used
as a spatula?”
“toothpicks can pick food caught between the teeth. toothpicks have a pointed tip. can this object be used as a toothpick?”
“pliers can grab a coin. pliers have two-prongs. can this object be used as pliers?”
TABLE II: Prompts used in the experiment
Refer to caption
Figure 7: Complete test set of objects used in the experiments.