How should the advent of large language models affect the practice of science?

Marcel Binz Max Planck Institute for Biological Cybernetics Helmholtz Center for Computational Health, Munich Joint first authors [email protected] Stephan Alaniz University of Tübingen Joint first authors Adina Roskies University of California Santa Barbara Balazs Aczel Eötvös Loránd University, Budapest Carl T. Bergstrom University of Washington Colin Allen University of California Santa Barbara Daniel Schad Health and Medical University, Potsdam Dirk Wulff Max-Planck-Institute for Human Development University of Basel Jevin D. West University of Washington Qiong Zhang Rutgers University, New Brunswick Richard M. Shiffrin Indiana University, Bloomington Samuel J. Gershman Harvard University Ven Popov University of Zurich Emily M. Bender University of Washington Perspective leaders Marco Marelli University of Milano-Bicocca Perspective leaders Matthew M. Botvinick Google DeepMind University College London Perspective leaders Zeynep Akata University of Tübingen Joint senior authors Eric Schulz Max Planck Institute for Biological Cybernetics Helmholtz Center for Computational Health, Munich Perspective leaders Joint senior authors

Abstract

Large language models (LLMs) are being increasingly incorporated into scientific workflows. However, we have yet to fully grasp the implications of this integration. How should the advent of large language models affect the practice of science? For this opinion piece, we have invited four diverse groups of scientists to reflect on this query, sharing their perspectives and engaging in debate. Schulz et al. make the argument that working with LLMs is not fundamentally different from working with human collaborators, while Bender et al. argue that LLMs are often misused and over-hyped, and that their limitations warrant a focus on more specialized, easily interpretable tools. Marelli et al. emphasize the importance of transparent attribution and responsible use of LLMs. Finally, Botvinick and Gershman advocate that humans should retain responsibility for determining the scientific roadmap. To facilitate the discussion, the four perspectives are complemented with a response from each group. By putting these different perspectives in conversation, we aim to bring attention to important considerations within the academic community regarding the adoption of LLMs and their impact on both current and future scientific practices.

Significance Statement:
Artificial intelligence (AI) is resha** the way researchers conduct science. Large language models (LLMs) in particular have received attention for their apparent versatility in reading, analyzing, and writing text. Yet, at the same time, there are also considerable concerns that come with the use of this technology, which could potentially place our scientific integrity at risk. This raises the question: how should LLMs affect the practice of science? In this manuscript, we present four perspectives on this issue, each one from a different group of researchers. The perspectives demonstrate the extent of both issues and opportunities, ranging from encouraging the use of LLMs in science to a stern warning against using LLMs in most, if not all, of the proposed cases.

keywords:

large language models

|

artificial intelligence

|

meta-science

|

automated science

Language models are statistical models of human language that can be used to predict the next token (e.g., a word or character) for a given text sequence. Even though these models have been around for decades [1, 2], they have recently experienced an unprecedented renaissance: by training enormous neural networks with billions of parameters on data sets with trillions of tokens, researchers have observed the emergence of models whose abilities can go beyond mere text generation and conversational skills [3].

Modern large language models (LLMs) are, amongst other things, able to solve selected university-level math problems [4], support language translation [5], or answer questions in a bar exam with high accuracy [6], out of the box and without additional training. Given the range of these capabilities, it seems possible that these systems will have an enormous impact on our society, leaving their mark on the labor market [7], the education system [8], and many other parts of our daily lives.

We — as scientists — may therefore wonder: how will the advent of LLMs affect the practice of science? Finding answers to this question is urgent as LLMs are already starting to permeate the academic landscape [9, 10, 11, 12, 13, 14, 15, 16, 17]. For instance, in 2022, MetaAI released the first science-specific LLM (under the name Galactica) aimed to support researchers in the process of knowledge discovery [18]. Even more recently, Terence Tao, a Fields Medal-winning mathematician, proclaimed [19] that \saythe 2023-level AI can already generate [ $\ldots$ ] promising leads to a working mathematician [ $\ldots$ ]. When integrated with tools such as formal proof verifiers, internet search, and symbolic math packages, I expect, say, 2026-level AI [ $\ldots$ ] will be a trustworthy co-author [ $\ldots$ ].

Yet while there have been claims of immense potential for this technology for the advancement of science, there are also considerable concerns that need to be taken into account. For instance, the aforementioned Galactica model had to be taken offline after just three days because it was heavily criticized by researchers for fabricating information, such as \sayfake papers (sometimes attributing them to real authors), and […] wiki articles about the history of bears in space [20]. Furthermore, even though LLMs often achieve state-of-the-art performance on existing benchmarks, it remains debated whether this reflects genuine understanding, or whether they are merely regurgitating the training set, thereby acting like stochastic parrots [21]. It has been, for instance, repeatedly demonstrated that even the most capable models at present fail at basic arithmetic problems such as multiplying two four-digit numbers [22]. Flaws like these are especially concerning if we intend to utilize LLMs for research purposes, and could endanger the integrity of science if we act carelessly.

The objective of the present article is to provide researchers with different opinions and a forum to voice and discuss their perspectives on if and how we should make use of LLMs in the context of science. To facilitate this discussion, we will first highlight a few applications where LLMs have the potential to positively impact science, followed by pointing out some of the issues that come with them.

Background: applications of LLMs in science

LLMs find their most obvious use case as a supporting tool for scientific writing. For example, as proofreaders of manuscript drafts, they can aid in rectifying grammatical errors, improving the writing style, and ensuring adherence to editorial guidelines. Beyond manuscript composition, LLMs could prove valuable for data acquisition and analysis in domains that were traditionally reliant on manual human labor [23, 24]. Researchers have even suggested using LLMs as potential substitutes for human participants, as proxies [25] or for pilot studies [26]. In computational fields, LLMs could speed up prototy** by writing code [27], while a human-in-the-loop would guide these processes, correct LLM-generated errors and ultimately decide which ideas warrant further pursuit. Moreover, researchers might experiment with employing LLMs at certain stages of research with progressively reduced supervision [28], potentially leading to increased automation in some aspects of scientific exploration and discovery.

While the potential influence of LLMs on the practice of science is immense, there are pressing issues that come with the use of LLMs in the context of science. For instance, when an LLM helps us to write text, who ensures that its output is not subject to plagiarism issues [29]? LLMs learn from web-sourced text data, acquiring inherent biases [30, 31, 32] and — in some cases — replicate excerpts from their training data [33]. When an LLM is used for data analysis, what happens when it hallucinates data? The content generated by LLMs can contain errors or fabricated information, presenting a potential threat to the integrity of scientific publishing [12]. When an LLM suggests an idea, who gets credit for it? The general consensus within the scientific community seems to indicate that LLMs are not eligible for (co-)authorship [34] as they cannot be held accountable for upholding scientific precision and integrity. Leading AI conferences such as ICML¹¹1https://icml.cc/Conferences/2023/llm-policy and ACL²²2https://2023.aclweb.org/blog/ACL-2023-policy/ — as well as journals such as Science³³3https://www.science.org/content/page/science-journals-editorial-policies, Nature⁴⁴4https://www.nature.com/nature-portfolio/editorial-policies/ai and PNAS⁵⁵5https://www.pnas.org/author-center/editorial-and-journal-policies#authorship-and-contributions — have already adopted policies to limit the involvement of LLMs. However, it remains an open question how strong these regulations should be and if and how the usage of LLMs should be acknowledged.

These — and many other — issues raise the questions: How should the advent of LLMs affect the practice of science? Do LLMs actually improve our scientific output or are they rather hindering good scientific practice? To what extent should they be used given the ethical and legal issues that come with them? We believe these to be highly non-trivial questions without an obvious answer and have therefore invited four groups of researchers to provide their perspectives on them. These perspectives were selected to cover a broad spectrum of opinions in order to spark a constructive discussion. Each of the perspectives is accompanied by a response from each group. We conclude this article with a short general discussion in which we attempt to identify common themes.

Perspective – LLMs: more like a human collaborator than a software tool

Contributors: Eric Schulz, Daniel Schad, Marcel Binz, Stephan Alaniz, Ven Popov, and Zeynep Akata

Most researchers in our labs already frequently employ LLMs in their everyday work. They use them, amongst other things, to finetune and revise their drafts, as a supporting tool for programming, to suggest formulations for research items such as questionnaires or experimental instructions, and to summarise research papers. We have observed a significant increase in quality in all of these areas after the widespread adoption of these models. While our personal experience may be biased, there are several studies supporting the idea that LLMs can facilitate writing [35], coding [36], and knowledge extraction [37]. In the future, we expect these models to be even more deeply integrated into the scientific process, taking on roles similar to a collaborator with whom one can develop and discuss ideas.

Indeed, we believe that working with LLMs will not be fundamentally different from working with other collaborators, such as research assistants or doctoral students. LLMs are not perfect and have limitations and biases that could affect their performance and output. However, humans are also subject to some of the same flaws, such as errors, plagiarism, fabrication, or discrimination. If we take this perspective, it seems appropriate to view current LLMs less as traditional software tools and more as knowledgeable research assistants: they can do phenomenal work but we need to be aware that they can make mistakes.

Protecting the past

It is our chief responsibility to ensure the quality and integrity of our work. There are already rules and norms about scientific practice in place to ensure this, and many of them also apply to LLMs. For instance, we should always check the accuracy and validity of the information and data we obtain, no matter the source, as well as correctly cite the sources and methods we use. That means that we should not blindly trust or rely on LLMs, but rather use them as a complement to our own expertise and judgment. Furthermore, our work can only be criticized appropriately if all information about its methodology is transparently communicated. We should therefore acknowledge the contributions of LLMs to our research, just as we would do for any other tool. Ultimately, it is — and will remain — the authors’ responsibility to ensure that the appropriate scientific standards are followed, regardless of whether we use LLMs or not.

Ensuring that our research is reproducible is one of the cornerstones of modern science. However, as many LLMs are proprietary, working with them poses a threat to this ideal. Nobody guarantees that OpenAI, Google, or other providers will not make changes to their models (in the worst case, without informing the user). In fact, this happened to us during the revision process of one of our papers, where, at some point, we could not reproduce our initial results, likely due to changes on the provider side. How should we deal with such cases? We believe that the obvious solution to this problem is to rely on open-source models where one has full control over all aspects of the model. Following a recent call for action to the European Parliament [38], we therefore strongly advocate for the development of such models, such that they can become the primary tool for scientific inquiry.

Welcoming the future

Paper reviewing is another area where LLMs could improve our scientific pipeline. In a recent study, Liang and colleagues [39] demonstrated this potential by systematically comparing LLM-generated reviews to reviews written by human researchers. They found that \saymore than half (57.4%) of the users found GPT-4 generated feedback helpful or very helpful and 82.4% found it more beneficial than feedback from at least some human reviewers. Not only does this result allow scientists — especially early career researchers — to receive high-quality, instantaneous feedback (similar to that one could get from a critical colleague with an unlimited amount of time) but it also has implications for the peer review process. Yet, the use of LLMs in the peer review process also presents one major legal obstacle: manuscripts under review are typically confidential, and hence should not be entered into proprietary LLMs. To prevent such breaches of confidentiality, the National Institutes of Health (NIH) and other institutions have rules in place that prohibit the use of LLMs for peer review.⁶⁶6https://grants.nih.gov/grants/guide/notice-files/NOT-OD-23-149.html Locally hosted, open-source models are again a solution to this issue, as they provide control about which information is shared with external sources and which is not.

We also would like to point out that LLMs are a moving target, constantly evolving and becoming more capable and autonomous. This may raise new challenges and questions for the scientific community in the future, such as how to evaluate, interpret, and communicate the results generated by LLMs, or how to ensure their transparency and accountability. We welcome these challenges as an opportunity to advance our understanding and methods of science. We also encourage researchers to collaborate with each other and with LLM developers to address these issues and ensure that LLMs improve at frequently criticised skills such as providing truthful sources or acknowledging ignorance.

Conclusion

In conclusion, LLMs are a valuable asset for science and should be embraced rather than feared or restricted. It becomes apparent that they are not infallible machines once we start thinking about them as knowledgeable research assistants instead of traditional software tools. Furthermore, since rules for good scientific practice are already in place, and since it is the authors’ obligation to take responsibility for adhering to these rules, there is no need for novel rules with the use of LLMs. We believe that strengthening the development of open-source alternatives should be one of our top priorities, as they \sayoffer enhanced security, explainability, and robustness due to their transparency and the vast community oversight [38]. Finally, being conscious about the current limitations of LLMs and embracing them, will allow us to grow with the technology as LLM research finds remedies and develops complementary tools. We hope that by adopting this liberal perspective, we can foster a positive and fruitful relationship between humans and LLMs in science. Please note that the first draft of our perspective was written by an LLM (GPT-4) based on our meeting notes.

Perspective – Science is a social process that cannot be auto-completed

Contributors: Emily M. Bender, Carl T. Bergstrom, and Jevin D. West

When deciding whether to use an LLM, it is important to recognize that LLMs are simply models of word form distributions extracted from text — not models of the information that people might get from reading that text [40]. Originally, such systems were used to rank or classify text. In automatic transcription, for example, an acoustic model provides a set of possibilities and the language model helps determine the most likely next word [41]. Today, however, LLMs are vaunted for their ability to extrude synthetic text by repeatedly selecting a next likely token.

Trained on sufficiently large datasets and with sufficiently well-tuned architectures and training processes, LLMs appear to produce coherent text on just about any topic, including scientific ones. Moreover, we can’t help but make sense of their textual output because our linguistic processing capabilities are instinctual and reflexive [42].

Proponents argue that LLMs are useful in three domains: 1) navigating science, by searching and synthesizing published literature, 2) doing science, in the sense of designing or conducting experiments or generating data, and 3) communicating science, by drafting text for publication. While certain machine approaches may be useful in each, LLMs are unlikely to outperform alternative technologies. Furthermore, they have the potential to cause downstream harms to science if their use is widely embraced.

Navigating science

Natural language processing has proven useful in sorting through an ever-growing body of scientific literature. Information retrieval and extraction techniques, as implemented in academic search engines (e.g. ref. [43]), have helped researchers discover relevant prior work. Will LLMs supplant other NLP approaches? We doubt it. The inappropriateness of LLMs as text generators and synthesis machines was highlighted in Meta’s Galactica debacle. That system—taken off-line after three days in response to intense criticism for its abysmal performance—had been trained on scientific text and promoted as a tool to “summarize academic papers, solve math problems, generate Wiki articles, write scientific code, annotate molecules and proteins, and more.” [20]. But training an LLM on scientific papers doesn’t guarantee that it will output scientifically accurate information. As Meta discovered, the use of LLMs yields text ungrounded in any communicative intent or accountability for accuracy.

One might hope that LLMs could at least be used to summarize a set of papers. Extractive summarization systems [44] already do this; will LLMs perform better? Will people tend to over-rely on system output rather than using it as a starting point? What are the costs of false negatives, i.e., important points not included in the generated summary? How will errors generated by LLMs, which then become training data for future LLMs, get amplified?

Doing science

LLMs are just one of many technologies dubbed “artificial intelligence”, but their surprising capacity to perform what amounts to a fancy parlor trick has drawn outsized attention. That’s a mistake. LLMs may be adequate for specific linguistic tasks such as grammar checking, automatic transcription, and machine translation (including code generation), but they are unlikely to provide an effective basis for most tasks involved in hybrid human-machine science. Even where they do appear to be moderately effective, they are known to be brittle to input variation [45]. The future of machine-aided science will not be a massive, one-size-fits-all, universal application of LLMs, but rather an ensemble of bespoke and often lightweight models that have been designed explicitly to solve the specific tasks at hand — and, crucially, evaluated in terms of those specific tasks. Such approaches also have a major advantage where interpretability is concerned. If researchers want to understand output variation, let alone find ways to fine-tune the architecture to generate better results, they need to steer away from technologies as opaque as LLMs. But instead, the ongoing hype around LLMs is drawing funding and brainpower away from more promising, targeted approaches.

Not only are LLMs being explored as aides to researchers; numerous proposals suggest that they can stand in for test subjects [46], survey participants [47], or data annotators [48]. Such arguments derive from a failure to understand that LLMs model the output of sequential linguistic tokens, not concepts, meanings, or communicative intent. If we are looking to study the opinions or behavior of human beings, we need to work with actual people.

Communicating science

By design LLMs generate form without substance. The synthetic text that systems output constitutes neither ideas, nor data — and it certainly is not a reliable information source. This notion of generating statements that no one intended is anathema to the spirit of scientific inquiry. Automatically generating something that looks like a manuscript is very different from the iterative process of actually writing a manuscript. Yet the output can be difficult to distinguish, particularly in a cursory read or by inexpert readers. Some proponents argue that LLMs can relieve scientists of the drudgery of writing papers and free them up to get on with the serious business of “doing science” [49]. This false dichotomy between communication and investigation reflects a fundamental misunderstanding of the nature of science [50] that devalues the communicative aspects of science and ignores the role of writing in the process of formulating, organizing, and refining ideas.

Downstream, LLMs threaten the notion of scientific expertise, shift incentive structures [51], and undermine trust in the literature. When human authors review or acknowledge prior literature, we are assured that they are familiar with the field and striving to situate their results therein. If LLMs write our introductions, we lose these guarantees. Moreover, notions of systematic review are undercut by the randomness inherent in LLM output. Finally, when a LLM generates a literature review, the claims that it generates are not directly derived from the manuscripts it cites. Rather, the machine creates textual claims, and then predicts the citations that might be associated with them. Obviously this practice violates all norms of scholarly citation. At best, LLMs gesticulate towards the shoulders of giants.

Driven by quantitative metrics and the strong incentive to publish, researchers may opt to trade off quality for speed by letting LLMs do much of their writing. Widespread use of one or a few LLMs could undercut epistemic diversity in science. When asked to provide a hypothesis, experiment, or mode of explanation, LLMs may repeatedly offer similar solutions, instead of leveraging the parallel creativity of an entire science community.

Worse still, opportunistic or malicious actors could use LLMs to generate nonsense at scale with minimal cost. (This is not an argument against using LLMs appropriately, but we need to be prepared for such behavior). Lazy authors could boost their publication counts by shotgunning machine-generated papers to low-quality journals. Predatory publishers could feign peer review using LLM output. Bad actors could overwhelm the manuscript submission system of a target journal (or even a target field) with a massive volume of fake papers. Or an investigator’s work could be targeted with a deluge of spurious machine-generated critiques on post-publication peer review platforms such as Pubpeer.

Finally, LLMs may cause considerable collateral damage to science education. For example, as LLMs slash the cost of generating seemingly authoritative text, the web will be flooded with low-quality, mistake-ridden tutorials designed to capture advertising revenue. At present, search engines’ ability to discriminate is more or less the only line of defense. That’s worrisome.

Conclusion

In conclusion, LLMs are often mis-characterized, misused, and over-hyped, yet they will certainly impact the way we do science, from search to experimental design to writing. The norms that we establish now around their use will determine the consequences far into the future. We should proceed with caution — and evaluate at every step.

Perspective – LLMs in scientific practice: a matter of principles, not just regulations

Contributors: Marco Marelli, Adina Roskies, Balazs Aczel, Colin Allen, Dirk Wulff, Qiong Zhang, and Richard M. Shiffrin

A moderate perspective on the potential impact of LLMs on scientific practice holds that, while it is important to be mindful of the dangers, their application seems largely beneficial, insofar they offer a much needed support in day-to-day research activity and may alleviate major obstacles to scientific advancement. This is evident when LLMs are applied as editing tools: they provide a writing aid that leaves researchers with more time for brainstorming ideas and analysis, may help mitigate disparities between different scientific communities, and remedy some of the disadvantages for researchers who are not native speakers of English [52]. Also, LLMs can access a broader range of literature than any individual researcher could, potentially offering valuable support for literature analysis and hypothesis generation [53], with a reach that goes beyond one’s research specialization.

However, although any new technology may be used for good or evil, some technologies afford opportunities for good or evil uses more than others. LLMs have disruptive potential that is ever more evident, and such disruption must be kept at bay if the goal is to prevent “evil drifts”. One perspective might hold that strict regulation is required, but regulation carries with it many costs that might be best avoided if it is kept moderate. A preferable approach may be to adopt clear principles guiding the way this technology should be used, principles that cannot just focus on efficiency and overall utility. Such principles include transparency, accountability, and fairness.

A matter of transparency

In science, transparency is of indispensable value. When used as writing tools, researchers must acknowledge the reliance on LLMs so that readers are on notice that the text is (at least partially) AI-generated. Authors should make explicit which LLMs were applied and how, as part of the method sections or in a separate dedicated statement. This could be achieved by relying on already existing solutions; for example, the CrediT taxonomy⁷⁷7https://credit.niso.org/ could also be used to code the nature of AI contribution, even if AI is not to be recognized as a coauthor. Ideally, in the spirit of open science, authors shall publicly release their prompts along with the corresponding LLM responses as supplementary materials, and reference such archives in the manuscript. Importantly, transparency does not only pertain to the way we exploit LLMs, but to the systems themselves. LLMs are not, strictly speaking, anything new. Models that are analogous to current LLMs in structure, spirit, and basic mechanisms have been part of the scientific debate for decades [54]. However, such older models were unambiguous about their architecture and training, if not openly released. Current LLMs are often not held to the same scientific standards as their ancestors, being widely applied even when their inner workings and training data remain undisclosed. This causes substantial issues in estimating the actual performance of such models (and, importantly, the possibility of data contamination [55]). As a scientific community valuing greater transparency, we should favour the systems that are taking some steps in that direction [56].

A matter of accountability

It must be acknowledged that LLMs are instruments of human agency, and researchers should be held accountable for any scientific product they present to the community, irrespective of the extent to which this was obtained through the application of automatic systems. The Association for the Advancement in Artificial Intelligence has released clear guidelines in this respect: “Attribution of authorship carries with it accountability for the work, which cannot be effectively applied to AI systems $\ldots$ Ultimately, all authors are responsible for the entire content of their papers, including text, figures, references, and appendices.” For example, LLMs are known to “hallucinate” and produce factually incorrect responses [57]. They can fabricate bibliographic citations, omit important references when summarizing literature, and potentially plagiarize text written by another researcher. The burden to verify that LLM-produced texts are accurate and that LLM-proofread texts are consistent with the original message remains with the individual authors. Similarly, LLMs can be particularly poor at logic and deductive reasoning [58, 59], so using them for analysis may lead to false conclusions. The onus is on the user to make sure that what LLMs produce is worth pursuing. Researchers must hence have strategies for assessment over AI-related content; a good practice would be to have clear quality criteria and verification methods defined before using LLMs. Scientists should not underestimate the time and effort that such vetting will take, and should weigh the efficiency of LLM application against these costs.

A matter of fairness

AI in general and LLMs in particular have the potential to deeply affect us at a societal level. Science, as any human endeavour, is not immune to this. As a community we must make all possible efforts to guarantee that reliance on LMMs does not violate basic fairness principles. Indeed, current language models reflect mostly WEIRD (Western Educated Industrialized Rich Democratic) populations and cannot easily be prompted to represent non-WEIRD communities [60, 61]. This leads to biases in writing and annotation, potentially reinforcing distortions in citations and marginalization of already marginalized scientists. Moreover, it may have negative consequences in terms of equitable research, given that LLMs are also more accessible to WEIRD populations. More generally, LLMs will, for known or unknown reasons, favour some perspectives or sources over others [62]. These systematic patterns must be recognized and taken into account, to avoid unprincipled biases affecting the direction of research and possibly the relative success of careers.

Conclusion

The impact that LLMs are having on scientific practice cannot be understated. Given the current trend, at the time you are reading these words such impact will likely be much larger than it is as we write this piece. Precisely how LLMs will influence the practice of science in the future cannot be entirely predicted and countering such a revolution with strict, preconceived norms is a losing battle. Rather, establishing principles and shared values in the scientific community constitutes the ideal foundation when deciding how to manage these rapidly changing technologies. Most importantly, we need to train students and each other to build upon such principles in order to become appropriately skeptical towards these systems and their outputs.

Perspective – AI can help, but science is for people

Contributors: Matthew M. Botvinick and Samuel J. Gershman

Like many forms of technology, AI can substitute for human labor. With the advent of LLMs, the relevant kinds of labor begin to overlap with high-level human cognitive work, including the activities involved in science [15]. As LLMs improve, their ability to substitute for human scientific labor will be a major boon. However, we argue here that two core aspects of scientific work should be reserved to human scientists.

AI and scientific labor

Over time, the labor involved in scientific research has become progressively more onerous, sometimes now bordering on the intractable. Assimilating current knowledge has become more difficult in the face of increasingly voluminous literatures. Generating new questions, hypotheses and experimental tests has become more challenging, as the search problem entailed by each has become more complex. Drawing conclusions from experimental results has become harder as the size and complexity of datasets has exploded. And communicating and debating scientific conclusions has become more challenging for reasons including an overtaxing of peer review systems [63]. Given the increasing costs of scientific labor on these fronts, it’s no surprise that progress across multiple scientific fields appears to have slowed [64].

In the long run, AI may help us cope with the increasing demands of scientific work. Through the kinds of application detailed in the introductory essay above, AI may help us scale up, by making each step in the research cycle cheaper. In some cases, AI may eventually perform some forms of scientific labor better than human scientists, including the work of generating new hypotheses [65]. Even in present-day forms, AI may be useful on some fronts, as reviewed in the introduction. Of course, as widely discussed, current systems are too unreliable to deploy without caution and oversight (see accompanying commentaries), and only time will tell how feasible it may be to overcome current limitations.

However, in addition to addressing present-day shortcomings, it’s equally important to look into the future and consider what kind of AI tools we actually want for science in the long run. Given that AI can be applied to all phases of scientific work, one aim might be to build a full-fledged AI scientist, one that can do everything a human scientist now does: a full-spectrum replacement for human scientists. To us, this prospect is deeply unappealing. Why? Because there are particular aspects of science that we simply would not want to delegate to AI, even in a scenario where technical limitations presented no barrier. In particular, there are two core aspects of science that should be left to people. As we now explain, one of these is normative and the other epistemic.

The normative aspect of science

Any scientific discipline must continually ask, What problems shall we work on? How this question gets answered, both within individual labs and across whole research communities, is a complex affair, but it centers on judgments concerning the ‘interest’ and ‘significance’ of candidate problems, as well as their ‘timeliness,’ including their amenability to study under prevailing material and ethical constraints. Such judgments are informed by hard data; we obviously cannot reduce them to purely social constructions. However, at the same time, judgments of interestingness, significance and timeliness are inherently tied to culturally and historically grounded sensibilities and mores. This is not a corruption or impurity in scientific thought and procedure. Cultural sensibilities and patterns of thought are fundamental to scientific prioritization.

This point will be especially salient to students of the history of science, because the sensibilities and mores that inform science evolve over time. Just as scientific theory changes over the years, so do the ethical commitments and intellectual priorities that underlie science. This is evident in the fact that we no longer approach homosexuality as a disorder, or study genetics through the lens of eugenics. It shows in growing restrictions on animal experimentation. And it shows in the attention that Western climatologists now pay to regions historically neglected.

We argue that the normative aspect of science should not be ceded to AI systems, no matter how capable those systems become. People should stay in the driver’s seat, determining the direction of travel for science. Certainly, AI systems may be helpful partners in deliberation, especially as techniques for AI value alignment improve [66]. However, aligning a system to currently prevailing human views is different from allowing that system to govern the evolution of human views. In science, the ultimate driving force in that evolution should remain human. We are the moral agents in the room, and we shouldn’t forget it.

The epistemic aspect of science

Obviously, a central goal of basic science is understanding the natural world. If we are going to do science with AI tools, the question arises: ‘whose’ understanding matters? Would it be satisfactory to have AI systems that in some sense understand aspects of nature, but which don’t make that understanding accessible to people? From an engineering standpoint that might be fine. However, if it’s basic science we’re talking about, we shouldn’t let go of the core objective, which is not just practical but epistemic. We cannot cede understanding to artificial systems. We should insist on human understanding remaining a core goal of science.

Of course, it may be that because of limitations on human cognition, AI systems may someday be able to represent some aspects of nature that we cannot, just as existing AI systems master aspects of complex board games that elude even highly skilled human players [67]. Even in these cases, however, we should strive to extract as much human insight from AI systems as possible [68]. We shouldn’t lose track of what basic science is for.

Conclusion

AI promises to deliver great value in science, just as in many other domains. We believe its potential should be embraced. However, at the same time that we strive to break through the current limitations of AI to access its benefits, we should also think through our long-term goals in develo** this technology. In the end, the two areas of science we’ve proposed to protect — one normative, the other epistemic — are two reflections of a more general bound on AI’s proper domain. We might call this the subjective limit. Unlike AI systems, people have a ‘point of view,’ which cannot be automated because it’s inherently subjective [69]. This point of view includes knowledge that is meaningful to us (the epistemic view) and values that are meaningful to us (the normative view). Machines might have their own knowledge or values, and these might be aligned with ours, but the alignment problem is fundamentally yoked to our subjective views. This principle applies in science, as in all human-centered activities.

Response by Eric Schulz, Daniel Schad, Marcel Binz, Stephan Alaniz, Ven Popov, and Zeynep Akata

We have argued that one should think of working with LLMs less as using a traditional software tool and more as working with a human collaborator and that this perspective allows us to better understand their shortcomings. This view actually resonates with many of the points raised in the other perspectives. For example, Marelli et al. write that \saywe should not blindly trust or rely on LLMs, but rather use them as a complement to our own expertise and judgment, and Bender et al. argue that collaboration in science means iterating over outputs many times. Like working with a human collaborator, working with LLMs is an iterative process in which we constantly check for facts and logical consistency, revise arguments, and identify new connections. This process takes time and is more than just booting up an LLM and copy-pasting its outputs; as nicely put by Marelli et al., we \sayshould not underestimate the time and effort that such vetting will take, and should weigh the efficiency of LLM application against these costs.

We would like to stress that the notion that \sayLLMs are simply models of word form distributions extracted from text oversimplifies both their capabilities and the additional engineering effort involved in modern LLMs. If one takes steps like reinforcement learning from human feedback [70] or instruction tuning [71] out of the equation, the outputs produced by such models are rather uninspiring (anyone who has ever worked with a plain LLM can attest to this). However, with those ingredients, LLMs do not just mimic language patterns; they can also synthesize concepts, critically evaluate their own outputs, and assist in problem-solving by processing vast amounts of data.

Bender et al. argue that \saythe future of machine-aided science will not be a massive, one-size-fits-all, universal application of LLMs, but rather an ensemble of bespoke and often lightweight models that have been designed explicitly to solve the specific tasks at hand […]. We believe LLMs are widely adopted precisely because they are a universal tool to accomplish many tasks. Not only does that remove the need to build specialized tools for each application, but it also eradicates the time it takes to learn them. Like human collaborators, who bring a diverse range of skills to a project, LLMs offer a breadth of knowledge that can be tailored to specific needs, e.g., as shown with the finetuning of coding LLMs [27]. There are – of course — applications that benefit from purposefully designed tools, but we believe that the percentage of such applications is modest once we take the time required to develop and learn such tools into account.

Finally, there is the question of how much autonomy we want to transfer to LLMs or other AI systems. Botvinick and Gershman advocated that people should retain control over certain aspects of the scientific pipeline, such as deciding which topics to work on. We do not think that such a constraint is necessary. For example, if in the future, an LLM (or any other AI system) decides to work on a topic that it deems interesting, and this LLM has proven itself to select topics in a very fruitful and productive manner, should we stop it? We do not think so as long as ethical and legal guidelines are followed. Deciding on scientific topics is hard, and it is often not a priori known which research directions will be fruitful. Therefore, we should take any help we can get. Human researchers and AI systems bring complementary strength to the table, and acknowledging this collaborative spirit enables us to leverage the best out of both worlds.

Response by Emily M. Bender, Carl T. Bergstrom, and Jevin D. West

Science is a social process. It cannot be auto-completed. Its agents — real scientists — are as much the product of this process as the results recorded in papers.

LLM optimists envision a new world, where machines write, review, and even do much of the science. Even the less extreme narrative wherein LLMs simply aid researchers suffers from a misplaced and almost Taylorist [72] optimism regarding production efficiency. Science is not a factory, churning out widgets or statistical analyses wrapped in text. For a factory, producing one more car per day is progress. For science, the goals are to understand our world — not to produce more artifacts that look like scientific papers. If science were a paper factory, we too would indulge in LLM euphoria and might even claim a significant resulting improvement in quality coming out of our labs. But we cannot equate papers and progress. Papers are but messages that we send one another to coordinate our collective quest for scientific understanding.

We don’t, however, believe that any new mandates are required prohibiting the use of LLMs. All ill-advised use cases are already contrary to the norms of science: Using LLMs as stand-ins for human subjects or annotators amounts to fabricating data; using LLMs to write first drafts runs afoul of prohibitions against plagiarism, as it is impossible to discern the source of any string produced by an LLM; treating LLMs as co-authors contravenes norms around authorship, since LLMs are not the sort of thing that can be accountable for paper contents; using LLMs to produce peer reviews is tantamount to abrogating our responsibility to deeply evaluate the methods, reasoning and conclusions of our peers’ work.

When contemplating how LLMs will affect science, we should not underestimate the temptation to use them under deadline pressure or in response to publish-or-perish threats to job security. Nor should we underestimate the time needed to fact-check all LLM output—not only for the inevitable and frequent errors but also to assess whether citations are accurate. We note that there are no published user studies that quantify just how much effort this checking process is, nor how accurately researchers can carry it out, especially while working under pressure. Norms of plagiarism and the weight of reputation will hopefully counterbalance the unfettered use of this new technology.

To reason appropriately about when LLMs are suitable within science, it is critical to avoid anthropomorphizing them. These models aren’t research assistants. They are tools. They don’t make mistakes like junior (or senior!) researchers do: People can take responsibility for, and learn from, mistakes. Tools produce errors; thus people using the tools have a responsibility to understand their affordances and use them with care.

Similarly, understanding LLMs as tools positions us to ask: Is this the best tool for this task? Often, we expect, LLMs are not. Even setting aside the closed proprietary models, their attendant failures of transparency, and the stochastic nature of LLM output, we expect that bespoke models designed for specific tasks will be more efficient, performant, interpretable, and easier to fix when not functioning well.

Ultimately, science is a conversation and the interlocutors are the scientists. Synthetic text-extruding machines, designed only to produce plausible-sounding prose, are not fit participants in that conversation and should not be treated as such.

Response by Marco Marelli, Adina Roskies, Balazs Aczel, Colin Allen, Dirk Wulff, Qiong Zhang, and Richard M. Shiffrin

In our proposal concerning the application of LLMs in science, we aimed for a moderate perspective. In that spirit, we think that such systems can be profitably incorporated into scientific practice (in line with Schulz et al.), but we also recognize that there are causes for reservation (in line with Bender et al.).

We disagree with the view that LLMs should be considered collaborators or research assistants (Schulz et al.). One can instruct students or research assistants, correct their mistakes, and anticipate that they will learn from them. One may also question their reasons or their reasoning and get answers and expect accountability. Finally, one may also get insight into their values and their motivations, and trust or distrust them accordingly. LLMs are not introspective, lack metacognition, and have no values, at least not in the way humans do. Indeed, our inability to understand why they make the errors they do or when they will make them impairs our ability to understand their limits, especially on the edges of knowledge, where their training corpus is arguably less robust. Moreover, although LLMs move from the same foundations of previous language models (Schulz et al.), they are significantly more opaque and complex. As a result, maintaining the ever-important scientific value of transparency can be challenging and necessitates further development of practices and strategies to ensure its preservation.

Nevertheless, we disagree that such concerns should prevent scientific applications of LLMs. It is unrealistic to presume that LLMs won’t be used because of the risks involved, and banning them could do more harm than good: given the current trend, if prohibited, they would likely be used covertly, exacerbating the already-worrying transparency issues. Certainly, we need to pursue a critical and not starry-eyed understanding of LLMs and maintain a clear-eyed assessment of the potential risks of use. However, there are ways of employing them that can improve the quality of science, as long as the researcher is kept at the center of the process. LLMs are tools and, as such, must be carefully evaluated in their applications. This applies to any tool, including the existing alternatives discussed by Bender et al., which, although optimized for specific scientific purposes, are not immune from mistakes and whose degree of reliability always needs careful scrutiny. At the end of the day, the responsibility falls upon the shoulders of the researchers who use the tools. It is, hence, crucial to establish principles and values that guide our decisions—whether one applies LLMs or any other method.

Ultimately, we mostly concur with Botvinick and Gershman: the impact of LLMs on the future practice of science cannot be fully predicted, but science is a humanistic and human enterprise and must remain so, motivating curbs to LLM use. Our perspective highlighted the normative aspects in terms of core values that should guide their use today, while Botvinick and Gershman seeks to identify the principles and values for the future, deciding what should remain exclusively human even when AI becomes fully capable of performing every step of scientific inquiry as well as upholding values such as accountability, transparency, and fairness. The two perspectives complement each other in stimulating discussions about what should guide the way we integrate AI into our scientific practices.

Response by Matthew M. Botvinick and Samuel J. Gershman

We see significant common ground across the other perspectives. We will focus here on one issue that gets to the heart of our perspective. Schulz et al. characterize LLMs as closer to collaborators than to tools. This raises critical issues of accountability, as pointed out by Bender et al. and Marelli et al. Some of these issues are currently being grappled with, while others will become more salient in the future as the technology advances. In particular, accountability is a fundamentally human concept: humans are the only currently existing agents that are accountable in the sense that they have ultimate control over their own actions and voluntarily submit to a system that regulates these actions. Extending this concept to artificial agents would entail a profound shift in our attitudes, essentially requiring us to acknowledge the personhood of such agents.

This shift, if it ever happens, will have ramifications far beyond science. Policymakers are already starting to wrestle with the question of how accountability should operate in a world where AI systems are increasingly autonomous, and the issues can get quite complex. The difficulties can be bounded, however, in domains where humans are able to draw clear boundaries around what role they will permit AI systems to play. In science, we believe these boundaries should be firm and restrictive, limiting key decisions — and thus accountability — to human scientists.

Ultimately, we are interested in the limit case where the limits imposed on AI are sociological, moral, and juristic, rather than technological. To regard LLMs as genuine collaborators rather than sophisticated tools, we would need to acknowledge attributes of personhood that go far beyond the mere practice of science. Our view is that AI, no matter how intelligent, should remain a tool, because ceding personhood to artificial agents would have undesirable consequences. It’s one thing for an AI scientist to tell us that there is a better way to fold proteins or design nuclear reactors, but it’s quite another thing for it to tell us that it would rather be studying some other problem. It would also be quite a shock to be told by an AI scientist that it’s solved an important problem but that it doesn’t feel like trying to explain it to a human. As we argued in our perspective, the choices of what to study and which explanations count are irreducibly human.

Conclusion

We have presented four different perspectives centering around the question \sayhow should the advent of LLMs affect the practice of science? Schulz et al. argued that \sayworking with LLMs will not be fundamentally different from working with other collaborators, such as research assistants or doctoral students. Bender et al. described a suite of problems with using LLMs in scientific activity and argued that many uses of LLMs are \saycontrary to the norms of science. Marelli et al. called for \sayclear principles guiding the way this technology should be used, including transparency, accountability, and fairness. Finally, Botvinick and Gershman advocated that \saytwo core aspects of scientific work should be reserved to human scientists, namely deciding on what problems to work on and that human understanding remains the goal of science.

Yet, even though there was substantial disagreement, there were also important common themes. In particular, all parties emphasized the social nature of science and the importance of protecting scientific integrity and standards. In modern times, these core values are more important than ever before, and we — as a community — will have to continuously reevaluate how to protect them.

References

[1] Bengio, Y., Ducharme, R. & Vincent, P. A neural probabilistic language model. \JournalTitleAdvances in neural information processing systems 13 (2000).
[2] Jurafsky, D. & Martin, J. H. Speech and language processing : an introduction to natural language processing, computational linguistics, and speech recognition (Pearson Prentice Hall, 2009).
[3] Brown, T. et al. Language models are few-shot learners. \JournalTitleAdvances in neural information processing systems 33, 1877–1901 (2020).
[4] Drori, I. et al. A neural network solves, explains, and generates university math problems by program synthesis and few-shot learning at human level. \JournalTitleProceedings of the National Academy of Sciences 119, e2123433119 (2022).
[5] Kocmi, T. & Federmann, C. Large language models are state-of-the-art evaluators of translation quality. In Nurminen, M. et al. (eds.) Proceedings of the 24th Annual Conference of the European Association for Machine Translation, 193–203 (European Association for Machine Translation, Tampere, Finland, 2023).
[6] Katz, D. M., Bommarito, M. J., Gao, S. & Arredondo, P. Gpt-4 passes the bar exam. \JournalTitleAvailable at SSRN 4389233 (2023).
[7] Eloundou, T., Manning, S., Mishkin, P. & Rock, D. Gpts are gpts: An early look at the labor market impact potential of large language models. \JournalTitlearXiv:2303.10130. Unpublished preprint (2023).
[8] Kasneci, E. et al. Chatgpt for good? on opportunities and challenges of large language models for education. \JournalTitleLearning and Individual Differences 103, 102274 (2023).
[9] Peres, R., Schreier, M., Schweidel, D. & Sorescu, A. On chatgpt and beyond: How generative artificial intelligence may affect research, teaching, and practice. \JournalTitleInternational Journal of Research in Marketing (2023).
[10] Lund, B. D. & Wang, T. Chatting about chatgpt: how may ai and gpt impact academia and libraries? \JournalTitleLibrary Hi Tech News 40, 26–29 (2023).
[11] Hill-Yardin, E. L., Hutchinson, M. R., Laycock, R. & Spencer, S. J. A chat (gpt) about the future of scientific publishing. \JournalTitleBrain Behav Immun 110, 152–154 (2023).
[12] Zheng, H. & Zhan, H. Chatgpt in scientific writing: a cautionary tale. \JournalTitleThe American Journal of Medicine (2023).
[13] Lund, B. D. et al. Chatgpt and a new academic reality: Artificial intelligence-written research papers and the ethics of the large language models in scholarly publishing. \JournalTitleJournal of the Association for Information Science and Technology 74, 570–581 (2023).
[14] Transformer, G. G. P., Thunström, A. O. & Steingrimsson, S. Can gpt-3 write an academic paper on itself, with minimal human input? \JournalTitleUnpublished (2022).
[15] Birhane, A., Kasirzadeh, A., Leslie, D. & Wachter, S. Science in the age of large language models. \JournalTitleNature Reviews Physics 1–4 (2023).
[16] Fecher, B., Hebing, M., Laufer, M., Pohle, J. & Sofsky, F. Friend or foe? exploring the implications of large language models on the science system. \JournalTitlearXiv:2306.09928. Unpublished preprint (2023).
[17] Stokel-Walker, C. & Van Noorden, R. What chatgpt and generative ai mean for science, DOI: 10.1038/d41586-023-00340-6 (2023).
[18] Taylor, R. et al. Galactica: A large language model for science. \JournalTitlearXiv:2211.09085. Unpublished preprint (2022).
[19] Embracing change and resetting expectations. https://unlocked.microsoft.com/ai-anthology/terence-tao/. Accessed: 2023-09-04.
[20] Heaven, W. D. Why Meta’s latest large language model survived only three days online (2022).
[21] Bender, E. M., Gebru, T., McMillan-Major, A. & Shmitchell, S. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, 610–623 (2021).
[22] Arkoudas, K. Gpt-4 can’t reason. \JournalTitlearXiv:2308.03762. Unpublished preprint (2023).
[23] Gilardi, F., Alizadeh, M. & Kubli, M. Chatgpt outperforms crowd workers for text-annotation tasks. \JournalTitleProceedings of the National Academy of Sciences 120, e2305016120, DOI: 10.1073/pnas.2305016120 (2023). https://www.pnas.org/doi/pdf/10.1073/pnas.2305016120.
[24] Wulff, D. U. & Mata, R. Automated **gle–jangle detection: Using embeddings to tackle taxonomic incommensurability. \JournalTitlePsyArXiv DOI: https://doi.org/10.31234/osf.io/9h7aw (2023).
[25] Dillion, D., Tandon, N., Gu, Y. & Gray, K. Can ai language models replace human participants? \JournalTitleTrends in Cognitive Sciences (2023).
[26] Hutson, M. Guinea pigbots. \JournalTitleScience (New York, N.Y.) 381, 121–123, DOI: 10.1126/science.adj6791 (2023).
[27] Rozière, B. et al. Code llama: Open foundation models for code. \JournalTitlearXiv:2308.12950. Unpublished preprint (2023).
[28] Sanmarchi, F. et al. A step-by-step researcher’s guide to the use of an ai-based transformer in epidemiology: an exploratory analysis of chatgpt using the strobe checklist for observational studies. \JournalTitleJournal of Public Health 1–36 (2023).
[29] Dehouche, N. Plagiarism in the age of massive generative pre-trained transformers (gpt-3). \JournalTitleEthics in Science and Environmental Politics 21, 17–23 (2021).
[30] Liang, P. P., Wu, C., Morency, L.-P. & Salakhutdinov, R. Towards understanding and mitigating social biases in language models. In International Conference on Machine Learning, 6565–6576 (PMLR, 2021).
[31] Coda-Forno, J. et al. Inducing anxiety in large language models increases exploration and bias. \JournalTitlearXiv:2304.11111. Unpublished preprint (2023).
[32] Hutchinson, B. et al. Social biases in NLP models as barriers for persons with disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5491–5501, DOI: 10.18653/v1/2020.acl-main.487 (Association for Computational Linguistics, 2020).
[33] Carlini, N. et al. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), 2633–2650 (2021).
[34] King, M. R. A place for large language models in scientific publishing, apart from credited authorship. \JournalTitleCellular and Molecular Bioengineering 1–4 (2023).
[35] Herbold, S., Hautli-Janisz, A., Heuer, U., Kikteva, Z. & Trautsch, A. Ai, write an essay for me: A large-scale comparison of human-written versus chatgpt-generated essays. \JournalTitlearXiv:2304.14276. Unpublished preprint (2023).
[36] Poldrack, R. A., Lu, T. & Beguš, G. Ai-assisted coding: Experiments with gpt-4. \JournalTitlearXiv:2304.13187. Unpublished preprint (2023).
[37] Goyal, T., Li, J. J. & Durrett, G. News summarization and evaluation in the era of gpt-3. \JournalTitlearXiv:2209.12356. Unpublished preprint (2022).
[38] Towards a transparent ai future: The call for less regulatory hurdles on open-source ai in europe. https://laion.ai/blog/transparent-ai/. Accessed: 2023-10-22.
[39] Liang, W. et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis. \JournalTitlearXiv:2310.01783. Unpublished preprint (2023).
[40] Bender, E. M. & Koller, A. Climbing towards NLU: On meaning, form, and understanding in the age of data. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 5185–5198, DOI: 10.18653/v1/2020.acl-main.463 (Association for Computational Linguistics, Online, 2020).
[41] Wang, D., Wang, X. & Lv, S. An overview of end-to-end automatic speech recognition. \JournalTitleSymmetry 11, DOI: 10.3390/sym11081018 (2019).
[42] Hartsuiker, R. J. & Moors, A. On the automaticity of language processing. In Schmid, H.-J. (ed.) Entrenchment and the Psychology of Language Learning: How We Reorganize and Adapt Linguistic Knowledge, DOI: https://doi-org.offcampus.lib.washington.edu/10.1037/15969-010 (American Psychological Association; De Gruyter Mouton, 2017).
[43] Kinney, R. M. et al. The semantic scholar open data platform. \JournalTitleArXiv. Unpublished preprint abs/2301.10140 (2023).
[44] Narayan, S., Cohen, S. B. & Lapata, M. Ranking sentences for extractive summarization with reinforcement learning. In Walker, M., Ji, H. & Stent, A. (eds.) Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), 1747–1759, DOI: 10.18653/v1/N18-1158 (Association for Computational Linguistics, New Orleans, Louisiana, 2018).
[45] Hodel, D. & West, J. Response: Emergent analogical reasoning in large language models (2023). 2308.16118.
[46] Törnberg, P., Valeeva, D., Uitermark, J. & Bail, C. Simulating social media using large language models to evaluate alternative news feed algorithms (2023). 2310.05984.
[47] Argyle, L. P. et al. Out of one, many: Using language models to simulate human samples. \JournalTitlePolitical Analysis 31, 337–351, DOI: 10.1017/pan.2023.2 (2023).
[48] Gilardi, F., Alizadeh, M. & Kubli, M. Chatgpt outperforms crowd workers for text-annotation tasks. \JournalTitleProceedings of the National Academy of Sciences 120, e2305016120, DOI: 10.1073/pnas.2305016120 (2023). https://www.pnas.org/doi/pdf/10.1073/pnas.2305016120.
[49] Conroy, G. How ChatGPT and other AI tools could disrupt scientific publishing. \JournalTitleNature 622, 234–236 (2023).
[50] Latour, B. & Woolgar, S. Laboratory life: The construction of scientific facts (Princeton university press, 2013).
[51] Partha, D. & David, P. A. Toward a new economics of science. \JournalTitleResearch policy 23, 487–521 (1994).
[52] Amano, T. et al. The manifold costs of being a non-native english speaker in science. \JournalTitlePLoS Biology 21, e3002184 (2023).
[53] Jumper, J. et al. Highly accurate protein structure prediction with alphafold. \JournalTitleNature 596, 583–589 (2021).
[54] Günther, F., Rinaldi, L. & Marelli, M. Vector-space models of semantic representation from a cognitive perspective: A discussion of common misconceptions. \JournalTitlePerspectives on Psychological Science 14, 1006–1033 (2019).
[55] Golchin, S. & Surdeanu, M. Time travel in llms: Tracing data contamination in large language models. \JournalTitlearXiv:2308.08493. Unpublished preprint (2023).
[56] Li, R. et al. Starcoder: may the source be with you! \JournalTitlearXiv:2305.06161. Unpublished preprint (2023).
[57] Walters, W. H. & Wilder, E. I. Fabrication and errors in the bibliographic citations generated by chatgpt. \JournalTitleScientific Reports 13, 14045 (2023).
[58] Kocoń, J. et al. Chatgpt: Jack of all trades, master of none. \JournalTitleInformation Fusion 101861 (2023).
[59] Liu, H. et al. Evaluating the logical reasoning ability of chatgpt and gpt-4. \JournalTitlearXiv:2304.03439. Unpublished preprint (2023).
[60] Durmus, E. et al. Towards measuring the representation of subjective global opinions in language models. \JournalTitlearXiv:2306.16388. Unpublished preprint (2023).
[61] Atari, M., Xue, M. J., Park, P. S., Blasi, D. & Henrich, J. Which humans? \JournalTitlePsyArXiv. Unpublished preprint (2023).
[62] Santurkar, S. et al. Whose opinions do language models reflect? \JournalTitlearXiv:2303.17548. Unpublished preprint (2023).
[63] Flaherty, C. The peer review crisis. https://www.insidehighered.com/news/2022/06/13/peer-review-crisis-creates-problems-journals-and-scholars. Accessed: 2023-10-30.
[64] Park, M., Leahey, E. & Funk, R. J. Papers and patents are becoming less disruptive over time. \JournalTitleNature 613, 138–144 (2023).
[65] Davies, A. et al. Advancing mathematics by guiding human intuition with ai. \JournalTitleNature 600, 70–74 (2021).
[66] Gabriel, I. & Ghazavi, V. The Challenge of Value Alignment: From Fairer Algorithms to AI Safety. In The Oxford Handbook of Digital Ethics, DOI: 10.1093/oxfordhb/9780198857815.013.18 (Oxford University Press). https://academic.oup.com/book/0/chapter/337809435/chapter-ag-pdf/50148600/book_37078_section_337809435.ag.pdf.
[67] Silver, D. et al. Mastering the game of go without human knowledge. \JournalTitlenature 550, 354–359 (2017).
[68] Lemos, P., Jeffrey, N., Cranmer, M., Ho, S. & Battaglia, P. Rediscovering orbital mechanics with machine learning. \JournalTitleMachine Learning: Science and Technology 4, 045002 (2023).
[69] Botvinick, M. Have we lost our minds? https://medium.com/@matthew.botvinick/have-we-lost-our-minds-86d9125bd803. Accessed: 2023-10-30.
[70] Stiennon, N. et al. Learning to summarize with human feedback. \JournalTitleAdvances in Neural Information Processing Systems 33, 3008–3021 (2020).
[71] Longpre, S. et al. The flan collection: Designing data and methods for effective instruction tuning. In Proceedings of the 40th International Conference on Machine Learning, ICML’23 (JMLR.org, 2023).
[72] Taylor, F. W. The Principles of Scientific Management (Harper, 1913).

Acknowledgements

This work has been partially funded by the ERC (853489 - DEXIM; 101087053 - BraveNewWord), by the DFG (2064/1 – Project number 390727645), the BMBF (Tübingen AI Center, FKZ: 01IS18039A) and as part of the Excellence Strategy of the German Federal and State Governments.

Author contributions statement

Project administration: Marcel Binz, Stephan Alaniz
Project supervision: Zeynep Akata, Eric Schulz
Perspective leaders: Emily M. Bender, Marco Marelli, Matthew M. Botvinick, Eric Schulz
Perspectives/responses - original draft: Adina Roskies, Balazs Aczel, Carl T. Bergstrom, Emily M. Bender, Eric Schulz, Jevin West, Marco Marelli, Matthew M. Botvinick, Qiong Zhang
Perspectives/responses - review & editing: all authors
Introduction and conclusion - original draft: Marcel Binz, Stephan Alaniz
Introduction and conclusion - review & editing: Marcel Binz, Stephan Alaniz, Zeynep Akata, Eric Schulz