-
ChipNeMo: Domain-Adapted LLMs for Chip Design
Authors:
Mingjie Liu,
Teodor-Dumitru Ene,
Robert Kirby,
Chris Cheng,
Nathaniel Pinckney,
Rongjian Liang,
Jonah Alben,
Himyanshu Anand,
Sanmitra Banerjee,
Ismet Bayraktaroglu,
Bonita Bhaskaran,
Bryan Catanzaro,
Arjun Chaudhuri,
Sharon Clay,
Bill Dally,
Laura Dang,
Parikshit Deshpande,
Siddhanth Dhodhi,
Sameer Halepete,
Eric Hill,
Jiashang Hu,
Sumit Jain,
Ankit **dal,
Brucek Khailany,
George Kokai
, et al. (17 additional authors not shown)
Abstract:
ChipNeMo aims to explore the applications of large language models (LLMs) for industrial chip design. Instead of directly deploying off-the-shelf commercial or open-source LLMs, we instead adopt the following domain adaptation techniques: domain-adaptive tokenization, domain-adaptive continued pretraining, model alignment with domain-specific instructions, and domain-adapted retrieval models. We e…
▽ More
ChipNeMo aims to explore the applications of large language models (LLMs) for industrial chip design. Instead of directly deploying off-the-shelf commercial or open-source LLMs, we instead adopt the following domain adaptation techniques: domain-adaptive tokenization, domain-adaptive continued pretraining, model alignment with domain-specific instructions, and domain-adapted retrieval models. We evaluate these methods on three selected LLM applications for chip design: an engineering assistant chatbot, EDA script generation, and bug summarization and analysis. Our evaluations demonstrate that domain-adaptive pretraining of language models, can lead to superior performance in domain related downstream tasks compared to their base LLaMA2 counterparts, without degradations in generic capabilities. In particular, our largest model, ChipNeMo-70B, outperforms the highly capable GPT-4 on two of our use cases, namely engineering assistant chatbot and EDA scripts generation, while exhibiting competitive performance on bug summarization and analysis. These results underscore the potential of domain-specific customization for enhancing the effectiveness of large language models in specialized applications.
△ Less
Submitted 4 April, 2024; v1 submitted 31 October, 2023;
originally announced November 2023.
-
An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags
Authors:
Christian D. Newman,
Michael J. Decker,
Reem S. AlSuhaibani,
Anthony Peruma,
Satyajit Mohapatra,
Tejal Vishnoi,
Marcos Zampieri,
Mohamed W. Mkaouer,
Timothy J. Sheldon,
Emily Hill
Abstract:
This paper presents an ensemble part-of-speech tagging approach for source code identifiers. Ensemble tagging is a technique that uses machine-learning and the output from multiple part-of-speech taggers to annotate natural language text at a higher quality than the part-of-speech taggers are able to obtain independently. Our ensemble uses three state-of-the-art part-of-speech taggers: SWUM, POSSE…
▽ More
This paper presents an ensemble part-of-speech tagging approach for source code identifiers. Ensemble tagging is a technique that uses machine-learning and the output from multiple part-of-speech taggers to annotate natural language text at a higher quality than the part-of-speech taggers are able to obtain independently. Our ensemble uses three state-of-the-art part-of-speech taggers: SWUM, POSSE, and Stanford. We study the quality of the ensemble's annotations on five different types of identifier names: function, class, attribute, parameter, and declaration statement at the level of both individual words and full identifier names. We also study and discuss the weaknesses of our tagger to promote the future amelioration of these problems through further research. Our results show that the ensemble achieves 75\% accuracy at the identifier level and 84-86\% accuracy at the word level. This is an increase of +17\% points at the identifier level from the closest independent part-of-speech tagger.
△ Less
Submitted 1 September, 2021;
originally announced September 2021.
-
Solving Heterogeneous General Equilibrium Economic Models with Deep Reinforcement Learning
Authors:
Edward Hill,
Marco Bardoscia,
Arthur Turrell
Abstract:
General equilibrium macroeconomic models are a core tool used by policymakers to understand a nation's economy. They represent the economy as a collection of forward-looking actors whose behaviours combine, possibly with stochastic effects, to determine global variables (such as prices) in a dynamic equilibrium. However, standard semi-analytical techniques for solving these models make it difficul…
▽ More
General equilibrium macroeconomic models are a core tool used by policymakers to understand a nation's economy. They represent the economy as a collection of forward-looking actors whose behaviours combine, possibly with stochastic effects, to determine global variables (such as prices) in a dynamic equilibrium. However, standard semi-analytical techniques for solving these models make it difficult to include the important effects of heterogeneous economic actors. The COVID-19 pandemic has further highlighted the importance of heterogeneity, for example in age and sector of employment, in macroeconomic outcomes and the need for models that can more easily incorporate it. We use techniques from reinforcement learning to solve such models incorporating heterogeneous agents in a way that is simple, extensible, and computationally efficient. We demonstrate the method's accuracy and stability on a toy problem for which there is a known analytical solution, its versatility by solving a general equilibrium problem that includes global stochasticity, and its flexibility by solving a combined macroeconomic and epidemiological model to explore the economic and health implications of a pandemic. The latter successfully captures plausible economic behaviours induced by differential health risks by age.
△ Less
Submitted 31 March, 2021;
originally announced March 2021.
-
On the Generation, Structure, and Semantics of Grammar Patterns in Source Code Identifiers
Authors:
Christian D. Newman,
Reem S. AlSuhaibani,
Michael J. Decker,
Anthony Peruma,
Dishant Kaushik,
Mohamed Wiem Mkaouer,
Emily Hill
Abstract:
Identifiers make up a majority of the text in code. They are one of the most basic mediums through which developers describe the code they create and understand the code that others create. Therefore, understanding the patterns latent in identifier naming practices and how accurately we are able to automatically model these patterns is vital if researchers are to support developers and automated a…
▽ More
Identifiers make up a majority of the text in code. They are one of the most basic mediums through which developers describe the code they create and understand the code that others create. Therefore, understanding the patterns latent in identifier naming practices and how accurately we are able to automatically model these patterns is vital if researchers are to support developers and automated analysis approaches in comprehending and creating identifiers correctly and optimally. This paper investigates identifiers by studying sequences of part-of-speech annotations, referred to as grammar patterns. This work advances our understanding of these patterns and our ability to model them by 1) establishing common naming patterns in different types of identifiers, such as class and attribute names; 2) analyzing how different patterns influence comprehension; and 3) studying the accuracy of state-of-the-art techniques for part-of-speech annotations, which are vital in automatically modeling identifier naming patterns, in order to establish their limits and paths toward improvement. To do this, we manually annotate a dataset of 1,335 identifiers from 20 open-source systems and use this dataset to study naming patterns, semantics, and tagger accuracy.
△ Less
Submitted 15 July, 2020;
originally announced July 2020.
-
Counting Value Sets: Algorithm and Complexity
Authors:
Qi Cheng,
Joshua E. Hill,
Daqing Wan
Abstract:
Let $p$ be a prime. Given a polynomial in $\F_{p^m}[x]$ of degree $d$ over the finite field $\F_{p^m}$, one can view it as a map from $\F_{p^m}$ to $\F_{p^m}$, and examine the image of this map, also known as the value set. In this paper, we present the first non-trivial algorithm and the first complexity result on computing the cardinality of this value set. We show an elementary connection betwe…
▽ More
Let $p$ be a prime. Given a polynomial in $\F_{p^m}[x]$ of degree $d$ over the finite field $\F_{p^m}$, one can view it as a map from $\F_{p^m}$ to $\F_{p^m}$, and examine the image of this map, also known as the value set. In this paper, we present the first non-trivial algorithm and the first complexity result on computing the cardinality of this value set. We show an elementary connection between this cardinality and the number of points on a family of varieties in affine space. We then apply Lauder and Wan's $p$-adic point-counting algorithm to count these points, resulting in a non-trivial algorithm for calculating the cardinality of the value set. The running time of our algorithm is $(pmd)^{O(d)}$. In particular, this is a polynomial time algorithm for fixed $d$ if $p$ is reasonably small. We also show that the problem is #P-hard when the polynomial is given in a sparse representation, $p=2$, and $m$ is allowed to vary, or when the polynomial is given as a straight-line program, $m=1$ and $p$ is allowed to vary. Additionally, we prove that it is NP-hard to decide whether a polynomial represented by a straight-line program has a root in a prime-order finite field, thus resolving an open problem proposed by Kaltofen and Koiran in \cite{Kaltofen03,KaltofenKo05}.
△ Less
Submitted 4 November, 2011;
originally announced November 2011.