-
Reducing the time-step errors in diffusion Monte Carlo
Authors:
Tyler A. Anderson,
Manolo C. Per,
C. J. Umrigar
Abstract:
We modify the reweighting factor of the projector used in diffusion Monte Carlo to reduce the time-step error of the total energy. Further, we present a reweighting scheme that has the desirable feature that it is exactly size-consistent, i.e, the energy of a system containing widely separated fragments is the same as the sum of the energies of the individual fragments. The practical utility of th…
▽ More
We modify the reweighting factor of the projector used in diffusion Monte Carlo to reduce the time-step error of the total energy. Further, we present a reweighting scheme that has the desirable feature that it is exactly size-consistent, i.e, the energy of a system containing widely separated fragments is the same as the sum of the energies of the individual fragments. The practical utility of the latter improvement is that it reduces the time-step error of the binding energies of some weakly interacting systems.
△ Less
Submitted 15 March, 2024; v1 submitted 3 December, 2023;
originally announced December 2023.
-
Large Language Models Based Automatic Synthesis of Software Specifications
Authors:
Shantanu Mandal,
Adhrik Chethan,
Vahid Janfaza,
S M Farabi Mahmud,
Todd A Anderson,
Javier Turek,
Jesmin Jahan Tithi,
Abdullah Muzahid
Abstract:
Software configurations play a crucial role in determining the behavior of software systems. In order to ensure safe and error-free operation, it is necessary to identify the correct configuration, along with their valid bounds and rules, which are commonly referred to as software specifications. As software systems grow in complexity and scale, the number of configurations and associated specific…
▽ More
Software configurations play a crucial role in determining the behavior of software systems. In order to ensure safe and error-free operation, it is necessary to identify the correct configuration, along with their valid bounds and rules, which are commonly referred to as software specifications. As software systems grow in complexity and scale, the number of configurations and associated specifications required to ensure the correct operation can become large and prohibitively difficult to manipulate manually. Due to the fast pace of software development, it is often the case that correct software specifications are not thoroughly checked or validated within the software itself. Rather, they are frequently discussed and documented in a variety of external sources, including software manuals, code comments, and online discussion forums. Therefore, it is hard for the system administrator to know the correct specifications of configurations due to the lack of clarity, organization, and a centralized unified source to look at. To address this challenge, we propose SpecSyn a framework that leverages a state-of-the-art large language model to automatically synthesize software specifications from natural language sources. Our approach formulates software specification synthesis as a sequence-to-sequence learning problem and investigates the extraction of specifications from large contextual texts. This is the first work that uses a large language model for end-to-end specification synthesis from natural language texts. Empirical results demonstrate that our system outperforms prior the state-of-the-art specification synthesis tool by 21% in terms of F1 score and can find specifications from single as well as multiple sentences.
△ Less
Submitted 17 April, 2023;
originally announced April 2023.
-
Synthesizing Programs with Continuous Optimization
Authors:
Shantanu Mandal,
Todd A. Anderson,
Javier Turek,
Justin Gottschlich,
Abdullah Muzahid
Abstract:
Automatic software generation based on some specification is known as program synthesis. Most existing approaches formulate program synthesis as a search problem with discrete parameters. In this paper, we present a novel formulation of program synthesis as a continuous optimization problem and use a state-of-the-art evolutionary approach, known as Covariance Matrix Adaptation Evolution Strategy t…
▽ More
Automatic software generation based on some specification is known as program synthesis. Most existing approaches formulate program synthesis as a search problem with discrete parameters. In this paper, we present a novel formulation of program synthesis as a continuous optimization problem and use a state-of-the-art evolutionary approach, known as Covariance Matrix Adaptation Evolution Strategy to solve the problem. We then propose a map** scheme to convert the continuous formulation into actual programs. We compare our system, called GENESYS, with several recent program synthesis techniques (in both discrete and continuous domains) and show that GENESYS synthesizes more programs within a fixed time budget than those existing schemes. For example, for programs of length 10, GENESYS synthesizes 28% more programs than those existing schemes within the same time budget.
△ Less
Submitted 3 April, 2023; v1 submitted 1 November, 2022;
originally announced November 2022.
-
Accurate energies of transition metal atoms, ions, and monoxides using selected configuration interaction and density-based basis-set corrections
Authors:
Yuan Yao,
Emmanuel Giner,
Tyler A. Anderson,
Julien Toulouse,
C. J. Umrigar
Abstract:
The semistochastic heat-bath configuration interaction (SHCI) method is a selected configuration interaction plus perturbation theory method that has provided near-full configuration interaction (FCI) levels of accuracy for many systems with both single- and multi-reference character. However, obtaining accurate energies in the complete basis set limit is hindered by the slow convergence of the FC…
▽ More
The semistochastic heat-bath configuration interaction (SHCI) method is a selected configuration interaction plus perturbation theory method that has provided near-full configuration interaction (FCI) levels of accuracy for many systems with both single- and multi-reference character. However, obtaining accurate energies in the complete basis set limit is hindered by the slow convergence of the FCI energy with respect to basis size. Here we show that the recently developed basis-set correction method based on range-separated density-functional theory can be used to significantly speed up basis-set convergence in SHCI calculations. In particular, we study two such schemes that differ in the functional used, and apply them to transition metal atoms and monoxides to obtain total, ionization, and dissociation energies well converged to the complete-basis-set limit within chemical accuracy.
△ Less
Submitted 16 November, 2021; v1 submitted 21 September, 2021;
originally announced September 2021.
-
Nonlocal pseudopotentials and time-step errors in diffusion Monte Carlo
Authors:
Tyler A. Anderson,
C. J. Umrigar
Abstract:
We present a version of the T-moves approach for treating nonlocal pseudopotentials in diffusion Monte Carlo which has much smaller time-step errors than the existing T-moves approaches, while at the same time preserving desirable features such as the upper-bound property for the energy. In addition, we modify the reweighting factor of the projector used in diffusion Monte Carlo to reduce the time…
▽ More
We present a version of the T-moves approach for treating nonlocal pseudopotentials in diffusion Monte Carlo which has much smaller time-step errors than the existing T-moves approaches, while at the same time preserving desirable features such as the upper-bound property for the energy. In addition, we modify the reweighting factor of the projector used in diffusion Monte Carlo to reduce the time-step error. The latter is applicable not only to pseudopotential calculations but to all-electron calculations as well.
△ Less
Submitted 27 July, 2021; v1 submitted 1 April, 2021;
originally announced April 2021.
-
The Ground State Electronic Energy of Benzene
Authors:
Janus J. Eriksen,
Tyler A. Anderson,
J. Emiliano Deustua,
Khaldoon Ghanem,
Diptarka Hait,
Mark R. Hoffmann,
Seunghoon Lee,
Daniel S. Levine,
Ilias Magoulas,
Jun Shen,
Norman M. Tubman,
K. Birgitta Whaley,
Enhua Xu,
Yuan Yao,
Ning Zhang,
Ali Alavi,
Garnet Kin-Lic Chan,
Martin Head-Gordon,
Wenjian Liu,
Piotr Piecuch,
Sandeep Sharma,
Seiichiro L. Ten-no,
C. J. Umrigar,
Jürgen Gauss
Abstract:
We report on the findings of a blind challenge devoted to determining the frozen-core, full configuration interaction (FCI) ground state energy of the benzene molecule in a standard correlation-consistent basis set of double-$ζ$ quality. As a broad international endeavour, our suite of wave function-based correlation methods collectively represents a diverse view of the high-accuracy repertoire of…
▽ More
We report on the findings of a blind challenge devoted to determining the frozen-core, full configuration interaction (FCI) ground state energy of the benzene molecule in a standard correlation-consistent basis set of double-$ζ$ quality. As a broad international endeavour, our suite of wave function-based correlation methods collectively represents a diverse view of the high-accuracy repertoire offered by modern electronic structure theory. In our assessment, the evaluated high-level methods are all found to qualitatively agree on a final correlation energy, with most methods yielding an estimate of the FCI value around $-863$ m$E_{\text{H}}$. However, we find the root-mean-square deviation of the energies from the studied methods to be considerable (1.3 m$E_{\text{H}}$), which in light of the acclaimed performance of each of the methods for smaller molecular systems clearly displays the challenges faced in extending reliable, near-exact correlation methods to larger systems. While the discrepancies exposed by our study thus emphasize the fact that the current state-of-the-art approaches leave room for improvement, we still expect the present assessment to provide a valuable community resource for benchmark and calibration purposes going forward.
△ Less
Submitted 7 October, 2020; v1 submitted 6 August, 2020;
originally announced August 2020.
-
Learning Fitness Functions for Machine Programming
Authors:
Shantanu Mandal,
Todd A. Anderson,
Javier S. Turek,
Justin Gottschlich,
Shengtian Zhou,
Abdullah Muzahid
Abstract:
The problem of automatic software generation is known as Machine Programming. In this work, we propose a framework based on genetic algorithms to solve this problem. Although genetic algorithms have been used successfully for many problems, one criticism is that hand-crafting its fitness function, the test that aims to effectively guide its evolution, can be notably challenging. Our framework pres…
▽ More
The problem of automatic software generation is known as Machine Programming. In this work, we propose a framework based on genetic algorithms to solve this problem. Although genetic algorithms have been used successfully for many problems, one criticism is that hand-crafting its fitness function, the test that aims to effectively guide its evolution, can be notably challenging. Our framework presents a novel approach to learn the fitness function using neural networks to predict values of ideal fitness functions. We also augment the evolutionary process with a minimally intrusive search heuristic. This heuristic improves the framework's ability to discover correct programs from ones that are approximately correct and does so with negligible computational overhead. We compare our approach with several state-of-the-art program synthesis methods and demonstrate that it finds more correct programs with fewer candidate program generations.
△ Less
Submitted 23 January, 2021; v1 submitted 22 August, 2019;
originally announced August 2019.
-
HiFrames: High Performance Data Frames in a Scripting Language
Authors:
Ehsan Totoni,
Wajih Ul Hassan,
Todd A. Anderson,
Tatiana Shpeisman
Abstract:
Data frames in scripting languages are essential abstractions for processing structured data. However, existing data frame solutions are either not distributed (e.g., Pandas in Python) and therefore have limited scalability, or they are not tightly integrated with array computations (e.g., Spark SQL). This paper proposes a novel compiler-based approach where we integrate data frames into the High…
▽ More
Data frames in scripting languages are essential abstractions for processing structured data. However, existing data frame solutions are either not distributed (e.g., Pandas in Python) and therefore have limited scalability, or they are not tightly integrated with array computations (e.g., Spark SQL). This paper proposes a novel compiler-based approach where we integrate data frames into the High Performance Analytics Toolkit (HPAT) to build HiFrames. It provides expressive and flexible data frame APIs which are tightly integrated with array operations. HiFrames then automatically parallelizes and compiles relational operations along with other array computations in end-to-end data analytics programs, and generates efficient MPI/C++ code. We demonstrate that HiFrames is significantly faster than alternatives such as Spark SQL on clusters, without forcing the programmer to switch to embedded SQL for part of the program. HiFrames is 3.6x to 70x faster than Spark SQL for basic relational operations, and can be up to 20,000x faster for advanced analytics operations, such as weighted moving averages (WMA), that the map-reduce paradigm cannot handle effectively. HiFrames is also 5x faster than Spark SQL for TPCx-BB Q26 on 64 nodes of Cori supercomputer.
△ Less
Submitted 7 April, 2017;
originally announced April 2017.
-
Characterisation of speech diversity using self-organising maps
Authors:
Tom A. F. Anderson,
David M. W. Powers
Abstract:
We report investigations into speaker classification of larger quantities of unlabelled speech data using small sets of manually phonemically annotated speech. The Kohonen speech typewriter is a semi-supervised method comprised of self-organising maps (SOMs) that achieves low phoneme error rates. A SOM is a 2D array of cells that learn vector representations of the data based on neighbourhoods. In…
▽ More
We report investigations into speaker classification of larger quantities of unlabelled speech data using small sets of manually phonemically annotated speech. The Kohonen speech typewriter is a semi-supervised method comprised of self-organising maps (SOMs) that achieves low phoneme error rates. A SOM is a 2D array of cells that learn vector representations of the data based on neighbourhoods. In this paper, we report a method to evaluate pronunciation using multilevel SOMs with /hVd/ single syllable utterances for the study of vowels, for Australian pronunciation.
△ Less
Submitted 23 January, 2017;
originally announced February 2017.
-
Vocabulary and the Brain: Evidence from Neuroimaging Studies
Authors:
Tom A. F. Anderson,
C. -H. Ruan
Abstract:
In summary of the research findings presented in this paper, various brain regions are correlated with vocabulary and vocabulary acquisition. Semantic associations for vocabulary seem to be located near brain areas that vary according to the type of vocabulary, e.g. ventral temporal regions important for words for things that can be seen. Semantic processing is believed to be strongly associated w…
▽ More
In summary of the research findings presented in this paper, various brain regions are correlated with vocabulary and vocabulary acquisition. Semantic associations for vocabulary seem to be located near brain areas that vary according to the type of vocabulary, e.g. ventral temporal regions important for words for things that can be seen. Semantic processing is believed to be strongly associated with the ANG. Phonological ability has been closely related to the anterior surfaces of the SMG. Pathways through the posterior SMG are thought to link the anterior SMG and the ANG. In vocabulary tasks, mediotemporal structures may be related to long-term memory processing, with left hippocampal and parahippocampal regions related to long-term and working memory, respectively. Precentral structures are associated with phonological retrieval. Furthermore, many more regions of the brain are of interest in vocabulary tasks, particularly in areas important for visual and auditory processing. Furthermore, differences between brain anatomies can be attributed to vocabulary demands of different languages.
△ Less
Submitted 30 November, 2016;
originally announced November 2016.
-
HPAT: High Performance Analytics with Scripting Ease-of-Use
Authors:
Ehsan Totoni,
Todd A. Anderson,
Tatiana Shpeisman
Abstract:
Big data analytics requires high programmer productivity and high performance simultaneously on large-scale clusters. However, current big data analytics frameworks (e.g. Apache Spark) have prohibitive runtime overheads since they are library-based. We introduce a novel auto-parallelizing compiler approach that exploits the characteristics of the data analytics domain such as the map/reduce parall…
▽ More
Big data analytics requires high programmer productivity and high performance simultaneously on large-scale clusters. However, current big data analytics frameworks (e.g. Apache Spark) have prohibitive runtime overheads since they are library-based. We introduce a novel auto-parallelizing compiler approach that exploits the characteristics of the data analytics domain such as the map/reduce parallel pattern and is robust, unlike previous auto-parallelization methods. Using this approach, we build High Performance Analytics Toolkit (HPAT), which parallelizes high-level scripting (Julia) programs automatically, generates efficient MPI/C++ code, and provides resiliency. Furthermore, it provides automatic optimizations for scripting programs, such as fusion of array operations. Thus, HPAT is 369x to 2033x faster than Spark on the Cori supercomputer and 20x to 256x times on Amazon AWS.
△ Less
Submitted 10 April, 2017; v1 submitted 15 November, 2016;
originally announced November 2016.
-
An Approach to Learning Research with a Wireless Sensor Network in an Outdoor Setting
Authors:
Tom Adam Frederic Anderson,
Yean-Fu Wen
Abstract:
Automated collection of environmental data may be accomplished with wireless sensor networks (WSNs). In this paper, a general discussion of WSNs is given for the gathering of data for educational research. WSNs have the capability to enhance the scope of a researcher to include multiple streams of data: environmental, location, cyberdata, video, and RFID. The location of data stored in a databas…
▽ More
Automated collection of environmental data may be accomplished with wireless sensor networks (WSNs). In this paper, a general discussion of WSNs is given for the gathering of data for educational research. WSNs have the capability to enhance the scope of a researcher to include multiple streams of data: environmental, location, cyberdata, video, and RFID. The location of data stored in a database can allow reconstruction of the learning activity for the evaluation of significance at a later time. A brief overview of the technology forms the basis of an exploration of a setting used for outdoor learning.
△ Less
Submitted 5 May, 2008;
originally announced May 2008.