SysCaps: Language Interfaces for Simulation Surrogates of Complex Systems
Abstract
Data-driven simulation surrogates help computational scientists study complex systems. They can also help inform impactful policy decisions. We introduce a learning framework for surrogate modeling where language is used to interface with the underlying system being simulated. We call a language description of a system a “system caption”, or SysCap. To address the lack of datasets of paired natural language SysCaps and simulation runs, we use large language models (LLMs) to synthesize high-quality captions. Using our framework, we train multimodal text and timeseries regression models for two real-world simulators of complex energy systems. Our experiments demonstrate the feasibility of designing language interfaces for real-world surrogate models at comparable accuracy to standard baselines. We qualitatively and quantitatively show that SysCaps unlock text-prompt-style surrogate modeling and new generalization abilities beyond what was previously possible. We will release the generated SysCaps datasets and our code to support follow-on studies.
1 Introduction
Data-driven surrogates enable computational scientists to efficiently predict the results of expensive numerical simulations that run on supercomputers [30, 7]. Surrogates are particularly valuable for emulating simulations of complex energy systems (CES), which model dynamic interactions between humans, earth systems, and infrastructure. Examples of CES include buildings [52, 9, 5], electric vehicle fleets [53], and microgrids [13]. Advancing the science of CES contributes to reducing emissions and accelerating the adoption of clean energy, which is needed to address the impacts of climate change. Surrogate models are not only intended for expert use. Surrogates are also needed to inform highly consequential policy and investment decisions about complex systems made by non-experts in industry and governments [41], such as when planning to build and deploy a new renewable energy system [21].
These types of surrogates perform a fairly standard regression task, predicting simulation output quantities of interest from a) an input system configuration and b) a deployment scenario. For example, we might want to predict a) the amount of energy a particular building will consume given b) a timeseries of weather variables spanning an entire year. In this case, this involves performing long sequence timeseries regression, which classic data-driven regression such as gradient-boosted decision trees have difficulty with [5, 60].
In this work, we aim to explore the design and analysis of language interfaces for such surrogates. Intuitively, a language interface to a surrogate model makes it more accessible, particularly for non-experts, by simplifying how we inspect and alter a complex system’s configuration. Language interfaces are powerful—they ground interactions between humans and machines in the human’s preferred way [51]. The idea of using language to create interfaces for complex data or models is not new [23, 39], and interest has renewed due to the success of large language models (LLMs) and their demonstrated ability to generate high-quality synthetic natural “captions” [46, 12, 36, 22]. Our work defines a “system caption”, or SysCap, as text-based descriptions of knowledge about the system being simulated. In this work, we focus on black-box settings where the only available knowledge about the system is its configuration, found in simulator input files as lists of attributes.
However, it is unclear whether textual inputs, and particularly natural language, is suitable for real-world regression tasks. System attributes are sets of both discrete (categorical, binary, or string) and continuous (numeric) variables, i.e., tabular data. Previous work demonstrated inconclusive evidence when using pretrained language models to do tabular regression from text-encoded inputs, with and without modifications to the architecture [11, 27, 4, 57], motivating further study. Indeed, regression with text-encoded tabular inputs promises multiple advantages. First, it avoids cumbersome feature engineering (e.g., one-hot encodings) for tabular data. Second, it flexibly handles variable-length inputs, such as when certain system attributes are unknown at test time. Third, using text to represent tabular data increases generalizability—intuitively, attribute names and values have semantic information that can be exploited with the help of pretrained language embeddings [22, 57].
Our paper introduces a framework for training lightweight multimodal surrogates for CES with text (for system attributes) and timeseries inputs (for the deployment scenario) and makes contributions towards addressing the following technical challenges:
-
•
Given the lack of human-labeled natural language descriptions of complex systems, we describe a data collection pipeline that uses an LLM to generate high-quality natural language SysCaps from metadata files of CES. We observe that LLMs possess broad knowledge about CES, thus, minimal prompting is required to produce conversational SysCaps.
-
•
We introduce a simple and lightweight multimodal surrogate model architecture that a) fuses text embeddings obtained from pretrained language models (LMs) with b) timeseries encoded by a bidirectional sequence encoder to c) regress a timeseries output. We expect this to be insightful for future multimodal text and timeseries studies.
-
•
We develop an automatic evaluation strategy to assess caption quality–specifically, we estimate the rate at which ground truth attributes appear in the synthetic description with a multiclass attribute classifier.
Our experiments are based on two real-world CES simulators of building energy demand and wind farm wake. We rigorously evaluate accuracy and generalization beyond the capabilities of traditional regression approaches by quantifying robustness to variable-length inputs and paraphrasing (e.g., attribute synonyms). We also qualitatively showcase how text interfaces enable flexible handling of missing attributes and rapid design space exploration via the use of captions. As there are no standard benchmarks for comparing surrogate modeling performance for CES, we will open-source all data and code and contribute the generated SysCaps datasets to facilitate future work.
2 Related Work
Language interfaces for scientific machine learning: A growing body of work is similarly exploring creating language interfaces for advanced scientific machine learning (SciML) tasks, including protein representation learning [56], protein design [32], and activity prediction for drug discovery [47]. LLM-powered natural language interfaces are also being designed for complicated scientific workflows including synchrotron management [38], automated chemistry labs [6], and fluid dynamics workflows [29]. Our work adds to this growing body of literature by studying language interfaces for surrogate models of complex systems.
Large language models for regression: Another line of work asks whether LLMs can perform regression with both language inputs and outputs (numbers encoded as tokens), such as for tabular problems [11] or black-box optimization [48, 33]. These studies use templated text inputs, whereas our work additionally seeks to answer whether natural language is also viable as an interface for regression. Also, our framework directly predicts continuous outputs instead of tokens. Moreover, one study found mixed results compared to simple gradient-boosted tree baselines and difficulty with interpolation [11], raising questions about the effectiveness of this direction. LLMs are also expensive to evaluate, which makes them impractical for scientific applications where the surrogate is intended to be called many times. We only use LLMs to create synthetic training data and instead employ lightweight LMs such as DistilBERT [45] to encode SysCaps.
Text and timeseries multimodal models: Outside of limited prior work on timeseries forecasting with text covariate inputs for taxi demand [44] and financial data [14], most work on multimodal text and timeseries modeling is in audio generation. Here, LLMs have also recently been used to synthesize captions for various tasks [12, 37]. Models for text-guided audio and music generation [1, 26, 31] are pretrained with a contrastive objective that aligns embeddings of captions that describe the audio and music. This differs from our setting, where the SysCaps describes the complex system being simulated, and the timeseries inputs are covariates for the target variable (the simulator outputs).
Knowledge-enhanced PDE surrogates: Numerical simulation of partial differential equations (PDEs) is extremely computationally intensive, and thus a large body of SciML work is focused on develo** neural surrogates which are fast to evaluate. Closely related work has recently tried to encode knowledge about the PDE into a neural surrogate to facilitate generalization within and across families of PDEs. Methods that embed equation parameters (i.e., the system attributes) within the architecture to generalize to unseen parameters include CAPE [49] and those explored in Gupta & Brandstetter [19]. Other approaches embed structural knowledge about the PDE equation into the surrogate model architecture [40, 59] or in the loss [42]. Concurrent work has explored “PDE captions” [34, 58], which are a type of SysCaps for neural PDE surrogates where the system knowledge is PDE equations encoded as text.
3 Problem Statement
Our goal is to learn a surrogate that regresses the outputs of a simulator directly from its inputs. More formally, we are given a dataset of pairs of simulator inputs and outputs. The inputs are the system deployment scenario (a timeseries) and the tabular system configuration inputs . The outputs are a timeseries . For simplicity, we consider only univariate timeseries outputs in this work (), although the number of timesteps may be large (potentially thousands of steps), the timeseries inputs are multivariate, and the map** which approximates the simulator is highly nonlinear.
To summarize, we have a timeseries regression problem modeled as . By conditioning the surrogate on system knowledge , it can potentially generalize to new system configurations. However, learning transferable representations of variable-length, heterogeneous input features such as is notoriously difficult for deep neural networks, and is a key focus of tabular deep learning (see survey [3]). In our work, we develop and analyze a framework for learning multimodal surrogates where is encoded as a text (using templates as well as conversational natural language, Section 4).
Some simulators may have inputs that are not clearly distinguishable into what is and , for example, if a dynamical system simulation is configured to be in steady-state or assumes fixed exogenous conditions. In these cases, we allow to be a vector of real-valued scalars (a timeseries with ), or, simply an empty set (leaving only ).
Example: In many CES, the timeseries are exogenous inputs to the system such as weather timeseries consisting of temperature or wind speed. Attributes of a wind farm might include the number of turbines in the wind farm and turbine blade length.
4 Synthesizing System Captions (SysCaps) with LLMs
Our work is motivated by the idea that language interfaces for surrogates represent a path towards improving the accessibility of these models for expert and non-expert users, e.g., when using them for downstream system design tasks [51]. In this section, we describe two approaches for converting system attributes into text: key-value templates and natural language.
For the key-value approach, attributes are described as key-value pairs key:value and joined by a separator “” (SysCaps-kv). For example, if a simulation has attributes A=1.0 and B=blue, we create the string A:1.0|B:blue. Generating these strings is easy to do and incurs a negligible amount of extra computational overhead. In the natural language approach (SysCaps-nl, Figure 1), attributes are described in a conversational manner, which we believe is more flexible and expressive than key-value captions and thereby more accessible for non-experts. However, we do not have access to large quantities of natural language descriptions for each system and simulation. We avoid the time-consuming task of enlisting domain experts to create this data by instead prompting a powerful LLM to generate synthetic natural language descriptions given attributes. The details of the prompt are provided next. In our work, we use the open-source LLM llama-2-7b-chat [50].
Prompt design: We append a carefully written instruction template to a list of system attributes to help guide the LLM in generating a caption via prompting (see Figure 1). The system prompt is: You are a <CES> expert who provides <CES> descriptions <STYLE>. The user prompt is: Write a <CES> description based on the following attributes. Your answer should be <NUM> sentences. Please note that your response should NOT be a list of attributes and should be entirely based on the information provided. The last part is added to discourage the LLM from changing or omitting attributes. The tags <CES>, <STYLE>, <NUM> are filled in with the CES type (e.g., buildings), the style of the description (e.g, with an objective tone), and the number of sentences to use in the description (e.g., “4-6”), respectively.
Attribute subset selection: Simulations of real-world systems may have attributes that only weakly correlate with the output quantity of interest, or have a large number of attributes, which can be challenging for deep learning approaches. Since the length of a SysCap is proportional to the number of attributes, the computational burden incurred by text-based encodings of attributes can grow significantly in these cases. In these cases, reducing the number of attributes can be handled with classic feature selection methods such as recursive feature elimination (RFE) [20] or by recommendations from domain experts, as a pre-processing step.
5 Text and Timeseries Surrogate Model
We now describe a lightweight multimodal surrogate model for timeseries regression. The surrogate (Figure 2) is a composition of a multimodal encoder function and a top model , where for simplicity, the model parameters are shared across timesteps to predict each timeseries output . The training objective is to minimize the expected mean square error averaged over simulation timesteps,
(1) |
Although more sophisticated loss functions than Eq. 1 could be used that account for predictive uncertainty, we left this extension for future work to simplify our exposition and experiments.
Multimodal encoder : A text encoder extracts an embedding from a SysCap with a pretrained LM, then broadcasts and concatenates this text embedding with the timeseries inputs to create a multimodal feature vector for each simulation timestep. These features get processed by a bidirectional sequence encoder to produce a sequence of time-dependent multimodal features, , which are finally used to regress outputs.
Text encoder : To encode textual inputs we use pretrained models such as DistilBERT [45] and BERT [10]. We use each model’s default pretrained tokenizer. Tokenized sequences are bracketed by [CLS] and [EOS] tokens, and we use the final activation at the [CLS] token position to produce a text embedding . Following standard fine-tuning practices, all layers for BERT are fine-tuned while only the last layer of DistilBERT is fine-tuned.
Bidirectional sequence encoder : We broadcast the text embedding to create a sequence of length , , and concatenate each with the timeseries input , . This simplifies the task of learning timestep-specific correlations between system attributes and timeseries in the multimodal encoder . To efficiently embed long timeseries with thousands of timesteps, we explore both bidirectional LSTMs [25] and bidirectional SSMs [16] for . Our bidirectional SSM uses stacks of S4 [18] blocks without downpooling layers. We use the last layer’s hidden states as temporal features for the top model. If or for non-sequential surrogate models, we instead use an MLP with residual layers (ResNet MLP) to embed each per-timestep to get .
Top model : The multimodal encoder produces feature vectors . For simplicity, the output at each timestep is predicted from by a shared MLP with a single hidden layer.
6 Experiments
Caption length (13 attributes) | Accuracy (%) |
---|---|
Short | 88.90 |
Medium | 90.90 |
Long | 90.38 |
SysCaps length | NRMSE |
---|---|
Short | 0.57 0.02 |
Medium | 0.53 0.01 |
Long | 0.64 0.02 |
Buildings type Synonym W/ Synonym W/out Building Type FullServiceRestaurant FineDiningRestaurant 0.52 0.05 0.93 0.01 RetailStripmall Shop**Center 0.01 0.00 0.68 0.02 Warehouse StorageFacility 0.35 0.30 0.55 0.31 RetailStandalone ConvenienceStore 0.00 0.01 0.30 0.04 SmallOffice Co-WorkingSpace 0.03 0.01 0.02 0.02 PrimarySchool ElementarySchool 0.00 0.01 0.38 0.02 MediumOffice Workplace 0.08 0.02 0.03 0.04 SecondarySchool HighSchool -0.01 0.04 0.52 0.06 Outpatient MedicalClinic 0.02 0.01 0.56 0.09 QuickServiceRestaurant FastFoodRestaurant 0.10 0.07 0.83 0.01 LargeOffice OfficeTower 0.12 0.13 0.23 0.03 LargeHotel Five-Star Hotel 0.03 0.01 0.46 0.06 SmallHotel Motel 0.26 0.07 0.88 0.07 Hospital HealthcareFacility 0.03 0.04 0.62 0.12
Setup: Our main experiments focus on training building stock surrogate models for the building energy simulator EnergyPlus [8]. Given an annual hourly weather timeseries ( = 8,760) with 7 variables and a list of tabular building attributes, surrogates predict the building’s energy consumption at each hour of the year. Each building initially has 17 attributes; using RFE with a tuned LightGBM [28] model, we selected the 13 most important attributes. We use the commercial building split of the Buildings-900K dataset [15], which are building stock simulation runs for all commercial buildings in the United States. Since this dataset only has the energy timeseries, we extracted the building configuration and weather timeseries from the End-Use Load Profiles database [55] for each building. Our training set is comprised of 330K buildings, and we use 100 buildings for validation, and 6K held-out buildings for testing. We also reserved a held-out set of 10K buildings for RFE. We carefully tune the hyperparameters of all models (details in the Appendix).
We created three SysCaps datasets: a “medium” caption length dataset where <NUM> “4-6”, a “short” dataset using 2-3 sentences and a “long” dataset using 7-9 sentences. The SSMs in our experiments are trained with medium captions. Generating these datasets with llama-2-7b-chat used 1.5K GPU hours on a cluster with 16 NVIDIA A100-40GB GPUs.
Evaluating SysCaps quality: The LLM that generates natural language SysCaps may erroneously ignore or hallucinate attributes, which can negatively impact downstream performance. We evaluate the generated captions by estimating the fraction of attributes which the LLM successfully includes per caption. To compute this metric, we train a multi-class classifier to predict each categorical attribute in a SysCaps from its text embedding. The rate of missing or incorrect attributes is around 9-12% across the “short”, “medium”, and “long” caption types, with “short” captions having the highest error. This increases our confidence that our LLM-based approach for generating natural language SysCaps preserves sufficient information for surrogate modeling.
6.1 Accuracy On Held-Out Systems
Following Emami et al. [15], we use the normalized root mean square error (NRMSE) metric to compare model accuracy, averaged across 3 random seeds. In addition to comparing the ResNet MLP, LSTM, and SSM, we also trained a tuned LightGBM baseline. Does the sequential architecture matter? Yes—Figure 3 shows that the LSTM and SSM encoders outperform both the ResNet and our carefully tuned LightGBM baseline, and the SSM outperforms the LSTM. How do different system attribute encodings compare? First, we ablate the importance of encoding the system attributes by training an SSM baseline with these inputs removed (SSM/X); this model is unable to learn this task. Surprisingly, the SSM with text templates achieves comparable test accuracy on held-out systems to the SSM with one-hot inputs. We initially expected to see a non-negligible drop in regression accuracy even for text template inputs, because the text encoder compresses the caption into a single embedding vector; however, the DistilBERT encoder is sufficiently expressive to mitigate this. The SSM with natural language SysCaps has slightly worse accuracy than the LSTM model with key-value SysCaps, yet comfortably outperforms the non-sequential models, including LightGBM. We believe the performance gap between key-value templates and natural language SysCaps is mostly explained by the caption quality (Table 2). To check whether a more powerful text encoder than DistilBERT improves accuracy, we replace it with BERT. We saw only a minor difference in the building-hourly NRMSE (Fig. 3(a)), but a large improvement from 0.069 to 0.038 in stock-annual NRMSE (Fig. 3(b)), which is comparable to the SSM with one-hot inputs.
6.2 Caption Generalization
Length generalization: We assess how accuracy varies when surrogates are provided with natural language SysCaps having different lengths than seen during training. We evaluate zero-shot generalization to the short and long captions. The results (Table 2) show a small increase in error for shorter captions with a larger increase in error for longer captions, as might be expected. The error on long captions remains lower than the error achieved by our tuned LightGBM baseline.
Attribute synonyms: To quantitatively evaluate the extent to which natural language SysCaps surrogates gain a level of robustness to distribution shifts such as word order changes, synonyms, or writing style [24], we created captions for the held-out systems where the “building type” attribute is replaced by a synonym. We avoid biasing the choice of building type synonym by 5-shot prompting llama-2 to suggest the synonyms. For a control, we compare the synonym caption accuracy against accuracy on a caption with the building type attribute removed. Examples and results are shown in Table 3, where for 11/13 building type synonyms the increase in NRMSE is less than 12%, while the average increase in NRMSE for the control is 54%.
6.3 Robustness to Missing Information
Figures 4–6 show how a SysCaps model performs with missing attribute information in the natural language caption. We did not train the model on any captions with missing attributes. To create these figures, we progressively added more information to a caption template (Figure 4). Prediction accuracy improves as more information is given; notably, there is a large jump in accuracy once the building square footage is known, which is to be expected (Figure 6). This experiment reveals an intriguing connection between SysCaps with missing system information and traditional low-fidelity mathematical surrogate models. When attribute values are unknown, such as during the early stages of a design process, the SysCaps surrogate is naturally less accurate; similarly, a low-fidelity surrogates which uses a “coarse” mathematical approximation of a complex system sacrifices accuracy for computational efficiency.
6.4 Sensitivity Analysis From Natural Language
We visualize in Figure 6 a demonstration of using a SysCaps surrogate to conduct a sensitivity analysis on two system attributes, as might be performed for an early-stage design space exploration task. We use the caption template from Section 6.3 to create captions for each test building that enumerate all combinations of the number of stories and square footage attributes, totaling 160 configurations; the entire analysis requires simulating 960K buildings. We observe that the model has indeed learned physically plausible relationships between these two attributes. The model fails to predict the energy usage for buildings over 100K square feet—such buildings are in the “long tail” of the training data distribution.
6.5 Prompt Augmentation: Wind Farm Wake
Model | NRMSE |
---|---|
LightGBM | 0.196 |
one-hot | 0.212 0.009 |
SysCaps-kv | 0.054 0.024 |
SysCaps-nl | 0.036 0.001 |
This experiment uses the Wind Farm Wake Modeling Dataset [43], made with the FLORIS simulator, to train a surrogate to predict a wind farm’s power generation in steady-state atmospheric conditions. The difficulty of this task is in modeling losses due to wake effects, given only a coarse description of the wind farm layout. There are three numeric simulator inputs specifying atmospheric conditions, and five system attributes which include categorical variables indicating wind farm shape (there are four different layout types), number of turbines, and average turbine spacing (we do not use RFE). In this dataset, there are only 500 unique system configurations (split 3:1:1 for train, val, test), although each configuration is simulated under 500 distinct atmospheric conditions.
We explore generating multiple captions for each system configuration through prompt augmentation to increase diversity. Specifically, we replace the <STYLE> tag in the prompt with phrases encouraging different description styles, e.g., with an objective tone, with an objective tone (creative paraphrasing is acceptable), to a colleague, and to a classroom. The simulation is run assuming steady-state conditions (i.e., time-independent), so we tune hyperparameters for and train the non-sequential ResNet models. The ResNet baseline with one-hot encoded attribute inputs suffers from severe overfitting (Table 4), likely due to the small number (300) of training systems, whereas the SysCaps models generalize better to unseen systems. This suggests SysCaps can have a regularizing effect in small data settings. Notably, the prompt augmentation helps the natural language SysCaps model to achieve the lowest NRMSE.
7 Discussion
Our experiments demonstrate that, at comparable or minimal losses in accuracy with respect to standard feature engineering (e.g., one-hot encoding), real-world surrogate models can be augmented language interfaces. For a problem with only a small number of training systems available, the language inputs (with prompt augmentation) have a regularizing effect (Sec. 6.5) and actually outperformed standard feature engineering. We qualitatively and quantitatively showed that SysCaps handle missing attributes well. SysCaps unlock text-prompt-style surrogate modeling and new generalization abilities beyond what was previously possible with the standard encoding approach.
Limitations: Current BERT-style tokenizers struggle with numerical values [54]; for one example, they interpolate poorly to unseen numbers (Fig. 6). For another, because llama-2-7b-chat tends to add a comma to large numbers (e.g., ) when generating SysCaps, we found that our models failed to understand large numbers without commas (high error). Orthogonal research on improving number encodings for language model inputs [17, 57] can benefit our framework. Another potential concern is with creating SysCaps for simulators with a large (e.g., over 100) attributes. To get high quality captions, a more powerful LLM than llama-2-7b-chat may be needed.
Broader impacts: Improving the design of surrogate models for CES has the potential to accelerate the transition to cleaner energy sources. To avoid unfair outcomes from decisions made with CES surrogates, care should be taken when deciding what simulation runs to use as training data.
8 Conclusion
In this work, we introduced a learning framework for training multimodal text and timeseries surrogate models for simulations of complex energy systems such as buildings and wind farms, and described how we use LLMs to synthesize natural language descriptions of such systems, which we call SysCaps. Our findings underscore that language is a viable interface for real-world surrogate models with only minimal losses (and occasionally gains) in accuracy. Surrogates with natural language SysCaps are robust to missing system attributes and paraphrasing (e.g., synonyms of attributes).
A future extension of this work might explore how to use language to also interface with the timeseries simulator inputs, possibly through summary statistics. For example, to study how the complex system behaves when the average exogenous temperature is increased by five degrees. Moreover, an important question is how we might create surrogate foundation models that generalize not only across system configurations for a single simulator, but also generalize across different simulators.
Acknowledgments and Disclosure of Funding
This work was authored by the National Renewable Energy Laboratory (NREL), operated by Alliance for Sustainable Energy, LLC, for the U.S. Department of Energy (DOE) under Contract No. DE-AC36-08GO28308. This work was supported by the Laboratory Directed Research and Development (LDRD) Program at NREL. The views expressed in the article do not necessarily represent the views of the DOE or the U.S. Government. The U.S. Government retains and the publisher, by accepting the article for publication, acknowledges that the U.S. Government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this work, or allow others to do so, for U.S. Government purposes. The research was performed using computational resources sponsored by the Department of Energy’s Office of Energy Efficiency and Renewable Energy and located at the National Renewable Energy Laboratory.
References
- Agostinelli et al. [2023] Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
- Akiba et al. [2019] Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp. 2623–2631, 2019.
- Badaro et al. [2023] Badaro, G., Saeed, M., and Papotti, P. Transformers for tabular data representation: A survey of models and applications. Transactions of the Association for Computational Linguistics, 11:227–249, 2023.
- Bellamy et al. [2023] Bellamy, D. R., Kumar, B., Wang, C., and Beam, A. Labrador: Exploring the limits of masked language modeling for laboratory data. arXiv preprint arXiv:2312.11502, 2023.
- Bhavsar et al. [2023] Bhavsar, S., Pitchumani, R., Reynolds, M., Merket, N., and Reyna, J. Machine learning surrogate of physics-based building-stock simulator for end-use load forecasting. pp. 113395, 2023. ISSN 03787788. doi: 10.1016/j.enbuild.2023.113395. URL https://linkinghub.elsevier.com/retrieve/pii/S0378778823006254.
- Bran et al. [2023] Bran, A. M., Cox, S., Schilter, O., Baldassari, C., White, A., and Schwaller, P. Augmenting large language models with chemistry tools. In NeurIPS 2023 AI for Science Workshop, 2023.
- Carter et al. [2023] Carter, J., Feddema, J., Kothe, D., Neely, R., Pruet, J., Stevens, R., Balaprakash, P., Beckman, P., Foster, I., Iskra, K., et al. Advanced research directions on ai for science, energy, and security: Report on summer 2022 workshops. 2023.
- Crawley et al. [2001] Crawley, D. B., Lawrie, L. K., Winkelmann, F. C., Buhl, W. F., Huang, Y. J., Pedersen, C. O., Strand, R. K., Liesen, R. J., Fisher, D. E., Witte, M. J., et al. Energyplus: creating a new-generation building energy simulation program. Energy and buildings, 33(4):319–331, 2001.
- Dai et al. [2023] Dai, T.-Y., Niyogi, D., and Nagy, Z. Citytft: Temporal fusion transformer for urban building energy modeling. ArXiv preprint, abs/2312.02375, 2023. URL https://arxiv.longhoe.net/abs/2312.02375.
- Devlin et al. [2018] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Dinh et al. [2022] Dinh, T., Zeng, Y., Zhang, R., Lin, Z., Gira, M., Rajput, S., Sohn, J.-y., Papailiopoulos, D., and Lee, K. Lift: Language-interfaced fine-tuning for non-language machine learning tasks. Advances in Neural Information Processing Systems, 35:11763–11784, 2022.
- Doh et al. [2023] Doh, S., Choi, K., Lee, J., and Nam, J. LP-MusicCaps: LLM-based pseudo music captioning. ArXiv preprint, abs/2307.16372, 2023. URL https://arxiv.longhoe.net/abs/2307.16372.
- Du & Li [2019] Du, Y. and Li, F. Intelligent multi-microgrid energy management based on deep neural network and model-free reinforcement learning. IEEE Transactions on Smart Grid, 11(2):1066–1076, 2019.
- Emami et al. [2023a] Emami, H., Dang, X.-H., Shah, Y., and Zerfos, P. Modality-aware transformer for time series forecasting. arXiv preprint arXiv:2310.01232, 2023a.
- Emami et al. [2023b] Emami, P., Sahu, A., and Graf, P. Buildingsbench: A large-scale dataset of 900k buildings and benchmark for short-term load forecasting. Advances in Neural Information Processing Systems, 2023b.
- Goel et al. [2022] Goel, K., Gu, A., Donahue, C., and Ré, C. It’s raw! audio generation with state-space models. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 7616–7633. PMLR, 2022. URL https://proceedings.mlr.press/v162/goel22a.html.
- Golkar et al. [2023] Golkar, S., Pettee, M., Eickenberg, M., Bietti, A., Cranmer, M., Krawezik, G., Lanusse, F., McCabe, M., Ohana, R., Parker, L., et al. xval: A continuous number encoding for large language models. arXiv preprint arXiv:2310.02989, 2023.
- Gu et al. [2021] Gu, A., Goel, K., and Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
- Gupta & Brandstetter [2022] Gupta, J. K. and Brandstetter, J. Towards multi-spatiotemporal-scale generalized pde modeling. arXiv preprint arXiv:2209.15616, 2022.
- Guyon et al. [2002] Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. Gene selection for cancer classification using support vector machines. Machine learning, 46:389–422, 2002.
- Harrison-Atlas et al. [2024] Harrison-Atlas, D., Glaws, A., King, R. N., and Lantz, E. Artificial intelligence-aided wind plant optimization for nationwide evaluation of land use and economic benefits of wake steering. Nature Energy, pp. 1–15, 2024.
- Hegselmann et al. [2023] Hegselmann, S., Buendia, A., Lang, H., Agrawal, M., Jiang, X., and Sontag, D. Tabllm: Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics, pp. 5549–5581. PMLR, 2023.
- Hendrix et al. [1978] Hendrix, G. G., Sacerdoti, E. D., Sagalowicz, D., and Slocum, J. Develo** a natural language interface to complex data. ACM Transactions on Database Systems (TODS), 3(2):105–147, 1978.
- Hendrycks et al. [2020] Hendrycks, D., Liu, X., Wallace, E., Dziedzic, A., Krishnan, R., and Song, D. Pretrained transformers improve out-of-distribution robustness. arXiv preprint arXiv:2004.06100, 2020.
- Hochreiter & Schmidhuber [1997] Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
- Huang et al. [2022] Huang, Q., Jansen, A., Lee, J., Ganti, R., Li, J. Y., and Ellis, D. P. W. MuLan: A joint embedding of music audio and natural language, 2022. URL https://arxiv.longhoe.net/abs/2208.12415.
- Jablonka et al. [2024] Jablonka, K. M., Schwaller, P., Ortega-Guerrero, A., and Smit, B. Leveraging large language models for predictive chemistry. Nature Machine Intelligence, pp. 1–9, 2024.
- Ke et al. [2017] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30, 2017.
- Kumar et al. [2023] Kumar, V., Gleyzer, L., Kahana, A., Shukla, K., and Karniadakis, G. E. Mycrunchgpt: A llm assisted framework for scientific machine learning. Journal of Machine Learning for Modeling and Computing, 4(4), 2023.
- Lavin et al. [2021] Lavin, A., Krakauer, D., Zenil, H., Gottschlich, J., Mattson, T., Brehmer, J., Anandkumar, A., Choudry, S., Rocki, K., Baydin, A. G., et al. Simulation intelligence: Towards a new generation of scientific methods. ArXiv preprint, abs/2112.03235, 2021. URL https://arxiv.longhoe.net/abs/2112.03235.
- Liu et al. [2023a] Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., Wang, W., and Plumbley, M. D. Audioldm: Text-to-audio generation with latent diffusion models. ArXiv preprint, abs/2301.12503, 2023a. URL https://arxiv.longhoe.net/abs/2301.12503.
- Liu et al. [2023b] Liu, S., Zhu, Y., Lu, J., Xu, Z., Nie, W., Gitter, A., Xiao, C., Tang, J., Guo, H., and Anandkumar, A. A text-guided protein design framework. arXiv preprint arXiv:2302.04611, 2023b.
- Liu et al. [2024] Liu, T., Astorga, N., Seedat, N., and van der Schaar, M. Large language models to enhance bayesian optimization. International Conference on Learning Representations, 2024.
- Lorsung et al. [2024] Lorsung, C., Li, Z., and Barati Farimani, A. Physics informed token transformer for solving partial differential equations. Machine Learning: Science and Technology, 2024.
- Loshchilov & Hutter [2017] Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. International Conference on Learning Representations, 2017.
- Mei et al. [2023a] Mei, X., Meng, C., Liu, H., Kong, Q., Ko, T., Zhao, C., Plumbley, M. D., Zou, Y., and Wang, W. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395, 2023a.
- Mei et al. [2023b] Mei, X., Meng, C., Liu, H., Kong, Q., Ko, T., Zhao, C., Plumbley, M. D., Zou, Y., and Wang, W. WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. ArXiv preprint, abs/2303.17395, 2023b. URL https://arxiv.longhoe.net/abs/2303.17395.
- Potemkin et al. [2023] Potemkin, D., Soto, C., Li, R., Yager, K., and Tsai, E. Virtual scientific companion for synchrotron beamlines: A prototype. ArXiv preprint, abs/2312.17180, 2023. URL https://arxiv.longhoe.net/abs/2312.17180.
- Quamar et al. [2022] Quamar, A., Efthymiou, V., Lei, C., Özcan, F., et al. Natural language interfaces to data. 11(4):319–414, 2022.
- Rackauckas et al. [2020] Rackauckas, C., Ma, Y., Martensen, J., Warner, C., Zubov, K., Supekar, R., Skinner, D., Ramadhan, A., and Edelman, A. Universal differential equations for scientific machine learning. arXiv preprint arXiv:2001.04385, 2020.
- Rackauckas & Abdelrehim [2024] Rackauckas, C. V. and Abdelrehim, A. Scientific machine learning (sciml) surrogates for industry, part 1: The guiding questions. https://doi.org/10.31219/osf.io/p95zn, 2024. Accessed: 2024-03-22.
- Raissi et al. [2019] Raissi, M., Perdikaris, P., and Karniadakis, G. E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics, 378:686–707, 2019.
- Ramos et al. [2023] Ramos, D., Glaws, A., King, R., , and Harrison-Atlas, D. Flow redirection and induction in steady state (floris) wind plant power production data sets, 2023. URL https://data.openei.org/submissions/5884.
- Rodrigues et al. [2019] Rodrigues, F., Markou, I., and Pereira, F. C. Combining time-series and textual data for taxi demand prediction in event areas: A deep learning approach. Information Fusion, 49:120–129, 2019.
- Sanh et al. [2019] Sanh, V., Debut, L., Chaumond, J., and Wolf, T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv preprint, abs/1910.01108, 2019. URL https://arxiv.longhoe.net/abs/1910.01108.
- Schick & Schütze [2021] Schick, T. and Schütze, H. Generating datasets with pretrained language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6943–6951, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.555. URL https://aclanthology.org/2021.emnlp-main.555.
- Seidl et al. [2023] Seidl, P., Vall, A., Hochreiter, S., and Klambauer, G. Enhancing activity prediction models in drug discovery with the ability to understand human language. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp. 30458–30490. PMLR, 2023. URL https://proceedings.mlr.press/v202/seidl23a.html.
- Song et al. [2024] Song, X., Li, O., Lee, C., Peng, D., Perel, S., Chen, Y., et al. Omnipred: Language models as universal regressors. ArXiv preprint, abs/2402.14547, 2024. URL https://arxiv.longhoe.net/abs/2402.14547.
- Takamoto et al. [2023] Takamoto, M., Alesiani, F., and Niepert, M. Learning neural pde solvers with parameter-guided channel attention. In International Conference on Machine Learning, pp. 33448–33467. PMLR, 2023.
- Touvron et al. [2023] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023. URL https://arxiv.longhoe.net/abs/2307.09288.
- Vaithilingam et al. [2024] Vaithilingam, P., Arawjo, I., and Glassman, E. L. Imagining a future of designing with ai: Dynamic grounding, constructive negotiation, and sustainable motivation. arXiv preprint arXiv:2402.07342, 2024.
- Vazquez-Canteli et al. [2019] Vazquez-Canteli, J., Demir, A. D., Brown, J., and Nagy, Z. Deep neural networks as surrogate models for urban energy simulations. In Journal of Physics: Conference Series, volume 1343, pp. 012002. IOP Publishing, 2019.
- Vepsäläinen et al. [2019] Vepsäläinen, J., Otto, K., Lajunen, A., and Tammi, K. Computationally efficient model for energy demand prediction of electric city bus in varying operating conditions. Energy, 169:433–443, 2019.
- Wallace et al. [2019] Wallace, E., Wang, Y., Li, S., Singh, S., and Gardner, M. Do NLP models know numbers? probing numeracy in embeddings. In Inui, K., Jiang, J., Ng, V., and Wan, X. (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 5307–5315, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1534. URL https://aclanthology.org/D19-1534.
- [55] Wilson, E. J., Parker, A., Fontanini, A., Present, E., Reyna, J. L., Adhikari, R., Bianchi, C., CaraDonna, C., Dahlhausen, M., Kim, J., et al. End-use load profiles for the us building stock: Methodology and results of model calibration, validation, and uncertainty quantification. Technical report, National Renewable Energy Lab (NREL). URL https://www.nrel.gov/docs/fy22osti/80889.pdf.
- Xu et al. [2023] Xu, M., Yuan, X., Miret, S., and Tang, J. Protst: multi-modality learning of protein sequences and biomedical texts. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
- Yan et al. [2024] Yan, J., Zheng, B., Xu, H., Zhu, Y., Chen, D., Sun, J., Wu, J., and Chen, J. Making pre-trained language models great on tabular prediction. arXiv preprint arXiv:2403.01841, 2024.
- Yang et al. [2024] Yang, L., Liu, S., and Osher, S. J. Fine-tune language models as multi-modal differential equation solvers. arXiv preprint arXiv:2308.05061v4, 2024.
- Ye et al. [2024] Ye, Z., Huang, X., Chen, L., Liu, H., Wang, Z., and Dong, B. Pdeformer: Towards a foundation model for one-dimensional partial differential equations. arXiv preprint arXiv:2402.12652, 2024.
- Zhang et al. [2021] Zhang, L., Plathottam, S., Reyna, J., Merket, N., Sayers, K., Yang, X., Reynolds, M., Parker, A., Wilson, E., Fontanini, A., Roberts, D., and Muehleisen, R. High-resolution hourly surrogate modeling framework for physics-based large-scale building stock modeling. 75:103292, 2021. ISSN 22106707. doi: 10.1016/j.scs.2021.103292. URL https://linkinghub.elsevier.com/retrieve/pii/S2210670721005680.
Appendix A Additional Experiment Details
We use the normalized root mean square error (NRMSE) to capture the accuracy of the surrogate model. NRMSE is also known as (CV)RMSE.
The building-hour NRMSE is where the NRMSE is normalized by building and by hour, where is the number of buildings in the building stock and is the number of hours in a year:
(2) |
The stock-annual NRMSE, is where the NRMSE is normalized by the annual stock energy consumption:
(3) |
We use the AdamW [35] optimizer with = 0.9, = 0.98, = 1e-9, and weight decay of 0.01 for all experiments. The early stop** patience is 50 for all experiments. All models are trained with a single NVIDIA A100-40GB GPU. The longest training runs take 1-2 days and the shortest 2-3 hours.
A.1 Hyperparameter sweeps
A.2 Buildings
Model | Hyperparameter | Grid search space | Best values |
LightGBM | Learning rate | From 0.01 to 0.1 | 0.066 |
Number of leaves | From 40 to 150 | 149 | |
Subsample | From 0.05 to 1.0 | 0.178 | |
Feature fraction | From 0.05 to 1.0 | 0.860 | |
Min number of data in one leaf | From 1 to 100 | 12 | |
ResNet (one-hot) | Hidden layers size | 256, 1024, 2048 | 1024 |
Number of layers | 2, 8 | 2 | |
Batch size | 128, 256, 512 | 512 | |
Learning rate | 0.0001, 0.0003, 0.001 | 0.0003 | |
ResNet (SysCaps) | Hidden layers size | 256, 1024, 2048 | 256 |
Number of layers | 2, 8 | 8 | |
Batch size | 128, 256, 512 | 256 | |
Learning rate | 0.0001, 0.0003, 0.001 | 0.0003 | |
Bidirectional LSTM (one-hot) | [Hidden layer size, Batch size] | [128, 64], [512, 32], [1024, 32] | [128, 64] |
Number of layers | 1, 3, 4, 6, 8 | 4 | |
MLP dimension | 256 | 256 | |
Learning rate | 0.00001, 0.0003, 0.001 | 0.001 | |
Bidirectional LSTM (SysCaps) | [Hidden layer size, Batch size] | [128, 64], [512, 32], [1024, 32] | [512, 32] |
Number of layers | 1, 3, 4, 6, 8 | 1 | |
MLP dimension | 256 | 256 | |
Learning rate | 0.00001, 0.0003, 0.001 | 0.0003 | |
Bidirectional S4 (one-hot) | [Hidden layer size, Num. layers] | [64,8] , [128,4] | [128,4] |
MLP dimension | 256 | 256 | |
Batch size | 32, 64 | 64 | |
Learning rate | 1e-5, 3e-4, 1e-3 | 3e-4 | |
Bidirectional S4 (SysCaps) | [Hidden layer size, Num. layers] | [64,8] , [128,4] | [128, 4] |
MLP dimension | 256 | 256 | |
Batch size | 32, 64 | 32 | |
Learning rate | 1e-5, 3e-4, 1e-3 | 3e-4 |
There are 13 attributes after RFE, which are one-hot encoded into a 336-dimensional feature vector, whereas the text embeddings are 768-dimensional. Along with the 7 weather variables, we concatenate cyclically encoded calendar features [15]. This creates a 103-dimensional input for the bidirectional sequence encoders.
LightGBM: As LightGBM does not support batch training out of the box, the entire training data needs to be loaded into the memory to train a LightGBM model. With the train dataset containing 340k buildings, each with 8759 hours and 347 features, we randomly extract 438 hours per building (which is about 5% of total hours) to limit memory usage. This results in hours in total for the train dataset, which consumes about 380 GB of memory when being loaded into a NumPy object. For the validation and test splits, we retain the full number of hours per building. The LightGBM model is tuned with Optuna [2] across 30 trials and achieves the best validation NRMSE of 0.667.
A.2.1 Wind farm
Model | Hyperparameter | Grid search space | Best values |
LightGBM | Learning rate | From 0.01 to 0.1 | 0.039 |
Number of leaves | From 40 to 120 | 108 | |
Subsample | From 0.6 to 1.0 | 0.963 | |
Feature fraction | From 0.6 to 1.0 | 0.997 | |
Min number of data in one leaf | From 20 to 100 | 96 | |
ResNet (one-hot) | Hidden layers size | [256,1024] | 256 |
Number of layers | [2,8] | 2 | |
Batch size | [128,256] | 128 | |
Learning rate | [1e-5, 3e-4, 1e-3] | 1e-5 | |
ResNet (SysCaps-kv) | Hidden layers size | [256,1024] | 1024 |
Number of layers | [2,8] | 8 | |
Batch size | [128,256] | 256 | |
Learning rate | [1e-5,3e-4,1e-3] | 3e-4 | |
ResNet (SysCaps-nl) | Hidden layers size | [256,1024] | 1024 |
Number of layers | [2,8] | 8 | |
Batch size | [128,256] | 256 | |
Learning rate | [1e-5,3e-4,1e-3] | 1e-5 |
LightGBM: The training, validation, and test split for the wind dataset gives us datasets of size 148,650, 49,600, and 49,250 respectively with 190 features after one-hot encoding. The training dataset is loaded into memory to train the LightGBM model. We use Optuna [2] to tune the hyperparameters and the best validation NMRSE achieved is 0.189.