SysCaps: Language Interfaces for Simulation Surrogates of Complex Systems

Patrick Emami, Zhaonan Li, Saumya Sinha, Truc Nguyen
National Renewable Energy Lab
{Patrick.Emami, Zhaonan.Li, Saumya.Sinha,Truc.Nguyen}@nrel.gov
Equal contribution.
Abstract

Data-driven simulation surrogates help computational scientists study complex systems. They can also help inform impactful policy decisions. We introduce a learning framework for surrogate modeling where language is used to interface with the underlying system being simulated. We call a language description of a system a “system caption”, or SysCap. To address the lack of datasets of paired natural language SysCaps and simulation runs, we use large language models (LLMs) to synthesize high-quality captions. Using our framework, we train multimodal text and timeseries regression models for two real-world simulators of complex energy systems. Our experiments demonstrate the feasibility of designing language interfaces for real-world surrogate models at comparable accuracy to standard baselines. We qualitatively and quantitatively show that SysCaps unlock text-prompt-style surrogate modeling and new generalization abilities beyond what was previously possible. We will release the generated SysCaps datasets and our code to support follow-on studies.

1 Introduction

Data-driven surrogates enable computational scientists to efficiently predict the results of expensive numerical simulations that run on supercomputers [30, 7]. Surrogates are particularly valuable for emulating simulations of complex energy systems (CES), which model dynamic interactions between humans, earth systems, and infrastructure. Examples of CES include buildings [52, 9, 5], electric vehicle fleets [53], and microgrids [13]. Advancing the science of CES contributes to reducing emissions and accelerating the adoption of clean energy, which is needed to address the impacts of climate change. Surrogate models are not only intended for expert use. Surrogates are also needed to inform highly consequential policy and investment decisions about complex systems made by non-experts in industry and governments [41], such as when planning to build and deploy a new renewable energy system [21].

These types of surrogates perform a fairly standard regression task, predicting simulation output quantities of interest from a) an input system configuration and b) a deployment scenario. For example, we might want to predict a) the amount of energy a particular building will consume given b) a timeseries of weather variables spanning an entire year. In this case, this involves performing long sequence timeseries regression, which classic data-driven regression such as gradient-boosted decision trees have difficulty with [5, 60].

In this work, we aim to explore the design and analysis of language interfaces for such surrogates. Intuitively, a language interface to a surrogate model makes it more accessible, particularly for non-experts, by simplifying how we inspect and alter a complex system’s configuration. Language interfaces are powerful—they ground interactions between humans and machines in the human’s preferred way [51]. The idea of using language to create interfaces for complex data or models is not new [23, 39], and interest has renewed due to the success of large language models (LLMs) and their demonstrated ability to generate high-quality synthetic natural “captions” [46, 12, 36, 22]. Our work defines a “system caption”, or SysCap, as text-based descriptions of knowledge about the system being simulated. In this work, we focus on black-box settings where the only available knowledge about the system is its configuration, found in simulator input files as lists of attributes.

However, it is unclear whether textual inputs, and particularly natural language, is suitable for real-world regression tasks. System attributes are sets of both discrete (categorical, binary, or string) and continuous (numeric) variables, i.e., tabular data. Previous work demonstrated inconclusive evidence when using pretrained language models to do tabular regression from text-encoded inputs, with and without modifications to the architecture [11, 27, 4, 57], motivating further study. Indeed, regression with text-encoded tabular inputs promises multiple advantages. First, it avoids cumbersome feature engineering (e.g., one-hot encodings) for tabular data. Second, it flexibly handles variable-length inputs, such as when certain system attributes are unknown at test time. Third, using text to represent tabular data increases generalizability—intuitively, attribute names and values have semantic information that can be exploited with the help of pretrained language embeddings [22, 57].

Our paper introduces a framework for training lightweight multimodal surrogates for CES with text (for system attributes) and timeseries inputs (for the deployment scenario) and makes contributions towards addressing the following technical challenges:

  • Given the lack of human-labeled natural language descriptions of complex systems, we describe a data collection pipeline that uses an LLM to generate high-quality natural language SysCaps from metadata files of CES. We observe that LLMs possess broad knowledge about CES, thus, minimal prompting is required to produce conversational SysCaps.

  • We introduce a simple and lightweight multimodal surrogate model architecture that a) fuses text embeddings obtained from pretrained language models (LMs) with b) timeseries encoded by a bidirectional sequence encoder to c) regress a timeseries output. We expect this to be insightful for future multimodal text and timeseries studies.

  • We develop an automatic evaluation strategy to assess caption quality–specifically, we estimate the rate at which ground truth attributes appear in the synthetic description with a multiclass attribute classifier.

Our experiments are based on two real-world CES simulators of building energy demand and wind farm wake. We rigorously evaluate accuracy and generalization beyond the capabilities of traditional regression approaches by quantifying robustness to variable-length inputs and paraphrasing (e.g., attribute synonyms). We also qualitatively showcase how text interfaces enable flexible handling of missing attributes and rapid design space exploration via the use of captions. As there are no standard benchmarks for comparing surrogate modeling performance for CES, we will open-source all data and code and contribute the generated SysCaps datasets to facilitate future work.

2 Related Work

Language interfaces for scientific machine learning: A growing body of work is similarly exploring creating language interfaces for advanced scientific machine learning (SciML) tasks, including protein representation learning [56], protein design [32], and activity prediction for drug discovery [47]. LLM-powered natural language interfaces are also being designed for complicated scientific workflows including synchrotron management [38], automated chemistry labs [6], and fluid dynamics workflows [29]. Our work adds to this growing body of literature by studying language interfaces for surrogate models of complex systems.

Large language models for regression: Another line of work asks whether LLMs can perform regression with both language inputs and outputs (numbers encoded as tokens), such as for tabular problems [11] or black-box optimization [48, 33]. These studies use templated text inputs, whereas our work additionally seeks to answer whether natural language is also viable as an interface for regression. Also, our framework directly predicts continuous outputs instead of tokens. Moreover, one study found mixed results compared to simple gradient-boosted tree baselines and difficulty with interpolation [11], raising questions about the effectiveness of this direction. LLMs are also expensive to evaluate, which makes them impractical for scientific applications where the surrogate is intended to be called many times. We only use LLMs to create synthetic training data and instead employ lightweight LMs such as DistilBERT [45] to encode SysCaps.

Text and timeseries multimodal models: Outside of limited prior work on timeseries forecasting with text covariate inputs for taxi demand [44] and financial data [14], most work on multimodal text and timeseries modeling is in audio generation. Here, LLMs have also recently been used to synthesize captions for various tasks [12, 37]. Models for text-guided audio and music generation [1, 26, 31] are pretrained with a contrastive objective that aligns embeddings of captions that describe the audio and music. This differs from our setting, where the SysCaps describes the complex system being simulated, and the timeseries inputs are covariates for the target variable (the simulator outputs).

Knowledge-enhanced PDE surrogates: Numerical simulation of partial differential equations (PDEs) is extremely computationally intensive, and thus a large body of SciML work is focused on develo** neural surrogates which are fast to evaluate. Closely related work has recently tried to encode knowledge about the PDE into a neural surrogate to facilitate generalization within and across families of PDEs. Methods that embed equation parameters (i.e., the system attributes) within the architecture to generalize to unseen parameters include CAPE [49] and those explored in Gupta & Brandstetter [19]. Other approaches embed structural knowledge about the PDE equation into the surrogate model architecture [40, 59] or in the loss [42]. Concurrent work has explored “PDE captions” [34, 58], which are a type of SysCaps for neural PDE surrogates where the system knowledge is PDE equations encoded as text.

3 Problem Statement

Our goal is to learn a surrogate f:𝒳×𝒵𝒴:𝑓𝒳𝒵𝒴f:\mathcal{X}\times\mathcal{Z}\rightarrow\mathcal{Y}italic_f : caligraphic_X × caligraphic_Z → caligraphic_Y that regresses the outputs of a simulator F𝐹Fitalic_F directly from its inputs. More formally, we are given a dataset D𝐷Ditalic_D of pairs of simulator inputs and outputs. The inputs are the system deployment scenario x1:T𝒳subscript𝑥:1𝑇𝒳x_{1:T}\in\mathcal{X}italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ∈ caligraphic_X (a timeseries) and the tabular system configuration inputs z𝒵𝑧𝒵z\in\mathcal{Z}italic_z ∈ caligraphic_Z. The outputs are a timeseries y1:T𝒴subscript𝑦:1𝑇𝒴y_{1:T}\in\mathcal{Y}italic_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ∈ caligraphic_Y. For simplicity, we consider only univariate timeseries outputs in this work (ytsubscript𝑦𝑡y_{t}\in\mathbb{R}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ∈ blackboard_R), although the number of timesteps T𝑇Titalic_T may be large (potentially thousands of steps), the timeseries inputs 𝒳𝒳\mathcal{X}caligraphic_X are multivariate, and the map** f𝑓fitalic_f which approximates the simulator is highly nonlinear.

To summarize, we have a timeseries regression problem modeled as y1:T=f(x1:T,z)subscript𝑦:1𝑇𝑓subscript𝑥:1𝑇𝑧y_{1:T}=f(x_{1:T},z)italic_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT = italic_f ( italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_z ). By conditioning the surrogate on system knowledge z𝑧zitalic_z, it can potentially generalize to new system configurations. However, learning transferable representations of variable-length, heterogeneous input features such as z𝑧zitalic_z is notoriously difficult for deep neural networks, and is a key focus of tabular deep learning (see survey  [3]). In our work, we develop and analyze a framework for learning multimodal surrogates where z𝑧zitalic_z is encoded as a text (using templates as well as conversational natural language, Section 4).

Some simulators may have inputs that are not clearly distinguishable into what is 𝒳𝒳\mathcal{X}caligraphic_X and 𝒵𝒵\mathcal{Z}caligraphic_Z, for example, if a dynamical system simulation is configured to be in steady-state or assumes fixed exogenous conditions. In these cases, we allow 𝒳𝒳\mathcal{X}caligraphic_X to be a vector of real-valued scalars (a timeseries with T=1𝑇1T=1italic_T = 1), or, simply an empty set (leaving only 𝒵𝒵\mathcal{Z}caligraphic_Z).

Example: In many CES, the timeseries 𝒳𝒳\mathcal{X}caligraphic_X are exogenous inputs to the system such as weather timeseries consisting of temperature or wind speed. Attributes 𝒵𝒵\mathcal{Z}caligraphic_Z of a wind farm might include the number of turbines in the wind farm and turbine blade length.

4 Synthesizing System Captions (SysCaps) with LLMs

Refer to caption
Figure 1: An example of generating a SysCap by strategically prompting an LLM with system attributes that are represented as key-value pairs. The resulting caption is able to capture the provided attributes in a natural conversational language.

Our work is motivated by the idea that language interfaces for surrogates represent a path towards improving the accessibility of these models for expert and non-expert users, e.g., when using them for downstream system design tasks [51]. In this section, we describe two approaches for converting system attributes into text: key-value templates and natural language.

For the key-value approach, attributes are described as key-value pairs key:value and joined by a separator “||||” (SysCaps-kv). For example, if a simulation has attributes A=1.0 and B=blue, we create the string A:1.0|B:blue. Generating these strings is easy to do and incurs a negligible amount of extra computational overhead. In the natural language approach (SysCaps-nl, Figure 1), attributes are described in a conversational manner, which we believe is more flexible and expressive than key-value captions and thereby more accessible for non-experts. However, we do not have access to large quantities of natural language descriptions for each system and simulation. We avoid the time-consuming task of enlisting domain experts to create this data by instead prompting a powerful LLM to generate synthetic natural language descriptions given attributes. The details of the prompt are provided next. In our work, we use the open-source LLM llama-2-7b-chat [50].

Prompt design: We append a carefully written instruction template to a list of system attributes to help guide the LLM in generating a caption via prompting (see Figure 1). The system prompt is: You are a <CES> expert who provides <CES> descriptions <STYLE>. The user prompt is: Write a <CES> description based on the following attributes. Your answer should be <NUM> sentences. Please note that your response should NOT be a list of attributes and should be entirely based on the information provided. The last part is added to discourage the LLM from changing or omitting attributes. The tags <CES>, <STYLE>, <NUM> are filled in with the CES type (e.g., buildings), the style of the description (e.g, with an objective tone), and the number of sentences to use in the description (e.g., “4-6”), respectively.

Attribute subset selection: Simulations of real-world systems may have attributes that only weakly correlate with the output quantity of interest, or have a large number of attributes, which can be challenging for deep learning approaches. Since the length of a SysCap is proportional to the number of attributes, the computational burden incurred by text-based encodings of attributes can grow significantly in these cases. In these cases, reducing the number of attributes can be handled with classic feature selection methods such as recursive feature elimination (RFE) [20] or by recommendations from domain experts, as a pre-processing step.

5 Text and Timeseries Surrogate Model

Refer to caption
Figure 2: Building blocks of our surrogate model, f=hθgψ𝑓subscript𝜃subscript𝑔𝜓f=h_{\theta}\circ g_{\psi}italic_f = italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ∘ italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, that includes a multimodal encoder, gψsubscript𝑔𝜓g_{\psi}italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT, and a top model, hθsubscript𝜃h_{\theta}italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT. The multimodal encoder, gψ=gψ𝗌𝖾𝗊gψ𝗍𝖾𝗑𝗍subscript𝑔𝜓superscriptsubscript𝑔𝜓𝗌𝖾𝗊superscriptsubscript𝑔𝜓𝗍𝖾𝗑𝗍g_{\psi}=g_{\psi}^{\mathsf{seq}}\circ g_{\psi}^{\mathsf{text}}italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_seq end_POSTSUPERSCRIPT ∘ italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_text end_POSTSUPERSCRIPT, is a composition of a text encoder, gψ𝗍𝖾𝗑𝗍superscriptsubscript𝑔𝜓𝗍𝖾𝗑𝗍g_{\psi}^{\mathsf{text}}italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_text end_POSTSUPERSCRIPT, and a bidirectional sequence encoder, gψ𝗌𝖾𝗊superscriptsubscript𝑔𝜓𝗌𝖾𝗊g_{\psi}^{\mathsf{seq}}italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_seq end_POSTSUPERSCRIPT, for timeseries inputs.

We now describe a lightweight multimodal surrogate model for timeseries regression. The surrogate f𝑓fitalic_f (Figure 2) is a composition of a multimodal encoder function gψ:(𝒵,𝒳){d}1:T:subscript𝑔𝜓𝒵𝒳subscriptsuperscript𝑑:1𝑇g_{\psi}:(\mathcal{Z},\mathcal{X})\rightarrow\{\mathbb{R}^{d}\}_{1:T}italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT : ( caligraphic_Z , caligraphic_X ) → { blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT } start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT and a top model hθ:d:subscript𝜃superscript𝑑h_{\theta}:\mathbb{R}^{d}\rightarrow\mathbb{R}italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT : blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT → blackboard_R, where for simplicity, the model parameters θ𝜃\thetaitalic_θ are shared across timesteps to predict each timeseries output ytsubscript𝑦𝑡y_{t}italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT. The training objective is to minimize the expected mean square error averaged over simulation timesteps,

minθ,ψ𝔼(z,x1:T,y1:T)D[1Tt=1T([hθ(gψ(z,x1:T))]tyt)2].subscript𝜃𝜓subscript𝔼similar-to𝑧subscript𝑥:1𝑇subscript𝑦:1𝑇𝐷delimited-[]1𝑇superscriptsubscript𝑡1𝑇superscriptsubscriptdelimited-[]subscript𝜃subscript𝑔𝜓𝑧subscript𝑥:1𝑇𝑡subscript𝑦𝑡2\min_{\theta,\psi}\mathbb{E}_{(z,x_{1:T},y_{1:T})\sim D}\Biggl{[}\frac{1}{T}% \sum_{t=1}^{T}\bigl{(}[h_{\theta}(g_{\psi}(z,x_{1:T}))]_{t}-y_{t}\bigr{)}^{2}% \Biggr{]}.roman_min start_POSTSUBSCRIPT italic_θ , italic_ψ end_POSTSUBSCRIPT blackboard_E start_POSTSUBSCRIPT ( italic_z , italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT , italic_y start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ∼ italic_D end_POSTSUBSCRIPT [ divide start_ARG 1 end_ARG start_ARG italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT ( [ italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT ( italic_z , italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) ) ] start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ] . (1)

Although more sophisticated loss functions than Eq. 1 could be used that account for predictive uncertainty, we left this extension for future work to simplify our exposition and experiments.

Multimodal encoder gψsubscript𝑔𝜓g_{\psi}italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT: A text encoder gψ𝗍𝖾𝗑𝗍superscriptsubscript𝑔𝜓𝗍𝖾𝗑𝗍g_{\psi}^{\mathsf{text}}italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_text end_POSTSUPERSCRIPT extracts an embedding z^^𝑧\hat{z}over^ start_ARG italic_z end_ARG from a SysCap z𝑧zitalic_z with a pretrained LM, then broadcasts and concatenates this text embedding with the timeseries inputs to create a multimodal feature vector for each simulation timestep. These features get processed by a bidirectional sequence encoder gψ𝗌𝖾𝗊superscriptsubscript𝑔𝜓𝗌𝖾𝗊g_{\psi}^{\mathsf{seq}}italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_seq end_POSTSUPERSCRIPT to produce a sequence of time-dependent multimodal features, e1:T=gψ𝗌𝖾𝗊(gψ𝗍𝖾𝗑𝗍(z),x1:T)subscript𝑒:1𝑇superscriptsubscript𝑔𝜓𝗌𝖾𝗊superscriptsubscript𝑔𝜓𝗍𝖾𝗑𝗍𝑧subscript𝑥:1𝑇e_{1:T}=g_{\psi}^{\mathsf{seq}}(g_{\psi}^{\mathsf{text}}(z),x_{1:T})italic_e start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT = italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_seq end_POSTSUPERSCRIPT ( italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_text end_POSTSUPERSCRIPT ( italic_z ) , italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT ) , which are finally used to regress outputs.

Text encoder gψ𝗍𝖾𝗑𝗍superscriptsubscript𝑔𝜓𝗍𝖾𝗑𝗍g_{\psi}^{\mathsf{text}}italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_text end_POSTSUPERSCRIPT: To encode textual inputs we use pretrained models such as DistilBERT [45] and BERT [10]. We use each model’s default pretrained tokenizer. Tokenized sequences are bracketed by [CLS] and [EOS] tokens, and we use the final activation at the [CLS] token position to produce a text embedding z^d^𝑧superscript𝑑\hat{z}\in\mathbb{R}^{d}over^ start_ARG italic_z end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT. Following standard fine-tuning practices, all layers for BERT are fine-tuned while only the last layer of DistilBERT is fine-tuned.

Bidirectional sequence encoder gψ𝗌𝖾𝗊superscriptsubscript𝑔𝜓𝗌𝖾𝗊g_{\psi}^{\mathsf{seq}}italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_seq end_POSTSUPERSCRIPT: We broadcast the text embedding z^^𝑧\hat{z}over^ start_ARG italic_z end_ARG to create a sequence of length T𝑇Titalic_T, z^{z^t}t=1T^𝑧superscriptsubscriptsubscript^𝑧𝑡𝑡1𝑇\hat{z}\rightarrow\{\hat{z}_{t}\}_{t=1}^{T}over^ start_ARG italic_z end_ARG → { over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT, and concatenate each z^tsubscript^𝑧𝑡\hat{z}_{t}over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT with the timeseries input x1:Tsubscript𝑥:1𝑇x_{1:T}italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT, {z^t;xt}1Tsuperscriptsubscriptsubscript^𝑧𝑡subscript𝑥𝑡1𝑇\{\hat{z}_{t};x_{t}\}_{1}^{T}{ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT. This simplifies the task of learning timestep-specific correlations between system attributes z^^𝑧\hat{z}over^ start_ARG italic_z end_ARG and timeseries x1:Tsubscript𝑥:1𝑇x_{1:T}italic_x start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT in the multimodal encoder gψsubscript𝑔𝜓g_{\psi}italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT. To efficiently embed long timeseries with thousands of timesteps, we explore both bidirectional LSTMs [25] and bidirectional SSMs [16] for gψ𝗌𝖾𝗊superscriptsubscript𝑔𝜓𝗌𝖾𝗊g_{\psi}^{\mathsf{seq}}italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT start_POSTSUPERSCRIPT sansserif_seq end_POSTSUPERSCRIPT. Our bidirectional SSM uses stacks of S4 [18] blocks without downpooling layers. We use the last layer’s hidden states as temporal features e1:Tsubscript𝑒:1𝑇e_{1:T}italic_e start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT for the top model. If T=1𝑇1T=1italic_T = 1 or for non-sequential surrogate models, we instead use an MLP with residual layers (ResNet MLP) to embed each {z^t;xt}subscript^𝑧𝑡subscript𝑥𝑡\{\hat{z}_{t};x_{t}\}{ over^ start_ARG italic_z end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ; italic_x start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT } per-timestep to get etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT.

Top model hθsubscript𝜃h_{\theta}italic_h start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT: The multimodal encoder gψsubscript𝑔𝜓g_{\psi}italic_g start_POSTSUBSCRIPT italic_ψ end_POSTSUBSCRIPT produces T𝑇Titalic_T feature vectors e1:Tsubscript𝑒:1𝑇e_{1:T}italic_e start_POSTSUBSCRIPT 1 : italic_T end_POSTSUBSCRIPT. For simplicity, the output y^tsubscript^𝑦𝑡\hat{y}_{t}over^ start_ARG italic_y end_ARG start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT at each timestep is predicted from etsubscript𝑒𝑡e_{t}italic_e start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT by a shared MLP with a single hidden layer.

6 Experiments

Refer to caption
(a) building-hourly
Refer to caption
(b) stock-annual
Figure 3: Accuracy. Lower NRMSE is better. Building-hourly is the NRMSE normalized per building and per hour. Stock-annual sums over all buildings and hours before normalizing by the total consumption of the building stock. SSM/X not shown in (b) for clarity (similar-to\sim 0.52 NRMSE).
Table 1: Caption quality. We estimate presence of each attribute in a SysCaps, measured by the average test accuracy of a multi-class classifier trained to predict each categorical attribute. Our metric suggests similar-to\sim9-12% of attributes are missing or incorrect per SysCaps, due to errors made by llama-2-7b-chat.
Caption length (13 attributes) Accuracy (%)
Short 88.90
Medium 90.90
Long 90.38
Table 2: SysCaps zero-shot length generalization. NRMSE is per-building-hourly. Results are for the SSM model trained with medium-length SysCaps and evaluated zero-shot on short and long captions.
SysCaps length NRMSE
Short 0.57±plus-or-minus\pm± 0.02
Medium 0.53±plus-or-minus\pm± 0.01
Long 0.64±plus-or-minus\pm± 0.02
Table 3: Regression with attribute synonyms. Pretrained language embeddings make our model robust to certain types of caption paraphrasing, such as the use of attribute synonyms. We show the difference in NRMSE between an unmodified caption and a modified caption (mean and std. across 3 random seeds), where one modified caption replaces the building type with a synonym (column 3) and the other removes the building type attribute as a baseline (column 4). A difference in means larger than the std. dev. is bolded, otherwise we use an underline.

Buildings type Synonym W/ Synonym W/out Building Type FullServiceRestaurant FineDiningRestaurant 0.52±plus-or-minus\pm± 0.05 0.93±plus-or-minus\pm± 0.01 RetailStripmall Shop**Center 0.01±plus-or-minus\pm± 0.00 0.68±plus-or-minus\pm± 0.02 Warehouse StorageFacility 0.35±plus-or-minus\pm± 0.30 0.55±plus-or-minus\pm± 0.31 RetailStandalone ConvenienceStore 0.00±plus-or-minus\pm± 0.01 0.30±plus-or-minus\pm± 0.04 SmallOffice Co-WorkingSpace 0.03±plus-or-minus\pm± 0.01 0.02±plus-or-minus\pm± 0.02 PrimarySchool ElementarySchool 0.00±plus-or-minus\pm± 0.01 0.38±plus-or-minus\pm± 0.02 MediumOffice Workplace 0.08±plus-or-minus\pm± 0.02 0.03±plus-or-minus\pm± 0.04 SecondarySchool HighSchool -0.01±plus-or-minus\pm± 0.04 0.52±plus-or-minus\pm± 0.06 Outpatient MedicalClinic 0.02±plus-or-minus\pm± 0.01 0.56±plus-or-minus\pm± 0.09 QuickServiceRestaurant FastFoodRestaurant 0.10±plus-or-minus\pm± 0.07 0.83±plus-or-minus\pm± 0.01 LargeOffice OfficeTower 0.12±plus-or-minus\pm± 0.13 0.23±plus-or-minus\pm± 0.03 LargeHotel Five-Star Hotel 0.03±plus-or-minus\pm± 0.01 0.46±plus-or-minus\pm± 0.06 SmallHotel Motel 0.26±plus-or-minus\pm± 0.07 0.88±plus-or-minus\pm± 0.07 Hospital HealthcareFacility 0.03±plus-or-minus\pm± 0.04 0.62±plus-or-minus\pm± 0.12

Refer to caption
Figure 4: System captions unlock text-prompt-style surrogate modeling for complex systems. We show building stock daily load profiles aggregated for Warehouse building type, created with caption templates. In tabular regression, a critical missing attribute (building square footage) can have an outsized impact on accuracy. In column two, the accuracy has significantly improved (see Figure 6 for the quantitative analysis).
Refer to caption
Figure 5: The stock annual NRMSE decreases as missing attributes are added to a caption template. Performance improves greatly when the square footage is known.
Refer to caption
(a)
Refer to caption
(b)
Figure 6: Sensitivity analysis with natural language SysCap templates. a) The model has learned a physically plausible relationship between building square footage and number of stories. b) Failure case: When tested on unseen values of sqft (blue crosses), the model severely underestimates the energy consumption.

Setup: Our main experiments focus on training building stock surrogate models for the building energy simulator EnergyPlus [8]. Given an annual hourly weather timeseries (T𝑇Titalic_T = 8,760) with 7 variables and a list of tabular building attributes, surrogates predict the building’s energy consumption at each hour of the year. Each building initially has 17 attributes; using RFE with a tuned LightGBM [28] model, we selected the 13 most important attributes. We use the commercial building split of the Buildings-900K dataset [15], which are building stock simulation runs for all commercial buildings in the United States. Since this dataset only has the energy timeseries, we extracted the building configuration and weather timeseries from the End-Use Load Profiles database [55] for each building. Our training set is comprised of 330K buildings, and we use 100 buildings for validation, and 6K held-out buildings for testing. We also reserved a held-out set of 10K buildings for RFE. We carefully tune the hyperparameters of all models (details in the Appendix).

We created three SysCaps datasets: a “medium” caption length dataset where <NUM> \coloneqq“4-6”, a “short” dataset using 2-3 sentences and a “long” dataset using 7-9 sentences. The SSMs in our experiments are trained with medium captions. Generating these datasets with llama-2-7b-chat used similar-to\sim1.5K GPU hours on a cluster with 16 NVIDIA A100-40GB GPUs.

Evaluating SysCaps quality: The LLM that generates natural language SysCaps may erroneously ignore or hallucinate attributes, which can negatively impact downstream performance. We evaluate the generated captions by estimating the fraction of attributes which the LLM successfully includes per caption. To compute this metric, we train a multi-class classifier to predict each categorical attribute in a SysCaps from its text embedding. The rate of missing or incorrect attributes is around 9-12% across the “short”, “medium”, and “long” caption types, with “short” captions having the highest error. This increases our confidence that our LLM-based approach for generating natural language SysCaps preserves sufficient information for surrogate modeling.

6.1 Accuracy On Held-Out Systems

Following Emami et al. [15], we use the normalized root mean square error (NRMSE) metric to compare model accuracy, averaged across 3 random seeds. In addition to comparing the ResNet MLP, LSTM, and SSM, we also trained a tuned LightGBM baseline. Does the sequential architecture matter? Yes—Figure 3 shows that the LSTM and SSM encoders outperform both the ResNet and our carefully tuned LightGBM baseline, and the SSM outperforms the LSTM. How do different system attribute encodings compare? First, we ablate the importance of encoding the system attributes by training an SSM baseline with these inputs removed (SSM/X); this model is unable to learn this task. Surprisingly, the SSM with text templates achieves comparable test accuracy on held-out systems to the SSM with one-hot inputs. We initially expected to see a non-negligible drop in regression accuracy even for text template inputs, because the text encoder compresses the caption into a single embedding vector; however, the DistilBERT encoder is sufficiently expressive to mitigate this. The SSM with natural language SysCaps has slightly worse accuracy than the LSTM model with key-value SysCaps, yet comfortably outperforms the non-sequential models, including LightGBM. We believe the performance gap between key-value templates and natural language SysCaps is mostly explained by the caption quality (Table 2). To check whether a more powerful text encoder than DistilBERT improves accuracy, we replace it with BERT. We saw only a minor difference in the building-hourly NRMSE (Fig. 3(a)), but a large improvement from 0.069 to 0.038 in stock-annual NRMSE (Fig. 3(b)), which is comparable to the SSM with one-hot inputs.

6.2 Caption Generalization

Length generalization: We assess how accuracy varies when surrogates are provided with natural language SysCaps having different lengths than seen during training. We evaluate zero-shot generalization to the short and long captions. The results (Table 2) show a small increase in error for shorter captions with a larger increase in error for longer captions, as might be expected. The error on long captions remains lower than the error achieved by our tuned LightGBM baseline.

Attribute synonyms: To quantitatively evaluate the extent to which natural language SysCaps surrogates gain a level of robustness to distribution shifts such as word order changes, synonyms, or writing style [24], we created captions for the held-out systems where the “building type” attribute is replaced by a synonym. We avoid biasing the choice of building type synonym by 5-shot prompting llama-2 to suggest the synonyms. For a control, we compare the synonym caption accuracy against accuracy on a caption with the building type attribute removed. Examples and results are shown in Table 3, where for 11/13 building type synonyms the increase in NRMSE is less than 12%, while the average increase in NRMSE for the control is 54%.

6.3 Robustness to Missing Information

Figures 46 show how a SysCaps model performs with missing attribute information in the natural language caption. We did not train the model on any captions with missing attributes. To create these figures, we progressively added more information to a caption template (Figure 4). Prediction accuracy improves as more information is given; notably, there is a large jump in accuracy once the building square footage is known, which is to be expected (Figure 6). This experiment reveals an intriguing connection between SysCaps with missing system information and traditional low-fidelity mathematical surrogate models. When attribute values are unknown, such as during the early stages of a design process, the SysCaps surrogate is naturally less accurate; similarly, a low-fidelity surrogates which uses a “coarse” mathematical approximation of a complex system sacrifices accuracy for computational efficiency.

6.4 Sensitivity Analysis From Natural Language

We visualize in Figure 6 a demonstration of using a SysCaps surrogate to conduct a sensitivity analysis on two system attributes, as might be performed for an early-stage design space exploration task. We use the caption template from Section 6.3 to create captions for each test building that enumerate all combinations of the number of stories and square footage attributes, totaling 160 configurations; the entire analysis requires simulating 960K buildings. We observe that the model has indeed learned physically plausible relationships between these two attributes. The model fails to predict the energy usage for buildings over 100K square feet—such buildings are in the “long tail” of the training data distribution.

6.5 Prompt Augmentation: Wind Farm Wake

Table 4: Wind farm surrogate accuracy. The base architecture is ResNet. Average across 3 random seeds.
Model NRMSE
LightGBM 0.196±0.000plus-or-minus0.000\pm 0.000± 0.000
one-hot 0.212±plus-or-minus\pm± 0.009
SysCaps-kv 0.054±plus-or-minus\pm± 0.024
SysCaps-nl 0.036±plus-or-minus\pm± 0.001

This experiment uses the Wind Farm Wake Modeling Dataset [43], made with the FLORIS simulator, to train a surrogate to predict a wind farm’s power generation in steady-state atmospheric conditions. The difficulty of this task is in modeling losses due to wake effects, given only a coarse description of the wind farm layout. There are three numeric simulator inputs x𝑥xitalic_x specifying atmospheric conditions, and five system attributes which include categorical variables indicating wind farm shape (there are four different layout types), number of turbines, and average turbine spacing (we do not use RFE). In this dataset, there are only 500 unique system configurations (split 3:1:1 for train, val, test), although each configuration is simulated under 500 distinct atmospheric conditions.

We explore generating multiple captions for each system configuration through prompt augmentation to increase diversity. Specifically, we replace the <STYLE> tag in the prompt with phrases encouraging different description styles, e.g., with an objective tone, with an objective tone (creative paraphrasing is acceptable), to a colleague, and to a classroom. The simulation is run assuming steady-state conditions (i.e., time-independent), so we tune hyperparameters for and train the non-sequential ResNet models. The ResNet baseline with one-hot encoded attribute inputs suffers from severe overfitting (Table 4), likely due to the small number (300) of training systems, whereas the SysCaps models generalize better to unseen systems. This suggests SysCaps can have a regularizing effect in small data settings. Notably, the prompt augmentation helps the natural language SysCaps model to achieve the lowest NRMSE.

7 Discussion

Our experiments demonstrate that, at comparable or minimal losses in accuracy with respect to standard feature engineering (e.g., one-hot encoding), real-world surrogate models can be augmented language interfaces. For a problem with only a small number of training systems available, the language inputs (with prompt augmentation) have a regularizing effect (Sec. 6.5) and actually outperformed standard feature engineering. We qualitatively and quantitatively showed that SysCaps handle missing attributes well. SysCaps unlock text-prompt-style surrogate modeling and new generalization abilities beyond what was previously possible with the standard encoding approach.

Limitations: Current BERT-style tokenizers struggle with numerical values [54]; for one example, they interpolate poorly to unseen numbers (Fig. 6). For another, because llama-2-7b-chat tends to add a comma to large numbers (e.g., 200,000200000200,000200 , 000) when generating SysCaps, we found that our models failed to understand large numbers without commas (high error). Orthogonal research on improving number encodings for language model inputs [17, 57] can benefit our framework. Another potential concern is with creating SysCaps for simulators with a large (e.g., over 100) attributes. To get high quality captions, a more powerful LLM than llama-2-7b-chat may be needed.

Broader impacts: Improving the design of surrogate models for CES has the potential to accelerate the transition to cleaner energy sources. To avoid unfair outcomes from decisions made with CES surrogates, care should be taken when deciding what simulation runs to use as training data.

8 Conclusion

In this work, we introduced a learning framework for training multimodal text and timeseries surrogate models for simulations of complex energy systems such as buildings and wind farms, and described how we use LLMs to synthesize natural language descriptions of such systems, which we call SysCaps. Our findings underscore that language is a viable interface for real-world surrogate models with only minimal losses (and occasionally gains) in accuracy. Surrogates with natural language SysCaps are robust to missing system attributes and paraphrasing (e.g., synonyms of attributes).

A future extension of this work might explore how to use language to also interface with the timeseries simulator inputs, possibly through summary statistics. For example, to study how the complex system behaves when the average exogenous temperature is increased by five degrees. Moreover, an important question is how we might create surrogate foundation models that generalize not only across system configurations for a single simulator, but also generalize across different simulators.

Acknowledgments and Disclosure of Funding

This work was authored by the National Renewable Energy Laboratory (NREL), operated by Alliance for Sustainable Energy, LLC, for the U.S. Department of Energy (DOE) under Contract No. DE-AC36-08GO28308. This work was supported by the Laboratory Directed Research and Development (LDRD) Program at NREL. The views expressed in the article do not necessarily represent the views of the DOE or the U.S. Government. The U.S. Government retains and the publisher, by accepting the article for publication, acknowledges that the U.S. Government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this work, or allow others to do so, for U.S. Government purposes. The research was performed using computational resources sponsored by the Department of Energy’s Office of Energy Efficiency and Renewable Energy and located at the National Renewable Energy Laboratory.

References

  • Agostinelli et al. [2023] Agostinelli, A., Denk, T. I., Borsos, Z., Engel, J., Verzetti, M., Caillon, A., Huang, Q., Jansen, A., Roberts, A., Tagliasacchi, M., et al. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
  • Akiba et al. [2019] Akiba, T., Sano, S., Yanase, T., Ohta, T., and Koyama, M. Optuna: A next-generation hyperparameter optimization framework. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp.  2623–2631, 2019.
  • Badaro et al. [2023] Badaro, G., Saeed, M., and Papotti, P. Transformers for tabular data representation: A survey of models and applications. Transactions of the Association for Computational Linguistics, 11:227–249, 2023.
  • Bellamy et al. [2023] Bellamy, D. R., Kumar, B., Wang, C., and Beam, A. Labrador: Exploring the limits of masked language modeling for laboratory data. arXiv preprint arXiv:2312.11502, 2023.
  • Bhavsar et al. [2023] Bhavsar, S., Pitchumani, R., Reynolds, M., Merket, N., and Reyna, J. Machine learning surrogate of physics-based building-stock simulator for end-use load forecasting. pp.  113395, 2023. ISSN 03787788. doi: 10.1016/j.enbuild.2023.113395. URL https://linkinghub.elsevier.com/retrieve/pii/S0378778823006254.
  • Bran et al. [2023] Bran, A. M., Cox, S., Schilter, O., Baldassari, C., White, A., and Schwaller, P. Augmenting large language models with chemistry tools. In NeurIPS 2023 AI for Science Workshop, 2023.
  • Carter et al. [2023] Carter, J., Feddema, J., Kothe, D., Neely, R., Pruet, J., Stevens, R., Balaprakash, P., Beckman, P., Foster, I., Iskra, K., et al. Advanced research directions on ai for science, energy, and security: Report on summer 2022 workshops. 2023.
  • Crawley et al. [2001] Crawley, D. B., Lawrie, L. K., Winkelmann, F. C., Buhl, W. F., Huang, Y. J., Pedersen, C. O., Strand, R. K., Liesen, R. J., Fisher, D. E., Witte, M. J., et al. Energyplus: creating a new-generation building energy simulation program. Energy and buildings, 33(4):319–331, 2001.
  • Dai et al. [2023] Dai, T.-Y., Niyogi, D., and Nagy, Z. Citytft: Temporal fusion transformer for urban building energy modeling. ArXiv preprint, abs/2312.02375, 2023. URL https://arxiv.longhoe.net/abs/2312.02375.
  • Devlin et al. [2018] Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Dinh et al. [2022] Dinh, T., Zeng, Y., Zhang, R., Lin, Z., Gira, M., Rajput, S., Sohn, J.-y., Papailiopoulos, D., and Lee, K. Lift: Language-interfaced fine-tuning for non-language machine learning tasks. Advances in Neural Information Processing Systems, 35:11763–11784, 2022.
  • Doh et al. [2023] Doh, S., Choi, K., Lee, J., and Nam, J. LP-MusicCaps: LLM-based pseudo music captioning. ArXiv preprint, abs/2307.16372, 2023. URL https://arxiv.longhoe.net/abs/2307.16372.
  • Du & Li [2019] Du, Y. and Li, F. Intelligent multi-microgrid energy management based on deep neural network and model-free reinforcement learning. IEEE Transactions on Smart Grid, 11(2):1066–1076, 2019.
  • Emami et al. [2023a] Emami, H., Dang, X.-H., Shah, Y., and Zerfos, P. Modality-aware transformer for time series forecasting. arXiv preprint arXiv:2310.01232, 2023a.
  • Emami et al. [2023b] Emami, P., Sahu, A., and Graf, P. Buildingsbench: A large-scale dataset of 900k buildings and benchmark for short-term load forecasting. Advances in Neural Information Processing Systems, 2023b.
  • Goel et al. [2022] Goel, K., Gu, A., Donahue, C., and Ré, C. It’s raw! audio generation with state-space models. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., and Sabato, S. (eds.), International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  7616–7633. PMLR, 2022. URL https://proceedings.mlr.press/v162/goel22a.html.
  • Golkar et al. [2023] Golkar, S., Pettee, M., Eickenberg, M., Bietti, A., Cranmer, M., Krawezik, G., Lanusse, F., McCabe, M., Ohana, R., Parker, L., et al. xval: A continuous number encoding for large language models. arXiv preprint arXiv:2310.02989, 2023.
  • Gu et al. [2021] Gu, A., Goel, K., and Ré, C. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021.
  • Gupta & Brandstetter [2022] Gupta, J. K. and Brandstetter, J. Towards multi-spatiotemporal-scale generalized pde modeling. arXiv preprint arXiv:2209.15616, 2022.
  • Guyon et al. [2002] Guyon, I., Weston, J., Barnhill, S., and Vapnik, V. Gene selection for cancer classification using support vector machines. Machine learning, 46:389–422, 2002.
  • Harrison-Atlas et al. [2024] Harrison-Atlas, D., Glaws, A., King, R. N., and Lantz, E. Artificial intelligence-aided wind plant optimization for nationwide evaluation of land use and economic benefits of wake steering. Nature Energy, pp.  1–15, 2024.
  • Hegselmann et al. [2023] Hegselmann, S., Buendia, A., Lang, H., Agrawal, M., Jiang, X., and Sontag, D. Tabllm: Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics, pp.  5549–5581. PMLR, 2023.
  • Hendrix et al. [1978] Hendrix, G. G., Sacerdoti, E. D., Sagalowicz, D., and Slocum, J. Develo** a natural language interface to complex data. ACM Transactions on Database Systems (TODS), 3(2):105–147, 1978.
  • Hendrycks et al. [2020] Hendrycks, D., Liu, X., Wallace, E., Dziedzic, A., Krishnan, R., and Song, D. Pretrained transformers improve out-of-distribution robustness. arXiv preprint arXiv:2004.06100, 2020.
  • Hochreiter & Schmidhuber [1997] Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Huang et al. [2022] Huang, Q., Jansen, A., Lee, J., Ganti, R., Li, J. Y., and Ellis, D. P. W. MuLan: A joint embedding of music audio and natural language, 2022. URL https://arxiv.longhoe.net/abs/2208.12415.
  • Jablonka et al. [2024] Jablonka, K. M., Schwaller, P., Ortega-Guerrero, A., and Smit, B. Leveraging large language models for predictive chemistry. Nature Machine Intelligence, pp.  1–9, 2024.
  • Ke et al. [2017] Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30, 2017.
  • Kumar et al. [2023] Kumar, V., Gleyzer, L., Kahana, A., Shukla, K., and Karniadakis, G. E. Mycrunchgpt: A llm assisted framework for scientific machine learning. Journal of Machine Learning for Modeling and Computing, 4(4), 2023.
  • Lavin et al. [2021] Lavin, A., Krakauer, D., Zenil, H., Gottschlich, J., Mattson, T., Brehmer, J., Anandkumar, A., Choudry, S., Rocki, K., Baydin, A. G., et al. Simulation intelligence: Towards a new generation of scientific methods. ArXiv preprint, abs/2112.03235, 2021. URL https://arxiv.longhoe.net/abs/2112.03235.
  • Liu et al. [2023a] Liu, H., Chen, Z., Yuan, Y., Mei, X., Liu, X., Mandic, D., Wang, W., and Plumbley, M. D. Audioldm: Text-to-audio generation with latent diffusion models. ArXiv preprint, abs/2301.12503, 2023a. URL https://arxiv.longhoe.net/abs/2301.12503.
  • Liu et al. [2023b] Liu, S., Zhu, Y., Lu, J., Xu, Z., Nie, W., Gitter, A., Xiao, C., Tang, J., Guo, H., and Anandkumar, A. A text-guided protein design framework. arXiv preprint arXiv:2302.04611, 2023b.
  • Liu et al. [2024] Liu, T., Astorga, N., Seedat, N., and van der Schaar, M. Large language models to enhance bayesian optimization. International Conference on Learning Representations, 2024.
  • Lorsung et al. [2024] Lorsung, C., Li, Z., and Barati Farimani, A. Physics informed token transformer for solving partial differential equations. Machine Learning: Science and Technology, 2024.
  • Loshchilov & Hutter [2017] Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. International Conference on Learning Representations, 2017.
  • Mei et al. [2023a] Mei, X., Meng, C., Liu, H., Kong, Q., Ko, T., Zhao, C., Plumbley, M. D., Zou, Y., and Wang, W. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395, 2023a.
  • Mei et al. [2023b] Mei, X., Meng, C., Liu, H., Kong, Q., Ko, T., Zhao, C., Plumbley, M. D., Zou, Y., and Wang, W. WavCaps: A ChatGPT-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. ArXiv preprint, abs/2303.17395, 2023b. URL https://arxiv.longhoe.net/abs/2303.17395.
  • Potemkin et al. [2023] Potemkin, D., Soto, C., Li, R., Yager, K., and Tsai, E. Virtual scientific companion for synchrotron beamlines: A prototype. ArXiv preprint, abs/2312.17180, 2023. URL https://arxiv.longhoe.net/abs/2312.17180.
  • Quamar et al. [2022] Quamar, A., Efthymiou, V., Lei, C., Özcan, F., et al. Natural language interfaces to data. 11(4):319–414, 2022.
  • Rackauckas et al. [2020] Rackauckas, C., Ma, Y., Martensen, J., Warner, C., Zubov, K., Supekar, R., Skinner, D., Ramadhan, A., and Edelman, A. Universal differential equations for scientific machine learning. arXiv preprint arXiv:2001.04385, 2020.
  • Rackauckas & Abdelrehim [2024] Rackauckas, C. V. and Abdelrehim, A. Scientific machine learning (sciml) surrogates for industry, part 1: The guiding questions. https://doi.org/10.31219/osf.io/p95zn, 2024. Accessed: 2024-03-22.
  • Raissi et al. [2019] Raissi, M., Perdikaris, P., and Karniadakis, G. E. Physics-informed neural networks: A deep learning framework for solving forward and inverse problems involving nonlinear partial differential equations. Journal of Computational physics, 378:686–707, 2019.
  • Ramos et al. [2023] Ramos, D., Glaws, A., King, R., , and Harrison-Atlas, D. Flow redirection and induction in steady state (floris) wind plant power production data sets, 2023. URL https://data.openei.org/submissions/5884.
  • Rodrigues et al. [2019] Rodrigues, F., Markou, I., and Pereira, F. C. Combining time-series and textual data for taxi demand prediction in event areas: A deep learning approach. Information Fusion, 49:120–129, 2019.
  • Sanh et al. [2019] Sanh, V., Debut, L., Chaumond, J., and Wolf, T. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv preprint, abs/1910.01108, 2019. URL https://arxiv.longhoe.net/abs/1910.01108.
  • Schick & Schütze [2021] Schick, T. and Schütze, H. Generating datasets with pretrained language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  6943–6951, Online and Punta Cana, Dominican Republic, 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.emnlp-main.555. URL https://aclanthology.org/2021.emnlp-main.555.
  • Seidl et al. [2023] Seidl, P., Vall, A., Hochreiter, S., and Klambauer, G. Enhancing activity prediction models in drug discovery with the ability to understand human language. In Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., and Scarlett, J. (eds.), Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pp.  30458–30490. PMLR, 2023. URL https://proceedings.mlr.press/v202/seidl23a.html.
  • Song et al. [2024] Song, X., Li, O., Lee, C., Peng, D., Perel, S., Chen, Y., et al. Omnipred: Language models as universal regressors. ArXiv preprint, abs/2402.14547, 2024. URL https://arxiv.longhoe.net/abs/2402.14547.
  • Takamoto et al. [2023] Takamoto, M., Alesiani, F., and Niepert, M. Learning neural pde solvers with parameter-guided channel attention. In International Conference on Machine Learning, pp.  33448–33467. PMLR, 2023.
  • Touvron et al. [2023] Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. ArXiv preprint, abs/2307.09288, 2023. URL https://arxiv.longhoe.net/abs/2307.09288.
  • Vaithilingam et al. [2024] Vaithilingam, P., Arawjo, I., and Glassman, E. L. Imagining a future of designing with ai: Dynamic grounding, constructive negotiation, and sustainable motivation. arXiv preprint arXiv:2402.07342, 2024.
  • Vazquez-Canteli et al. [2019] Vazquez-Canteli, J., Demir, A. D., Brown, J., and Nagy, Z. Deep neural networks as surrogate models for urban energy simulations. In Journal of Physics: Conference Series, volume 1343, pp.  012002. IOP Publishing, 2019.
  • Vepsäläinen et al. [2019] Vepsäläinen, J., Otto, K., Lajunen, A., and Tammi, K. Computationally efficient model for energy demand prediction of electric city bus in varying operating conditions. Energy, 169:433–443, 2019.
  • Wallace et al. [2019] Wallace, E., Wang, Y., Li, S., Singh, S., and Gardner, M. Do NLP models know numbers? probing numeracy in embeddings. In Inui, K., Jiang, J., Ng, V., and Wan, X. (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  5307–5315, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1534. URL https://aclanthology.org/D19-1534.
  • [55] Wilson, E. J., Parker, A., Fontanini, A., Present, E., Reyna, J. L., Adhikari, R., Bianchi, C., CaraDonna, C., Dahlhausen, M., Kim, J., et al. End-use load profiles for the us building stock: Methodology and results of model calibration, validation, and uncertainty quantification. Technical report, National Renewable Energy Lab (NREL). URL https://www.nrel.gov/docs/fy22osti/80889.pdf.
  • Xu et al. [2023] Xu, M., Yuan, X., Miret, S., and Tang, J. Protst: multi-modality learning of protein sequences and biomedical texts. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023.
  • Yan et al. [2024] Yan, J., Zheng, B., Xu, H., Zhu, Y., Chen, D., Sun, J., Wu, J., and Chen, J. Making pre-trained language models great on tabular prediction. arXiv preprint arXiv:2403.01841, 2024.
  • Yang et al. [2024] Yang, L., Liu, S., and Osher, S. J. Fine-tune language models as multi-modal differential equation solvers. arXiv preprint arXiv:2308.05061v4, 2024.
  • Ye et al. [2024] Ye, Z., Huang, X., Chen, L., Liu, H., Wang, Z., and Dong, B. Pdeformer: Towards a foundation model for one-dimensional partial differential equations. arXiv preprint arXiv:2402.12652, 2024.
  • Zhang et al. [2021] Zhang, L., Plathottam, S., Reyna, J., Merket, N., Sayers, K., Yang, X., Reynolds, M., Parker, A., Wilson, E., Fontanini, A., Roberts, D., and Muehleisen, R. High-resolution hourly surrogate modeling framework for physics-based large-scale building stock modeling. 75:103292, 2021. ISSN 22106707. doi: 10.1016/j.scs.2021.103292. URL https://linkinghub.elsevier.com/retrieve/pii/S2210670721005680.

Appendix A Additional Experiment Details

We use the normalized root mean square error (NRMSE) to capture the accuracy of the surrogate model. NRMSE is also known as (CV)RMSE.

The building-hour NRMSE is where the NRMSE is normalized by building and by hour, where B𝐵Bitalic_B is the number of buildings in the building stock and T𝑇Titalic_T is the number of hours in a year:

:=11BTytb1BTb=1,t=1B,T(ytby^tb)2.assignabsent11𝐵𝑇subscriptsuperscript𝑦𝑏𝑡1𝐵𝑇superscriptsubscriptformulae-sequence𝑏1𝑡1𝐵𝑇superscriptsubscriptsuperscript𝑦𝑏𝑡subscriptsuperscript^𝑦𝑏𝑡2:=\frac{1}{\frac{1}{BT}\sum y^{b}_{t}}\sqrt{\frac{1}{BT}\sum_{b=1,t=1}^{B,T}(y% ^{b}_{t}-\hat{y}^{b}_{t})^{2}}.:= divide start_ARG 1 end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG italic_B italic_T end_ARG ∑ italic_y start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT end_ARG square-root start_ARG divide start_ARG 1 end_ARG start_ARG italic_B italic_T end_ARG ∑ start_POSTSUBSCRIPT italic_b = 1 , italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B , italic_T end_POSTSUPERSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (2)

The stock-annual NRMSE, is where the NRMSE is normalized by the annual stock energy consumption:

:=1ytb((b=1,t=1B,Tytb)(b=1,t=1B,Ty^b))2.assignabsent1superscriptsubscript𝑦𝑡𝑏superscriptsuperscriptsubscriptformulae-sequence𝑏1𝑡1𝐵𝑇subscriptsuperscript𝑦𝑏𝑡superscriptsubscriptformulae-sequence𝑏1𝑡1𝐵𝑇superscript^𝑦𝑏2:=\frac{1}{\sum y_{t}^{b}}\sqrt{((\sum_{b=1,t=1}^{B,T}y^{b}_{t})-(\sum_{b=1,t=% 1}^{B,T}\hat{y}^{b}))^{2}}.:= divide start_ARG 1 end_ARG start_ARG ∑ italic_y start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT end_ARG square-root start_ARG ( ( ∑ start_POSTSUBSCRIPT italic_b = 1 , italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B , italic_T end_POSTSUPERSCRIPT italic_y start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ) - ( ∑ start_POSTSUBSCRIPT italic_b = 1 , italic_t = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_B , italic_T end_POSTSUPERSCRIPT over^ start_ARG italic_y end_ARG start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT ) ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG . (3)

We use the AdamW [35] optimizer with β1subscript𝛽1\beta_{1}italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9, β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.98, ϵitalic-ϵ\epsilonitalic_ϵ = 1e-9, and weight decay of 0.01 for all experiments. The early stop** patience is 50 for all experiments. All models are trained with a single NVIDIA A100-40GB GPU. The longest training runs take 1-2 days and the shortest 2-3 hours.

A.1 Hyperparameter sweeps

See Table 5 and Table 6.

A.2 Buildings

Table 5: Buildings: hyperparameters.
Model Hyperparameter Grid search space Best values
LightGBM Learning rate From 0.01 to 0.1 0.066
Number of leaves From 40 to 150 149
Subsample From 0.05 to 1.0 0.178
Feature fraction From 0.05 to 1.0 0.860
Min number of data in one leaf From 1 to 100 12
ResNet (one-hot) Hidden layers size 256, 1024, 2048 1024
Number of layers 2, 8 2
Batch size 128, 256, 512 512
Learning rate 0.0001, 0.0003, 0.001 0.0003
ResNet (SysCaps) Hidden layers size 256, 1024, 2048 256
Number of layers 2, 8 8
Batch size 128, 256, 512 256
Learning rate 0.0001, 0.0003, 0.001 0.0003
Bidirectional LSTM (one-hot) [Hidden layer size, Batch size] [128, 64], [512, 32], [1024, 32] [128, 64]
Number of layers 1, 3, 4, 6, 8 4
MLP dimension 256 256
Learning rate 0.00001, 0.0003, 0.001 0.001
Bidirectional LSTM (SysCaps) [Hidden layer size, Batch size] [128, 64], [512, 32], [1024, 32] [512, 32]
Number of layers 1, 3, 4, 6, 8 1
MLP dimension 256 256
Learning rate 0.00001, 0.0003, 0.001 0.0003
Bidirectional S4 (one-hot) [Hidden layer size, Num. layers] [64,8] , [128,4] [128,4]
MLP dimension 256 256
Batch size 32, 64 64
Learning rate 1e-5, 3e-4, 1e-3 3e-4
Bidirectional S4 (SysCaps) [Hidden layer size, Num. layers] [64,8] , [128,4] [128, 4]
MLP dimension 256 256
Batch size 32, 64 32
Learning rate 1e-5, 3e-4, 1e-3 3e-4

There are 13 attributes after RFE, which are one-hot encoded into a 336-dimensional feature vector, whereas the text embeddings are 768-dimensional. Along with the 7 weather variables, we concatenate cyclically encoded calendar features [15]. This creates a 103-dimensional input for the bidirectional sequence encoders.

LightGBM: As LightGBM does not support batch training out of the box, the entire training data needs to be loaded into the memory to train a LightGBM model. With the train dataset containing 340k buildings, each with 8759 hours and 347 features, we randomly extract 438 hours per building (which is about 5% of total hours) to limit memory usage. This results in 340,000×438340000438340,000\times 438340 , 000 × 438 hours in total for the train dataset, which consumes about 380 GB of memory when being loaded into a NumPy object. For the validation and test splits, we retain the full number of hours per building. The LightGBM model is tuned with Optuna [2] across 30 trials and achieves the best validation NRMSE of 0.667.

A.2.1 Wind farm

Table 6: Wind: hyperparameters.
Model Hyperparameter Grid search space Best values
LightGBM Learning rate From 0.01 to 0.1 0.039
Number of leaves From 40 to 120 108
Subsample From 0.6 to 1.0 0.963
Feature fraction From 0.6 to 1.0 0.997
Min number of data in one leaf From 20 to 100 96
ResNet (one-hot) Hidden layers size [256,1024] 256
Number of layers [2,8] 2
Batch size [128,256] 128
Learning rate [1e-5, 3e-4, 1e-3] 1e-5
ResNet (SysCaps-kv) Hidden layers size [256,1024] 1024
Number of layers [2,8] 8
Batch size [128,256] 256
Learning rate [1e-5,3e-4,1e-3] 3e-4
ResNet (SysCaps-nl) Hidden layers size [256,1024] 1024
Number of layers [2,8] 8
Batch size [128,256] 256
Learning rate [1e-5,3e-4,1e-3] 1e-5

LightGBM: The training, validation, and test split for the wind dataset gives us datasets of size 148,650, 49,600, and 49,250 respectively with 190 features after one-hot encoding. The training dataset is loaded into memory to train the LightGBM model. We use Optuna [2] to tune the hyperparameters and the best validation NMRSE achieved is 0.189.

Appendix B Additional SysCaps examples

Refer to caption
Figure 7: Building SysCap.
Refer to caption
Figure 8: Wind farm SysCap.
Refer to caption
Figure 9: SysCap building with an incorrectly described attribute (weekday closing time - highlighted in blue) in the natural language caption.
Refer to caption
Figure 10: Wind farm SysCap with a logical error in the natural language caption where it says "total installed capacity" when it is just the capacity of a single turbine (highlighted in blue).