Search | arXiv e-print repository

Does fine-tuning GPT-3 with the OpenAI API leak personally-identifiable information?

Authors: Albert Yu Sun, Eliott Zemour, Arushi Saxena, Udith Vaidyanathan, Eric Lin, Christian Lau, Vaikkunth Mugunthan

Abstract: Machine learning practitioners often fine-tune generative pre-trained models like GPT-3 to improve model performance at specific tasks. Previous works, however, suggest that fine-tuned machine learning models memorize and emit sensitive information from the original fine-tuning dataset. Companies such as OpenAI offer fine-tuning services for their models, but no prior work has conducted a memoriza… ▽ More Machine learning practitioners often fine-tune generative pre-trained models like GPT-3 to improve model performance at specific tasks. Previous works, however, suggest that fine-tuned machine learning models memorize and emit sensitive information from the original fine-tuning dataset. Companies such as OpenAI offer fine-tuning services for their models, but no prior work has conducted a memorization attack on any closed-source models. In this work, we simulate a privacy attack on GPT-3 using OpenAI's fine-tuning API. Our objective is to determine if personally identifiable information (PII) can be extracted from this model. We (1) explore the use of naive prompting methods on a GPT-3 fine-tuned classification model, and (2) we design a practical word generation task called Autocomplete to investigate the extent of PII memorization in fine-tuned GPT-3 within a real-world context. Our findings reveal that fine-tuning GPT3 for both tasks led to the model memorizing and disclosing critical personally identifiable information (PII) obtained from the underlying fine-tuning dataset. To encourage further research, we have made our codes and datasets publicly available on GitHub at: https://github.com/albertsun1/gpt3-pii-attacks △ Less

Submitted 15 April, 2024; v1 submitted 30 July, 2023; originally announced July 2023.

arXiv:2307.16090 [pdf, other]

Rapid Flood Inundation Forecast Using Fourier Neural Operator

Authors: Alexander Y. Sun, Zhi Li, Wonhyun Lee, Qixing Huang, Bridget R. Scanlon, Clint Dawson

Abstract: Flood inundation forecast provides critical information for emergency planning before and during flood events. Real time flood inundation forecast tools are still lacking. High-resolution hydrodynamic modeling has become more accessible in recent years, however, predicting flood extents at the street and building levels in real-time is still computationally demanding. Here we present a hybrid proc… ▽ More Flood inundation forecast provides critical information for emergency planning before and during flood events. Real time flood inundation forecast tools are still lacking. High-resolution hydrodynamic modeling has become more accessible in recent years, however, predicting flood extents at the street and building levels in real-time is still computationally demanding. Here we present a hybrid process-based and data-driven machine learning (ML) approach for flood extent and inundation depth prediction. We used the Fourier neural operator (FNO), a highly efficient ML method, for surrogate modeling. The FNO model is demonstrated over an urban area in Houston (Texas, U.S.) by training using simulated water depths (in 15-min intervals) from six historical storm events and then tested over two holdout events. Results show FNO outperforms the baseline U-Net model. It maintains high predictability at all lead times tested (up to 3 hrs) and performs well when applying to new sites, suggesting strong generalization skill. △ Less

Submitted 29 July, 2023; originally announced July 2023.

Comments: Artificial Intelligence (AI) and Humanitarian Assistance and Disaster Recovery (HADR) workshop, ICCV 2023 in Paris, France

arXiv:2305.03101 [pdf, other]

Hybrid Transducer and Attention based Encoder-Decoder Modeling for Speech-to-Text Tasks

Authors: Yun Tang, Anna Y. Sun, Hirofumi Inaguma, Xinyue Chen, Ning Dong, Xutai Ma, Paden D. Tomasello, Juan Pino

Abstract: Transducer and Attention based Encoder-Decoder (AED) are two widely used frameworks for speech-to-text tasks. They are designed for different purposes and each has its own benefits and drawbacks for speech-to-text tasks. In order to leverage strengths of both modeling methods, we propose a solution by combining Transducer and Attention based Encoder-Decoder (TAED) for speech-to-text tasks. The new… ▽ More Transducer and Attention based Encoder-Decoder (AED) are two widely used frameworks for speech-to-text tasks. They are designed for different purposes and each has its own benefits and drawbacks for speech-to-text tasks. In order to leverage strengths of both modeling methods, we propose a solution by combining Transducer and Attention based Encoder-Decoder (TAED) for speech-to-text tasks. The new method leverages AED's strength in non-monotonic sequence to sequence learning while retaining Transducer's streaming property. In the proposed framework, Transducer and AED share the same speech encoder. The predictor in Transducer is replaced by the decoder in the AED model, and the outputs of the decoder are conditioned on the speech inputs instead of outputs from an unconditioned language model. The proposed solution ensures that the model is optimized by covering all possible read/write scenarios and creates a matched environment for streaming applications. We evaluate the proposed approach on the \textsc{MuST-C} dataset and the findings demonstrate that TAED performs significantly better than Transducer for offline automatic speech recognition (ASR) and speech-to-text translation (ST) tasks. In the streaming case, TAED outperforms Transducer in the ASR task and one ST direction while comparable results are achieved in another translation direction. △ Less

Submitted 4 May, 2023; originally announced May 2023.

Comments: ACL 2023 main conference

arXiv:2304.14364 [pdf, other]

CONSCENDI: A Contrastive and Scenario-Guided Distillation Approach to Guardrail Models for Virtual Assistants

Authors: Albert Yu Sun, Varun Nair, Elliot Schumacher, Anitha Kannan

Abstract: A wave of new task-based virtual assistants has been fueled by increasingly powerful large language models (LLMs), such as GPT-4 (OpenAI, 2023). A major challenge in deploying LLM-based virtual conversational assistants in real world settings is ensuring they operate within what is admissible for the task. To overcome this challenge, the designers of these virtual assistants rely on an independent… ▽ More A wave of new task-based virtual assistants has been fueled by increasingly powerful large language models (LLMs), such as GPT-4 (OpenAI, 2023). A major challenge in deploying LLM-based virtual conversational assistants in real world settings is ensuring they operate within what is admissible for the task. To overcome this challenge, the designers of these virtual assistants rely on an independent guardrail system that verifies the virtual assistant's output aligns with the constraints required for the task. However, relying on commonly used, prompt-based guardrails can be difficult to engineer correctly and comprehensively. To address these challenges, we propose CONSCENDI. We use CONSCENDI to exhaustively generate training data with two key LLM-powered components: scenario-augmented generation and contrastive training examples. When generating conversational data, we generate a set of rule-breaking scenarios, which enumerate a diverse set of high-level ways a rule can be violated. This scenario-guided approach produces a diverse training set and provides chatbot designers greater control. To generate contrastive examples, we prompt the LLM to alter conversations with violations into acceptable conversations to enable fine-grained distinctions. We then use this data, generated by CONSCENDI, to train a smaller model. We find that CONSCENDI results in guardrail models that improve over baselines in multiple dialogue domains. △ Less

Submitted 3 April, 2024; v1 submitted 27 April, 2023; originally announced April 2023.

Comments: To appear in NAACL 2024

arXiv:2104.04764 [pdf, ps, other]

Applications of physics-informed scientific machine learning in subsurface science: A survey

Authors: Alexander Y. Sun, Hongkyu Yoon, Chung-Yan Shih, Zhi Zhong

Abstract: Geosystems are geological formations altered by humans activities such as fossil energy exploration, waste disposal, geologic carbon sequestration, and renewable energy generation. Geosystems also represent a critical link in the global water-energy nexus, providing both the source and buffering mechanisms for enabling societal adaptation to climate variability and change. The responsible use and… ▽ More Geosystems are geological formations altered by humans activities such as fossil energy exploration, waste disposal, geologic carbon sequestration, and renewable energy generation. Geosystems also represent a critical link in the global water-energy nexus, providing both the source and buffering mechanisms for enabling societal adaptation to climate variability and change. The responsible use and exploration of geosystems are thus critical to the geosystem governance, which in turn depends on the efficient monitoring, risk assessment, and decision support tools for practical implementation. Fast advances in machine learning (ML) algorithms and novel sensing technologies in recent years have presented new opportunities for the subsurface research community to improve the efficacy and transparency of geosystem governance. Although recent studies have shown the great promise of scientific ML (SciML) models, questions remain on how to best leverage ML in the management of geosystems, which are typified by multiscality, high-dimensionality, and data resolution inhomogeneity. This survey will provide a systematic review of the recent development and applications of domain-aware SciML in geosystem researches, with an emphasis on how the accuracy, interpretability, scalability, defensibility, and generalization skill of ML approaches can be improved to better serve the geoscientific community. △ Less

Submitted 13 April, 2021; v1 submitted 10 April, 2021; originally announced April 2021.

Comments: 20 pages, 2 figures, 1 table

arXiv:1902.01933 [pdf, other]

doi 10.1029/2018WR023333

Combining Physically-Based Modeling and Deep Learning for Fusing GRACE Satellite Data: Can We Learn from Mismatch?

Authors: Alexander Y. Sun, Bridget R. Scanlon, Zizhan Zhang, David Walling, Soumendra N. Bhanja, Abhijit Mukherjee, Zhi Zhong

Abstract: Global hydrological and land surface models are increasingly used for tracking terrestrial total water storage (TWS) dynamics, but the utility of existing models is hampered by conceptual and/or data uncertainties related to various underrepresented and unrepresented processes, such as groundwater storage. The gravity recovery and climate experiment (GRACE) satellite mission provided a valuable in… ▽ More Global hydrological and land surface models are increasingly used for tracking terrestrial total water storage (TWS) dynamics, but the utility of existing models is hampered by conceptual and/or data uncertainties related to various underrepresented and unrepresented processes, such as groundwater storage. The gravity recovery and climate experiment (GRACE) satellite mission provided a valuable independent data source for tracking TWS at regional and continental scales. Strong interests exist in fusing GRACE data into global hydrological models to improve their predictive performance. Here we develop and apply deep convolutional neural network (CNN) models to learn the spatiotemporal patterns of mismatch between TWS anomalies (TWSA) derived from GRACE and those simulated by NOAH, a widely used land surface model. Once trained, our CNN models can be used to correct the NOAH simulated TWSA without requiring GRACE data, potentially filling the data gap between GRACE and its follow-on mission, GRACE-FO. Our methodology is demonstrated over India, which has experienced significant groundwater depletion in recent decades that is nevertheless not being captured by the NOAH model. Results show that the CNN models significantly improve the match with GRACE TWSA, achieving a country-average correlation coefficient of 0.94 and Nash-Sutcliff efficient of 0.87, or 14\% and 52\% improvement respectively over the original NOAH TWSA. At the local scale, the learned mismatch pattern correlates well with the observed in situ groundwater storage anomaly data for most parts of India, suggesting that deep learning models effectively compensate for the missing groundwater component in NOAH for this study region. △ Less

Submitted 31 January, 2019; originally announced February 2019.

Journal ref: Water Resources Research, 2019

arXiv:1810.12856 [pdf, other]

doi 10.1029/2018GL080404

Discovering state-parameter map**s in subsurface models using generative adversarial networks

Authors: Alexander Y. Sun

Abstract: A fundamental problem in geophysical modeling is related to the identification and approximation of causal structures among physical processes. However, resolving the bidirectional map**s between physical parameters and model state variables (i.e., solving the forward and inverse problems) is challenging, especially when parameter dimensionality is high. Deep learning has opened a new door towar… ▽ More A fundamental problem in geophysical modeling is related to the identification and approximation of causal structures among physical processes. However, resolving the bidirectional map**s between physical parameters and model state variables (i.e., solving the forward and inverse problems) is challenging, especially when parameter dimensionality is high. Deep learning has opened a new door toward knowledge representation and complex pattern identification. In particular, the recently introduced generative adversarial networks (GANs) hold strong promises in learning cross-domain map**s for image translation. This study presents a state-parameter identification GAN (SPID-GAN) for simultaneously learning bidirectional map**s between a high-dimensional parameter space and the corresponding model state space. SPID-GAN is demonstrated using a series of representative problems from subsurface flow modeling. Results show that SPID-GAN achieves satisfactory performance in identifying the bidirectional state-parameter map**s, providing a new deep-learning-based, knowledge representation paradigm for a wide array of complex geophysical problems. △ Less

Submitted 30 October, 2018; originally announced October 2018.

arXiv:1710.04253 [pdf]

Ultrasensitive biosensor based on Nd:YAG waveguide laser: Tumor cell and Dextrose solution

Authors: Guanhua Li, Huiyuan Li, Rumei Gong, Yang Tan, Javier Rodraiguez Vazquez de Aldana Yu** Sun, Feng Chen

Abstract: This work demonstrates the Nd:YAG waveguide laser as an efficient platform for the bio-sensing. The waveguide was fabricated in the Nd:YAG crystal by the cooperation of the ultrafast laser writing and ion irradiation. As the laser oscillation in the Nd:YAG waveguide is ultra-sensitivity to the external environment of the waveguide. Even a weak disturbance would induce a large variation of the outp… ▽ More This work demonstrates the Nd:YAG waveguide laser as an efficient platform for the bio-sensing. The waveguide was fabricated in the Nd:YAG crystal by the cooperation of the ultrafast laser writing and ion irradiation. As the laser oscillation in the Nd:YAG waveguide is ultra-sensitivity to the external environment of the waveguide. Even a weak disturbance would induce a large variation of the output power of the laser. According to this feature, the Nd:YAG waveguide coated with Graphene and WSe2 layers is used as substrate for the microfluidic channel. When the microflow crosses the Nd:YAG waveguide, the laser oscillation in the waveguide is disturbed, and induces the fluctuation of the output laser. Through the analysis of the fluctuation, the concentration of the dextrose solution and the size of the tumor cell are distinguished △ Less

Submitted 9 October, 2017; originally announced October 2017.

Showing 1–8 of 8 results for author: Sun, A Y