Search | arXiv e-print repository

A Language Model-Guided Framework for Mining Time Series with Distributional Shifts

Authors: Haibei Zhu, Yousef El-Laham, Elizabeth Fons, Svitlana Vyetrenko

Abstract: Effective utilization of time series data is often constrained by the scarcity of data quantity that reflects complex dynamics, especially under the condition of distributional shifts. Existing datasets may not encompass the full range of statistical properties required for robust and comprehensive analysis. And privacy concerns can further limit their accessibility in domains such as finance and… ▽ More Effective utilization of time series data is often constrained by the scarcity of data quantity that reflects complex dynamics, especially under the condition of distributional shifts. Existing datasets may not encompass the full range of statistical properties required for robust and comprehensive analysis. And privacy concerns can further limit their accessibility in domains such as finance and healthcare. This paper presents an approach that utilizes large language models and data source interfaces to explore and collect time series datasets. While obtained from external sources, the collected data share critical statistical properties with primary time series datasets, making it possible to model and adapt to various scenarios. This method enlarges the data quantity when the original data is limited or lacks essential properties. It suggests that collected datasets can effectively supplement existing datasets, especially involving changes in data distribution. We demonstrate the effectiveness of the collected datasets through practical examples and show how time series forecasting foundation models fine-tuned on these datasets achieve comparable performance to those models without fine-tuning. △ Less

Submitted 7 June, 2024; originally announced June 2024.

arXiv:2401.00081 [pdf, other]

Synthetic Data Applications in Finance

Authors: Vamsi K. Potluru, Daniel Borrajo, Andrea Coletta, Niccolò Dalmasso, Yousef El-Laham, Elizabeth Fons, Mohsen Ghassemi, Sriram Gopalakrishnan, Vikesh Gosai, Eleonora Kreačić, Ganapathy Mani, Saheed Obitayo, Deepak Paramanand, Natraj Raman, Mikhail Solonin, Srijan Sood, Svitlana Vyetrenko, Haibei Zhu, Manuela Veloso, Tucker Balch

Abstract: Synthetic data has made tremendous strides in various commercial settings including finance, healthcare, and virtual reality. We present a broad overview of prototypical applications of synthetic data in the financial sector and in particular provide richer details for a few select ones. These cover a wide variety of data modalities including tabular, time-series, event-series, and unstructured ar… ▽ More Synthetic data has made tremendous strides in various commercial settings including finance, healthcare, and virtual reality. We present a broad overview of prototypical applications of synthetic data in the financial sector and in particular provide richer details for a few select ones. These cover a wide variety of data modalities including tabular, time-series, event-series, and unstructured arising from both markets and retail financial applications. Since finance is a highly regulated industry, synthetic data is a potential approach for dealing with issues related to privacy, fairness, and explainability. Various metrics are utilized in evaluating the quality and effectiveness of our approaches in these applications. We conclude with open directions in synthetic data in the context of the financial domain. △ Less

Submitted 20 March, 2024; v1 submitted 29 December, 2023; originally announced January 2024.

Comments: 50 pages, journal submission; updated 6 privacy levels

arXiv:2312.13152 [pdf, other]

Neural Stochastic Differential Equations with Change Points: A Generative Adversarial Approach

Authors: Zhongchang Sun, Yousef El-Laham, Svitlana Vyetrenko

Abstract: Stochastic differential equations (SDEs) have been widely used to model real world random phenomena. Existing works mainly focus on the case where the time series is modeled by a single SDE, which might be restrictive for modeling time series with distributional shift. In this work, we propose a change point detection algorithm for time series modeled as neural SDEs. Given a time series dataset, t… ▽ More Stochastic differential equations (SDEs) have been widely used to model real world random phenomena. Existing works mainly focus on the case where the time series is modeled by a single SDE, which might be restrictive for modeling time series with distributional shift. In this work, we propose a change point detection algorithm for time series modeled as neural SDEs. Given a time series dataset, the proposed method jointly learns the unknown change points and the parameters of distinct neural SDE models corresponding to each change point. Specifically, the SDEs are learned under the framework of generative adversarial networks (GANs) and the change points are detected based on the output of the GAN discriminator in a forward pass. At each step of the proposed algorithm, the change points and the SDE model parameters are updated in an alternating fashion. Numerical results on both synthetic and real datasets are provided to validate the performance of our algorithm in comparison to classical change point detection benchmarks, standard GAN-based neural SDEs, and other state-of-the-art deep generative models for time series data. △ Less

Submitted 22 January, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

Comments: accepted paper to be published in the proceedings of ICASSP 2024

arXiv:2312.13141 [pdf, other]

Augment on Manifold: Mixup Regularization with UMAP

Authors: Yousef El-Laham, Elizabeth Fons, Dillon Daudert, Svitlana Vyetrenko

Abstract: Data augmentation techniques play an important role in enhancing the performance of deep learning models. Despite their proven benefits in computer vision tasks, their application in the other domains remains limited. This paper proposes a Mixup regularization scheme, referred to as UMAP Mixup, designed for ``on-manifold" automated data augmentation for deep learning predictive models. The propose… ▽ More Data augmentation techniques play an important role in enhancing the performance of deep learning models. Despite their proven benefits in computer vision tasks, their application in the other domains remains limited. This paper proposes a Mixup regularization scheme, referred to as UMAP Mixup, designed for ``on-manifold" automated data augmentation for deep learning predictive models. The proposed approach ensures that the Mixup operations result in synthesized samples that lie on the data manifold of the features and labels by utilizing a dimensionality reduction technique known as uniform manifold approximation and projection. Evaluations across diverse regression tasks show that UMAP Mixup is competitive with or outperforms other Mixup variants, show promise for its potential as an effective tool for enhancing the generalization performance of deep learning models. △ Less

Submitted 22 January, 2024; v1 submitted 20 December, 2023; originally announced December 2023.

Comments: accepted paper to be published in the proceedings of ICASSP 2024

arXiv:2307.00868 [pdf, other]

MADS: Modulated Auto-Decoding SIREN for time series imputation

Authors: Tom Bamford, Elizabeth Fons, Yousef El-Laham, Svitlana Vyetrenko

Abstract: Time series imputation remains a significant challenge across many fields due to the potentially significant variability in the type of data being modelled. Whilst traditional imputation methods often impose strong assumptions on the underlying data generation process, limiting their applicability, researchers have recently begun to investigate the potential of deep learning for this task, inspire… ▽ More Time series imputation remains a significant challenge across many fields due to the potentially significant variability in the type of data being modelled. Whilst traditional imputation methods often impose strong assumptions on the underlying data generation process, limiting their applicability, researchers have recently begun to investigate the potential of deep learning for this task, inspired by the strong performance shown by these models in both classification and regression problems across a range of applications. In this work we propose MADS, a novel auto-decoding framework for time series imputation, built upon implicit neural representations. Our method leverages the capabilities of SIRENs for high fidelity reconstruction of signals and irregular data, and combines it with a hypernetwork architecture which allows us to generalise by learning a prior over the space of time series. We evaluate our model on two real-world datasets, and show that it outperforms state-of-the-art methods for time series imputation. On the human activity dataset, it improves imputation performance by at least 40%, while on the air quality dataset it is shown to be competitive across all metrics. When evaluated on synthetic data, our model results in the best average rank across different dataset configurations over all baselines. △ Less

Submitted 3 July, 2023; originally announced July 2023.

Comments: 8 pages (inc. refs), 1 figure

arXiv:2306.07235 [pdf, ps, other]

Deep Gaussian Mixture Ensembles

Authors: Yousef El-Laham, Niccolò Dalmasso, Elizabeth Fons, Svitlana Vyetrenko

Abstract: This work introduces a novel probabilistic deep learning technique called deep Gaussian mixture ensembles (DGMEs), which enables accurate quantification of both epistemic and aleatoric uncertainty. By assuming the data generating process follows that of a Gaussian mixture, DGMEs are capable of approximating complex probability distributions, such as heavy-tailed or multimodal distributions. Our co… ▽ More This work introduces a novel probabilistic deep learning technique called deep Gaussian mixture ensembles (DGMEs), which enables accurate quantification of both epistemic and aleatoric uncertainty. By assuming the data generating process follows that of a Gaussian mixture, DGMEs are capable of approximating complex probability distributions, such as heavy-tailed or multimodal distributions. Our contributions include the derivation of an expectation-maximization (EM) algorithm used for learning the model parameters, which results in an upper-bound on the log-likelihood of training data over that of standard deep ensembles. Additionally, the proposed EM training procedure allows for learning of mixture weights, which is not commonly done in ensembles. Our experimental results demonstrate that DGMEs outperform state-of-the-art uncertainty quantifying deep learning models in handling complex predictive densities. △ Less

Submitted 12 June, 2023; originally announced June 2023.

Comments: Accepted at Uncertainty in Artificial Intelligence (UAI) 2023 Conference, 7 figures, 11 tables

arXiv:2211.11513 [pdf, other]

DSLOB: A Synthetic Limit Order Book Dataset for Benchmarking Forecasting Algorithms under Distributional Shift

Authors: Defu Cao, Yousef El-Laham, Loc Trinh, Svitlana Vyetrenko, Yan Liu

Abstract: In electronic trading markets, limit order books (LOBs) provide information about pending buy/sell orders at various price levels for a given security. Recently, there has been a growing interest in using LOB data for resolving downstream machine learning tasks (e.g., forecasting). However, dealing with out-of-distribution (OOD) LOB data is challenging since distributional shifts are unlabeled in… ▽ More In electronic trading markets, limit order books (LOBs) provide information about pending buy/sell orders at various price levels for a given security. Recently, there has been a growing interest in using LOB data for resolving downstream machine learning tasks (e.g., forecasting). However, dealing with out-of-distribution (OOD) LOB data is challenging since distributional shifts are unlabeled in current publicly available LOB datasets. Therefore, it is critical to build a synthetic LOB dataset with labeled OOD samples serving as a testbed for develo** models that generalize well to unseen scenarios. In this work, we utilize a multi-agent market simulator to build a synthetic LOB dataset, named DSLOB, with and without market stress scenarios, which allows for the design of controlled distributional shift benchmarking. Using the proposed synthetic dataset, we provide a holistic analysis on the forecasting performance of three different state-of-the-art forecasting methods. Our results reflect the need for increased researcher efforts to develop algorithms with robustness to distributional shifts in high-frequency time series data. △ Less

Submitted 17 November, 2022; originally announced November 2022.

Comments: 11 pages, 5 figures, already accepted by NeurIPS 2022 Distribution Shifts Workshop

arXiv:2209.11306 [pdf, other]

doi 10.1145/3533271.3561772

StyleTime: Style Transfer for Synthetic Time Series Generation

Authors: Yousef El-Laham, Svitlana Vyetrenko

Abstract: Neural style transfer is a powerful computer vision technique that can incorporate the artistic "style" of one image to the "content" of another. The underlying theory behind the approach relies on the assumption that the style of an image is represented by the Gram matrix of its features, which is typically extracted from pre-trained convolutional neural networks (e.g., VGG-19). This idea does no… ▽ More Neural style transfer is a powerful computer vision technique that can incorporate the artistic "style" of one image to the "content" of another. The underlying theory behind the approach relies on the assumption that the style of an image is represented by the Gram matrix of its features, which is typically extracted from pre-trained convolutional neural networks (e.g., VGG-19). This idea does not straightforwardly extend to time series stylization since notions of style for two-dimensional images are not analogous to notions of style for one-dimensional time series. In this work, a novel formulation of time series style transfer is proposed for the purpose of synthetic data generation and enhancement. We introduce the concept of stylized features for time series, which is directly related to the time series realism properties, and propose a novel stylization algorithm, called StyleTime, that uses explicit feature extraction techniques to combine the underlying content (trend) of one time series with the style (distributional properties) of another. Further, we discuss evaluation metrics, and compare our work to existing state-of-the-art time series generation and augmentation schemes. To validate the effectiveness of our methods, we use stylized synthetic data as a means for data augmentation to improve the performance of recurrent neural network models on several forecasting tasks. △ Less

Submitted 22 September, 2022; originally announced September 2022.

arXiv:2208.05836 [pdf, other]

HyperTime: Implicit Neural Representation for Time Series

Authors: Elizabeth Fons, Alejandro Sztrajman, Yousef El-laham, Alexandros Iosifidis, Svitlana Vyetrenko

Abstract: Implicit neural representations (INRs) have recently emerged as a powerful tool that provides an accurate and resolution-independent encoding of data. Their robustness as general approximators has been shown in a wide variety of data sources, with applications on image, sound, and 3D scene representation. However, little attention has been given to leveraging these architectures for the representa… ▽ More Implicit neural representations (INRs) have recently emerged as a powerful tool that provides an accurate and resolution-independent encoding of data. Their robustness as general approximators has been shown in a wide variety of data sources, with applications on image, sound, and 3D scene representation. However, little attention has been given to leveraging these architectures for the representation and analysis of time series data. In this paper, we analyze the representation of time series using INRs, comparing different activation functions in terms of reconstruction accuracy and training convergence speed. We show how these networks can be leveraged for the imputation of time series, with applications on both univariate and multivariate data. Finally, we propose a hypernetwork architecture that leverages INRs to learn a compressed latent representation of an entire time series dataset. We introduce an FFT-based loss to guide training so that all frequencies are preserved in the time series. We show that this network can be used to encode time series as INRs, and their embeddings can be interpolated to generate new time series from existing ones. We evaluate our generative method by using it for data augmentation, and show that it is competitive against current state-of-the-art approaches for augmentation of time series. △ Less

Submitted 11 August, 2022; originally announced August 2022.

arXiv:2202.11633 [pdf, other]

doi 10.1109/JPROC.2022.3154399

Fusion of Probability Density Functions

Authors: Günther Koliander, Yousef El-Laham, Petar M. Djurić, Franz Hlawatsch

Abstract: Fusing probabilistic information is a fundamental task in signal and data processing with relevance to many fields of technology and science. In this work, we investigate the fusion of multiple probability density functions (pdfs) of a continuous random variable or vector. Although the case of continuous random variables and the problem of pdf fusion frequently arise in multisensor signal processi… ▽ More Fusing probabilistic information is a fundamental task in signal and data processing with relevance to many fields of technology and science. In this work, we investigate the fusion of multiple probability density functions (pdfs) of a continuous random variable or vector. Although the case of continuous random variables and the problem of pdf fusion frequently arise in multisensor signal processing, statistical inference, and machine learning, a universally accepted method for pdf fusion does not exist. The diversity of approaches, perspectives, and solutions related to pdf fusion motivates a unified presentation of the theory and methodology of the field. We discuss three different approaches to fusing pdfs. In the axiomatic approach, the fusion rule is defined indirectly by a set of properties (axioms). In the optimization approach, it is the result of minimizing an objective function that involves an information-theoretic divergence or a distance measure. In the supra-Bayesian approach, the fusion center interprets the pdfs to be fused as random observations. Our work is partly a survey, reviewing in a structured and coherent fashion many of the concepts and methods that have been developed in the literature. In addition, we present new results for each of the three approaches. Our original contributions include new fusion rules, axioms, and axiomatic and optimization-based characterizations; a new formulation of supra-Bayesian fusion in terms of finite-dimensional parametrizations; and a study of supra-Bayesian fusion of posterior pdfs for linear Gaussian models. △ Less

Submitted 23 February, 2022; originally announced February 2022.

MSC Class: 60-02

Journal ref: Proceedings of the IEEE, 110(4):404--453, April 2022

Showing 1–10 of 10 results for author: El-Laham, Y