-
Towards a Flexible Embedding Learning Framework
Authors:
Chin-Chia Michael Yeh,
Dhruv Gelda,
Zhongfang Zhuang,
Yan Zheng,
Liang Gou,
Wei Zhang
Abstract:
Representation learning is a fundamental building block for analyzing entities in a database. While the existing embedding learning methods are effective in various data mining problems, their applicability is often limited because these methods have pre-determined assumptions on the type of semantics captured by the learned embeddings, and the assumptions may not well align with specific downstre…
▽ More
Representation learning is a fundamental building block for analyzing entities in a database. While the existing embedding learning methods are effective in various data mining problems, their applicability is often limited because these methods have pre-determined assumptions on the type of semantics captured by the learned embeddings, and the assumptions may not well align with specific downstream tasks. In this work, we propose an embedding learning framework that 1) uses an input format that is agnostic to input data type, 2) is flexible in terms of the relationships that can be embedded into the learned representations, and 3) provides an intuitive pathway to incorporate domain knowledge into the embedding learning process. Our proposed framework utilizes a set of entity-relation-matrices as the input, which quantifies the affinities among different entities in the database. Moreover, a sampling mechanism is carefully designed to establish a direct connection between the input and the information captured by the output embeddings. To complete the representation learning toolbox, we also outline a simple yet effective post-processing technique to properly visualize the learned embeddings. Our empirical results demonstrate that the proposed framework, in conjunction with a set of relevant entity-relation-matrices, outperforms the existing state-of-the-art approaches in various data mining tasks.
△ Less
Submitted 23 September, 2020;
originally announced September 2020.
-
Multi-stream RNN for Merchant Transaction Prediction
Authors:
Zhongfang Zhuang,
Chin-Chia Michael Yeh,
Liang Wang,
Wei Zhang,
Junpeng Wang
Abstract:
Recently, digital payment systems have significantly changed people's lifestyles. New challenges have surfaced in monitoring and guaranteeing the integrity of payment processing systems. One important task is to predict the future transaction statistics of each merchant. These predictions can thus be used to steer other tasks, ranging from fraud detection to recommendation. This problem is challen…
▽ More
Recently, digital payment systems have significantly changed people's lifestyles. New challenges have surfaced in monitoring and guaranteeing the integrity of payment processing systems. One important task is to predict the future transaction statistics of each merchant. These predictions can thus be used to steer other tasks, ranging from fraud detection to recommendation. This problem is challenging as we need to predict not only multivariate time series but also multi-steps into the future. In this work, we propose a multi-stream RNN model for multi-step merchant transaction predictions tailored to these requirements. The proposed multi-stream RNN summarizes transaction data in different granularity and makes predictions for multiple steps in the future. Our extensive experimental results have demonstrated that the proposed model is capable of outperforming existing state-of-the-art methods.
△ Less
Submitted 24 July, 2020;
originally announced August 2020.
-
Multi-future Merchant Transaction Prediction
Authors:
Chin-Chia Michael Yeh,
Zhongfang Zhuang,
Wei Zhang,
Liang Wang
Abstract:
The multivariate time series generated from merchant transaction history can provide critical insights for payment processing companies. The capability of predicting merchants' future is crucial for fraud detection and recommendation systems. Conventionally, this problem is formulated to predict one multivariate time series under the multi-horizon setting. However, real-world applications often re…
▽ More
The multivariate time series generated from merchant transaction history can provide critical insights for payment processing companies. The capability of predicting merchants' future is crucial for fraud detection and recommendation systems. Conventionally, this problem is formulated to predict one multivariate time series under the multi-horizon setting. However, real-world applications often require more than one future trend prediction considering the uncertainties, where more than one multivariate time series needs to be predicted. This problem is called multi-future prediction. In this work, we combine the two research directions and propose to study this new problem: multi-future, multi-horizon and multivariate time series prediction. This problem is crucial as it has broad use cases in the financial industry to reduce the risk while improving user experience by providing alternative futures. This problem is also challenging as now we not only need to capture the patterns and insights from the past but also train a model that has a strong inference capability to project multiple possible outcomes. To solve this problem, we propose a new model using convolutional neural networks and a simple yet effective encoder-decoder structure to learn the time series pattern from multiple perspectives. We use experiments on real-world merchant transaction data to demonstrate the effectiveness of our proposed model. We also provide extensive discussions on different model design choices in our experimental section.
△ Less
Submitted 10 July, 2020;
originally announced July 2020.
-
Constrained Non-Affine Alignment of Embeddings
Authors:
Yuwei Wang,
Yan Zheng,
Yanqing Peng,
Chin-Chia Michael Yeh,
Zhongfang Zhuang,
Das Mahashweta,
Bendre Mangesh,
Feifei Li,
Wei Zhang,
Jeff M. Phillips
Abstract:
Embeddings are one of the fundamental building blocks for data analysis tasks. Embeddings are already essential tools for large language models and image analysis, and their use is being extended to many other research domains. The generation of these distributed representations is often a data- and computation-expensive process; yet the holistic analysis and adjustment of them after they have bee…
▽ More
Embeddings are one of the fundamental building blocks for data analysis tasks. Embeddings are already essential tools for large language models and image analysis, and their use is being extended to many other research domains. The generation of these distributed representations is often a data- and computation-expensive process; yet the holistic analysis and adjustment of them after they have been created is still a develo** area. In this paper, we first propose a very general quantitatively measure for the presence of features in the embedding data based on if it can be learned. We then devise a method to remove or alleviate undesired features in the embedding while retaining the essential structure of the data. We use a Domain Adversarial Network (DAN) to generate a non-affine transformation, but we add constraints to ensure the essential structure of the embedding is preserved. Our empirical results demonstrate that the proposed algorithm significantly outperforms the state-of-art unsupervised algorithm on several data sets, including novel applications from the industry.
△ Less
Submitted 19 November, 2021; v1 submitted 13 October, 2019;
originally announced October 2019.
-
Time Series Classification to Improve Poultry Welfare
Authors:
Alireza Abdoli,
Amy C. Murillo,
Chin-Chia M. Yeh,
Alec C. Gerry,
Eamonn J. Keogh
Abstract:
Poultry farms are an important contributor to the human food chain. Worldwide, humankind keeps an enormous number of domesticated birds (e.g. chickens) for their eggs and their meat, providing rich sources of low-fat protein. However, around the world, there have been growing concerns about the quality of life for the livestock in poultry farms; and increasingly vocal demands for improved standard…
▽ More
Poultry farms are an important contributor to the human food chain. Worldwide, humankind keeps an enormous number of domesticated birds (e.g. chickens) for their eggs and their meat, providing rich sources of low-fat protein. However, around the world, there have been growing concerns about the quality of life for the livestock in poultry farms; and increasingly vocal demands for improved standards of animal welfare. Recent advances in sensing technologies and machine learning allow the possibility of automatically assessing the health of some individual birds, and employing the lessons learned to improve the welfare for all birds. This task superficially appears to be easy, given the dramatic progress in recent years in classifying human behaviors, and given that human behaviors are presumably more complex. However, as we shall demonstrate, classifying chicken behaviors poses several unique challenges, chief among which is creating a generalizable dictionary of behaviors from sparse and noisy data. In this work we introduce a novel time series dictionary learning algorithm that can robustly learn from weakly labeled data sources.
△ Less
Submitted 7 November, 2018;
originally announced November 2018.
-
Towards a Near Universal Time Series Data Mining Tool: Introducing the Matrix Profile
Authors:
Chin-Chia Michael Yeh
Abstract:
The last decade has seen a flurry of research on all-pairs-similarity-search (or, self-join) for text, DNA, and a handful of other datatypes, and these systems have been applied to many diverse data mining problems. Surprisingly, however, little progress has been made on addressing this problem for time series subsequences. In this thesis, we have introduced a near universal time series data minin…
▽ More
The last decade has seen a flurry of research on all-pairs-similarity-search (or, self-join) for text, DNA, and a handful of other datatypes, and these systems have been applied to many diverse data mining problems. Surprisingly, however, little progress has been made on addressing this problem for time series subsequences. In this thesis, we have introduced a near universal time series data mining tool called matrix profile which solves the all-pairs-similarity-search problem and caches the output in an easy-to-access fashion. The proposed algorithm is not only parameter-free, exact and scalable, but also applicable for both single and multidimensional time series. By building time series data mining methods on top of matrix profile, many time series data mining tasks (e.g., motif discovery, discord discovery, shapelet discovery, semantic segmentation, and clustering) can be efficiently solved. Because the same matrix profile can be shared by a diverse set of time series data mining methods, matrix profile is versatile and computed-once-use-many-times data structure. We demonstrate the utility of matrix profile for many time series data mining problems, including motif discovery, discord discovery, weakly labeled time series classification, and representation learning on domains as diverse as seismology, entomology, music processing, bioinformatics, human activity monitoring, electrical power-demand monitoring, and medicine. We hope the matrix profile is not the end but the beginning of many more time series data mining projects.
△ Less
Submitted 11 July, 2020; v1 submitted 5 November, 2018;
originally announced November 2018.
-
Representation Learning by Reconstructing Neighborhoods
Authors:
Chin-Chia Michael Yeh,
Yan Zhu,
Evangelos E. Papalexakis,
Abdullah Mueen,
Eamonn Keogh
Abstract:
Since its introduction, unsupervised representation learning has attracted a lot of attention from the research community, as it is demonstrated to be highly effective and easy-to-apply in tasks such as dimension reduction, clustering, visualization, information retrieval, and semi-supervised learning. In this work, we propose a novel unsupervised representation learning framework called neighbor-…
▽ More
Since its introduction, unsupervised representation learning has attracted a lot of attention from the research community, as it is demonstrated to be highly effective and easy-to-apply in tasks such as dimension reduction, clustering, visualization, information retrieval, and semi-supervised learning. In this work, we propose a novel unsupervised representation learning framework called neighbor-encoder, in which domain knowledge can be easily incorporated into the learning process without modifying the general encoder-decoder architecture of the classic autoencoder.In contrast to autoencoder, which reconstructs the input data itself, neighbor-encoder reconstructs the input data's neighbors. As the proposed representation learning problem is essentially a neighbor reconstruction problem, domain knowledge can be easily incorporated in the form of an appropriate definition of similarity between objects. Based on that observation, our framework can leverage any off-the-shelf similarity search algorithms or side information to find the neighbor of an input object. Applications of other algorithms (e.g., association rule mining) in our framework are also possible, given that the appropriate definition of neighbor can vary in different contexts. We have demonstrated the effectiveness of our framework in many diverse domains, including images, text, and time series, and for various data mining tasks including classification, clustering, and visualization. Experimental results show that neighbor-encoder not only outperforms autoencoder in most of the scenarios we consider, but also achieves the state-of-the-art performance on text document clustering.
△ Less
Submitted 6 November, 2018; v1 submitted 5 November, 2018;
originally announced November 2018.
-
The UCR Time Series Archive
Authors:
Hoang Anh Dau,
Anthony Bagnall,
Kaveh Kamgar,
Chin-Chia Michael Yeh,
Yan Zhu,
Shaghayegh Gharghabi,
Chotirat Ann Ratanamahatana,
Eamonn Keogh
Abstract:
The UCR Time Series Archive - introduced in 2002, has become an important resource in the time series data mining community, with at least one thousand published papers making use of at least one data set from the archive. The original incarnation of the archive had sixteen data sets but since that time, it has gone through periodic expansions. The last expansion took place in the summer of 2015 w…
▽ More
The UCR Time Series Archive - introduced in 2002, has become an important resource in the time series data mining community, with at least one thousand published papers making use of at least one data set from the archive. The original incarnation of the archive had sixteen data sets but since that time, it has gone through periodic expansions. The last expansion took place in the summer of 2015 when the archive grew from 45 to 85 data sets. This paper introduces and will focus on the new data expansion from 85 to 128 data sets. Beyond expanding this valuable resource, this paper offers pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive. Finally, this paper makes a novel and yet actionable claim: of the hundreds of papers that show an improvement over the standard baseline (1-nearest neighbor classification), a large fraction may be mis-attributing the reasons for their improvement. Moreover, they may have been able to achieve the same improvement with a much simpler modification, requiring just a single line of code.
△ Less
Submitted 8 September, 2019; v1 submitted 17 October, 2018;
originally announced October 2018.