-
Enhancing Supervised Visualization through Autoencoder and Random Forest Proximities for Out-of-Sample Extension
Authors:
Shuang Ni,
Adrien Aumon,
Guy Wolf,
Kevin R. Moon,
Jake S. Rhodes
Abstract:
The value of supervised dimensionality reduction lies in its ability to uncover meaningful connections between data features and labels. Common dimensionality reduction methods embed a set of fixed, latent points, but are not capable of generalizing to an unseen test set. In this paper, we provide an out-of-sample extension method for the random forest-based supervised dimensionality reduction met…
▽ More
The value of supervised dimensionality reduction lies in its ability to uncover meaningful connections between data features and labels. Common dimensionality reduction methods embed a set of fixed, latent points, but are not capable of generalizing to an unseen test set. In this paper, we provide an out-of-sample extension method for the random forest-based supervised dimensionality reduction method, RF-PHATE, combining information learned from the random forest model with the function-learning capabilities of autoencoders. Through quantitative assessment of various autoencoder architectures, we identify that networks that reconstruct random forest proximities are more robust for the embedding extension problem. Furthermore, by leveraging proximity-based prototypes, we achieve a 40% reduction in training time without compromising extension quality. Our method does not require label information for out-of-sample points, thus serving as a semi-supervised method, and can achieve consistent quality using only 10% of the training data.
△ Less
Submitted 6 June, 2024;
originally announced June 2024.
-
Symmetry Discovery Beyond Affine Transformations
Authors:
Ben Shaw,
Abram Magner,
Kevin R. Moon
Abstract:
Symmetry detection has been shown to improve various machine learning tasks. In the context of continuous symmetry detection, current state of the art experiments are limited to the detection of affine transformations. Under the manifold assumption, we outline a framework for discovering continuous symmetry in data beyond the affine transformation group. We also provide a similar framework for dis…
▽ More
Symmetry detection has been shown to improve various machine learning tasks. In the context of continuous symmetry detection, current state of the art experiments are limited to the detection of affine transformations. Under the manifold assumption, we outline a framework for discovering continuous symmetry in data beyond the affine transformation group. We also provide a similar framework for discovering discrete symmetry. We experimentally compare our method to an existing method known as LieGAN and show that our method is competitive at detecting affine symmetries for large sample sizes and superior than LieGAN for small sample sizes. We also show our method is able to detect continuous symmetries beyond the affine group and is generally more computationally efficient than LieGAN.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
Noisy Data Visualization using Functional Data Analysis
Authors:
Haozhe Chen,
Andres Felipe Duque Correa,
Guy Wolf,
Kevin R. Moon
Abstract:
Data visualization via dimensionality reduction is an important tool in exploratory data analysis. However, when the data are noisy, many existing methods fail to capture the underlying structure of the data. The method called Empirical Intrinsic Geometry (EIG) was previously proposed for performing dimensionality reduction on high dimensional dynamical processes while theoretically eliminating al…
▽ More
Data visualization via dimensionality reduction is an important tool in exploratory data analysis. However, when the data are noisy, many existing methods fail to capture the underlying structure of the data. The method called Empirical Intrinsic Geometry (EIG) was previously proposed for performing dimensionality reduction on high dimensional dynamical processes while theoretically eliminating all noise. However, implementing EIG in practice requires the construction of high-dimensional histograms, which suffer from the curse of dimensionality. Here we propose a new data visualization method called Functional Information Geometry (FIG) for dynamical processes that adapts the EIG framework while using approaches from functional data analysis to mitigate the curse of dimensionality. We experimentally demonstrate that the resulting method outperforms a variant of EIG designed for visualization in terms of capturing the true structure, hyperparameter robustness, and computational speed. We then use our method to visualize EEG brain measurements of sleep activity.
△ Less
Submitted 5 June, 2024;
originally announced June 2024.
-
HyperCLOVA X Technical Report
Authors:
Kang Min Yoo,
Jaegeun Han,
Sookyo In,
Heewon Jeon,
Jisu Jeong,
Jaewook Kang,
Hyunwook Kim,
Kyung-Min Kim,
Munhyong Kim,
Sungju Kim,
Donghyun Kwak,
Hanock Kwak,
Se Jung Kwon,
Bado Lee,
Dongsoo Lee,
Gichang Lee,
Jooho Lee,
Baeseong Park,
Seong** Shin,
Joonsang Yu,
Seolki Baek,
Sumin Byeon,
Eungsup Cho,
Dooseok Choe,
Jeesung Han
, et al. (371 additional authors not shown)
Abstract:
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment t…
▽ More
We introduce HyperCLOVA X, a family of large language models (LLMs) tailored to the Korean language and culture, along with competitive capabilities in English, math, and coding. HyperCLOVA X was trained on a balanced mix of Korean, English, and code data, followed by instruction-tuning with high-quality human-annotated datasets while abiding by strict safety guidelines reflecting our commitment to responsible AI. The model is evaluated across various benchmarks, including comprehensive reasoning, knowledge, commonsense, factuality, coding, math, chatting, instruction-following, and harmlessness, in both Korean and English. HyperCLOVA X exhibits strong reasoning capabilities in Korean backed by a deep understanding of the language and cultural nuances. Further analysis of the inherent bilingual nature and its extension to multilingualism highlights the model's cross-lingual proficiency and strong generalization ability to untargeted languages, including machine translation between several language pairs and cross-lingual inference tasks. We believe that HyperCLOVA X can provide helpful guidance for regions or countries in develo** their sovereign LLMs.
△ Less
Submitted 13 April, 2024; v1 submitted 2 April, 2024;
originally announced April 2024.
-
Document Author Classification Using Parsed Language Structure
Authors:
Todd K Moon,
Jacob H. Gunther
Abstract:
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used, for example, to determine authorship of all of \emph{The Federalist Papers}. Such methods may be useful in more modern times to detect fake or AI authorship. Progres…
▽ More
Over the years there has been ongoing interest in detecting authorship of a text based on statistical properties of the text, such as by using occurrence rates of noncontextual words. In previous work, these techniques have been used, for example, to determine authorship of all of \emph{The Federalist Papers}. Such methods may be useful in more modern times to detect fake or AI authorship. Progress in statistical natural language parsers introduces the possibility of using grammatical structure to detect authorship. In this paper we explore a new possibility for detecting authorship using grammatical structural information extracted using a statistical natural language parser. This paper provides a proof of concept, testing author classification based on grammatical structure on a set of "proof texts," The Federalist Papers and Sanditon which have been as test cases in previous authorship detection studies. Several features extracted from the statistical natural language parser were explored: all subtrees of some depth from any level; rooted subtrees of some depth, part of speech, and part of speech by level in the parse tree. It was found to be helpful to project the features into a lower dimensional space. Statistical experiments on these documents demonstrate that information from a statistical parser can, in fact, assist in distinguishing authors.
△ Less
Submitted 19 March, 2024;
originally announced March 2024.
-
Incorporating Taylor Series and Recursive Structure in Neural Networks for Time Series Prediction
Authors:
Jarrod Mau,
Kevin Moon
Abstract:
Time series analysis is relevant in various disciplines such as physics, biology, chemistry, and finance. In this paper, we present a novel neural network architecture that integrates elements from ResNet structures, while introducing the innovative incorporation of the Taylor series framework. This approach demonstrates notable enhancements in test accuracy across many of the baseline datasets in…
▽ More
Time series analysis is relevant in various disciplines such as physics, biology, chemistry, and finance. In this paper, we present a novel neural network architecture that integrates elements from ResNet structures, while introducing the innovative incorporation of the Taylor series framework. This approach demonstrates notable enhancements in test accuracy across many of the baseline datasets investigated. Furthermore, we extend our method to incorporate a recursive step, which leads to even further improvements in test accuracy. Our findings underscore the potential of our proposed model to significantly advance time series analysis methodologies, offering promising avenues for future research and application.
△ Less
Submitted 9 February, 2024;
originally announced February 2024.
-
Exploring higher-order neural network node interactions with total correlation
Authors:
Thomas Kerby,
Teresa White,
Kevin Moon
Abstract:
In domains such as ecological systems, collaborations, and the human brain the variables interact in complex ways. Yet accurately characterizing higher-order variable interactions (HOIs) is a difficult problem that is further exacerbated when the HOIs change across the data. To solve this problem we propose a new method called Local Correlation Explanation (CorEx) to capture HOIs at a local scale…
▽ More
In domains such as ecological systems, collaborations, and the human brain the variables interact in complex ways. Yet accurately characterizing higher-order variable interactions (HOIs) is a difficult problem that is further exacerbated when the HOIs change across the data. To solve this problem we propose a new method called Local Correlation Explanation (CorEx) to capture HOIs at a local scale by first clustering data points based on their proximity on the data manifold. We then use a multivariate version of the mutual information called the total correlation, to construct a latent factor representation of the data within each cluster to learn the local HOIs. We use Local CorEx to explore HOIs in synthetic and real world data to extract hidden insights about the data structure. Lastly, we demonstrate Local CorEx's suitability to explore and interpret the inner workings of trained neural networks.
△ Less
Submitted 6 February, 2024;
originally announced February 2024.
-
Local Background Estimation for Improved Gas Plume Identification in Hyperspectral Images
Authors:
Scout Jarman,
Zigfried Hampel-Arias,
Adra Carr,
Kevin R. Moon
Abstract:
Deep learning identification models have shown promise for identifying gas plumes in Longwave IR hyperspectral images of urban scenes, particularly when a large library of gases are being considered. Because many gases have similar spectral signatures, it is important to properly estimate the signal from a detected plume. Typically, a scene's global mean spectrum and covariance matrix are estimate…
▽ More
Deep learning identification models have shown promise for identifying gas plumes in Longwave IR hyperspectral images of urban scenes, particularly when a large library of gases are being considered. Because many gases have similar spectral signatures, it is important to properly estimate the signal from a detected plume. Typically, a scene's global mean spectrum and covariance matrix are estimated to whiten the plume's signal, which removes the background's signature from the gas signature. However, urban scenes can have many different background materials that are spatially and spectrally heterogeneous. This can lead to poor identification performance when the global background estimate is not representative of a given local background material. We use image segmentation, along with an iterative background estimation algorithm, to create local estimates for the various background materials that reside underneath a gas plume. Our method outperforms global background estimation on a set of simulated and real gas plumes. This method shows promise in increasing deep learning identification confidence, while being simple and easy to tune when considering diverse plumes.
△ Less
Submitted 23 January, 2024;
originally announced January 2024.
-
In-situ crack and keyhole pore detection in laser directed energy deposition through acoustic signal and deep learning
Authors:
Lequn Chen,
Xiling Yao,
Chaolin Tan,
Weiyang He,
**long Su,
Fei Weng,
Youxiang Chew,
Nicholas Poh Huat Ng,
Seung Ki Moon
Abstract:
Cracks and keyhole pores are detrimental defects in alloys produced by laser directed energy deposition (LDED). Laser-material interaction sound may hold information about underlying complex physical events such as crack propagation and pores formation. However, due to the noisy environment and intricate signal content, acoustic-based monitoring in LDED has received little attention. This paper pr…
▽ More
Cracks and keyhole pores are detrimental defects in alloys produced by laser directed energy deposition (LDED). Laser-material interaction sound may hold information about underlying complex physical events such as crack propagation and pores formation. However, due to the noisy environment and intricate signal content, acoustic-based monitoring in LDED has received little attention. This paper proposes a novel acoustic-based in-situ defect detection strategy in LDED. The key contribution of this study is to develop an in-situ acoustic signal denoising, feature extraction, and sound classification pipeline that incorporates convolutional neural networks (CNN) for online defect prediction. Microscope images are used to identify locations of the cracks and keyhole pores within a part. The defect locations are spatiotemporally registered with acoustic signal. Various acoustic features corresponding to defect-free regions, cracks, and keyhole pores are extracted and analysed in time-domain, frequency-domain, and time-frequency representations. The CNN model is trained to predict defect occurrences using the Mel-Frequency Cepstral Coefficients (MFCCs) of the lasermaterial interaction sound. The CNN model is compared to various classic machine learning models trained on the denoised acoustic dataset and raw acoustic dataset. The validation results shows that the CNN model trained on the denoised dataset outperforms others with the highest overall accuracy (89%), keyhole pore prediction accuracy (93%), and AUC-ROC score (98%). Furthermore, the trained CNN model can be deployed into an in-house developed software platform for online quality monitoring. The proposed strategy is the first study to use acoustic signals with deep learning for insitu defect detection in LDED process.
△ Less
Submitted 10 April, 2023;
originally announced April 2023.
-
Computational Complexity of Minimal Trap Spaces in Boolean Networks
Authors:
Kyungduk Moon,
Kangbok Lee,
Loïc Paulevé
Abstract:
A Boolean network (BN) is a discrete dynamical system defined by a Boolean function that maps to the domain itself. A trap space of a BN is a generalization of a fixed point, which is defined as the sub-hypercubes closed by the function of the BN. A trap space is minimal if it does not contain any smaller trap space. Minimal trap spaces have applications for the analysis of attractors of BNs with…
▽ More
A Boolean network (BN) is a discrete dynamical system defined by a Boolean function that maps to the domain itself. A trap space of a BN is a generalization of a fixed point, which is defined as the sub-hypercubes closed by the function of the BN. A trap space is minimal if it does not contain any smaller trap space. Minimal trap spaces have applications for the analysis of attractors of BNs with various update modes. This paper establishes the computational complexity results of three decision problems related to minimal trap spaces: the decision of the trap space property of a sub-hypercube, the decision of its minimality, and the decision of the membership of a given configuration to a minimal trap space. Under several cases on Boolean function representations, we investigate the computational complexity of each problem. In the general case, we demonstrate that the trap space property is coNP-complete, and the minimality and the membership properties are $Π_2^{\text P}$-complete. The complexities drop by one level in the polynomial hierarchy whenever the local functions of the BN are either unate, or are represented using truth-tables, binary decision diagrams, or double DNFs (Petri net encoding): the trap space property can be decided in a polynomial time, whereas deciding the minimality and the membership are coNP- complete. When the BN is given as its functional graph, all these problems are in P.
△ Less
Submitted 14 March, 2023; v1 submitted 24 December, 2022;
originally announced December 2022.
-
Manifold Alignment with Label Information
Authors:
Andres F. Duque,
Myriam Lizotte,
Guy Wolf,
Kevin R. Moon
Abstract:
Multi-domain data is becoming increasingly common and presents both challenges and opportunities in the data science community. The integration of distinct data-views can be used for exploratory data analysis, and benefit downstream analysis including machine learning related tasks. With this in mind, we present a novel manifold alignment method called MALI (Manifold alignment with label informati…
▽ More
Multi-domain data is becoming increasingly common and presents both challenges and opportunities in the data science community. The integration of distinct data-views can be used for exploratory data analysis, and benefit downstream analysis including machine learning related tasks. With this in mind, we present a novel manifold alignment method called MALI (Manifold alignment with label information) that learns a correspondence between two distinct domains. MALI can be considered as belonging to a middle ground between the more commonly addressed semi-supervised manifold alignment problem with some known correspondences between the two domains, and the purely unsupervised case, where no known correspondences are provided. To do this, MALI learns the manifold structure in both domains via a diffusion process and then leverages discrete class labels to guide the alignment. By aligning two distinct domains, MALI recovers a pairing and a common representation that reveals related samples in both domains. Additionally, MALI can be used for the transfer learning problem known as domain adaptation. We show that MALI outperforms the current state-of-the-art manifold alignment methods across multiple datasets.
△ Less
Submitted 30 October, 2022; v1 submitted 23 October, 2022;
originally announced October 2022.
-
3D Graph Contrastive Learning for Molecular Property Prediction
Authors:
Kisung Moon,
Sunyoung Kwon
Abstract:
Self-supervised learning (SSL) is a method that learns the data representation by utilizing supervision inherent in the data. This learning method is in the spotlight in the drug field, lacking annotated data due to time-consuming and expensive experiments. SSL using enormous unlabeled data has shown excellent performance for molecular property prediction, but a few issues exist. (1) Existing SSL…
▽ More
Self-supervised learning (SSL) is a method that learns the data representation by utilizing supervision inherent in the data. This learning method is in the spotlight in the drug field, lacking annotated data due to time-consuming and expensive experiments. SSL using enormous unlabeled data has shown excellent performance for molecular property prediction, but a few issues exist. (1) Existing SSL models are large-scale; there is a limitation to implementing SSL where the computing resource is insufficient. (2) In most cases, they do not utilize 3D structural information for molecular representation learning. The activity of a drug is closely related to the structure of the drug molecule. Nevertheless, most current models do not use 3D information or use it partially. (3) Previous models that apply contrastive learning to molecules use the augmentation of permuting atoms and bonds. Therefore, molecules having different characteristics can be in the same positive samples. We propose a novel contrastive learning framework, small-scale 3D Graph Contrastive Learning (3DGCL) for molecular property prediction, to solve the above problems. 3DGCL learns the molecular representation by reflecting the molecule's structure through the pre-training process that does not change the semantics of the drug. Using only 1,128 samples for pre-train data and 1 million model parameters, we achieved the state-of-the-art or comparable performance in four regression benchmark datasets. Extensive experiments demonstrate that 3D structural information based on chemical knowledge is essential to molecular representation learning for property prediction.
△ Less
Submitted 18 August, 2022; v1 submitted 31 May, 2022;
originally announced August 2022.
-
Diffusion Transport Alignment
Authors:
Andres F. Duque,
Guy Wolf,
Kevin R. Moon
Abstract:
The integration of multimodal data presents a challenge in cases when the study of a given phenomena by different instruments or conditions generates distinct but related domains. Many existing data integration methods assume a known one-to-one correspondence between domains of the entire dataset, which may be unrealistic. Furthermore, existing manifold alignment methods are not suited for cases w…
▽ More
The integration of multimodal data presents a challenge in cases when the study of a given phenomena by different instruments or conditions generates distinct but related domains. Many existing data integration methods assume a known one-to-one correspondence between domains of the entire dataset, which may be unrealistic. Furthermore, existing manifold alignment methods are not suited for cases where the data contains domain-specific regions, i.e., there is not a counterpart for a certain portion of the data in the other domain. We propose Diffusion Transport Alignment (DTA), a semi-supervised manifold alignment method that exploits prior correspondence knowledge between only a few points to align the domains. By building a diffusion process, DTA finds a transportation plan between data measured from two heterogeneous domains with different feature spaces, which by assumption, share a similar geometrical structure coming from the same underlying data generating process. DTA can also compute a partial alignment in a data-driven fashion, resulting in accurate alignments when some data are measured in only one domain. We empirically demonstrate that DTA outperforms other methods in aligning multimodal data in this semisupervised setting. We also empirically show that the alignment obtained by DTA can improve the performance of machine learning tasks, such as domain adaptation, inter-domain feature map**, and exploratory data analysis, while outperforming competing methods.
△ Less
Submitted 15 June, 2022;
originally announced June 2022.
-
Geometry- and Accuracy-Preserving Random Forest Proximities
Authors:
Jake S. Rhodes,
Adele Cutler,
Kevin R. Moon
Abstract:
Random forests are considered one of the best out-of-the-box classification and regression algorithms due to their high level of predictive performance with relatively little tuning. Pairwise proximities can be computed from a trained random forest and measure the similarity between data points relative to the supervised task. Random forest proximities have been used in many applications including…
▽ More
Random forests are considered one of the best out-of-the-box classification and regression algorithms due to their high level of predictive performance with relatively little tuning. Pairwise proximities can be computed from a trained random forest and measure the similarity between data points relative to the supervised task. Random forest proximities have been used in many applications including the identification of variable importance, data imputation, outlier detection, and data visualization. However, existing definitions of random forest proximities do not accurately reflect the data geometry learned by the random forest. In this paper, we introduce a novel definition of random forest proximities called Random Forest-Geometry- and Accuracy-Preserving proximities (RF-GAP). We prove that the proximity-weighted sum (regression) or majority vote (classification) using RF-GAP exactly matches the out-of-bag random forest prediction, thus capturing the data geometry learned by the random forest. We empirically show that this improved geometric representation outperforms traditional random forest proximities in tasks such as data imputation and provides outlier detection and visualization results consistent with the learned data geometry.
△ Less
Submitted 28 February, 2023; v1 submitted 29 January, 2022;
originally announced January 2022.
-
GPS-Denied Navigation Using SAR Images and Neural Networks
Authors:
Teresa White,
Jesse Wheeler,
Colton Lindstrom,
Randall Christensen,
Kevin R. Moon
Abstract:
Unmanned aerial vehicles (UAV) often rely on GPS for navigation. GPS signals, however, are very low in power and easily jammed or otherwise disrupted. This paper presents a method for determining the navigation errors present at the beginning of a GPS-denied period utilizing data from a synthetic aperture radar (SAR) system. This is accomplished by comparing an online-generated SAR image with a re…
▽ More
Unmanned aerial vehicles (UAV) often rely on GPS for navigation. GPS signals, however, are very low in power and easily jammed or otherwise disrupted. This paper presents a method for determining the navigation errors present at the beginning of a GPS-denied period utilizing data from a synthetic aperture radar (SAR) system. This is accomplished by comparing an online-generated SAR image with a reference image obtained a priori. The distortions relative to the reference image are learned and exploited with a convolutional neural network to recover the initial navigational errors, which can be used to recover the true flight trajectory throughout the synthetic aperture. The proposed neural network approach is able to learn to predict the initial errors on both simulated and real SAR image data.
△ Less
Submitted 22 October, 2020;
originally announced October 2020.
-
Extendable and invertible manifold learning with geometry regularized autoencoders
Authors:
Andrés F. Duque,
Sacha Morin,
Guy Wolf,
Kevin R. Moon
Abstract:
A fundamental task in data exploration is to extract simplified low dimensional representations that capture intrinsic geometry in data, especially for faithfully visualizing data in two or three dimensions. Common approaches to this task use kernel methods for manifold learning. However, these methods typically only provide an embedding of fixed input data and cannot extend to new data points. Au…
▽ More
A fundamental task in data exploration is to extract simplified low dimensional representations that capture intrinsic geometry in data, especially for faithfully visualizing data in two or three dimensions. Common approaches to this task use kernel methods for manifold learning. However, these methods typically only provide an embedding of fixed input data and cannot extend to new data points. Autoencoders have also recently become popular for representation learning. But while they naturally compute feature extractors that are both extendable to new data and invertible (i.e., reconstructing original features from latent representation), they have limited capabilities to follow global intrinsic geometry compared to kernel-based manifold learning. We present a new method for integrating both approaches by incorporating a geometric regularization term in the bottleneck of the autoencoder. Our regularization, based on the diffusion potential distances from the recently-proposed PHATE visualization method, encourages the learned latent representation to follow intrinsic data geometry, similar to manifold learning algorithms, while still enabling faithful extension to new data and reconstruction of data in the original feature space from latent coordinates. We compare our approach with leading kernel methods and autoencoder models for manifold learning to provide qualitative and quantitative evidence of our advantages in preserving intrinsic structure, out of sample extension, and reconstruction. Our method is easily implemented for big-data applications, whereas other methods are limited in this regard.
△ Less
Submitted 22 November, 2020; v1 submitted 14 July, 2020;
originally announced July 2020.
-
Supervised Visualization for Data Exploration
Authors:
Jake S. Rhodes,
Adele Cutler,
Guy Wolf,
Kevin R. Moon
Abstract:
Dimensionality reduction is often used as an initial step in data exploration, either as preprocessing for classification or regression or for visualization. Most dimensionality reduction techniques to date are unsupervised; they do not take class labels into account (e.g., PCA, MDS, t-SNE, Isomap). Such methods require large amounts of data and are often sensitive to noise that may obfuscate impo…
▽ More
Dimensionality reduction is often used as an initial step in data exploration, either as preprocessing for classification or regression or for visualization. Most dimensionality reduction techniques to date are unsupervised; they do not take class labels into account (e.g., PCA, MDS, t-SNE, Isomap). Such methods require large amounts of data and are often sensitive to noise that may obfuscate important patterns in the data. Various attempts at supervised dimensionality reduction methods that take into account auxiliary annotations (e.g., class labels) have been successfully implemented with goals of increased classification accuracy or improved data visualization. Many of these supervised techniques incorporate labels in the loss function in the form of similarity or dissimilarity matrices, thereby creating over-emphasized separation between class clusters, which does not realistically represent the local and global relationships in the data. In addition, these approaches are often sensitive to parameter tuning, which may be difficult to configure without an explicit quantitative notion of visual superiority. In this paper, we describe a novel supervised visualization technique based on random forest proximities and diffusion-based dimensionality reduction. We show, both qualitatively and quantitatively, the advantages of our approach in retaining local and global structures in data, while emphasizing important variables in the low-dimensional embedding. Importantly, our approach is robust to noise and parameter tuning, thus making it simple to use while producing reliable visualizations for data exploration.
△ Less
Submitted 15 June, 2020;
originally announced June 2020.
-
Machines Getting with the Program: Understanding Intent Arguments of Non-Canonical Directives
Authors:
Won Ik Cho,
Young Ki Moon,
Sangwhan Moon,
Seok Min Kim,
Nam Soo Kim
Abstract:
Modern dialog managers face the challenge of having to fulfill human-level conversational skills as part of common user expectations, including but not limited to discourse with no clear objective. Along with these requirements, agents are expected to extrapolate intent from the user's dialogue even when subjected to non-canonical forms of speech. This depends on the agent's comprehension of parap…
▽ More
Modern dialog managers face the challenge of having to fulfill human-level conversational skills as part of common user expectations, including but not limited to discourse with no clear objective. Along with these requirements, agents are expected to extrapolate intent from the user's dialogue even when subjected to non-canonical forms of speech. This depends on the agent's comprehension of paraphrased forms of such utterances. Especially in low-resource languages, the lack of data is a bottleneck that prevents advancements of the comprehension performance for these types of agents. In this regard, here we demonstrate the necessity of extracting the intent argument of non-canonical directives in a natural language format, which may yield more accurate parsing, and suggest guidelines for building a parallel corpus for this purpose. Following the guidelines, we construct a Korean corpus of 50K instances of question/command-intent pairs, including the labels for classification of the utterance type. We also propose a method for mitigating class imbalance, demonstrating the potential applications of the corpus generation method and its multilingual extensibility.
△ Less
Submitted 7 October, 2020; v1 submitted 1 December, 2019;
originally announced December 2019.
-
Coarse Graining of Data via Inhomogeneous Diffusion Condensation
Authors:
Nathan Brugnone,
Alex Gonopolskiy,
Mark W. Moyle,
Manik Kuchroo,
David van Dijk,
Kevin R. Moon,
Daniel Colon-Ramos,
Guy Wolf,
Matthew J. Hirn,
Smita Krishnaswamy
Abstract:
Big data often has emergent structure that exists at multiple levels of abstraction, which are useful for characterizing complex interactions and dynamics of the observations. Here, we consider multiple levels of abstraction via a multiresolution geometry of data points at different granularities. To construct this geometry we define a time-inhomogeneous diffusion process that effectively condense…
▽ More
Big data often has emergent structure that exists at multiple levels of abstraction, which are useful for characterizing complex interactions and dynamics of the observations. Here, we consider multiple levels of abstraction via a multiresolution geometry of data points at different granularities. To construct this geometry we define a time-inhomogeneous diffusion process that effectively condenses data points together to uncover nested grou**s at larger and larger granularities. This inhomogeneous process creates a deep cascade of intrinsic low pass filters on the data affinity graph that are applied in sequence to gradually eliminate local variability while adjusting the learned data geometry to increasingly coarser resolutions. We provide visualizations to exhibit our method as a continuously-hierarchical clustering with directions of eliminated variation highlighted at each step. The utility of our algorithm is demonstrated via neuronal data condensation, where the constructed multiresolution data geometry uncovers the organization, grou**, and connectivity between neurons.
△ Less
Submitted 9 March, 2020; v1 submitted 9 July, 2019;
originally announced July 2019.
-
Visualizing High Dimensional Dynamical Processes
Authors:
Andrés F. Duque,
Guy Wolf,
Kevin R. Moon
Abstract:
Manifold learning techniques for dynamical systems and time series have shown their utility for a broad spectrum of applications in recent years. While these methods are effective at learning a low-dimensional representation, they are often insufficient for visualizing the global and local structure of the data. In this paper, we present DIG (Dynamical Information Geometry), a visualization method…
▽ More
Manifold learning techniques for dynamical systems and time series have shown their utility for a broad spectrum of applications in recent years. While these methods are effective at learning a low-dimensional representation, they are often insufficient for visualizing the global and local structure of the data. In this paper, we present DIG (Dynamical Information Geometry), a visualization method for multivariate time series data that extracts an information geometry from a diffusion framework. Specifically, we implement a novel group of distances in the context of diffusion operators, which may be useful to reveal structure in the data that may not be accessible by the commonly used diffusion distances. Finally, we present a case study applying our visualization tool to EEG data to visualize sleep stages.
△ Less
Submitted 25 June, 2019;
originally announced June 2019.
-
Compressed Diffusion
Authors:
Scott Gigante,
Jay S. Stanley III,
Ngan Vu,
David van Dijk,
Kevin Moon,
Guy Wolf,
Smita Krishnaswamy
Abstract:
Diffusion maps are a commonly used kernel-based method for manifold learning, which can reveal intrinsic structures in data and embed them in low dimensions. However, as with most kernel methods, its implementation requires a heavy computational load, reaching up to cubic complexity in the number of data points. This limits its usability in modern data analysis. Here, we present a new approach to…
▽ More
Diffusion maps are a commonly used kernel-based method for manifold learning, which can reveal intrinsic structures in data and embed them in low dimensions. However, as with most kernel methods, its implementation requires a heavy computational load, reaching up to cubic complexity in the number of data points. This limits its usability in modern data analysis. Here, we present a new approach to computing the diffusion geometry, and related embeddings, from a compressed diffusion process between data regions rather than data points. Our construction is based on an adaptation of the previously proposed measure-based Gaussian correlation (MGC) kernel that robustly captures the local geometry around data points. We use this MGC kernel to efficiently compress diffusion relations from pointwise to data region resolution. Finally, a spectral embedding of the data regions provides coordinates that are used to interpolate and approximate the pointwise diffusion map embedding of data. We analyze theoretical connections between our construction and the original diffusion geometry of diffusion maps, and demonstrate the utility of our method in analyzing big datasets, where it outperforms competing approaches.
△ Less
Submitted 10 June, 2019; v1 submitted 31 January, 2019;
originally announced February 2019.
-
Extracting Arguments from Korean Question and Command: An Annotated Corpus for Structured Paraphrasing
Authors:
Won Ik Cho,
Young Ki Moon,
Woo Hyun Kang,
Nam Soo Kim
Abstract:
Intention identification is a core issue in dialog management. However, due to the non-canonicality of the spoken language, it is difficult to extract the content automatically from the conversation-style utterances. This is much more challenging for languages like Korean and Japanese since the agglutination between morphemes make it difficult for the machines to parse the sentence and understand…
▽ More
Intention identification is a core issue in dialog management. However, due to the non-canonicality of the spoken language, it is difficult to extract the content automatically from the conversation-style utterances. This is much more challenging for languages like Korean and Japanese since the agglutination between morphemes make it difficult for the machines to parse the sentence and understand the intention. To suggest a guideline for this problem, and to merge the issue flexibly with the neural paraphrasing systems introduced recently, we propose a structured annotation scheme for Korean question/commands and the resulting corpus which are widely applicable to the field of argument mining. The scheme and dataset are expected to help machines understand the intention of natural language and grasp the core meaning of conversation-style instructions.
△ Less
Submitted 9 July, 2019; v1 submitted 10 October, 2018;
originally announced October 2018.
-
Convergence Rates for Empirical Estimation of Binary Classification Bounds
Authors:
Salimeh Yasaei Sekeh,
Morteza Noshad,
Kevin R. Moon,
Alfred O. Hero
Abstract:
Bounding the best achievable error probability for binary classification problems is relevant to many applications including machine learning, signal processing, and information theory. Many bounds on the Bayes binary classification error rate depend on information divergences between the pair of class distributions. Recently, the Henze-Penrose (HP) divergence has been proposed for bounding classi…
▽ More
Bounding the best achievable error probability for binary classification problems is relevant to many applications including machine learning, signal processing, and information theory. Many bounds on the Bayes binary classification error rate depend on information divergences between the pair of class distributions. Recently, the Henze-Penrose (HP) divergence has been proposed for bounding classification error probability. We consider the problem of empirically estimating the HP-divergence from random samples. We derive a bound on the convergence rate for the Friedman-Rafsky (FR) estimator of the HP-divergence, which is related to a multivariate runs statistic for testing between two distributions. The FR estimator is derived from a multicolored Euclidean minimal spanning tree (MST) that spans the merged samples. We obtain a concentration inequality for the Friedman-Rafsky estimator of the Henze-Penrose divergence. We validate our results experimentally and illustrate their application to real datasets.
△ Less
Submitted 1 October, 2018;
originally announced October 2018.
-
Modeling Global Dynamics from Local Snapshots with Deep Generative Neural Networks
Authors:
Scott Gigante,
David van Dijk,
Kevin Moon,
Alexander Strzalkowski,
Guy Wolf,
Smita Krishnaswamy
Abstract:
Complex high dimensional stochastic dynamic systems arise in many applications in the natural sciences and especially biology. However, while these systems are difficult to describe analytically, "snapshot" measurements that sample the output of the system are often available. In order to model the dynamics of such systems given snapshot data, or local transitions, we present a deep neural network…
▽ More
Complex high dimensional stochastic dynamic systems arise in many applications in the natural sciences and especially biology. However, while these systems are difficult to describe analytically, "snapshot" measurements that sample the output of the system are often available. In order to model the dynamics of such systems given snapshot data, or local transitions, we present a deep neural network framework we call Dynamics Modeling Network or DyMoN. DyMoN is a neural network framework trained as a deep generative Markov model whose next state is a probability distribution based on the current state. DyMoN is trained using samples of current and next-state pairs, and thus does not require longitudinal measurements. We show the advantage of DyMoN over shallow models such as Kalman filters and hidden Markov models, and other deep models such as recurrent neural networks in its ability to embody the dynamics (which can be studied via perturbation of the neural network) and generate longitudinal hypothetical trajectories. We perform three case studies in which we apply DyMoN to different types of biological systems and extract features of the dynamics in each case by examining the learned model.
△ Less
Submitted 10 June, 2019; v1 submitted 9 February, 2018;
originally announced February 2018.
-
Ensemble Estimation of Distributional Functionals via $k$-Nearest Neighbors
Authors:
Kevin R. Moon,
Kumar Sricharan,
Alfred O. Hero III
Abstract:
The problem of accurate nonparametric estimation of distributional functionals (integral functionals of one or more probability distributions) has received recent interest due to their wide applicability in signal processing, information theory, machine learning, and statistics. In particular, $k$-nearest neighbor (nn) based methods have received a lot of attention due to their adaptive nature and…
▽ More
The problem of accurate nonparametric estimation of distributional functionals (integral functionals of one or more probability distributions) has received recent interest due to their wide applicability in signal processing, information theory, machine learning, and statistics. In particular, $k$-nearest neighbor (nn) based methods have received a lot of attention due to their adaptive nature and their relatively low computational complexity. We derive the mean squared error (MSE) convergence rates of leave-one-out $k$-nn plug-in density estimators of a large class of distributional functionals without boundary correction. We then apply the theory of optimally weighted ensemble estimation to obtain weighted ensemble estimators that achieve the parametric MSE rate under assumptions that are competitive with the state of the art. The asymptotic distributions of these estimators, which are unknown for all other $k$-nn based distributional functional estimators, are also presented which enables us to perform hypothesis testing.
△ Less
Submitted 10 July, 2017;
originally announced July 2017.
-
Direct Ensemble Estimation of Density Functionals
Authors:
Alan Wisler,
Kevin Moon,
Visar Berisha
Abstract:
Estimating density functionals of analog sources is an important problem in statistical signal processing and information theory. Traditionally, estimating these quantities requires either making parametric assumptions about the underlying distributions or using non-parametric density estimation followed by integration. In this paper we introduce a direct nonparametric approach which bypasses the…
▽ More
Estimating density functionals of analog sources is an important problem in statistical signal processing and information theory. Traditionally, estimating these quantities requires either making parametric assumptions about the underlying distributions or using non-parametric density estimation followed by integration. In this paper we introduce a direct nonparametric approach which bypasses the need for density estimation by using the error rates of k-NN classifiers asdata-driven basis functions that can be combined to estimate a range of density functionals. However, this method is subject to a non-trivial bias that dramatically slows the rate of convergence in higher dimensions. To overcome this limitation, we develop an ensemble method for estimating the value of the basis function which, under some minor constraints on the smoothness of the underlying distributions, achieves the parametric rate of convergence regardless of data dimension.
△ Less
Submitted 17 May, 2017;
originally announced May 2017.
-
Direct Estimation of Information Divergence Using Nearest Neighbor Ratios
Authors:
Morteza Noshad,
Kevin R. Moon,
Salimeh Yasaei Sekeh,
Alfred O. Hero III
Abstract:
We propose a direct estimation method for Rényi and f-divergence measures based on a new graph theoretical interpretation. Suppose that we are given two sample sets $X$ and $Y$, respectively with $N$ and $M$ samples, where $η:=M/N$ is a constant value. Considering the $k$-nearest neighbor ($k$-NN) graph of $Y$ in the joint data set $(X,Y)$, we show that the average powered ratio of the number of…
▽ More
We propose a direct estimation method for Rényi and f-divergence measures based on a new graph theoretical interpretation. Suppose that we are given two sample sets $X$ and $Y$, respectively with $N$ and $M$ samples, where $η:=M/N$ is a constant value. Considering the $k$-nearest neighbor ($k$-NN) graph of $Y$ in the joint data set $(X,Y)$, we show that the average powered ratio of the number of $X$ points to the number of $Y$ points among all $k$-NN points is proportional to Rényi divergence of $X$ and $Y$ densities. A similar method can also be used to estimate f-divergence measures. We derive bias and variance rates, and show that for the class of $γ$-Hölder smooth functions, the estimator achieves the MSE rate of $O(N^{-2γ/(γ+d)})$. Furthermore, by using a weighted ensemble estimation technique, for density functions with continuous and bounded derivatives of up to the order $d$, and some extra conditions at the support set boundary, we derive an ensemble estimator that achieves the parametric MSE rate of $O(1/N)$. Our estimators are more computationally tractable than other competing estimators, which makes them appealing in many practical applications.
△ Less
Submitted 20 November, 2017; v1 submitted 16 February, 2017;
originally announced February 2017.
-
Ensemble Estimation of Generalized Mutual Information with Applications to Genomics
Authors:
Kevin R. Moon,
Kumar Sricharan,
Alfred O. Hero III
Abstract:
Mutual information is a measure of the dependence between random variables that has been used successfully in myriad applications in many fields. Generalized mutual information measures that go beyond classical Shannon mutual information have also received much interest in these applications. We derive the mean squared error convergence rates of kernel density-based plug-in estimators of general m…
▽ More
Mutual information is a measure of the dependence between random variables that has been used successfully in myriad applications in many fields. Generalized mutual information measures that go beyond classical Shannon mutual information have also received much interest in these applications. We derive the mean squared error convergence rates of kernel density-based plug-in estimators of general mutual information measures between two multidimensional random variables $\mathbf{X}$ and $\mathbf{Y}$ for two cases: 1) $\mathbf{X}$ and $\mathbf{Y}$ are continuous; 2) $\mathbf{X}$ and $\mathbf{Y}$ may have any mixture of discrete and continuous components. Using the derived rates, we propose an ensemble estimator of these information measures called GENIE by taking a weighted sum of the plug-in estimators with varied bandwidths. The resulting ensemble estimators achieve the $1/N$ parametric mean squared error convergence rate when the conditional densities of the continuous variables are sufficiently smooth. To the best of our knowledge, this is the first nonparametric mutual information estimator known to achieve the parametric convergence rate for the mixture case, which frequently arises in applications (e.g. variable selection in classification). The estimator is simple to implement and it uses the solution to an offline convex optimization problem and simple plug-in estimators. A central limit theorem is also derived for the ensemble estimators and minimax rates are derived for the continuous case. We demonstrate the ensemble estimator for the mixed case on simulated data and apply the proposed estimator to analyze gene relationships in single cell data.
△ Less
Submitted 29 July, 2021; v1 submitted 27 January, 2017;
originally announced January 2017.
-
Information Theoretic Structure Learning with Confidence
Authors:
Kevin R. Moon,
Morteza Noshad,
Salimeh Yasaei Sekeh,
Alfred O. Hero III
Abstract:
Information theoretic measures (e.g. the Kullback Liebler divergence and Shannon mutual information) have been used for exploring possibly nonlinear multivariate dependencies in high dimension. If these dependencies are assumed to follow a Markov factor graph model, this exploration process is called structure discovery. For discrete-valued samples, estimates of the information divergence over the…
▽ More
Information theoretic measures (e.g. the Kullback Liebler divergence and Shannon mutual information) have been used for exploring possibly nonlinear multivariate dependencies in high dimension. If these dependencies are assumed to follow a Markov factor graph model, this exploration process is called structure discovery. For discrete-valued samples, estimates of the information divergence over the parametric class of multinomial models lead to structure discovery methods whose mean squared error achieves parametric convergence rates as the sample size grows. However, a naive application of this method to continuous nonparametric multivariate models converges much more slowly. In this paper we introduce a new method for nonparametric structure discovery that uses weighted ensemble divergence estimators that achieve parametric convergence rates and obey an asymptotic central limit theorem that facilitates hypothesis testing and other types of statistical validation.
△ Less
Submitted 13 September, 2016;
originally announced September 2016.
-
ConfidentCare: A Clinical Decision Support System for Personalized Breast Cancer Screening
Authors:
Ahmed M. Alaa,
Kyeong H. Moon,
William Hsu,
Mihaela van der Schaar
Abstract:
Breast cancer screening policies attempt to achieve timely diagnosis by the regular screening of apparently healthy women. Various clinical decisions are needed to manage the screening process; those include: selecting the screening tests for a woman to take, interpreting the test outcomes, and deciding whether or not a woman should be referred to a diagnostic test. Such decisions are currently gu…
▽ More
Breast cancer screening policies attempt to achieve timely diagnosis by the regular screening of apparently healthy women. Various clinical decisions are needed to manage the screening process; those include: selecting the screening tests for a woman to take, interpreting the test outcomes, and deciding whether or not a woman should be referred to a diagnostic test. Such decisions are currently guided by clinical practice guidelines (CPGs), which represent a one-size-fits-all approach that are designed to work well on average for a population, without guaranteeing that it will work well uniformly over that population. Since the risks and benefits of screening are functions of each patients features, personalized screening policies that are tailored to the features of individuals are needed in order to ensure that the right tests are recommended to the right woman. In order to address this issue, we present ConfidentCare: a computer-aided clinical decision support system that learns a personalized screening policy from the electronic health record (EHR) data. ConfidentCare operates by recognizing clusters of similar patients, and learning the best screening policy to adopt for each cluster. A cluster of patients is a set of patients with similar features (e.g. age, breast density, family history, etc.), and the screening policy is a set of guidelines on what actions to recommend for a woman given her features and screening test scores. ConfidentCare algorithm ensures that the policy adopted for every cluster of patients satisfies a predefined accuracy requirement with a high level of confidence. We show that our algorithm outperforms the current CPGs in terms of cost-efficiency and false positive rates.
△ Less
Submitted 31 January, 2016;
originally announced February 2016.
-
Ensemble Estimation of Information Divergence
Authors:
Kevin R. Moon,
Kumar Sricharan,
Kristjan Greenewald,
Alfred O. Hero III
Abstract:
Recent work has focused on the problem of nonparametric estimation of information divergence functionals. Many existing approaches are restrictive in their assumptions on the density support set or require difficult calculations at the support boundary which must be known a priori. The MSE convergence rate of a leave-one-out kernel density plug-in divergence functional estimator for general bounde…
▽ More
Recent work has focused on the problem of nonparametric estimation of information divergence functionals. Many existing approaches are restrictive in their assumptions on the density support set or require difficult calculations at the support boundary which must be known a priori. The MSE convergence rate of a leave-one-out kernel density plug-in divergence functional estimator for general bounded density support sets is derived where knowledge of the support boundary is not required. The theory of optimally weighted ensemble estimation is generalized to derive a divergence estimator that achieves the parametric rate when the densities are sufficiently smooth. The asymptotic distribution of this estimator and some guidelines for tuning parameter selection are provided. Based on the theory, an empirical estimator of Rényi-$α$ divergence is proposed that outperforms the standard kernel density plug-in estimator, especially in high dimension. The estimator is shown to be robust to the choice of tuning parameters. As an illustration, we use the estimator to estimate bounds on the Bayes error rate of a classification problem.
△ Less
Submitted 4 June, 2018; v1 submitted 25 January, 2016;
originally announced January 2016.
-
The intrinsic value of HFO features as a biomarker of epileptic activity
Authors:
Stephen V. Gliske,
Kevin R. Moon,
William C. Stacey,
Alfred O. Hero III
Abstract:
High frequency oscillations (HFOs) are a promising biomarker of epileptic brain tissue and activity. HFOs additionally serve as a prototypical example of challenges in the analysis of discrete events in high-temporal resolution, intracranial EEG data. Two primary challenges are 1) dimensionality reduction, and 2) assessing feasibility of classification. Dimensionality reduction assumes that the da…
▽ More
High frequency oscillations (HFOs) are a promising biomarker of epileptic brain tissue and activity. HFOs additionally serve as a prototypical example of challenges in the analysis of discrete events in high-temporal resolution, intracranial EEG data. Two primary challenges are 1) dimensionality reduction, and 2) assessing feasibility of classification. Dimensionality reduction assumes that the data lie on a manifold with dimension less than that of the feature space. However, previous HFO analyses have assumed a linear manifold, global across time, space (i.e. recording electrode/channel), and individual patients. Instead, we assess both a) whether linear methods are appropriate and b) the consistency of the manifold across time, space, and patients. We also estimate bounds on the Bayes classification error to quantify the distinction between two classes of HFOs (those occurring during seizures and those occurring due to other processes). This analysis provides the foundation for future clinical use of HFO features and buides the analysis for other discrete events, such as individual action potentials or multi-unit activity.
△ Less
Submitted 12 October, 2015;
originally announced October 2015.
-
Meta learning of bounds on the Bayes classifier error
Authors:
Kevin R. Moon,
Veronique Delouille,
Alfred O. Hero III
Abstract:
Meta learning uses information from base learners (e.g. classifiers or estimators) as well as information about the learning problem to improve upon the performance of a single base learner. For example, the Bayes error rate of a given feature space, if known, can be used to aid in choosing a classifier, as well as in feature selection and model selection for the base classifiers and the meta clas…
▽ More
Meta learning uses information from base learners (e.g. classifiers or estimators) as well as information about the learning problem to improve upon the performance of a single base learner. For example, the Bayes error rate of a given feature space, if known, can be used to aid in choosing a classifier, as well as in feature selection and model selection for the base classifiers and the meta classifier. Recent work in the field of f-divergence functional estimation has led to the development of simple and rapidly converging estimators that can be used to estimate various bounds on the Bayes error. We estimate multiple bounds on the Bayes error using an estimator that applies meta learning to slowly converging plug-in estimators to obtain the parametric convergence rate. We compare the estimated bounds empirically on simulated data and then estimate the tighter bounds on features extracted from an image patch analysis of sunspot continuum and magnetogram images.
△ Less
Submitted 3 July, 2015; v1 submitted 27 April, 2015;
originally announced April 2015.
-
Image patch analysis of sunspots and active regions. II. Clustering via matrix factorization
Authors:
Kevin R. Moon,
Veronique Delouille,
Jimmy J. Li,
Ruben De Visscher,
Fraser Watson,
Alfred O. Hero III
Abstract:
Separating active regions that are quiet from potentially eruptive ones is a key issue in Space Weather applications. Traditional classification schemes such as Mount Wilson and McIntosh have been effective in relating an active region large scale magnetic configuration to its ability to produce eruptive events. However, their qualitative nature prevents systematic studies of an active region's ev…
▽ More
Separating active regions that are quiet from potentially eruptive ones is a key issue in Space Weather applications. Traditional classification schemes such as Mount Wilson and McIntosh have been effective in relating an active region large scale magnetic configuration to its ability to produce eruptive events. However, their qualitative nature prevents systematic studies of an active region's evolution for example. We introduce a new clustering of active regions that is based on the local geometry observed in Line of Sight magnetogram and continuum images. We use a reduced-dimension representation of an active region that is obtained by factoring the corresponding data matrix comprised of local image patches. Two factorizations can be compared via the definition of appropriate metrics on the resulting factors. The distances obtained from these metrics are then used to cluster the active regions. We find that these metrics result in natural clusterings of active regions. The clusterings are related to large scale descriptors of an active region such as its size, its local magnetic field distribution, and its complexity as measured by the Mount Wilson classification scheme. We also find that including data focused on the neutral line of an active region can result in an increased correspondence between our clustering results and other active region descriptors such as the Mount Wilson classifications and the $R$ value. We provide some recommendations for which metrics, matrix factorization techniques, and regions of interest to use to study active regions.
△ Less
Submitted 10 December, 2015; v1 submitted 10 April, 2015;
originally announced April 2015.
-
Image patch analysis of sunspots and active regions. I. Intrinsic dimension and correlation analysis
Authors:
Kevin R. Moon,
Jimmy J. Li,
Veronique Delouille,
Ruben De Visscher,
Fraser Watson,
Alfred O. Hero III
Abstract:
The flare-productivity of an active region is observed to be related to its spatial complexity. Mount Wilson or McIntosh sunspot classifications measure such complexity but in a categorical way, and may therefore not use all the information present in the observations. Moreover, such categorical schemes hinder a systematic study of an active region's evolution for example. We propose fine-scale qu…
▽ More
The flare-productivity of an active region is observed to be related to its spatial complexity. Mount Wilson or McIntosh sunspot classifications measure such complexity but in a categorical way, and may therefore not use all the information present in the observations. Moreover, such categorical schemes hinder a systematic study of an active region's evolution for example. We propose fine-scale quantitative descriptors for an active region's complexity and relate them to the Mount Wilson classification. We analyze the local correlation structure within continuum and magnetogram data, as well as the cross-correlation between continuum and magnetogram data. We compute the intrinsic dimension, partial correlation, and canonical correlation analysis (CCA) of image patches of continuum and magnetogram active region images taken from the SOHO-MDI instrument. We use masks of sunspots derived from continuum as well as larger masks of magnetic active regions derived from the magnetogram to analyze separately the core part of an active region from its surrounding part. We find the relationship between complexity of an active region as measured by Mount Wilson and the intrinsic dimension of its image patches. Partial correlation patterns exhibit approximately a third-order Markov structure. CCA reveals different patterns of correlation between continuum and magnetogram within the sunspots and in the region surrounding the sunspots. These results also pave the way for patch-based dictionary learning with a view towards automatic clustering of active regions.
△ Less
Submitted 14 December, 2015; v1 submitted 13 March, 2015;
originally announced March 2015.
-
Multivariate f-Divergence Estimation With Confidence
Authors:
Kevin R. Moon,
Alfred O. Hero III
Abstract:
The problem of f-divergence estimation is important in the fields of machine learning, information theory, and statistics. While several nonparametric divergence estimators exist, relatively few have known convergence properties. In particular, even for those estimators whose MSE convergence rates are known, the asymptotic distributions are unknown. We establish the asymptotic normality of a recen…
▽ More
The problem of f-divergence estimation is important in the fields of machine learning, information theory, and statistics. While several nonparametric divergence estimators exist, relatively few have known convergence properties. In particular, even for those estimators whose MSE convergence rates are known, the asymptotic distributions are unknown. We establish the asymptotic normality of a recently proposed ensemble estimator of f-divergence between two distributions from a finite number of samples. This estimator has MSE convergence rate of O(1/T), is simple to implement, and performs well in high dimensions. This theory enables us to perform divergence-based inference tasks such as testing equality of pairs of distributions based on empirical samples. We experimentally validate our theoretical results and, as an illustration, use them to empirically bound the best achievable classification error.
△ Less
Submitted 7 November, 2014;
originally announced November 2014.
-
Image patch analysis and clustering of sunspots: a dimensionality reduction approach
Authors:
Kevin R. Moon,
Jimmy J. Li,
Veronique Delouille,
Fraser Watson,
Alfred O. Hero III
Abstract:
Sunspots, as seen in white light or continuum images, are associated with regions of high magnetic activity on the Sun, visible on magnetogram images. Their complexity is correlated with explosive solar activity and so classifying these active regions is useful for predicting future solar activity. Current classification of sunspot groups is visually based and suffers from bias. Supervised learnin…
▽ More
Sunspots, as seen in white light or continuum images, are associated with regions of high magnetic activity on the Sun, visible on magnetogram images. Their complexity is correlated with explosive solar activity and so classifying these active regions is useful for predicting future solar activity. Current classification of sunspot groups is visually based and suffers from bias. Supervised learning methods can reduce human bias but fail to optimally capitalize on the information present in sunspot images. This paper uses two image modalities (continuum and magnetogram) to characterize the spatial and modal interactions of sunspot and magnetic active region images and presents a new approach to cluster the images. Specifically, in the framework of image patch analysis, we estimate the number of intrinsic parameters required to describe the spatial and modal dependencies, the correlation between the two modalities and the corresponding spatial patterns, and examine the phenomena at different scales within the images. To do this, we use linear and nonlinear intrinsic dimension estimators, canonical correlation analysis, and multiresolution analysis of intrinsic dimension.
△ Less
Submitted 24 June, 2014;
originally announced June 2014.
-
Ensemble estimation of multivariate f-divergence
Authors:
Kevin R. Moon,
Alfred O. Hero III
Abstract:
f-divergence estimation is an important problem in the fields of information theory, machine learning, and statistics. While several divergence estimators exist, relatively few of their convergence rates are known. We derive the MSE convergence rate for a density plug-in estimator of f-divergence. Then by applying the theory of optimally weighted ensemble estimation, we derive a divergence estimat…
▽ More
f-divergence estimation is an important problem in the fields of information theory, machine learning, and statistics. While several divergence estimators exist, relatively few of their convergence rates are known. We derive the MSE convergence rate for a density plug-in estimator of f-divergence. Then by applying the theory of optimally weighted ensemble estimation, we derive a divergence estimator with a convergence rate of O(1/T) that is simple to implement and performs well in high dimensions. We validate our theoretical results with experiments.
△ Less
Submitted 9 June, 2014; v1 submitted 24 April, 2014;
originally announced April 2014.
-
Training-Free Non-Intrusive Load Monitoring of Electric Vehicle Charging with Low Sampling Rate
Authors:
Zhilin Zhang,
Jae Hyun Son,
Ying Li,
Mark Trayer,
Zhouyue Pi,
Dong Yoon Hwang,
Joong Ki Moon
Abstract:
Non-intrusive load monitoring (NILM) is an important topic in smart-grid and smart-home. Many energy disaggregation algorithms have been proposed to detect various individual appliances from one aggregated signal observation. However, few works studied the energy disaggregation of plug-in electric vehicle (EV) charging in the residential environment since EVs charging at home has emerged only rece…
▽ More
Non-intrusive load monitoring (NILM) is an important topic in smart-grid and smart-home. Many energy disaggregation algorithms have been proposed to detect various individual appliances from one aggregated signal observation. However, few works studied the energy disaggregation of plug-in electric vehicle (EV) charging in the residential environment since EVs charging at home has emerged only recently. Recent studies showed that EV charging has a large impact on smart-grid especially in summer. Therefore, EV charging monitoring has become a more important and urgent missing piece in energy disaggregation. In this paper, we present a novel method to disaggregate EV charging signals from aggregated real power signals. The proposed method can effectively mitigate interference coming from air-conditioner (AC), enabling accurate EV charging detection and energy estimation under the presence of AC power signals. Besides, the proposed algorithm requires no training, demands a light computational load, delivers high estimation accuracy, and works well for data recorded at the low sampling rate 1/60 Hz. When the algorithm is tested on real-world data recorded from 11 houses over about a whole year (total 125 months worth of data), the averaged error in estimating energy consumption of EV charging is 15.7 kwh/month (while the true averaged energy consumption of EV charging is 208.5 kwh/month), and the averaged normalized mean square error in disaggregating EV charging load signals is 0.19.
△ Less
Submitted 6 August, 2014; v1 submitted 20 April, 2014;
originally announced April 2014.