-
ChaosMining: A Benchmark to Evaluate Post-Hoc Local Attribution Methods in Low SNR Environments
Authors:
Ge Shi,
Ziwen Kan,
Jason Smucny,
Ian Davidson
Abstract:
In this study, we examine the efficacy of post-hoc local attribution methods in identifying features with predictive power from irrelevant ones in domains characterized by a low signal-to-noise ratio (SNR), a common scenario in real-world machine learning applications. We developed synthetic datasets encompassing symbolic functional, image, and audio data, incorporating a benchmark on the {\it (Mo…
▽ More
In this study, we examine the efficacy of post-hoc local attribution methods in identifying features with predictive power from irrelevant ones in domains characterized by a low signal-to-noise ratio (SNR), a common scenario in real-world machine learning applications. We developed synthetic datasets encompassing symbolic functional, image, and audio data, incorporating a benchmark on the {\it (Model \(\times\) Attribution\(\times\) Noise Condition)} triplet. By rigorously testing various classic models trained from scratch, we gained valuable insights into the performance of these attribution methods in multiple conditions. Based on these findings, we introduce a novel extension to the notable recursive feature elimination (RFE) algorithm, enhancing its applicability for neural networks. Our experiments highlight its strengths in prediction and feature selection, alongside limitations in scalability. Further details and additional minor findings are included in the appendix, with extensive discussions. The codes and resources are available at \href{https://github.com/geshijoker/ChaosMining/}{URL}.
△ Less
Submitted 17 June, 2024;
originally announced June 2024.
-
Identification and Uses of Deep Learning Backbones via Pattern Mining
Authors:
Michael Livanos,
Ian Davidson
Abstract:
Deep learning is extensively used in many areas of data mining as a black-box method with impressive results. However, understanding the core mechanism of how deep learning makes predictions is a relatively understudied problem. Here we explore the notion of identifying a backbone of deep learning for a given group of instances. A group here can be instances of the same class or even misclassified…
▽ More
Deep learning is extensively used in many areas of data mining as a black-box method with impressive results. However, understanding the core mechanism of how deep learning makes predictions is a relatively understudied problem. Here we explore the notion of identifying a backbone of deep learning for a given group of instances. A group here can be instances of the same class or even misclassified instances of the same class. We view each instance for a given group as activating a subset of neurons and attempt to find a subgraph of neurons associated with a given concept/group. We formulate this problem as a set cover style problem and show it is intractable and presents a highly constrained integer linear programming (ILP) formulation. As an alternative, we explore a coverage-based heuristic approach related to pattern mining, and show it converges to a Pareto equilibrium point of the ILP formulation. Experimentally we explore these backbones to identify mistakes and improve performance, explanation, and visualization. We demonstrate application-based results using several challenging data sets, including Bird Audio Detection (BAD) Challenge and Labeled Faces in the Wild (LFW), as well as the classic MNIST data.
△ Less
Submitted 27 March, 2024;
originally announced March 2024.
-
Cooperative Knowledge Distillation: A Learner Agnostic Approach
Authors:
Michael Livanos,
Ian Davidson,
Stephen Wong
Abstract:
Knowledge distillation is a simple but powerful way to transfer knowledge between a teacher model to a student model. Existing work suffers from at least one of the following key limitations in terms of direction and scope of transfer which restrict its use: all knowledge is transferred from teacher to student regardless of whether or not that knowledge is useful, the student is the only one learn…
▽ More
Knowledge distillation is a simple but powerful way to transfer knowledge between a teacher model to a student model. Existing work suffers from at least one of the following key limitations in terms of direction and scope of transfer which restrict its use: all knowledge is transferred from teacher to student regardless of whether or not that knowledge is useful, the student is the only one learning in this exchange, and typically distillation transfers knowledge only from a single teacher to a single student. We formulate a novel form of knowledge distillation in which many models can act as both students and teachers which we call cooperative distillation. The models cooperate as follows: a model (the student) identifies specific deficiencies in it's performance and searches for another model (the teacher) who encodes learned knowledge into instructional virtual instances via counterfactual instance generation. Because different models may have different strengths and weaknesses, all models can act as either students or teachers (cooperation) when appropriate and only distill knowledge in areas specific to their strengths (focus). Since counterfactuals as a paradigm are not tied to any specific algorithm, we can use this method to distill knowledge between learners of different architectures, algorithms, and even feature spaces. We demonstrate that our approach not only outperforms baselines such as transfer learning, self-supervised learning, and multiple knowledge distillation algorithms on several datasets, but it can also be used in settings where the aforementioned techniques cannot.
△ Less
Submitted 2 February, 2024;
originally announced February 2024.
-
Double Clad Antiresonant Hollow Core Fiber and Its Comparison with other Fibres for Multiphoton Micro-Endoscopy
Authors:
Marzanna Szwaj,
Ian A Davidson,
Peter B Johnson,
Greg Jasion,
Yongmin Jung,
Seyed Reza Sandoghchi,
Krzysztof P Herdzik,
Konstantinos N Bourdakos,
Natalie V Wheeler,
Hans Christian Mulvad,
David J Richardson,
Francesco Poletti,
Sumeet Mahajan
Abstract:
In this work, we study a new hollow-core (air-filled) double-clad anti-resonant fiber (DC-ARF) as a potent candidate for multiphoton micro-endoscopy. We compare the fiber characteristics with a single-clad anti-resonant fiber (SC-ARF) and a solid core fiber (SCF). While the DC-ARF and the SC-ARF enable low-loss (<0.2 dBm-1), close to dispersion-free excitation pulse delivery (<10% pulse width incr…
▽ More
In this work, we study a new hollow-core (air-filled) double-clad anti-resonant fiber (DC-ARF) as a potent candidate for multiphoton micro-endoscopy. We compare the fiber characteristics with a single-clad anti-resonant fiber (SC-ARF) and a solid core fiber (SCF). While the DC-ARF and the SC-ARF enable low-loss (<0.2 dBm-1), close to dispersion-free excitation pulse delivery (<10% pulse width increase at 900 nm per 1 m fiber) without any induced non-linearities, the SCF resulted in spectral broadening and pulse-stretching (> 2000% of pulse width increase at 900 nm per 1 m fiber). An ideal optical fiber endoscope needs to be several meters long and should enable both excitation and collection through the fiber. Therefore, we performed multiphoton imaging on endoscopy-compatible 1 m and 3 m lengths of fiber in the back-scattered geometry, wherein the signals were collected either directly (non-descanned detection) or through the fiber (descanned detection). Second harmonic images were collected from barium titanate crystals as well as from biological samples (rat tail tendon). In non-descanned detection conditions, the ARFs outperformed the SCF by up to 10 times in terms of signal-to-noise ratio of images. Significantly, only the DC-ARF, due to its high numerical aperture (0.45) and wide-collection bandwidth (>1 um), could provide images in the de-scanned detection configuration desirable for endoscopy. Thus, our systematic characterization and comparison of different optical fibres under different image collection configurations, confirms and establishes the utility of DC-ARFs for high-performing label-free multiphoton imaging based micro-endoscopy.
△ Less
Submitted 6 November, 2023;
originally announced November 2023.
-
Sentiment Analysis in Digital Spaces: An Overview of Reviews
Authors:
Laura E. M. Ayravainen,
Joanne Hinds,
Brittany I. Davidson
Abstract:
Sentiment analysis (SA) is commonly applied to digital textual data, revealing insight into opinions and feelings. Many systematic reviews have summarized existing work, but often overlook discussions of validity and scientific practices. Here, we present an overview of reviews, synthesizing 38 systematic reviews, containing 2,275 primary studies. We devise a bespoke quality assessment framework d…
▽ More
Sentiment analysis (SA) is commonly applied to digital textual data, revealing insight into opinions and feelings. Many systematic reviews have summarized existing work, but often overlook discussions of validity and scientific practices. Here, we present an overview of reviews, synthesizing 38 systematic reviews, containing 2,275 primary studies. We devise a bespoke quality assessment framework designed to assess the rigor and quality of systematic review methodologies and reporting standards. Our findings show diverse applications and methods, limited reporting rigor, and challenges over time. We discuss how future research and practitioners can address these issues and highlight their importance across numerous applications.
△ Less
Submitted 30 October, 2023;
originally announced October 2023.
-
Mode attraction, rejection and control in nonlinear multimode optics
Authors:
Kunhao Ji,
Ian Davidson,
Jayantha Sahu,
David. J. Richardson,
Stefan Wabnitz,
Massimiliano Guasoni
Abstract:
Novel fundamental notions hel** in the interpretation of the complex dynamics of nonlinear systems are essential to our understanding and ability to exploit them. In this work we predict and demonstrate experimentally a fundamental property of Kerr-nonlinear media, which we name mode rejection and takes place when two intense counter-propagating beams interact in a multimode waveguide. In stark…
▽ More
Novel fundamental notions hel** in the interpretation of the complex dynamics of nonlinear systems are essential to our understanding and ability to exploit them. In this work we predict and demonstrate experimentally a fundamental property of Kerr-nonlinear media, which we name mode rejection and takes place when two intense counter-propagating beams interact in a multimode waveguide. In stark contrast to mode attraction phenomena, mode rejection leads to the selective suppression of a spatial mode in the forward beam, which is controlled via the counter-propagating backward beam. Starting from this observation we generalise the ideas of attraction and rejection in nonlinear multimode systems of arbitrary dimension, which paves the way towards a more general idea of all-optical mode control. These ideas represent universal tools to explore novel dynamics and applications in a variety of optical and non-optical nonlinear systems. Coherent beam combination in polarization-maintaining multicore fibres is demonstrated as example.
△ Less
Submitted 20 October, 2023;
originally announced October 2023.
-
On-target delivery of intense ultrafast laser pulses through hollow-core anti-resonant fibers
Authors:
Athanasios Lekosiotis,
Federico Belli,
Christian Brahms,
Mohammed Sabbah,
Hesham Sakr,
Ian A. Davidson,
Francesco Poletti,
John C. Travers
Abstract:
We report the flexible on-target delivery of 800 nm wavelength, 5 GW peak power, 40 fs duration laser pulses through an evacuated and tightly coiled 10 m long hollow-core nested anti-resonant fiber by positively chir** the input pulses to compensate for the anomalous dispersion of the fiber. Near-transform-limited output pulses with high beam quality and a guided peak intensity of 3 PW/cm2 were…
▽ More
We report the flexible on-target delivery of 800 nm wavelength, 5 GW peak power, 40 fs duration laser pulses through an evacuated and tightly coiled 10 m long hollow-core nested anti-resonant fiber by positively chir** the input pulses to compensate for the anomalous dispersion of the fiber. Near-transform-limited output pulses with high beam quality and a guided peak intensity of 3 PW/cm2 were achieved by suppressing plasma effects in the residual gas by pre-pum** the fiber after evacuation. This appears to cause a long-term removal of molecules from the fiber core. Identifying the fluence at the fiber core-wall interface as the damage origin, we scaled the coupled energy to 2.1 mJ using a short piece of larger-core fiber to obtain 20 GW at the fiber output. This scheme can pave the way towards the integration of anti-resonant fibers in mJ-level nonlinear optical experiments and laser-source development.
△ Less
Submitted 7 September, 2023; v1 submitted 26 May, 2023;
originally announced May 2023.
-
The Politics of Language Choice: How the Russian-Ukrainian War Influences Ukrainians' Language Use on Twitter
Authors:
Daniel Racek,
Brittany I. Davidson,
Paul W. Thurner,
Xiao Xiang Zhu,
Göran Kauermann
Abstract:
The use of language is innately political and often a vehicle of cultural identity as well as the basis for nation building. Here, we examine language choice and tweeting activity of Ukrainian citizens based on more than 4 million geo-tagged tweets from over 62,000 users before and during the Russian-Ukrainian War, from January 2020 to October 2022. Using statistical models, we disentangle sample…
▽ More
The use of language is innately political and often a vehicle of cultural identity as well as the basis for nation building. Here, we examine language choice and tweeting activity of Ukrainian citizens based on more than 4 million geo-tagged tweets from over 62,000 users before and during the Russian-Ukrainian War, from January 2020 to October 2022. Using statistical models, we disentangle sample effects, arising from the in- and outflux of users on Twitter, from behavioural effects, arising from behavioural changes of the users. We observe a steady shift from the Russian language towards the Ukrainian language already before the war, which drastically speeds up with its outbreak. We attribute these shifts in large part to users' behavioural changes. Notably, we find that more than half of the Russian-tweeting users shift towards Ukrainian as a result of the war.
△ Less
Submitted 6 June, 2023; v1 submitted 4 May, 2023;
originally announced May 2023.
-
COSMOS2020: Identification of High-z Protocluster Candidates in COSMOS
Authors:
Malte Brinch,
Thomas R. Greve,
John R. Weaver,
Gabriel Brammer,
Olivier Ilbert,
Marko Shuntov,
Shuowen **,
Daizhong Liu,
Clara Giménez-Arteaga,
Caitlin M. Casey,
Iary Davidson,
Seiji Fujimoto,
Anton M. Koekemoer,
Vasily Kokorev,
Georgios Magdis,
H. J. McCracken,
Conor J. R. McPartland,
Bahram Mobasher,
David B. Sanders,
Sune Toft,
Francesco Valentino,
Giovanni Zamorani,
Jorge Zavala
Abstract:
We conduct a systematic search for protocluster candidates at $z \geq 6$ in the COSMOS field using the recently released COSMOS2020 source catalog. We select galaxies using a number of selection criteria to obtain a sample of galaxies that have a high probability of being inside a given redshift bin. We then apply overdensity analysis to the bins using two density estimators, a Weighted Adaptive K…
▽ More
We conduct a systematic search for protocluster candidates at $z \geq 6$ in the COSMOS field using the recently released COSMOS2020 source catalog. We select galaxies using a number of selection criteria to obtain a sample of galaxies that have a high probability of being inside a given redshift bin. We then apply overdensity analysis to the bins using two density estimators, a Weighted Adaptive Kernel Estimator and a Weighted Voronoi Tessellation Estimator. We have found 15 significant ($>4σ$) candidate galaxy overdensities across the redshift range $6\le z\le7.7$. The majority of the galaxies appear to be on the galaxy main sequence at their respective epochs. We use multiple stellar-mass-to-halo-mass conversion methods to obtain a range of dark matter halo mass estimates for the overdensities in the range of $\sim10^{11-13}\,M_{\rm \odot}$, at the respective redshifts of the overdensities. The number and the masses of the halos associated with our protocluster candidates are consistent with what is expected from the area of a COSMOS-like survey in a standard $Λ$CDM cosmology. Through comparison with simulation, we expect that all the overdensities at $z\simeq6$ will evolve into a Virgo-/Coma-like clusters at present (i.e., with masses $\sim 10^{14}-10^{15}\,M_{\rm \odot}$). Compared to other overdensities identified at $z \geq 6$ via narrow-band selection techniques, the overdensities presented appear to have $\sim10\times$ higher stellar masses and star-formation rates. We compare the evolution in the total star-formation rate and stellar mass content of the protocluster candidates across the redshift range $6\le z\le7.7$ and find agreement with the total average star-formation rate from simulations.
△ Less
Submitted 31 October, 2022;
originally announced October 2022.
-
Scalable Spectral Clustering with Group Fairness Constraints
Authors:
Ji Wang,
Ding Lu,
Ian Davidson,
Zhaojun Bai
Abstract:
There are synergies of research interests and industrial efforts in modeling fairness and correcting algorithmic bias in machine learning. In this paper, we present a scalable algorithm for spectral clustering (SC) with group fairness constraints. Group fairness is also known as statistical parity where in each cluster, each protected group is represented with the same proportion as in the entiret…
▽ More
There are synergies of research interests and industrial efforts in modeling fairness and correcting algorithmic bias in machine learning. In this paper, we present a scalable algorithm for spectral clustering (SC) with group fairness constraints. Group fairness is also known as statistical parity where in each cluster, each protected group is represented with the same proportion as in the entirety. While FairSC algorithm (Kleindessner et al., 2019) is able to find the fairer clustering, it is compromised by high costs due to the kernels of computing nullspaces and the square roots of dense matrices explicitly. We present a new formulation of underlying spectral computation by incorporating nullspace projection and Hotelling's deflation such that the resulting algorithm, called s-FairSC, only involves the sparse matrix-vector products and is able to fully exploit the sparsity of the fair SC model. The experimental results on the modified stochastic block model demonstrate that s-FairSC is comparable with FairSC in recovering fair clustering. Meanwhile, it is sped up by a factor of 12 for moderate model sizes. s-FairSC is further demonstrated to be scalable in the sense that the computational costs of s-FairSC only increase marginally compared to the SC without fairness constraints.
△ Less
Submitted 14 April, 2023; v1 submitted 28 October, 2022;
originally announced October 2022.
-
Towards Auditing Unsupervised Learning Algorithms and Human Processes For Fairness
Authors:
Ian Davidson,
S. S. Ravi
Abstract:
Existing work on fairness typically focuses on making known machine learning algorithms fairer. Fair variants of classification, clustering, outlier detection and other styles of algorithms exist. However, an understudied area is the topic of auditing an algorithm's output to determine fairness. Existing work has explored the two group classification problem for binary protected status variables u…
▽ More
Existing work on fairness typically focuses on making known machine learning algorithms fairer. Fair variants of classification, clustering, outlier detection and other styles of algorithms exist. However, an understudied area is the topic of auditing an algorithm's output to determine fairness. Existing work has explored the two group classification problem for binary protected status variables using standard definitions of statistical parity. Here we build upon the area of auditing by exploring the multi-group setting under more complex definitions of fairness.
△ Less
Submitted 20 September, 2022;
originally announced September 2022.
-
Explainable Clustering via Exemplars: Complexity and Efficient Approximation Algorithms
Authors:
Ian Davidson,
Michael Livanos,
Antoine Gourru,
Peter Walker,
Julien Velcin,
S. S. Ravi
Abstract:
Explainable AI (XAI) is an important develo** area but remains relatively understudied for clustering. We propose an explainable-by-design clustering approach that not only finds clusters but also exemplars to explain each cluster. The use of exemplars for understanding is supported by the exemplar-based school of concept definition in psychology. We show that finding a small set of exemplars to…
▽ More
Explainable AI (XAI) is an important develo** area but remains relatively understudied for clustering. We propose an explainable-by-design clustering approach that not only finds clusters but also exemplars to explain each cluster. The use of exemplars for understanding is supported by the exemplar-based school of concept definition in psychology. We show that finding a small set of exemplars to explain even a single cluster is computationally intractable; hence, the overall problem is challenging. We develop an approximation algorithm that provides provable performance guarantees with respect to clustering quality as well as the number of exemplars used. This basic algorithm explains all the instances in every cluster whilst another approximation algorithm uses a bounded number of exemplars to allow simpler explanations and provably covers a large fraction of all the instances. Experimental results show that our work is useful in domains involving difficult to understand deep embeddings of images and text.
△ Less
Submitted 20 September, 2022;
originally announced September 2022.
-
How darknet market users learned to worry more and love PGP: Analysis of security advice on darknet marketplaces
Authors:
Andrew C. Dwyer,
Joseph Hallett,
Claudia Peersman,
Matthew Edwards,
Brittany I. Davidson,
Awais Rashid
Abstract:
Darknet marketplaces, accessible through, Tor are where users can buy illicit goods, and learn to hide from law enforcement. We surveyed the advice on these markets and found valid security advice mixed up with paranoid threat models and a reliance on privacy tools dismissed as unusable by the mainstream.
Darknet marketplaces, accessible through, Tor are where users can buy illicit goods, and learn to hide from law enforcement. We surveyed the advice on these markets and found valid security advice mixed up with paranoid threat models and a reliance on privacy tools dismissed as unusable by the mainstream.
△ Less
Submitted 16 March, 2022;
originally announced March 2022.
-
On the role of technology in human-dog relationships: a future of nightmares or dreams?
Authors:
Dirk van der Linden,
Brittany I. Davidson,
Orit Hirsch-Matsioulas,
Anna Zamansky
Abstract:
Digital technologies that help people take care of their dogs are becoming more widespread. Yet, little research explores what the role of technology in the human-dog relationship should be. We conducted a qualitative study incorporating quantitative and thematic analysis of 155 UK dog owners reflecting on their daily routines and technology's role in it, disentangling the what-where-why of inters…
▽ More
Digital technologies that help people take care of their dogs are becoming more widespread. Yet, little research explores what the role of technology in the human-dog relationship should be. We conducted a qualitative study incorporating quantitative and thematic analysis of 155 UK dog owners reflecting on their daily routines and technology's role in it, disentangling the what-where-why of interspecies routines and activities, technological desires, and rationales for technological support across common human-dog activities. We found that increasingly entangled daily routines lead to close multi-species households where dog owners conceptualize technology as having a role to support them in giving care to their dogs. When confronted with the role of technology across various activities, only chores like cleaning up after their dogs lead to largely positive considerations, while activities that benefit themselves like walking together lead to largely negative considerations. For other activities, whether playing, training, or feeding, attitudes remain diverse. In general, across all activities both a nightmare scenario of technology taking the human's role and in doing so disentangling the human-dog bond, as well as a dream scenario of technology augmenting human abilities arise. We argue that the current trajectory of digital technology for pets is increasingly focused on enabling remote interactions, an example of the nightmare scenario in our thematic analysis. It is important to redirect this trajectory to one of technology predominantly supporting us in becoming better and more informed caregivers.
△ Less
Submitted 24 September, 2022; v1 submitted 4 February, 2022;
originally announced February 2022.
-
Deep Fair Discriminative Clustering
Authors:
Hong**g Zhang,
Ian Davidson
Abstract:
Deep clustering has the potential to learn a strong representation and hence better clustering performance compared to traditional clustering methods such as $k$-means and spectral clustering. However, this strong representation learning ability may make the clustering unfair by discovering surrogates for protected information which we empirically show in our experiments. In this work, we study a…
▽ More
Deep clustering has the potential to learn a strong representation and hence better clustering performance compared to traditional clustering methods such as $k$-means and spectral clustering. However, this strong representation learning ability may make the clustering unfair by discovering surrogates for protected information which we empirically show in our experiments. In this work, we study a general notion of group-level fairness for both binary and multi-state protected status variables (PSVs). We begin by formulating the group-level fairness problem as an integer linear programming formulation whose totally unimodular constraint matrix means it can be efficiently solved via linear programming. We then show how to inject this solver into a discriminative deep clustering backbone and hence propose a refinement learning algorithm to combine the clustering goal with the fairness objective to learn fair clusters adaptively. Experimental results on real-world datasets demonstrate that our model consistently outperforms state-of-the-art fair clustering algorithms. Our framework shows promising results for novel clustering tasks including flexible fairness constraints, multi-state PSVs and predictive clustering.
△ Less
Submitted 28 May, 2021;
originally announced May 2021.
-
Deep Descriptive Clustering
Authors:
Hong**g Zhang,
Ian Davidson
Abstract:
Recent work on explainable clustering allows describing clusters when the features are interpretable. However, much modern machine learning focuses on complex data such as images, text, and graphs where deep learning is used but the raw features of data are not interpretable. This paper explores a novel setting for performing clustering on complex data while simultaneously generating explanations…
▽ More
Recent work on explainable clustering allows describing clusters when the features are interpretable. However, much modern machine learning focuses on complex data such as images, text, and graphs where deep learning is used but the raw features of data are not interpretable. This paper explores a novel setting for performing clustering on complex data while simultaneously generating explanations using interpretable tags. We propose deep descriptive clustering that performs sub-symbolic representation learning on complex data while generating explanations based on symbolic data. We form good clusters by maximizing the mutual information between empirical distribution on the inputs and the induced clustering labels for clustering objectives. We generate explanations by solving an integer linear programming that generates concise and orthogonal descriptions for each cluster. Finally, we allow the explanation to inform better clustering by proposing a novel pairwise loss with self-generated constraints to maximize the clustering and explanation module's consistency. Experimental results on public data demonstrate that our model outperforms competitive baselines in clustering performance while offering high-quality cluster-level explanations.
△ Less
Submitted 24 May, 2021;
originally announced May 2021.
-
A Framework for Deep Constrained Clustering
Authors:
Hong**g Zhang,
Tianyang Zhan,
Sugato Basu,
Ian Davidson
Abstract:
The area of constrained clustering has been extensively explored by researchers and used by practitioners. Constrained clustering formulations exist for popular algorithms such as k-means, mixture models, and spectral clustering but have several limitations. A fundamental strength of deep learning is its flexibility, and here we explore a deep learning framework for constrained clustering and in p…
▽ More
The area of constrained clustering has been extensively explored by researchers and used by practitioners. Constrained clustering formulations exist for popular algorithms such as k-means, mixture models, and spectral clustering but have several limitations. A fundamental strength of deep learning is its flexibility, and here we explore a deep learning framework for constrained clustering and in particular explore how it can extend the field of constrained clustering. We show that our framework can not only handle standard together/apart constraints (without the well documented negative effects reported earlier) generated from labeled side information but more complex constraints generated from new types of side information such as continuous values and high-level domain knowledge. Furthermore, we propose an efficient training paradigm that is generally applicable to these four types of constraints. We validate the effectiveness of our approach by empirical results on both image and text datasets. We also study the robustness of our framework when learning with noisy constraints and show how different components of our framework contribute to the final performance. Our source code is available at $\href{https://github.com/blueocean92/deep_constrained_clustering}{\text{URL}}$.
△ Less
Submitted 7 January, 2021;
originally announced January 2021.
-
Towards Fair Deep Anomaly Detection
Authors:
Hong**g Zhang,
Ian Davidson
Abstract:
Anomaly detection aims to find instances that are considered unusual and is a fundamental problem of data science. Recently, deep anomaly detection methods were shown to achieve superior results particularly in complex data such as images. Our work focuses on deep one-class classification for anomaly detection which learns a map** only from the normal samples. However, the non-linear transformat…
▽ More
Anomaly detection aims to find instances that are considered unusual and is a fundamental problem of data science. Recently, deep anomaly detection methods were shown to achieve superior results particularly in complex data such as images. Our work focuses on deep one-class classification for anomaly detection which learns a map** only from the normal samples. However, the non-linear transformation performed by deep learning can potentially find patterns associated with social bias. The challenge with adding fairness to deep anomaly detection is to ensure both making fair and correct anomaly predictions simultaneously. In this paper, we propose a new architecture for the fair anomaly detection approach (Deep Fair SVDD) and train it using an adversarial network to de-correlate the relationships between the sensitive attributes and the learned representations. This differs from how fairness is typically added namely as a regularizer or a constraint. Further, we propose two effective fairness measures and empirically demonstrate that existing deep anomaly detection methods are unfair. We show that our proposed approach can remove the unfairness largely with minimal loss on the anomaly detection performance. Lastly, we conduct an in-depth analysis to show the strength and limitations of our proposed model, including parameter analysis, feature visualization, and run-time analysis.
△ Less
Submitted 29 December, 2020;
originally announced December 2020.
-
Block Model Guided Unsupervised Feature Selection
Authors:
Zilong Bai,
Hoa Nguyen,
Ian Davidson
Abstract:
Feature selection is a core area of data mining with a recent innovation of graph-driven unsupervised feature selection for linked data. In this setting we have a dataset $\mathbf{Y}$ consisting of $n$ instances each with $m$ features and a corresponding $n$ node graph (whose adjacency matrix is $\mathbf{A}$) with an edge indicating that the two instances are similar. Existing efforts for unsuperv…
▽ More
Feature selection is a core area of data mining with a recent innovation of graph-driven unsupervised feature selection for linked data. In this setting we have a dataset $\mathbf{Y}$ consisting of $n$ instances each with $m$ features and a corresponding $n$ node graph (whose adjacency matrix is $\mathbf{A}$) with an edge indicating that the two instances are similar. Existing efforts for unsupervised feature selection on attributed networks have explored either directly regenerating the links by solving for $f$ such that $f(\mathbf{y}_i,\mathbf{y}_j) \approx \mathbf{A}_{i,j}$ or finding community structure in $\mathbf{A}$ and using the features in $\mathbf{Y}$ to predict these communities. However, graph-driven unsupervised feature selection remains an understudied area with respect to exploring more complex guidance. Here we take the novel approach of first building a block model on the graph and then using the block model for feature selection. That is, we discover $\mathbf{F}\mathbf{M}\mathbf{F}^T \approx \mathbf{A}$ and then find a subset of features $\mathcal{S}$ that induces another graph to preserve both $\mathbf{F}$ and $\mathbf{M}$. We call our approach Block Model Guided Unsupervised Feature Selection (BMGUFS). Experimental results show that our method outperforms the state of the art on several real-world public datasets in finding high-quality features for clustering.
△ Less
Submitted 5 July, 2020;
originally announced July 2020.
-
Efficient Algorithms for Generating Provably Near-Optimal Cluster Descriptors for Explainability
Authors:
Prathyush Sambaturu,
Aparna Gupta,
Ian Davidson,
S. S. Ravi,
Anil Vullikanti,
Andrew Warren
Abstract:
Improving the explainability of the results from machine learning methods has become an important research goal. Here, we study the problem of making clusters more interpretable by extending a recent approach of [Davidson et al., NeurIPS 2018] for constructing succinct representations for clusters. Given a set of objects $S$, a partition $π$ of $S$ (into clusters), and a universe $T$ of tags such…
▽ More
Improving the explainability of the results from machine learning methods has become an important research goal. Here, we study the problem of making clusters more interpretable by extending a recent approach of [Davidson et al., NeurIPS 2018] for constructing succinct representations for clusters. Given a set of objects $S$, a partition $π$ of $S$ (into clusters), and a universe $T$ of tags such that each element in $S$ is associated with a subset of tags, the goal is to find a representative set of tags for each cluster such that those sets are pairwise-disjoint and the total size of all the representatives is minimized. Since this problem is NP-hard in general, we develop approximation algorithms with provable performance guarantees for the problem. We also show applications to explain clusters from datasets, including clusters of genomic sequences that represent different threat levels.
△ Less
Submitted 6 February, 2020;
originally announced February 2020.
-
A Graph-Based Approach for Active Learning in Regression
Authors:
Hong**g Zhang,
S. S. Ravi,
Ian Davidson
Abstract:
Active learning aims to reduce labeling efforts by selectively asking humans to annotate the most important data points from an unlabeled pool and is an example of human-machine interaction. Though active learning has been extensively researched for classification and ranking problems, it is relatively understudied for regression problems. Most existing active learning for regression methods use t…
▽ More
Active learning aims to reduce labeling efforts by selectively asking humans to annotate the most important data points from an unlabeled pool and is an example of human-machine interaction. Though active learning has been extensively researched for classification and ranking problems, it is relatively understudied for regression problems. Most existing active learning for regression methods use the regression function learned at each active learning iteration to select the next informative point to query. This introduces several challenges such as handling noisy labels, parameter uncertainty and overcoming initially biased training data. Instead, we propose a feature-focused approach that formulates both sequential and batch-mode active regression as a novel bipartite graph optimization problem. We conduct experiments on both noise-free and noisy settings. Our experimental results on benchmark data sets demonstrate the effectiveness of our proposed approach.
△ Less
Submitted 29 January, 2020;
originally announced January 2020.
-
Coverage-based Outlier Explanation
Authors:
Yue Wu,
Leman Akoglu,
Ian Davidson
Abstract:
Outlier detection is a core task in data mining with a plethora of algorithms that have enjoyed wide scale usage. Existing algorithms are primarily focused on detection, that is the identification of outliers in a given dataset. In this paper we explore the relatively under-studied problem of the outlier explanation problem. Our goal is, given a dataset that is already divided into outliers and no…
▽ More
Outlier detection is a core task in data mining with a plethora of algorithms that have enjoyed wide scale usage. Existing algorithms are primarily focused on detection, that is the identification of outliers in a given dataset. In this paper we explore the relatively under-studied problem of the outlier explanation problem. Our goal is, given a dataset that is already divided into outliers and normal instances, explain what characterizes the outliers. We explore the novel direction of a semantic explanation that a domain expert or policy maker is able to understand. We formulate this as an optimization problem to find explanations that are both interpretable and pure. Through experiments on real-world data sets, we quantitatively show that our method can efficiently generate better explanations compared with rule-based learners.
△ Less
Submitted 6 November, 2019;
originally announced November 2019.
-
A Framework for Deep Constrained Clustering -- Algorithms and Advances
Authors:
Hong**g Zhang,
Sugato Basu,
Ian Davidson
Abstract:
The area of constrained clustering has been extensively explored by researchers and used by practitioners. Constrained clustering formulations exist for popular algorithms such as k-means, mixture models, and spectral clustering but have several limitations. A fundamental strength of deep learning is its flexibility, and here we explore a deep learning framework for constrained clustering and in p…
▽ More
The area of constrained clustering has been extensively explored by researchers and used by practitioners. Constrained clustering formulations exist for popular algorithms such as k-means, mixture models, and spectral clustering but have several limitations. A fundamental strength of deep learning is its flexibility, and here we explore a deep learning framework for constrained clustering and in particular explore how it can extend the field of constrained clustering. We show that our framework can not only handle standard together/apart constraints (without the well documented negative effects reported earlier) generated from labeled side information but more complex constraints generated from new types of side information such as continuous values and high-level domain knowledge.
△ Less
Submitted 19 December, 2019; v1 submitted 28 January, 2019;
originally announced January 2019.
-
Towards Fair Deep Clustering With Multi-State Protected Variables
Authors:
Bokun Wang,
Ian Davidson
Abstract:
Fair clustering under the disparate impact doctrine requires that population of each protected group should be approximately equal in every cluster. Previous work investigated a difficult-to-scale pre-processing step for $k$-center and $k$-median style algorithms for the special case of this problem when the number of protected groups is two. In this work, we consider a more general and practical…
▽ More
Fair clustering under the disparate impact doctrine requires that population of each protected group should be approximately equal in every cluster. Previous work investigated a difficult-to-scale pre-processing step for $k$-center and $k$-median style algorithms for the special case of this problem when the number of protected groups is two. In this work, we consider a more general and practical setting where there can be many protected groups. To this end, we propose Deep Fair Clustering, which learns a discriminative but fair cluster assignment function. The experimental results on three public datasets with different types of protected attribute show that our approach can steadily improve the degree of fairness while only having minor loss in terms of clustering quality.
△ Less
Submitted 28 January, 2019;
originally announced January 2019.
-
On The Equivalence of Tries and Dendrograms - Efficient Hierarchical Clustering of Traffic Data
Authors:
Chia-Tung Kuo,
Ian Davidson
Abstract:
The widespread use of GPS-enabled devices generates voluminous and continuous amounts of traffic data but analyzing such data for interpretable and actionable insights poses challenges. A hierarchical clustering of the trips has many uses such as discovering shortest paths, common routes and often traversed areas. However, hierarchical clustering typically has time complexity of $O(n^2 \log n)$ wh…
▽ More
The widespread use of GPS-enabled devices generates voluminous and continuous amounts of traffic data but analyzing such data for interpretable and actionable insights poses challenges. A hierarchical clustering of the trips has many uses such as discovering shortest paths, common routes and often traversed areas. However, hierarchical clustering typically has time complexity of $O(n^2 \log n)$ where $n$ is the number of instances, and is difficult to scale to large data sets associated with GPS data. Furthermore, incremental hierarchical clustering is still a develo** area. Prefix trees (also called tries) can be efficiently constructed and updated in linear time (in $n$). We show how a specially constructed trie can compactly store the trips and further show this trie is equivalent to a dendrogram that would have been built by classic agglomerative hierarchical algorithms using a specific distance metric. This allows creating hierarchical clusterings of GPS trip data and updating this hierarchy in linear time. %we can extract a meaningful kernel and can also interpret the structure as clusterings of differing granularity as one progresses down the tree. We demonstrate the usefulness of our proposed approach on a real world data set of half a million taxis' GPS traces, well beyond the capabilities of agglomerative clustering methods. Our work is not limited to trip data and can be used with other data with a string representation.
△ Less
Submitted 12 October, 2018;
originally announced October 2018.
-
Probabilistic Formulations of Regression with Mixed Guidance
Authors:
Aubrey Gress,
Ian Davidson
Abstract:
Regression problems assume every instance is annotated (labeled) with a real value, a form of annotation we call \emph{strong guidance}. In order for these annotations to be accurate, they must be the result of a precise experiment or measurement. However, in some cases additional \emph{weak guidance} might be given by imprecise measurements, a domain expert or even crowd sourcing. Current formula…
▽ More
Regression problems assume every instance is annotated (labeled) with a real value, a form of annotation we call \emph{strong guidance}. In order for these annotations to be accurate, they must be the result of a precise experiment or measurement. However, in some cases additional \emph{weak guidance} might be given by imprecise measurements, a domain expert or even crowd sourcing. Current formulations of regression are unable to use both types of guidance. We propose a regression framework that can also incorporate weak guidance based on relative orderings, bounds, neighboring and similarity relations. Consider learning to predict ages from portrait images, these new types of guidance allow weaker forms of guidance such as stating a person is in their 20s or two people are similar in age. These types of annotations can be easier to generate than strong guidance. We introduce a probabilistic formulation for these forms of weak guidance and show that the resulting optimization problems are convex. Our experimental results show the benefits of these formulations on several data sets.
△ Less
Submitted 1 April, 2018;
originally announced April 2018.
-
Transfer Regression via Pairwise Similarity Regularization
Authors:
Aubrey Gress,
Ian Davidson
Abstract:
Transfer learning methods address the situation where little labeled training data from the "target" problem exists, but much training data from a related "source" domain is available. However, the overwhelming majority of transfer learning methods are designed for simple settings where the source and target predictive functions are almost identical, limiting the applicability of transfer learning…
▽ More
Transfer learning methods address the situation where little labeled training data from the "target" problem exists, but much training data from a related "source" domain is available. However, the overwhelming majority of transfer learning methods are designed for simple settings where the source and target predictive functions are almost identical, limiting the applicability of transfer learning methods to real world data. We propose a novel, weaker, property of the source domain that can be transferred even when the source and target predictive functions diverge. Our method assumes the source and target functions share a Pairwise Similarity property, where if the source function makes similar predictions on a pair of instances, then so will the target function. We propose Pairwise Similarity Regularization Transfer, a flexible graph-based regularization framework which can incorporate this modeling assumption into standard supervised learning algorithms. We show how users can encode domain knowledge into our regularizer in the form of spatial continuity, pairwise "similarity constraints" and how our method can be scaled to large data sets using the Nystrom approximation. Finally, we present positive and negative results on real and synthetic data sets and discuss when our Pairwise Similarity transfer assumption seems to hold in practice.
△ Less
Submitted 23 December, 2017;
originally announced December 2017.
-
Dense Transformer Networks
Authors:
Jun Li,
Yongjun Chen,
Lei Cai,
Ian Davidson,
Shuiwang Ji
Abstract:
The key idea of current deep learning methods for dense prediction is to apply a model on a regular patch centered on each pixel to make pixel-wise predictions. These methods are limited in the sense that the patches are determined by network architecture instead of learned from data. In this work, we propose the dense transformer networks, which can learn the shapes and sizes of patches from data…
▽ More
The key idea of current deep learning methods for dense prediction is to apply a model on a regular patch centered on each pixel to make pixel-wise predictions. These methods are limited in the sense that the patches are determined by network architecture instead of learned from data. In this work, we propose the dense transformer networks, which can learn the shapes and sizes of patches from data. The dense transformer networks employ an encoder-decoder architecture, and a pair of dense transformer modules are inserted into each of the encoder and decoder paths. The novelty of this work is that we provide technical solutions for learning the shapes and sizes of patches from data and efficiently restoring the spatial correspondence required for dense prediction. The proposed dense transformer modules are differentiable, thus the entire network can be trained. We apply the proposed networks on natural and biological image segmentation tasks and show superior performance is achieved in comparison to baseline methods.
△ Less
Submitted 7 June, 2017; v1 submitted 24 May, 2017;
originally announced May 2017.
-
The VIMOS Public Extragalactic Redshift Survey (VIPERS). The matter density and baryon fraction from the galaxy power spectrum at redshift $0.6<z<1.1$
Authors:
S. Rota,
B. R. Granett,
J. Bel,
L. Guzzo,
J. A. Peacock,
M. J. Wilson,
A. Pezzotta,
S. de la Torre,
B. Garilli,
M. Bolzonella,
M. Scodeggio,
U. Abbas,
C. Adami,
D. Bottini,
A. Cappi,
O. Cucciati,
I. Davidson,
P. Franzetti,
A. Fritz,
A. Iovino,
J. Krywult,
V. Le Brun,
O. Le Fèvre,
D. Mascagni,
K. Małek
, et al. (15 additional authors not shown)
Abstract:
We use the final catalogue of the VIMOS Public Extragalactic Redshift Survey (VIPERS) to measure the power spectrum of the galaxy distribution at high redshift, presenting results that extend beyond $z=1$ for the first time. We apply an FFT technique to four independent sub-volumes comprising a total of $51,728$ galaxies at $0.6<z<1.1$ (out of the nearly $90,000$ included in the whole survey). We…
▽ More
We use the final catalogue of the VIMOS Public Extragalactic Redshift Survey (VIPERS) to measure the power spectrum of the galaxy distribution at high redshift, presenting results that extend beyond $z=1$ for the first time. We apply an FFT technique to four independent sub-volumes comprising a total of $51,728$ galaxies at $0.6<z<1.1$ (out of the nearly $90,000$ included in the whole survey). We concentrate here on the shape of the direction-averaged power spectrum in redshift space, explaining the level of modelling of redshift-space anisotropies and the anisotropic survey window function that are needed to deduce this in a robust fashion. We then use covariance matrices derived from a large ensemble of mock datasets in order to fit the spectral data. The results are well matched by a standard $Λ$CDM model, with density parameter $Ω_M h =\smash{0.227^{+0.063}_{-0.050}}$ and baryon fraction $\smash{f_B=Ω_B/Ω_M=0.220^{+0.058}_{-0.072}}$. These inferences from the high-$z$ galaxy distribution are consistent with results from local galaxy surveys, and also with the Cosmic Microwave Background. Thus the $Λ$CDM model gives a good match to cosmic structure at all redshifts so far accessible to observational study.
△ Less
Submitted 15 June, 2017; v1 submitted 21 November, 2016;
originally announced November 2016.
-
Rank Restricted Semidefinite Matrices and Image Closedness
Authors:
Ian Davidson,
Henry Wolkowicz
Abstract:
We study the closure of the projection of the (nonconvex) cone of rank restricted positive semidefinite matrices onto subsets of the matrix entries. This defines the feasible sets for semidefinite completion problems with restrictions on the ranks. Applications include conditions for low-rank completions using the nuclear norm heuristic.
We study the closure of the projection of the (nonconvex) cone of rank restricted positive semidefinite matrices onto subsets of the matrix entries. This defines the feasible sets for semidefinite completion problems with restrictions on the ranks. Applications include conditions for low-rank completions using the nuclear norm heuristic.
△ Less
Submitted 31 October, 2016;
originally announced October 2016.
-
Some Advances in Role Discovery in Graphs
Authors:
Sean Gilpin,
Chia-Tung Kuo,
Tina Eliassi-Rad,
Ian Davidson
Abstract:
Role discovery in graphs is an emerging area that allows analysis of complex graphs in an intuitive way. In contrast to other graph prob- lems such as community discovery, which finds groups of highly connected nodes, the role discovery problem finds groups of nodes that share similar graph topological structure. However, existing work so far has two severe limitations that prevent its use in some…
▽ More
Role discovery in graphs is an emerging area that allows analysis of complex graphs in an intuitive way. In contrast to other graph prob- lems such as community discovery, which finds groups of highly connected nodes, the role discovery problem finds groups of nodes that share similar graph topological structure. However, existing work so far has two severe limitations that prevent its use in some domains. Firstly, it is completely unsupervised which is undesirable for a number of reasons. Secondly, most work is limited to a single relational graph. We address both these lim- itations in an intuitive and easy to implement alternating least squares framework. Our framework allows convex constraints to be placed on the role discovery problem which can provide useful supervision. In par- ticular we explore supervision to enforce i) sparsity, ii) diversity and iii) alternativeness. We then show how to lift this work for multi-relational graphs. A natural representation of a multi-relational graph is an order 3 tensor (rather than a matrix) and that a Tucker decomposition allows us to find complex interactions between collections of entities (E-groups) and the roles they play for a combination of relations (R-groups). Existing Tucker decomposition methods in tensor toolboxes are not suited for our purpose, so we create our own algorithm that we demonstrate is pragmatically useful.
△ Less
Submitted 8 September, 2016;
originally announced September 2016.
-
Stochastic Coordinate Coding and Its Application for Drosophila Gene Expression Pattern Annotation
Authors:
Binbin Lin,
Qingyang Li,
Qian Sun,
Ming-Jun Lai,
Ian Davidson,
Wei Fan,
Jie** Ye
Abstract:
\textit{Drosophila melanogaster} has been established as a model organism for investigating the fundamental principles of developmental gene interactions. The gene expression patterns of \textit{Drosophila melanogaster} can be documented as digital images, which are annotated with anatomical ontology terms to facilitate pattern discovery and comparison. The automated annotation of gene expression…
▽ More
\textit{Drosophila melanogaster} has been established as a model organism for investigating the fundamental principles of developmental gene interactions. The gene expression patterns of \textit{Drosophila melanogaster} can be documented as digital images, which are annotated with anatomical ontology terms to facilitate pattern discovery and comparison. The automated annotation of gene expression pattern images has received increasing attention due to the recent expansion of the image database. The effectiveness of gene expression pattern annotation relies on the quality of feature representation. Previous studies have demonstrated that sparse coding is effective for extracting features from gene expression images. However, solving sparse coding remains a computationally challenging problem, especially when dealing with large-scale data sets and learning large size dictionaries. In this paper, we propose a novel algorithm to solve the sparse coding problem, called Stochastic Coordinate Coding (SCC). The proposed algorithm alternatively updates the sparse codes via just a few steps of coordinate descent and updates the dictionary via second order stochastic gradient descent. The computational cost is further reduced by focusing on the non-zero components of the sparse codes and the corresponding columns of the dictionary only in the updating procedure. Thus, the proposed algorithm significantly improves the efficiency and the scalability, making sparse coding applicable for large-scale data sets and large dictionary sizes. Our experiments on Drosophila gene expression data sets demonstrate the efficiency and the effectiveness of the proposed algorithm.
△ Less
Submitted 9 December, 2014; v1 submitted 30 July, 2014;
originally announced July 2014.
-
Minimum Message Length Clustering Using Gibbs Sampling
Authors:
Ian Davidson
Abstract:
The K-Mean and EM algorithms are popular in clustering and mixture modeling, due to their simplicity and ease of implementation. However, they have several significant limitations. Both coverage to a local optimum of their respective objective functions (ignoring the uncertainty in the model space), require the apriori specification of the number of classes/clsuters, and are inconsistent. In this…
▽ More
The K-Mean and EM algorithms are popular in clustering and mixture modeling, due to their simplicity and ease of implementation. However, they have several significant limitations. Both coverage to a local optimum of their respective objective functions (ignoring the uncertainty in the model space), require the apriori specification of the number of classes/clsuters, and are inconsistent. In this work we overcome these limitations by using the Minimum Message Length (MML) principle and a variation to the K-Means/EM observation assignment and parameter calculation scheme. We maintain the simplicity of these approaches while constructing a Bayesian mixture modeling tool that samples/searches the model space using a Markov Chain Monte Carlo (MCMC) sampler known as a Gibbs sampler. Gibbs sampling allows us to visit each model according to its posterior probability. Therefore, if the model space is multi-modal we will visit all models and not get stuck in local optima. We call our approach multiple chains at equilibrium (MCE) MML sampling.
△ Less
Submitted 16 January, 2013;
originally announced January 2013.
-
A Reconstruction Error Formulation for Semi-Supervised Multi-task and Multi-view Learning
Authors:
Buyue Qian,
Xiang Wang,
Ian Davidson
Abstract:
A significant challenge to make learning techniques more suitable for general purpose use is to move beyond i) complete supervision, ii) low dimensional data, iii) a single task and single view per instance. Solving these challenges allows working with "Big Data" problems that are typically high dimensional with multiple (but possibly incomplete) labelings and views. While other work has addressed…
▽ More
A significant challenge to make learning techniques more suitable for general purpose use is to move beyond i) complete supervision, ii) low dimensional data, iii) a single task and single view per instance. Solving these challenges allows working with "Big Data" problems that are typically high dimensional with multiple (but possibly incomplete) labelings and views. While other work has addressed each of these problems separately, in this paper we show how to address them together, namely semi-supervised dimension reduction for multi-task and multi-view learning (SSDR-MML), which performs optimization for dimension reduction and label inference in semi-supervised setting. The proposed framework is designed to handle both multi-task and multi-view learning settings, and can be easily adapted to many useful applications. Information obtained from all tasks and views is combined via reconstruction errors in a linear fashion that can be efficiently solved using an alternating optimization scheme. Our formulation has a number of advantages. We explicitly model the information combining mechanism as a data structure (a weight/nearest-neighbor matrix) which allows investigating fundamental questions in multi-task and multi-view learning. We address one such question by presenting a general measure to quantify the success of simultaneous learning of multiple tasks or from multiple views. We show that our SSDR-MML approach can outperform many state-of-the-art baseline methods and demonstrate the effectiveness of connecting dimension reduction and learning.
△ Less
Submitted 3 February, 2012;
originally announced February 2012.
-
On Constrained Spectral Clustering and Its Applications
Authors:
Xiang Wang,
Buyue Qian,
Ian Davidson
Abstract:
Constrained clustering has been well-studied for algorithms such as $K$-means and hierarchical clustering. However, how to satisfy many constraints in these algorithmic settings has been shown to be intractable. One alternative to encode many constraints is to use spectral clustering, which remains a develo** area. In this paper, we propose a flexible framework for constrained spectral clusterin…
▽ More
Constrained clustering has been well-studied for algorithms such as $K$-means and hierarchical clustering. However, how to satisfy many constraints in these algorithmic settings has been shown to be intractable. One alternative to encode many constraints is to use spectral clustering, which remains a develo** area. In this paper, we propose a flexible framework for constrained spectral clustering. In contrast to some previous efforts that implicitly encode Must-Link and Cannot-Link constraints by modifying the graph Laplacian or constraining the underlying eigenspace, we present a more natural and principled formulation, which explicitly encodes the constraints as part of a constrained optimization problem. Our method offers several practical advantages: it can encode the degree of belief in Must-Link and Cannot-Link constraints; it guarantees to lower-bound how well the given constraints are satisfied using a user-specified threshold; it can be solved deterministically in polynomial time through generalized eigendecomposition. Furthermore, by inheriting the objective function from spectral clustering and encoding the constraints explicitly, much of the existing analysis of unconstrained spectral clustering techniques remains valid for our formulation. We validate the effectiveness of our approach by empirical results on both artificial and real datasets. We also demonstrate an innovative use of encoding large number of constraints: transfer learning via constraints.
△ Less
Submitted 21 September, 2012; v1 submitted 25 January, 2012;
originally announced January 2012.
-
The LSST Data Mining Research Agenda
Authors:
K. D. Borne,
J. Becla,
I. Davidson,
A. Szalay,
J. A. Tyson
Abstract:
We describe features of the LSST science database that are amenable to scientific data mining, object classification, outlier identification, anomaly detection, image quality assurance, and survey science validation. The data mining research agenda includes: scalability (at petabytes scales) of existing machine learning and data mining algorithms; development of grid-enabled parallel data mining…
▽ More
We describe features of the LSST science database that are amenable to scientific data mining, object classification, outlier identification, anomaly detection, image quality assurance, and survey science validation. The data mining research agenda includes: scalability (at petabytes scales) of existing machine learning and data mining algorithms; development of grid-enabled parallel data mining algorithms; designing a robust system for brokering classifications from the LSST event pipeline (which may produce 10,000 or more event alerts per night); multi-resolution methods for exploration of petascale databases; indexing of multi-attribute multi-dimensional astronomical databases (beyond spatial indexing) for rapid querying of petabyte databases; and more.
△ Less
Submitted 2 November, 2008;
originally announced November 2008.