-
Clustering on the Edge: Learning Structure in Graphs
Authors:
Matt Barnes,
Artur Dubrawski
Abstract:
With the recent popularity of graphical clustering methods, there has been an increased focus on the information between samples. We show how learning cluster structure using edge features naturally and simultaneously determines the most likely number of clusters and addresses data scale issues. These results are particularly useful in instances where (a) there are a large number of clusters and (…
▽ More
With the recent popularity of graphical clustering methods, there has been an increased focus on the information between samples. We show how learning cluster structure using edge features naturally and simultaneously determines the most likely number of clusters and addresses data scale issues. These results are particularly useful in instances where (a) there are a large number of clusters and (b) we have some labeled edges. Applications in this domain include image segmentation, community discovery and entity resolution. Our model is an extension of the planted partition model and our solution uses results of correlation clustering, which achieves a partition O(log(n))-close to the log-likelihood of the true clustering.
△ Less
Submitted 5 May, 2016;
originally announced May 2016.
-
Batched Lazy Decision Trees
Authors:
Mathieu Guillame-Bert,
Artur Dubrawski
Abstract:
We introduce a batched lazy algorithm for supervised classification using decision trees. It avoids unnecessary visits to irrelevant nodes when it is used to make predictions with either eagerly or lazily trained decision trees. A set of experiments demonstrate that the proposed algorithm can outperform both the conventional and lazy decision tree algorithms in terms of computation time as well as…
▽ More
We introduce a batched lazy algorithm for supervised classification using decision trees. It avoids unnecessary visits to irrelevant nodes when it is used to make predictions with either eagerly or lazily trained decision trees. A set of experiments demonstrate that the proposed algorithm can outperform both the conventional and lazy decision tree algorithms in terms of computation time as well as memory consumption, without compromising accuracy.
△ Less
Submitted 8 March, 2016;
originally announced March 2016.
-
Do Public Events Affect Sex Trafficking Activity?
Authors:
Kyle Miller,
Emily Kennedy,
Artur Dubrawski
Abstract:
For several years the pervasive belief that the Super Bowl is the single biggest day for human trafficking in the United States each year has been perpetuated in popular press despite a lack of evidentiary support. The practice of relying on hearsay and popular belief for decision-making may result in misappropriation of resources in anti-trafficking efforts. We propose a data-driven approach to a…
▽ More
For several years the pervasive belief that the Super Bowl is the single biggest day for human trafficking in the United States each year has been perpetuated in popular press despite a lack of evidentiary support. The practice of relying on hearsay and popular belief for decision-making may result in misappropriation of resources in anti-trafficking efforts. We propose a data-driven approach to analyzing sex trafficking, especially as it is carried on during--and perhaps in response to--large public events such as the Super Bowl. We examine 33 public events, chosen for attendance numbers comparable to the Super Bowl from a diversity of types, and use the volume of escort advertisements posted online as an accessible and reasonable proxy measure for the actual levels of activity of sex-workers as well as trafficking victims. Our analysis puts the impact of local public events on sex advertisement activity into perspective. We find that many of the events we considered are not correlated with statistically significant impact on sex-worker advertising, though some are. Additionally, we demonstrate how our method can uncover evidence of other events, not included in our initial list, that are correlated with more significant increases in ad activity. Reliance on quantitative evidence accessible through data-driven analysis can inform wise resource allocation, guide good policies, and foster the most meaningful impact.
△ Less
Submitted 16 February, 2016;
originally announced February 2016.
-
Canonical Autocorrelation Analysis
Authors:
Maria De-Arteaga,
Artur Dubrawski,
Peter Huggins
Abstract:
We present an extension of sparse Canonical Correlation Analysis (CCA) designed for finding multiple-to-multiple linear correlations within a single set of variables. Unlike CCA, which finds correlations between two sets of data where the rows are matched exactly but the columns represent separate sets of variables, the method proposed here, Canonical Autocorrelation Analysis (CAA), finds multivar…
▽ More
We present an extension of sparse Canonical Correlation Analysis (CCA) designed for finding multiple-to-multiple linear correlations within a single set of variables. Unlike CCA, which finds correlations between two sets of data where the rows are matched exactly but the columns represent separate sets of variables, the method proposed here, Canonical Autocorrelation Analysis (CAA), finds multivariate correlations within just one set of variables. This can be useful when we look for hidden parsimonious structures in data, each involving only a small subset of all features. In addition, the discovered correlations are highly interpretable as they are formed by pairs of sparse linear combinations of the original features. We show how CAA can be of use as a tool for anomaly detection when the expected structure of correlations is not followed by anomalous data. We illustrate the utility of CAA in two application domains where single-class and unsupervised learning of correlation structures are particularly relevant: breast cancer diagnosis and radiation threat detection. When applied to the Wisconsin Breast Cancer data, single-class CAA is competitive with supervised methods used in literature. On the radiation threat detection task, unsupervised CAA performs significantly better than an unsupervised alternative prevalent in the domain, while providing valuable additional insights for threat analysis.
△ Less
Submitted 19 November, 2015;
originally announced November 2015.
-
Lass-0: sparse non-convex regression by local search
Authors:
William Herlands,
Maria De-Arteaga,
Daniel Neill,
Artur Dubrawski
Abstract:
We compute approximate solutions to L0 regularized linear regression using L1 regularization, also known as the Lasso, as an initialization step. Our algorithm, the Lass-0 ("Lass-zero"), uses a computationally efficient stepwise search to determine a locally optimal L0 solution given any L1 regularization solution. We present theoretical results of consistency under orthogonality and appropriate h…
▽ More
We compute approximate solutions to L0 regularized linear regression using L1 regularization, also known as the Lasso, as an initialization step. Our algorithm, the Lass-0 ("Lass-zero"), uses a computationally efficient stepwise search to determine a locally optimal L0 solution given any L1 regularization solution. We present theoretical results of consistency under orthogonality and appropriate handling of redundant features. Empirically, we use synthetic data to demonstrate that Lass-0 solutions are closer to the true sparse support than L1 regularization models. Additionally, in real-world data Lass-0 finds more parsimonious solutions than L1 regularization while maintaining similar predictive accuracy.
△ Less
Submitted 17 February, 2016; v1 submitted 13 November, 2015;
originally announced November 2015.
-
An Entity Resolution approach to isolate instances of Human Trafficking online
Authors:
Chirag Nagpal,
Kyle Miller,
Benedikt Boecking,
Artur Dubrawski
Abstract:
Human trafficking is a challenging law enforcement problem, and a large amount of such activity manifests itself on various online forums. Given the large, heterogeneous and noisy structure of this data, building models to predict instances of trafficking is an even more convolved a task. In this paper we propose and entity resolution pipeline using a notion of proxy labels, in order to extract cl…
▽ More
Human trafficking is a challenging law enforcement problem, and a large amount of such activity manifests itself on various online forums. Given the large, heterogeneous and noisy structure of this data, building models to predict instances of trafficking is an even more convolved a task. In this paper we propose and entity resolution pipeline using a notion of proxy labels, in order to extract clusters from this data with prior history of human trafficking activity. We apply this pipeline to 5M records from backpage.com and report on the performance of this approach, challenges in terms of scalability, and some significant domain specific characteristics of our resolved entities.
△ Less
Submitted 18 June, 2017; v1 submitted 22 September, 2015;
originally announced September 2015.
-
Performance Bounds for Pairwise Entity Resolution
Authors:
Matt Barnes,
Kyle Miller,
Artur Dubrawski
Abstract:
One significant challenge to scaling entity resolution algorithms to massive datasets is understanding how performance changes after moving beyond the realm of small, manually labeled reference datasets. Unlike traditional machine learning tasks, when an entity resolution algorithm performs well on small hold-out datasets, there is no guarantee this performance holds on larger hold-out datasets. W…
▽ More
One significant challenge to scaling entity resolution algorithms to massive datasets is understanding how performance changes after moving beyond the realm of small, manually labeled reference datasets. Unlike traditional machine learning tasks, when an entity resolution algorithm performs well on small hold-out datasets, there is no guarantee this performance holds on larger hold-out datasets. We prove simple bounding properties between the performance of a match function on a small validation set and the performance of a pairwise entity resolution algorithm on arbitrarily sized datasets. Thus, our approach enables optimization of pairwise entity resolution algorithms for large datasets, using a small set of labeled data.
△ Less
Submitted 10 September, 2015;
originally announced September 2015.