-
The Trace Criterion for Kernel Bandwidth Selection for Support Vector Data Description
Authors:
Arin Chaudhuri,
Carol Sadek,
Deovrat Kakde,
Wenhao Hu,
Hansi Jiang,
Seunghyun Kong,
Yuewei Liao,
Sergiy Peredriy,
Haoyu Wang
Abstract:
Support vector data description (SVDD) is a popular anomaly detection technique. The SVDD classifier partitions the whole data space into an inlier region, which consists of the region near the training data, and an outlier region, which consists of points away from the training data. The computation of the SVDD classifier requires a kernel function, for which the Gaussian kernel is a common choic…
▽ More
Support vector data description (SVDD) is a popular anomaly detection technique. The SVDD classifier partitions the whole data space into an inlier region, which consists of the region near the training data, and an outlier region, which consists of points away from the training data. The computation of the SVDD classifier requires a kernel function, for which the Gaussian kernel is a common choice. The Gaussian kernel has a bandwidth parameter, and it is important to set the value of this parameter correctly for good results. A small bandwidth leads to overfitting such that the resulting SVDD classifier overestimates the number of anomalies, whereas a large bandwidth leads to underfitting and an inability to detect many anomalies. In this paper, we present a new unsupervised method for selecting the Gaussian kernel bandwidth. Our method exploits a low-rank representation of the kernel matrix to suggest a kernel bandwidth value. Our new technique is competitive with the current state of the art for low-dimensional data and performs extremely well for many classes of high-dimensional data. Because the mathematical formulation of SVDD is identical with the mathematical formulation of one-class support vector machines (OCSVM) when the Gaussian kernel is used, our method is equally applicable to Gaussian kernel bandwidth tuning for OCSVM.
△ Less
Submitted 5 February, 2020; v1 submitted 15 November, 2018;
originally announced November 2018.
-
The Mean and Median Criterion for Automatic Kernel Bandwidth Selection for Support Vector Data Description
Authors:
Arin Chaudhuri,
Deovrat Kakde,
Carol Sadek,
Laura Gonzalez,
Seunghyun Kong
Abstract:
Support vector data description (SVDD) is a popular technique for detecting anomalies. The SVDD classifier partitions the whole space into an inlier region, which consists of the region near the training data, and an outlier region, which consists of points away from the training data. The computation of the SVDD classifier requires a kernel function, and the Gaussian kernel is a common choice for…
▽ More
Support vector data description (SVDD) is a popular technique for detecting anomalies. The SVDD classifier partitions the whole space into an inlier region, which consists of the region near the training data, and an outlier region, which consists of points away from the training data. The computation of the SVDD classifier requires a kernel function, and the Gaussian kernel is a common choice for the kernel function. The Gaussian kernel has a bandwidth parameter, whose value is important for good results. A small bandwidth leads to overfitting, and the resulting SVDD classifier overestimates the number of anomalies. A large bandwidth leads to underfitting, and the classifier fails to detect many anomalies. In this paper we present a new automatic, unsupervised method for selecting the Gaussian kernel bandwidth. The selected value can be computed quickly, and it is competitive with existing bandwidth selection methods.
△ Less
Submitted 21 August, 2017; v1 submitted 16 August, 2017;
originally announced August 2017.
-
A Case Study in Text Mining: Interpreting Twitter Data From World Cup Tweets
Authors:
Daniel Godfrey,
Caley Johns,
Carl Meyer,
Shaina Race,
Carol Sadek
Abstract:
Cluster analysis is a field of data analysis that extracts underlying patterns in data. One application of cluster analysis is in text-mining, the analysis of large collections of text to find similarities between documents. We used a collection of about 30,000 tweets extracted from Twitter just before the World Cup started. A common problem with real world text data is the presence of linguistic…
▽ More
Cluster analysis is a field of data analysis that extracts underlying patterns in data. One application of cluster analysis is in text-mining, the analysis of large collections of text to find similarities between documents. We used a collection of about 30,000 tweets extracted from Twitter just before the World Cup started. A common problem with real world text data is the presence of linguistic noise. In our case it would be extraneous tweets that are unrelated to dominant themes. To combat this problem, we created an algorithm that combined the DBSCAN algorithm and a consensus matrix. This way we are left with the tweets that are related to those dominant themes. We then used cluster analysis to find those topics that the tweets describe. We clustered the tweets using k-means, a commonly used clustering algorithm, and Non-Negative Matrix Factorization (NMF) and compared the results. The two algorithms gave similar results, but NMF proved to be faster and provided more easily interpreted results. We explored our results using two visualization tools, Gephi and Wordle.
△ Less
Submitted 21 August, 2014;
originally announced August 2014.