-
Explicit Formulae to Interchangeably use Hyperplanes and Hyperballs using Inversive Geometry
Authors:
Erik Thordsen,
Erich Schubert
Abstract:
Many algorithms require discriminative boundaries, such as separating hyperplanes or hyperballs, or are specifically designed to work on spherical data. By applying inversive geometry, we show that the two discriminative boundaries can be used interchangeably, and that general Euclidean data can be transformed into spherical data, whenever a change in point distances is acceptable. We provide expl…
▽ More
Many algorithms require discriminative boundaries, such as separating hyperplanes or hyperballs, or are specifically designed to work on spherical data. By applying inversive geometry, we show that the two discriminative boundaries can be used interchangeably, and that general Euclidean data can be transformed into spherical data, whenever a change in point distances is acceptable. We provide explicit formulae to embed general Euclidean data into spherical data and to unembed it back. We further show a duality between hyperspherical caps, i.e., the volume created by a separating hyperplane on spherical data, and hyperballs and provide explicit formulae to map between the two. We further provide equations to translate inner products and Euclidean distances between the two spaces, to avoid explicit embedding and unembedding. We also provide a method to enforce projections of the general Euclidean space onto hemi-hyperspheres and propose an intrinsic dimensionality based method to obtain "all-purpose" parameters. To show the usefulness of the cap-ball-duality, we discuss example applications in machine learning and vector similarity search.
△ Less
Submitted 28 May, 2024;
originally announced May 2024.
-
Medoid Silhouette clustering with automatic cluster number selection
Authors:
Lars Lenssen,
Erich Schubert
Abstract:
The evaluation of clustering results is difficult, highly dependent on the evaluated data set and the perspective of the beholder. There are many different clustering quality measures, which try to provide a general measure to validate clustering results. A very popular measure is the Silhouette. We discuss the efficient medoid-based variant of the Silhouette, perform a theoretical analysis of its…
▽ More
The evaluation of clustering results is difficult, highly dependent on the evaluated data set and the perspective of the beholder. There are many different clustering quality measures, which try to provide a general measure to validate clustering results. A very popular measure is the Silhouette. We discuss the efficient medoid-based variant of the Silhouette, perform a theoretical analysis of its properties, provide two fast versions for the direct optimization, and discuss the use to choose the optimal number of clusters. We combine ideas from the original Silhouette with the well-known PAM algorithm and its latest improvements FasterPAM. One of the versions guarantees equal results to the original variant and provides a run speedup of $O(k^2)$. In experiments on real data with 30000 samples and $k$=100, we observed a 10464$\times$ speedup compared to the original PAMMEDSIL algorithm. Additionally, we provide a variant to choose the optimal number of clusters directly.
△ Less
Submitted 7 September, 2023;
originally announced September 2023.
-
Sparse Partitioning Around Medoids
Authors:
Lars Lenssen,
Erich Schubert
Abstract:
Partitioning Around Medoids (PAM, k-Medoids) is a popular clustering technique to use with arbitrary distance functions or similarities, where each cluster is represented by its most central object, called the medoid or the discrete median. In operations research, this family of problems is also known as facility location problem (FLP). FastPAM recently introduced a speedup for large k to make it…
▽ More
Partitioning Around Medoids (PAM, k-Medoids) is a popular clustering technique to use with arbitrary distance functions or similarities, where each cluster is represented by its most central object, called the medoid or the discrete median. In operations research, this family of problems is also known as facility location problem (FLP). FastPAM recently introduced a speedup for large k to make it applicable for larger problems, but the method still has a runtime quadratic in N. In this chapter, we discuss a sparse and asymmetric variant of this problem, to be used for example on graph data such as road networks. By exploiting sparsity, we can avoid the quadratic runtime and memory requirements, and make this method scalable to even larger problems, as long as we are able to build a small enough graph of sufficient connectivity to perform local optimization. Furthermore, we consider asymmetric cases, where the set of medoids is not identical to the set of points to be covered (or in the interpretation of facility location, where the possible facility locations are not identical to the consumer locations). Because of sparsity, it may be impossible to cover all points with just k medoids for too small k, which would render the problem unsolvable, and this breaks common heuristics for finding a good starting condition. We, hence, consider determining k as a part of the optimization problem and propose to first construct a greedy initial solution with a larger k, then to optimize the problem by alternating between PAM-style "swap" operations where the result is improved by replacing medoids with better alternatives and "remove" operations to reduce the number of k until neither allows further improving the result quality. We demonstrate the usefulness of this method on a problem from electrical engineering, with the input graph derived from cartographic data.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
Data Aggregation for Hierarchical Clustering
Authors:
Erich Schubert,
Andreas Lang
Abstract:
Hierarchical Agglomerative Clustering (HAC) is likely the earliest and most flexible clustering method, because it can be used with many distances, similarities, and various linkage strategies. It is often used when the number of clusters the data set forms is unknown and some sort of hierarchy in the data is plausible. Most algorithms for HAC operate on a full distance matrix, and therefore requi…
▽ More
Hierarchical Agglomerative Clustering (HAC) is likely the earliest and most flexible clustering method, because it can be used with many distances, similarities, and various linkage strategies. It is often used when the number of clusters the data set forms is unknown and some sort of hierarchy in the data is plausible. Most algorithms for HAC operate on a full distance matrix, and therefore require quadratic memory. The standard algorithm also has cubic runtime to produce a full hierarchy. Both memory and runtime are especially problematic in the context of embedded or otherwise very resource-constrained systems. In this section, we present how data aggregation with BETULA, a numerically stable version of the well known BIRCH data aggregation algorithm, can be used to make HAC viable on systems with constrained resources with only small losses on clustering quality, and hence allow exploratory data analysis of very large data sets.
△ Less
Submitted 5 September, 2023;
originally announced September 2023.
-
LOSDD: Leave-Out Support Vector Data Description for Outlier Detection
Authors:
Daniel Boiar,
Thomas Liebig,
Erich Schubert
Abstract:
Support Vector Machines have been successfully used for one-class classification (OCSVM, SVDD) when trained on clean data, but they work much worse on dirty data: outliers present in the training data tend to become support vectors, and are hence considered "normal". In this article, we improve the effectiveness to detect outliers in dirty training data with a leave-out strategy: by temporarily om…
▽ More
Support Vector Machines have been successfully used for one-class classification (OCSVM, SVDD) when trained on clean data, but they work much worse on dirty data: outliers present in the training data tend to become support vectors, and are hence considered "normal". In this article, we improve the effectiveness to detect outliers in dirty training data with a leave-out strategy: by temporarily omitting one candidate at a time, this point can be judged using the remaining data only. We show that this is more effective at scoring the outlierness of points than using the slack term of existing SVM-based approaches. Identified outliers can then be removed from the data, such that outliers hidden by other outliers can be identified, to reduce the problem of masking. Naively, this approach would require training N individual SVMs (and training $O(N^2)$ SVMs when iteratively removing the worst outliers one at a time), which is prohibitively expensive. We will discuss that only support vectors need to be considered in each step and that by reusing SVM parameters and weights, this incremental retraining can be accelerated substantially. By removing candidates in batches, we can further improve the processing time, although it obviously remains more costly than training a single SVM.
△ Less
Submitted 27 December, 2022;
originally announced December 2022.
-
Stop using the elbow criterion for k-means and how to choose the number of clusters instead
Authors:
Erich Schubert
Abstract:
A major challenge when using k-means clustering often is how to choose the parameter k, the number of clusters. In this letter, we want to point out that it is very easy to draw poor conclusions from a common heuristic, the "elbow method". Better alternatives have been known in literature for a long time, and we want to draw attention to some of these easy to use options, that often perform better…
▽ More
A major challenge when using k-means clustering often is how to choose the parameter k, the number of clusters. In this letter, we want to point out that it is very easy to draw poor conclusions from a common heuristic, the "elbow method". Better alternatives have been known in literature for a long time, and we want to draw attention to some of these easy to use options, that often perform better. This letter is a call to stop using the elbow method altogether, because it severely lacks theoretic support, and we want to encourage educators to discuss the problems of the method -- if introducing it in class at all -- and teach alternatives instead, while researchers and reviewers should reject conclusions drawn from the elbow method.
△ Less
Submitted 23 December, 2022;
originally announced December 2022.
-
Algebra of N-event synchronization
Authors:
Ernesto Gomez,
Keith E. Schubert,
Khalil Dajani
Abstract:
We have previously defined synchronization (Gomez, E. and K. Schubert 2011) as a relation between the times at which a pair of events can happen, and introduced an algebra that covers all possible relations for such pairs. In this work we introduce the synchronization matrix, to make it easier to calculate the properties and results of $N$ event synchronizations, such as are commonly encountered i…
▽ More
We have previously defined synchronization (Gomez, E. and K. Schubert 2011) as a relation between the times at which a pair of events can happen, and introduced an algebra that covers all possible relations for such pairs. In this work we introduce the synchronization matrix, to make it easier to calculate the properties and results of $N$ event synchronizations, such as are commonly encountered in parallel execution of multiple processes. The synchronization matrix leads to the definition of N-event synchronization algebras as specific extensions to the original algebra. We derive general properties of such synchronization, and we are able to analyze effects of synchronization on the phase space of parallel execution introduced in (Gomez E Kai R, Schubert KE 2017)
△ Less
Submitted 1 November, 2022;
originally announced November 2022.
-
Clustering by Direct Optimization of the Medoid Silhouette
Authors:
Lars Lenssen,
Erich Schubert
Abstract:
The evaluation of clustering results is difficult, highly dependent on the evaluated data set and the perspective of the beholder. There are many different clustering quality measures, which try to provide a general measure to validate clustering results. A very popular measure is the Silhouette. We discuss the efficient medoid-based variant of the Silhouette, perform a theoretical analysis of its…
▽ More
The evaluation of clustering results is difficult, highly dependent on the evaluated data set and the perspective of the beholder. There are many different clustering quality measures, which try to provide a general measure to validate clustering results. A very popular measure is the Silhouette. We discuss the efficient medoid-based variant of the Silhouette, perform a theoretical analysis of its properties, and provide two fast versions for the direct optimization. We combine ideas from the original Silhouette with the well-known PAM algorithm and its latest improvements FasterPAM. One of the versions guarantees equal results to the original variant and provides a run speedup of $O(k^2)$. In experiments on real data with 30000 samples and $k$=100, we observed a 10464$\times$ speedup compared to the original PAMMEDSIL algorithm.
△ Less
Submitted 26 September, 2022;
originally announced September 2022.
-
On Projections to Linear Subspaces
Authors:
Erik Thordsen,
Erich Schubert
Abstract:
The merit of projecting data onto linear subspaces is well known from, e.g., dimension reduction. One key aspect of subspace projections, the maximum preservation of variance (principal component analysis), has been thoroughly researched and the effect of random linear projections on measures such as intrinsic dimensionality still is an ongoing effort. In this paper, we investigate the less explor…
▽ More
The merit of projecting data onto linear subspaces is well known from, e.g., dimension reduction. One key aspect of subspace projections, the maximum preservation of variance (principal component analysis), has been thoroughly researched and the effect of random linear projections on measures such as intrinsic dimensionality still is an ongoing effort. In this paper, we investigate the less explored depths of linear projections onto explicit subspaces of varying dimensionality and the expectations of variance that ensue. The result is a new family of bounds for Euclidean distances and inner products. We showcase the quality of these bounds as well as investigate the intimate relation to intrinsic dimensionality estimation.
△ Less
Submitted 26 September, 2022;
originally announced September 2022.
-
EmbAssi: Embedding Assignment Costs for Similarity Search in Large Graph Databases
Authors:
Franka Bause,
Erich Schubert,
Nils M. Kriege
Abstract:
The graph edit distance is an intuitive measure to quantify the dissimilarity of graphs, but its computation is NP-hard and challenging in practice. We introduce methods for answering nearest neighbor and range queries regarding this distance efficiently for large databases with up to millions of graphs. We build on the filter-verification paradigm, where lower and upper bounds are used to reduce…
▽ More
The graph edit distance is an intuitive measure to quantify the dissimilarity of graphs, but its computation is NP-hard and challenging in practice. We introduce methods for answering nearest neighbor and range queries regarding this distance efficiently for large databases with up to millions of graphs. We build on the filter-verification paradigm, where lower and upper bounds are used to reduce the number of exact computations of the graph edit distance. Highly effective bounds for this involve solving a linear assignment problem for each graph in the database, which is prohibitive in massive datasets. Index-based approaches typically provide only weak bounds leading to high computational costs verification. In this work, we derive novel lower bounds for efficient filtering from restricted assignment problems, where the cost function is a tree metric. This special case allows embedding the costs of optimal assignments isometrically into $\ell_1$ space, rendering efficient indexing possible. We propose several lower bounds of the graph edit distance obtained from tree metrics reflecting the edit costs, which are combined for effective filtering. Our method termed EmbAssi can be integrated into existing filter-verification pipelines as a fast and effective pre-filtering step. Empirically we show that for many real-world graphs our lower bounds are already close to the exact graph edit distance, while our index construction and search scales to very large databases.
△ Less
Submitted 19 July, 2022; v1 submitted 15 November, 2021;
originally announced November 2021.
-
Metric Indexing for Graph Similarity Search
Authors:
Franka Bause,
David B. Blumenthal,
Erich Schubert,
Nils M. Kriege
Abstract:
Finding the graphs that are most similar to a query graph in a large database is a common task with various applications. A widely-used similarity measure is the graph edit distance, which provides an intuitive notion of similarity and naturally supports graphs with vertex and edge attributes. Since its computation is NP-hard, techniques for accelerating similarity search have been studied extensi…
▽ More
Finding the graphs that are most similar to a query graph in a large database is a common task with various applications. A widely-used similarity measure is the graph edit distance, which provides an intuitive notion of similarity and naturally supports graphs with vertex and edge attributes. Since its computation is NP-hard, techniques for accelerating similarity search have been studied extensively. However, index-based approaches for this are almost exclusively designed for graphs with categorical vertex and edge labels and uniform edit costs. We propose a filter-verification framework for similarity search, which supports non-uniform edit costs for graphs with arbitrary attributes. We employ an expensive lower bound obtained by solving an optimal assignment problem. This filter distance satisfies the triangle inequality, making it suitable for acceleration by metric indexing. In subsequent stages, assignment-based upper bounds are used to avoid further exact distance computations. Our extensive experimental evaluation shows that a significant runtime advantage over both a linear scan and state-of-the-art methods is achieved.
△ Less
Submitted 4 October, 2021;
originally announced October 2021.
-
Developments in Mathematical Algorithms and Computational Tools for Proton CT and Particle Therapy Treatment Planning
Authors:
Yair Censor,
Keith E. Schubert,
Reinhard W. Schulte
Abstract:
We summarize recent results and ongoing activities in mathematical algorithms and computer science methods related to proton computed tomography (pCT) and intensity-modulated particle therapy (IMPT) treatment planning. Proton therapy necessitates a high level of delivery accuracy to exploit the selective targeting imparted by the Bragg peak. For this purpose, pCT utilizes the proton beam itself to…
▽ More
We summarize recent results and ongoing activities in mathematical algorithms and computer science methods related to proton computed tomography (pCT) and intensity-modulated particle therapy (IMPT) treatment planning. Proton therapy necessitates a high level of delivery accuracy to exploit the selective targeting imparted by the Bragg peak. For this purpose, pCT utilizes the proton beam itself to create images. The technique works by sending a low-intensity beam of protons through the patient and measuring the position, direction, and energy loss of each exiting proton. The pCT technique allows reconstruction of the volumetric distribution of the relative stop** power (RSP) of the patient tissues for use in treatment planning and pre-treatment range verification. We have investigated new ways to make the reconstruction both efficient and accurate. Better accuracy of RSP also enables more robust inverse approaches to IMPT. For IMPT, we developed a framework for performing intensity-modulation of the proton pencil beams. We expect that these developments will lead to additional project work in the years to come, which requires a regular exchange between experts in the fields of mathematics, computer science, and medical physics. We have initiated such an exchange by organizing annual workshops on pCT and IMPT algorithm and technology developments. This report is, admittedly, tilted toward our interdisciplinary work and methods. We offer a comprehensive overview of results, problems, and challenges in pCT and IMPT with the aim of making other scientists wanting to tackle such issues and to strengthen their interdisciplinary collaboration by bringing together cutting-edge know-how from medicine, computer science, physics, and mathematics to bear on medical physics problems at hand.
△ Less
Submitted 21 August, 2021;
originally announced August 2021.
-
MESS: Manifold Embedding Motivated Super Sampling
Authors:
Erik Thordsen,
Erich Schubert
Abstract:
Many approaches in the field of machine learning and data analysis rely on the assumption that the observed data lies on lower-dimensional manifolds. This assumption has been verified empirically for many real data sets. To make use of this manifold assumption one generally requires the manifold to be locally sampled to a certain density such that features of the manifold can be observed. However,…
▽ More
Many approaches in the field of machine learning and data analysis rely on the assumption that the observed data lies on lower-dimensional manifolds. This assumption has been verified empirically for many real data sets. To make use of this manifold assumption one generally requires the manifold to be locally sampled to a certain density such that features of the manifold can be observed. However, for increasing intrinsic dimensionality of a data set the required data density introduces the need for very large data sets, resulting in one of the many faces of the curse of dimensionality. To combat the increased requirement for local data density we propose a framework to generate virtual data points that faithful to an approximate embedding function underlying the manifold observable in the data.
△ Less
Submitted 14 July, 2021;
originally announced July 2021.
-
Accelerating Spherical k-Means
Authors:
Erich Schubert,
Andreas Lang,
Gloria Feher
Abstract:
Spherical k-means is a widely used clustering algorithm for sparse and high-dimensional data such as document vectors. While several improvements and accelerations have been introduced for the original k-means algorithm, not all easily translate to the spherical variant: Many acceleration techniques, such as the algorithms of Elkan and Hamerly, rely on the triangle inequality of Euclidean distance…
▽ More
Spherical k-means is a widely used clustering algorithm for sparse and high-dimensional data such as document vectors. While several improvements and accelerations have been introduced for the original k-means algorithm, not all easily translate to the spherical variant: Many acceleration techniques, such as the algorithms of Elkan and Hamerly, rely on the triangle inequality of Euclidean distances. However, spherical k-means uses Cosine similarities instead of distances for computational efficiency. In this paper, we incorporate the Elkan and Hamerly accelerations to the spherical k-means algorithm working directly with the Cosines instead of Euclidean distances to obtain a substantial speedup and evaluate these spherical accelerations on real data.
△ Less
Submitted 8 July, 2021;
originally announced July 2021.
-
A Triangle Inequality for Cosine Similarity
Authors:
Erich Schubert
Abstract:
Similarity search is a fundamental problem for many data analysis techniques. Many efficient search techniques rely on the triangle inequality of metrics, which allows pruning parts of the search space based on transitive bounds on distances. Recently, Cosine similarity has become a popular alternative choice to the standard Euclidean metric, in particular in the context of textual data and neural…
▽ More
Similarity search is a fundamental problem for many data analysis techniques. Many efficient search techniques rely on the triangle inequality of metrics, which allows pruning parts of the search space based on transitive bounds on distances. Recently, Cosine similarity has become a popular alternative choice to the standard Euclidean metric, in particular in the context of textual data and neural network embeddings. Unfortunately, Cosine similarity is not metric and does not satisfy the standard triangle inequality. Instead, many search techniques for Cosine rely on approximation techniques such as locality sensitive hashing. In this paper, we derive a triangle inequality for Cosine similarity that is suitable for efficient similarity search with many standard search structures (such as the VP-tree, Cover-tree, and M-tree); show that this bound is tight and discuss fast approximations for it. We hope that this spurs new research on accelerating exact similarity search for cosine similarity, and possible other similarity measures beyond the existing work for distance metrics.
△ Less
Submitted 8 July, 2021;
originally announced July 2021.
-
Fast and Eager k-Medoids Clustering: O(k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms
Authors:
Erich Schubert,
Peter J. Rousseeuw
Abstract:
Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids clustering. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not exist for arbitrary dissimilarities. PAM uses the medoid instead, t…
▽ More
Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids clustering. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not exist for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains and applications. A key issue with PAM is its high run time cost. We propose modifications to the PAM algorithm that achieve an O(k)-fold speedup in the second ("SWAP") phase of the algorithm, but will still find the same results as the original PAM algorithm. If we relax the choice of swaps performed (while retaining comparable quality), we can further accelerate the algorithm by eagerly performing additional swaps in each iteration. With the substantially faster SWAP, we can now explore faster initialization strategies, because (i) the classic ("BUILD") initialization now becomes the bottleneck, and (ii) our swap is fast enough to compensate for worse starting conditions. We also show how the CLARA and CLARANS algorithms benefit from the proposed modifications. While we do not study the parallelization of our approach in this work, it can easily be combined with earlier approaches to use PAM and CLARA on big data (some of which use PAM as a subroutine, hence can immediately benefit from these improvements), where the performance with high k becomes increasingly important. In experiments on real data with k=100,200, we observed a 458x respectively 1191x speedup compared to the original PAM SWAP algorithm, making PAM applicable to larger data sets, and in particular to higher k.
△ Less
Submitted 1 June, 2021; v1 submitted 12 August, 2020;
originally announced August 2020.
-
BETULA: Numerically Stable CF-Trees for BIRCH Clustering
Authors:
Andreas Lang,
Erich Schubert
Abstract:
BIRCH clustering is a widely known approach for clustering, that has influenced much subsequent research and commercial products. The key contribution of BIRCH is the Clustering Feature tree (CF-Tree), which is a compressed representation of the input data. As new data arrives, the tree is eventually rebuilt to increase the compression. Afterward, the leaves of the tree are used for clustering. Be…
▽ More
BIRCH clustering is a widely known approach for clustering, that has influenced much subsequent research and commercial products. The key contribution of BIRCH is the Clustering Feature tree (CF-Tree), which is a compressed representation of the input data. As new data arrives, the tree is eventually rebuilt to increase the compression. Afterward, the leaves of the tree are used for clustering. Because of the data compression, this method is very scalable. The idea has been adopted for example for k-means, data stream, and density-based clustering.
Clustering features used by BIRCH are simple summary statistics that can easily be updated with new data: the number of points, the linear sums, and the sum of squared values. Unfortunately, how the sum of squares is then used in BIRCH is prone to catastrophic cancellation.
We introduce a replacement cluster feature that does not have this numeric problem, that is not much more expensive to maintain, and which makes many computations simpler and hence more efficient. These cluster features can also easily be used in other work derived from BIRCH, such as algorithms for streaming data. In the experiments, we demonstrate the numerical problem and compare the performance of the original algorithm compared to the improved cluster features.
△ Less
Submitted 23 June, 2020;
originally announced June 2020.
-
ABID: Angle Based Intrinsic Dimensionality
Authors:
Erik Thordsen,
Erich Schubert
Abstract:
The intrinsic dimensionality refers to the ``true'' dimensionality of the data, as opposed to the dimensionality of the data representation. For example, when attributes are highly correlated, the intrinsic dimensionality can be much lower than the number of variables. Local intrinsic dimensionality refers to the observation that this property can vary for different parts of the data set; and intr…
▽ More
The intrinsic dimensionality refers to the ``true'' dimensionality of the data, as opposed to the dimensionality of the data representation. For example, when attributes are highly correlated, the intrinsic dimensionality can be much lower than the number of variables. Local intrinsic dimensionality refers to the observation that this property can vary for different parts of the data set; and intrinsic dimensionality can serve as a proxy for the local difficulty of the data set.
Most popular methods for estimating the local intrinsic dimensionality are based on distances, and the rate at which the distances to the nearest neighbors increase, a concept known as ``expansion dimension''. In this paper we introduce an orthogonal concept, which does not use any distances: we use the distribution of angles between neighbor points. We derive the theoretical distribution of angles and use this to construct an estimator for intrinsic dimensionality.
Experimentally, we verify that this measure behaves similarly, but complementarily, to existing measures of intrinsic dimensionality. By introducing a new idea of intrinsic dimensionality to the research community, we hope to contribute to a better understanding of intrinsic dimensionality and to spur new research in this direction.
△ Less
Submitted 23 June, 2020;
originally announced June 2020.
-
ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"
Authors:
Erich Schubert,
Arthur Zimek
Abstract:
This paper documents the release of the ELKI data mining framework, version 0.7.5.
ELKI is an open source (AGPLv3) data mining software written in Java. The focus of ELKI is research in algorithms, with an emphasis on unsupervised methods in cluster analysis and outlier detection. In order to achieve high performance and scalability, ELKI offers data index structures such as the R*-tree that can…
▽ More
This paper documents the release of the ELKI data mining framework, version 0.7.5.
ELKI is an open source (AGPLv3) data mining software written in Java. The focus of ELKI is research in algorithms, with an emphasis on unsupervised methods in cluster analysis and outlier detection. In order to achieve high performance and scalability, ELKI offers data index structures such as the R*-tree that can provide major performance gains. ELKI is designed to be easy to extend for researchers and students in this domain, and welcomes contributions of additional methods. ELKI aims at providing a large collection of highly parameterizable algorithms, in order to allow easy and fair evaluation and benchmarking of algorithms.
We will first outline the motivation for this release, the plans for the future, and then give a brief overview over the new functionality in this version. We also include an appendix presenting an overview on the overall implemented functionality.
△ Less
Submitted 10 February, 2019;
originally announced February 2019.
-
Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms
Authors:
Erich Schubert,
Peter J. Rousseeuw
Abstract:
Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not hold for arbitrary dissimilarities. PAM uses the medoid instead, the object wi…
▽ More
Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not hold for arbitrary dissimilarities. PAM uses the medoid instead, the object with the smallest dissimilarity to all others in the cluster. This notion of centrality can be used with any (dis-)similarity, and thus is of high relevance to many domains such as biology that require the use of Jaccard, Gower, or more complex distances.
A key issue with PAM is its high run time cost. We propose modifications to the PAM algorithm to achieve an O(k)-fold speedup in the second SWAP phase of the algorithm, but will still find the same results as the original PAM algorithm. If we slightly relax the choice of swaps performed (at comparable quality), we can further accelerate the algorithm by performing up to k swaps in each iteration. With the substantially faster SWAP, we can now also explore alternative strategies for choosing the initial medoids. We also show how the CLARA and CLARANS algorithms benefit from these modifications. It can easily be combined with earlier approaches to use PAM and CLARA on big data (some of which use PAM as a subroutine, hence can immediately benefit from these improvements), where the performance with high k becomes increasingly important.
In experiments on real data with k=100, we observed a 200-fold speedup compared to the original PAM SWAP algorithm, making PAM applicable to larger data sets as long as we can afford to compute a distance matrix, and in particular to higher k (at k=2, the new SWAP was only 1.5 times faster, as the speedup is expected to increase with k).
△ Less
Submitted 29 October, 2019; v1 submitted 12 October, 2018;
originally announced October 2018.
-
An Improved Method of Total Variation Superiorization Applied to Reconstruction in Proton Computed Tomography
Authors:
Blake Schultze,
Yair Censor,
Paniz Karbasi,
Keith E. Schubert,
Reinhard W. Schulte
Abstract:
Previous work showed that total variation superiorization (TVS) improves reconstructed image quality in proton computed tomography (pCT). The structure of the TVS algorithm has evolved since then and this work investigated if this new algorithmic structure provides additional benefits to pCT image quality. Structural and parametric changes introduced to the original TVS algorithm included: (1) inc…
▽ More
Previous work showed that total variation superiorization (TVS) improves reconstructed image quality in proton computed tomography (pCT). The structure of the TVS algorithm has evolved since then and this work investigated if this new algorithmic structure provides additional benefits to pCT image quality. Structural and parametric changes introduced to the original TVS algorithm included: (1) inclusion or exclusion of TV reduction requirement, (2) a variable number, $N$, of TV perturbation steps per feasibility-seeking iteration, and (3) introduction of a perturbation kernel $0<α<1$. The structural change of excluding the TV reduction requirement check tended to have a beneficial effect for $3\le N\le 6$ and allows full parallelization of the TVS algorithm. Repeated perturbations per feasibility-seeking iterations reduced total variation (TV) and material dependent standard deviations for $3\le N\le 6$. The perturbation kernel $α$, equivalent to $α=0.5$ in the original TVS algorithm, reduced TV and standard deviations as $α$ was increased beyond $α=0.5$, but negatively impacted reconstructed relative stop** power (RSP) values for $α>0.75$. The reductions in TV and standard deviations allowed feasibility-seeking with a larger relaxation parameter $λ$ than previously used, without the corresponding increases in standard deviations experienced with the original TVS algorithm. This work demonstrates that the modifications related to the evolution of the original TVS algorithm provide benefits in terms of both pCT image quality and computational efficiency for appropriately chosen parameter values.
△ Less
Submitted 17 January, 2019; v1 submitted 3 March, 2018;
originally announced March 2018.
-
A Highly Accelerated Parallel Multi-GPU based Reconstruction Algorithm for Generating Accurate Relative Stop** Powers
Authors:
Paniz Karbasi,
Ritchie Cai,
Blake Schultze,
Hanh Nguyen,
Jones Reed,
Patrick Hall,
Valentina Giacometti,
Vladimir Bashkirov,
Robert Johnson,
Nick Karonis,
Jeffrey Olafsen,
Caesar Ordonez,
Keith E. Schubert,
Reinhard W. Schulte
Abstract:
Low-dose Proton Computed Tomography (pCT) is an evolving imaging modality that is used in proton therapy planning which addresses the range uncertainty problem. The goal of pCT is generating a 3D map of Relative Stop** Power (RSP) measurements with high accuracy within clinically required time frames. Generating accurate RSP values within the shortest amount of time is considered a key goal when…
▽ More
Low-dose Proton Computed Tomography (pCT) is an evolving imaging modality that is used in proton therapy planning which addresses the range uncertainty problem. The goal of pCT is generating a 3D map of Relative Stop** Power (RSP) measurements with high accuracy within clinically required time frames. Generating accurate RSP values within the shortest amount of time is considered a key goal when develo** a pCT software. The existing pCT softwares have successfully met this time frame and even succeeded this time goal, but requiring clusters with hundreds of processors.
This paper describes a novel reconstruction technique using two Graphics Processing Unit (GPU) cores, such as is available on a single Nvidia P100. The proposed reconstruction technique is tested on both simulated and experimental datasets and on two different systems namely Nvidia K40 and P100 GPUs from IBM and Cray. The experimental results demonstrate that our proposed reconstruction method meets both the timing and accuracy with the benefit of having reasonable cost, and efficient use of power.
△ Less
Submitted 3 February, 2018;
originally announced February 2018.
-
Semantic Word Clouds with Background Corpus Normalization and t-distributed Stochastic Neighbor Embedding
Authors:
Erich Schubert,
Andreas Spitz,
Michael Weiler,
Johanna Geiß,
Michael Gertz
Abstract:
Many word clouds provide no semantics to the word placement, but use a random layout optimized solely for aesthetic purposes. We propose a novel approach to model word significance and word affinity within a document, and in comparison to a large background corpus. We demonstrate its usefulness for generating more meaningful word clouds as a visual summary of a given document. We then select keywo…
▽ More
Many word clouds provide no semantics to the word placement, but use a random layout optimized solely for aesthetic purposes. We propose a novel approach to model word significance and word affinity within a document, and in comparison to a large background corpus. We demonstrate its usefulness for generating more meaningful word clouds as a visual summary of a given document. We then select keywords based on their significance and construct the word cloud based on the derived affinity. Based on a modified t-distributed stochastic neighbor embedding (t-SNE), we generate a semantic word placement. For words that cooccur significantly, we include edges, and cluster the words according to their cooccurrence. For this we designed a scalable and memory-efficient sketch-based approach usable on commodity hardware to aggregate the required corpus statistics needed for normalization, and for identifying keywords as well as significant cooccurences. We empirically validate our approch using a large Wikipedia corpus.
△ Less
Submitted 11 August, 2017;
originally announced August 2017.
-
Performance of Hull-Detection Algorithms For Proton Computed Tomography Reconstruction
Authors:
Blake Schultze,
Micah Witt,
Yair Censor,
Reinhard Schulte,
Keith Evan Schubert
Abstract:
Proton computed tomography (pCT) is a novel imaging modality developed for patients receiving proton radiation therapy. The purpose of this work was to investigate hull-detection algorithms used for preconditioning of the large and sparse linear system of equations that needs to be solved for pCT image reconstruction. The hull-detection algorithms investigated here included silhouette/space carvin…
▽ More
Proton computed tomography (pCT) is a novel imaging modality developed for patients receiving proton radiation therapy. The purpose of this work was to investigate hull-detection algorithms used for preconditioning of the large and sparse linear system of equations that needs to be solved for pCT image reconstruction. The hull-detection algorithms investigated here included silhouette/space carving (SC), modified silhouette/space carving (MSC), and space modeling (SM). Each was compared to the cone-beam version of filtered backprojection (FBP) used for hull-detection. Data for testing these algorithms included simulated data sets of a digital head phantom and an experimental data set of a pediatric head phantom obtained with a pCT scanner prototype at Loma Linda University Medical Center. SC was the fastest algorithm, exceeding the speed of FBP by more than 100 times. FBP was most sensitive to the presence of noise. Ongoing work will focus on optimizing threshold parameters in order to define a fast and efficient method for hull-detection in pCT image reconstruction.
△ Less
Submitted 7 February, 2014;
originally announced February 2014.