Skip to main content

Showing 1–24 of 24 results for author: Schubert, E

Searching in archive cs. Search in all archives.
.
  1. arXiv:2405.18401  [pdf, other

    cs.LG cs.CG stat.ML

    Explicit Formulae to Interchangeably use Hyperplanes and Hyperballs using Inversive Geometry

    Authors: Erik Thordsen, Erich Schubert

    Abstract: Many algorithms require discriminative boundaries, such as separating hyperplanes or hyperballs, or are specifically designed to work on spherical data. By applying inversive geometry, we show that the two discriminative boundaries can be used interchangeably, and that general Euclidean data can be transformed into spherical data, whenever a change in point distances is acceptable. We provide expl… ▽ More

    Submitted 28 May, 2024; originally announced May 2024.

    Comments: to be submitted to TMLR (submission pending)

  2. Medoid Silhouette clustering with automatic cluster number selection

    Authors: Lars Lenssen, Erich Schubert

    Abstract: The evaluation of clustering results is difficult, highly dependent on the evaluated data set and the perspective of the beholder. There are many different clustering quality measures, which try to provide a general measure to validate clustering results. A very popular measure is the Silhouette. We discuss the efficient medoid-based variant of the Silhouette, perform a theoretical analysis of its… ▽ More

    Submitted 7 September, 2023; originally announced September 2023.

    Comments: arXiv admin note: substantial text overlap with arXiv:2209.12553

  3. Sparse Partitioning Around Medoids

    Authors: Lars Lenssen, Erich Schubert

    Abstract: Partitioning Around Medoids (PAM, k-Medoids) is a popular clustering technique to use with arbitrary distance functions or similarities, where each cluster is represented by its most central object, called the medoid or the discrete median. In operations research, this family of problems is also known as facility location problem (FLP). FastPAM recently introduced a speedup for large k to make it… ▽ More

    Submitted 5 September, 2023; originally announced September 2023.

  4. arXiv:2309.02552  [pdf, other

    stat.ML cs.DB cs.LG

    Data Aggregation for Hierarchical Clustering

    Authors: Erich Schubert, Andreas Lang

    Abstract: Hierarchical Agglomerative Clustering (HAC) is likely the earliest and most flexible clustering method, because it can be used with many distances, similarities, and various linkage strategies. It is often used when the number of clusters the data set forms is unknown and some sort of hierarchy in the data is plausible. Most algorithms for HAC operate on a full distance matrix, and therefore requi… ▽ More

    Submitted 5 September, 2023; originally announced September 2023.

  5. arXiv:2212.13626  [pdf, other

    cs.LG stat.ML

    LOSDD: Leave-Out Support Vector Data Description for Outlier Detection

    Authors: Daniel Boiar, Thomas Liebig, Erich Schubert

    Abstract: Support Vector Machines have been successfully used for one-class classification (OCSVM, SVDD) when trained on clean data, but they work much worse on dirty data: outliers present in the training data tend to become support vectors, and are hence considered "normal". In this article, we improve the effectiveness to detect outliers in dirty training data with a leave-out strategy: by temporarily om… ▽ More

    Submitted 27 December, 2022; originally announced December 2022.

  6. Stop using the elbow criterion for k-means and how to choose the number of clusters instead

    Authors: Erich Schubert

    Abstract: A major challenge when using k-means clustering often is how to choose the parameter k, the number of clusters. In this letter, we want to point out that it is very easy to draw poor conclusions from a common heuristic, the "elbow method". Better alternatives have been known in literature for a long time, and we want to draw attention to some of these easy to use options, that often perform better… ▽ More

    Submitted 23 December, 2022; originally announced December 2022.

  7. arXiv:2211.00596  [pdf, other

    cs.DC cs.DM

    Algebra of N-event synchronization

    Authors: Ernesto Gomez, Keith E. Schubert, Khalil Dajani

    Abstract: We have previously defined synchronization (Gomez, E. and K. Schubert 2011) as a relation between the times at which a pair of events can happen, and introduced an algebra that covers all possible relations for such pairs. In this work we introduce the synchronization matrix, to make it easier to calculate the properties and results of $N$ event synchronizations, such as are commonly encountered i… ▽ More

    Submitted 1 November, 2022; originally announced November 2022.

    Comments: 9 pages, 2 figures

    ACM Class: B.4.3; D.3.1; D.3.2; D.3.3

  8. Clustering by Direct Optimization of the Medoid Silhouette

    Authors: Lars Lenssen, Erich Schubert

    Abstract: The evaluation of clustering results is difficult, highly dependent on the evaluated data set and the perspective of the beholder. There are many different clustering quality measures, which try to provide a general measure to validate clustering results. A very popular measure is the Silhouette. We discuss the efficient medoid-based variant of the Silhouette, perform a theoretical analysis of its… ▽ More

    Submitted 26 September, 2022; originally announced September 2022.

  9. On Projections to Linear Subspaces

    Authors: Erik Thordsen, Erich Schubert

    Abstract: The merit of projecting data onto linear subspaces is well known from, e.g., dimension reduction. One key aspect of subspace projections, the maximum preservation of variance (principal component analysis), has been thoroughly researched and the effect of random linear projections on measures such as intrinsic dimensionality still is an ongoing effort. In this paper, we investigate the less explor… ▽ More

    Submitted 26 September, 2022; originally announced September 2022.

  10. EmbAssi: Embedding Assignment Costs for Similarity Search in Large Graph Databases

    Authors: Franka Bause, Erich Schubert, Nils M. Kriege

    Abstract: The graph edit distance is an intuitive measure to quantify the dissimilarity of graphs, but its computation is NP-hard and challenging in practice. We introduce methods for answering nearest neighbor and range queries regarding this distance efficiently for large databases with up to millions of graphs. We build on the filter-verification paradigm, where lower and upper bounds are used to reduce… ▽ More

    Submitted 19 July, 2022; v1 submitted 15 November, 2021; originally announced November 2021.

    Comments: Data Min Knowl Disc (2022)

  11. arXiv:2110.01283  [pdf, other

    cs.DB

    Metric Indexing for Graph Similarity Search

    Authors: Franka Bause, David B. Blumenthal, Erich Schubert, Nils M. Kriege

    Abstract: Finding the graphs that are most similar to a query graph in a large database is a common task with various applications. A widely-used similarity measure is the graph edit distance, which provides an intuitive notion of similarity and naturally supports graphs with vertex and edge attributes. Since its computation is NP-hard, techniques for accelerating similarity search have been studied extensi… ▽ More

    Submitted 4 October, 2021; originally announced October 2021.

    Comments: SISAP 2021

  12. arXiv:2108.09459  [pdf, other

    physics.med-ph cs.CE math.OC

    Developments in Mathematical Algorithms and Computational Tools for Proton CT and Particle Therapy Treatment Planning

    Authors: Yair Censor, Keith E. Schubert, Reinhard W. Schulte

    Abstract: We summarize recent results and ongoing activities in mathematical algorithms and computer science methods related to proton computed tomography (pCT) and intensity-modulated particle therapy (IMPT) treatment planning. Proton therapy necessitates a high level of delivery accuracy to exploit the selective targeting imparted by the Bragg peak. For this purpose, pCT utilizes the proton beam itself to… ▽ More

    Submitted 21 August, 2021; originally announced August 2021.

    Comments: 13 pages. Accepted for publication in IEEE Transactions on Radiation and Plasma Medical Sciences. August 16, 2021

  13. MESS: Manifold Embedding Motivated Super Sampling

    Authors: Erik Thordsen, Erich Schubert

    Abstract: Many approaches in the field of machine learning and data analysis rely on the assumption that the observed data lies on lower-dimensional manifolds. This assumption has been verified empirically for many real data sets. To make use of this manifold assumption one generally requires the manifold to be locally sampled to a certain density such that features of the manifold can be observed. However,… ▽ More

    Submitted 14 July, 2021; originally announced July 2021.

  14. Accelerating Spherical k-Means

    Authors: Erich Schubert, Andreas Lang, Gloria Feher

    Abstract: Spherical k-means is a widely used clustering algorithm for sparse and high-dimensional data such as document vectors. While several improvements and accelerations have been introduced for the original k-means algorithm, not all easily translate to the spherical variant: Many acceleration techniques, such as the algorithms of Elkan and Hamerly, rely on the triangle inequality of Euclidean distance… ▽ More

    Submitted 8 July, 2021; originally announced July 2021.

  15. A Triangle Inequality for Cosine Similarity

    Authors: Erich Schubert

    Abstract: Similarity search is a fundamental problem for many data analysis techniques. Many efficient search techniques rely on the triangle inequality of metrics, which allows pruning parts of the search space based on transitive bounds on distances. Recently, Cosine similarity has become a popular alternative choice to the standard Euclidean metric, in particular in the context of textual data and neural… ▽ More

    Submitted 8 July, 2021; originally announced July 2021.

  16. arXiv:2008.05171  [pdf, other

    cs.LG cs.AI stat.ML

    Fast and Eager k-Medoids Clustering: O(k) Runtime Improvement of the PAM, CLARA, and CLARANS Algorithms

    Authors: Erich Schubert, Peter J. Rousseeuw

    Abstract: Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids clustering. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not exist for arbitrary dissimilarities. PAM uses the medoid instead, t… ▽ More

    Submitted 1 June, 2021; v1 submitted 12 August, 2020; originally announced August 2020.

    Journal ref: Information Systems 2021, 101804

  17. BETULA: Numerically Stable CF-Trees for BIRCH Clustering

    Authors: Andreas Lang, Erich Schubert

    Abstract: BIRCH clustering is a widely known approach for clustering, that has influenced much subsequent research and commercial products. The key contribution of BIRCH is the Clustering Feature tree (CF-Tree), which is a compressed representation of the input data. As new data arrives, the tree is eventually rebuilt to increase the compression. Afterward, the leaves of the tree are used for clustering. Be… ▽ More

    Submitted 23 June, 2020; originally announced June 2020.

  18. ABID: Angle Based Intrinsic Dimensionality

    Authors: Erik Thordsen, Erich Schubert

    Abstract: The intrinsic dimensionality refers to the ``true'' dimensionality of the data, as opposed to the dimensionality of the data representation. For example, when attributes are highly correlated, the intrinsic dimensionality can be much lower than the number of variables. Local intrinsic dimensionality refers to the observation that this property can vary for different parts of the data set; and intr… ▽ More

    Submitted 23 June, 2020; originally announced June 2020.

  19. arXiv:1902.03616  [pdf, other

    cs.LG stat.ML

    ELKI: A large open-source library for data analysis - ELKI Release 0.7.5 "Heidelberg"

    Authors: Erich Schubert, Arthur Zimek

    Abstract: This paper documents the release of the ELKI data mining framework, version 0.7.5. ELKI is an open source (AGPLv3) data mining software written in Java. The focus of ELKI is research in algorithms, with an emphasis on unsupervised methods in cluster analysis and outlier detection. In order to achieve high performance and scalability, ELKI offers data index structures such as the R*-tree that can… ▽ More

    Submitted 10 February, 2019; originally announced February 2019.

  20. Faster k-Medoids Clustering: Improving the PAM, CLARA, and CLARANS Algorithms

    Authors: Erich Schubert, Peter J. Rousseeuw

    Abstract: Clustering non-Euclidean data is difficult, and one of the most used algorithms besides hierarchical clustering is the popular algorithm Partitioning Around Medoids (PAM), also simply referred to as k-medoids. In Euclidean geometry the mean-as used in k-means-is a good estimator for the cluster center, but this does not hold for arbitrary dissimilarities. PAM uses the medoid instead, the object wi… ▽ More

    Submitted 29 October, 2019; v1 submitted 12 October, 2018; originally announced October 2018.

    Journal ref: Similarity Search and Applications, SISAP 2019

  21. arXiv:1803.01112  [pdf, other

    physics.med-ph cs.CY math.OC

    An Improved Method of Total Variation Superiorization Applied to Reconstruction in Proton Computed Tomography

    Authors: Blake Schultze, Yair Censor, Paniz Karbasi, Keith E. Schubert, Reinhard W. Schulte

    Abstract: Previous work showed that total variation superiorization (TVS) improves reconstructed image quality in proton computed tomography (pCT). The structure of the TVS algorithm has evolved since then and this work investigated if this new algorithmic structure provides additional benefits to pCT image quality. Structural and parametric changes introduced to the original TVS algorithm included: (1) inc… ▽ More

    Submitted 17 January, 2019; v1 submitted 3 March, 2018; originally announced March 2018.

  22. arXiv:1802.01070  [pdf, other

    physics.med-ph cs.DC

    A Highly Accelerated Parallel Multi-GPU based Reconstruction Algorithm for Generating Accurate Relative Stop** Powers

    Authors: Paniz Karbasi, Ritchie Cai, Blake Schultze, Hanh Nguyen, Jones Reed, Patrick Hall, Valentina Giacometti, Vladimir Bashkirov, Robert Johnson, Nick Karonis, Jeffrey Olafsen, Caesar Ordonez, Keith E. Schubert, Reinhard W. Schulte

    Abstract: Low-dose Proton Computed Tomography (pCT) is an evolving imaging modality that is used in proton therapy planning which addresses the range uncertainty problem. The goal of pCT is generating a 3D map of Relative Stop** Power (RSP) measurements with high accuracy within clinically required time frames. Generating accurate RSP values within the shortest amount of time is considered a key goal when… ▽ More

    Submitted 3 February, 2018; originally announced February 2018.

    Comments: IEEE NSS/MIC 2017

  23. arXiv:1708.03569  [pdf, other

    cs.IR cs.CL

    Semantic Word Clouds with Background Corpus Normalization and t-distributed Stochastic Neighbor Embedding

    Authors: Erich Schubert, Andreas Spitz, Michael Weiler, Johanna Geiß, Michael Gertz

    Abstract: Many word clouds provide no semantics to the word placement, but use a random layout optimized solely for aesthetic purposes. We propose a novel approach to model word significance and word affinity within a document, and in comparison to a large background corpus. We demonstrate its usefulness for generating more meaningful word clouds as a visual summary of a given document. We then select keywo… ▽ More

    Submitted 11 August, 2017; originally announced August 2017.

  24. arXiv:1402.1720  [pdf, other

    cs.CV physics.med-ph

    Performance of Hull-Detection Algorithms For Proton Computed Tomography Reconstruction

    Authors: Blake Schultze, Micah Witt, Yair Censor, Reinhard Schulte, Keith Evan Schubert

    Abstract: Proton computed tomography (pCT) is a novel imaging modality developed for patients receiving proton radiation therapy. The purpose of this work was to investigate hull-detection algorithms used for preconditioning of the large and sparse linear system of equations that needs to be solved for pCT image reconstruction. The hull-detection algorithms investigated here included silhouette/space carvin… ▽ More

    Submitted 7 February, 2014; originally announced February 2014.

    Comments: Contemporary Mathematics, accepted for publication