-
pylspack: Parallel algorithms and data structures for sketching, column subset selection, regression and leverage scores
Authors:
Aleksandros Sobczyk,
Efstratios Gallopoulos
Abstract:
We present parallel algorithms and data structures for three fundamental operations in Numerical Linear Algebra: (i) Gaussian and CountSketch random projections and their combination, (ii) computation of the Gram matrix and (iii) computation of the squared row norms of the product of two matrices, with a special focus on "tall-and-skinny" matrices, which arise in many applications. We provide a de…
▽ More
We present parallel algorithms and data structures for three fundamental operations in Numerical Linear Algebra: (i) Gaussian and CountSketch random projections and their combination, (ii) computation of the Gram matrix and (iii) computation of the squared row norms of the product of two matrices, with a special focus on "tall-and-skinny" matrices, which arise in many applications. We provide a detailed analysis of the ubiquitous CountSketch transform and its combination with Gaussian random projections, accounting for memory requirements, computational complexity and workload balancing. We also demonstrate how these results can be applied to column subset selection, least squares regression and leverage scores computation. These tools have been implemented in pylspack, a publicly available Python package (https://github.com/IBM/pylspack) whose core is written in C++ and parallelized with OpenMP, and which is compatible with standard matrix data structures of SciPy and NumPy. Extensive numerical experiments indicate that the proposed algorithms scale well and significantly outperform existing libraries for tall-and-skinny matrices.
△ Less
Submitted 4 August, 2022; v1 submitted 5 March, 2022;
originally announced March 2022.
-
Estimating leverage scores via rank revealing methods and randomization
Authors:
Aleksandros Sobczyk,
Efstratios Gallopoulos
Abstract:
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank. Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms. We first develop a set of fast novel algorithms for rank estimation, column subset selection and least squares preconditioning. We…
▽ More
We study algorithms for estimating the statistical leverage scores of rectangular dense or sparse matrices of arbitrary rank. Our approach is based on combining rank revealing methods with compositions of dense and sparse randomized dimensionality reduction transforms. We first develop a set of fast novel algorithms for rank estimation, column subset selection and least squares preconditioning. We then describe the design and implementation of leverage score estimators based on these primitives. These estimators are also effective for rank deficient input, which is frequently the case in data analytics applications. We provide detailed complexity analyses for all algorithms as well as meaningful approximation bounds and comparisons with the state-of-the-art. We conduct extensive numerical experiments to evaluate our algorithms and to illustrate their properties and performance using synthetic and real world data sets.
△ Less
Submitted 23 May, 2021;
originally announced May 2021.
-
EigenRec: Generalizing PureSVD for Effective and Efficient Top-N Recommendations
Authors:
Athanasios N. Nikolakopoulos,
Vassilis Kalantzis,
Efstratios Gallopoulos,
John D. Garofalakis
Abstract:
We introduce EigenRec; a versatile and efficient Latent-Factor framework for Top-N Recommendations that includes the well-known PureSVD algorithm as a special case. EigenRec builds a low dimensional model of an inter-item proximity matrix that combines a similarity component, with a scaling operator, designed to control the influence of the prior item popularity on the final model. Seeing PureSVD…
▽ More
We introduce EigenRec; a versatile and efficient Latent-Factor framework for Top-N Recommendations that includes the well-known PureSVD algorithm as a special case. EigenRec builds a low dimensional model of an inter-item proximity matrix that combines a similarity component, with a scaling operator, designed to control the influence of the prior item popularity on the final model. Seeing PureSVD within our framework provides intuition about its inner workings, exposes its inherent limitations, and also, paves the path towards painlessly improving its recommendation performance. A comprehensive set of experiments on the MovieLens and the Yahoo datasets based on widely applied performance metrics, indicate that EigenRec outperforms several state-of-the-art algorithms, in terms of Standard and Long-Tail recommendation accuracy, exhibiting low susceptibility to sparsity, even in its most extreme manifestations -- the Cold-Start problems. At the same time EigenRec has an attractive computational profile and it can apply readily in large-scale recommendation settings.
△ Less
Submitted 5 December, 2017; v1 submitted 18 November, 2015;
originally announced November 2015.
-
Asynchronous iterative computations with Web information retrieval structures: The PageRank case
Authors:
Giorgos Kollias,
Efstratios Gallopoulos,
Daniel B. Szyld
Abstract:
There are several ideas being used today for Web information retrieval, and specifically in Web search engines. The PageRank algorithm is one of those that introduce a content-neutral ranking function over Web pages. This ranking is applied to the set of pages returned by the Google search engine in response to posting a search query. PageRank is based in part on two simple common sense concepts…
▽ More
There are several ideas being used today for Web information retrieval, and specifically in Web search engines. The PageRank algorithm is one of those that introduce a content-neutral ranking function over Web pages. This ranking is applied to the set of pages returned by the Google search engine in response to posting a search query. PageRank is based in part on two simple common sense concepts: (i)A page is important if many important pages include links to it. (ii)A page containing many links has reduced impact on the importance of the pages it links to. In this paper we focus on asynchronous iterative schemes to compute PageRank over large sets of Web pages. The elimination of the synchronizing phases is expected to be advantageous on heterogeneous platforms. The motivation for a possible move to such large scale distributed platforms lies in the size of matrices representing Web structure. In orders of magnitude: $10^{10}$ pages with $10^{11}$ nonzero elements and $10^{12}$ bytes just to store a small percentage of the Web (the already crawled); distributed memory machines are necessary for such computations. The present research is part of our general objective, to explore the potential of asynchronous computational models as an underlying framework for very large scale computations over the Grid. The area of ``internet algorithmics'' appears to offer many occasions for computations of unprecedent dimensionality that would be good candidates for this framework.
△ Less
Submitted 11 June, 2006;
originally announced June 2006.
-
Exploring term-document matrices from matrix models in text mining
Authors:
Ioannis Antonellis,
Efstratios Gallopoulos
Abstract:
We explore a matrix-space model, that is a natural extension to the vector space model for Information Retrieval. Each document can be represented by a matrix that is based on document extracts (e.g. sentences, paragraphs, sections). We focus on the performance of this model for the specific case in which documents are originally represented as term-by-sentence matrices. We use the singular valu…
▽ More
We explore a matrix-space model, that is a natural extension to the vector space model for Information Retrieval. Each document can be represented by a matrix that is based on document extracts (e.g. sentences, paragraphs, sections). We focus on the performance of this model for the specific case in which documents are originally represented as term-by-sentence matrices. We use the singular value decomposition to approximate the term-by-sentence matrices and assemble these results to form the pseudo-``term-document'' matrix that forms the basis of a text mining method alternative to traditional VSM and LSI. We investigate the singular values of this matrix and provide experimental evidence suggesting that the method can be particularly effective in terms of accuracy for text collections with multi-topic documents, such as web pages with news.
△ Less
Submitted 21 February, 2006;
originally announced February 2006.