-
ALCM: Autonomous LLM-Augmented Causal Discovery Framework
Authors:
Elahe Khatibi,
Mahyar Abbasian,
Zhongqi Yang,
Iman Azimi,
Amir M. Rahmani
Abstract:
To perform effective causal inference in high-dimensional datasets, initiating the process with causal discovery is imperative, wherein a causal graph is generated based on observational data. However, obtaining a complete and accurate causal graph poses a formidable challenge, recognized as an NP-hard problem. Recently, the advent of Large Language Models (LLMs) has ushered in a new era, indicati…
▽ More
To perform effective causal inference in high-dimensional datasets, initiating the process with causal discovery is imperative, wherein a causal graph is generated based on observational data. However, obtaining a complete and accurate causal graph poses a formidable challenge, recognized as an NP-hard problem. Recently, the advent of Large Language Models (LLMs) has ushered in a new era, indicating their emergent capabilities and widespread applicability in facilitating causal reasoning across diverse domains, such as medicine, finance, and science. The expansive knowledge base of LLMs holds the potential to elevate the field of causal reasoning by offering interpretability, making inferences, generalizability, and uncovering novel causal structures. In this paper, we introduce a new framework, named Autonomous LLM-Augmented Causal Discovery Framework (ALCM), to synergize data-driven causal discovery algorithms and LLMs, automating the generation of a more resilient, accurate, and explicable causal graph. The ALCM consists of three integral components: causal structure learning, causal wrapper, and LLM-driven causal refiner. These components autonomously collaborate within a dynamic environment to address causal discovery questions and deliver plausible causal graphs. We evaluate the ALCM framework by implementing two demonstrations on seven well-known datasets. Experimental results demonstrate that ALCM outperforms existing LLM methods and conventional data-driven causal reasoning mechanisms. This study not only shows the effectiveness of the ALCM but also underscores new research directions in leveraging the causal reasoning capabilities of LLMs.
△ Less
Submitted 2 May, 2024;
originally announced May 2024.
-
Impact of Physical Activity on Quality of Life During Pregnancy: A Causal ML Approach
Authors:
Kianoosh Kazemi,
Iina Ryhtä,
Iman Azimi,
Hannakaisa Niela-Vilen,
Anna Axelin,
Amir M. Rahmani,
Pasi Liljeberg
Abstract:
The concept of Quality of Life (QoL) refers to a holistic measurement of an individual's well-being, incorporating psychological and social aspects. Pregnant women, especially those with obesity and stress, often experience lower QoL. Physical activity (PA) has shown the potential to enhance the QoL. However, pregnant women who are overweight and obese rarely meet the recommended level of PA. Stud…
▽ More
The concept of Quality of Life (QoL) refers to a holistic measurement of an individual's well-being, incorporating psychological and social aspects. Pregnant women, especially those with obesity and stress, often experience lower QoL. Physical activity (PA) has shown the potential to enhance the QoL. However, pregnant women who are overweight and obese rarely meet the recommended level of PA. Studies have investigated the relationship between PA and QoL during pregnancy using correlation-based approaches. These methods aim to discover spurious correlations between variables rather than causal relationships. Besides, the existing methods mainly rely on physical activity parameters and neglect the use of different factors such as maternal (medical) history and context data, leading to biased estimates. Furthermore, the estimations lack an understanding of mediators and counterfactual scenarios that might affect them. In this paper, we investigate the causal relationship between being physically active (treatment variable) and the QoL (outcome) during pregnancy and postpartum. To estimate the causal effect, we develop a Causal Machine Learning method, integrating causal discovery and causal inference components. The data for our investigation is derived from a long-term wearable-based health monitoring study focusing on overweight and obese pregnant women. The machine learning (meta-learner) estimation technique is used to estimate the causal effect. Our result shows that performing adequate physical activity during pregnancy and postpartum improves the QoL by units of 7.3 and 3.4 on average in physical health and psychological domains, respectively. In the final step, four refutation analysis techniques are employed to validate our estimation.
△ Less
Submitted 25 February, 2024;
originally announced February 2024.
-
Robust Projection based Anomaly Extraction (RPE) in Univariate Time-Series
Authors:
Mostafa Rahmani,
Anoop Deoras,
Laurent Callot
Abstract:
This paper presents a novel, closed-form, and data/computation efficient online anomaly detection algorithm for time-series data. The proposed method, dubbed RPE, is a window-based method and in sharp contrast to the existing window-based methods, it is robust to the presence of anomalies in its window and it can distinguish the anomalies in time-stamp level. RPE leverages the linear structure of…
▽ More
This paper presents a novel, closed-form, and data/computation efficient online anomaly detection algorithm for time-series data. The proposed method, dubbed RPE, is a window-based method and in sharp contrast to the existing window-based methods, it is robust to the presence of anomalies in its window and it can distinguish the anomalies in time-stamp level. RPE leverages the linear structure of the trajectory matrix of the time-series and employs a robust projection step which makes the algorithm able to handle the presence of multiple arbitrarily large anomalies in its window. A closed-form/non-iterative algorithm for the robust projection step is provided and it is proved that it can identify the corrupted time-stamps. RPE is a great candidate for the applications where a large training data is not available which is the common scenario in the area of time-series. An extensive set of numerical experiments show that RPE can outperform the existing approaches with a notable margin.
△ Less
Submitted 31 May, 2022;
originally announced May 2022.
-
Provable Clustering of a Union of Linear Manifolds Using Optimal Directions
Authors:
Mostafa Rahmani
Abstract:
This paper focuses on the Matrix Factorization based Clustering (MFC) method which is one of the few closed form algorithms for the subspace clustering problem. Despite being simple, closed-form, and computation-efficient, MFC can outperform the other sophisticated subspace clustering methods in many challenging scenarios. We reveal the connection between MFC and the Innovation Pursuit (iPursuit)…
▽ More
This paper focuses on the Matrix Factorization based Clustering (MFC) method which is one of the few closed form algorithms for the subspace clustering problem. Despite being simple, closed-form, and computation-efficient, MFC can outperform the other sophisticated subspace clustering methods in many challenging scenarios. We reveal the connection between MFC and the Innovation Pursuit (iPursuit) algorithm which was shown to be able to outperform the other spectral clustering based methods with a notable margin especially when the span of clusters are close. A novel theoretical study is presented which sheds light on the key performance factors of both algorithms (MFC/iPursuit) and it is shown that both algorithms can be robust to notable intersections between the span of clusters. Importantly, in contrast to the theoretical guarantees of other algorithms which emphasized on the distance between the subspaces as the key performance factor and without making the innovation assumption, it is shown that the performance of MFC/iPursuit mainly depends on the distance between the innovative components of the clusters.
△ Less
Submitted 7 January, 2022;
originally announced January 2022.
-
Non-Local Feature Aggregation on Graphs via Latent Fixed Data Structures
Authors:
Mostafa Rahmani,
Rasoul Shafipour,
** Li
Abstract:
In contrast to image/text data whose order can be used to perform non-local feature aggregation in a straightforward way using the pooling layers, graphs lack the tensor representation and mostly the element-wise max/mean function is utilized to aggregate the locally extracted feature vectors. In this paper, we present a novel approach for global feature aggregation in Graph Neural Networks (GNNs)…
▽ More
In contrast to image/text data whose order can be used to perform non-local feature aggregation in a straightforward way using the pooling layers, graphs lack the tensor representation and mostly the element-wise max/mean function is utilized to aggregate the locally extracted feature vectors. In this paper, we present a novel approach for global feature aggregation in Graph Neural Networks (GNNs) which utilizes a Latent Fixed Data Structure (LFDS) to aggregate the extracted feature vectors. The locally extracted feature vectors are sorted/distributed on the LFDS and a latent neural network (CNN/GNN) is utilized to perform feature aggregation on the LFDS. The proposed approach is used to design several novel global feature aggregation methods based on the choice of the LFDS. We introduce multiple LFDSs including loop, 3D tensor (image), sequence, data driven graphs and an algorithm which sorts/distributes the extracted local feature vectors on the LFDS. While the computational complexity of the proposed methods are linear with the order of input graphs, they achieve competitive or better results.
△ Less
Submitted 16 August, 2021;
originally announced August 2021.
-
Closed-Form, Provable, and Robust PCA via Leverage Statistics and Innovation Search
Authors:
Mostafa Rahmani,
** Li
Abstract:
The idea of Innovation Search, which was initially proposed for data clustering, was recently used for outlier detection. In the application of Innovation Search for outlier detection, the directions of innovation were utilized to measure the innovation of the data points. We study the Innovation Values computed by the Innovation Search algorithm under a quadratic cost function and it is proved th…
▽ More
The idea of Innovation Search, which was initially proposed for data clustering, was recently used for outlier detection. In the application of Innovation Search for outlier detection, the directions of innovation were utilized to measure the innovation of the data points. We study the Innovation Values computed by the Innovation Search algorithm under a quadratic cost function and it is proved that Innovation Values with the new cost function are equivalent to Leverage Scores. This interesting connection is utilized to establish several theoretical guarantees for a Leverage Score based robust PCA method and to design a new robust PCA method. The theoretical results include performance guarantees with different models for the distribution of outliers and the distribution of inliers. In addition, we demonstrate the robustness of the algorithms against the presence of noise. The numerical and theoretical studies indicate that while the presented approach is fast and closed-form, it can outperform most of the existing algorithms.
△ Less
Submitted 23 June, 2021;
originally announced June 2021.
-
Ranking and Rejecting of Pre-Trained Deep Neural Networks in Transfer Learning based on Separation Index
Authors:
Mostafa Kalhor,
Ahmad Kalhor,
Mehdi Rahmani
Abstract:
Automated ranking of pre-trained Deep Neural Networks (DNNs) reduces the required time for selecting optimal pre-trained DNN and boost the classification performance in transfer learning. In this paper, we introduce a novel algorithm to rank pre-trained DNNs by applying a straightforward distance-based complexity measure named Separation Index (SI) to the target dataset. For this purpose, at first…
▽ More
Automated ranking of pre-trained Deep Neural Networks (DNNs) reduces the required time for selecting optimal pre-trained DNN and boost the classification performance in transfer learning. In this paper, we introduce a novel algorithm to rank pre-trained DNNs by applying a straightforward distance-based complexity measure named Separation Index (SI) to the target dataset. For this purpose, at first, a background about the SI is given and then the automated ranking algorithm is explained. In this algorithm, the SI is computed for the target dataset which passes from the feature extracting parts of pre-trained DNNs. Then, by descending sort of the computed SIs, the pre-trained DNNs are ranked, easily. In this ranking method, the best DNN makes maximum SI on the target dataset and a few pre-trained DNNs may be rejected in the case of their sufficiently low computed SIs. The efficiency of the proposed algorithm is evaluated by using three challenging datasets including Linnaeus 5, Breast Cancer Images, and COVID-CT. For the two first case studies, the results of the proposed algorithm exactly match with the ranking of the trained DNNs by the accuracy on the target dataset. For the third case study, despite using different preprocessing on the target data, the ranking of the algorithm has a high correlation with the ranking resulted from classification accuracy.
△ Less
Submitted 26 December, 2020;
originally announced December 2020.
-
The Causality Inference of Public Interest in Restaurants and Bars on COVID-19 Daily Cases in the US: A Google Trends Analysis
Authors:
Milad Asgari Mehrabadi,
Nikil Dutt,
Amir M. Rahmani
Abstract:
The COVID-19 coronavirus pandemic has affected virtually every region of the globe. At the time of conducting this study, the number of daily cases in the United States is more than any other country, and the trend is increasing in most of its states. Google trends provide public interest in various topics during different periods. Analyzing these trends using data mining methods might provide use…
▽ More
The COVID-19 coronavirus pandemic has affected virtually every region of the globe. At the time of conducting this study, the number of daily cases in the United States is more than any other country, and the trend is increasing in most of its states. Google trends provide public interest in various topics during different periods. Analyzing these trends using data mining methods might provide useful insights and observations regarding the COVID-19 outbreak. The objective of this study was to consider the predictive ability of different search terms (i.e., bars and restaurants) with regards to the increase of daily cases in the US. We considered the causation of two different search query trends, namely restaurant and bars, on daily positive cases in top-10 states/territories of the United States with the highest and lowest daily new positive cases. In addition, to measure the linear relation of different trends, we used Pearson correlation. Our results showed for states/territories with higher numbers of daily cases, the historical trends in search queries related to bars and restaurants, which mainly happened after re-opening, significantly affect the daily new cases, on average. California, for example, had most searches for restaurants on June 7th, 2020, which affected the number of new cases within two weeks after the peak with the P-value of .004 for Granger's causality test. Although a limited number of search queries were considered, Google search trends for restaurants and bars showed a significant effect on daily new cases for regions with higher numbers of daily new cases in the United States. We showed that such influential search trends could be used as additional information for prediction tasks in new cases of each region. This prediction can help healthcare leaders manage and control the impact of COVID-19 outbreaks on society and be prepared for the outcomes.
△ Less
Submitted 26 July, 2020;
originally announced July 2020.
-
Outlier Detection and Data Clustering via Innovation Search
Authors:
Mostafa Rahmani,
** Li
Abstract:
The idea of Innovation Search was proposed as a data clustering method in which the directions of innovation were utilized to compute the adjacency matrix and it was shown that Innovation Pursuit can notably outperform the self representation based subspace clustering methods. In this paper, we present a new discovery that the directions of innovation can be used to design a provable and strong ro…
▽ More
The idea of Innovation Search was proposed as a data clustering method in which the directions of innovation were utilized to compute the adjacency matrix and it was shown that Innovation Pursuit can notably outperform the self representation based subspace clustering methods. In this paper, we present a new discovery that the directions of innovation can be used to design a provable and strong robust (to outlier) PCA method. The proposed approach, dubbed iSearch, uses the direction search optimization problem to compute an optimal direction corresponding to each data point. iSearch utilizes the directions of innovation to measure the innovation of the data points and it identifies the outliers as the most innovative data points. Analytical performance guarantees are derived for the proposed robust PCA method under different models for the distribution of the outliers including randomly distributed outliers, clustered outliers, and linearly dependent outliers. In addition, we study the problem of outlier detection in a union of subspaces and it is shown that iSearch provably recovers the span of the inliers when the inliers lie in a union of subspaces. Moreover, we present theoretical studies which show that the proposed measure of innovation remains stable in the presence of noise and the performance of iSearch is robust to noisy data. In the challenging scenarios in which the outliers are close to each other or they are close to the span of the inliers, iSearch is shown to remarkably outperform most of the existing methods. The presented method shows that the directions of innovation are useful representation of the data which can be used to perform both data clustering and outlier detection.
△ Less
Submitted 30 December, 2019;
originally announced December 2019.
-
Graph Analysis and Graph Pooling in the Spatial Domain
Authors:
Mostafa Rahmani,
** Li
Abstract:
The spatial convolution layer which is widely used in the Graph Neural Networks (GNNs) aggregates the feature vector of each node with the feature vectors of its neighboring nodes. The GNN is not aware of the locations of the nodes in the global structure of the graph and when the local structures corresponding to different nodes are similar to each other, the convolution layer maps all those node…
▽ More
The spatial convolution layer which is widely used in the Graph Neural Networks (GNNs) aggregates the feature vector of each node with the feature vectors of its neighboring nodes. The GNN is not aware of the locations of the nodes in the global structure of the graph and when the local structures corresponding to different nodes are similar to each other, the convolution layer maps all those nodes to similar or same feature vectors in the continuous feature space. Therefore, the GNN cannot distinguish two graphs if their difference is not in their local structures. In addition, when the nodes are not labeled/attributed the convolution layers can fail to distinguish even different local structures. In this paper, we propose an effective solution to address this problem of the GNNs. The proposed approach leverages a spatial representation of the graph which makes the neural network aware of the differences between the nodes and also their locations in the graph. The spatial representation which is equivalent to a point-cloud representation of the graph is obtained by a graph embedding method. Using the proposed approach, the local feature extractor of the GNN distinguishes similar local structures in different locations of the graph and the GNN infers the topological structure of the graph from the spatial distribution of the locally extracted feature vectors. Moreover, the spatial representation is utilized to simplify the graph down-sampling problem. A new graph pooling method is proposed and it is shown that the proposed pooling method achieves competitive or better results in comparison with the state-of-the-art methods.
△ Less
Submitted 3 October, 2019;
originally announced October 2019.
-
Scalable and Robust Community Detection with Randomized Sketching
Authors:
Mostafa Rahmani,
Andre Beckus,
Adel Karimian,
George Atia
Abstract:
This article explores and analyzes the unsupervised clustering of large partially observed graphs. We propose a scalable and provable randomized framework for clustering graphs generated from the stochastic block model. The clustering is first applied to a sub-matrix of the graph's adjacency matrix associated with a reduced graph sketch constructed using random sampling. Then, the clusters of the…
▽ More
This article explores and analyzes the unsupervised clustering of large partially observed graphs. We propose a scalable and provable randomized framework for clustering graphs generated from the stochastic block model. The clustering is first applied to a sub-matrix of the graph's adjacency matrix associated with a reduced graph sketch constructed using random sampling. Then, the clusters of the full graph are inferred based on the clusters extracted from the sketch using a correlation-based retrieval step. Uniform random node sampling is shown to improve the computational complexity over clustering of the full graph when the cluster sizes are balanced. A new random degree-based node sampling algorithm is presented which significantly improves upon the performance of the clustering algorithm even when clusters are unbalanced. This framework improves the phase transitions for matrix-decomposition-based clustering with regard to computational complexity and minimum cluster size, which are shown to be nearly dimension-free in the low inter-cluster connectivity regime. A third sampling technique is shown to improve balance by randomly sampling nodes based on spatial distribution. We provide analysis and numerical results using a convex clustering algorithm based on matrix completion.
△ Less
Submitted 3 December, 2022; v1 submitted 25 May, 2018;
originally announced May 2018.
-
Data Dropout in Arbitrary Basis for Deep Network Regularization
Authors:
Mostafa Rahmani,
George Atia
Abstract:
An important problem in training deep networks with high capacity is to ensure that the trained network works well when presented with new inputs outside the training dataset. Dropout is an effective regularization technique to boost the network generalization in which a random subset of the elements of the given data and the extracted features are set to zero during the training process. In this…
▽ More
An important problem in training deep networks with high capacity is to ensure that the trained network works well when presented with new inputs outside the training dataset. Dropout is an effective regularization technique to boost the network generalization in which a random subset of the elements of the given data and the extracted features are set to zero during the training process. In this paper, a new randomized regularization technique in which we withhold a random part of the data without necessarily turning off the neurons/data-elements is proposed. In the proposed method, of which the conventional dropout is shown to be a special case, random data dropout is performed in an arbitrary basis, hence the designation Generalized Dropout. We also present a framework whereby the proposed technique can be applied efficiently to convolutional neural networks. The presented numerical experiments demonstrate that the proposed technique yields notable performance gain. Generalized Dropout provides new insight into the idea of dropout, shows that we can achieve different performance gains by using different bases matrices, and opens up a new research question as of how to choose optimal bases matrices that achieve maximal performance gain.
△ Less
Submitted 4 December, 2017; v1 submitted 3 December, 2017;
originally announced December 2017.
-
Subspace Clustering via Optimal Direction Search
Authors:
Mostafa Rahmani,
George Atia
Abstract:
This letter presents a new spectral-clustering-based approach to the subspace clustering problem. Underpinning the proposed method is a convex program for optimal direction search, which for each data point d finds an optimal direction in the span of the data that has minimum projection on the other data points and non-vanishing projection on d. The obtained directions are subsequently leveraged t…
▽ More
This letter presents a new spectral-clustering-based approach to the subspace clustering problem. Underpinning the proposed method is a convex program for optimal direction search, which for each data point d finds an optimal direction in the span of the data that has minimum projection on the other data points and non-vanishing projection on d. The obtained directions are subsequently leveraged to identify a neighborhood set for each data point. An alternating direction method of multipliers framework is provided to efficiently solve for the optimal directions. The proposed method is shown to notably outperform the existing subspace clustering methods, particularly for unwieldy scenarios involving high levels of noise and close subspaces, and yields the state-of-the-art results for the problem of face clustering using subspace segmentation.
△ Less
Submitted 26 November, 2017; v1 submitted 12 June, 2017;
originally announced June 2017.
-
Spatial Random Sampling: A Structure-Preserving Data Sketching Tool
Authors:
Mostafa Rahmani,
George Atia
Abstract:
Random column sampling is not guaranteed to yield data sketches that preserve the underlying structures of the data and may not sample sufficiently from less-populated data clusters. Also, adaptive sampling can often provide accurate low rank approximations, yet may fall short of producing descriptive data sketches, especially when the cluster centers are linearly dependent. Motivated by that, thi…
▽ More
Random column sampling is not guaranteed to yield data sketches that preserve the underlying structures of the data and may not sample sufficiently from less-populated data clusters. Also, adaptive sampling can often provide accurate low rank approximations, yet may fall short of producing descriptive data sketches, especially when the cluster centers are linearly dependent. Motivated by that, this paper introduces a novel randomized column sampling tool dubbed Spatial Random Sampling (SRS), in which data points are sampled based on their proximity to randomly sampled points on the unit sphere. The most compelling feature of SRS is that the corresponding probability of sampling from a given data cluster is proportional to the surface area the cluster occupies on the unit sphere, independently from the size of the cluster population. Although it is fully randomized, SRS is shown to provide descriptive and balanced data representations. The proposed idea addresses a pressing need in data science and holds potential to inspire many novel approaches for analysis of big data.
△ Less
Submitted 12 July, 2017; v1 submitted 9 May, 2017;
originally announced May 2017.
-
Low Rank Matrix Recovery with Simultaneous Presence of Outliers and Sparse Corruption
Authors:
Mostafa Rahmani,
George Atia
Abstract:
We study a data model in which the data matrix D can be expressed as D = L + S + C, where L is a low rank matrix, S an element-wise sparse matrix and C a matrix whose non-zero columns are outlying data points. To date, robust PCA algorithms have solely considered models with either S or C, but not both. As such, existing algorithms cannot account for simultaneous element-wise and column-wise corru…
▽ More
We study a data model in which the data matrix D can be expressed as D = L + S + C, where L is a low rank matrix, S an element-wise sparse matrix and C a matrix whose non-zero columns are outlying data points. To date, robust PCA algorithms have solely considered models with either S or C, but not both. As such, existing algorithms cannot account for simultaneous element-wise and column-wise corruptions. In this paper, a new robust PCA algorithm that is robust to simultaneous types of corruption is proposed. Our approach hinges on the sparse approximation of a sparsely corrupted column so that the sparse expansion of a column with respect to the other data points is used to distinguish a sparsely corrupted inlier column from an outlying data point. We also develop a randomized design which provides a scalable implementation of the proposed approach. The core idea of sparse approximation is analyzed analytically where we show that the underlying ell_1-norm minimization can obtain the representation of an inlier in presence of sparse corruptions.
△ Less
Submitted 6 February, 2017;
originally announced February 2017.
-
Robust and Scalable Column/Row Sampling from Corrupted Big Data
Authors:
Mostafa Rahmani,
George Atia
Abstract:
Conventional sampling techniques fall short of drawing descriptive sketches of the data when the data is grossly corrupted as such corruptions break the low rank structure required for them to perform satisfactorily. In this paper, we present new sampling algorithms which can locate the informative columns in presence of severe data corruptions. In addition, we develop new scalable randomized desi…
▽ More
Conventional sampling techniques fall short of drawing descriptive sketches of the data when the data is grossly corrupted as such corruptions break the low rank structure required for them to perform satisfactorily. In this paper, we present new sampling algorithms which can locate the informative columns in presence of severe data corruptions. In addition, we develop new scalable randomized designs of the proposed algorithms. The proposed approach is simultaneously robust to sparse corruption and outliers and substantially outperforms the state-of-the-art robust sampling algorithms as demonstrated by experiments conducted using both real and synthetic data.
△ Less
Submitted 18 November, 2016;
originally announced November 2016.
-
Coherence Pursuit: Fast, Simple, and Robust Principal Component Analysis
Authors:
Mostafa Rahmani,
George Atia
Abstract:
This paper presents a remarkably simple, yet powerful, algorithm termed Coherence Pursuit (CoP) to robust Principal Component Analysis (PCA). As inliers lie in a low dimensional subspace and are mostly correlated, an inlier is likely to have strong mutual coherence with a large number of data points. By contrast, outliers either do not admit low dimensional structures or form small clusters. In ei…
▽ More
This paper presents a remarkably simple, yet powerful, algorithm termed Coherence Pursuit (CoP) to robust Principal Component Analysis (PCA). As inliers lie in a low dimensional subspace and are mostly correlated, an inlier is likely to have strong mutual coherence with a large number of data points. By contrast, outliers either do not admit low dimensional structures or form small clusters. In either case, an outlier is unlikely to bear strong resemblance to a large number of data points. Given that, CoP sets an outlier apart from an inlier by comparing their coherence with the rest of the data points. The mutual coherences are computed by forming the Gram matrix of the normalized data points. Subsequently, the sought subspace is recovered from the span of the subset of the data points that exhibit strong coherence with the rest of the data. As CoP only involves one simple matrix multiplication, it is significantly faster than the state-of-the-art robust PCA algorithms. We derive analytical performance guarantees for CoP under different models for the distributions of inliers and outliers in both noise-free and noisy settings. CoP is the first robust PCA algorithm that is simultaneously non-iterative, provably robust to both unstructured and structured outliers, and can tolerate a large number of unstructured outliers.
△ Less
Submitted 12 July, 2017; v1 submitted 15 September, 2016;
originally announced September 2016.
-
Innovation Pursuit: A New Approach to Subspace Clustering
Authors:
Mostafa Rahmani,
George Atia
Abstract:
In subspace clustering, a group of data points belonging to a union of subspaces are assigned membership to their respective subspaces. This paper presents a new approach dubbed Innovation Pursuit (iPursuit) to the problem of subspace clustering using a new geometrical idea whereby subspaces are identified based on their relative novelties. We present two frameworks in which the idea of innovation…
▽ More
In subspace clustering, a group of data points belonging to a union of subspaces are assigned membership to their respective subspaces. This paper presents a new approach dubbed Innovation Pursuit (iPursuit) to the problem of subspace clustering using a new geometrical idea whereby subspaces are identified based on their relative novelties. We present two frameworks in which the idea of innovation pursuit is used to distinguish the subspaces. Underlying the first framework is an iterative method that finds the subspaces consecutively by solving a series of simple linear optimization problems, each searching for a direction of innovation in the span of the data potentially orthogonal to all subspaces except for the one to be identified in one step of the algorithm. A detailed mathematical analysis is provided establishing sufficient conditions for iPursuit to correctly cluster the data. The proposed approach can provably yield exact clustering even when the subspaces have significant intersections. It is shown that the complexity of the iterative approach scales only linearly in the number of data points and subspaces, and quadratically in the dimension of the subspaces. The second framework integrates iPursuit with spectral clustering to yield a new variant of spectral-clustering-based algorithms. The numerical simulations with both real and synthetic data demonstrate that iPursuit can often outperform the state-of-the-art subspace clustering algorithms, more so for subspaces with significant intersections, and that it significantly improves the state-of-the-art result for subspace-segmentation-based face clustering.
△ Less
Submitted 26 November, 2017; v1 submitted 2 December, 2015;
originally announced December 2015.
-
Randomized Robust Subspace Recovery for High Dimensional Data Matrices
Authors:
Mostafa Rahmani,
George Atia
Abstract:
This paper explores and analyzes two randomized designs for robust Principal Component Analysis (PCA) employing low-dimensional data sketching. In one design, a data sketch is constructed using random column sampling followed by low dimensional embedding, while in the other, sketching is based on random column and row sampling. Both designs are shown to bring about substantial savings in complexit…
▽ More
This paper explores and analyzes two randomized designs for robust Principal Component Analysis (PCA) employing low-dimensional data sketching. In one design, a data sketch is constructed using random column sampling followed by low dimensional embedding, while in the other, sketching is based on random column and row sampling. Both designs are shown to bring about substantial savings in complexity and memory requirements for robust subspace learning over conventional approaches that use the full scale data. A characterization of the sample and computational complexity of both designs is derived in the context of two distinct outlier models, namely, sparse and independent outlier models. The proposed randomized approach can provably recover the correct subspace with computational and sample complexity that are almost independent of the size of the data. The results of the mathematical analysis are confirmed through numerical simulations using both synthetic and real data.
△ Less
Submitted 8 April, 2016; v1 submitted 21 May, 2015;
originally announced May 2015.
-
High Dimensional Low Rank plus Sparse Matrix Decomposition
Authors:
Mostafa Rahmani,
George Atia
Abstract:
This paper is concerned with the problem of low rank plus sparse matrix decomposition for big data. Conventional algorithms for matrix decomposition use the entire data to extract the low-rank and sparse components, and are based on optimization problems with complexity that scales with the dimension of the data, which limits their scalability. Furthermore, existing randomized approaches mostly re…
▽ More
This paper is concerned with the problem of low rank plus sparse matrix decomposition for big data. Conventional algorithms for matrix decomposition use the entire data to extract the low-rank and sparse components, and are based on optimization problems with complexity that scales with the dimension of the data, which limits their scalability. Furthermore, existing randomized approaches mostly rely on uniform random sampling, which is quite inefficient for many real world data matrices that exhibit additional structures (e.g. clustering). In this paper, a scalable subspace-pursuit approach that transforms the decomposition problem to a subspace learning problem is proposed. The decomposition is carried out using a small data sketch formed from sampled columns/rows. Even when the data is sampled uniformly at random, it is shown that the sufficient number of sampled columns/rows is roughly O(rμ), where μis the coherency parameter and r the rank of the low rank component. In addition, adaptive sampling algorithms are proposed to address the problem of column/row sampling from structured data. We provide an analysis of the proposed method with adaptive sampling and show that adaptive sampling makes the required number of sampled columns/rows invariant to the distribution of the data. The proposed approach is amenable to online implementation and an online scheme is proposed.
△ Less
Submitted 16 March, 2017; v1 submitted 31 January, 2015;
originally announced February 2015.
-
A Subspace Method for Array Covariance Matrix Estimation
Authors:
Mostafa Rahmani,
George Atia
Abstract:
This paper introduces a subspace method for the estimation of an array covariance matrix. It is shown that when the received signals are uncorrelated, the true array covariance matrices lie in a specific subspace whose dimension is typically much smaller than the dimension of the full space. Based on this idea, a subspace based covariance matrix estimator is proposed. The estimator is obtained as…
▽ More
This paper introduces a subspace method for the estimation of an array covariance matrix. It is shown that when the received signals are uncorrelated, the true array covariance matrices lie in a specific subspace whose dimension is typically much smaller than the dimension of the full space. Based on this idea, a subspace based covariance matrix estimator is proposed. The estimator is obtained as a solution to a semi-definite convex optimization problem. While the optimization problem has no closed-form solution, a nearly optimal closed-form solution is proposed making it easy to implement. In comparison to the conventional approaches, the proposed method yields higher estimation accuracy because it eliminates the estimation error which does not lie in the subspace of the true covariance matrices. The numerical examples indicate that the proposed covariance matrix estimator can significantly improve the estimation quality of the covariance matrix.
△ Less
Submitted 19 October, 2014;
originally announced November 2014.