-
Benign overfitting and adaptive nonparametric regression
Authors:
Julien Chhor,
Suzanne Sigalla,
Alexandre B. Tsybakov
Abstract:
In the nonparametric regression setting, we construct an estimator which is a continuous function interpolating the data points with high probability, while attaining minimax optimal rates under mean squared risk on the scale of Hölder classes adaptively to the unknown smoothness.
In the nonparametric regression setting, we construct an estimator which is a continuous function interpolating the data points with high probability, while attaining minimax optimal rates under mean squared risk on the scale of Hölder classes adaptively to the unknown smoothness.
△ Less
Submitted 27 June, 2022;
originally announced June 2022.
-
Assigning Topics to Documents by Successive Projections
Authors:
Olga Klopp,
Maxim Panov,
Suzanne Sigalla,
Alexandre Tsybakov
Abstract:
Topic models provide a useful tool to organize and understand the structure of large corpora of text documents, in particular, to discover hidden thematic structure. Clustering documents from big unstructured corpora into topics is an important task in various areas, such as image analysis, e-commerce, social networks, population genetics. A common approach to topic modeling is to associate each t…
▽ More
Topic models provide a useful tool to organize and understand the structure of large corpora of text documents, in particular, to discover hidden thematic structure. Clustering documents from big unstructured corpora into topics is an important task in various areas, such as image analysis, e-commerce, social networks, population genetics. A common approach to topic modeling is to associate each topic with a probability distribution on the dictionary of words and to consider each document as a mixture of topics. Since the number of topics is typically substantially smaller than the size of the corpus and of the dictionary, the methods of topic modeling can lead to a dramatic dimension reduction. In this paper, we study the problem of estimating topics distribution for each document in the given corpus, that is, we focus on the clustering aspect of the problem. We introduce an algorithm that we call Successive Projection Overlap** Clustering (SPOC) inspired by the Successive Projection Algorithm for separable matrix factorization. This algorithm is simple to implement and computationally fast. We establish theoretical guarantees on the performance of the SPOC algorithm, in particular, near matching minimax upper and lower bounds on its estimation risk. We also propose a new method that estimates the number of topics. We complement our theoretical results with a numerical study on synthetic and semi-synthetic data to analyze the performance of this new algorithm in practice. One of the conclusions is that the error of the algorithm grows at most logarithmically with the size of the dictionary, in contrast to what one observes for Latent Dirichlet Allocation.
△ Less
Submitted 8 July, 2021;
originally announced July 2021.
-
Improved clustering algorithms for the Bipartite Stochastic Block Model
Authors:
Mohamed Ndaoud,
Suzanne Sigalla,
Alexandre B. Tsybakov
Abstract:
We establish sufficient conditions of exact and almost full recovery of the node partition in Bipartite Stochastic Block Model (BSBM) using polynomial time algorithms. First, we improve upon the known conditions of almost full recovery by spectral clustering algorithms in BSBM. Next, we propose a new computationally simple and fast procedure achieving exact recovery under milder conditions than th…
▽ More
We establish sufficient conditions of exact and almost full recovery of the node partition in Bipartite Stochastic Block Model (BSBM) using polynomial time algorithms. First, we improve upon the known conditions of almost full recovery by spectral clustering algorithms in BSBM. Next, we propose a new computationally simple and fast procedure achieving exact recovery under milder conditions than the state of the art. Namely, if the vertex sets $V_1$ and $V_2$ in BSBM have sizes $n_1$ and $n_2$, we show that the condition $p = Ω\left(\max\left(\sqrt{\frac{\log{n_1}}{n_1n_2}},\frac{\log{n_1}}{n_2}\right)\right)$ on the edge intensity $p$ is sufficient for exact recovery witin $V_1$. This condition exhibits an elbow at $n_{2} \asymp n_1\log{n_1}$ between the low-dimensional and high-dimensional regimes. The suggested procedure is a variant of Lloyd's iterations initialized with a well-chosen spectral estimator leading to what we expect to be the optimal condition for exact recovery in BSBM. {The optimality conjecture is supported by showing that, for a supervised oracle procedure, such a condition is necessary to achieve exact recovery.} The key elements of the proof techniques are different from classical community detection tools on random graphs. Numerical studies confirm our theory, and show that the suggested algorithm is both very fast and achieves {almost the same} performance as the supervised oracle. Finally, using the connection between planted satisfiability problems and the BSBM, we improve upon the sufficient number of clauses to completely recover the planted assignment.
△ Less
Submitted 22 April, 2021; v1 submitted 18 November, 2019;
originally announced November 2019.