Skip to main content

Showing 1–11 of 11 results for author: Mentch, L

Searching in archive cs. Search in all archives.
.
  1. arXiv:2103.16700  [pdf, other

    stat.ML cs.LG

    Trees, Forests, Chickens, and Eggs: When and Why to Prune Trees in a Random Forest

    Authors: Siyu Zhou, Lucas Mentch

    Abstract: Due to their long-standing reputation as excellent off-the-shelf predictors, random forests continue remain a go-to model of choice for applied statisticians and data scientists. Despite their widespread use, however, until recently, little was known about their inner-workings and about which aspects of the procedure were driving their success. Very recently, two competing hypotheses have emerged… ▽ More

    Submitted 30 March, 2021; originally announced March 2021.

  2. arXiv:2102.12328  [pdf, ps, other

    stat.OT cs.LG

    Bridging Breiman's Brook: From Algorithmic Modeling to Statistical Learning

    Authors: Lucas Mentch, Giles Hooker

    Abstract: In 2001, Leo Breiman wrote of a divide between "data modeling" and "algorithmic modeling" cultures. Twenty years later this division feels far more ephemeral, both in terms of assigning individuals to camps, and in terms of intellectual boundaries. We argue that this is largely due to the "data modelers" incorporating algorithmic methods into their toolbox, particularly driven by recent developmen… ▽ More

    Submitted 22 February, 2021; originally announced February 2021.

    Comments: In response to the Journal of Observational Studies reprinting Leo Breiman's paper "Statistical Modeling: The Two Cultures" on its 20th anniversary

  3. arXiv:2004.14500  [pdf, other

    cs.CL cs.LG stat.ML

    Posterior Calibrated Training on Sentence Classification Tasks

    Authors: Taehee Jung, Dongyeop Kang, Hua Cheng, Lucas Mentch, Thomas Schaaf

    Abstract: Most classification models work by first predicting a posterior probability distribution over all classes and then selecting that class with the largest estimated probability. In many settings however, the quality of posterior probability itself (e.g., 65% chance having diabetes), gives more reliable information than the final predicted class alone. When these methods are shown to be poorly calibr… ▽ More

    Submitted 1 May, 2020; v1 submitted 29 April, 2020; originally announced April 2020.

    Comments: Accepted at ACL 2020

  4. arXiv:2003.03629  [pdf, other

    stat.ML cs.LG

    Getting Better from Worse: Augmented Bagging and a Cautionary Tale of Variable Importance

    Authors: Lucas Mentch, Siyu Zhou

    Abstract: As the size, complexity, and availability of data continues to grow, scientists are increasingly relying upon black-box learning algorithms that can often provide accurate predictions with minimal a priori model specifications. Tools like random forests have an established track record of off-the-shelf success and even offer various strategies for analyzing the underlying relationships among varia… ▽ More

    Submitted 9 November, 2020; v1 submitted 7 March, 2020; originally announced March 2020.

  5. arXiv:1912.01089  [pdf, other

    stat.ML cs.LG stat.CO stat.ME

    $V$-statistics and Variance Estimation

    Authors: Zhengze Zhou, Lucas Mentch, Giles Hooker

    Abstract: This paper develops a general framework for analyzing asymptotics of $V$-statistics. Previous literature on limiting distribution mainly focuses on the cases when $n \to \infty$ with fixed kernel size $k$. Under some regularity conditions, we demonstrate asymptotic normality when $k$ grows with $n$ by utilizing existing results for $U$-statistics. The key in our approach lies in a mathematical red… ▽ More

    Submitted 6 May, 2020; v1 submitted 2 December, 2019; originally announced December 2019.

    Comments: This version supersedes the previous technical report titled "Asymptotic Normality and Variance Estimation For Supervised Ensembles". Extensive simulations are added and we also provide a more detailed discussion on the bias phenomenon in variance estimation

  6. arXiv:1911.00190  [pdf, other

    stat.ML cs.LG

    Randomization as Regularization: A Degrees of Freedom Explanation for Random Forest Success

    Authors: Lucas Mentch, Siyu Zhou

    Abstract: Random forests remain among the most popular off-the-shelf supervised machine learning tools with a well-established track record of predictive accuracy in both regression and classification settings. Despite their empirical success as well as a bevy of recent work investigating their statistical properties, a full and satisfying explanation for their success has yet to be put forth. Here we aim t… ▽ More

    Submitted 14 September, 2020; v1 submitted 31 October, 2019; originally announced November 2019.

    Comments: To Appear in the Journal of Machine Learning Research (JMLR)

  7. arXiv:1908.11723  [pdf, other

    cs.CL

    Earlier Isn't Always Better: Sub-aspect Analysis on Corpus and System Biases in Summarization

    Authors: Taehee Jung, Dongyeop Kang, Lucas Mentch, Eduard Hovy

    Abstract: Despite the recent developments on neural summarization systems, the underlying logic behind the improvements from the systems and its corpus-dependency remains largely unexplored. Position of sentences in the original text, for example, is a well known bias for news summarization. Following in the spirit of the claim that summarization is a combination of sub-functions, we define three sub-aspect… ▽ More

    Submitted 30 August, 2019; originally announced August 2019.

    Comments: EMNLP 2019

  8. arXiv:1908.09967  [pdf, other

    stat.ML cs.LG stat.ME

    Locally Optimized Random Forests

    Authors: Tim Coleman, Kimberly Kaufeld, Mary Frances Dorn, Lucas Mentch

    Abstract: Standard supervised learning procedures are validated against a test set that is assumed to have come from the same distribution as the training data. However, in many problems, the test data may have come from a different distribution. We consider the case of having many labeled observations from one distribution, $P_1$, and making predictions at unlabeled points that come from $P_2$. We combine… ▽ More

    Submitted 26 August, 2019; originally announced August 2019.

    Comments: 23 pages, 7 figures

  9. arXiv:1905.10651  [pdf, other

    stat.ML cs.LG math.ST

    Asymptotic Distributions and Rates of Convergence for Random Forests via Generalized U-statistics

    Authors: Wei Peng, Tim Coleman, Lucas Mentch

    Abstract: Random forests remain among the most popular off-the-shelf supervised learning algorithms. Despite their well-documented empirical success, however, until recently, few theoretical results were available to describe their performance and behavior. In this work we push beyond recent work on consistency and asymptotic normality by establishing rates of convergence for random forests and other superv… ▽ More

    Submitted 16 November, 2021; v1 submitted 25 May, 2019; originally announced May 2019.

    Comments: 76 pages, 7 figure

  10. arXiv:1905.03151  [pdf, other

    stat.ME cs.LG stat.ML

    Unrestricted Permutation forces Extrapolation: Variable Importance Requires at least One More Model, or There Is No Free Variable Importance

    Authors: Giles Hooker, Lucas Mentch, Siyu Zhou

    Abstract: This paper reviews and advocates against the use of permute-and-predict (PaP) methods for interpreting black box functions. Methods such as the variable importance measures proposed for random forests, partial dependence plots, and individual conditional expectation plots remain popular because they are both model-agnostic and depend only on the pre-trained model output, making them computationall… ▽ More

    Submitted 7 October, 2021; v1 submitted 1 May, 2019; originally announced May 2019.

    MSC Class: 62G08 ACM Class: I.5.1

  11. arXiv:1606.09281  [pdf, other

    cs.CV

    Multiphase Segmentation For Simultaneously Homogeneous and Textural Images

    Authors: Duy Hoang Thai, Lucas Mentch

    Abstract: Segmentation remains an important problem in image processing. For homogeneous (piecewise smooth) images, a number of important models have been developed and refined over the past several decades. However, these models often fail when applied to the substantially larger class of natural images that simultaneously contain regions of both texture and homogeneity. This work introduces a bi-level con… ▽ More

    Submitted 29 June, 2016; originally announced June 2016.