Cautionary note on "Semiparametric modeling of grouped current duration data with preferential reporting'"
Authors:
Alexander C. McLain,
Rajeshwari Sundaram,
Marie Thoma,
Germaine M. Buck Louis
Abstract:
This report is designed to clarify a few points about the article "Semiparametric modeling of grouped current duration data with preferential reporting" by McLain, Sundaram, Thoma and Louis in Statistics in Medicine (McLain et al., 2014, hereafter MSTL) regarding using the methods under right censoring. In simulation studies, it has been found that bias can occur when right censoring is present. C…
▽ More
This report is designed to clarify a few points about the article "Semiparametric modeling of grouped current duration data with preferential reporting" by McLain, Sundaram, Thoma and Louis in Statistics in Medicine (McLain et al., 2014, hereafter MSTL) regarding using the methods under right censoring. In simulation studies, it has been found that bias can occur when right censoring is present. Current duration data normally does not have censored values, but censoring can be induced at a value, say tau, after which the data values are thought to be unreliable. As noted in MSTL, some right censored data require an assumption on the parametric form of the data beyond τ. While this assumption was given in MSTL, the implications of the assumption were not sufficiently explored. Here we present simulations and evaluate the methods of MSTL under type I censoring, give some settings under which the method works well even in presence of censoring, state when the model is correctly specified and discuss the reasons of the bias.
△ Less
Submitted 2 January, 2018;
originally announced January 2018.
Testing for Feature Relevance: The HARVEST Algorithm
Authors:
Herbert Weisberg,
Victor Pontes,
Mathis Thoma
Abstract:
Feature selection with high-dimensional data and a very small proportion of relevant features poses a severe challenge to standard statistical methods. We have developed a new approach (HARVEST) that is straightforward to apply, albeit somewhat computer-intensive. This algorithm can be used to pre-screen a large number of features to identify those that are potentially useful. The basic idea is to…
▽ More
Feature selection with high-dimensional data and a very small proportion of relevant features poses a severe challenge to standard statistical methods. We have developed a new approach (HARVEST) that is straightforward to apply, albeit somewhat computer-intensive. This algorithm can be used to pre-screen a large number of features to identify those that are potentially useful. The basic idea is to evaluate each feature in the context of many random subsets of other features. HARVEST is predicated on the assumption that an irrelevant feature can add no real predictive value, regardless of which other features are included in the subset. Motivated by this idea, we have derived a simple statistical test for feature relevance. Empirical analyses and simulations produced so far indicate that the HARVEST algorithm is highly effective in predictive analytics, both in science and business.
△ Less
Submitted 27 February, 2018; v1 submitted 30 September, 2017;
originally announced October 2017.