-
Recent Advances in Text Analysis
Authors:
Zheng Tracy Ke,
Pengsheng Ji,
Jiashun **,
Wanshan Li
Abstract:
Text analysis is an interesting research area in data science and has various applications, such as in artificial intelligence, biomedical research, and engineering. We review popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, we review Topic-SCORE, a statistical approach to topic modeling, and discuss how to use it to analyze MADSta…
▽ More
Text analysis is an interesting research area in data science and has various applications, such as in artificial intelligence, biomedical research, and engineering. We review popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, we review Topic-SCORE, a statistical approach to topic modeling, and discuss how to use it to analyze MADStat - a dataset on statistical publications that we collected and cleaned.
The application of Topic-SCORE and other methods on MADStat leads to interesting findings. For example, $11$ representative topics in statistics are identified. For each journal, the evolution of topic weights over time can be visualized, and these results are used to analyze the trends in statistical research. In particular, we propose a new statistical model for ranking the citation impacts of $11$ topics, and we also build a cross-topic citation graph to illustrate how research results on different topics spread to one another.
The results on MADStat provide a data-driven picture of the statistical research in $1975$--$2015$, from a text analysis perspective.
△ Less
Submitted 7 February, 2024; v1 submitted 1 January, 2024;
originally announced January 2024.
-
Spectral Algorithms for Community Detection in Directed Networks
Authors:
Zhe Wang,
Yingbin Liang,
Pengsheng Ji
Abstract:
Community detection in large social networks is affected by degree heterogeneity of nodes. The D-SCORE algorithm for directed networks was introduced to reduce this effect by taking the element-wise ratios of the singular vectors of the adjacency matrix before clustering. Meaningful results were obtained for the statistician citation network, but rigorous analysis on its performance was missing. F…
▽ More
Community detection in large social networks is affected by degree heterogeneity of nodes. The D-SCORE algorithm for directed networks was introduced to reduce this effect by taking the element-wise ratios of the singular vectors of the adjacency matrix before clustering. Meaningful results were obtained for the statistician citation network, but rigorous analysis on its performance was missing. First, this paper establishes theoretical guarantee for this algorithm and its variants for the directed degree-corrected block model (Directed-DCBM). Second, this paper provides significant improvements for the original D-SCORE algorithms by attaching the nodes outside of the community cores using the information of the original network instead of the singular vectors.
△ Less
Submitted 9 August, 2020;
originally announced August 2020.
-
Patient Risk Assessment and Warning Symptom Detection Using Deep Attention-Based Neural Networks
Authors:
Ivan Girardi,
Pengfei Ji,
An-phi Nguyen,
Nora Hollenstein,
Adam Ivankay,
Lorenz Kuhn,
Chiara Marchiori,
Ce Zhang
Abstract:
We present an operational component of a real-world patient triage system. Given a specific patient presentation, the system is able to assess the level of medical urgency and issue the most appropriate recommendation in terms of best point of care and time to treat. We use an attention-based convolutional neural network architecture trained on 600,000 doctor notes in German. We compare two approa…
▽ More
We present an operational component of a real-world patient triage system. Given a specific patient presentation, the system is able to assess the level of medical urgency and issue the most appropriate recommendation in terms of best point of care and time to treat. We use an attention-based convolutional neural network architecture trained on 600,000 doctor notes in German. We compare two approaches, one that uses the full text of the medical notes and one that uses only a selected list of medical entities extracted from the text. These approaches achieve 79% and 66% precision, respectively, but on a confidence threshold of 0.6, precision increases to 85% and 75%, respectively. In addition, a method to detect warning symptoms is implemented to render the classification task transparent from a medical perspective. The method is based on the learning of attention scores and a method of automatic validation using the same data.
△ Less
Submitted 27 September, 2018;
originally announced September 2018.
-
Coauthorship and Citation Networks for Statisticians
Authors:
Pengsheng Ji,
Jiashun **
Abstract:
We have collected and cleaned two network data sets: Coauthorship and Citation networks for statisticians. The data sets are based on all research papers published in four of the top journals in statistics from $2003$ to the first half of $2012$. We analyze the data sets from many different perspectives, focusing on (a) centrality, (b) community structures, and (c) productivity, patterns and trend…
▽ More
We have collected and cleaned two network data sets: Coauthorship and Citation networks for statisticians. The data sets are based on all research papers published in four of the top journals in statistics from $2003$ to the first half of $2012$. We analyze the data sets from many different perspectives, focusing on (a) centrality, (b) community structures, and (c) productivity, patterns and trends.
For (a), we have identified the most prolific/collaborative/highly cited authors. We have also identified a handful of "hot" papers, suggesting "Variable Selection" as one of the "hot" areas.
For (b), we have identified about $15$ meaningful communities or research groups, including large-size ones such as "Spatial Statistics", "Large-Scale Multiple Testing", "Variable Selection" as well as small-size ones such as "Dimensional Reduction", "Objective Bayes", "Quantile Regression", and "Theoretical Machine Learning".
For (c), we find that over the 10-year period, both the average number of papers per author and the fraction of self citations have been decreasing, but the proportion of distant citations has been increasing. These suggest that the statistics community has become increasingly more collaborative, competitive, and globalized.
Our findings shed light on research habits, trends, and topological patterns of statisticians. The data sets provide a fertile ground for future researches on or related to social networks of statisticians.
△ Less
Submitted 2 July, 2015; v1 submitted 10 October, 2014;
originally announced October 2014.
-
Rate optimal multiple testing procedure in high-dimensional regression
Authors:
Pengsheng Ji,
Zhigen Zhao
Abstract:
In the high dimensional regression analysis when the number of predictors is much larger than the sample size, an important question is to select the important variable which are relevant to the response variable of interest. Variable selection and the multiple testing are both tools to address this issue. However, there is little discussion on the connection of these two areas. When the signal st…
▽ More
In the high dimensional regression analysis when the number of predictors is much larger than the sample size, an important question is to select the important variable which are relevant to the response variable of interest. Variable selection and the multiple testing are both tools to address this issue. However, there is little discussion on the connection of these two areas. When the signal strength is strong enough such that the selection consistency is achievable, it seems to be unnecessary to control the false discovery rate. In this paper, we consider the regime where the signals are both rare and weak such that the selection consistency is not achievable and propose a method which controls the false discovery rate asymptotically. It is theoretically shown that the false non-discovery rate of the proposed method converges to zero at the optimal rate. Numerical results are provided to demonstrate the advantage of the proposed method.
△ Less
Submitted 6 January, 2023; v1 submitted 10 April, 2014;
originally announced April 2014.