Search | arXiv e-print repository

doi 10.1146/annurev-statistics-040522-022138

Recent Advances in Text Analysis

Authors: Zheng Tracy Ke, Pengsheng Ji, Jiashun **, Wanshan Li

Abstract: Text analysis is an interesting research area in data science and has various applications, such as in artificial intelligence, biomedical research, and engineering. We review popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, we review Topic-SCORE, a statistical approach to topic modeling, and discuss how to use it to analyze MADSta… ▽ More Text analysis is an interesting research area in data science and has various applications, such as in artificial intelligence, biomedical research, and engineering. We review popular methods for text analysis, ranging from topic modeling to the recent neural language models. In particular, we review Topic-SCORE, a statistical approach to topic modeling, and discuss how to use it to analyze MADStat - a dataset on statistical publications that we collected and cleaned. The application of Topic-SCORE and other methods on MADStat leads to interesting findings. For example, $11$ representative topics in statistics are identified. For each journal, the evolution of topic weights over time can be visualized, and these results are used to analyze the trends in statistical research. In particular, we propose a new statistical model for ranking the citation impacts of $11$ topics, and we also build a cross-topic citation graph to illustrate how research results on different topics spread to one another. The results on MADStat provide a data-driven picture of the statistical research in $1975$--$2015$, from a text analysis perspective. △ Less

Submitted 7 February, 2024; v1 submitted 1 January, 2024; originally announced January 2024.

Journal ref: Annual Review of Statistics and Its Application 2024 11:1

arXiv:2008.03820 [pdf, other]

Spectral Algorithms for Community Detection in Directed Networks

Authors: Zhe Wang, Yingbin Liang, Pengsheng Ji

Abstract: Community detection in large social networks is affected by degree heterogeneity of nodes. The D-SCORE algorithm for directed networks was introduced to reduce this effect by taking the element-wise ratios of the singular vectors of the adjacency matrix before clustering. Meaningful results were obtained for the statistician citation network, but rigorous analysis on its performance was missing. F… ▽ More Community detection in large social networks is affected by degree heterogeneity of nodes. The D-SCORE algorithm for directed networks was introduced to reduce this effect by taking the element-wise ratios of the singular vectors of the adjacency matrix before clustering. Meaningful results were obtained for the statistician citation network, but rigorous analysis on its performance was missing. First, this paper establishes theoretical guarantee for this algorithm and its variants for the directed degree-corrected block model (Directed-DCBM). Second, this paper provides significant improvements for the original D-SCORE algorithms by attaching the nodes outside of the community cores using the information of the original network instead of the singular vectors. △ Less

Submitted 9 August, 2020; originally announced August 2020.

Comments: Journal of Machine Learning Research 2020, to appear

Journal ref: Journal of Machine Learning Research 2020. (153):1-45,

arXiv:1809.10804 [pdf, other]

Patient Risk Assessment and Warning Symptom Detection Using Deep Attention-Based Neural Networks

Authors: Ivan Girardi, Pengfei Ji, An-phi Nguyen, Nora Hollenstein, Adam Ivankay, Lorenz Kuhn, Chiara Marchiori, Ce Zhang

Abstract: We present an operational component of a real-world patient triage system. Given a specific patient presentation, the system is able to assess the level of medical urgency and issue the most appropriate recommendation in terms of best point of care and time to treat. We use an attention-based convolutional neural network architecture trained on 600,000 doctor notes in German. We compare two approa… ▽ More We present an operational component of a real-world patient triage system. Given a specific patient presentation, the system is able to assess the level of medical urgency and issue the most appropriate recommendation in terms of best point of care and time to treat. We use an attention-based convolutional neural network architecture trained on 600,000 doctor notes in German. We compare two approaches, one that uses the full text of the medical notes and one that uses only a selected list of medical entities extracted from the text. These approaches achieve 79% and 66% precision, respectively, but on a confidence threshold of 0.6, precision increases to 85% and 75%, respectively. In addition, a method to detect warning symptoms is implemented to render the classification task transparent from a medical perspective. The method is based on the learning of attention scores and a method of automatic validation using the same data. △ Less

Submitted 27 September, 2018; originally announced September 2018.

Comments: 10 pages, 2 figures, EMNLP workshop LOUHI 2018

arXiv:1410.2840 [pdf, other]

doi 10.1214/15-AOAS896

Coauthorship and Citation Networks for Statisticians

Authors: Pengsheng Ji, Jiashun **

Abstract: We have collected and cleaned two network data sets: Coauthorship and Citation networks for statisticians. The data sets are based on all research papers published in four of the top journals in statistics from $2003$ to the first half of $2012$. We analyze the data sets from many different perspectives, focusing on (a) centrality, (b) community structures, and (c) productivity, patterns and trend… ▽ More We have collected and cleaned two network data sets: Coauthorship and Citation networks for statisticians. The data sets are based on all research papers published in four of the top journals in statistics from $2003$ to the first half of $2012$. We analyze the data sets from many different perspectives, focusing on (a) centrality, (b) community structures, and (c) productivity, patterns and trends. For (a), we have identified the most prolific/collaborative/highly cited authors. We have also identified a handful of "hot" papers, suggesting "Variable Selection" as one of the "hot" areas. For (b), we have identified about $15$ meaningful communities or research groups, including large-size ones such as "Spatial Statistics", "Large-Scale Multiple Testing", "Variable Selection" as well as small-size ones such as "Dimensional Reduction", "Objective Bayes", "Quantile Regression", and "Theoretical Machine Learning". For (c), we find that over the 10-year period, both the average number of papers per author and the fraction of self citations have been decreasing, but the proportion of distant citations has been increasing. These suggest that the statistics community has become increasingly more collaborative, competitive, and globalized. Our findings shed light on research habits, trends, and topological patterns of statisticians. The data sets provide a fertile ground for future researches on or related to social networks of statisticians. △ Less

Submitted 2 July, 2015; v1 submitted 10 October, 2014; originally announced October 2014.

MSC Class: 91C20; 62H30; 62P25

Journal ref: Annals of Applied Statistics 2016, 10(4): 1779-1812

arXiv:1404.2961 [pdf, other]

Rate optimal multiple testing procedure in high-dimensional regression

Authors: Pengsheng Ji, Zhigen Zhao

Abstract: In the high dimensional regression analysis when the number of predictors is much larger than the sample size, an important question is to select the important variable which are relevant to the response variable of interest. Variable selection and the multiple testing are both tools to address this issue. However, there is little discussion on the connection of these two areas. When the signal st… ▽ More In the high dimensional regression analysis when the number of predictors is much larger than the sample size, an important question is to select the important variable which are relevant to the response variable of interest. Variable selection and the multiple testing are both tools to address this issue. However, there is little discussion on the connection of these two areas. When the signal strength is strong enough such that the selection consistency is achievable, it seems to be unnecessary to control the false discovery rate. In this paper, we consider the regime where the signals are both rare and weak such that the selection consistency is not achievable and propose a method which controls the false discovery rate asymptotically. It is theoretically shown that the false non-discovery rate of the proposed method converges to zero at the optimal rate. Numerical results are provided to demonstrate the advantage of the proposed method. △ Less

Submitted 6 January, 2023; v1 submitted 10 April, 2014; originally announced April 2014.

Comments: 26 pages

Showing 1–5 of 5 results for author: Ji, P