Revealing the value of Repository Centrality in lifespan prediction of Open Source Software Projects

Runzhi He12, Hengzhi Ye12, Minghui Zhou13 1School of Computer Science, Peking University, China
Key Laboratory of High Confidence Software Technologies, Ministry of Education, China
[email protected], [email protected], [email protected]

Abstract

Background: Open Source Software (OSS) is the building block of modern software. However, the prevalence of project deprecation in the open source world weakens the integrity of the downstream systems and the broad ecosystem. Therefore it calls for efforts in monitoring and predicting project deprecations, empowering stakeholders to take proactive measures. Challenge: Existing techniques mainly focus on static features on a point in time to make predictions, resulting in limited effects. Goal: We propose a novel metric from the user-repository network, and leverage the metric to fit project deprecation predictors and prove its real-life implications. Method: We establish a comprehensive dataset containing 103,354 non-fork GitHub OSS projects spanning from 2011 to 2023. We propose repository centrality, a family of HITS (Hyperlink-Induced Topic Search) weights that captures shifts in the popularity of a repository in the repository-user star network. Further with the metric, we utilize the advancements in gradient boosting and deep learning to fit survival analysis models to predict project lifespan or its survival hazard. Results: Our study reveals a correlation between the HITS centrality metrics and the repository deprecation risk. A drop in the HITS weights of a repository indicates a decline in its centrality and prevalence, leading to an increase in its deprecation risk and a decrease in its expected lifespan. Our predictive models powered by repository centrality and other repository features achieve satisfactory accuracy on the test set, with repository centrality being the most significant feature among all. Implications: This research offers a novel perspective on understanding the effect of prevalence on the deprecation of OSS repositories. Our approach to predict repository deprecation help detect health status of project and take actions in advance, fostering a more resilient OSS ecosystem.

²²footnotetext: These authors contributed equally to this work.³³footnotetext: Minghui Zhou is the corresponding author.

I Introduction

The prosperous world of modern software does not emerge from nothing. Modern software relies on hundreds, and often thousands, of other pieces of software to function, develop, and maintain. Software engineers refer to the software that others depend on as dependencies. Thanks to the power of open source software (OSS) dependencies, developers can easily reuse code from various code hosting platforms, such as GitHub. Synopsys conducted an analysis of 1700 codebases from 17 different application domains in their annual Open Source Security and Risk Analysis Report (OSSRAR) [1]. They found that 76% of the analyzed codebases are open source, and 96% of them use third-party open-source components.

Open source software, although extremely useful, is often unreliable in the sense that there is no way to know if and when support for the project could cease, i.e., become deprecated. Over the years, many influential open source projects have declined, including GitHub’s well-known Atom text editor [2], Adobe’s HTML editor Brackets [3], and the JavaScript library faker.js [4], whose deprecation caused chaos. Besides, OSSRAR [1] reveals that over 91% of the analyzed dependencies have not shown any sign of maintenance in two years. Numerous factors contribute to the deaths of open source projects, including maintainers’ loss of interest or lack of time, the emergence of more attractive alternatives, and the uncovering of serious security vulnerabilities [5, 6].

The impact of repository deprecation extend beyond individual software projects, potentially triggering a domino effect that weakens the foundation of interconnected software supply chain at scale. Software systems relying on deprecated repositories are susceptible to security vulnerabilities and compatibility issues, which could instigate cascading failures. Moreover, the abrupt deprecation of a repository (e.g. the left-pad incident in 2015 [7]) may introduce substantial disruptions in the development lifecycle, forcing developers into “emergency mode” and dedicate significant time and effort into the identification and integration of alternatives.

In light of the potential hazards, the capacity to anticipate repository deprecation emerges as a vital necessity. By accurately predicting the obsolescence of a code repository, developers can undertake proactive measures to counteract the ensuing risks. This might encompass the identification of substitute repositories or packages, strategizing for code migration, or even contributing to the preservation of the repository to avert its deprecation.

To tackle the problem, extensive effort [8, 9, 10, 11] has been devoted to investigating the factors contributing to repository deprecation. However, there is a lack of observation of variations in repository-related features over continuous periods. Furthermore, there is a room for exploring the characteristics beyond those directly obtainable to predict repository deprecation, such as centrality features within the network constructed by the repositories. This gap in the existing body of knowledge underscores the need for a more comprehensive and nuanced understanding of these overlooked aspects.

Therefore, our work endeavors to analyze the popularity dynamics of an OSS repository over time, seeking to predict the prospective lifespan and risk of deprecation of a repository from the current time point. This temporal perspective allows us to capture trends and patterns that could provide more precise predictions of deprecation, enhancing the resilience of the open source ecosystem.

We propose the following research questions (RQs):

RQ1: Can we model the popularity dynamics of an OSS repository in social coding platform?

Results: We adopt the HITS algorithm to model the “star” network of a GitHub repository with a bipartite graph (where users are Hubs and repositories are authorities), and propose repository centrality metric to reflect the popularity of a repository. The metric is a combination of the HITS weight, rank-normalized HITS weight, and z-score normalized HITS weight, providing a nuanced understanding of a repository’s position within the open-source ecosystem.

RQ2: How effective is the proposed repository centrality metric in predicting repository deprecation?

Results: Based on the proposed repository centrality metric, we leverage gradient boosting and deep learning techniques to fit survival analysis models, and to forecast the lifespan and survival hazard of a GitHub repository. The prediction accuracy is high and the proposed metric proves to be effective.

The contributions of this paper are:

•

A novel comprehensive metric “repository centrality” to capture the shift of project popularity.
•

A high-performance project decline prediction model that utilizes only open data and has been validated with real-world examples.
•

A powerful language model to identify real repository deprecation from READMEs and descriptions.
•

A large-scale and comprehensive dataset with 51,677 validated GitHub project deprecations.

We provide a replication package at https://figshare.com/s/981beaa8fc4a93c9c7e0.

II Background and Related Work

II-A Deprecation of OSS

Deprecation is a prominent problem due to its disruptive nature; it can cause disarray in software development processes, and when developers fail to realize that one of their dependencies is deprecated, it has the potential to introduce security bugs via the software supply chain [1]. Consequently, researchers have tried to understand the implications. To this extent, studies by Robbes et al. [5] and Sawant et al. [6] have laid the groundwork for understanding the nature, causes, and implications of deprecation in OSS. These studies suggest that deprecation can stem from factors, including the loss of interest by maintainers, the emergence of more efficient alternatives, or the unearthing of consequential security vulnerabilities. Subsequent research, such as the study by Kula et al. [12], indicates that a significant portion of libraries in package ecosystems like npm and RubyGems become deprecated with time and additionally identifies the prevalence of deprecation amongst OSS projects.

To remedy deprecation disruptions, neoteric research has attempted to identify unmaintained projects using advanced machine learning techniques [13, 14], as well as employing centrality measures to forecast the trend of package deprecation [15]. However, these studies focus on static indicators. Our research improves upon the foundational studies by incorporating dynamic factors to offer a more nuanced insight and improved predictive accuracy. We believe this research is valuable as it will help identify deprecation before it happens, allowing developers to plan and implement countermeasures, thereby minimizing and perhaps entirely mitigating risks associated with deprecation in OSS.

II-B Network Analysis

Although a marvel, communities can grow exponentially, quickly becoming an intricate web of networks. Therefore, to facilitate the comprehension of such a complex network, studies have utilized network models to unravel the patterns of such communities [16]. Similarly, in the software ecosystem, the effectiveness of network analysis has nurtured an understanding of software development attributes such as dependency management and bug proliferation. For instance, one study leveraged network centrality measures to predict the likelihood of future software changes [17]. Additionally, another study employed network metrics to forecast post-release defects [18]. However, we observe a noticeable gap in the current literature concerning project deprecation. Firstly, there is a lack of consideration for dynamic factors in the ecosystem. Secondly, there is a lack of utilization of network centrality features within the repository network. Accordingly, we believe that utilizing network analysis will provide a novel perspective in understanding the deprecation of OSS repositories.

Refer to caption — Figure 1: Research Framework

III Data Collection

Figure 1 presents the methodology of this study. In this section we explain how we spend significant effort to construct a comprehensive dataset that is critical to this study and the like, as shown on the left side of Figure 1. For the convenience of operation, we define project deprecation in Section III-A. We introduce how we select projects and construct dataset in Section III-B and III-C respectively.

III-A Definition of Repository Deprecation

There is no de facto definition of repository deprecation. Rather than defining the concept on all repositories in the open source world, we first limit the scope of our research to GitHub. GitHub is a code hosting platform of global prominence and extensive use, with a vast array of projects and contributors, ensuring our dataset’s scale and diversity. Besides, GitHub’s rich development history and “archive repository” feature highlight repository deprecations, ensuring the number and correctness of positive samples.

We follow the state of practice of GitHub project maintainers and define a repository to be deprecated if it meets any of the following criteria (shown in Figure 2):

•

Repository is archived. GitHub provides a feature for archiving repositories [19], which is a clear indicator of the discontinuation of maintenance. When a repository is archived, it transitions into a read-only state, prohibiting any further updates or deletions. The process of archiving is straightforward for maintainers, involving simple navigation through “Settings → Danger Zone → Archive Repository” on GitHub. This feature has gained considerable traction, and nowadays a considerable number of projects are opting for this approach when deprecating their repositories.
•

Repository has deprecation-indicating keywords in its README or description. In noticeable cases, projects opt to announce their deprecation in the repository description or the README file, rather than resorting to the archive feature. (Part of the reason is that the “archive repository” feature was not a thing before 2017 [20]). An example is the EntityFramework [21], which communicates its discontinuation by stating “this library is no longer supported since 2015” at the outset of its README file. This method of signaling deprecation has also been noted by Coelho et al. [22] and utilized as a criterion for identifying discontinued repositories.

These two methods, shown in Figure 2(a) and 2(b), collectively form the basis of our operational definition of deprecation, enabling us to systematically identify and categorize deprecated repositories.

III-B Project Selection

Given the vast number of project repositories on GitHub, it is imperative to curate a reasonably sized, clean experimental dataset. This necessitates rigorous selection and aggregation processes for our data sources.

Our initial task involves the construction of a comprehensive dataset of deprecated GitHub repositories, as no existing dataset aptly fits the context of our study, prompting the need for a bespoke dataset. Historically, research in this area has been limited to a narrow scope, focusing primarily on a few top-tier GitHub projects. For instance, the work of Coelho et al. [22] was confined to failed projects within the top 5000 GitHub projects as ranked by star count. The limited sample size is both insufficient for generating robust results from a regression model and vulnerable to validity threats for not considering less popular repositories. However, researchers limited the scale of the dataset for a reason: to mitigate the threat of false positives, former work involves a considerable number of manual labeling, which translates into hundreds of hours of labor. To address the limitation, we: 1) utilize the ground truth of archived projects on GitHub, which are guaranteed not to be false positives; 2) label and cross-validate 1,200 samples that declare their deprecation in natural language, and leverage recent advances in language models to train a sentence-transformer-based classifier to label the remaining 20,358.

Archived Repositories. We first use the GitHub GraphQL API [23] to gather all archived GitHub repositories that boast a star count exceeding 32. This threshold is dictated by the constraints of the GitHub GraphQL search API and serves as a filtering criterion to eliminate trivial projects. This star-count-based approach is a common filtering method in software engineering studies, once applied by Xiao et al. [24].

Repositories with deprecation-indicating keywords. Resulting from the combination of historical factors and maintainers’ preferences, not all GitHub projects signal the end of life by archiving the repository. We employ the GitHub GraphQL API to search all non-fork GitHub repositories with the indicating keywords cataloged by Coelho et al. [22] in their descriptions and README files. As researchers have found, string matching introduces a unignorable number of false positives (e.g. versions from 1.0.14 to 1.1.4 are deprecated, (Another project) seems unmaintained). To tackle with the unprecedented size of the dataset, we use a combination of manual and machine-learning labeling to classify real deprecations from candidates.

To ease the labeling experience and make results easily comparable, we first build a delicate Web UI (Figure 3) for binary classification tasks with Label Studio [25]. Two authors independently examine the randomly sampled 1,200 repositories with their commits, issues and pull requests, and identify whether the repository has been deprecated or the text indicates a false positive. The first round of labeling ended with an agreement of 95% and a Cohen’s kappa of 0.814, which suggests a strong agreement. The two authors discuss the inconsistencies and reach a consensus.

Next, we leverage the best-in-class few-shot text classification framework setfit [26] to label the remaining 20,358 samples. Setfit is a prompt-free fine-tuning framework for sentence transformers [27], and it is known for its efficiency and accuracy with a small number of labelled samples, which renders it the ideal choice in this scenario. We fit our binary classifier from the pretrained paraphrase-mpnet-base-v2 [28] sentence transformer model on 600 random labelled samples, and use the other 600 as the test set. Our model reaches a stunning accuracy of 0.96, a recall of 0.96 and a precision of 0.90.

After aggregating and deduplicating the two sets, we end up with 51,667 non-fork deprecated GitHub projects. But those 51,667 projects are not the end of the story. The diversity and representativeness of the dataset is the key to the soundness of the forecast models. Considering that, we incorporate GitHub GraphQL API again to randomly select 51,677 “known good” repositories with more than 32 stars and are not identified as deprecated.

III-C Dataset Construction

Our goal is to forecast the lifespan of GitHub repositories, necessitating the use of time-series data. Hence, the acquisition of historical project statistics, such as the number of stars and commits, is integral to our research. The native GitHub API, however, proves inadequate for this extensive task. Consequently, we turn to the GHArchive dataset [29], a more suitable alternative for our purposes.

Unlike other public datasets, which either lack updates or only update annually (e.g. GHTorrent [30]), GHArchive stands out due to its dynamic nature. It generates fresh data dumps every hour, storing responses from the GitHub TimelineEvent API in a universally compatible JSON format. This dataset provides a comprehensive record of public GitHub activity dating back to 2011, encapsulating various metrics such as pushes, stars, pull requests, issues, and comments.

For the purpose of the analysis, we compute the monthly statistics of the projects in our dataset, spanning from 2011 to 2023. To efficiently aggregate these historical metrics from GHArchive dumps, we employ the scalable OLAP database, ClickHouse. Our implementation facilitates rapid retrieval of monthly statistics, even for large-scale projects with over 10,000 issues, achieving sub-second retrieval times.

Eventually we assemble a comprehensive dataset, comprised of 103,307 non-fork projects, spanning from 2011 to 2023. Table I shows the distribution of stars/lifespan of the 51,677 deprecated projects.

TABLE I: Distribution of star/lifespan of our dataset

	Mean	25%	50%	75%
Stars	391.75	48	86	219
Lifespan(Days)	1786	978	1673	2243

IV RQ1: modelling the popularity dynamics of an OSS repository

In this study we propose that a metric modeling the popularity of a repository in its social network on OSS platform is essential to forcast deprecation of an OSS repository. Though existing metrics (e.g. commits, stars and release intervals) have been extensively explored and applied to evaluate repositories’ activity and the state of maintenance [31, 32], those metrics suffer from the following limitations: 1) After years of active development, repositories may start to stabilize and enter a stage with fewer contributions but still consistent maintenance; 2) Repositories may continue to receive stars after its declaration of deprecation; 3) Even extraordinarily popular projects may experience a release stall, for example, react (a top GitHub project in Table II) has not published a new version for 18 months. The status is calling for a more precise and comprehensive metric.

Inspired by the interconnected nature and rich historical data of open source communities, we begin with the vast network of repositories and users on GitHub [33], and propose repository centrality: a family of HITS weight and its normalization on the star bipartite graph of users and repositories. Below we elaborate on how we define and construct the user-repository network and repository centrality, and conduct preliminary analysis to observe its effectiveness in capturing the complex relationships between repositories and users, as shown in the middle of Figure 1.

IV-A Defining Repository Centrality

Popularity and attention on an OSS repository is an important indicator of its maintenance status [34], and it is self-evident that stars are indicators of popularity and attention. It’s also self-evident that just as web pages, stars are not created equal. Stars from more experienced developers matter more than the ones from newcomers [35]. For instance, stars created by Linus Torvalds are more representative of the preferences of the developer community than that created by novice developers. We notice that a user can star various repositories, a repository can be starred by distinct users, the “star” relationship between users and repositories form a bipartite graph (Figure 4). There has been extensive effort on node ranking in the field of search engines, and the Hyperlink Induced Topic Search (HITS) algorithm proposed by Kleinberg [36, 37] is visible and fits into our context. HITS assigns scores to nodes based on the link structure of a bipartite graph, nodes with more links to higher importance neighbors are scored higher.

Figure 4: The Bipartite Graph of Users and Repositories

IV-B Calculation and Normalization

The the HITS algorithm can be formalized as follows:

Auth(p)=\sum_{q\in p_{to}}Hub(q)

Hub(p)=\sum_{p\in q_{from}}Auth(p)

The weights of hubs is the aggregation of the weights of its authority neighbors, and vise versa. In the calculating process, initial weights of hubs and authorities are set to 1, and the algorithm iteratively aggregates the weights, which can be formalized as Algorithm 1.

Algorithm 1 Calculation of Repository Centrality

Initialization:

Auth[v]=1

Hub[v]=1

\forall v\in V

repeat

for each

v

V

Auth[v]=0

for each

u

incoming\_neighbor(v)

Auth[v]=Auth[v]+Hub[u]

end for

for each

v

V

Hub[v]=0

for each

w

outcoming\_neighbor(v)

Hub[v]=Hub[v]+Auth[w]

end for

for each

v

V

Auth[v]=\frac{Auth[v]}{\sqrt{\sum_{u\in V}Auth[u]^{2}}}

Hub[v]=\frac{Hub[v]}{\sqrt{\sum_{u\in V}Hub[u]^{2}}}

end for

until convergence

Anth[v]

Hub[v]

\forall v\in V

The algorithm is relatively simple, yet applying it to a gigantic graph of 200 million nodes and 200 million edges is indeed challenging, not mentioning that we need to repeat this process for each month in the past decade. In face of this challenge, we spent hundreds of developer-hours to implement a efficient and scalable HITS-computation framework. As illustrated in Figure 1, our framework digests data from the distributed SQL-compliant database TiDB, and dispatches the workload to an array of Spark nodes. The above framework allows us to calculate the monthly HITS weight for every GitHub repository over a decade, within a computational budget of 1000 core-hours.

On the other hand, HITS weights, like many metrics in software engineering [38, 39], follow a long-tail distribution. The rapid growth of open-source communities has led to a substantial increase in the number of nodes and edges, resulting in significant variance in HITS weights over time. Following in the idea of direct comparison of HITS weights across time periods maybe not meaningful, we apply two normalization methods to stabilize and flatten the HITS weights distribution:

•

Rank normalization: Following the approach of Mujahid et al. [15], we normalize the HITS weight as its percentile at a given time. The process can be formalized as follows:

w_{\%}=\frac{rank(w_{i})}{|i|}

•

Z-score normalization: A common linear normalization method that adjusts the values’ mean to 0 and standard deviation to 1. Given the long-tail distribution of HITS, we apply a logarithmic transformation to the HITS weight before calculating the z-score:

w_{z}=\frac{lnw_{i}-\mu}{\sigma}

We define a repository’s centrality to be the combination of its HITS weight in the star bipartite graph, rank-normalized HITS weight, and z-score normalized HITS weight. This composite measure provides a more nuanced understanding of a repository’s position within the open-source ecosystem.

IV-C Preliminary Analysis

To substantiate our initial hypothesis regarding the correlation between a project’s popularity and its lifespan, we perform a preliminary analysis of the collected repositories with the HITS weight calculated from the following three aspects.

IV-C1 Top-Repositories

Many have found that the number of stars is not a honest representative of a project’s quality and impact. An intuitive example is that more often than not, top repositories in GitHub rankings are tutorials, examples and resource collections rather than libraries or frameworks. We find that less expected names of repositories not strongly associated with software development became popular on the GitHub stars leaderboard of September 2023 (Table II), for example, 996.ICU, a protest against overwork within the IT industry. In fact, nine of the top ten repositories ranked by the number of stars are not software projects (libraries or applications), the only exception being react, a web framework. From a software developer’s perspective, the top repositories ranked by HITS sound more familiar and influential: top ten repositories include seven well-known software projects like tensorflow, vue and vscode. This intuitive observation confirms on the HITS weight’s capability of comprehending repository impact, which in turn influences repository lifespan [40].

TABLE II: Top 10 Repositories ranked by HITS and Stars

Rank	Repository by HITS	Repository by Stars
1	sindresorhus/awesome	freeCodeCamp
2	facebook/react*	free-programming-books
3	vuejs/vue*	sindresorhus/awesome
4	tensorflow*	996.ICU
5	airbnb/javascript*	coding-interview-university
6	You-Dont-Know-JS	public-apis
7	react-native*	developer-roadmap
8	oh-my-zsh*	system-design-primer
9	developer-roadmap	build-your-own-x
10	Microsoft/vscode*	facebook/react*

* Software projects (applications or libraries).

IV-C2 Representative Projects

How does HITS work in practice? How is its capability of forecasting deprecation compared with other metrics? To answer the questions, firstly we choose the once-popular code editor Brackets, which was deprecated in favor of the more feature-complete VSCode in 2021, as a showcase. Secondly we explore the prediction power of HITS’s delta $\Delta$ HITS ( $\Delta HITS=HITS_{t}-HITS_{t-1month}$ ) within three randomly sampled projects: Project [41], [42] and [43].

Figure 5 displays the activity statistics of the Brackets project since 2015. The figure shows the number of new events created each month for stars, issues, PRs, commits, comments, and tags. It is evident that the development of Brackets has been gradually stagnating since 2015. Yet, the project continued to receive a high and stable number of stars each month, with a notable surge in 2021. The HITS weight however, as a reliable indicator of the project’s impact, has been on a steady decline since 2015. This case demonstrates the HITS weight to be a more promising representation of projects’ deprecation trends, and less prone to noise compared with other indicators.

Figure 6 illustrates the relationship between $\Delta$ HITS and time for the three randomly selected projects. It is clear that for Project [41], represented in Figure 6(a), there was no negative $\Delta$ HITS during the observation period. Indeed, this project has not been deprecated and is still under active maintenance. However, for Project [42] and [43], represented in Figures 6(b) and 6(c) respectively, deprecation occurred during the observation period. In each case, a negative peak in $\Delta$ HITS in the month preceding deprecation serves as a harbinger of this event.

Therefore, it is evident that the HITS weight, as a predictor of repository deprecation, exhibits a high degree of sensitivity and have the potential to accurately detect a trend towards deprecation.

IV-C3 Correlation Between HITS metric and Other Metrics

Is HITS truly unique? Does it capture trends that metrics developers already rely on, such as the number of stars and commits, fail to reveal? To get some answers, we follow the methodology of previous work [15] and select Spearman’s rank correlation test [44] to measure the correlation between HITS and other metrics. Spearman’s rank correlation coefficient ( $\rho$ ) is calculated from the relative order of the value, and ranges from -1 and 1, with -1 indicating perfect negative correlation and 1 vise versa. Spearman’s $\rho$ is ideal in this context for its ability to mitigate disturbances caused by data distribution.

Figure 7(a) presents the correlation matrix between metrics. We observe that Spearman’s $\rho$ between HITS and other metrics across all samples is less than 0.4, suggesting negligible to weak correlations, as interpreted by Prion and Haerling [45]. The sole strong correlation identified is between the number of pull requests and the number of commits. Further analysis of the distribution of correlations among repositories is provided in Figure 7(b), from which we can draw the following conclusions: 1) In most repositories, HITS is positively correlated with the number of commits, issues, PRs, stars, and tags, which is reasonable; 2) The 75th percentiles (the upper edge of the boxes in the figure) are all below 0.4, indicating that HITS exhibits weak correlations with the metrics in the majority of repositories. The only exception is the moderate correlation between HITS and the number of stars. The results indicate that HITS may capture unique characteristics of the repositories that can not be simply modelled by existing common metrics.

V RQ2: Effectiveness of repository centrality

To estimate the prediction power of repository centrality, we turn to an approach that is often employed to estimate the survival hazard: survival analysis. The life cycle of an OSS project repository commences upon its creation, with various activity metrics being recorded until its deprecation, which could be compared to that of a human. The activity data of the repository (e.g., stars and commits) is similar to the vital signs of a human (e.g., blood pressure and glucose), and the survival time of the repository is similar to a human’s life expectancy. Survival analysis, a prominent method in medical research can be aptly applied here. As a form of regression that models hazard rate over time of an event, adept at handling censored data, it matches our data and prediction target well. Note that studies in software engineering have widely adopted survival analysis as well [46, 47].

In particular, we fit survival analysis models with the repository centrality metric and other historical metrics being predictors, to predict the survival time (AFT) or hazard rate (DRSA) of a repository from the current observation point, as shown on the right of Figure 1.

V-A Selection of Controlling Features

Our goal is to train and fit suitable models based on HITS weight and various controlling features, to predict the expected lifespan of a repository from the current observation point.

To offer comprehensive baselines to repository centrality being predictor, we extract repository features (controlling features) from three perspectives:

•

Development: The number of new commits serves as a robust indicator of a project’s development and maintenance activity. A near-deprecation project tends to exhibit a steep decline in the frequency of bug fixes and features, which means a stall in commits and releases.

•

Collaboration: Sustained work within a core group of contributors is a significant contributor to the success and longevity of open-source projects [48]. On code hosting platforms like GitHub, developers collaborate by opening issues, pull requests and discuss under comments. They are good proxies of a project’s collaboration status.

•

Community Attention: The number of stars is a widely used popularity metric in software engineering studies. As a complement to the number of stars and the core contribution of this work, we introduce repository centrality (HITS weight, rank-normalized HITS, and z-score normalized HITS) as encodings of a project’s community attention.

The features we ultimately select as controlling predictors are listed in Table III.

TABLE III: Selected Features

Feature	Explanation
Commits	Created Commits in a Month
Comments	Created Issue / PR Comments in a Month
Issues	Created Issues in a Month
PRs	Created PRs in a Month
Stars	New Stars in a Month
Tags	Created Tags in a Month
Weight	HITS Weight of the Project Repository
Weight_%	Percentile Normalized HITS
Weight_z	Z-Score Normalized HITS

V-B Accelerated Failure Time (AFT)

The Accelerated Failure Time (AFT) model proposed by Fox et al. [49], is one of the most commonly used models in survival analysis. It assumes a linear relationship between the logarithm of survival time and the predictor variables, expressed as: $lnY=\left<w,x\right>+\sigma\cdot Z$ , where $x$ is a vector representing the features, $w$ is the coefficient vector, $Y$ is the output label (survival time), $Z$ is a known probability distribution of noise, and $\sigma$ is the scaling factor of $Z$ . In this study, $x$ represents the array of features of GitHub projects, and the label to predict, $Y$ , is the survival time of the projects.

To evaluate the practical utility of the HITS weight, we train an AFT model using the state-of-the-art XGBoost [50] gradient boosting framework based on the selected features, including HITS weights. XGBoost has demonstrated top-tier performance in numerous prediction tasks and exhibits impressive scalability. It is also easily interpretable with the reported F-Score and automatic feature selection.

We set aside 20% of the repositories as the test set for the model, with nloglik chosen as the loss function. To accelerate the training process, we limited the iteration rounds to 50 and set the number of early-stop** rounds to 10. Other parameters were set to XGBoost defaults.

This model converged after 50 iterations, and the C-Index on the test set was 0.810. This score indicates that the model has a strong discriminatory power [51].

Thanks to the interpretable nature of gradient boosting models, we can derive quantitative insights into how well the HITS weights contribute to the predictions. First, we choose the F-Score calculated by the XGBoost framework as the direct measurement of the features’ contributions; Besides, following the choice of numerous AI researchers [52], we pick SHAP (SHapley Additive exPlanations) [53] as the generic metric of feature importance. By borrowing concepts of “cooperator” and “payoff” from the game theory, SHAP calculates the average contribution of each feature to the model output delta by considering all possible feature combinations.

Figure 8(a) presents the F-scores of each feature in the XGBoost model, showing that the HITS weight has the highest importance, followed by the number of stars. Percentile-normalized HITS and Z-score normalized HITS ranked third and fourth, respectively, suggesting that decreasing popularity significantly influences developers’ decisions to deprecate repositories, while the frequency of maintenance and collaboration has a less impact. Figure 8(b) shows that the HITS weight and Z-Score normalized HITS have highest SHAP values, suggesting that they have strongest effect on the predictions. The red dots on the right side and the blue cluster on the left side indicate a positive correlation between HITS weight and prediction values. In other words, projects with higher HITS weights have a greater chance to survive.

TABLE IV: Performance of Ablation Models (AFT)

Model	C-Index	Predicted Lifespan (Days, Mean)
Baseline	0.748	7877
Baseline-stars	0.718	7925
Full-weight	0.755	7663
Full-weight_%	0.808	7926
Full-weight_z	0.808	7726
Full-comments	0.805	7772
Full-commits	0.801	7685
Full-issues	0.805	7569
Full-prs	0.804	7374
Full-stars	0.809	7877
Full-tags	0.809	7914
Full	0.810	7925

Table IV illustrates the results from the ablation study. The inclusion of HITS has a noticeable effect on the performance of the predictor. Compared to the baseline model, the full model shows a significant improvement in C-Index, from 0.748 to 0.810. Removing the HITS weight feature from the full model results in a drop in the C-Index to 0.755, while no significant drop was observed after removing other features. This suggests that HITS weights plays a unique role in lifespan prediction for GitHub projects. Given that HITS weights does not nullify the overall evolution in GitHub projects, it represents an important avenue for future researchers to explore the evolution of project deprecation.

V-C Deep Recurrent Survival Analysis (DRSA)

We also employ the Deep Recurrent Survival Analysis (DRSA) proposed by Ren et al [54] to predict repository deprecation with the memory of historical shifts of the features. As a survival analysis model based on recurrent neural networks (RNNs), DRSA is good at capturing trends of metrics in the timeline and is perfect for the job.

We randomly select 80% of the repositories in our dataset as the training set, with the remaining 20% serving as the test set. For each repository, we extract each month’s features from the historical data over a continuous 10-month period and then feed the sequential data into the DRSA model for training. We set batch size to 64, learning rate to 0.015 and iteration rounds up to 1000 to conduct the training process.

After training, we evaluate the DRSA model on the test set and get satisfactory results. We randomly select two repositories Stopwatch [55] and mr4c [56] from the test set to illustrate the prediction ability of DRSA models trained on monthly HITS weight and other metrics for repository prediction. As shown in Figure 9, the horizontal coordinate indicates the number of months since our last observation, while the vertical coordinate indicates the hazard rate of repository depreciation in each month. We assume that a hazard rate greater than 50% indicates a high probability of repository depreciation in the corresponding month, which is a wake-up call that deprecation may occur. Meanwhile, if the line graph shows a peak of more than 50% in the observation interval covered by the horizontal coordinate, we intuitively consider the month corresponding to the peak as the month in which the DRSA model predicts the occurrence of deprecation.

In Figure 9(a), we use the feature data of Stopwatch from January 2017 to October 2017 as a test input, and the hazard rate curve shows the highest peak probability of deprecation (73%) in the ninth month ahead, and we assume that the project will be discarded in the next nine months. According to the actual situation, the project was archived on 5 July 2018. The difference between the result we predicted based on the model and the real deprecated date is within one month, and the predicted date is earlier than the actual deprecated date, so we can consider that we have a certain degree of accuracy on the prediction. While in Figure 9(b), we use the data of mr4c from October 2015 to July 2016 as the test input, and the hazard curve does not exceed 50% in any of the next 10 months, which can be considered that the project will not be deprecated in the short term. In fact, mr4c was archived in December 2021, well after 2016, and therefore matches the prediction.

Similar to the ablation study results of AFT model, the inclusion of HITS strongly contributes to the performance of the DRSA model: the full model shows a significant improvement in C-Index of 9.3% than base model, and removing the HITS weight related features from the full model results in a drop in the C-Index of 6.9% while less significant drop (less than 4%) was observed if other features removed. These results further support that HITS weights are of good use in lifespan prediction for GitHub projects.

VI Discussion

In this section, we discuss how our approach may help with related research and practice, and what factors may potentially threaten the validity of this study.

VI-A Implication to Research and Practice

Novel Use of Centrality Indicator. In contrast to traditional metrics, our study introduces a centrality measure based on the HITS algorithm, providing a more nuanced depiction of repository popularity. This innovative approach encourages a higher-dimensional perspective on the issue of repository deprecation trends. Our work aligns with the increasing recognition in the importance of network-related indicators in capturing the evolving nature of OSS development ecosystems. Moreover, it reveals a novel angle for analyzing complex system through network analysis, echoing recent advances in network analysis applications to software engineering research.

Scalable Approach. Our approach leverages graph computation acceleration tools, enabling efficient calculation of HITS weights and potential scalability to larger datasets, given sufficient computational resources. This aligns with the broad trend towards the use of scalable methodologies in software engineering research to tackle large-scale problems. While our study focuses on GitHub, the proposed methodology could be applicable to other open-source platforms, offering a versatile framework for predicting repository deprecation risks. Furthermore, our predictive model, based on HITS weight and survival analysis model, performs well in forecasting deprecation risks, providing a foundation for future research in this area.

VI-B Threats to Validity

Internal Validity. The first threat pertains to the use of the HITS algorithm, which favors older, more established projects due to its recursive nature [36]. This could potentially introduce a bias towards these projects.

The second threat concerns the survival analysis fitting process. While we have considered a set of features, there might be other influential factors not included in our model or feature overlap existing. Furthermore, the performance of the survival analysis model could be sensitive to the choice of hyperparameters during the training process, which could affect the predictive power of the model [57].

In addition, our bipartite graph model only considers the star-relationship between users and repositories. However, users and repositories might be connected through other relationships, such as forks or pull requests, which could yield a more complex network structure [58, 59]. And we don’t include star cancellations in our dataset, which are not given by the GitHub Timeline API, so there may also be discrepancies between the network we constructed and the real one.

External Validity. Since our research is primarily based on data from GitHub, which has its unique user base and project characteristics, there are potential threats to validity. Though similar, these characteristics may not be representative of other open-source platforms, potentially limiting the applicability of our findings.

Besides, our study dose not take the influence that project type may have on a project’s popularity and deprecation risk into consideration. However, it might not hold true in all cases. At a similar degree of centrality, different types of projects (e.g., commercial, volunteer, interest-driven) may have different deprecation risk, which potentially affects the validity of our findings [60].

VII Conclusion

OSS repositories face the risk of deprecation for various reasons [22], leading to instability in the software ecosystem. Therefore, predicting repository deprecation is crucial for ensuring the sustainability of OSS projects.

In this paper, we address this pressing need in the software engineering domain, a gap that existing metrics have failed to bridge adequately. Traditional metrics, while informative, do not fully capture the network centrality of OSS project popularity and its implication for repository deprecation.

To bridge this gap, we propose and validate a centrality metric based on the HITS algorithm, which captures the repository popularity according to the connections between users and repositories in a bipartite network. Our method provides a more comprehensive understanding of OSS project popularity and its correlation with repository deprecation. Our preliminary analysis and subsequent survival model analysis demonstrate the effectiveness of the HITS weight in predicting repository deprecation, with our models showing good performance.

Looking forward, there is potential for further refinement and validation of our metric across various open-source platforms. Moreover, the integration of our metric into predictive models could be explored to enhance the accuracy of deprecation forecasts. Such advancements would not only provide developers and maintainers with a more nuanced understanding of project popularity and repository deprecation but also equip them with a reliable tool for strategic decision-making, thereby fostering a more sustainable open-source software ecosystem.

References

[1] (2023) Synopsys: Open source security and risk analysis report. [Online]. Available: https://www.synopsys.com/software-integrity/resources/analyst-reports/open-source-security-risk-analysis.html
[2] Github project: atom/atom. [Online]. Available: https://github.com/atom/atom
[3] Github project: adobe/brackets. [Online]. Available: https://github.com/adobe/brackets
[4] Github project: Marak/faker.js. [Online]. Available: https://github.com/Marak/faker.js
[5] R. Robbes, M. Lungu, and D. Röthlisberger, “How do developers react to api deprecation? the case of a smalltalk ecosystem,” in Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering, 2012, pp. 1–11.
[6] A. A. Sawant, R. Robbes, and A. Bacchelli, “On the reaction to deprecation of 25,357 clients of 4+ 1 popular java apis,” in 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2016, pp. 400–410.
[7] (2016) kik, left-pad, and npm. [Online]. Available: https://blog.npmjs.org/post/141577284765/kik-left-pad-and-npm
[8] J. Khondhu, A. Capiluppi, and K.-J. Stol, “Is it all lost? a study of inactive open source projects,” in Open Source Software: Quality Verification: 9th IFIP WG 2.13 International Conference, OSS 2013, Koper-Capodistria, Slovenia, June 25-28, 2013. Proceedings 9. Springer, 2013, pp. 61–79.
[9] I. Samoladas, L. Angelis, and I. Stamelos, “Survival analysis on the duration of open source projects,” Information and Software Technology, vol. 52, no. 9, pp. 902–922, 2010.
[10] X. Li, S. Moreschini, F. Pecorelli, and D. Taibi, “Ossara: abandonment risk assessment for embedded open source components,” IEEE Software, vol. 39, no. 4, pp. 48–53, 2022.
[11] M. Valiev, B. Vasilescu, and J. Herbsleb, “Ecosystem-level determinants of sustained activity in open-source projects: A case study of the pypi ecosystem,” in Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2018, pp. 644–655.
[12] R. G. Kula, A. Ouni, D. M. German, and K. Inoue, “On the impact of micro-packages: An empirical study of the npm javascript ecosystem,” arXiv preprint arXiv:1709.04638, 2017.
[13] J. Coelho, M. T. Valente, L. L. Silva, and E. Shihab, “Identifying unmaintained projects in github,” in Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, 2018, pp. 1–10.
[14] J. Coelho, M. T. Valente, L. Milen, and L. L. Silva, “Is this github project maintained? measuring the level of maintenance activity of open-source projects,” Information and Software Technology, vol. 122, p. 106274, 2020.
[15] S. Mujahid, D. E. Costa, R. Abdalkareem, E. Shihab, M. A. Saied, and B. Adams, “Toward using package centrality trend to identify packages in decline,” IEEE Transactions on Engineering Management, vol. 69, no. 6, pp. 3618–3632, 2021.
[16] A. Clauset, M. E. Newman, and C. Moore, “Finding community structure in very large networks,” Physical review E, vol. 70, no. 6, p. 066111, 2004.
[17] M. Pinzger, N. Nagappan, and B. Murphy, “Can developer-module networks predict failures?” in Proceedings of the 16th ACM SIGSOFT International Symposium on Foundations of software engineering, 2008, pp. 2–12.
[18] N. Nagappan, A. Zeller, T. Zimmermann, K. Herzig, and B. Murphy, “Change bursts as defect predictors,” in 2010 IEEE 21st international symposium on software reliability engineering. IEEE, 2010, pp. 309–318.
[19] Github. ”archiving repositories - github docs”. [Online]. Available: https://docs.github.com/en/repositories/archiving-a-github-repository/archiving-repositories.
[20] Github. “archiving repositories”. [Online]. Available: https://github.blog/2017-11-08-archiving-repositories/
[21] (2015) Github project: Zzprojects. ”add-on feature for entity framework”. [Online]. Available: https://github.com/zzzprojects/EntityFramework.Extended.
[22] J. Coelho and M. T. Valente, “Why modern open source projects fail,” in Proceedings of the 2017 11th Joint meeting on foundations of software engineering, 2017, pp. 186–196.
[23] Github graphql api. [Online]. Available: https://docs.github.com/en/graphql
[24] W. Xiao, H. He, W. Xu, X. Tan, J. Dong, and M. Zhou, “Recommending good first issues in github oss projects,” in Proceedings of the 44th International Conference on Software Engineering, 2022, pp. 1830–1842.
[25] (2024) Open source data labelling platform. [Online]. Available: https://labelstud.io/
[26] L. Tunstall, N. Reimers, U. E. S. Jo, L. Bates, D. Korat, M. Wasserblat, and O. Pereg, “Efficient few-shot learning without prompts,” CoRR, vol. abs/2209.11055, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2209.11055
[27] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, K. Inui, J. Jiang, V. Ng, and X. Wan, Eds. Association for Computational Linguistics, 2019, pp. 3980–3990. [Online]. Available: https://doi.org/10.18653/v1/D19-1410
[28] (2022) sentence-transformers/paraphrase-mpnet-base-v2. [Online]. Available: https://huggingface.co/sentence-transformers/paraphrase-mpnet-base-v2
[29] Gharchive. ”a project to record the public github timeline, archive it, and make it easily accessible for further analysis”. [Online]. Available: https://www.gharchive.org/
[30] G. Gousios, “The ghtorent dataset and tool suite,” in 2013 10th Working Conference on Mining Software Repositories (MSR). IEEE, 2013, pp. 233–236.
[31] A. Mockus, R. T. Fielding, and J. D. Herbsleb, “Two case studies of open source software development: Apache and mozilla,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 11, no. 3, pp. 309–346, 2002.
[32] W. Xiao, H. He, W. Xu, Y. Zhang, and M. Zhou, “How early participation determines long-term sustained activity in github projects?” in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ser. ESEC/FSE 2023. New York, NY, USA: Association for Computing Machinery, 2023, p. 29–41. [Online]. Available: https://doi.org/10.1145/3611643.3616349
[33] K. Blincoe, J. Sheoran, S. Goggins, E. Petakovic, and D. Damian, “Understanding the popular users: Following, affiliation influence and leadership on github,” Information and Software Technology, vol. 70, pp. 30–39, 2016.
[34] K. Crowston, K. Wei, J. Howison, and A. Wiggins, “Free/libre open-source software development: What we know and what we do not know,” ACM Computing Surveys (CSUR), vol. 44, no. 2, pp. 1–35, 2008.
[35] E. Kalliamvakou, G. Gousios, K. Blincoe, L. Singer, D. M. German, and D. Damian, “The promises and perils of mining github,” in Proceedings of the 11th working conference on mining software repositories, 2014, pp. 92–101.
[36] J. M. Kleinberg, “Authoritative sources in a hyperlinked environment,” Journal of the ACM (JACM), vol. 46, no. 5, pp. 604–632, 1999.
[37] M. R. Prajapati, “A survey paper on hyperlink-induced topic search (hits) algorithms for web mining,” Int J Eng, vol. 1, no. 2, p. 8, 2012.
[38] M. Goeminne and T. Mens, “Evidence for the pareto principle in open source software activity,” in the Joint Porceedings of the 1st International workshop on Model Driven Software Maintenance and 5th International Workshop on Software Quality and Maintainability. Citeseer, 2011, pp. 74–82.
[39] Y. Zhang, M. Zhou, A. Mockus, and Z. **, “Companies’ participation in OSS development-an empirical study of openstack,” IEEE Trans. Software Eng., vol. 47, no. 10, pp. 2242–2259, 2021. [Online]. Available: https://doi.org/10.1109/TSE.2019.2946156
[40] A. Ait, J. L. C. Izquierdo, and J. Cabot, “An empirical study on the survival rate of github projects,” in Proceedings of the 19th International Conference on Mining Software Repositories, 2022, pp. 365–375.
[41] Github project: 0age/homework. [Online]. Available: https://github.com/0age/HomeWork
[42] Github project: 0mniscient/discord-themes. [Online]. Available: https://github.com/0mniscient/Discord-Themes
[43] Github project: 00-evan/shattered-pixel-dungeon-gdx. [Online]. Available: https://github.com/00-Evan/shattered-pixel-dungeon-gdx
[44] M. G. Kendall, “A new measure of rank correlation,” Biometrika, vol. 30, no. 1/2, pp. 81–93, 1938.
[45] S. Prion and K. A. Haerling, “Making sense of methods and measurement: Spearman-rho ranked-order correlation coefficient,” Clinical Simulation in Nursing, vol. 10, no. 10, pp. 535–536, 2014.
[46] I. Samoladas, L. Angelis, and I. Stamelos, “Survival analysis on the duration of open source projects,” Information and Software Technology, vol. 52, no. 9, pp. 902–922, 2010. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0950584910000790
[47] M. Zhou, A. Mockus, X. Ma, L. Zhang, and H. Mei, “Inflow and retention in oss communities with commercial involvement: A case study of three hybrid projects,” ACM Trans. Softw. Eng. Methodol., vol. 25, no. 2, apr 2016. [Online]. Available: https://doi.org/10.1145/2876443
[48] M. Joblin and S. Apel, “How do successful and failed projects differ? a socio-technical analysis,” ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 4, pp. 1–24, 2022.
[49] D. R. Cox, “Regression models and life-tables,” Journal of the Royal Statistical Society: Series B (Methodological), vol. 34, no. 2, pp. 187–202, 1972.
[50] A. Barnwal, H. Cho, and T. Hocking, “Survival regression with accelerated failure time model in xgboost,” J. Comput. Graph. Stat., vol. 31, no. 4, pp. 1292–1302, 2022. [Online]. Available: https://doi.org/10.1080/10618600.2022.2067548
[51] Y. Li, J. Ju, X. Liu, T. Gao, Z. Wang, Q. Ni, C. Ma, Z. Zhao, Y. Ren, and M. Sun, “Nomograms for predicting long-term overall survival and cancer-specific survival in patients with major salivary gland cancer: a population-based study,” Oncotarget, vol. 8, no. 15, p. 24469, 2017.
[52] V. Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, and G. Kasneci, “Deep neural networks and tabular data: A survey,” IEEE Transactions on Neural Networks and Learning Systems, 2022.
[53] S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” Advances in neural information processing systems, vol. 30, 2017.
[54] K. Ren, J. Qin, L. Zheng, Z. Yang, W. Zhang, L. Qiu, and Y. Yu, “Deep recurrent survival analysis,” in Proceedings of the AAAI conference on artificial intelligence, vol. 33, no. 01, 2019, pp. 4798–4805.
[55] Github project: Swifteducation/stopwatch. [Online]. Available: https://github.com/SwiftEducation/Stopwatch
[56] Github project: google/mr4c. [Online]. Available: https://github.com/google/mr4c
[57] F. E. Harrell et al., Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. Springer, 2001, vol. 608.
[58] G. Gousios, M. Pinzger, and A. v. Deursen, “An exploratory study of the pull-based software development model,” in Proceedings of the 36th international conference on software engineering, 2014, pp. 345–355.
[59] A. S. Badashian, A. Esteki, A. Gholipour, A. Hindle, and E. Stroulia, “Involvement, contribution and influence in github and stack overflow,” 2014.
[60] A. Capiluppi and M. Michlmayr, “From the cathedral to the bazaar: An empirical study of the lifecycle of volunteer community projects,” in IFIP International Conference on Open Source Systems. Springer, 2007, pp. 31–44.