GotFunding: A grant recommendation system based on scientific articles

Tong Zeng^1,2, , Daniel E. Acuna^2, Corresponding author: [email protected]

(¹School of Information Management, Nan**g University, Nan**g 210023, China
²School of Information Studies, Syracuse University, Syracuse, NY 13244, USA)

Abstract

Obtaining funding is an important part of becoming a successful scientist. Junior faculty spend a great deal of time finding the right agencies and programs that best match their research profile. But what are the factors that influence the best publication–grant matching? Some universities might employ pre-award personnel to understand these factors, but not all institutions can afford to hire them. Historical records of publications funded by grants can help us understand the matching process and also help us develop recommendation systems to automate it. In this work, we present GotFunding (Grant recOmmendaTion based on past FUNDING), a recommendation system trained on National Institutes of Health’s (NIH) grant–publication records. Our system achieves a high performance (NDCG@1 = 0.945) by casting the problem as learning to rank. By analyzing the features that make predictions effective, our results show that the ranking considers most important 1) the year difference between publication and grant grant, 2) the amount of information provided in the publication, and 3) the relevance of the publication to the grant. We discuss future improvements of the system and an online tool for scientists to try.

1 Introduction

The ability of a scientists to fund themselves plays an important role in a scientist’s career, sometimes propelling their productivity (Jacob and Lefgren,, 2011). Scientists, thus, spend an enormous amount of time finding the right opportunities, writing proposals, and waiting for funding decisions (Herbert et al.,, 2013). Past researchers have estimated that the opportunity costs in searching and preparing a grant might not be worth it (Gross and Bergstrom,, 2019). Some solutions to this problem include less stringent criteria for junior faculty (Van den Besselaar and Sandstrom,, 2015), awarding grants with a lottery (Gross and Bergstrom,, 2019), or peer-funding mechanisms (Bollen et al.,, 2014). Here we explore yet another alternative that instead uses machine learning to suggest the best-matching grant based on her publications. We show that we can cast the problem as a recommendation system trained on historical grant–publication data. Our work attempts to improve funding success which plays such a crucial role in today’s careers.

Finding the right grant is important and there are several factors involved in it. Scientists usually need to juggle multiple criteria including funding agencies (e.g., NSF or NIH), career stages (e.g., junior-oriented or senior/leader-oriented), award amounts (e.g., small NSF grant vs large DARPA grant), funding lengths (e.g., 1-year EAGER NSF grant or 5-year CAREER NSF grant), and call relevance (e.g., a particular program within NSF or institute in NIH) (Li and Marrongelle,, 2012). Thousands of grant opportunities might be available at any given time, offering hundreds of millions of dollars combined (Boroush,, 2016). These opportunities also have ramifications far beyond the receipt’s career (Lane,, 2009). It is therefore hard to navigate these funding opportunities but there should be ways in which to improve the process.

Several researchers have proposed numerous ways to improve the grant review process. In the work of Bollen et al., (2014), the authors proposed that funding agencies could distribute funding equally during a first round, and, in subsequent rounds, scientists could send a portion of this funding to other researchers that they think deserve the funding. In a more recent work, Gross and Bergstrom, (2019) proposed a mechanism where grants that pass a certain (low) decision threshold go through a lottery mechanism. In simulation, the authors showed that scientist itself benefits more because scientists spend more time doing actual research than preparing grants. These methods, however, are not considering that perhaps scientists are not applying to the best-matching funding opportunities. Thus, the present study provides a solution to improving the current state of affairs.

While submitting a grant is time consuming and has low probability of success (e.g., see Gross and Bergstrom, (2019); Bollen et al., (2014)), these low probabilities might be related to a mismatch between the grant submitted and the agency that receives it (Crow,, 2020). Another way of improving the granting process is rather than changing the preparation and review process, we could improve the quality of the matching between scientists and opportunities. Recommendation systems are a natural way of improving how scientists find relevant information such as publications (e.g., Achakulvisut et al., (2016)). A similar process could be applied to grant recommendation systems. Some systems exists (e.g., Elsevier’s Mendeley Funding Mendeley, (2020)) but they are closed source and difficult to evaluate. Thus, the granting process can be improved by increasing the submission accuracy using recommendation systems.

In this publication, we propose to use historical data of past publication–grant relationships from NIH. We cast the problem as a learning-to-rank recommendation system and show that it can achieve high performance on validation (NDCG@1 = 0.945). We further explore the factors that maximize the quality of the match, suggesting that successful scientists match publications to temporarily relevant grants and achieve high publication–grant match relevance. We describe potential improvements in the future.

2 Materials and Methods

2.1 Recommendation as Ranking

Suppose a user has associated a set of publications $P$ where each publication contains year and some description, such the title and abstract. These publications could be submitted by the user, or based on the user browsing history, or come from the user’s publication history. Also, there are announcements/messages/notifications from the funding agencies stating new grants and calling for proposals, which we denote as funding opportunities, $G=\{g_{1},g_{2},...g_{k}\}$ , where each funding opportunities $g_{i}$ contains information such as funding description, year, and agency. Our grant recommendation solution could be defined as using the $P$ as input, producing a subset of $G$ as output $R$ (retrieval stage), ranking each item in $R$ by the relevance value between the publication and opportunity, and returning funding candidates ranked by relevance (ranking stage). The overall framework is shown in Figure 1.

Refer to caption — Figure 1: The framework of our grant recommendation solution. The orange arrows denote the training pipeline and the green arrows represent the prediction pipeline.

Since there is already mature solution for retrieval stage, as an exploratory research, we are focus on the learning an effective ranking function. In the ranking stage, we need a function to assign a matching score to each retrieved grant candidate. The ranking order based on these scores indicates the relevance between the grants candidates and the publication. Learning such a ranking function is an important task in machine learning, called learning-to-rank. Depending on how the loss function is optimized, learning to rank can be categorized into pointwise, pairwise and listwise approaches (Cao et al.,, 2007; Li,, 2011; Burges,, 2010). For pointwise approach, the loss function takes only one document into account and optimizes to predict the relevant score directly. The pairwise ranking inputs a pair of documents into the loss function, and minimizes the incorrect ranking of these two documents compared to the ground truth. The listwise method looks at the candidate list directly, and tries to find the optimal ordering. In practice, the pairwise is more accurate than pointwise approach, and the list-wise approach is much more complex compared to the point-wise and pair-wise. In this paper, we will use pair-wise approach. Specifically, we will use the LambdaRank algorithm (Burges et al.,, 2006) implemented by lightGBM (Ke et al.,, 2017).

2.2 Datasets

2.2.1 Federal RePORTER

Federal RePORTER is an open and automated data infrastructure that collects data on federally funded research projects and its outcomes (e.g. publications and patents). The federal RePORTER includes approximately 1.15 million projects from 2000 to 2019, and involving 18 agencies. Among all the agencies, the NIH accounts for 77.3% of all the projects and it has the biggest funding pool¹¹1see https://federalreporter.nih.gov/Home/FAQ#faqs-panel7 for the projects distribution over agencies. In this publication, we focus only on NIH publication–grant relationships. Each of the NIH projects contains a list of the publications acknowledging the grant. Most of this publications are from PubMed, which we now describe.

2.2.2 PubMed

PubMed is a search engine and publication repository developed and maintained by the United States National Library of Medicine (NLM) at the NIH and mainly focuses on the fields of biomedical and health science. It provides access to over 30 million publications from MEDLINE (an NLM journal citation database), life science journal and online books. We use this publications in our recommendation system. We downloaded the 2019 baseline and the subsequent daily updates on December 2019.

2.2.3 Statistics of the datasets

We perform some data filtering and cleaning, such as removing duplication, removing projects and publications without links in Federal RePORTER table. We removed these sub-grants. Further, we removed outliers such as grants that yield more than 10 publications and publications which are funded by more than 3 grants. In the end, we have 67,396 grants and 235,419 publications.

2.2.4 Training data for learning to rank

The recommendation system learned from training data that starts with a list of publications. We create an artificial ranking using the following scheme. Rank 1 are grants that actually funded a publication. Rank 2 is the nearest neighbor grant. Rank 3, 4, and 5 are the first, second, and third distance quantile to the publication. The distance measure used is cosine similarity tf-idf vector space. This initial data is therefore an list of ordered lists, one for each publication, containing five grants each. Using these lists, we then proceed to extract features that can be used to learn the ranking.

2.3 Learning Features Extraction

We concatenate the publication title and abstract as the publication description. We consider the grant descriptions as the following fields: 1) the funding agency information (e.g., full name and description), 2) the grant’s title, 3) grant’s abstract, and 4) the union of 1 through 3. For each grant-publication pair, we extract the statistical and semantic features described in the next section.

2.3.1 For Statistical features

For each grant-publication pair, we extract 31 statistical features (see Table 1). These are standard features used in information retrieval for web search, most of them described in Qin and Liu, (2013). Features related only to publication are labelled as $P$ and features related to publication–grant pairs are labelled as P-G. The annotations are defined as below:

1. A publication description consists unique terms $q=\{q_{1},q_{2},...,q_{m}\}$ . We define the length of publication description $\left|q\right|$ as the number of tokens it contains, with $m\leq\left|q\right|$ . Similarly, we represent a grant description as $d=\{d_{1},d_{2},...,d_{n}\}$ , where the length $\left|d\right|$ is the number of tokens $d$ contains, with $n\leq\left|d\right|$ . We denote the corpus $D$ as the collection of all the grant descriptions and $\left|D\right|$ as the total number of grants in the corpus.

2. We use $c(q_{i},q)$ to denote the number of times a publication token $q_{i}$ appears in a publication $q$ . Similarly, we use $c(q_{i},d)$ to denote the number of times a publication token $q_{i}$ appears in the grant $d$ , and $c(q_{i},D)$ to denote the number of occurrences of $q_{i}$ in the corpus $D$ .

3. The terms frequency of a publication is denoted as $tf(q)$ , the document frequency $df(q_{i})$ is the number of grants containing term $q_{i}$ , and the inverse document frequency of a publication term $q_{i}$ is denote as $idf(q_{i})$

4. The $LMIR$ features is a set of smoothing methods for estimating the language model. The formal definition of these features is provided in Zhai and Lafferty, (2001). For the Jelinek-Mercer smoothing method, we use parameter $\lambda=0.1$ . For smoothing using Dirichlet priors, we set the parameter $\mu=2000$ . For the Absolute Discount smoothing, we use parameter $\delta=0.7$ .

Feature #	Feature	Class
1	$\sum_{q_{i}}c(q_{i},d)$	P-G
2	$\sum_{q_{i}}log(c(q_{i},d)+1)$	P-G
3	$\frac{\sum_{q_{i}}c(q_{i},d)}{\left\|d\right\|}$	P-G
4	$\left\|d\right\|$	P
5	$sum(tf(q))$	P-G
6	$min(tf(q))$	P-G
7	$max(tf(q))$	P-G
8	$mean(tf(q))$	P-G
9	$var(tf(q))$	P-G
10	$\frac{sum(tf(q))}{\left\|g\right\|}$	P-G
11	$\frac{min(tf(q))}{\left\|g\right\|}$	P-G
12	$\frac{max(tf(q))}{\left\|g\right\|}$	P-G
13	$\frac{mean(tf(q))}{\left\|g\right\|}$	P-G
14	$\frac{var(tf(q))}{\left\|g\right\|}$	P-G
15	$\sum_{q_{i}}log(\frac{\left\|D\right\|}{c(q_{i},D)+1}+1)$	P
16	$\sum_{q_{i}}idf(q_{i})$	P
17	$\sum_{q_{i}}log(idf(q_{i})+1)$	P
18	$sum(\text{c-idf}(q))$	P-G
19	$min(\text{c-idf}(q))$	P-G
20	$max(\text{c-idf}(q))$	P-G
21	$mean(\text{c-idf}(q))$	P-G
22	$var(\text{c-idf}(q))$	P-G
23	$sum(\text{weighted-c-idf}(q))$	P-G
24	$min(\text{weighted-c-idf}(q))$	P-G
25	$max(\text{weighted-c-idf}(q))$	P-G
26	$mean(\text{weighted-c-idf}(q))$	P-G
27	$var(\text{weighted-c-idf}(q))$	P-G
28	$BM25(q,d)$	P-G
29	$LMIR.AbsoluteDiscount$	P-G
30	$LMIR.Dirichlet$	P-G
31	$LMIR.Jelinek-Mercer$	P-G
Note: $tf(q)=\{c(q_{1},q),c(q_{2},q),\cdots,c(q_{m},q)\}$
$idf(q_{i})=\log(\frac{\left\|D\right\|}{df(q_{i})+1})$ , where $df(q_{i})$ is the number of grants containing term $q_{i}$
$\text{c-idf}(q)=\{c(q_{k},g)\cdot idf(q_{k})\}_{k=1,\dots,m}$
$\text{weighted-c-idf}(q)=\{\frac{c(q_{k},d)}{\left\|d\right\|}\cdot idf(q_{k})\}% _{k=1,\dots,m}$
$BM25(q,d)=\underset{q_{i}\in p}{\sum}\frac{idf(q_{i})\cdot c(q_{i},d)\cdot(k_{% 1}+1)}{c(q_{i},d)+k_{1}\cdot(1-b+b\cdot\frac{\left\|d\right\|}{avgdoclen})}$ , where $k_{1}=1.5$ and $b=0.75$ , the $avgdoclen$ refer to the average document length of the entire corpus. P: scientist’s publication G: grant.

Table 1: Features of the system.

2.3.2 Semantic features

In order to capture the semantic of the grant description and publication, we make use of the distributed word representations. Inspired by the idea “you should know a word by the company it keeps” proposed by Firth, (1957), there are a set of techniques committed to represents word as a multi-dimensional vector of continuous real numbers, each dimension captures a facet of the word’s meaning, the real number represent the strength of that meaning. Thus, the semantically similar words are located close to each other in the vector geometric space. The fastText (Bojanowski et al.,, 2016) word vector is one of the popular pre-trained word semantic representation. By using the character level information, fastText achieves good performance and is able to process the words which do not exist in the training corpora. We obtained a copy of fastText vector trained on large scale Common Crawl (web pages) and Wikipedia (Grave et al.,, 2018). Each vector has 300 dimensions.

We represent the description of a grant and a publication as vectors by averaging the fastText vectors of each word they contain. Then we use the cosine similarity between the grant and publication vectors as semantic feature.

3 Experiments and Result

We first report the performance, then attempt to interpret what are the features that the model considers important during matching.

3.1 Evaluation Metric

We use Normalized Discounted Cumulative Gain (NDCG) as our evaluation metric. The NDCG is designed for non-binary relevance labels, and usually evaluated over top k search results. The NDCG@k is defined as,

NDCG@k=\frac{\stackrel{{\scriptstyle[}}{{i}}=1]{k}{\sum}\frac{(2^{\mathit{rel}% _{i}}-1)}{\mathit{log_{2}(i+1)}}}{\stackrel{{\scriptstyle[}}{{i}}=1]{k}{\sum}% \frac{(2^{\mathit{ideal}_{i}}-1)}{\mathit{log_{2}(i+1)}}},

(1)

where $k$ is a particular rank position, $\mathit{rel_{i}}$ is the predicted relevance order at position $i$ , the $\mathit{ideal_{i}}$ is the ideal relevance order (ground truth) at position $i$ . The value of NDCG ranges from 0 to 1, the higher the better.

3.2 Ranking Function Performance

As the user mainly pay attention to the top items on the recommended list, we want the top items ranked by our model with less error. In our experiment, we optimize the NDCG over top 1 and top 5 in the ranked list. As it shown in Table 2, the NDCG@1 achieves a score is 0.945 (the highest possible score is 1), the NDCG@5 are very close to The NDCG@1, and both of them achieves a relatively high score, indicating that GotFunding can rank the grant candidates well.

Metric	Performance
Training(NDCG@1)	0.975
Training(NDCG@5)	0.954
Validation(NDCG@1)	0.945
Validation(NDCG@5)	0.943

Table 2: The ranking function performance

3.3 Feature Importance

One of the goal of our research is to understand what the most effective factors matching publications to grants. Using GotFunding, we can begin to understand these factors by interpreting the importance in the LambdaRank gradient boosting algorithm. lightGBM provides that cumulative total grains of splits for features, which is commonly used as feature importance (Ke et al.,, 2017).

3.3.1 Feature importance analysis

We listed the 20 most important features in Figure 2. The most import features is the temporal similarity between the publication and the grant. Its importance is almost twice as the second most important feature. The top second feature is the amount of information in the publication represented by the size of the publication document. The third feature and forth features are related to the relevance between publication and grant. The top 10 features account for almost 50% of the overall importances. Taken together, these top three features represent a combination of high relevance with high coverage.

Because our features are separated into four group of features, we wanted to understand the importance of each of these subgroups. These groups are funding agency information, the grant title, grant abstract and their union. To do this, we computed the total feature importance of all the features belonging to a group. Our analysis shows that the importance rank the groups as follows: grant abstract (total importance: 1063), The combination of funding agency information, grant title and grant abstract (total importance: 645), grant title (total importance: 440), and grant agency information(total importance: 427). These results suggest that the abstract contributed the most.

We also analyzed the different importance of the semantic vs the statistical features. We again do this by computing the total feature importance of the features that belong to these two categories. Our results show that the statistical features (total importance: 2791) is larger than semantic features (total importance: 209).

4 Discussion and Conclusion

In our work, we aim at improving how scientists can find relevant grants based on their research profile. We propose to solve this problem by building a recommendation system that learns from historical publication–grant relationships. Our results show that we can achieve a high performance of NDCG@1=0.975. Further, by analyzing what the recommendation system learned we can estimate that the most successful links between a publication and grant are when they are temporally relevant, the publication has large amounts of information (e.g., long document), and there is a good relevance between the publication and grant. We now discuss some limitations and future work.

One of the limitation of our work is that we only look at grants that were funded in the past. Funding mechanisms might be changing over time and publication topics might also change over time. This means that there is no guarantee that a correct prediction will actually yield a successful match for a future grant. Recommendation systems however benefits from large amounts of data and unless we are able to interview and ask scientists about their opinion on publication to grant matching, it is hard to build a recommendation system otherwise. Finally, even if our recommendations are off by topic, they can still serve as a narrowing step during the initial stages of match searchers.

Another limitation of our work is that we are using publications that we already funded by grants. However, our recommendation is trying to solve the opposite problem whereas a scientists wants to find a publication that can initiate funding. Research is still unclear on whether funding changes the direction of research but even if it is does, our recommendation could be useful to discover people that have worked on similar problems to ours. We hope to obtain new data in the future about grants that were not funded because they did not meet the criteria of a certain problem. Thus, with data that is available but potentially harder to obtain, some of these issues could be solved.

Our recommendation system is one of the first ones that offers scientists the ability to match their research to past grants. We think this research direction will benefit specially those who are starting in their career and might not have the human capital to help in finding relevant funding opportunities.

References

Achakulvisut et al., (2016) Achakulvisut, T., Acuna, D. E., Ruangrong, T., and Kording, K. (2016). Science concierge: A fast content-based recommendation system for scientific publications. PloS one, 11(7).
Bojanowski et al., (2016) Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.
Bollen et al., (2014) Bollen, J., Crandall, D., Junk, D., Ding, Y., and Börner, K. (2014). From funding agencies to scientific agency. EMBO reports, 15(2):131–133.
Boroush, (2016) Boroush, M. (2016). Us r and d increased by more than $20 billion in both 2013 and 2014, with similar increase estimated for 2015. National Center for Science and Engineering Statistics, Info-Brief NSF, pages 16–316.
Burges, (2010) Burges, C. J. (2010). From ranknet to lambdarank to lambdamart: An overview.
Burges et al., (2006) Burges, C. J. C., Ragno, R., and Le, Q. V. (2006). Learning to rank with nonsmooth cost functions. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS 2006, pages 193–200, Cambridge, MA, USA. MIT Press.
Cao et al., (2007) Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., and Li, H. (2007). Learning to rank: From pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning, ICML 2007, pages 129–136, New York, NY, USA. Association for Computing Machinery.
Crow, (2020) Crow, J. M. (2020). What to do when your grant is rejected. Nature, 578(7795):477–479.
Firth, (1957) Firth, J. R. (1957). A synopsis of linguistic theory 1930-55. 1952-59:1–32.
Grave et al., (2018) Grave, E., Bojanowski, P., Gupta, P., Joulin, A., and Mikolov, T. (2018). Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).
Gross and Bergstrom, (2019) Gross, K. and Bergstrom, C. T. (2019). Contest models highlight inherent inefficiencies of scientific funding competitions. PLoS biology, 17(1).
Herbert et al., (2013) Herbert, D. L., Barnett, A. G., Clarke, P., and Graves, N. (2013). On the time spent preparing grant proposals: an observational study of australian researchers. BMJ open, 3(5):e002800.
Jacob and Lefgren, (2011) Jacob, B. A. and Lefgren, L. (2011). The impact of research grant funding on scientific productivity. Journal of public economics, 95(9-10):1168–1177.
Ke et al., (2017) Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, pages 3149–3157, Red Hook, NY, USA. Curran Associates Inc.
Lane, (2009) Lane, J. (2009). Assessing the impact of science funding. Science, 324(5932):1273–1275.
Li, (2011) Li, H. (2011). A short introduction to learning to rank. IEICE TRANSACTIONS on Information and Systems, 94(10):1854–1862.
Li and Marrongelle, (2012) Li, P. and Marrongelle, K. (2012). Having success with NSF: a practical guide. John Wiley & Sons.
Mendeley, (2020) Mendeley, F. (2020). Mendeley funding.
Qin and Liu, (2013) Qin, T. and Liu, T. (2013). Introducing LETOR 4.0 datasets. CoRR, abs/1306.2597.
Van den Besselaar and Sandstrom, (2015) Van den Besselaar, P. and Sandstrom, U. (2015). Early career grants, performance, and careers: A study on predictive validity of grant decisions. Journal of Informetrics, 9(4):826–838.
Zhai and Lafferty, (2001) Zhai, C. and Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2001, pages 334–342, New York, NY, USA. Association for Computing Machinery.