GotFunding: A grant recommendation system based on scientific articles

Tong Zeng1,2, , Daniel E. Acuna2, Corresponding author: [email protected]
(1School of Information Management, Nan**g University, Nan**g 210023, China
2School of Information Studies, Syracuse University, Syracuse, NY 13244, USA
)
Abstract

Obtaining funding is an important part of becoming a successful scientist. Junior faculty spend a great deal of time finding the right agencies and programs that best match their research profile. But what are the factors that influence the best publication–grant matching? Some universities might employ pre-award personnel to understand these factors, but not all institutions can afford to hire them. Historical records of publications funded by grants can help us understand the matching process and also help us develop recommendation systems to automate it. In this work, we present GotFunding (Grant recOmmendaTion based on past FUNDING), a recommendation system trained on National Institutes of Health’s (NIH) grant–publication records. Our system achieves a high performance (NDCG@1 = 0.945) by casting the problem as learning to rank. By analyzing the features that make predictions effective, our results show that the ranking considers most important 1) the year difference between publication and grant grant, 2) the amount of information provided in the publication, and 3) the relevance of the publication to the grant. We discuss future improvements of the system and an online tool for scientists to try.

1 Introduction

The ability of a scientists to fund themselves plays an important role in a scientist’s career, sometimes propelling their productivity (Jacob and Lefgren,, 2011). Scientists, thus, spend an enormous amount of time finding the right opportunities, writing proposals, and waiting for funding decisions (Herbert et al.,, 2013). Past researchers have estimated that the opportunity costs in searching and preparing a grant might not be worth it (Gross and Bergstrom,, 2019). Some solutions to this problem include less stringent criteria for junior faculty (Van den Besselaar and Sandstrom,, 2015), awarding grants with a lottery (Gross and Bergstrom,, 2019), or peer-funding mechanisms (Bollen et al.,, 2014). Here we explore yet another alternative that instead uses machine learning to suggest the best-matching grant based on her publications. We show that we can cast the problem as a recommendation system trained on historical grant–publication data. Our work attempts to improve funding success which plays such a crucial role in today’s careers.

Finding the right grant is important and there are several factors involved in it. Scientists usually need to juggle multiple criteria including funding agencies (e.g., NSF or NIH), career stages (e.g., junior-oriented or senior/leader-oriented), award amounts (e.g., small NSF grant vs large DARPA grant), funding lengths (e.g., 1-year EAGER NSF grant or 5-year CAREER NSF grant), and call relevance (e.g., a particular program within NSF or institute in NIH) (Li and Marrongelle,, 2012). Thousands of grant opportunities might be available at any given time, offering hundreds of millions of dollars combined (Boroush,, 2016). These opportunities also have ramifications far beyond the receipt’s career (Lane,, 2009). It is therefore hard to navigate these funding opportunities but there should be ways in which to improve the process.

Several researchers have proposed numerous ways to improve the grant review process. In the work of Bollen et al., (2014), the authors proposed that funding agencies could distribute funding equally during a first round, and, in subsequent rounds, scientists could send a portion of this funding to other researchers that they think deserve the funding. In a more recent work, Gross and Bergstrom, (2019) proposed a mechanism where grants that pass a certain (low) decision threshold go through a lottery mechanism. In simulation, the authors showed that scientist itself benefits more because scientists spend more time doing actual research than preparing grants. These methods, however, are not considering that perhaps scientists are not applying to the best-matching funding opportunities. Thus, the present study provides a solution to improving the current state of affairs.

While submitting a grant is time consuming and has low probability of success (e.g., see Gross and Bergstrom, (2019); Bollen et al., (2014)), these low probabilities might be related to a mismatch between the grant submitted and the agency that receives it (Crow,, 2020). Another way of improving the granting process is rather than changing the preparation and review process, we could improve the quality of the matching between scientists and opportunities. Recommendation systems are a natural way of improving how scientists find relevant information such as publications (e.g., Achakulvisut et al., (2016)). A similar process could be applied to grant recommendation systems. Some systems exists (e.g., Elsevier’s Mendeley Funding Mendeley, (2020)) but they are closed source and difficult to evaluate. Thus, the granting process can be improved by increasing the submission accuracy using recommendation systems.

In this publication, we propose to use historical data of past publication–grant relationships from NIH. We cast the problem as a learning-to-rank recommendation system and show that it can achieve high performance on validation (NDCG@1 = 0.945). We further explore the factors that maximize the quality of the match, suggesting that successful scientists match publications to temporarily relevant grants and achieve high publication–grant match relevance. We describe potential improvements in the future.

2 Materials and Methods

2.1 Recommendation as Ranking

Suppose a user has associated a set of publications P𝑃Pitalic_P where each publication contains year and some description, such the title and abstract. These publications could be submitted by the user, or based on the user browsing history, or come from the user’s publication history. Also, there are announcements/messages/notifications from the funding agencies stating new grants and calling for proposals, which we denote as funding opportunities, G={g1,g2,gk}𝐺subscript𝑔1subscript𝑔2subscript𝑔𝑘G=\{g_{1},g_{2},...g_{k}\}italic_G = { italic_g start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_g start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … italic_g start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT }, where each funding opportunities gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT contains information such as funding description, year, and agency. Our grant recommendation solution could be defined as using the P𝑃Pitalic_P as input, producing a subset of G𝐺Gitalic_G as output R𝑅Ritalic_R (retrieval stage), ranking each item in R𝑅Ritalic_R by the relevance value between the publication and opportunity, and returning funding candidates ranked by relevance (ranking stage). The overall framework is shown in Figure 1.

Refer to caption
Figure 1: The framework of our grant recommendation solution. The orange arrows denote the training pipeline and the green arrows represent the prediction pipeline.

Since there is already mature solution for retrieval stage, as an exploratory research, we are focus on the learning an effective ranking function. In the ranking stage, we need a function to assign a matching score to each retrieved grant candidate. The ranking order based on these scores indicates the relevance between the grants candidates and the publication. Learning such a ranking function is an important task in machine learning, called learning-to-rank. Depending on how the loss function is optimized, learning to rank can be categorized into pointwise, pairwise and listwise approaches (Cao et al.,, 2007; Li,, 2011; Burges,, 2010). For pointwise approach, the loss function takes only one document into account and optimizes to predict the relevant score directly. The pairwise ranking inputs a pair of documents into the loss function, and minimizes the incorrect ranking of these two documents compared to the ground truth. The listwise method looks at the candidate list directly, and tries to find the optimal ordering. In practice, the pairwise is more accurate than pointwise approach, and the list-wise approach is much more complex compared to the point-wise and pair-wise. In this paper, we will use pair-wise approach. Specifically, we will use the LambdaRank algorithm (Burges et al.,, 2006) implemented by lightGBM (Ke et al.,, 2017).

2.2 Datasets

2.2.1 Federal RePORTER

Federal RePORTER is an open and automated data infrastructure that collects data on federally funded research projects and its outcomes (e.g. publications and patents). The federal RePORTER includes approximately 1.15 million projects from 2000 to 2019, and involving 18 agencies. Among all the agencies, the NIH accounts for 77.3% of all the projects and it has the biggest funding pool111see https://federalreporter.nih.gov/Home/FAQ#faqs-panel7 for the projects distribution over agencies. In this publication, we focus only on NIH publication–grant relationships. Each of the NIH projects contains a list of the publications acknowledging the grant. Most of this publications are from PubMed, which we now describe.

2.2.2 PubMed

PubMed is a search engine and publication repository developed and maintained by the United States National Library of Medicine (NLM) at the NIH and mainly focuses on the fields of biomedical and health science. It provides access to over 30 million publications from MEDLINE (an NLM journal citation database), life science journal and online books. We use this publications in our recommendation system. We downloaded the 2019 baseline and the subsequent daily updates on December 2019.

2.2.3 Statistics of the datasets

We perform some data filtering and cleaning, such as removing duplication, removing projects and publications without links in Federal RePORTER table. We removed these sub-grants. Further, we removed outliers such as grants that yield more than 10 publications and publications which are funded by more than 3 grants. In the end, we have 67,396 grants and 235,419 publications.

2.2.4 Training data for learning to rank

The recommendation system learned from training data that starts with a list of publications. We create an artificial ranking using the following scheme. Rank 1 are grants that actually funded a publication. Rank 2 is the nearest neighbor grant. Rank 3, 4, and 5 are the first, second, and third distance quantile to the publication. The distance measure used is cosine similarity tf-idf vector space. This initial data is therefore an list of ordered lists, one for each publication, containing five grants each. Using these lists, we then proceed to extract features that can be used to learn the ranking.

2.3 Learning Features Extraction

We concatenate the publication title and abstract as the publication description. We consider the grant descriptions as the following fields: 1) the funding agency information (e.g., full name and description), 2) the grant’s title, 3) grant’s abstract, and 4) the union of 1 through 3. For each grant-publication pair, we extract the statistical and semantic features described in the next section.

2.3.1 For Statistical features

For each grant-publication pair, we extract 31 statistical features (see Table 1). These are standard features used in information retrieval for web search, most of them described in Qin and Liu, (2013). Features related only to publication are labelled as P𝑃Pitalic_P and features related to publication–grant pairs are labelled as P-G. The annotations are defined as below:

1. A publication description consists unique terms q={q1,q2,,qm}𝑞subscript𝑞1subscript𝑞2subscript𝑞𝑚q=\{q_{1},q_{2},...,q_{m}\}italic_q = { italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT }. We define the length of publication description|q|𝑞\left|q\right|| italic_q |as the number of tokens it contains, with m|q|𝑚𝑞m\leq\left|q\right|italic_m ≤ | italic_q |. Similarly, we represent a grant description as d={d1,d2,,dn}𝑑subscript𝑑1subscript𝑑2subscript𝑑𝑛d=\{d_{1},d_{2},...,d_{n}\}italic_d = { italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, where the length |d|𝑑\left|d\right|| italic_d | is the number of tokens d𝑑ditalic_d contains, with n|d|𝑛𝑑n\leq\left|d\right|italic_n ≤ | italic_d |. We denote the corpus D𝐷Ditalic_D as the collection of all the grant descriptions and |D|𝐷\left|D\right|| italic_D | as the total number of grants in the corpus.

2. We use c(qi,q)𝑐subscript𝑞𝑖𝑞c(q_{i},q)italic_c ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_q ) to denote the number of times a publication token qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT appears in a publication q𝑞qitalic_q. Similarly, we use c(qi,d)𝑐subscript𝑞𝑖𝑑c(q_{i},d)italic_c ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d ) to denote the number of times a publication token qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT appears in the grant d𝑑ditalic_d, and c(qi,D)𝑐subscript𝑞𝑖𝐷c(q_{i},D)italic_c ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D ) to denote the number of occurrences of qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the corpus D𝐷Ditalic_D.

3. The terms frequency of a publication is denoted as tf(q)𝑡𝑓𝑞tf(q)italic_t italic_f ( italic_q ), the document frequency df(qi)𝑑𝑓subscript𝑞𝑖df(q_{i})italic_d italic_f ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the number of grants containing term qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, and the inverse document frequency of a publication term qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is denote as idf(qi)𝑖𝑑𝑓subscript𝑞𝑖idf(q_{i})italic_i italic_d italic_f ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT )

4. The LMIR𝐿𝑀𝐼𝑅LMIRitalic_L italic_M italic_I italic_R features is a set of smoothing methods for estimating the language model. The formal definition of these features is provided in Zhai and Lafferty, (2001). For the Jelinek-Mercer smoothing method, we use parameter λ=0.1𝜆0.1\lambda=0.1italic_λ = 0.1. For smoothing using Dirichlet priors, we set the parameter μ=2000𝜇2000\mu=2000italic_μ = 2000. For the Absolute Discount smoothing, we use parameter δ=0.7𝛿0.7\delta=0.7italic_δ = 0.7.

Feature # Feature Class
1 qic(qi,d)subscriptsubscript𝑞𝑖𝑐subscript𝑞𝑖𝑑\sum_{q_{i}}c(q_{i},d)∑ start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d ) P-G
2 qilog(c(qi,d)+1)subscriptsubscript𝑞𝑖𝑙𝑜𝑔𝑐subscript𝑞𝑖𝑑1\sum_{q_{i}}log(c(q_{i},d)+1)∑ start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_l italic_o italic_g ( italic_c ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d ) + 1 ) P-G
3 qic(qi,d)|d|subscriptsubscript𝑞𝑖𝑐subscript𝑞𝑖𝑑𝑑\frac{\sum_{q_{i}}c(q_{i},d)}{\left|d\right|}divide start_ARG ∑ start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_c ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d ) end_ARG start_ARG | italic_d | end_ARG P-G
4 |d|𝑑\left|d\right|| italic_d | P
5 sum(tf(q))𝑠𝑢𝑚𝑡𝑓𝑞sum(tf(q))italic_s italic_u italic_m ( italic_t italic_f ( italic_q ) ) P-G
6 min(tf(q))𝑚𝑖𝑛𝑡𝑓𝑞min(tf(q))italic_m italic_i italic_n ( italic_t italic_f ( italic_q ) ) P-G
7 max(tf(q))𝑚𝑎𝑥𝑡𝑓𝑞max(tf(q))italic_m italic_a italic_x ( italic_t italic_f ( italic_q ) ) P-G
8 mean(tf(q))𝑚𝑒𝑎𝑛𝑡𝑓𝑞mean(tf(q))italic_m italic_e italic_a italic_n ( italic_t italic_f ( italic_q ) ) P-G
9 var(tf(q))𝑣𝑎𝑟𝑡𝑓𝑞var(tf(q))italic_v italic_a italic_r ( italic_t italic_f ( italic_q ) ) P-G
10 sum(tf(q))|g|𝑠𝑢𝑚𝑡𝑓𝑞𝑔\frac{sum(tf(q))}{\left|g\right|}divide start_ARG italic_s italic_u italic_m ( italic_t italic_f ( italic_q ) ) end_ARG start_ARG | italic_g | end_ARG P-G
11 min(tf(q))|g|𝑚𝑖𝑛𝑡𝑓𝑞𝑔\frac{min(tf(q))}{\left|g\right|}divide start_ARG italic_m italic_i italic_n ( italic_t italic_f ( italic_q ) ) end_ARG start_ARG | italic_g | end_ARG P-G
12 max(tf(q))|g|𝑚𝑎𝑥𝑡𝑓𝑞𝑔\frac{max(tf(q))}{\left|g\right|}divide start_ARG italic_m italic_a italic_x ( italic_t italic_f ( italic_q ) ) end_ARG start_ARG | italic_g | end_ARG P-G
13 mean(tf(q))|g|𝑚𝑒𝑎𝑛𝑡𝑓𝑞𝑔\frac{mean(tf(q))}{\left|g\right|}divide start_ARG italic_m italic_e italic_a italic_n ( italic_t italic_f ( italic_q ) ) end_ARG start_ARG | italic_g | end_ARG P-G
14 var(tf(q))|g|𝑣𝑎𝑟𝑡𝑓𝑞𝑔\frac{var(tf(q))}{\left|g\right|}divide start_ARG italic_v italic_a italic_r ( italic_t italic_f ( italic_q ) ) end_ARG start_ARG | italic_g | end_ARG P-G
15 qilog(|D|c(qi,D)+1+1)subscriptsubscript𝑞𝑖𝑙𝑜𝑔𝐷𝑐subscript𝑞𝑖𝐷11\sum_{q_{i}}log(\frac{\left|D\right|}{c(q_{i},D)+1}+1)∑ start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_l italic_o italic_g ( divide start_ARG | italic_D | end_ARG start_ARG italic_c ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_D ) + 1 end_ARG + 1 ) P
16 qiidf(qi)subscriptsubscript𝑞𝑖𝑖𝑑𝑓subscript𝑞𝑖\sum_{q_{i}}idf(q_{i})∑ start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_i italic_d italic_f ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) P
17 qilog(idf(qi)+1)subscriptsubscript𝑞𝑖𝑙𝑜𝑔𝑖𝑑𝑓subscript𝑞𝑖1\sum_{q_{i}}log(idf(q_{i})+1)∑ start_POSTSUBSCRIPT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUBSCRIPT italic_l italic_o italic_g ( italic_i italic_d italic_f ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + 1 ) P
18 sum(c-idf(q))𝑠𝑢𝑚c-idf𝑞sum(\text{c-idf}(q))italic_s italic_u italic_m ( c-idf ( italic_q ) ) P-G
19 min(c-idf(q))𝑚𝑖𝑛c-idf𝑞min(\text{c-idf}(q))italic_m italic_i italic_n ( c-idf ( italic_q ) ) P-G
20 max(c-idf(q))𝑚𝑎𝑥c-idf𝑞max(\text{c-idf}(q))italic_m italic_a italic_x ( c-idf ( italic_q ) ) P-G
21 mean(c-idf(q))𝑚𝑒𝑎𝑛c-idf𝑞mean(\text{c-idf}(q))italic_m italic_e italic_a italic_n ( c-idf ( italic_q ) ) P-G
22 var(c-idf(q))𝑣𝑎𝑟c-idf𝑞var(\text{c-idf}(q))italic_v italic_a italic_r ( c-idf ( italic_q ) ) P-G
23 sum(weighted-c-idf(q))𝑠𝑢𝑚weighted-c-idf𝑞sum(\text{weighted-c-idf}(q))italic_s italic_u italic_m ( weighted-c-idf ( italic_q ) ) P-G
24 min(weighted-c-idf(q))𝑚𝑖𝑛weighted-c-idf𝑞min(\text{weighted-c-idf}(q))italic_m italic_i italic_n ( weighted-c-idf ( italic_q ) ) P-G
25 max(weighted-c-idf(q))𝑚𝑎𝑥weighted-c-idf𝑞max(\text{weighted-c-idf}(q))italic_m italic_a italic_x ( weighted-c-idf ( italic_q ) ) P-G
26 mean(weighted-c-idf(q))𝑚𝑒𝑎𝑛weighted-c-idf𝑞mean(\text{weighted-c-idf}(q))italic_m italic_e italic_a italic_n ( weighted-c-idf ( italic_q ) ) P-G
27 var(weighted-c-idf(q))𝑣𝑎𝑟weighted-c-idf𝑞var(\text{weighted-c-idf}(q))italic_v italic_a italic_r ( weighted-c-idf ( italic_q ) ) P-G
28 BM25(q,d)𝐵𝑀25𝑞𝑑BM25(q,d)italic_B italic_M 25 ( italic_q , italic_d ) P-G
29 LMIR.AbsoluteDiscountformulae-sequence𝐿𝑀𝐼𝑅𝐴𝑏𝑠𝑜𝑙𝑢𝑡𝑒𝐷𝑖𝑠𝑐𝑜𝑢𝑛𝑡LMIR.AbsoluteDiscountitalic_L italic_M italic_I italic_R . italic_A italic_b italic_s italic_o italic_l italic_u italic_t italic_e italic_D italic_i italic_s italic_c italic_o italic_u italic_n italic_t P-G
30 LMIR.Dirichletformulae-sequence𝐿𝑀𝐼𝑅𝐷𝑖𝑟𝑖𝑐𝑙𝑒𝑡LMIR.Dirichletitalic_L italic_M italic_I italic_R . italic_D italic_i italic_r italic_i italic_c italic_h italic_l italic_e italic_t P-G
31 LMIR.JelinekMercerformulae-sequence𝐿𝑀𝐼𝑅𝐽𝑒𝑙𝑖𝑛𝑒𝑘𝑀𝑒𝑟𝑐𝑒𝑟LMIR.Jelinek-Merceritalic_L italic_M italic_I italic_R . italic_J italic_e italic_l italic_i italic_n italic_e italic_k - italic_M italic_e italic_r italic_c italic_e italic_r P-G
Note: tf(q)={c(q1,q),c(q2,q),,c(qm,q)}𝑡𝑓𝑞𝑐subscript𝑞1𝑞𝑐subscript𝑞2𝑞𝑐subscript𝑞𝑚𝑞tf(q)=\{c(q_{1},q),c(q_{2},q),\cdots,c(q_{m},q)\}italic_t italic_f ( italic_q ) = { italic_c ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q ) , italic_c ( italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_q ) , ⋯ , italic_c ( italic_q start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT , italic_q ) }
idf(qi)=log(|D|df(qi)+1)𝑖𝑑𝑓subscript𝑞𝑖𝐷𝑑𝑓subscript𝑞𝑖1idf(q_{i})=\log(\frac{\left|D\right|}{df(q_{i})+1})italic_i italic_d italic_f ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) = roman_log ( divide start_ARG | italic_D | end_ARG start_ARG italic_d italic_f ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) + 1 end_ARG ), where df(qi)𝑑𝑓subscript𝑞𝑖df(q_{i})italic_d italic_f ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the number of grants containing term qisubscript𝑞𝑖q_{i}italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT
c-idf(q)={c(qk,g)idf(qk)}k=1,,mc-idf𝑞subscript𝑐subscript𝑞𝑘𝑔𝑖𝑑𝑓subscript𝑞𝑘𝑘1𝑚\text{c-idf}(q)=\{c(q_{k},g)\cdot idf(q_{k})\}_{k=1,\dots,m}c-idf ( italic_q ) = { italic_c ( italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_g ) ⋅ italic_i italic_d italic_f ( italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k = 1 , … , italic_m end_POSTSUBSCRIPT
weighted-c-idf(q)={c(qk,d)|d|idf(qk)}k=1,,mweighted-c-idf𝑞subscript𝑐subscript𝑞𝑘𝑑𝑑𝑖𝑑𝑓subscript𝑞𝑘𝑘1𝑚\text{weighted-c-idf}(q)=\{\frac{c(q_{k},d)}{\left|d\right|}\cdot idf(q_{k})\}% _{k=1,\dots,m}weighted-c-idf ( italic_q ) = { divide start_ARG italic_c ( italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT , italic_d ) end_ARG start_ARG | italic_d | end_ARG ⋅ italic_i italic_d italic_f ( italic_q start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_k = 1 , … , italic_m end_POSTSUBSCRIPT
BM25(q,d)=qipidf(qi)c(qi,d)(k1+1)c(qi,d)+k1(1b+b|d|avgdoclen)𝐵𝑀25𝑞𝑑subscript𝑞𝑖𝑝𝑖𝑑𝑓subscript𝑞𝑖𝑐subscript𝑞𝑖𝑑subscript𝑘11𝑐subscript𝑞𝑖𝑑subscript𝑘11𝑏𝑏𝑑𝑎𝑣𝑔𝑑𝑜𝑐𝑙𝑒𝑛BM25(q,d)=\underset{q_{i}\in p}{\sum}\frac{idf(q_{i})\cdot c(q_{i},d)\cdot(k_{% 1}+1)}{c(q_{i},d)+k_{1}\cdot(1-b+b\cdot\frac{\left|d\right|}{avgdoclen})}italic_B italic_M 25 ( italic_q , italic_d ) = start_UNDERACCENT italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ italic_p end_UNDERACCENT start_ARG ∑ end_ARG divide start_ARG italic_i italic_d italic_f ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) ⋅ italic_c ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d ) ⋅ ( italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + 1 ) end_ARG start_ARG italic_c ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_d ) + italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ⋅ ( 1 - italic_b + italic_b ⋅ divide start_ARG | italic_d | end_ARG start_ARG italic_a italic_v italic_g italic_d italic_o italic_c italic_l italic_e italic_n end_ARG ) end_ARG, where k1=1.5subscript𝑘11.5k_{1}=1.5italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 1.5 and b=0.75𝑏0.75b=0.75italic_b = 0.75, the avgdoclen𝑎𝑣𝑔𝑑𝑜𝑐𝑙𝑒𝑛avgdoclenitalic_a italic_v italic_g italic_d italic_o italic_c italic_l italic_e italic_n refer to the average document length of the entire corpus. P: scientist’s publication G: grant.
Table 1: Features of the system.

2.3.2 Semantic features

In order to capture the semantic of the grant description and publication, we make use of the distributed word representations. Inspired by the idea “you should know a word by the company it keeps” proposed by Firth, (1957), there are a set of techniques committed to represents word as a multi-dimensional vector of continuous real numbers, each dimension captures a facet of the word’s meaning, the real number represent the strength of that meaning. Thus, the semantically similar words are located close to each other in the vector geometric space. The fastText (Bojanowski et al.,, 2016) word vector is one of the popular pre-trained word semantic representation. By using the character level information, fastText achieves good performance and is able to process the words which do not exist in the training corpora. We obtained a copy of fastText vector trained on large scale Common Crawl (web pages) and Wikipedia (Grave et al.,, 2018). Each vector has 300 dimensions.

We represent the description of a grant and a publication as vectors by averaging the fastText vectors of each word they contain. Then we use the cosine similarity between the grant and publication vectors as semantic feature.

3 Experiments and Result

We first report the performance, then attempt to interpret what are the features that the model considers important during matching.

3.1 Evaluation Metric

We use Normalized Discounted Cumulative Gain (NDCG) as our evaluation metric. The NDCG is designed for non-binary relevance labels, and usually evaluated over top k search results. The NDCG@k is defined as,

NDCG@k=i[=1]k(2𝑟𝑒𝑙i1)𝑙𝑜𝑔2(i+1)i[=1]k(2𝑖𝑑𝑒𝑎𝑙i1)𝑙𝑜𝑔2(i+1),NDCG@k=\frac{\stackrel{{\scriptstyle[}}{{i}}=1]{k}{\sum}\frac{(2^{\mathit{rel}% _{i}}-1)}{\mathit{log_{2}(i+1)}}}{\stackrel{{\scriptstyle[}}{{i}}=1]{k}{\sum}% \frac{(2^{\mathit{ideal}_{i}}-1)}{\mathit{log_{2}(i+1)}}},italic_N italic_D italic_C italic_G @ italic_k = divide start_ARG start_RELOP SUPERSCRIPTOP start_ARG italic_i end_ARG start_ARG [ end_ARG end_RELOP = 1 ] italic_k ∑ divide start_ARG ( 2 start_POSTSUPERSCRIPT italic_rel start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - 1 ) end_ARG start_ARG italic_log start_POSTSUBSCRIPT italic_2 end_POSTSUBSCRIPT ( italic_i + italic_1 ) end_ARG end_ARG start_ARG start_RELOP SUPERSCRIPTOP start_ARG italic_i end_ARG start_ARG [ end_ARG end_RELOP = 1 ] italic_k ∑ divide start_ARG ( 2 start_POSTSUPERSCRIPT italic_ideal start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT - 1 ) end_ARG start_ARG italic_log start_POSTSUBSCRIPT italic_2 end_POSTSUBSCRIPT ( italic_i + italic_1 ) end_ARG end_ARG , (1)

where k𝑘kitalic_k is a particular rank position, 𝑟𝑒𝑙isubscript𝑟𝑒𝑙𝑖\mathit{rel_{i}}italic_rel start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the predicted relevance order at position i𝑖iitalic_i, the 𝑖𝑑𝑒𝑎𝑙isubscript𝑖𝑑𝑒𝑎𝑙𝑖\mathit{ideal_{i}}italic_ideal start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the ideal relevance order (ground truth) at position i𝑖iitalic_i. The value of NDCG ranges from 0 to 1, the higher the better.

3.2 Ranking Function Performance

As the user mainly pay attention to the top items on the recommended list, we want the top items ranked by our model with less error. In our experiment, we optimize the NDCG over top 1 and top 5 in the ranked list. As it shown in Table 2, the NDCG@1 achieves a score is 0.945 (the highest possible score is 1), the NDCG@5 are very close to The NDCG@1, and both of them achieves a relatively high score, indicating that GotFunding can rank the grant candidates well.

Metric Performance
Training(NDCG@1) 0.975
Training(NDCG@5) 0.954
Validation(NDCG@1) 0.945
Validation(NDCG@5) 0.943
Table 2: The ranking function performance

3.3 Feature Importance

One of the goal of our research is to understand what the most effective factors matching publications to grants. Using GotFunding, we can begin to understand these factors by interpreting the importance in the LambdaRank gradient boosting algorithm. lightGBM provides that cumulative total grains of splits for features, which is commonly used as feature importance (Ke et al.,, 2017).

3.3.1 Feature importance analysis

We listed the 20 most important features in Figure 2. The most import features is the temporal similarity between the publication and the grant. Its importance is almost twice as the second most important feature. The top second feature is the amount of information in the publication represented by the size of the publication document. The third feature and forth features are related to the relevance between publication and grant. The top 10 features account for almost 50% of the overall importances. Taken together, these top three features represent a combination of high relevance with high coverage.

Refer to caption
Figure 2: Top 20 feature importance. The APP_[1,2,3,4] in the feature name denotes the four approaches used for the grant description. The Feature_#[1-31] corresponds to Table 1. The top three features are year difference between publication and grant, information content of publication, and relevance between publication and grant.

Because our features are separated into four group of features, we wanted to understand the importance of each of these subgroups. These groups are funding agency information, the grant title, grant abstract and their union. To do this, we computed the total feature importance of all the features belonging to a group. Our analysis shows that the importance rank the groups as follows: grant abstract (total importance: 1063), The combination of funding agency information, grant title and grant abstract (total importance: 645), grant title (total importance: 440), and grant agency information(total importance: 427). These results suggest that the abstract contributed the most.

We also analyzed the different importance of the semantic vs the statistical features. We again do this by computing the total feature importance of the features that belong to these two categories. Our results show that the statistical features (total importance: 2791) is larger than semantic features (total importance: 209).

4 Discussion and Conclusion

In our work, we aim at improving how scientists can find relevant grants based on their research profile. We propose to solve this problem by building a recommendation system that learns from historical publication–grant relationships. Our results show that we can achieve a high performance of NDCG@1=0.975. Further, by analyzing what the recommendation system learned we can estimate that the most successful links between a publication and grant are when they are temporally relevant, the publication has large amounts of information (e.g., long document), and there is a good relevance between the publication and grant. We now discuss some limitations and future work.

One of the limitation of our work is that we only look at grants that were funded in the past. Funding mechanisms might be changing over time and publication topics might also change over time. This means that there is no guarantee that a correct prediction will actually yield a successful match for a future grant. Recommendation systems however benefits from large amounts of data and unless we are able to interview and ask scientists about their opinion on publication to grant matching, it is hard to build a recommendation system otherwise. Finally, even if our recommendations are off by topic, they can still serve as a narrowing step during the initial stages of match searchers.

Another limitation of our work is that we are using publications that we already funded by grants. However, our recommendation is trying to solve the opposite problem whereas a scientists wants to find a publication that can initiate funding. Research is still unclear on whether funding changes the direction of research but even if it is does, our recommendation could be useful to discover people that have worked on similar problems to ours. We hope to obtain new data in the future about grants that were not funded because they did not meet the criteria of a certain problem. Thus, with data that is available but potentially harder to obtain, some of these issues could be solved.

Our recommendation system is one of the first ones that offers scientists the ability to match their research to past grants. We think this research direction will benefit specially those who are starting in their career and might not have the human capital to help in finding relevant funding opportunities.

References

  • Achakulvisut et al., (2016) Achakulvisut, T., Acuna, D. E., Ruangrong, T., and Kording, K. (2016). Science concierge: A fast content-based recommendation system for scientific publications. PloS one, 11(7).
  • Bojanowski et al., (2016) Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2016). Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606.
  • Bollen et al., (2014) Bollen, J., Crandall, D., Junk, D., Ding, Y., and Börner, K. (2014). From funding agencies to scientific agency. EMBO reports, 15(2):131–133.
  • Boroush, (2016) Boroush, M. (2016). Us r and d increased by more than $20 billion in both 2013 and 2014, with similar increase estimated for 2015. National Center for Science and Engineering Statistics, Info-Brief NSF, pages 16–316.
  • Burges, (2010) Burges, C. J. (2010). From ranknet to lambdarank to lambdamart: An overview.
  • Burges et al., (2006) Burges, C. J. C., Ragno, R., and Le, Q. V. (2006). Learning to rank with nonsmooth cost functions. In Proceedings of the 19th International Conference on Neural Information Processing Systems, NIPS 2006, pages 193–200, Cambridge, MA, USA. MIT Press.
  • Cao et al., (2007) Cao, Z., Qin, T., Liu, T.-Y., Tsai, M.-F., and Li, H. (2007). Learning to rank: From pairwise approach to listwise approach. In Proceedings of the 24th International Conference on Machine Learning, ICML 2007, pages 129–136, New York, NY, USA. Association for Computing Machinery.
  • Crow, (2020) Crow, J. M. (2020). What to do when your grant is rejected. Nature, 578(7795):477–479.
  • Firth, (1957) Firth, J. R. (1957). A synopsis of linguistic theory 1930-55. 1952-59:1–32.
  • Grave et al., (2018) Grave, E., Bojanowski, P., Gupta, P., Joulin, A., and Mikolov, T. (2018). Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).
  • Gross and Bergstrom, (2019) Gross, K. and Bergstrom, C. T. (2019). Contest models highlight inherent inefficiencies of scientific funding competitions. PLoS biology, 17(1).
  • Herbert et al., (2013) Herbert, D. L., Barnett, A. G., Clarke, P., and Graves, N. (2013). On the time spent preparing grant proposals: an observational study of australian researchers. BMJ open, 3(5):e002800.
  • Jacob and Lefgren, (2011) Jacob, B. A. and Lefgren, L. (2011). The impact of research grant funding on scientific productivity. Journal of public economics, 95(9-10):1168–1177.
  • Ke et al., (2017) Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS 2017, pages 3149–3157, Red Hook, NY, USA. Curran Associates Inc.
  • Lane, (2009) Lane, J. (2009). Assessing the impact of science funding. Science, 324(5932):1273–1275.
  • Li, (2011) Li, H. (2011). A short introduction to learning to rank. IEICE TRANSACTIONS on Information and Systems, 94(10):1854–1862.
  • Li and Marrongelle, (2012) Li, P. and Marrongelle, K. (2012). Having success with NSF: a practical guide. John Wiley & Sons.
  • Mendeley, (2020) Mendeley, F. (2020). Mendeley funding.
  • Qin and Liu, (2013) Qin, T. and Liu, T. (2013). Introducing LETOR 4.0 datasets. CoRR, abs/1306.2597.
  • Van den Besselaar and Sandstrom, (2015) Van den Besselaar, P. and Sandstrom, U. (2015). Early career grants, performance, and careers: A study on predictive validity of grant decisions. Journal of Informetrics, 9(4):826–838.
  • Zhai and Lafferty, (2001) Zhai, C. and Lafferty, J. (2001). A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2001, pages 334–342, New York, NY, USA. Association for Computing Machinery.