In our database, for any grant we report "Related Projects" that are renewals of ongoing projects from individual investigators. We wanted to improve this by identifying "Related Projects" more generally in which we group different grants based on the text in the abstract.
We analyzed the similarities of 2013 NIH grant abstracts with lengths of at least 100 words (disregarding the public health relevance statement). For each abstract in the resulting collection of over 50,000, we calculated the frequency of each word. The similarity of different abstracts was evaluated in pairwise manner by comparing word frequencies. We then used cluster analysis to identify abstracts with the most similar word content, finding that there were over 500 of cases (~1%) of abstract pairs with highly similar text. Note that for this analysis we removed any grants that had identical project numbers or absolutely identical text, therefore reducing the number of false positives.
We show the results in the interactive plot above. Each orange circle represents a cluster of grants with similar abstract text, where each grant within the cluster is represented by a grey circle with a size proportional to the monetary value of the award. Clicking on the orange text will take you to our search tool that reveals detailed information for all grants within the cluster. Some of this information is available in the plot itself, where clicking on any orange circle reveals the title of the grants, and further clicking on the title displays the associated abstract.
Many of the text similarities can be easily explained. For example, the largest cluster with 14 similar project summaries belongs to the National Children's Study Project. This multi-institution study follows the health and development of over 100,000 children across the country. Similarly, the second largest cluster belongs to the National Network of Libraries of Medicine, where each abstract differs mostly by the names of the institutions that are funded by the given project number. As the cluster size decreases and the award types become more common, it becomes increasingly more difficult to account for text similarities between the abstracts. For example, R21 and DP2 awards have different titles and over 10x difference in monetary value, yet there are pairs of project summaries that are nearly identical. Similarly, one can find many cases of R01 awards with different titles & study section that funds them. Likewise, there were 13 awards in 2013 where a NIH R01 had highly similar text to a VA I01 grant. As a final note, it is perhaps not surprising that there are a handful of examples of related F* and R* grants, where a PI and his/her trainee have nearly identical proposals.
While it should not be the case that the details of the full research proposals are in fact the same, it would be beneficial if the project summaries were written to appropriately reflect the contents of the grant. Soon the Grantome developers hope to incorporate this new algorithm into the search results, such that the "Related Projects" will not just report continuations of individual research projects, but also different projects of related content.
If you are in need of an idea for a fundable project, perhaps one of the above abstracts might serve as the perfect blueprint? Surely the granting agencies will not mind yet another piece of text with related content.