The goal of this proposed project is to develop a collaboration capacity framework and evaluate the collaboration capacity of science teams at macro-, meso-, and micro-levels through using GenBank metadata and other related data sources. The framework defines the Scientifc &Technical (S&T) human capital, cyberinfrastructure, and science policy as the enablers of collaboration capacity, the impact of which on collaboration capacity can be measured by data production and data-to-knowledge metrics such as team size and ratio of data to publications. GenBank metadata as the primary data source for this project offers a longitudinal coverage (1984-2018) and full research lifecycle traces from data production to publication to patent application, creating an unprecedented opportunity to study the biomedical research enterprise. This project will design and create datasets from GenBank metadata to generate analysis-ready data, which will be combined with statistics from NSF and NIH. The datasets will be used to develop computational models and test hypotheses that examine the correlation between collaboration capacity, team size, and connectedness of nodes, as well as the properties of disruptive nodes and their impact on productivity and innovation. In addition to statistics from NSF and NIH, the project will also combine events in science policy (e.g., mandates on data sharing), public health (e.g., outbreaks and prevalent chronic diseases), and funding to triangulate with the datasets and analyze collaboration capacity and policy implications. The data source and theoretical approach compensate for the limitations of publication-centric data sources used in past research on collaboration networks. The fact that the primary data source comes from basic biomedical research situates this study at the cutting-edge and allows us to gain more holistic insights into the impact of federal investment and policy on collaboration capacity. Our future research will use this longitudinal, rich data collection to continue deeper mining of collaboration in data production and data-to-knowledge lifecycle, particularly in relation to specific genes, diseases, and treatments that are key aspects in basic and clinical biomedical research.
The trace data (metadata) about molecular sequences in GenBank contain rich information about collaboration networks from data to knowledge production. Our prior analysis shows the feasibility of using such trace data to study collaboration networks in GenBank related to outbreaks (e.g., SARS, West Nile virus) and those of the species important to basic biomedical research (e.g., Mus musculous). This project will investigate research collaboration networks related to other prevalent diseases.