In this proposal, we bring together a unified team with a strong track record of developing secure and scalable software systems to support flagship scientific efforts, such as the All of Us Research Program, the Genomic Data Commons (GDC), and the Human Cell Atlas (HCA). Our group will leverage these experiences, and the software developed for them, to create an ecosystem of applications that will both serve the needs of the AnVIL and interoperate with other NIH data resources. We will accomplish this through the following Aims: ? Aim 1 (Software Engineering): Leverage existing software capabilities to create tools for storing, sharing, and analyzing AnVIL datasets at unlimited scale. During the past five years, our groups have created a suite of modular and open source software capabilities that address key needs in genomic data science. We will leverage these existing capabilities and extend them in novel directions to address AnVIL-specific scientific goals relating to human genetics and functional genomics. ? Aim 2 (Data Engineering): Curate data and metadata resources so that they are easily accessible. The AnVIL will not only be a suite of software services, but also a vast repository of genotypic and phenotypic information. For this resource to be usable by the community, it must be organized, curated, and made accessible. We will accomplish this by processing genomic datasets using a consistent set of best-practices pipelines, and mapping phenotypes to a common data model. ? Aim 3 (Operations): Stand up and support a data environment for the AnVIL community, and integrate it with other NIH resources as part of a federated NIH-wide genomic data commons. The modular components of Aim 1 are critical building blocks, but they alone are not enough to meet the needs of the AnVIL; they must also be stood up as services and integrated into a coherent entity, which we call a ?data environment.? We propose to create an AnVIL data environment that will enable researchers to access datasets in a secure, compliant, and facile manner. The guiding principle of these efforts is that progress in genomic science will happen most rapidly if there is a diversity of solutions created by a plurality of groups. Towards that end, our approach to engineering the software components of Aim 1, curating the datasets of Aim 2, and operating the software services of Aim 3 is to catalyze an ecosystem of activity around the AnVIL. Our proposal focuses not only on creating and operating software services ourselves, but also on incorporating third-party solutions. We propose to accomplish this by architecting the AnVIL data environment according to the following principles: (i) modularity, (ii) openness, (iii) community engagement, (iv) standardization, and (v) interoperability.
The AnVIL Data Ecosystem Project Narrative In this proposal, we bring together a unified team with a strong track record of developing secure and scalable software systems to support flagship scientific efforts, such as the All of Us Research Program, the Genomic Data Commons (GDC), and the Human Cell Atlas (HCA). Our group will leverage these experiences, and the software developed for them, to create an ecosystem of cloud-based applications that will enable the NHGRI to store, share and analyze datasets at unlimited scale. Importantly, this architecture will interoperate with other key NIH data environments as part of a federated genomic data commons.