During the last decade, Next Generation Sequencing (NGS) applications have expanded to include measurement of dynamic outcomes underlying genomic function in development and disease. Measurements related to functional elements that act at the protein and RNA levels, and regulatory elements that control gene activity, are at the core of studies undertaken by large consortia and individual labs alike. These measurements introduce levels of variability that give rise to data analytic challenges related to distinguishing unwanted or uninterested sources of variability, from biologically relevant signals. Furthermore, new technologies and improved data analytic ideas are giving rise to a need for new mapping algorithms to facilitate deployment on increasingly larger datasets. While existing tools have provided effective ways to process and analyze data in functional genomics studies, new technologies, more complex biological questions, and the availability of increasingly complete datasets are posing new challenges. Single cell RNA-seq and single cell ATAC-seq technologies in particular have introduced complexities that current tools are not optimized to address. Our team has extensive experience developing computational tools and statistical methodology for functional genomics, disseminated as open source software. Many of our methods have become standards among users of high-throughput technologies and are commonly included as part of standard pipelines. Combined, these software packages receive hundreds of thousands of downloads each year and the papers describing the methods have been cited tens of thousands of times. Furthermore, Dr. Irizarry (PI) is a leader in the Bioconductor project, one of the most widely used open-source projects for the analysis of high-throughput genomics data which has greatly facilitated the development and dissemination of our and others state-of-the-art statistical methodologies. We have identified three specific computational challenges urgently requiring new or improved solutions that can greatly benefit from our expertise. Namely, we propose to develop: fast and accurate read mapping specialized for count-focused sequencing data; develop a unified statistical approach for normalization and downstream analysis?; ?developing computational tools to integrate scATAC-seq data with scRNA-seq and using public data to facilitate annotation and functional interpretation. We plan to disseminate our tools via open source software and provide a user friendly suite of packages that functional genomics researchers can use to extract knowledge from their single cell RNA-seq or ATAC-seq data.

Public Health Relevance

Advances in the technologies used to decipher genomic function in development and disease are giving rise to new computational and statistical challenges. Our team has extensive experience developing computational tools and statistical methodology in this area. We will leverage our expertise to develop solutions to some of the most urgent needs of the research community and disseminate these as open source software.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
1R01HG011139-01
Application #
9979396
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Sen, Shurjo Kumar
Project Start
2020-09-22
Project End
2025-06-30
Budget Start
2020-09-22
Budget End
2021-06-30
Support Year
1
Fiscal Year
2020
Total Cost
Indirect Cost
Name
Dana-Farber Cancer Institute
Department
Type
DUNS #
076580745
City
Boston
State
MA
Country
United States
Zip Code
02215