The exploration and interpretation of large, complex datasets is vital to discovery in genomics. However, researchers now confront a fundamental limitation; unprecedented experiments are possible thanks to modern DNA sequencing technologies, yet existing genome arithmetic techniques for comparing and dissecting the resulting datasets are incapable of keeping pace with inexorable growth in dataset size and complexity. Genome arithmetic (GA) represents a powerful and widely used set of techniques that allow one to explore relationships among sets of genome features (e.g., a gene, sequence alignment, ChIP-seq peak, or anything that can be described with chromosome coordinates). GA is used for a broad spectrum of analyses including: the detection of intersecting/overlapping features (e.g., sequence alignments and exons), describing feature coverage among datasets, and the merging, subtraction, and complementation of feature datasets. GA functionality is used by all genome browsers and data visualization tools, and by analysis software such as GATK and SAMTOOLS. Owing to its power and flexibility, own BEDTOOLS software is extremely popular and is used in a broad range of complex genomic analyses. However, while GA is central to genomic analysis and discovery, the core algorithms employed by all existing tools are inherently incapable of keeping pace with the scale and diversity of modern genomic datasets. Restricted to these approaches, the present analytic bottleneck will become increasingly acute. Therefore, the overall objective of this proposal is to provide the genomics community with innovative new algorithms and software that keep pace with modern genomics experiments and facilitate future discoveries.
The Specific Aims are to: (1) Create an ecosystem and software that allows researchers to easily integrate diverse genome annotations and datasets into their research. We will develop new tools that make it easy and reproducible for researchers to collect datasets germane to a given experiment. (2) Dramatically expand the utility, flexibility, and performance of BEDTOOLS. We will devise and implement new algorithms for scalable and flexible analysis of large-scale genome datasets. (3) Develop a workbench for visualizing and quantifying the biological significance of relationships among genomic datasets. We will leverage the technologies from Aims 1 and 2 to develop a comprehensive statistical and visualization workbench for the R statistical package that will allow researchers to detect biological relationships among genome datasets. The proposed research will devise entirely new, scalable approaches for genome arithmetic. This will provide the community with powerful new techniques for exploring and interpreting genomics experiments and give tool developers robust approaches for software development and improvement.

Public Health Relevance

New discoveries in genomics depend in large part upon the exploration of many large experimental datasets and diverse genome annotations. Unfortunately, researchers are facing a fundamental analysis constraint caused by the fact that existing 'genome arithmetic' analysis techniques are incapable of scaling to the size and complexity of modern genomics experiments. We therefore propose to devise innovative analysis techniques that will provide the genomics research community with scalable, reliable tools for exploring and interpreting tomorrow's genomics experiments.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
2R01HG006693-05
Application #
9026895
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Sofia, Heidi J
Project Start
2012-04-19
Project End
2019-04-30
Budget Start
2016-05-12
Budget End
2017-04-30
Support Year
5
Fiscal Year
2016
Total Cost
Indirect Cost
Name
University of Utah
Department
Genetics
Type
Schools of Medicine
DUNS #
009095365
City
Salt Lake City
State
UT
Country
United States
Zip Code
84112
Belyeu, Jonathan R; Nicholas, Thomas J; Pedersen, Brent S et al. (2018) SV-plaudit: A cloud-based framework for manually curating thousands of structural variants. Gigascience 7:
Layer, Ryan M; Pedersen, Brent S; DiSera, Tonya et al. (2018) GIGGLE: a search engine for large-scale integrated genome analysis. Nat Methods 15:123-126
Pedersen, Brent S; Quinlan, Aaron R (2018) Mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics 34:867-868
Ostrander, Betsy E P; Butterfield, Russell J; Pedersen, Brent S et al. (2018) Whole-genome analysis for effective clinical diagnosis and gene discovery in early infantile epileptic encephalopathy. NPJ Genom Med 3:22
Pedersen, Brent S; Collins, Ryan L; Talkowski, Michael E et al. (2017) Indexcov: fast coverage quality control for whole-genome sequencing. Gigascience 6:1-6
Layer, Ryan M; Quinlan, Aaron R (2017) A parallel algorithm for N-way interval set intersection. Proc IEEE Inst Electr Electron Eng 105:542-551
Pedersen, Brent S; Quinlan, Aaron R (2017) cyvcf2: fast, flexible variant analysis with Python. Bioinformatics 33:1867-1869
Pedersen, Brent S; Quinlan, Aaron R (2017) Who's Who? Detecting and Resolving Sample Anomalies in Human DNA Sequencing Studies with Peddy. Am J Hum Genet 100:406-413
Eilbeck, Karen; Quinlan, Aaron; Yandell, Mark (2017) Settling the score: variant prioritization and Mendelian disease. Nat Rev Genet 18:599-612
Pedersen, Brent S; Layer, Ryan M; Quinlan, Aaron R (2016) Vcfanno: fast, flexible annotation of genetic variants. Genome Biol 17:118

Showing the most recent 10 out of 23 publications