New algorithms and tools for large-scale genomic analyses

Quinlan, Aaron

Abstract

The exploration and interpretation of large, complex datasets is vital to discovery in genomics. However, researchers now confront a fundamental limitation;unprecedented experiments are possible thanks to modern DNA sequencing technologies, yet existing """"""""genome arithmetic"""""""" techniques for comparing and dissecting the resulting datasets are incapable of keeping pace with inexorable growth in dataset size and complexity. Genome arithmetic (GA) represents a powerful and widely used set of techniques that allow one to explore relationships among sets of genome features (e.g., a gene, sequence alignment, ChIP-seq peak, or anything that can be described with chromosome coordinates). GA is used for a broad spectrum of analyses including: the detection of intersecting/overlapping features (e.g., sequence alignments and exons), describing feature coverage among datasets, and the merging, subtraction, and complementation of feature datasets. GA functionality is used by all genome browsers and data visualization tools, and by analysis software such as GATK and SAMTOOLS. Owing to their power and flexibility, existing GA tools (i.e., Galaxy, the UCSC Genome Browser, and our own BEDTOOLS) are extremely popular and are used in a broad range of complex genomic analyses. However, while GA is central to genomic analysis and discovery, the core algorithms employed by all existing tools are inherently incapable of scaling to the scale and diversity of modern genomic datasets. Restricted to these approaches, the present analytic bottleneck will become increasingly acute. Therefore, the overall objective of this proposal is to provide the genomics community with innovative new algorithms and software that keep pace with modern genomics experiments and facilitate future discoveries.
The Specific Aims are to: (1) Devise efficient new algorithms for large-scale genome arithmetic analyses. We will develop innovative GA algorithms that scale to modern genomics experiments and are capable of integrating many diverse genomic datasets. We will devise novel algorithms and adapt proven, scalable approaches from the field of computational geometry. (2) Develop software and libraries that facilitate innovative analyses and new tool development. We will release our algorithms to the community as open-source software libraries and tools that will foster new tool development and provide innovative approaches for exploring large-scale datasets. (3) Extend our tools to scalable computing frameworks in order to enable future genomic discovery. We will adapt our software to parallel computing environments and thereby enable continued discovery on increasingly massive and complex datasets. The proposed research will devise entirely new, scalable approaches for genome arithmetic. This will provide the community with powerful new techniques for exploring and interpreting genomics experiments and provide tool developers with robust approaches for software development and improvement.

Public Health Relevance

Discovery in genomics depends upon the exploration of many large experimental datasets and diverse genome annotations. Unfortunately, researchers are facing a fundamental analysis constraint caused by the fact that existing """"""""genome arithmetic"""""""" analysis techniques are incapable of scaling to the size and complexity of modern genomics experiments. We therefore propose to devise innovative analysis techniques that will provide the genomics research community with scalable, reliable tools for exploring and interpreting tomorrow's genomics experiments.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 1R01HG006693-01
Application #: 8273206
Study Section: Biodata Management and Analysis Study Section (BDMA)
Program Officer: Bonazzi, Vivien

Project Start: 2012-04-19
Project End: 2016-03-31
Budget Start: 2012-04-19
Budget End: 2013-03-31
Support Year: 1
Fiscal Year: 2012
Total Cost: $437,112
Indirect Cost: $148,883

Institution

Name: University of Virginia
Department: Public Health & Prev Medicine
Type: Schools of Medicine
DUNS #: 065391526

City: Charlottesville
State: VA
Country: United States
Zip Code: 22904

Related projects


NIH 2018 R01 HG	New algorithms and tools for large-scale genomic analyses Quinlan, Aaron R. / University of Utah
NIH 2017 R01 HG	New algorithms and tools for large-scale genomic analyses Quinlan, Aaron R. / University of Utah	$440,999
NIH 2016 R01 HG	New algorithms and tools for large-scale genomic analyses Quinlan, Aaron R. / University of Utah
NIH 2015 R01 HG	New algorithms and tools for large-scale genomic analyses Quinlan, Aaron R. / University of Utah
NIH 2014 R01 HG	New algorithms and tools for large-scale genomic analyses Quinlan, Aaron R. / University of Virginia
NIH 2013 R01 HG	New algorithms and tools for large-scale genomic analyses Quinlan, Aaron R. / University of Virginia	$367,929
NIH 2012 R01 HG	New algorithms and tools for large-scale genomic analyses Quinlan, Aaron R. / University of Virginia	$437,112

Publications

Belyeu, Jonathan R; Nicholas, Thomas J; Pedersen, Brent S et al. (2018) SV-plaudit: A cloud-based framework for manually curating thousands of structural variants. Gigascience 7:

Layer, Ryan M; Pedersen, Brent S; DiSera, Tonya et al. (2018) GIGGLE: a search engine for large-scale integrated genome analysis. Nat Methods 15:123-126

Pedersen, Brent S; Quinlan, Aaron R (2018) Mosdepth: quick coverage calculation for genomes and exomes. Bioinformatics 34:867-868

Ostrander, Betsy E P; Butterfield, Russell J; Pedersen, Brent S et al. (2018) Whole-genome analysis for effective clinical diagnosis and gene discovery in early infantile epileptic encephalopathy. NPJ Genom Med 3:22

Pedersen, Brent S; Collins, Ryan L; Talkowski, Michael E et al. (2017) Indexcov: fast coverage quality control for whole-genome sequencing. Gigascience 6:1-6

Layer, Ryan M; Quinlan, Aaron R (2017) A parallel algorithm for N-way interval set intersection. Proc IEEE Inst Electr Electron Eng 105:542-551

Pedersen, Brent S; Quinlan, Aaron R (2017) cyvcf2: fast, flexible variant analysis with Python. Bioinformatics 33:1867-1869

Pedersen, Brent S; Quinlan, Aaron R (2017) Who's Who? Detecting and Resolving Sample Anomalies in Human DNA Sequencing Studies with Peddy. Am J Hum Genet 100:406-413

Eilbeck, Karen; Quinlan, Aaron; Yandell, Mark (2017) Settling the score: variant prioritization and Mendelian disease. Nat Rev Genet 18:599-612

Pedersen, Brent S; Layer, Ryan M; Quinlan, Aaron R (2016) Vcfanno: fast, flexible annotation of genetic variants. Genome Biol 17:118

Showing the most recent 10 out of 23 publications

Comments

Be the first to comment on Aaron Quinlan's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: