Modular Software for Sequence Data Quality Checking Alignment &Variant-Calling

Marth, Gabor; Rocha Abecasis, Goncalo; Stewart, Donald

Abstract

Rapid technological advances mean that sequencing technologies can now be deployed on a genomic scale to further our understanding of human traits and diseases. Massive-throughput genotyping technologies have provided a tool for large-scale assessment of the contribution of common genetic variants (particularly SNPs) to complex traits. It is expected that next-generation sequencing technologies will substantially broaden the scope of genetic variants to include rarer SNP variants, as well as short insertion and deletion polymorphisms, copy number polymorphisms and other large structural variants. Extracting the full benefits of these new sequencing technologies will require new analytical tools and data processing pipelines, both because (a) approaches and implementations designed to handle more modest amounts of data generated by earlier technologies cannot always handle the orders of magnitudes more voluminous high-throughput data or provide only cumbersome ways for doing so;and because (b) the nature of the data and quality control issues generated by new sequencing technologies are often substantially different from existing technologies. We propose to build on our extensive knowledge and understanding of next generation sequencing technologies and of the analysis of large genetic association studies to construct a data processing pipeline that can be deployed by the NCBI and by scientists wishing to process and analyze large of amounts of next generation sequence data. This data processing pipeline will facilitate (1) quality assessment and validation of short read sequence datasets;(2) mapping of sequencing reads to the genome;(3) variant calling with accurate quality scores;and (4) include tools for data export and visualization. All component software tools of our modular pipeline will support standard data formats. The pipeline will be extensively tested and documented to ensure they are ready for widespread production-level deployment. We believe that the proposed data analysis pipeline will enhance the value of a variety of planned and future sequencing experiments including cancer sequencing, and are committed to delivering these tools in a timely, and standards compliant, manner.

Public Health Relevance

Project narrative We are developing a computer software pipeline to analyze and discover genetic differences between individual human genome sequences. This pipeline will be installed at the NIH and used to analyze data submitted by large genome sequencing projects. The methods we are developing will enhance the study of human genetic variability, contribute to gene mapping and, ultimately, the understanding of heritable human diseases. 1

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: High Impact Research and Research Infrastructure Programs (RC2)
Project #: 1RC2HG005552-01
Application #: 7855575
Study Section: Special Emphasis Panel (ZHG1-HGR-N (O1))
Program Officer: Brooks, Lisa

Project Start: 2009-09-22
Project End: 2011-07-31
Budget Start: 2009-09-22
Budget End: 2010-07-31
Support Year: 1
Fiscal Year: 2009
Total Cost: $1,092,563
Indirect Cost

Institution

Name: Boston College
Department: Biology
Type: Schools of Arts and Sciences
DUNS #: 045896339

City: Chestnut Hill
State: MA
Country: United States
Zip Code: 02467

Related projects


NIH 2012 RC2 HG	Modular Software for Sequence Data Quality Checking Alignment &Variant-Calling Marth, Gabor T.; Rocha Abecasis, Goncalo; Stewart, Donald A. / Boston College	$199,907
NIH 2010 RC2 HG	Modular Software for Sequence Data Quality Checking Alignment &Variant-Calling Marth, Gabor T.; Rocha Abecasis, Goncalo; Stewart, Donald A. / Boston College	$1,157,137
NIH 2009 RC2 HG	Modular Software for Sequence Data Quality Checking Alignment &Variant-Calling Marth, Gabor T.; Rocha Abecasis, Goncalo; Stewart, Donald A. / Boston College	$1,092,563

Publications

Chiang, Charleston W K; Marcus, Joseph H; Sidore, Carlo et al. (2018) Genomic history of the Sardinian population. Nat Genet 50:1426-1434

Steri, Maristella; Orrù, Valeria; Idda, M Laura et al. (2017) Overexpression of the Cytokine BAFF and Autoimmunity Risk. N Engl J Med 376:1615-1626

van den Berg, Marten E; Warren, Helen R; Cabrera, Claudia P et al. (2017) Discovery of novel heart rate-associated loci using the Exome Chip. Hum Mol Genet 26:2346-2363

Danjou, Fabrice; Zoledziewska, Magdalena; Sidore, Carlo et al. (2015) Genome-wide association analyses based on whole-genome sequencing in Sardinia provide insights into regulation of hemoglobin levels. Nat Genet 47:1264-71

Lo, Yancy; Kang, Hyun M; Nelson, Matthew R et al. (2015) Comparing variant calling algorithms for target-exon sequencing in a large sample. BMC Bioinformatics 16:75

Zoledziewska, Magdalena; Sidore, Carlo; Chiang, Charleston W K et al. (2015) Height-reducing variants and selection for short stature in Sardinia. Nat Genet 47:1352-1356

1000 Genomes Project Consortium; Auton, Adam; Brooks, Lisa D et al. (2015) A global reference for human genetic variation. Nature 526:68-74

Ding, Jun; Sidore, Carlo; Butler, Thomas J et al. (2015) Assessing Mitochondrial DNA Variation and Copy Number in Lymphocytes of ~2,000 Sardinians Using Tailored Sequencing Analysis Tools. PLoS Genet 11:e1005306

Day, Felix R (see original citation for additional authors) (2015) Large-scale genomic analyses link reproductive aging to hypothalamic signaling, breast cancer susceptibility and BRCA1-mediated DNA repair. Nat Genet 47:1294-1303

Sidore, Carlo; Busonero, Fabio; Maschio, Andrea et al. (2015) Genome sequencing elucidates Sardinian genetic architecture and augments association analyses for lipid and blood inflammatory markers. Nat Genet 47:1272-1281

Showing the most recent 10 out of 23 publications

Comments

Be the first to comment on Gabor Marth's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: