Rapid technological advances mean that sequencing technologies can now be deployed on a genomic scale to further our understanding of human traits and diseases. Massive-throughput genotyping technologies have provided a tool for large-scale assessment of the contribution of common genetic variants (particularly SNPs) to complex traits. It is expected that next-generation sequencing technologies will substantially broaden the scope of genetic variants to include rarer SNP variants, as well as short insertion and deletion polymorphisms, copy number polymorphisms and other large structural variants. Extracting the full benefits of these new sequencing technologies will require new analytical tools and data processing pipelines, both because (a) approaches and implementations designed to handle more modest amounts of data generated by earlier technologies cannot always handle the orders of magnitudes more voluminous high-throughput data or provide only cumbersome ways for doing so;and because (b) the nature of the data and quality control issues generated by new sequencing technologies are often substantially different from existing technologies. We propose to build on our extensive knowledge and understanding of next generation sequencing technologies and of the analysis of large genetic association studies to construct a data processing pipeline that can be deployed by the NCBI and by scientists wishing to process and analyze large of amounts of next generation sequence data. This data processing pipeline will facilitate (1) quality assessment and validation of short read sequence datasets;(2) mapping of sequencing reads to the genome;(3) variant calling with accurate quality scores;and (4) include tools for data export and visualization. All component software tools of our modular pipeline will support standard data formats. The pipeline will be extensively tested and documented to ensure they are ready for widespread production-level deployment. We believe that the proposed data analysis pipeline will enhance the value of a variety of planned and future sequencing experiments including cancer sequencing, and are committed to delivering these tools in a timely, and standards compliant, manner.

Public Health Relevance

Project narrative We are developing a computer software pipeline to analyze and discover genetic differences between individual human genome sequences. This pipeline will be installed at the NIH and used to analyze data submitted by large genome sequencing projects. The methods we are developing will enhance the study of human genetic variability, contribute to gene mapping and, ultimately, the understanding of heritable human diseases. 1

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
High Impact Research and Research Infrastructure Programs (RC2)
Project #
Application #
Study Section
Special Emphasis Panel (ZHG1-HGR-N (O1))
Program Officer
Brooks, Lisa
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Boston College
Schools of Arts and Sciences
Chestnut Hill
United States
Zip Code
Chiang, Charleston W K; Marcus, Joseph H; Sidore, Carlo et al. (2018) Genomic history of the Sardinian population. Nat Genet 50:1426-1434
Steri, Maristella; OrrĂ¹, Valeria; Idda, M Laura et al. (2017) Overexpression of the Cytokine BAFF and Autoimmunity Risk. N Engl J Med 376:1615-1626
van den Berg, Marten E; Warren, Helen R; Cabrera, Claudia P et al. (2017) Discovery of novel heart rate-associated loci using the Exome Chip. Hum Mol Genet 26:2346-2363
Danjou, Fabrice; Zoledziewska, Magdalena; Sidore, Carlo et al. (2015) Genome-wide association analyses based on whole-genome sequencing in Sardinia provide insights into regulation of hemoglobin levels. Nat Genet 47:1264-71
Lo, Yancy; Kang, Hyun M; Nelson, Matthew R et al. (2015) Comparing variant calling algorithms for target-exon sequencing in a large sample. BMC Bioinformatics 16:75
Zoledziewska, Magdalena; Sidore, Carlo; Chiang, Charleston W K et al. (2015) Height-reducing variants and selection for short stature in Sardinia. Nat Genet 47:1352-1356
1000 Genomes Project Consortium; Auton, Adam; Brooks, Lisa D et al. (2015) A global reference for human genetic variation. Nature 526:68-74
Ding, Jun; Sidore, Carlo; Butler, Thomas J et al. (2015) Assessing Mitochondrial DNA Variation and Copy Number in Lymphocytes of ~2,000 Sardinians Using Tailored Sequencing Analysis Tools. PLoS Genet 11:e1005306
Day, Felix R (see original citation for additional authors) (2015) Large-scale genomic analyses link reproductive aging to hypothalamic signaling, breast cancer susceptibility and BRCA1-mediated DNA repair. Nat Genet 47:1294-1303
Sidore, Carlo; Busonero, Fabio; Maschio, Andrea et al. (2015) Genome sequencing elucidates Sardinian genetic architecture and augments association analyses for lipid and blood inflammatory markers. Nat Genet 47:1272-1281

Showing the most recent 10 out of 23 publications