High-throughput sequencing (HTS) platforms are revolutionizing genomics and health research. The incredible throughput of new sequencing instruments has enabled sequencing of genomes, exomes, methylomes, and transcriptomes in both research and clinical settings. As the cost of DNA sequencing has plummeted, two important trends have become apparent. First, the cost of analysis, in terms of computing resources and personnel, will soon surpass the cost of data generation. This will increase the pressing demand for analytical algorithms that run faster, with fewer CPU/memory resources, while processing overgrowing data sets. Second, the advent of HTS technologies has put low-cost, high-throughput sequencing into the hands of small research labs and clinical investigators;groups that are not accustomed to dealing with this type and scale of data. These developments will undoubtedly yield an unprecedented number of new discoveries, clinical insights, and medical breakthroughs in the coming years, provided the outstanding issues of HTS data analysis (short read lengths, inherent errors, and sheer number of sequence reads) can be conclusively resolved. Until now, most HTS has taken place in large genome centers with teams of bioinformaticians and substantial computing infrastructures. There is an urgent need to make their analysis tools and next-generation pipelines available to the wider research community as easy to install and use packages. We have spent several years developing a computational framework and innovative tools for HTS data analysis, with a particular focus on the discovery and interpretation of genetic variants. Our goal in this proposal is to make these tools available to the wider community, both individually and as part of a complete informatics solution from alignment to detection to interpretation. The solution we describe is flexible and powerful enough to be adopted by experienced laboratories, while at the same time providing high quality, push-button analysis of sequence data for those with little bioinformatics expertise. The framework will run in the cloud or on a single CPU, enabling researchers, educators, and clinicians to speed the transition from sequencing technology adoption to biological knowledge and clinical application.

Public Health Relevance

The promise of the personalized medicine will only be realized when each individual's genetic code can be read and analyzed in the clinical setting. Unfortunately, the associated technologies will generate massive amounts of data that are difficult to analyze and interpret. The software describe in this proposal will enable widespread and easy analysis and interpretation of genetic data, accelerating the overall understanding of genetic information and its application to human health.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project--Cooperative Agreements (U01)
Project #
Application #
Study Section
Special Emphasis Panel (ZHG1-HGR-M (O3))
Program Officer
Sofia, Heidi J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Washington University
Schools of Medicine
Saint Louis
United States
Zip Code
Ellrott, Kyle; Bailey, Matthew H; Saksena, Gordon et al. (2018) Scalable Open Science Approach for Mutation Calling of Tumor Exomes Using Multiple Genomic Pipelines. Cell Syst 6:271-281.e7
Cao, Yanan; Zhou, Weiwei; Li, Lin et al. (2018) Pan-cancer analysis of somatic mutations across 21 neuroendocrine tumor types. Cell Res 28:601-604
Sengupta, Sohini; Sun, Sam Q; Huang, Kuan-Lin et al. (2018) Integrative omics analyses broaden treatment targets in human cancer. Genome Med 10:60
Huang, Kuan-Lin; Li, Shunqiang; Mertins, Philipp et al. (2017) Proteogenomic integration reveals therapeutic targets in breast cancer xenografts. Nat Commun 8:14864
Mashl, R Jay; Scott, Adam D; Huang, Kuan-Lin et al. (2017) GenomeVIP: a cloud platform for genomic variant discovery and interpretation. Genome Res 27:1450-1459
Wyczalkowski, Matthew A; Wylie, Kristine M; Cao, Song et al. (2017) BreakPoint Surveyor: a pipeline for structural variant visualization. Bioinformatics 33:3121-3122
Jones, K B; Barrott, J J; Xie, M et al. (2016) The impact of chromosomal translocation locus and fusion oncogene coding sequence in synovial sarcomagenesis. Oncogene 35:5021-32
Niu, Beifang; Scott, Adam D; Sengupta, Sohini et al. (2016) Protein-structure-guided discovery of functional mutations across 19 cancer types. Nat Genet 48:827-37
Ye, Kai; Wang, Jiayin; Jayasinghe, Reyka et al. (2016) Systematic discovery of complex insertions and deletions in human cancers. Nat Med 22:97-104
Manda, K R; Tripathi, P; Hsi, A C et al. (2016) NFATc1 promotes prostate tumorigenesis and overcomes PTEN loss-induced senescence. Oncogene 35:3282-92

Showing the most recent 10 out of 30 publications