High-throughput sequencing (HTS) platforms are revolutionizing genomics and health research. The incredible throughput of new sequencing instruments has enabled sequencing of genomes, exomes, methylomes, and transcriptomes in both research and clinical settings. As the cost of DNA sequencing has plummeted, two important trends have become apparent. First, the cost of analysis, in terms of computing resources and personnel, will soon surpass the cost of data generation. This will increase the pressing demand for analytical algorithms that run faster, with fewer CPU/memory resources, while processing overgrowing data sets. Second, the advent of HTS technologies has put low-cost, high-throughput sequencing into the hands of small research labs and clinical investigators;groups that are not accustomed to dealing with this type and scale of data. These developments will undoubtedly yield an unprecedented number of new discoveries, clinical insights, and medical breakthroughs in the coming years, provided the outstanding issues of HTS data analysis (short read lengths, inherent errors, and sheer number of sequence reads) can be conclusively resolved. Until now, most HTS has taken place in large genome centers with teams of bioinformaticians and substantial computing infrastructures. There is an urgent need to make their analysis tools and next-generation pipelines available to the wider research community as easy to install and use packages. We have spent several years developing a computational framework and innovative tools for HTS data analysis, with a particular focus on the discovery and interpretation of genetic variants. Our goal in this proposal is to make these tools available to the wider community, both individually and as part of a complete informatics solution from alignment to detection to interpretation. The solution we describe is flexible and powerful enough to be adopted by experienced laboratories, while at the same time providing high quality, push-button analysis of sequence data for those with little bioinformatics expertise. The framework will run in the cloud or on a single CPU, enabling researchers, educators, and clinicians to speed the transition from sequencing technology adoption to biological knowledge and clinical application.

Public Health Relevance

The promise of the personalized medicine will only be realized when each individual's genetic code can be read and analyzed in the clinical setting. Unfortunately, the associated technologies will generate massive amounts of data that are difficult to analyze and interpret. The software describe in this proposal will enable widespread and easy analysis and interpretation of genetic data, accelerating the overall understanding of genetic information and its application to human health.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Research Project--Cooperative Agreements (U01)
Project #
Application #
Study Section
Special Emphasis Panel (ZHG1-HGR-M (O3))
Program Officer
Sofia, Heidi J
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Washington University
Schools of Medicine
Saint Louis
United States
Zip Code
Huang, Kuan-Lin; Li, Shunqiang; Mertins, Philipp et al. (2017) Proteogenomic integration reveals therapeutic targets in breast cancer xenografts. Nat Commun 8:14864
Mashl, R Jay; Scott, Adam D; Huang, Kuan-Lin et al. (2017) GenomeVIP: a cloud platform for genomic variant discovery and interpretation. Genome Res 27:1450-1459
Manda, K R; Tripathi, P; Hsi, A C et al. (2016) NFATc1 promotes prostate tumorigenesis and overcomes PTEN loss-induced senescence. Oncogene 35:3282-92
Jones, K B; Barrott, J J; Xie, M et al. (2016) The impact of chromosomal translocation locus and fusion oncogene coding sequence in synovial sarcomagenesis. Oncogene 35:5021-32
Niu, Beifang; Scott, Adam D; Sengupta, Sohini et al. (2016) Protein-structure-guided discovery of functional mutations across 19 cancer types. Nat Genet 48:827-37
Ye, Kai; Wang, Jiayin; Jayasinghe, Reyka et al. (2016) Systematic discovery of complex insertions and deletions in human cancers. Nat Med 22:97-104
Griffith, Malachi; Griffith, Obi L; Smith, Scott M et al. (2015) Genome Modeling System: A Knowledge Management Platform for Genomics. PLoS Comput Biol 11:e1004274
Leiserson, Mark D M; Vandin, Fabio; Wu, Hsin-Ta et al. (2015) Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat Genet 47:106-14
Lu, Charles; Xie, Mingchao; Wendl, Michael C et al. (2015) Patterns and functional implications of rare germline variants across 12 cancer types. Nat Commun 6:10086
Ding, Li; Wendl, Michael C; McMichael, Joshua F et al. (2014) Expanding the computational toolbox for mining cancer genomes. Nat Rev Genet 15:556-70

Showing the most recent 10 out of 26 publications