Next generation DNA sequencing (NGS) technologies hold great promise as tools for building a new understanding of health and disease. In the case of understanding cancer, deep sequencing provides more sensitive ways to detect the germline and somatic mutations that cause different types of cancer as well as identify new mutations within small subpopulations of tumor cells that can be prognostic indicators of tumor growth or drug resistance. Completing the transition from proof of principal applications to practical applications, however, requires that many basic and clinical research groups to be able to effectively utilize NGS. Ongoing technical developments and intense vendor competition amongst NGS platform and service providers are commoditizing data collection costs making systems more assessable. However, the single greatest impediment to the adoption of NGS technology is the lack of systems that create easy access to the immense bioinformatics and IT infrastructures needed to work with the data. In the case of variant analysis, such systems will need to process very large datasets, and accurately predict common, rare, and de novo levels of variation. Genetic variation must be presented in an annotation-rich, biological context to determine the clinical utility, frequency, and putative biological impact. Software systems used for this work must integrate data from many samples together with resources ranging from core analysis algorithms to application specific datasets to annotations, all woven into computational systems with interactive user interfaces (UIs). Such end-to-end systems currently do not exist. In this project, Geospiza will create integrated methods for robust detection and rich contextualization of genetic variants. Using variation analysis in cancer genomics as a model system, we will conduct research to improve assay sensitivity by deeply characterizing data from existing and emerging NGS platforms, quality value (QV) recalibration tools, and alignment algorithms, to understand the systematic artifacts that create errors in the data. To improve how researchers understand a variant's biological context, function and potential clinical utility, we will develop methods to combine assay results from many samples with de novo NGS datasets for assays like RNA-Seq and existing data such as those in GEO and SRA, and information resources from dbSNP, cancer genome databases, and ENCODE. Finally, we will develop the necessary scalable computing infrastructure and novel UI's needed to organize and process the data and explore and annotate the results. Through this work, and follow on product development, we will produce integrated sensitive assay systems that harness NGS for identifying very low (1:1000) levels of changes between DNA sequences to detect cancerous mutations and emerging drug resistance. Our tools and infrastructure can be later applied in assays designed to follow viral epidemics, and understand autoimmune disorders.

Public Health Relevance

The SBIR project """"""""Software Systems for Detecting Rare Mutations"""""""" will deliver new software technologies to further advance the applications for deep DNA sequencing in personalized medicine by improving methods for detecting rare mutations that define cancer types and determine how a cancer cell may grow and respond to, or resist, treatment. In addition to improving cancer research and diagnostics, the software developed will have general use for any application where DNA sequencing is used to understand the genetic basis of human health, disease, and response to drug therapies.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Small Business Innovation Research Grants (SBIR) - Phase II (R44)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-IMST-J (15))
Program Officer
Brooks, Lisa
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Geospiza, Inc.
United States
Zip Code
Chhangawala, Sagar; Rudy, Gabe; Mason, Christopher E et al. (2015) The impact of read length on quantification of differentially expressed genes and splice junction detection. Genome Biol 16:131
SEQC/MAQC-III Consortium (2014) A comprehensive assessment of RNA-seq accuracy, reproducibility and information content by the Sequencing Quality Control Consortium. Nat Biotechnol 32:903-14
Mason, Christopher E; Porter, Sandra G; Smith, Todd M (2014) Characterizing multi-omic data in systems biology. Adv Exp Med Biol 799:15-38
Chiron, David; Martin, Peter; Di Liberto, Maurizio et al. (2013) Induction of prolonged early G1 arrest by CDK4/CDK6 inhibition reprograms lymphoma cells for durable PI3K? inhibition through PIK3IP1. Cell Cycle 12:1892-900
Li, Sheng; Garrett-Bakelman, Francine E; Akalin, Altuna et al. (2013) An optimized algorithm for detecting and annotating regional differential methylation. BMC Bioinformatics 14 Suppl 5:S10
Ricarte-Filho, Julio C; Li, Sheng; Garcia-Rendueles, Maria E R et al. (2013) Identification of kinase fusion oncogenes in post-Chernobyl radiation-induced thyroid cancers. J Clin Invest 123:4935-44
Rosenfeld, Jeffrey A; Mason, Christopher E; Smith, Todd M (2012) Limitations of the human reference genome for personalized genomics. PLoS One 7:e40294
Laborde, Rebecca R; Wang, Vivian W; Smith, Todd M et al. (2012) Transcriptional profiling by sequencing of oropharyngeal cancer. Mayo Clin Proc 87:226-32