Cancers selected for the NIH's The Cancer Genome Atlas (TCGA) project have been chosen because of their poor prognosis and overall public health impact. Select tissue samples have been profiled for gene and miRNA expression, promoter methylation, DNA sequence and mutation analysis, as well as copy number variation (CNV), with total expenditures of $275 Million13. The copy number variation (CNV) information, derived from the raw array-based comparative genomic hybridization (aCGH) and SNP-array data, has been successfully utilized in specific application areas, such as identification of significant recurrent aberrations in each tumor type from population-wide, tumor- specific analysis. However, the full potential of this data has not yet been exploited. The two major obstacles have been the method used to perform the initial data processing which have somewhat limited its utility, and the lack of a comprehensive integrated data access and analytical platform for copy number analysis. We have demonstrated that the copy number data could be successfully re- processed to more closely reflect the underlying genomic events, which, in turn, would open several high-impact avenues for further research. Examples of such new research areas include identification of CNVs predictive of survival, genomic stratification of like-tumors by phenotype, and correlation of copy number and gene expression information . A product which enables the research community to take advantage of this substantial national investment in a much broader way is highly significant both in terms of advancement in cancer research as well as a being a viable business opportunity. Hypothesis: We hypothesize that using the BioDiscovery Nexus pre-processing and calling algorithms, with optimized statistical parameters confirmed by a clinical laboratory, combined with sample review by scientists trained in copy number analysis, will yield a database of structural variants that is substantially more concordant with underlying tumor genomes than currently available data. Delivering such curated data, integrated with powerful, easy to use analysis tools will have great scientific benefit. Preliminary data: We have performed a proof-of-principle using data from the glioblastoma multiforme (GBM) level-1 data (raw data) through processing in our pipeline, and have demonstrated the copy number profiles generated better reflected the true genomic profile of the samples (showing the correct ploidy and break points as compared to expected profiles for these samples), Specific Aims: This project involves establishing the statistical and review methods for performing high-quality copy number analysis on existing TCGA level-1 data, creating a resultant data product, and delivering this through an integrated analytical platform.
I SPECIFIC AIM1, we will optimize the statistical parameters from our commercially- developed Hidden Markov Model (HMM) algorithm for the TCGA data set and apply these to the dataset along with baseline ploidy correction. In SPECIFC AIM 2, we will develop quality control methods and metrics to identify samples which should be excluded from the data set, or which require manual review and analysis.
In SPECIFIC AIM 3, we will create a database of the re-processed and curated TCGA copy number data with associated clinical annotations.
In SPECIFIC AIM 4, we will integrate this data product with a scientifically-accepted analytical platform for accessing the data and performing downstream analyses. Phase II would involve extending the curating methodology and analytical platform to incorporate the other dimensions of data described.

Public Health Relevance

DNA copy number changes are well-established drivers of carcinogenesis, disease progression, markers for prognosis, and indicators of potential cancer targets. Our product, which provides substantially improved DNA copy number information along with an integrated data access and analytical platform for TCGA data will have a large impact on public health - in basic and translational research, in drug discovery, and for possible clinical applications. Our product is unique in important ways: we will be re- processing and curating structural variation from the raw data, and we will integrate the resultant data product with a proven data access and downstream analysis platform.

National Institute of Health (NIH)
National Cancer Institute (NCI)
Small Business Innovation Research Grants (SBIR) - Phase I (R43)
Project #
Application #
Study Section
Special Emphasis Panel (ZRG1-IMST-J (15))
Program Officer
Lou, Xing-Jian
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Biodiscovery, Inc.
El Segundo
United States
Zip Code