Biological systems employ multiple levels of regulation that enable them to respond to genetic, epigenetic, genomic, and environmental perturbations. Advances in high throughput technologies over the past several years have enabled the generation of comprehensive data sets measuring multiple aspects of biological regulation (such as genetics, epigenetics, transcriptomics, metabolomics, glycomics, proteomics, etc.). Many databases, such as TCGA (The Cancer Genome Atlas) database and the LGRC (Lung Genome Research Consortium) database, have been created for depositing diverse types of omics data and for sharing data for public dissemination. However, data errors, including sample swapping, mis-labeling, and improper data entry, during large-scale data generation and data management are inevitable. Our preliminary results indicate that sample labeling errors frequently occur in every database we examined. Data quality control (QC) is critical for all public databases. Data errors need to be identified and corrected before data is released for data analysis and data mining. Analyzing error infested data wastes public resources. Importantly, wrong data could lead to wrong scientific conclusions. And, sample errors could have a large impact on statistic power. To maximally utilize genetic, genomic, and other omics data in public databases, it is critical to properly match different types of data pertaining to the same sample or individual before applying integrative analyses. There is an urgent need for developing methods that can identify data labeling errors in large databases and properly connect diverse types of omics data pertaining to the same individual. In respond to the Big Data to Knowledge (BD2K) initiative, we will develop computational methods to address the topic area Data Wrangling. Here we propose to develop a sample mapping procedure called MODMatcher (Multi- Omics Data matcher) to simultaneously QC multiple types of omics data (Aim 1), and to develop a suite of predictive models based on multi omics data to identify inconsistency between clinical data and omics data (Aim 2). Our proposed methods will be used to clean data, identify and correct data annotation and metadata attribute errors in large databases, which are all within the scope of the Data Wangling.

Public Health Relevance

Sample labeling errors frequently occur in biomedical research databases with diverse types of omics data. We will develop methods to identify and correction data errors in public databases by simultaneously analyzing multiple types of omics data, which are all within the scope of the 'Data Wangling'.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project--Cooperative Agreements (U01)
Project #
5U01HG008451-02
Application #
9069019
Study Section
Special Emphasis Panel (ZRG1)
Program Officer
Sofia, Heidi J
Project Start
2015-06-01
Project End
2018-05-31
Budget Start
2016-06-01
Budget End
2017-05-31
Support Year
2
Fiscal Year
2016
Total Cost
Indirect Cost
Name
Icahn School of Medicine at Mount Sinai
Department
Genetics
Type
Schools of Medicine
DUNS #
078861598
City
New York
State
NY
Country
United States
Zip Code
10029
Hitzel, Juliane; Lee, Eunjee; Zhang, Yi et al. (2018) Oxidized phospholipids regulate amino acid metabolism through MTHFD2 to facilitate nucleotide release in endothelial cells. Nat Commun 9:2292
Lin, Luan; Chen, Quan; Hirsch, Jeanne P et al. (2018) Temporal genetic association and temporal genetic causality methods for dissecting complex networks. Nat Commun 9:3980
Lee, Eunjee; Collazo-Lorduy, Ana; Castillo-Martin, Mireia et al. (2018) Identification of microR-106b as a prognostic biomarker of p53-like bladder cancers by ActMiR. Oncogene 37:5858-5872
Peters, Lauren A; Perrigoue, Jacqueline; Mortha, Arthur et al. (2017) A functional genomics predictive network model identifies regulators of inflammatory bowel disease. Nat Genet 49:1437-1449
Lee, Eunjee; Pain, Margaret; Wang, Huaien et al. (2017) Sensitivity to BUB1B Inhibition Defines an Alternative Classification of Glioblastoma. Cancer Res 77:5518-5529
Pollak, Julia; Rai, Karan G; Funk, Cory C et al. (2017) Ion channel expression patterns in glioblastoma stem cells with functional and therapeutic implications for malignancy. PLoS One 12:e0172884
Degli Esposti, Davide; Aushev, Vasily N; Lee, Eunjee et al. (2017) miR-500a-5p regulates oxidative stress response genes in breast cancer and predicts cancer survival. Sci Rep 7:15966
Yoo, Seungyeul; Wang, Wenhui; Wang, Qin et al. (2017) A pilot systematic genomic comparison of recurrence risks of hepatitis B virus-associated hepatocellular carcinoma with low- and high-degree liver fibrosis. BMC Med 15:214
Gong, Yixuan; Wang, Li; Chippada-Venkata, Uma et al. (2016) Constructing Bayesian networks by integrating gene expression and copy number data identifies NLGN4Y as a novel regulator of prostate cancer progression. Oncotarget 7:68688-68707
Katsyv, Igor; Wang, Minghui; Song, Won Min et al. (2016) EPRS is a critical regulator of cell proliferation and estrogen signaling in ER+ breast cancer. Oncotarget 7:69592-69605

Showing the most recent 10 out of 23 publications