Developing methods for curating multi-omics data

Zhu, Jun

Abstract

Biological systems employ multiple levels of regulation that enable them to respond to genetic, epigenetic, genomic, and environmental perturbations. Advances in high throughput technologies over the past several years have enabled the generation of comprehensive data sets measuring multiple aspects of biological regulation (such as genetics, epigenetics, transcriptomics, metabolomics, glycomics, proteomics, etc.). Many databases, such as TCGA (The Cancer Genome Atlas) database and the LGRC (Lung Genome Research Consortium) database, have been created for depositing diverse types of omics data and for sharing data for public dissemination. However, data errors, including sample swapping, mis-labeling, and improper data entry, during large-scale data generation and data management are inevitable. Our preliminary results indicate that sample labeling errors frequently occur in every database we examined. Data quality control (QC) is critical for all public databases. Data errors need to be identified and corrected before data is released for data analysis and data mining. Analyzing error infested data wastes public resources. Importantly, wrong data could lead to wrong scientific conclusions. And, sample errors could have a large impact on statistic power. To maximally utilize genetic, genomic, and other omics data in public databases, it is critical to properly match different types of data pertaining to the same sample or individual before applying integrative analyses. There is an urgent need for developing methods that can identify data labeling errors in large databases and properly connect diverse types of omics data pertaining to the same individual. In respond to the Big Data to Knowledge (BD2K) initiative, we will develop computational methods to address the topic area Data Wrangling. Here we propose to develop a sample mapping procedure called MODMatcher (Multi- Omics Data matcher) to simultaneously QC multiple types of omics data (Aim 1), and to develop a suite of predictive models based on multi omics data to identify inconsistency between clinical data and omics data (Aim 2). Our proposed methods will be used to clean data, identify and correct data annotation and metadata attribute errors in large databases, which are all within the scope of the Data Wangling.

Public Health Relevance

Sample labeling errors frequently occur in biomedical research databases with diverse types of omics data. We will develop methods to identify and correction data errors in public databases by simultaneously analyzing multiple types of omics data, which are all within the scope of the 'Data Wangling'.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project--Cooperative Agreements (U01)
Project #: 5U01HG008451-02
Application #: 9069019
Study Section: Special Emphasis Panel (ZRG1)
Program Officer: Sofia, Heidi J

Project Start: 2015-06-01
Project End: 2018-05-31
Budget Start: 2016-06-01
Budget End: 2017-05-31
Support Year: 2
Fiscal Year: 2016
Total Cost
Indirect Cost

Institution

Name: Icahn School of Medicine at Mount Sinai
Department: Genetics
Type: Schools of Medicine
DUNS #: 078861598

City: New York
State: NY
Country: United States
Zip Code: 10029

Related projects


NIH 2017 U01 HG	Developing methods for curating multi-omics data Zhu, Jun / Icahn School of Medicine at Mount Sinai	$406,020
NIH 2016 U01 HG	Developing methods for curating multi-omics data Zhu, Jun / Icahn School of Medicine at Mount Sinai
NIH 2016 U01 HG	Developing methods for curating multi-omics data Zhu, Jun / Icahn School of Medicine at Mount Sinai	$282,305
NIH 2015 U01 HG	Developing methods for curating multi-omics data Zhu, Jun / Icahn School of Medicine at Mount Sinai

Publications

Lee, Eunjee; Collazo-Lorduy, Ana; Castillo-Martin, Mireia et al. (2018) Identification of microR-106b as a prognostic biomarker of p53-like bladder cancers by ActMiR. Oncogene 37:5858-5872

Hitzel, Juliane; Lee, Eunjee; Zhang, Yi et al. (2018) Oxidized phospholipids regulate amino acid metabolism through MTHFD2 to facilitate nucleotide release in endothelial cells. Nat Commun 9:2292

Lin, Luan; Chen, Quan; Hirsch, Jeanne P et al. (2018) Temporal genetic association and temporal genetic causality methods for dissecting complex networks. Nat Commun 9:3980

Peters, Lauren A; Perrigoue, Jacqueline; Mortha, Arthur et al. (2017) A functional genomics predictive network model identifies regulators of inflammatory bowel disease. Nat Genet 49:1437-1449

Lee, Eunjee; Pain, Margaret; Wang, Huaien et al. (2017) Sensitivity to BUB1B Inhibition Defines an Alternative Classification of Glioblastoma. Cancer Res 77:5518-5529

Pollak, Julia; Rai, Karan G; Funk, Cory C et al. (2017) Ion channel expression patterns in glioblastoma stem cells with functional and therapeutic implications for malignancy. PLoS One 12:e0172884

Degli Esposti, Davide; Aushev, Vasily N; Lee, Eunjee et al. (2017) miR-500a-5p regulates oxidative stress response genes in breast cancer and predicts cancer survival. Sci Rep 7:15966

Yoo, Seungyeul; Wang, Wenhui; Wang, Qin et al. (2017) A pilot systematic genomic comparison of recurrence risks of hepatitis B virus-associated hepatocellular carcinoma with low- and high-degree liver fibrosis. BMC Med 15:214

Gong, Yixuan; Wang, Li; Chippada-Venkata, Uma et al. (2016) Constructing Bayesian networks by integrating gene expression and copy number data identifies NLGN4Y as a novel regulator of prostate cancer progression. Oncotarget 7:68688-68707

Katsyv, Igor; Wang, Minghui; Song, Won Min et al. (2016) EPRS is a critical regulator of cell proliferation and estrogen signaling in ER+ breast cancer. Oncotarget 7:69592-69605

Showing the most recent 10 out of 23 publications

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: