Interpretable and extendable deep learning model for biological sequence analysis and prediction

Xu, Dong

Abstract

Bioinformatics and computational biology have become the core of biomedical research. The PI Dr. Dong Xu's work in this area focuses on development of novel computational algorithms, software and information systems, as well as on broad applications of these tools and other informatics resources for diverse biological and medical problems. He works on many research problems in protein structure prediction, post-translational modification prediction, high-throughput biological data analyses, in silico studies of plants, microbes and cancers, biological information systems, and mobile App development for healthcare. He has published more than 300 papers, with about 12,000 citations and H-index of 55. In this project, the PI proposes to develop deep-learning algorithms, tools, web resources for analyses and predictions of biological sequences, including DNA, RNA, and protein sequences. The availability of these data provides emerging opportunities for precision medicine and other areas, while deep learning as a cutting-edge technology in machine learning, presents a new powerful method for analyses and predictions of biological sequences. With rapidly accumulating sequence data and fast development of deep-learning methods, there is an urgent need to systematically investigate how to best apply deep learning in sequence analyses and predictions. For this purpose, the PI will develop cutting-edge deep-learning methods with the following goals for the next five years: (1) Develop a series of novel deep-learning methods and models to specifically target biological sequence analyses and predictions in: (a) general unsupervised representations of DNA/RNA, protein and SNP/mutation sequences that capture both local and global features for various applications; (b) methods to make deep-learning models interpretable for understanding biological mechanisms and generating hypotheses; (c) ?rule learning?, which abstracts the underlying ?rules? by combining unsupervised learning of large unlabeled data and supervised learning of small labeled data so that it can classify new unlabeled data. (2) Apply the proposed deep-learning model to DNA/RNA sequence annotation, genotype-phenotype analyses, cancer mutation analyses, protein function/structure prediction, protein localization prediction, and protein post-translational modification prediction. The PI will exploit particular properties associated with each of these problems to improve the deep-learning models. He will develop a set of related prediction and analysis tools, which will improve the state-of-art performance and shed some light on related biological mechanisms. (3) Make the data, models, and tools freely accessible to the research community. The system will be designed modular and open-source, available through GitHub. They will be available like integrated circuit modules, which are universal and ready to plug in for different applications. The PI will develop a web resource for biological sequence representations, analyses, and predictions, as well as tutorials to help biologists with no computational knowledge to apply deep learning to their specific research problems.

Public Health Relevance

Biological sequences, including DNA, RNA and protein sequences, represent the largest sources of growing big data in current biology and medicine, which provide tremendous opportunities for precision medicine, synthetic biology, and other areas. Deep learning as an emerging machine-learning method has a great potential in utilizing these data in biomedical research. This project will develop and apply cutting-edge deep- learning methods to deliver various sequence-based computational tools for gaining new knowledge, accelerating drug development, and improving personalized diagnosis and treatment.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Institute of General Medical Sciences (NIGMS)
Type: Unknown (R35)
Project #: 1R35GM126985-01
Application #: 9485584
Study Section: Special Emphasis Panel (ZRG1)
Program Officer: Lyster, Peter

Project Start: 2018-05-01
Project End: 2023-04-30
Budget Start: 2018-05-01
Budget End: 2019-04-30
Support Year: 1
Fiscal Year: 2018
Total Cost
Indirect Cost

Institution

Name: University of Missouri-Columbia
Department: Biostatistics & Other Math Sci
Type: Biomed Engr/Col Engr/Engr Sta
DUNS #: 153890272

City: Columbia
State: MO
Country: United States
Zip Code: 65211

Related projects


NIH 2020 R35 GM	Interpretable and extendable deep learning model for biological sequence analysis and prediction Xu, Dong / University of Missouri-Columbia
NIH 2019 R35 GM	Interpretable and extendable deep learning model for biological sequence analysis and prediction Xu, Dong / University of Missouri-Columbia
NIH 2018 R35 GM	Interpretable and extendable deep learning model for biological sequence analysis and prediction Xu, Dong / University of Missouri-Columbia

Comments

Be the first to comment on this grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: