Mining Structured Data with Applications in Chemistry and Biology

Baldi, Pierre

Abstract

This project will develop novel data analysis algorithms that will enable scientists to discover new knowledge from large data sets of variable-size structured data. It will specifically focus on small molecules in organic chemistry that can be represented in various ways, such as the two-dimensional graph of covalent bonds. The project will develop new methods to explore chemical space and predict the physical, chemical, and biological properties of small organic molecules.

The research will directly address these problems by using annotated datasets of compounds in combination with machine learning approaches. Specifically, this project will develop new fingerprint representations of small molecules, for instance, by indexing the paths and trees contained in the molecular graphs, or by building histograms of three-dimensional distances between labeled pairs of atoms. Kernels methods - currently one of the leading methods in machine learning - will be used to measure similarity between fingerprints and to develop predictive algorithms for classification and regression tasks. The algorithms will be quantitatively evaluated in terms of their ability both to describe observed data and to predict new data. Selected applications to the prediction of critical temperatures, toxicity, mutagenicity, anti-cancer and other biological activity of small molecules will serve as testbeds for validating the techniques developed. Algorithms and data developed during the project will be made publicly available on the Web for research and scientific use. Educational activities included in this project will foster in computer science students an understanding of the increasingly important role of computer science and data mining in data-driven sciences such as chemistry.

New informatics methods for structured data will greatly benefit chemistry. The penetration of computational, artificial intelligence, and informatics methods in chemistry has been slower than in biology, because of the single-investigator nature of chemical research and the dominance of genome sequencing and other high-throughput projects in biology. Data on millions of compounds, however, are becoming readily available. By developing efficient fingerprints, kernels, and other machine learning methods for graphs and molecular structures, this project will address some of the most outstanding problems in the field and help accelerate the penetration of modern computational methods in chemistry.

The project has the potential for significant benefit to society. Small molecules have numerous applications in biology, pharmacology, and bioengineering. They can be used to probe and study biological pathways and systems and to develop new drugs. The algorithms developed by this project will provide basic building blocks and important steps towards understanding and predicting molecular properties from molecular structures. They will allow scientists to screen large data sets of compounds rapidly, while searching for compounds that satisfy particular structural or functional constraints. This will produce cost savings, accelerate the development of new drugs, and promote the understanding of chemical space.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Information and Intelligent Systems (IIS)
Type: Standard Grant (Standard)
Application #: 0513376
Program Officer: Sylvia J. Spengler

Project Start
Project End
Budget Start: 2005-07-15
Budget End: 2009-06-30
Support Year
Fiscal Year: 2005
Total Cost: $311,291
Indirect Cost

Mining Structured Data with Applications in Chemistry and Biology
Baldi, Pierre
University of California Irvine, Irvine, CA, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments