This project will develop novel data analysis algorithms that will enable scientists to discover new knowledge from large data sets of variable-size structured data. It will specifically focus on small molecules in organic chemistry that can be represented in various ways, such as the two-dimensional graph of covalent bonds. The project will develop new methods to explore chemical space and predict the physical, chemical, and biological properties of small organic molecules.

The research will directly address these problems by using annotated datasets of compounds in combination with machine learning approaches. Specifically, this project will develop new fingerprint representations of small molecules, for instance, by indexing the paths and trees contained in the molecular graphs, or by building histograms of three-dimensional distances between labeled pairs of atoms. Kernels methods - currently one of the leading methods in machine learning - will be used to measure similarity between fingerprints and to develop predictive algorithms for classification and regression tasks. The algorithms will be quantitatively evaluated in terms of their ability both to describe observed data and to predict new data. Selected applications to the prediction of critical temperatures, toxicity, mutagenicity, anti-cancer and other biological activity of small molecules will serve as testbeds for validating the techniques developed. Algorithms and data developed during the project will be made publicly available on the Web for research and scientific use. Educational activities included in this project will foster in computer science students an understanding of the increasingly important role of computer science and data mining in data-driven sciences such as chemistry.

New informatics methods for structured data will greatly benefit chemistry. The penetration of computational, artificial intelligence, and informatics methods in chemistry has been slower than in biology, because of the single-investigator nature of chemical research and the dominance of genome sequencing and other high-throughput projects in biology. Data on millions of compounds, however, are becoming readily available. By developing efficient fingerprints, kernels, and other machine learning methods for graphs and molecular structures, this project will address some of the most outstanding problems in the field and help accelerate the penetration of modern computational methods in chemistry.

The project has the potential for significant benefit to society. Small molecules have numerous applications in biology, pharmacology, and bioengineering. They can be used to probe and study biological pathways and systems and to develop new drugs. The algorithms developed by this project will provide basic building blocks and important steps towards understanding and predicting molecular properties from molecular structures. They will allow scientists to screen large data sets of compounds rapidly, while searching for compounds that satisfy particular structural or functional constraints. This will produce cost savings, accelerate the development of new drugs, and promote the understanding of chemical space.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0513376
Program Officer
Sylvia J. Spengler
Project Start
Project End
Budget Start
2005-07-15
Budget End
2009-06-30
Support Year
Fiscal Year
2005
Total Cost
$311,291
Indirect Cost
Name
University of California Irvine
Department
Type
DUNS #
City
Irvine
State
CA
Country
United States
Zip Code
92697