This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5).

The objective of this proposal is to develop an integrated research and education program for advancing the underlying theoretical and computational principles of data mining in the emergent chemical genomics databases. The core technical innovations that this research aims to advance are: (i) developing effective kernel-based representations and structure pattern extraction and selection methods to capture the intrinsic characteristics of irregular and discrete spaces such as the chemical space, (ii) designing methods for adaptive and scalable similarity search in large databases of complex data and methods for accurate classification models, and (iii) deriving application oriented validation. A key strength of this work is the application of the theoretic and computational advancements to real-world problems, namely, chemical toxicity prediction based on microarray gene expression profiles and high-throughput chemical screening. Collaborators in academia, industry, and government agencies will evaluate the new algorithms. The data mining knowledge gained will be applicable beyond the chemical domain; examples of such applications include social network analysis and sensor network analysis. The PI will work closely with the Center of Excellence in Chemical Methodologies and Library Development at the University of Kansas (KU CMLD) to evaluate research prototypes.

Intellectual Merit This research addresses the fundamental problem of learning functional dependencies between arbitrary input and output domains. In particular, this research: 1) focuses on complex input domain, the space of all chemicals, 2)aims to derive a uniform representation of the domain by working on innovative tools for graphs and geometric structures that are associated with the domain, 3) will provide practical tools to search through the domain, and will design new algorithms that uncover real connections between the input domain to an equally complex output domain (a space of biological entities). The data mining knowledge gained from this project will provide the research community with much better techniques for searching, mining, and analyzing domains of complex data and for uncovering the real connections between domains of complex data. The proposed research is a timely effort to integrate and advance knowledge in three communities: cheminformatics, data mining, and machine learning.

Broader Impact Accurate data mining tools for chemical structure-activity relationship discovery will simplify and accelerate drug discovery and hence improve human health. Better prediction tools for chemical activity including toxicity will lead to better strategies for environmental monitoring and preservation. Deep understanding of chemical structure-activity relationships should enable rational material design in the research for renewable and clean energy. The research program is strongly linked to the educational goals of this proposal, which are, among others, (i) to enrich curriculum for the undergraduate and graduate education in new interdisciplinary training programs and (2) to encourage K-12 and undergraduate students to pursue careers in Science, Technology, Engineering, and Mathematics (STEM).

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0845951
Program Officer
Vasant G. Honavar
Project Start
Project End
Budget Start
2009-07-01
Budget End
2014-06-30
Support Year
Fiscal Year
2008
Total Cost
$499,972
Indirect Cost
Name
University of Kansas
Department
Type
DUNS #
City
Lawrence
State
KS
Country
United States
Zip Code
66045