This award provides funding for the development of effective and efficient algorithms to analyze large chemical compound databases and identify the compounds that are the most probable for displaying the desired drug-like behavior. These virtual screening algorithms are based on a substructure-based classification framework that utilizes (i) highly efficient frequent subgraph discovery algorithms that mine the chemical compounds to discover all the substructures (topological or geometric) that are critical for the classification task, (ii) sophisticated feature selection and generation algorithms that combine multiple criteria to identify and synthesize a set of substructure-based features that simultaneously simplify the representation of the original compounds while retaining and exposing their key features, and (iii) kernel-based approaches that take into account the relationships between these substructures at different levels of granularity and complexity. The research is integrated with an educational plan that focuses on initiating undergraduate and graduate students to the various computational and data analysis aspects of virtual screening, machine learning, and data mining through courses, summer institutes, and research opportunities.
The successful completion of this project will lead to advances in the drug development process by developing computationally efficient and accurate classification algorithms that can be used to replace or supplement biological-assay-based high-throughput screening (HTS) techniques and by producing a general purpose chemical compound classification software toolkit that will contain high-quality implementations of the various algorithms that will be developed and made available to the public. The combination of existing HTS-based approaches with these virtual screening methods will allow a move away from purely random-based testing, toward more meaningful and directed iterative rapid-feedback searches of subsets and focused libraries.