The recent development of various government and University funded screening centers has provided the academic research community with access to state-of-the-art high-throughput and high-content screening facilities. As a result, chemical genetics, which uses small organic molecules to alter the function of proteins, has emerged as an important experimental technique for studying and understanding complex biological systems. However, the methods used to develop small-molecule modulators (chemical probes) of specific protein functions and analyze the phenotypes induced by them have not kept pace with advances in the experimental screening technologies. Developing probes for novel protein targets remains a laborious process, whereas experimental approaches to identify the proteins that are responsible for the phenotypes induced by small molecules require a large amount of time and capital expenditure. There is a critical need to develop new methods for probe development and target identification and make them publicly available to the research community. Lack of such tools represents an important problem as it impedes the identification of chemical probes for various proteins and reduces our ability to effectively analyze the experimental results in order to elucidate the molecular mechanisms underlying biological processes.

Intellectual Merit This project will develop novel algorithms in the areas of cheminformatics, bioinformatics, and machine learning to analyze the publicly available information associated with proteins and the molecules that modulate their functions (target-ligand activity matrix). These algorithms will be used to develop new classes of computational methods and tools to aid in the development of chemical probes and the analysis of the phenotypes elicited by small molecules. The key hypothesis underlying this research is that the target-ligand activity matrix contains a wealth of information that if properly analyzed can provide insights connecting the structure of the chemical compounds (chemical space) to the structure of the proteins and their functions (biological space). Novel methods will be developed to: (i) better analyze the screening results and identify high affinity and selective hits, (ii) build models that can predict the compounds that are active against a novel protein target and select a set of compounds to be included in a high-throughput screen that will be enriched in actives, (iii) virtually generate a set of core molecules (scaffolds) for a given protein target that can be significantly different from those currently available in the various libraries and have a high probability of being active against the target, and (iv) identify the proteins being targeted by compounds in phenotypic assays. In addition, the research will be facilitated by creating a database to integrate a large portion of the publicly-available target-ligand binding data along with information about the targets and the compounds involved. The successful completion of this research will transform the field of chemical genetics by establishing a new methodology by which the increasing amount of target-ligand activity information is used in a systematic way to explicitly guide the discovery of new probes and the analysis of phenotypic assays.

Broader Impact The ability to discover chemical probes for a wide range of novel protein targets will make it possible to identify drugs for pharmaceutically relevant proteins, positively impacting the rate of drug discovery. In addition, it will greatly increase the set of proteins that can be selectively modulated via small organic molecules, expand the various biological processes that can be investigated via chemical genetics approaches, and allow researchers to use chemical genetics techniques to gain insights on the mechanisms of action associated with certain phenotypes. This will provide a better understanding of the dynamics of these processes and will supplement existing approaches based on molecular genetics. To further aid in the broad dissemination of the results and enhance scientific understanding, the computational methods developed will be made freely available via stand-alone or web-based services to aid researchers working in the area of chemical genomics. Finally, the project integrates the research with an educational plan that focuses on interdisciplinary undergraduate, graduate, and post-graduate education in the areas of Computer Science, Medicinal Chemistry, and Chemical Genetics.

Key Words: supervised learning; semi-supervised learning; cheminformatics; structural bioinformatics; data mining; graph algorithms

Project Report

Proteins perform several processes within a cell by interacting with other proteins, DNA, RNA and small molecules. Within the multi-stage, billion dollar drug discovery pipeline, the first step is to identify small organic molecules that alter the function of proteins. The experimental process of assessing whether a molecule interacts with a protein is called chemical genetics and provides insight into protein function. As part of this project, computational approaches were developed to discover potential drug leads. Specific contributions have led to the development of methods that seek to identify and characterize interactions between proteins and small molecules. Even though, these approaches are developed for biological datasets the methods developed are suitable for the analysis of social network datasets and recommender systems (e.g., user-movie instead of protein-molecule dataset) which have dyadic (two-way) relationships. Additionally, methods were developed to analyze the sequential and repeating patterns commonly observed in protein/DNA sequences. The utility of this approach was demonstrated on severaltypes of sequence data including protein sequence benchmarks, text datasets, and continuous electrocardiogram (heart measurments) data. Empirical results demonstrated the effectiveness of this approach in discovering repeated patterns meaningful to humans. Methods were also developed to analyze datasets with incomplete or missing information. This was essential when biological experiments would fail to provide a complete and accurate representation of protein interactions. These methods could also combine different, heterogeneous datasets and were shown to be useful in estimating function of a given protein. Software packages (svmPRAT) was released to the bioinformatics/biology community and is available freely at www.cs.gmu.edu/~mlbio/svmPRAT. This toolkit helps develop predictive models that help annotate i.e., analyze an input protein sequence and identify interesting aspects about these proteins. Example include: identification of specific points on the protein that are involved in binding with other proteins. This general purpose toolkit has been incorporated in several other protein analysis softwares/web services. Efforts of this research were broadly disseminated via publications, source codes and conference presentations. The award also assisted several graduate students (total of five) and undergraduate students (total of three) in various stages of their academic pathways. Graduated students have found positions in the software and data analytics industry. In terms of educational impact of this project, the PI incorporated the research ideas and methods developed within advanced graduate level classes. Specifically, assignments were setup that would analyze protein datasets. This allowed students to have a first-hand exposure to real world applications within the classroom setting.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0905117
Program Officer
Sylvia J. Spengler
Project Start
Project End
Budget Start
2009-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2009
Total Cost
$339,537
Indirect Cost
Name
George Mason University
Department
Type
DUNS #
City
Fairfax
State
VA
Country
United States
Zip Code
22030