The recent development of various government and University funded screening centers has provided the academic research community with access to state-of-the-art high-throughput and high-content screening facilities. As a result, chemical genetics, which uses small organic molecules to alter the function of proteins, has emerged as an important experimental technique for studying and understanding complex biological systems. However, the methods used to develop small-molecule modulators (chemical probes) of specific protein functions and analyze the phenotypes induced by them have not kept pace with advances in the experimental screening technologies. Developing probes for novel protein targets remains a laborious process, whereas experimental approaches to identify the proteins that are responsible for the phenotypes induced by small molecules require a large amount of time and capital expenditure. There is a critical need to develop new methods for probe development and target identification and make them publicly available to the research community. Lack of such tools represents an important problem as it impedes the identification of chemical probes for various proteins and reduces our ability to effectively analyze the experimental results in order to elucidate the molecular mechanisms underlying biological processes.
Intellectual Merit This project will develop novel algorithms in the areas of cheminformatics, bioinformatics, and machine learning to analyze the publicly available information associated with proteins and the molecules that modulate their functions (target-ligand activity matrix). These algorithms will be used to develop new classes of computational methods and tools to aid in the development of chemical probes and the analysis of the phenotypes elicited by small molecules. The key hypothesis underlying this research is that the target-ligand activity matrix contains a wealth of information that if properly analyzed can provide insights connecting the structure of the chemical compounds (chemical space) to the structure of the proteins and their functions (biological space). Novel methods will be developed to: (i) better analyze the screening results and identify high affinity and selective hits, (ii) build models that can predict the compounds that are active against a novel protein target and select a set of compounds to be included in a high-throughput screen that will be enriched in actives, (iii) virtually generate a set of core molecules (scaffolds) for a given protein target that can be significantly different from those currently available in the various libraries and have a high probability of being active against the target, and (iv) identify the proteins being targeted by compounds in phenotypic assays. In addition, the research will be facilitated by creating a database to integrate a large portion of the publicly-available target-ligand binding data along with information about the targets and the compounds involved. The successful completion of this research will transform the field of chemical genetics by establishing a new methodology by which the increasing amount of target-ligand activity information is used in a systematic way to explicitly guide the discovery of new probes and the analysis of phenotypic assays.
Broader Impact The ability to discover chemical probes for a wide range of novel protein targets will make it possible to identify drugs for pharmaceutically relevant proteins, positively impacting the rate of drug discovery. In addition, it will greatly increase the set of proteins that can be selectively modulated via small organic molecules, expand the various biological processes that can be investigated via chemical genetics approaches, and allow researchers to use chemical genetics techniques to gain insights on the mechanisms of action associated with certain phenotypes. This will provide a better understanding of the dynamics of these processes and will supplement existing approaches based on molecular genetics. To further aid in the broad dissemination of the results and enhance scientific understanding, the computational methods developed will be made freely available via stand-alone or web-based services to aid researchers working in the area of chemical genomics. Finally, the project integrates the research with an educational plan that focuses on interdisciplinary undergraduate, graduate, and post-graduate education in the areas of Computer Science, Medicinal Chemistry, and Chemical Genetics.
Key Words: supervised learning; semi-supervised learning; cheminformatics; structural bioinformatics; data mining; graph algorithms