) The broad aim of this proposal is to facilitate structural and functional genomics of cancer.
The specific aims are to develop and apply computational tools for (i) identifying and annotating cancer-related protein sequences; (ii) prioritizing target proteins for the structural genomics of cancer; and (iii) maximizing structural information about cancer-related proteins.
The first aim will be achieved by collecting cancer-related protein sequences from The Cancer Genome Anatomy Project at NCI and by identifying additional such sequences in the databases of metabolic and signaling pathways, and primary sequence databases. Proteins that occur in the same pathway or have similar regulatory patterns as cancer proteins, proteins that interact with cancer proteins, or proteins whose expression shares features with that of the cancer proteins will also be considered as cancer- related proteins. Queryable and up-to-date annotations of cancer-related proteins will be obtained by sensitive comparisons to all known protein sequences and structures. The annotations will include comparative protein structure models for all cancer-related proteins with assigned folds.
The second aim i s to identify and prioritize target protein domains for the AECOM/Brookhaven/Rockefeller Structural Genomics Research Consortium (SGRC) that will focus on developing high-throughput technology for structure determination of the cancer-related proteins by X-ray crystallography and NMR spectroscopy. The target domains will correspond primarily to the yeast homologs of the cancer-related proteins without known structure. The target list will be dynamically updated to maximize information from structure determinations.
The third aim i s to analyze and use the structures determined by SGRC for comparative structure modeling and comparative analysis of as many cancer-related proteins as possible. The annotation, modeling and analysis tools will build on the MAGPIE system for automated genome annotation, and on the MODELLER pipeline for large-scale comparative modeling. The annotations will be defined in the computer language Prolog through logical rules and relational facts, including rules to capture computed alignment data, domain definitions, and user preferences about properties of target domains. The ability to refer at the same time to the sequence, structure, and function of cancer-related proteins, organized in sequence and structure families, will allow cancer researchers to address questions that are currently not easily answered. This project will increase significantly the amount of protein structure information available to cancer biologists. The set of cancer- related proteins, their annotations, family membership, and structural models will be accessible efficiently over the web.