The difference between the number of proteins with known sequence and those with well- studied function (sequence-function gap) is growing daily. One well-defined coarse-grained aspect of function is the native subcellular localization of a protein that has a central role in the Gene Ontology (GO) hierarchy. Many detailed and high-throughput experiments annotate localization. Where experiments do not reach, homology-based and de novo prediction methods succeed. Here, we propose the development of a comprehensive system that combines experimental resources with data mining techniques and novel prediction methods with the objective to annotate localization for entirely sequenced eukaryotes at an unprecedented detail and accuracy. Firstly, we propose to gather all available data and all relevant methods to build a comprehensive localization atlas for human and Arabidopsis. Secondly, we plan to develop novel methods tailored specifically to capture proteins for which we are left with no reliable annotations after completing the first step. We assume that these methods will focus on the prediction of the particular type of membrane into which an integral membrane protein is inserted, and of the native localization for minor eukaryotic compartments (ER, Golgi, lysosome). Thirdly, we propose the implementation of specific improvements over today's motif-based methods for secreted and nuclear proteins, as well as the extension of de novo predictions for the major compartments. An important objective will be to maintain high levels of performance for splice variants and for sequence fragments. Overall, the project will require the analysis of existing biological databases, the development of novel methods, and the combination of existing ones; it will generate novel information available through internet servers, standalone programs and databases. ? ?
The annotations generated by our system will aid the design of detailed and high-throughput experimental studies. In particular, localization may increase in its relevance as one essential feature used to infer networks of interactions. The ultimate goal of our project is the generation of an atlas that maps all proteins in a cell. Eventually, this atlas will constitute a 4D map; it will localize proteins in their 3D cellular environments and resolve the coarse-grained dynamics of the system, e.g. """"""""expression on ribosomes, bind importin, transport into nucleus, bind DNA, bind exportin, export out of nucleus; next cell cycle"""""""". The components proposed here constitute one crucial building block toward such a 4D map of a cell. ? ? ?