Knowledge of protein function serves as a corner stone for biomedical research, which is fundamental for understanding biologic systems, the mechanism of disease and ultimately the human health. Decades of biomedical research has accumulated a great wealth of such knowledge available in the form of biomedical literatures. An important task of biomedical informatics is to acquire and represent the knowledge from free text of literatures and transform it to languages that are understandable by computational agents, so that the knowledge can be stored, retrieved and used for knowledge discovery. Currently, all protein annotations are assigned manually which, unfortunately, is extremely labor-intense and cannot keep up the pace of the growth of information. Indeed, with the completion of genome sequences of several model organisms, manual annotation of proteins has already become a major bottleneck between large number of proteins and exploding amount information in biomedical literatures. In this application, we propose to develop methods to facilitate automatic annotation of protein functions based on the functional information buried in the biomedical literature. The proposed methods adapt and extend the state of art probabilistic semantic analysis, information retrieval and machine learning methodologies, which serve as principled approaches to modeling uncertainties in natural language text. The project will develop algorithmic building blocks for a future automatic annotation system such that, when given a brief description of a protein (e.g., a protein name and symbol), it will be capable of retrieving relevant literature articles about the protein, extracting biological concepts from the articles and mapping the concept to a controlled vocabulary. We envision that achieving these goals will result in advances with broader impact which not only facilitate automatic protein annotation but also for biomedical literature indexing-one of the important area of biomedical informatics. The efficient knowledge acquisition and management will enhance biomedical research regarding the mechanisms of diseases and drug discovery.
Showing the most recent 10 out of 16 publications