Knowledge in molecular biology consists of assertions about the relationship of molecular entities qualified by context which describes when and where those assertions apply. The vast majority of knowledge in molecular biology resides in the primary research literature, and only a small fraction of this knowledge is currently accessible through well-structured databases. This is a pilot project to develop automated knowledge extraction technology. We will use the regulation of gene expression in hematopoiesis as a test domain. Knowledge acquisition will be accomplished through a multi-stage process: parsing the document and sentence structure, recognizing the names of known biological entities and matching sentences to verb based templates to capture assertions (e.g. ;A binds B; or ;A contains B; A regulates B;) and preposition templates to capture context in which these assertions apply. A multi-disciplinary approach will be used drawing on experts in bioinformatics, databases, information science and computational linguistics. Four unique aspects of this project are the definition of a multi-dimensional description of molecular biological context, the use of preposition templates and hierarchical document structure to capture and make inference on context, the development of domain specific parsing techniques and the use of probabilistic representations explicitly represented in XML throughout text processing, parsing, knowledge acquisition and information integration.
Showing the most recent 10 out of 19 publications