It is becoming increasingly difficult for biologists to keep pace with information being published within their own fields, let alone biology as a whole. The ability to rapidly access specific and current biomedical information as well as to quickly gain an overview of current knowledge in a given field is becoming more difficult while at the same time more important. Traditional methods of keeping up with advances are therefore becoming inadequate. Here we propose to continue to develop our Medstract Project to apply recent advances in the computational analysis of text to organize and structure the biological literature. The Medstract project will reduce the time required for biomedical researchers to find information of interest and should facilitate the development of new research insights. This project is the result of a unique collaboration between a computational linguistics lab at Brandeis University and a molecular biology lab at Tufts University School of Medicine. Previously we have developed an extensive set of tools for analyzing and processing biomedical text. We have used these tools to develop databases of biomedical acronyms, inhibitors, regulators, and interactors from Medline abstracts and have made these available on the web. These resources are currently used by hundreds of investigators every day. In addition we have generated and made available gold standard markup files for several biological terms and relations for use as testing standards by other groups developing knowledge extraction engines for the biomedical domain. Here we propose to extend and enhance our current Medstract databases as well to generate new databases using the tools that we have developed. New databases will include protein modifications, domains and motifs, and tissue and cellular localization information. In addition, we will use the bio-relation databases as the foundation for constructing a system allowing point-to-point regulatory pathway identification. We will enhance the robustness of these databases by utilizing algorithms that we have developed for rerendering the semantic ontologies for the biomedical lexicon. Furthermore, by applying coreference resolution algorithms to the text, we will improve precision and recall of knowledge extraction for populating the database

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
5R01LM006649-05
Application #
6896406
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Ye, Jane
Project Start
2004-06-01
Project End
2007-05-31
Budget Start
2005-06-01
Budget End
2006-05-31
Support Year
5
Fiscal Year
2005
Total Cost
$403,171
Indirect Cost
Name
Brandeis University
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
616845814
City
Waltham
State
MA
Country
United States
Zip Code
02454
Pustejovsky, J; Castano, J; Zhang, J et al. (2002) Robust relational parsing over biomedical literature: extracting inhibit relations. Pac Symp Biocomput :362-73
Pustejovsky, J; Castano, J; Cochran, B et al. (2001) Automatic extraction of acronym-meaning pairs from MEDLINE databases. Medinfo 10:371-5