? This proposal is for mentored training in the molecular biosciences of an established computer scientist. The training plan includes basic and advanced course work in modern biology, interactions with biological research groups, attendance at seminars and conferences, and laboratory training. Mentoring on the culture and practices of biomedical research will be provided by the sponsor. The training institution has a longstanding tradition of interdisciplinary research and specific expertise in cutting edge proteomics methods. The candidate will be fully committed to a combination of training and research. The research plan is based on the critical need to organize and summarize the knowledge in the vast biomedical literature. Curated databases are expensive to create and maintain; do not estimate confidence of assertions; and do not allow for divergence of opinions. Information extraction (IE) methods can be used to partially overcome these limitations by automatically extracting certain types of information from biomedical text. ? ? In most genres of scientific publication, the most important results in a paper are illustrated in non-textual forms, such as images and graphs. The broad thesis underlying our proposed research is that one can provide better access to the information in online scientific publications by extracting information jointly from figure images and their accompanying captions. With the exception of certain previous work by the Murphy group, previous biomedical IE systems have not attempted to extract information from image data, only text. ? ? This proposal addresses these issues in the specific context of fluorescence microscope images depicting the subcellular localization of proteins. This goal is consonant with a major focus of current biomedical research: the identification of expressed genes and the description of the proteins they encode. Motivated by recent large-scale projects which major focus of current biomedical research is the identification of expressed genes and the description (or annotation) of the proteins they encode, the Murphy group has developed automated systems for recognizing subcellular structures in 2D and 3D images. Automated image analysis techniques have also been applied to images harvested from online biomedical journal articles. This system will be extended to create a robust, comprehensive toolset for extracting, verifying and querying biologically relevant information from the text and images found in online journals. Based on this toolkit, a set of tools will be developed for aiding researchers to identify and locate information found in online journals. Upon completion of the proposed training, the candidate will be well placed to take a leadership position in machine learning applications to the range of experimental methods used in biomedical research. ? ? ? ?

National Institute of Health (NIH)
National Institute on Drug Abuse (NIDA)
Mentored Quantitative Research Career Development Award (K25)
Project #
Application #
Study Section
Human Development Research Subcommittee (NIDA)
Program Officer
Colvis, Christine
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Carnegie-Mellon University
Schools of Arts and Sciences
United States
Zip Code
Kou, Zhenzhen; Cohen, William W; Murphy, Robert F (2007) A stacked graphical model for associating sub-images with sub-captions. Pac Symp Biocomput :257-68
Cohen, William W; Minkov, Einat (2006) A graph-search framework for associating gene identifiers with documents. BMC Bioinformatics 7:440
Kou, Zhenzhen; Cohen, William W; Murphy, Robert F (2005) High-recall protein entity recognition using a dictionary. Bioinformatics 21 Suppl 1:i266-73