Our long-term goal is to assist biomedical scientists by extracting and codifying new knowledge from large biomedical databases routinely by computer. As large collections of data become more readily accessibly, the opportunities for discovering new information increase. We propose here to work toward this goal by extending our prior research on machine learning in two important directions: (1) codification of disparate pieces of knowledge into a coherent model (model building), and (2) discovery of new information in medical databases (data mining). Machine learning programs find classification rules (or decision trees or networks) that separate members of a target class from other individuals. They have emphasized predictive accuracy, with some attention to tradeoffs between accuracy and cost of errors or between accuracy and simplicity. We propose a framework in which these, and other, tradeoffs are explicit and the criteria by which tradeoffs are made are available for modification. We also include semantic considerations among the criteria to control the internal coherence of models. """"""""Data mining"""""""" is a recently-coined term for using computers to explore large databases, with a goal of discovering new relationships but usually with no specific target defined at the outset. In addition to accuracy, simplicity, coherence, and cost, a program that purports to discover new relationships must be able to assess novelty. We propose to measure the extent to which proposed relationships are novel by comparing them against existing knowledge in the domain of discourse, and to look for unusual rules (and other relations) that would be very interesting if true. The computer program we are primarily building on, RL, is a knowledge- based learning program that learns classification rules from a collection of data. RL has been demonstrated to be flexible enough to allow guidance from prior knowledge, and powerful enough to learn publishable information for scientists working in several different domains. Both parts of the research will requires extending the RL system in new ways detailed in the research plan, which are consistent with the overall design philosophy of the present system. We will primarily work with data already collected on pneumonia patients with with which we have considerable. We will test the generality of the criteria used to evaluate models and discoveries with a Baynesian Net learning. We will test the generality of the generality of the criteria used to evaluate models and discoveries with Bayesian Net learning system, K2.

National Institute of Health (NIH)
National Library of Medicine (NLM)
Research Project (R01)
Project #
Application #
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Florance, Valerie
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Pittsburgh
Biostatistics & Other Math Sci
Schools of Arts and Sciences
United States
Zip Code
Lu, Xinghua; Zhai, Chengxiang; Gopalakrishnan, Vanathi et al. (2004) Automatic annotation of protein motif function with Gene Ontology terms. BMC Bioinformatics 5:122
Chapman, W W; Bridewell, W; Hanbury, P et al. (2001) A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 34:301-10