Our long-term goal is to assist biomedical scientists by extracting and codifying new knowledge from large biomedical databases routinely by computer. As large collections of data become more readily accessibly, the opportunities for discovering new information increase. We propose here to work toward this goal by extending our prior research on machine learning in two important directions: (1) codification of disparate pieces of knowledge into a coherent model (model building), and (2) discovery of new information in medical databases (data mining). Machine learning programs find classification rules (or decision trees or networks) that separate members of a target class from other individuals. They have emphasized predictive accuracy, with some attention to tradeoffs between accuracy and cost of errors or between accuracy and simplicity. We propose a framework in which these, and other, tradeoffs are explicit and the criteria by which tradeoffs are made are available for modification. We also include semantic considerations among the criteria to control the internal coherence of models. """"""""Data mining"""""""" is a recently-coined term for using computers to explore large databases, with a goal of discovering new relationships but usually with no specific target defined at the outset. In addition to accuracy, simplicity, coherence, and cost, a program that purports to discover new relationships must be able to assess novelty. We propose to measure the extent to which proposed relationships are novel by comparing them against existing knowledge in the domain of discourse, and to look for unusual rules (and other relations) that would be very interesting if true. The computer program we are primarily building on, RL, is a knowledge- based learning program that learns classification rules from a collection of data. RL has been demonstrated to be flexible enough to allow guidance from prior knowledge, and powerful enough to learn publishable information for scientists working in several different domains. Both parts of the research will requires extending the RL system in new ways detailed in the research plan, which are consistent with the overall design philosophy of the present system. We will primarily work with data already collected on pneumonia patients with with which we have considerable. We will test the generality of the criteria used to evaluate models and discoveries with a Baynesian Net learning. We will test the generality of the generality of the criteria used to evaluate models and discoveries with Bayesian Net learning system, K2.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Research Project (R01)
Project #
5R01LM006759-02
Application #
6185231
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Florance, Valerie
Project Start
1999-05-01
Project End
2002-04-30
Budget Start
2000-05-01
Budget End
2001-04-30
Support Year
2
Fiscal Year
2000
Total Cost
$213,046
Indirect Cost
Name
University of Pittsburgh
Department
Biostatistics & Other Math Sci
Type
Schools of Arts and Sciences
DUNS #
053785812
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213
Lu, Xinghua; Zhai, Chengxiang; Gopalakrishnan, Vanathi et al. (2004) Automatic annotation of protein motif function with Gene Ontology terms. BMC Bioinformatics 5:122
Chapman, W W; Bridewell, W; Hanbury, P et al. (2001) A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform 34:301-10