In this collaborative project with Yiming Yang at Mayo Clinic, we use the term strength I have defined and use in the current Bayesian retrieval system for the Entrez neighbors, to determine thresholds for term removal. This allows a large number of terms to be identified as relatively useless. When these are removed the problem of text categorization based on the terms appearing in the text is greatly simplified. For the linear least squares fitting method developed and used by Dr. Yang, we find a time savings of 70 to 90% which comes from the removal of 80% or more of the terms. Dr. Yang has also developed what she terms an expert network method of text classification. It is based on finding the nearest neighbors to a text and using their classifications to predict the best classification for the text. The term removal methods provide significant time and space savings for this approach as well.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000062-02
Application #
2578635
Study Section
Special Emphasis Panel (CBB)
Project Start
Project End
Budget Start
Budget End
Support Year
2
Fiscal Year
1996
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code