The long-term objective of our research group is to facilitate automatic or semi-automatic classification and retrieval of natural language texts, in support of reducing the cost and improving the quality of computerized medical information. This proposal develops further and applies a novel approach, the Linear Least Squares Fit (LLSF) mapping, to document indexing and document retrieval of the MEDLINE database. LLSF mapping is a statistical method developed by the PI for learning human knowledge about matching queries, documents, and canonical concepts. The goal is to improve the quality (recall and precision) of automatic document indexing and retrieval, which cannot be achieved by surface-based matching without using human knowledge or thesaurus-based matching dependent on manually developed synonyms. This project applies LLSF to MEDLINE, the world's largest and most frequently used on-line database, to evaluate the effectiveness of this method and to explore the practical potential on large scale databases.
The specific aims and methods are: l. To collect data needed for the training and evaluation of the LLSF method. A collaboration with another research institute is planned for utilizing and refining a large collection of MEDLINE retrieval data. A sampling of MEDLINE searches at the Mayo Clinic will be employed for obtaining additional tasks. 2. To develop automatic noise reduction techniques for improving both the accuracy of the LLSF mapping and the efficiency of the computation. A multi-step noise reduction in the training process of LLSF will be investigated, including a statistical term weighting for the removal of non-informative terms, a truncated singular value decomposition (SVD) for reducing the noise at the semantic structure level, and the truncation of insignificant elements in the LLSF solution matrix for noise-reduction at the level of term-to-concept mapping. 3. To scale-up the training capacity for enabling the LLSF to accommodate the large size of MEDLINE data. A split-merge approach decomposes a large training sample into tractable subsets, computes an LLSF mapping function for each subset, and then merges the lcal mapping functions into a global one. 4. To improve the computational efficiency by employing algorithms optimized for sparse matrices and for noise reduction. The potential solutions include the Block Lanczos truncated SVD algorithm which can reduce the cubic time complexity of standard SVD (on dense matrices) to a quadratic complexity, a QR decomposition which solves the LLSF without SVD, a sparse matrix algorithm which has shown a speed-up in matrix multiplication and cosine computation by a factor of l to 4 magnitudes, and parallel computing. 5. To evaluate the effectiveness of LLSF on large MEDLINE document sets and compare with the performance of alternate indexing/retrieval systems.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
First Independent Research Support & Transition (FIRST) Awards (R29)
Project #
5R29LM005714-04
Application #
2685564
Study Section
Biomedical Library and Informatics Review Committee (BLR)
Program Officer
Bean, Carol A
Project Start
1995-04-01
Project End
2000-03-31
Budget Start
1998-04-01
Budget End
1999-03-31
Support Year
4
Fiscal Year
1998
Total Cost
Indirect Cost
Name
Carnegie-Mellon University
Department
Biostatistics & Other Math Sci
Type
Other Domestic Higher Education
DUNS #
052184116
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213
Yang, Y; Chute, C G (1995) Sampling strategies in a statistical approach to clinical classification. Proc Annu Symp Comput Appl Med Care :32-6