1) Electronic Textbook and PubMed Central Indexing Current processing of the electronic textbook material involves a number of steps designed to produce the most meaningful phrases in the text to be used as reference points. The first task is to identify grammatically reasonable phrases. We use a version of the Brill transformation based tagger, rewritten in C++, for part-of- speech tagging. This forms the basis for determining grammatically reasonable phrases. There is a significant post processing step that removes phrases that involve inappropriate references to context (e.g., different cells, final mutation). After finding grammatically reasonable phrases we attempt to eliminate those that are too common or generic to be useful (e.g., significant result, short time). The next step is to compare a phrase with previously rated phrases that have been collected over the life of the project. The final stage is to estimate the importance of a phrase in the passage where it is found in a textbook. Such an estimate is based on the frequency of the phrase and the size of the passage compared with the frequency of the phrase throughout the book and the overall size of the book. In order to improve such an estimate we attempt to take account of the phrase or any phrase that represents the same concept. For this purpose we use the UMLS Metathesaurus and also stemming and combine these two approaches into a consistent picture of the concept as it occurs in the text. The result of this processing is a scored list of phrase-book section pairs for each textbook. These are used to guide the response of general searching in the books. When a user types in a phrase that is on our curated list the first results given are the highly rated book sections for that phrase. We are now applying a similar indexing scheme to the text of articles in PMCentral. This allows us to give a list of highly rated phrases for each article as an enhanced reference point for searchers. 2) A significant fraction of queries in PubMed are multiterm queries and PubMed generally handles them as a Boolean conjunction of the terms. However, analysis of queries in PubMed indicates that many such queries are meaningful phrases, rather than simply collections of terms. We have examined whether or not it makes a difference, in terms of retrieval quality, if such queries are interpreted as a phrase or as a conjunction of query terms. And, if it does, what is the optimal way of searching with such queries. To address the question, we developed an automated retrieval evaluation method, based on machine learning techniques, that enables us to evaluate and compare various retrieval outcomes. We show that classes of records that contain all the search terms, but not the phrase, qualitatively differ from the class of records containing the phrase. We also show that the difference is systematic, depending on the proximity of query terms to each other within the record. Based on these results, one can establish the best retrieval order for the records. Our findings are consistent with studies in proximity searching. The important insight here for indexing is that in some cases where the words of a phrase occur in text, but not as the phrase, the phrase may still be an appropriate concept to use in indexing the text. 3) We have studied how good phrases can be recognized by their characteristics, such as frequency, tendency to be repeated in documents where they occur, and other numerical properties. These features allow one to predict which phrases are of high quality. We have found such predictions to be useful in studying different kinds of terms that may appear in text and how an ontoloogy might be extracted from text. 4) We have found stochastic gradient descent (SGD) with regularization by early stopping to be a very efficient method for training a Support Vector Machine (SVM) for MeSH term assignment. We have discovered that the early stopping can be implemented as stopping after a constant number of iterations and the results are as good as stopping based on held out data and also as good as more conventional methods of training an SVM on large data sets. The SGD approach is much faster and allows one to readily train classifiers for all 27,000 MeSH terms. Results are superior to previously published methods. The approach could be the basis of indexing suggestions for PubMed records or for an automatic concept assignment system similar to MeSH.

Project Start
Project End
Budget Start
Budget End
Support Year
6
Fiscal Year
2015
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
Zip Code
Yeganova, Lana; Kim, Won; Kim, Sun et al. (2014) Retro: concept-based clustering of biomedical topical sets. Bioinformatics 30:3240-8
Wilbur, W John; Kim, Won (2014) Stochastic Gradient Descent and the Prediction of MeSH for PubMed Records. AMIA Annu Symp Proc 2014:1198-207
Névéol, Aurélie; Wilbur, W John; Lu, Zhiyong (2012) Improving links between literature and biological data with text mining: a case study with GEO, PDB and MEDLINE. Database (Oxford) 2012:bas026
Sayers, Eric W; Barrett, Tanya; Benson, Dennis A et al. (2012) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 40:D13-25
Yeganova, Lana; Kim, Won; Comeau, Donald C et al. (2012) Finding biomedical categories in Medline®. J Biomed Semantics 3 Suppl 3:S3
Islamaj Dogan, Rezarta; Lu, Zhiyong (2010) Click-words: Learning to Predict Document Keywords from a User Perspective. Bioinformatics :