A system of C++ language programs has been developed for the purpose of finding the closely related documents in Medline and for the purpose of performing machine learning on sets of documents. The system has a number of unique features: 1) It is based on a number of C++ classes and highly modular so that alterations in the system are relatively simple to perform. 2) The system currently processes PubMed data by extracting from the Sybase repositories using a C++ interface to Sybase. However, a change in the interface portion of the system would allow it to be applied to any large database consisting of discrete textual records. 3) Data processed by the system is stored as compressed file structures, etc. These structures are updatable so that new data may be continually added to the system as it becomes available. 4) Documents are compared with each other using a Bayesian form of analysis. 5) Code has been multithreaded and memory mapping capabilities added to speed up processing. 6) Most recently the code has been updated to work in a 64 bit environment. The system described here is now not only being used to process all of MEDLINE for our research purposes, but also to produce the related documents for arbitrary pieces of text by other groups here in the NLM and outside of the NLM. The system is currently proving useful in testing different retrieval parameters and methods on the PubMedHealth records. We have recently developed a software system called DStor that allows us to store all of PubMed in a manner which is easily updateable and allows fast access. This system is now being used to maintain and update five different versions of the PubMed data twice a week. This system has greatly improved our access to PubMed data in various useful forms and we anctipate that its use will continue to grow. In addition we have developed software to maintain and update a list of strings where each string is associated with some fixed vector of integers. We currently maintain a list of all multi-word phrases without stop words or punctuation and with each is associated a vector of six integers representing counts of different types associated with each phrase where counts are computed over all PubMed records having abstracts. We also maintain a list of all one and two word phrases and MeSH terms in various forms (with &without stars and subheadings) and two counts with each consisting of the document frequency and the total frequency counting all occurrences in each document over all of PubMed.
|Kim, Sun; Lu, Zhiyong; Wilbur, W John (2015) Identifying named entities from PubMed for enriching semantic categories. BMC Bioinformatics 16:57|
|Kwon, Dongseop; Kim, Sun; Shin, Soo-Yong et al. (2014) Assisting manual literature curation for protein-protein interactions using BioQRator. Database (Oxford) 2014:|
|Arighi, Cecilia N; Carterette, Ben; Cohen, K Bretonnel et al. (2013) An overview of the BioCreative 2012 Workshop Track III: interactive text mining task. Database (Oxford) 2013:bas056|