Most top-level searches of scientific literature include querying of structured fields such as author, subject, or affiliation. A free-text search of abstracts or full texts entries would be more flexible allowing queries with any word combination including ranges of names and identifiers. Unfortunately, free text searches usually yield incomplete and often erroneous results since the naming of biologically important molecules (genes, proteins, substrates) is not standardized. Unless a specific query issued to a retrieval service (e.g. PubMed) covers all possible aliases of a given protein or gene the results may be insufficient or simply wrong. The system proposed here translates the problem of looking up literature pertaining to a certain protein to the sequence level. By correlating existing identifiers, names, and synonyms of proteins with their sequences this lookup increases the accuracy and coverage of the results. A particular challenge that our system will uniquely address is the following. Increasingly structural and functional genomics projects bring up proteins for which nothing is known. If someone published some new experimental that will actually name such a protein, this important knowledge will likely be lost to the genomics investigator because PubMed alarms need to be activated by keywords and names. Our system could fill in the gap: users will be able to deposit sequences corresponding to proteins of unknown function/name. If experimental information will be published for the same or a related sequence the original investigator will be notified.

Public Health Relevance

The experimental and computational data appearing daily in publications is critical to the advancement of biological research. However, the sheer quantity and high frequency in which new data is published turns bench scientists into research librarians trying to sift through the flood of information while searching for relevant and reliable data. Furthermore, as biological research is increasingly driven by the study of proteins and genes that mostly lack annotations, or even an identifiers, there is a need to access the literature by using sequence data alone. By automating the process of searching and discovering relevant information as it becomes available, the proposed system promises to save time and increase the coverage of relevant and reliable data retrieved by a given search in an intuitive and """"""""easy to consume"""""""" format.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Small Business Innovation Research Grants (SBIR) - Phase I (R43)
Project #
1R43LM010156-01
Application #
7748592
Study Section
Biomedical Computing and Health Informatics Study Section (BCHI)
Program Officer
Ye, Jane
Project Start
2009-09-15
Project End
2011-03-14
Budget Start
2009-09-15
Budget End
2011-03-14
Support Year
1
Fiscal Year
2009
Total Cost
$97,198
Indirect Cost
Name
Biosof, LLC
Department
Type
DUNS #
828255617
City
New York
State
NY
Country
United States
Zip Code
10025