The """"""""Big Idea"""""""" driving this project is that human memory is small, the body of scientific knowledge is vast and that breakthroughs are possible if software can do a better job of connecting researchers with knowledge in text and databases. The first step is to stop looking for words (as a search engine does) in data but instead try to find facts in data. A fact is something claimed in text or a database explicitly. A fact may not be true but we want to develop software that finds those facts. What does a fact look like? Consider the following sentence from MEDLINE: """"""""Recently, we have found that Htt is an antiapoptotic protein in striatal cells and acts by preventing caspase-3 activity."""""""" It contains the fact that the gene id 6532 (in the Entrez Gene database) regulates gene id 836. Software can extract such facts from a sentence like this. But the current state-of-the-art is not doing a great job of it. The reason is simple-current systems are focused on not making mistakes which means that they miss a lot of opportunities to find facts. The best reported performance is around 40% of the facts being found which we think is severely compromising the usefulness of text mining technologies in bioinformatics. This is where we are trying a different approach-we are focused on finding all the facts. We call this """"""""total recall"""""""" which we demonstrated was possible in Phase I but total recall comes with a price: we make lots of mistakes. The key innovation is that we keep score of how confident we are of any given fact which gives us an important point of leverage in sifting good from bad facts. Our Phase II proposal focuses on developing techniques to reason over such fact heavy analysis by exploring soft clustering approaches, structured classification and effective user interface design. We have partnered with Harvard, Columbia and Pfizer to keep our research effort focused on problems that actually matter for genomics experiments and early phase drug discovery. In addition we fit into the NIH's data sharing policy by making our software free (with source code) to organizations who make their data free too. We, as do many others, believe that many great scientific discoveries lay implicit and just below the surface of the research literature. All that is required is for the right researcher to see the right sentence or database entry to form a novel hypothesis and cure a disease. Total recall approaches to fact extraction make that all the more likely an outcome. The dominant paradigm in text mining is to treat the text like a database. But researchers would be better served with a more """"""""search"""""""" like approach to extracting and correlating facts in text and databases. We are committed to making all the facts, or total recall, available to scientists which is currently not available. ? ? ?

Agency
National Institute of Health (NIH)
Institute
National Center for Research Resources (NCRR)
Type
Small Business Innovation Research Grants (SBIR) - Phase II (R44)
Project #
2R44RR020259-03
Application #
7327948
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Brazhnik, Olga
Project Start
2004-08-09
Project End
2009-06-30
Budget Start
2007-07-25
Budget End
2008-06-30
Support Year
3
Fiscal Year
2007
Total Cost
$374,415
Indirect Cost
Name
Alias-I
Department
Type
DUNS #
124340956
City
New York
State
NY
Country
United States
Zip Code
11211
Smith, Larry; Tanabe, Lorraine K; Ando, Rie Johnson nee et al. (2008) Overview of BioCreative II gene mention recognition. Genome Biol 9 Suppl 2:S2