The """"""""Big Idea"""""""" driving this project is that human memory is small, the body of scientific knowledge is vast and that breakthroughs are possible if software can do a better job of connecting researchers with knowledge in text and databases. The first step is to stop looking for words (as a search engine does) in data but instead try to find facts in data. A fact is something claimed in text or a database explicitly. A fact may not be true but we want to develop software that finds those facts. What does a fact look like? Consider the following sentence from MEDLINE: """"""""Recently, we have found that Htt is an antiapoptotic protein in striatal cells and acts by preventing caspase-3 activity."""""""" It contains the fact that the gene id 6532 (in the Entrez Gene database) regulates gene id 836. Software can extract such facts from a sentence like this. But the current state-of-the-art is not doing a great job of it. The reason is simple-current systems are focused on not making mistakes which means that they miss a lot of opportunities to find facts. The best reported performance is around 40% of the facts being found which we think is severely compromising the usefulness of text mining technologies in bioinformatics. This is where we are trying a different approach-we are focused on finding all the facts. We call this """"""""total recall"""""""" which we demonstrated was possible in Phase I but total recall comes with a price: we make lots of mistakes. The key innovation is that we keep score of how confident we are of any given fact which gives us an important point of leverage in sifting good from bad facts. Our Phase II proposal focuses on developing techniques to reason over such fact heavy analysis by exploring soft clustering approaches, structured classification and effective user interface design. We have partnered with Harvard, Columbia and Pfizer to keep our research effort focused on problems that actually matter for genomics experiments and early phase drug discovery. In addition we fit into the NIH's data sharing policy by making our software free (with source code) to organizations who make their data free too. We, as do many others, believe that many great scientific discoveries lay implicit and just below the surface of the research literature. All that is required is for the right researcher to see the right sentence or database entry to form a novel hypothesis and cure a disease. Total recall approaches to fact extraction make that all the more likely an outcome. The dominant paradigm in text mining is to treat the text like a database. But researchers would be better served with a more """"""""search"""""""" like approach to extracting and correlating facts in text and databases. We are committed to making all the facts, or total recall, available to scientists which is currently not available. ? ? ?
Smith, Larry; Tanabe, Lorraine K; Ando, Rie Johnson nee et al. (2008) Overview of BioCreative II gene mention recognition. Genome Biol 9 Suppl 2:S2 |