This Small Business Innovation Research (SBIR) Phase II project aims to establish that a units-based approach to retrieving quantitative data from scientific and technical documents is a powerful alternative to keyword and document based search models. Keyword approaches to data extraction and contextualization are limited due to poor semantic contextualization and because quantities are often written in a wide variety of numeric and unit formats. The proposed approach to reliable numeric data extraction begins with quantity-intelligent indexing that recognizes many numeric formats and converts quantities to standardized base-unit tokens, to significantly enhance search recall over keyword approaches. The resulting number-unit pairs will anchor the index to enable efficient scientific exploratory search with high semantic precision, but without overly relying on sophisticated imposed semantic ontologies. Research will focus on a proprietary search-time data scoring algorithm that utilizes context-sensitive numeric spectra, to score otherwise ambiguous results based on probabilistic methods. This approach is expected to improve both precision and recall of contextual numeric data extraction. In turn, the resulting search engine will enable instant visualization and analysis of collective technology landscapes and trends, which will guide researchers in any area of technology represented by the indexed documents.

The broader impact of this project will be to enable reliable and efficient extraction of numeric data from diverse sources such as scientific literature and patent databases. These unstructured document sets contain a wealth of latent quantitative data which, if properly extracted and aggregated, can enable powerful modes of data exploration. The unit-based index and data-scoring algorithm are customized for an exploratory search model that will allow non-expert users to rapidly aggregate thousands of relevant data points, with simple keyword inputs and without laboriously opening and parsing individual documents. Researchers and students may thus explore data sets that were previously inaccessible, or known only to experts in a field. This will also contribute to knowledge discovery within large unstructured databases, since patterns and correlations between seemingly disparate variables can be immediately visualized. The platform will provide the capability to efficiently generate technology landscapes, anticipate emerging trends, and recognize competitive technical outliers. If successful, this will be valuable for high-tech industrial innovation including for engineers involved in R&D as well as business development executives and intellectual asset managers who focus on asset allocation, new technology ventures, prior art and patent infringement within a technical parameter space.

Project Report

" was focused on extracting signals from an unstructured database when little external contextualization is available. Dictionary-heavy approaches, such as search engines over medical records, are pre-programmed with causal relationships such as medicines and their side effects, and therefore it is not difficult to extract these relationships when they appear again in new documents. However, often the relationships between topics are unknown to the user, or the documents discuss fields where causal relationships have not yet been determined or are rapidly changing. In these circumstances, when effectively the user does not know what they are looking for, there must still be a methodology for extracting what is important. This was the focus of our efforts, and we discovered techniques to uncover these unknown relationships. This exercise in extracting signal without expertise in the field requires a proxy for defining importance in a particular application. We invented techniques for using time series data to be that proxy. We analyzed social media chatter, for which existing technologies had focused on two primary methods of discovering importance. The first, most prevalent method, relies on simple entity counting, and selects those phrases that appeared the most in the corpus. This approach often populates a word cloud. Another approach, often called sentiment analysis, involves analyzing the emotional tone of phrases by comparing against a list of emotional phrases, and elevating those entities that are highly associated with an extreme of emotion. Although this approach has significant natural language challenges, we have found that it does not consistently provide signals that correlate (even in-sample) with underlying business metrics. During the course of this effort, we developed a novel approach of prioritizing signal by training over a time series of structured data to be used as an intelligent filter for extracting those language patterns that were most highly correlated with the training metric. This structured, time series data, served as the Bayesian prior for contextualizing which conversation patterns were most likely important. Our approach required a flexible architecture that was capable of leveraging the real-time nature of unstructured feedback, capable of scaling to large datasets, and compatible with compute intensive algorithms. We also developed a platform and methodology for scoring the accuracy of our extracted signals. Much of the work accomplished during this SBIR was centered on social media chatter, including Facebook and Twitter, and was trained over financial transaction data, such as sales or churn. However, our technology of training over unstructured data by extracting correlations in language clusters with a structured time series is a generic approach with applications ranging from healthcare to defense/intelligence applications.

Project Start
Project End
Budget Start
2010-08-15
Budget End
2015-01-31
Support Year
Fiscal Year
2010
Total Cost
$900,112
Indirect Cost
Name
Quantifind Inc.
Department
Type
DUNS #
City
Palo Alto
State
CA
Country
United States
Zip Code
94306