New experimental methods for collecting high throughput data are revolutionizing biology and clinical studies and are now routinely used for pre-clinical drug discovery. Several large public and proprietary databases collect these types of data, however this data is largely unstructured and is difficult to utilize. The proposed effort is developed as a search engine for genomic data aimed at pharmaceutical companies, biotechnology companies, academic institutes, and medical centers allowing them to utilize large volumes of condition-specific data from public repositories, and also integrate it with proprietary, in-house data. The framework automatically downloads, parses, and annotates data from different repositories and is complemented with an easy-to-use web interface. The team created a large collection of automatically-annotated data and the platform offers search capabilities within and across species as well as additional advanced analysis options. These can provide new validation for current experimental results as well as new research directions. In particular, drug discovery is usually conducted on lower mammals before it is applied to human, hence the ability to easily perform cross species comparisons can reduce drug development cost and time.

The proposed technology addresses the classic problem of dealing with heterogeneity in unstructured data and integrating massive amounts of data from different sources into a seamless framework. The unique challenge here is a function of the domain (biological and pre-clinical data). The team's goal is to create a system that manages heterogeneity in more than a single aspect and provides vertical integration that allows the data to be searchable and comparable on many levels. This integration is made possible through the use of computational text mining and machine learning methods that are able to derive high quality information from the free text in order to automatically categorize and annotate the large volumes of data. This work also provides a holistic approach for the incorporation of new analysis tools for genomic data, offering standard services and benchmarks that can significantly shorten development time and increase usage. The ability to easily query large volumes of genomic data can facilitate basic research of cell processes by academic researchers and the discovery of new drugs or repurposing of old drugs by pharmaceutical companies. In addition, large medical centers are starting to collect genomics and genetics data for individual patients aiming to provide personalized medicine tailored specifically to each individual. The ability to compare results of an individual patient to a large collection of patients and their clinical records is a key to finding better suited treatments for that individual leading to reduced hospitalization time and fewer complications. Lastly, the software and methods created here are intended to be reusable for any science moving from individual lab practices to a shared, global collaboratory system. If successfully deployed, this technology has the potential to make a significant impact across a wide span of the health care industry.

Project Report

New experimental methods for collecting high throughput data are revolutionizing biology and clinical studies and are now routinely used for pre-clinical drug discovery. These include sequencing methods, array based methods and methods for determining interactions between proteins. Several large public and proprietary databases collect these types of data, however this data is largely unstructured and is difficult to utilize. In this project we developed ExpressionBlast, a search engine for genomic data aimed at scientific researchers, medical centers and biotechnology companies. ExpressionBlast facilitates the utilization of large public databases and also provides means for integrating public and proprietary, in-house data. The ExpressionBlast framework automatically downloads, parses, and annotates data from different repositories and is complemented with an easy-to-use web interface that provides search capabilities within and across different species and includes advanced analysis options. ExpressionBlast is available as a web service and initial studies demonstrated its effectiveness for the analysis of high throughout genomics data including data on aging and human diseases. Intellectual merit ExpressionBlast addresses the classic problem of dealing with heterogeneity in unstructured data and integrating massive amounts of data from different sources into a seamless framework. The unique challenge here is a function of the domain (biological and pre-clinical data). We created a system that manages heterogeneity and provides vertical integration that allows the data to be searchable and comparable on many levels. This integration is made possible through the use of computational text mining and machine learning methods that are able to derive high quality information from the free text in order to automatically categorize and annotate the large volumes of data. This work also provides a holistic approach for the incorporation of new analysis tools for genomic data, offering standard services and benchmarks that can significantly shorten development time and increase usage. Broader impact: ExpressionBlast is positioned as a basic analysis tool for every researcher using high throughput biological daat when studying a variety of conditions and diseases. The ability to easily query large volumes of genomic data can facilitate basic research of cell processes by academic researchers and the discovery of new drugs or repurposing of old drugs by pharmaceutical companies. In addition, large medical centers are starting to collect genomics and genetics data for individual patients aiming to provide personalized medicine tailored specifically to each individual. The ability to compare results of an individual patient to a large collection of patients and their clinical records is a key to finding better suited treatment for that individual leading to reduced hospitalization time and fewer complications. Lastly, the software and methods we created are intended to be general and reusable for future systems that require searching in unstructured datasets.

Agency
National Science Foundation (NSF)
Institute
Division of Industrial Innovation and Partnerships (IIP)
Type
Standard Grant (Standard)
Application #
1242525
Program Officer
Rathindra DasGupta
Project Start
Project End
Budget Start
2012-07-01
Budget End
2013-12-31
Support Year
Fiscal Year
2012
Total Cost
$50,000
Indirect Cost
Name
Carnegie-Mellon University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213