The exponential growth of the web, the recent technological progresses in molecular biology, the launch of massive-scale digital library projects, and the ability of exchanging information at our fingertips, have all contributed to the creation of an unprecedented quantity of textual data in digital form. Plain or semi-structured text is still the most versatile format in which to exchange information and there is so much of this data that is likely that the large majority of it will never be read by anyone, unless the way in which we access information drastically improves.

The major limiting factor in handling large textual datasets is typically related to space rather than time. When the amount of data is too large to be stored in main memory, computer scientists have to resort to algorithms capable of dealing with compressed representations of the data (called 'sketches' or 'indexes'). For textual data, the construction of the sketch typically involves keeping statistics on substrings or related associations or rules.

The first set of objectives of this project is centered around a new sketch based on a novel family of gapped patterns. We are applying the new index to three selected problems: databases; data compression; and computational biology. In the second set of objectives we are extending the pattern discovery problem to two-dimensional matrices. The discovery problem associated with two-dimensional patterns has a wide spectrum of applications including the analysis of gene expression data, recommender systems and collaborative filtering, identification of web communities, load balancing, and discovery of association rules.

The education goal of the proposal is to establish the algorithmic and the fundamental software development component of an interdisciplinary bioinformatics curriculum. Funds from this proposal are being used to enhance these activities through the development of new courses in computational genomics for in-depth training on individualized research topics. Since UCR is a minority-serving institution, this plan will also have an impact on the education of under-represented students.

Project Report

This six years grant (2005-2011) funded by NSF enabled a wide range of scientific discoveries and technical innovations at the interface between computer science and molecular biology. For instance, in collaboration with Prof. K. Le Roch and her team at UC Riverside, we made significant progress in understanding gene regulation in the malaria human parasite. This a might lead to new anti-malaria drugs or new strategies to eradicate one of the most deadly infectious diseases known to humankind. In another collaboration with Prof. T. Close (UC Riverside), we helped developing a more accurate map of the barley genome, which could lead to the selection of traits to improve the plant ability to withstand stresses, like drought or pest infestation. Barley ranks fourth among cereals in terms of total production and area of cultivation world-wide. Finally, in collaboration with NYU's Langone Medical Center, we used a combination of tumor analysis and microarray chip technology to identify potential long-term melanoma survivors from within the patient group whose disease has metastasized beyond the skin to other organs. We determined that patients with evidence of a stronger immune response are those likely to survive longer with the disease. The information could point the way toward new targeted therapies for some patients, as well as spare others the toxic side effects of drugs unlikely to help them. Other technical innovations involve: a new computational idea to functionally annotate the function of proteins, a more accurate technique to compute genetic and physical maps of genomes a method to hide messages in DNA novel techniques to visualize, mine and intepret time series improvements on data compression of images and text, in particular related to error correction. These collaborations produced almost three dozen peer-reviewed scientific publications, and half a dozen open-source software tool for the life science community. The grant also partially funded six PhD students, which were trained in the domain of computational biology. Two of them joined the biotech industry, and four are employed in academia either as a post-doc or assistant professor. A few undergraduate students were also trained, and they all decided to purse a Phd.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0447773
Program Officer
Sylvia J. Spengler
Project Start
Project End
Budget Start
2005-08-01
Budget End
2011-07-31
Support Year
Fiscal Year
2004
Total Cost
$422,277
Indirect Cost
Name
University of California Riverside
Department
Type
DUNS #
City
Riverside
State
CA
Country
United States
Zip Code
92521