Science is becoming increasingly data intensive. As the experimental data starts to accumulate within or across institutions and over time, it is indeed a valuable wealth of information. However, little has been done to query the data repository in an integrated manner, mainly because the results among different replicates of an experiment often show a large degree of inconsistency and variance. This database poses unique challenges in data types, query types, and accuracy of distributions, etc. The broad goal of the project is to solve some major query processing challenges in such a database.
Specific techniques used to achieve this goal include: (1) Coupling top-k query answering with computation of the score distribution of top-k tuples, and the novel usage of this framework in a number of contexts; (2) Measuring the accuracy of probability distributions which affects user?s perception of query results, and devising new predicates for decision making; (3) Proposing novel semantics and efficient query processing for join queries on uncertain data; and (4) Designing a suite of algorithms for approximate substring matching and windowed subsequence matching for online monitoring queries.
The ability to effectively and accurately share, query, and monitor diverse scientific experimental data on a large scale will greatly benefit the science community. The extra data analysis capability provided to scientists can even change the way they conduct their research. The education plan includes teaching both computer science students on managing scientific data and science major students on relevant database techniques, attracting middle school and college students into computer science, and inspiring students of underrepresented groups.