The use of cloud-based data processing platforms is an increasingly attractive alternative for large-scale data processing. There is active investigation into their use for various types of processing tasks on large-scale unstructured and structured data. However, due to an increased interest in many communities to enable more automatic sharing and exchange of data on the Web using Semantic Web techniques, there is a rapid surge in the availability of very large, real-world, Semantic Web datasets. Such data are semi-structured and have more complex processing requirements than relational data processing due to the fine-grained modeling of data and also the need for inferencing during processing. Consequently, existing optimization techniques for cloud data processing platforms which often adapt relational processing optimization techniques do not address the needs of such workloads. Further, such techniques do not adequately account for the nuances of cloud runtime platforms such as Hadoop e.g., dataflow length as a cost metric, no a-priori existence of indexes and statistics.
This project contributes insight into query optimization requirements for Semantic Web data processing on Map Reduce platforms. Its contributions include a novel Nested TripleGroup data model and Algebra (NTGA), algebraic and dynamic cost query optimization techniques; inter and intra-work sharing techniques, data representation formats and system architecture issues of integrating Semantic Web optimization techniques into frameworks such as Apache Pig. The impact of this project will cut across the increasing range of communities that are aggressively adopting Semantic Web tenets such as, scientific, business, government and other general-purpose communities.