This collaborative project is focused on GIS vector-based spatial data overlay processing through cloud computing. Vector-based processing is considerably more complicated than raster data processing because raster data is based on regular grid-based fixed-size pixels, while vector features have irregular geometric shapes represented by a list of large number of vertices. GIS data files can be very large-scale huge and as such, computational requirements are significant. Cloud platforms such as Azure are appropriate for large-scale computing and storage capabilities. These capabilities combined with accessibility on-demand and other features have made cloud processing very desirable for GIS applications. This exploratory research will attempt to discover distributed algorithms and test their scalable implementations for GIS overlay processing on the Azure platform.
Background and Goals: GIS polygon-based spatial data overlay processing through cloud computing is much more complex and challenging than raster data processing because raster data is based on regular grid-based fixed-size pixels, while vector (polygonal) features have irregular geometric shapes represented by a list of large number of vertices. The GIS data files can be huge and their overlay processing is computationally intensive. The emerging Cloud platforms such as Azure, with their potential for large scale computing and storage capabilities, easy accessibility by common users and scientists, availability on demand, easy maintenance, sustainability, and portability, has the promise to be the platform of choice for such GIS applications. We had proposed to discover parallel and distributed algorithms and their scalable implementations for GIS overlay processing on Azure platform. Meager amount of work had been done, before our work, on large volume of polygonal geospatial data processing through parallel/distributed computing (as opposed to for raster data processing), and none on cloud platforms. The existing parallel approaches mostly developed in the 1990s were not scalable and/or limited to small set of polygons on the traditional cluster and other platforms. Better algorithms and data structures were needed. Research and Education Activities: This EAGER project has been extremely successful and synergistic. It has led to development of key parallel data structures and algorithms, efficient systems over cloud and clusters, and helped initiate collaboration with GIS and spatial data mining experts. It resulted in a PhD dissertation, training of two other PhD students and an undergraduate minority student, more than a dozen publications and presentations, and two patent applications. An application of the our "Crayons" system was selected among 113,000 contestants to be presented at Microsoft’s Imagine Cup US finals. We demonstrated how the Crayons system could be employed to find the safe rescue shelters in the event of a natural disaster such as a hurricane. Summary of our Outcomes: We have undertaken parallelization of two key tree-based data structures, namely R-tree and heap, and have employed parallel R-tree in polygon overlay system and have parallelized interest interval discovery problem. These data structure parallelization are hard because of the underlying tree topology and the fine-grained computation leading to frequent access to such data structures severely stifling parallel efficiency. Therefore, the current best parallelization of R-tree on GPU was limited to about 20-fold speedup. We summarize our current results as follows. Parallel Polygon Overlay System: We engineered the first-ever parallel system that can perform spatial overlay over polygonal spatial data. We have implemented this system named "Crayons" over the Azure cloud platform, Map-Reduce, and a Linux cluster. Our current GIS system employing the parallel R-tree can process about 200K polygons within a few seconds with 70-fold speedup on a 12-node Linux cluster that previously took tens of minutes. Thus, it has the potential for bringing a practical overlay tool to the Geo Scientists. Parallel R-Tree: Our new parallel algorithms for construction and search has yielded the first demonstrated 200-fold speedup in R-tree construction on a graphics processing unit (GPU). This would be useful for large-scale range querying and can serve as template for parallelizing other tree-based structures, and other generalizations of spatio-temporal datasets. Parallel Priority Queue: Our parallel heap data structure supports large batches of extracting highest priority items and inserting newly produced items with 30-fold speedups. This has potential application to agent based modeling, shortest path and evacuation route planning. CloudMPI Programming Environment: Most recently, we developed cloudMPI, a framework for Windows Azure cloud platform that allows application developers to write MPI-like applications without sacrificing the performance for large grained applications. Broader Impact: The R-Tree and Priority queue data structures and GIS overlay computation are directly useful for disaster modeling and prediction, evacuation routing, spatial data mining, and agent-based GIS modeling and simulation. Crayons and related systems can potentially accelerate processes of disaster modeling and prediction, and therefore, can assist government agencies such as FEMA. The project website is at www.cs.gsu.edu/~dimos/?q=content/research-projects. The code for Crayons is maintained at http://gpcoverlay.codeplex.com under public license.