The High Performance Database Research Center at Florida International University is leveraging the Hadoop framework, which implements Google's computational paradigm MapReduce and provides distributed file system services, for serving geospatial imagery and to execute spatial queries with heterogeneous predicates. This work is laying the foundation for high-performance geospatial querying. For instance, queries such as "the percentage of Florida state's land-mass that has vegetation" can be computed using basic image processing (map operation) at each image tile, followed by a simple summation (reduce operation) across tiles that comprise the aerial imagery of the Florida land-mass. A potentially infinite number of such semantic queries can thus be computed using the MapReduce paradigm and a large-scale raster imagery dataset. This exploratory work is providing a bridge between geospatial Web services and the MapReduce platform which has demonstrated success in other data-intensive applications.

This work is expected to produce a major impact on the field of geospatial data management and especially decision support based on geospatial data, by enabling decision support queries which were not previously practical. This will provide a foundation to enable critical decision support applications in fields such as disaster mitigation and environmental protection.This work is also providing a uniquely comprehensive collection of geospatial data to a broad research community.

Project Report

MapReduce at Florida International University NSF Program Director: Xiaoyang Wang PI: Naphtali Rishe Co-PIs: Vagelis Hristidis and Raju Rangaswami http://cake.fiu.edu/MapReduce Researchers at the NSF Industry-University Research Center CAKE at Florida International University have leveraged their Geospatial Data Server TerraFly project to deploy data and algorithms on the CluE infrastructure and to develop new algorithms with applications in geographic information retrieval, urban improvement, and disaster mitigation. TerraFly users visualize and query aerial imagery and data layers. Users virtually "fly" over imagery via a web browser, without any software to install or plug in. Tools include user-friendly geospatial querying, data drill-down, interfaces with real-time data suppliers, demographic analysis, annotation, route dissemination via autopilots, customizable applications, production of aerial atlases, application programming interface (API) for web sites. The TerraFly project has been featured on TV news programs (including FOX TV News), worldwide press, covered by the New York Times, USA Today, NPR, and Science and Nature journals. The 40TB TerraFly data collection includes, among others, 1-meter aerial photography of almost the entire United States and 3-inch to 1-foot full-color recent imagery of major urban areas. TerraFly vector collection includes 400 million geolocated objects, 50 billion data fields, 40 million polylines, 120 million polygons, including: all US and Canada roads, the US Census demographic and socioeconomic datasets, 110 million parcels with property lines and ownership data, 15 million records of businesses with company stats and management roles and contacts, 2 million physicians with expertise detail, various public place databases (including the USGS GNIS and NGA GNS), Wikipedia, extensive global environmental data (including daily feeds from NASA and NOAA satellites and the USGS water gauges), and hundreds of other datasets. In the present project, we used MapReduce to execute and benchmark massive data computations in the GIS domain. The specific problems that FIU’s team has addressed are: Scalability in geo-textual search algorithms is an important concern. We tackled the problem of clustering spatial objects in MapReduce towards enhancing the scalability of spatial searches with text constraints. We implemented a clustering algorithm inspired in X-means clustering, using a distance metric that takes into account spatial and non-spatial similarities at the same time. After clustering, independent spatio-textual indexes are created concurrently, one per cluster, in MapReduce. Our results show that better query processing performance can be attained under certain conditions. The clustering techniques have been refined to achieve higher clustering quality for better scalability. Non-Negative Matrix Factorization is a popular technique in data mining and machine learning. It has wide applications in environmentrics, image processing, text analysis, and bioinformatics. We have investigated the usage of MapReduce to scale up the Non-Negative Matrix Factorization algorithms to handle large scale problems that are in terabyte or petabyte scale. The resulting algorithms are applicable to the 40TB Aerial/Satellite imagery we have in TerraFly to extract useful information for environment monitoring or urban planning. We have designed and implemented the multiplicative update rule of the algorithm in MapReduce. Our results show fair scalability of our implementation. Shortest Paths problem has a long research history. Although exact algorithms have been developed, they are not of practical use for large scale real road networks due to their high computational complexity (cubic on the number of vertices in the graph representation). Parallel algorithms have also been developed to cope up with the problem. However, parallel solutions are mostly tied to specific parallel computational models. We have investigated and elaborated efficient MapReduce algorithms that can handle graphs with several millions of vertices. Direct application of such algorithms will be various street direction services in TerraFly. Street directions help both the local community in trip planning and emergency management systems in, for example, finding routes to shelters nearby an affected area. Selected Publications: Ariel Cary, Zhengguo Sun, Vagelis Hristidis, Naphtali Rishe. ``Experiences on Processing Spatial Data with MapReduce.'' in Springer Lecture Notes in Computer Science, Volume 5566/2009: Scientific and Statistical Database Management. (Proceedings of the 21st International Conference on Scientific and Statistical Database Management. New Orleans, Louisiana, USA. June 1-5, 2009.) pp. 302-319. Ariel Cary, Yaacov Yesha, Malek Adjouadi, Naphtali Rishe. ``Leveraging Cloud Computing in Geodatabase Management''. Proceedings of the 2010 IEEE Conference on Granular Computing (GrC-2010). Silicon Valley, August 14-16, 2010. Ariel Cary, Ouri Wolfson, Naphtali Rishe. ``Efficient and Scalable Method for Processing Top-k Spatial Boolean Queries.'' Proceedings of 22nd International Conference on Scientific and Statistical Database Management. Published as: Lecture Notes in Computer Science Volume 6187/2010: Scientific and Statistical Database Management. Springer Berlin / Heidelberg, 2010. Pages 87-95. Contact: Naphtali David Rishe This material is based in part upon work supported by the National Science Foundation under Grant Number IIS-0837716. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Project Start
Project End
Budget Start
2008-06-15
Budget End
2011-01-31
Support Year
Fiscal Year
2008
Total Cost
$232,000
Indirect Cost
Name
Florida International University
Department
Type
DUNS #
City
Miami
State
FL
Country
United States
Zip Code
33199