Numerous organizations including government agencies are sitting on mountains of spreadsheet data that are becoming increasingly common on the web, but whose contents remain out of reach via search engines because direct links to the contents of their constituent cells are rare. Thus spreadsheet data represent legacy databases, especially since many of their underlying schemas are no longer accessible. The goal of this research is to discover the schema according to which the spreadsheet is constructed. The focus is on the spatio-textual spreadsheet which is a spreadsheet where the values of the spatial attributes are specified textually. Such spreadsheets support spatial searches whose output is visual and whose utility is enhanced by being able to handle spatial synonyms. This is done, in part, by devising methods to automatically discover the spatial attributes of the spreadsheet as well as how to distinguish between several instances of them which arise due to the presence of a containment hierarchy. In particular, use is made of spatial coherence which is manifested by observing that spatial data in the same column are usually of the same spatial type, while spatial data in the same spreadsheet row usually exhibit a containment relationship. Moreover, adjacent or nearby rows exhibit spreadsheet coherence in that they are usually similar. The broad impact of this research is to make spreadsheet data a first class citizen on the web with the same chances of being discovered and accessed as data found in other documents. Reports describing results of this and related research will be available at www.cs.umd.edu/~hjs/spreadsheets.html

Project Report

Numerous organizations including government agencies are sitting on mountains of spreadsheet data that are becoming increasingly common on the web, but whose contents remain out of reach via search engines because direct links to the contents of their constituent cells are rare. Thus spreadsheet data represent legacy databases, especially since many of their underlying schemas are no longer accessible. The goal of this research was to discover the schema according to which the spreadsheet is constructed. The focus was on the spatio-textual spreadsheet which is a spreadsheet where the values of the spa ial attributes are specified textually. Such spreadsheets support spatial searches whose output is visual and whose utility is enhanced by being able to handle spatial synonyms. This was done, in part, by devising methods to au omatically discover the spatial attributes of the spreadsheet as well as how to distinguish between several instances of them which arise due to the presence of a containment hierarchy. In particular, use was made of spatial coherence which is manifested by observing that spatial data in the same colum are usually of the same spatial type, while spatial data in the same spreadsheet row usually exhibit a containment relationship. Moreover, adjacent or nearby rows exhibit spreadsheet coherence in that they are usually similar. Analogous principles apply to other forms of data tables, such as HTML tables found on the Web, so the data extraction techniques were validated to a collection containing both spreadsheets and HTML tables. The broad impact of his research was to make spreadsheet data a first class citizen on the web with the same chances of being discovered by search engines and accessed by hem as is data that is found in other documents. This is especially true for mandated government data collections. Even more intellectually challenging was the discovery of the overall structure of the tables. In other words, identifying their schemas which are not stored explicitly as table metadata. We addressed this lack of structure by devising a new method for leveraging the principles of table construction i order to extract table schemas. We discovered the schema by which a table was constructed by harnessing the similarities and differences of nearby tale rows through the use of a novel set of features and a feature processing scheme. The schemas of these data tables were determined using a classifica ion technique based on conditional random fields in combination with a novel feature encoding method called logarithmic binning, which was specifically designed for the data table extraction task. Some specific outcomes of our research were: 1. Developed a way to perform similarity search over a vast collection of spreadsheets which contain geographical locations. We developed a geographic search system named GeoXLS. GeoXLS enables users to submit a set of locations as a query object Q and to find spreadsheets containing locations similar o those in Q. Search results come from a vast collection of over 100,000 sp eadsheets obtained from the Web. GeoXLS allows users to answer queries such as ``I know the locations of n entities of type X. What sets of data contain points similar to my query points?'' 2. Developed a machine learning-based method for classifying rows of a data table or spreadsheet by their function, thereby allowing extraction of the able's schema information. This process improves on prior methods for extracting table data by supporting more complex table structures. 3. Developed an algorithm for geotagging tables and lists by using a Bayesian classifier to identify categories that can describe the toponyms. The ca egories serve as common threads for interpreting the toponyms in a consiste t manner. Such data frequently occurs in table columns. 4. Developed techniques for identifying tables that correspond to itineraries. This requires a differentiation from documents that just contain geographic content and makes use of temporal content. Note that Itineraries diffe from the related concepts of routes and trajectories in that the precise paths between the stopping points are of less importance than the locations of the stopping points and their order. In conjunction with this work, we developed techniques for automatically generating visualizations of itineraries found in HTML tables and spreadsheets on the web. 5. Developed a classification scheme for choropleth maps for a spatially-va ying property called equal-area. Our goal was to assign the ranges of values of a spatially-varying property to colors so that the total area of the regions associated with each color is roughly equal thereby rendering a more symmetric and visually-appealing visualization. We explored a number of algorithms for doing so with the final algorithm being one that represented a modified approach which tries to simultaneously balance the goal of equal area for each color with that of assigning an equal number of regions to each color. The result worked well for both properties corresponding to absolute data and area-normalized data such as densities.

Project Start
Project End
Budget Start
2010-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2010
Total Cost
$508,000
Indirect Cost
Name
University of Maryland College Park
Department
Type
DUNS #
City
College Park
State
MD
Country
United States
Zip Code
20742