The tremendous growth of information in the data-intensive world and a new wave of big data are creating a promising future for global ultra-large-scale data sharing, where widely-scattered massive data will be pooled and shared globally. A distributed data intensive information system is a critical component for realizing this future. The system will allow users to efficiently and effectively search similar data. However, the unprecedented amount of data, along with the large-scale environment and autonomous nature of participants pose high efficiency and effectiveness challenges to the development of such a system. This research will provide collaborative research opportunities for faculty, graduate and undergraduate students, as well as K-12 students in South Carolina.
A growing need persists for developing an efficient and effective information searching system, and this challenge represents one of the more formidable hurdles facing data-intensive computing. This proposal is aimed at addressing this need through the development of a distributed information system supporting efficient and effective data searching. This system achieves both high efficiency and effectiveness. Efficiency means the speed and overhead of sorting and searching date, while effectiveness means the ability to find all matching data in the system with fewer false positives and false negatives. This system translates data items to IDs, maps the data items to nodes in a distributed system and enables the similarity searching in a distributed manner. First, previous data translation methods relying on a multi-dimensional space to hash a data item to one index achieve high efficiency but suffer from low effectiveness due to the curse of dimensionality in data dimension reduction. Previous exact mapping methods that hash each keyword of a data item for data search are highly effective but inefficient. By eliminating the need of a multi-dimensional space, this system is both highly efficient and effective. Second, unlike some previous systems relying on a centralized or hierarchical structure for data searching, this system builds a distributed hash table (DHT) structure, which provides highly efficient data searching in a distributed manner. Unlike most traditional DHT-based data sharing which provides only exact matching services, this system offers similarity searching.