All-pairs similarity comparison is one of the core algorithms in many data-intensive mining and search applications such as near duplicate detection among web pages, spam detection, advertisement click analysis, similar news/fresh content grouping, and recommendation for similar product purchases and search queries. Conducting similarity search on large datasets is time consuming and becomes more challenging when data are being updated continuously. It is important to develop high performance algorithms and software to meet the increasing speed demands in many consumer and business applications using similarity computation.

This project studies efficient and cost-effective parallel algorithms when data are being updated periodically or dynamically. Techniques for partitioning data and balancing computation on a cluster of machines are developed to optimize input/output operations, communication, and computing resource usage. As data are often updated continuously, leveraging previously computed results to handle updated data can eliminate a large amount of unnecessary operations and speedup the entire computation process by an order of magnitude. The project develops efficient software on a cluster of machines. The project starts with incremental duplicate detection for web data analysis and search, and continues to work on similarity comparison in several other applications. Performance of developed software is evaluated in those applications.

This research has the potential to develop fully-optimized solutions with significantly reduced cost and increased speed for a variety of big data applications that perform similarity analysis. Developed software will be made available for application developers or data engineers to conduct large-scale computation without involving the complexity of managing parallelism. The project web site (www.cs.ucsb.edu/projects/psc/) is used for dissemination of results. The educational plan contains research mentoring, undergraduate and graduate instruction improvement, and outreach activities such as working with high school students.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1118106
Program Officer
Maria Zemankova
Project Start
Project End
Budget Start
2011-08-15
Budget End
2014-07-31
Support Year
Fiscal Year
2011
Total Cost
$515,732
Indirect Cost
Name
University of California Santa Barbara
Department
Type
DUNS #
City
Santa Barbara
State
CA
Country
United States
Zip Code
93106