Pattern matching is a fundamental research field with applications in domains such as biological sequence alignment, web search engines and network intrusion detection. Given a pattern P and a text string T, the central problem is to find occurrences of P in T. When data becomes massive, we cannot assume that text can be stored in RAM. Pattern matching problems must now be considered with more appropriate models like external memory model, cache-oblivious model, streaming models, MapReduce paradigm and multi-core models. In many cases, a blend of models or newer, more appropriate models need to be developed keeping the practical aspects of the application in sight.
The focus of this project is to develop efficient search algorithms and indexes when a data set resides on disks, on network storage, or is accessible only as an online stream. The data must be efficiently searchable even though it may be in compressed format. The project considers traditional pattern matching problem, as well as variants such as (i) approximate matching -- where the pattern may not exactly match a substring in T, (ii) online matching -- where the pattern(s) are known in advance and text comes as a stream, and (iii) string retrieval -- where instead of finding all the occurrences, the focus is on retrieving high ranking documents which contain one or more occurences of the query pattern. Issues of I/O efficiency and space utilization are central to this project. This involves developing suitable massive data set models, deriving optimal theoretical bounds and implementing practical tools. Methodologies include combinatorial and randomized methods in pattern matching, succinct data structures, top-k query processing and I/O efficient indexes.
The project will build new, solid theoretical foundations in pattern matching, with direct applications to fields like databases and information retrieval. It will significantly drive forward current state of the art in web search engine technology (by impacting the way inverted indexes are used) and genome sequence alignment tools (e.g., BLAST). Tools and software developed during this project will be widely distributed to the research community. Some components will be incorporated into undergraduate and graduate algorithms course curricula as implementation projects.
The goal of the project was to develop pattern matching techniques (algorithms and data structures) for massive data sets (or big data). Several theoretical models like external memory, top-k querying and compressed indexing are popular in theory as well as database communities for big data. Newer programming paradigms like Map-Reduce are useful for multi-node implementations. We achieved several new results in these aspects of massive data. In terms of pattern matching, the goal was to achieve, better results for known problems as well as advancing the state-of-the-art by solving newer problems. One of central problems we considered was string retrieval with relevance. Following summarizes our outcomes in different facets of this project: A. External Memory Model -- (i) Top-k document retrieval: We obtained O(n log* n) space index with optimal O(P/B + log_B n + occ/B) I/Os. [ESA13] (ii) top-k colored range query: We obtained O(n alpha (n)) space index with O(alpha^3 (n) + occ/B) query I/Os [PODS14] and (iii) Shared constraint range reporting: We obtain O(n) space index with O(loglog N + occ/B) I/Os [ICDT15]. These three problems emerge in various top-k pattern matching scenario. B. Compression Model -- We obtained several theoretical results developing succinct indexes for (i) property suffix trees [I&C13](ii) wildcards matching [JDA13](iii) dictionary matching [TCS13] (iv) proximity based top-k retrieval [ISAAC14](v) top-k retrieval based on frequency [DCC13] (vi) discriminating words[SPIRE14]. The work on proximity based is the first positive result in this space enabling succinct index for a problem believed to be hard until now. We gave best improved state of the art result for frequency based top-k allowing 2|CSA| space with O(logklog^{1+e}) per item retrieval time. Finally, we gave succinct index for discriminating words -- a problem motivated by information retrieval. C. Top-k retrieval -- Apart from the problems covered in A. and B. above: we solved (i) top-k color queries for tree paths [SPIRE13](ii) implemented top-k join queries [IDEAS13] (iii) range frequency queries [MFCS13] (iv) Ranked document selection - where instead of top-k the query asked for only kth ranked document [SWAT14]. On implementation side, we demonstrated the first "inverted index" for string documents [SIGIR11] which achieves top-k retrieval with phrase searching capabilities. We achieved per doc retrieval time of 10 micro-secs. D. Mapreduce -- We implemented one of the first mapreduce based string index [ICPADS14] based on suffix tree. E. Other -- apart from these we obtained several results on (i) position restricted matching [SPIRE13,EDBTW13] (ii) practical Approximate matching [EDBTW13] (iii) joins on uncertain strings [SIGMOD14] (iv) top-k retrieval for uncertain strings [SSDBM11] (v) Range LCP queries [SPIRE13] (vi) work on circular patterns [ISAAC11, CPM12,CPM13] providing a complete range of results. Broader Impacts: The results were disseminated in top conferences and journals in computer science. Survey/plenary talks based on top-k retrieval were delivered in Dagstuhl13, Ianfest13, CIKM12. Several student were supported (including a women) and trained in theoretical as well as practical aspects of string matching and big data.