Multi-pattern matching is the problem of finding whether and where a set of patterns appear in a data stream. An example of such a problem is scanning Internet traffic for malware signature patterns. Important considerations in multi-pattern matching are the amount of memory required to store the patterns and the speed with which the stream can be scanned through. Storage space can be reduced either by compressing the patterns or by approximately 'sketching' them. This project aims to advance this field by seeking ways to enable multi-pattern search with (1) minimum possible space, (2) fastest possible time, and (3) high accuracy. The project combines theoretical advances with practical implementations. Theoretical approaches involve sketching, streaming, hashing and succinct data structures. Practical goals include developing faster hardware-based implementations and cache-aware implementations.
Multi-pattern matching is central to applications such as computer virus detection, network intrusion detection and information extraction. It resides at the cross-section of several fields of computer science, including theory, networking, databases and bioinformatics. Recent development of tools in pattern matching such as succinct data structures and streaming algorithms make this an exciting time to advance the state-of-the-art in this field. Several research opportunities for undergraduate students will evolve from this project. Results will be disseminated through publications in conferences and journals.