In this project, the PIs propose to construct and develop a shared infrastructure to support the collection and maintenance of realistic, large scale spam data sets, referred as SPAM Commons.
Spam is a problem in many important communications media such as email and web. A sub-problem of spam, phishing (a form of online pretexting), caused an estimated $3.2B in damages in 2007. The broad impact of effective spam filtering methods can be estimated in billions of dollars in several communications media such as email and web.
Spam has also invaded other media, with concrete attack examples in social networks, blogosphere, Internet telephony (VoIP), instant messaging, and click fraud.
Unfortunately, spam research has been hampered by the lack of published real world data sets due to concerns with privacy and company intellectual property. This project team develops a shared infrastructure to support the collection and maintenance of realistic, large scale spam data sets, called Spam Processing, Archiving, and Monitoring Community Facility (SPAM Commons).
The main goals of SPAM Commons are: (1) to facilitate remedial research that will stem the wastes and losses caused by spam, and (2) enable revolutionary research that aim for stopping certain kinds of spam attacks altogether.
SPAM Commons is divided into a Public Partition and a Protected Partition.
The Public Partition is a direct analog of standard corpora for speech and image recognition research, consisting of a systematic and regular collection of both spam and legitimate data in the various communications media, starting from email and web spam, and expanding into other communications media as spam becomes a serious threat in each area and data become available.
The Protected Partition consists of a combined data and processing facility that makes private data or near real-time spam data available for experimental evaluation of spam defense mechanisms in a protected testbed. Access to such protected data will enable new spam research on real-time evolving spam and real world data sets that is infeasible today.
The intellectual challenges of the SPAM Commons project extend beyond the new research on various abovementioned spam areas enabled by the availability of data sets. The construction of both partitions of SPAM Commons includes significant intellectual challenges of their own. First, the isolation of Protected Partition addresses partially the concerns of privacy, which remains a general research problem. Second, useful spam and legitimate data sets require automated distinction of spam from legitimate documents with certainty, which remains an open research question in email, web, and other media. Third, the adversarial and mutual evolution of spam producers and defenders require continuous collection of fresh data for further study. Finally, the collection and streaming of near-real-time spam data represent research resources currently unavailable to spam researchers. Advances in these areas will spur the growth and evolution of SPAM Commons that will enable new research on the evolving and growing spam problem.
The impact of SPAM Commons data sets on experimental spam research may be similar to the impact of large corpora in disciplines such as speech/image recognition and natural language processing, which achieved a level of scientific result reproducibility and comparativeness after the use of such corpora became standard requirements. The proposed data repository will be supported and used by 9 university partners (Clayton State, Emory, Georgia Tech, NC A&T, Northwestern, Texas A&M, UC Davis, U. Georgia, UNC Charlotte), and several industry partners (IBM, PureWire, Secure Computing).