Online communication mediums such as email, web sites, newsgroups, online forums, and chat rooms have been ubiquitously integrated into our everyday lives. Unfortunately, online channels are also being misused for distribution of unsolicited and inappropriate information (e.g., extremist propaganda, spam, online gambling, etc.). The anonymous nature of these channels makes them an ideal source of communication for criminal groups and extremist organizations. Additionally, the evolution of the internet as a major international communication medium has spawned the advent of a multilingual dimension.

Authorship analysis has been used to analyze long, precise English texts such as plays of Shakespeare (authorship identification) or student's class papers (plagiarism detection). Few past studies have addressed the multilingual issues of short online communications. The language-specific stylistic characteristics and the informal nature of online communications present unique research challenges. This exploratory project aims to develop a comprehensive framework and associated text mining techniques for multilingual online stylometric feature extraction and authorship classification, initially focusing on two languages, English and Arabic. The linguistic differences between these two languages will allow evaluation of common stylistic representations and explore other language-specific problems. The goal is to develop scalable online authorship analysis techniques that can be used to analyze 100s to 1000s of anonymous authors (a common scenario for web communications). Novel feature (subset) selection techniques will help reduce the high dimensionality of online writing features.

The primary intellectual contribution of this research is expected to yield: (a) development and evaluation of new text mining techniques that may be suitable for identity tracing in cyberspace; (b) creation of new representations of people's identities using online "Writeprints" (i.e., the representation of people's key online writing style features); and (c) evaluation of the effectiveness of different multilingual stylistic features and classification techniques for improving identification scalability and robustness.

The anticipated broader impact of this research include: building foundation for further cyber trust research; improving intelligence and law enforcement agencies' abilities to detect, prevent, and respond to cyber crimes and terrorist events via the Internet; and providing a large-scale research corpus and feature extraction resources for information scientists, political and social scientists, and terrorism researchers. The project web site (http://ai.arizona.edu/authorship) will be used for broad dissemination of project results.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0646942
Program Officer
Maria Zemankova
Project Start
Project End
Budget Start
2006-09-01
Budget End
2008-02-29
Support Year
Fiscal Year
2006
Total Cost
$75,000
Indirect Cost
Name
University of Arizona
Department
Type
DUNS #
City
Tucson
State
AZ
Country
United States
Zip Code
85721