SGER: Multilingual Online Stylometric Authorship Identification: An Exploratory Study

Chen, Hsinchun

Abstract

Online communication mediums such as email, web sites, newsgroups, online forums, and chat rooms have been ubiquitously integrated into our everyday lives. Unfortunately, online channels are also being misused for distribution of unsolicited and inappropriate information (e.g., extremist propaganda, spam, online gambling, etc.). The anonymous nature of these channels makes them an ideal source of communication for criminal groups and extremist organizations. Additionally, the evolution of the internet as a major international communication medium has spawned the advent of a multilingual dimension.

Authorship analysis has been used to analyze long, precise English texts such as plays of Shakespeare (authorship identification) or student's class papers (plagiarism detection). Few past studies have addressed the multilingual issues of short online communications. The language-specific stylistic characteristics and the informal nature of online communications present unique research challenges. This exploratory project aims to develop a comprehensive framework and associated text mining techniques for multilingual online stylometric feature extraction and authorship classification, initially focusing on two languages, English and Arabic. The linguistic differences between these two languages will allow evaluation of common stylistic representations and explore other language-specific problems. The goal is to develop scalable online authorship analysis techniques that can be used to analyze 100s to 1000s of anonymous authors (a common scenario for web communications). Novel feature (subset) selection techniques will help reduce the high dimensionality of online writing features.

The primary intellectual contribution of this research is expected to yield: (a) development and evaluation of new text mining techniques that may be suitable for identity tracing in cyberspace; (b) creation of new representations of people's identities using online "Writeprints" (i.e., the representation of people's key online writing style features); and (c) evaluation of the effectiveness of different multilingual stylistic features and classification techniques for improving identification scalability and robustness.

The anticipated broader impact of this research include: building foundation for further cyber trust research; improving intelligence and law enforcement agencies' abilities to detect, prevent, and respond to cyber crimes and terrorist events via the Internet; and providing a large-scale research corpus and feature extraction resources for information scientists, political and social scientists, and terrorism researchers. The project web site (http://ai.arizona.edu/authorship) will be used for broad dissemination of project results.

Funding Agency

Agency: National Science Foundation (NSF)
Institute: Division of Information and Intelligent Systems (IIS)
Type: Standard Grant (Standard)
Application #: 0646942
Program Officer: Maria Zemankova

Project Start
Project End
Budget Start: 2006-09-01
Budget End: 2008-02-29
Support Year
Fiscal Year: 2006
Total Cost: $75,000
Indirect Cost

SGER: Multilingual Online Stylometric Authorship Identification: An Exploratory Study
Chen, Hsinchun
University of Arizona, Tucson, AZ, United States

Abstract

Funding Agency

Institution

Comments

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Comments