Documents do not speak only of their contents; to a trained eye, they can also say much about their author. The field of ``authorship attribution'' in humanities scholarship has been attending to this for centuries, trying to determine how and to what accuracy the author of a document can be determined. Recent developments in corpus linguistics have shown it to be possible to make these determinations automatically by ``non-traditional'' methods, essentially statistical investigations of the words, phrases, layout, and other features of the document.

Unfortunately, the current state-of-the-art is a confused collection of proposed methods, with little guidance about which methods work, why they work, and under what conditions they work best. We are addressing this by developing a modular software framework (using a theoretical model proposed by Juola[23]) to perform this task in a modular design that permits easy swapping of functional components in cross-combination.

By applying a rigorous testing method to the resulting set of (novel) combinations, the project is establishing accuracy benchmarks for various techniques (under the various testing conditions), finding new combinations resulting in improved techniques, and creating "best practices."

Agency
National Science Foundation (NSF)
Institute
Division of Advanced CyberInfrastructure (ACI)
Type
Standard Grant (Standard)
Application #
0721667
Program Officer
Kevin L. Thompson
Project Start
Project End
Budget Start
2007-08-15
Budget End
2011-07-31
Support Year
Fiscal Year
2007
Total Cost
$212,000
Indirect Cost
Name
Duquesne University
Department
Type
DUNS #
City
Pittsburgh
State
PA
Country
United States
Zip Code
15282