Recent developments in machine learning and corpus linguistics have shown it to be possible to make automatic determinations about authorship using statistics; the NSF- funded JGAAP (Java Graphical Authorship Attribution Program) system has been part of these developments. JGAAP has helped support the emerging authorship attribution community and create a useful tool for a wide variety of scholastic specialties.
Although JGAAP incorporates thousands of possible methods, there are many more in the literature that have been proposed but not rigorously tested. Comparative testing on a large scale will require the development of new methods and test corpora. In addition, there are many key problems to address to meet the needs of the community, such as the open class problem, the adversarial problem, and the coauthorship problem. Finally, we will examine applications of JGAAP and similar systems to key areas in linguistic profiling, such as determining gender, education, native language, psychological profile, medical condition, age (of document or writer), or even attempted deceptiveness. Again, by applying a rigorous testing method to these new problems and corpora, the project can establish accuracy benchmarks for various techniques (under the various testing conditions), find new combinations resulting in improved techniques, and establish a recommendation for 'best practices.'
Improved authorship attribution will be immediately useful both to scholars and in broader social contexts, such as law enforcement and forensics where there are direct demands for this kind of security technology. The historical/social analysis will also provide better access between the related disciplines of digital humanities, sociology, history, and computer science, providing the basis for a better understanding of traditional humanities issues. Profiling work can help medical and psychological practitioners by providing a non-invasive method to detect certain aspects of a person's mind. The software developed (and the planned development/distribution process) will help improve the effectiveness of both digital humanities scholarship and computer science, especially through the establishment of software review standards and processes. In particular, by providing direct evidence of the conditions and expected error rates involved in various techniques, the information gained will help authorship attribution meet the Daubert criteria for expert evidence, allowing authorship attribution to be used in a formal legal setting. Finally, the funding of this research will help support the unique interdisciplinary Duquesne University Computational Mathematics program, providing a broader access to an unusual and atypical audience for technological education.
The JGAAP Improvement project set out to make authorship attribution a practical and useful technology, not only for computational specialists, but for the general public. To that end, we proposed both to make technical improvements in accuracy and data size, but also to create a useful platform for authorship studies generally. * We have helped establish a new baseline for authorship attribution accuracy, having correctly analyzed more documents than any other team in the PAN/CLEF 2012 authorship competition. * Our work has been accepted as admissible in US Federal Court, both Immigration Court and the Southern District of New York. * We have provided the base for several related technologies, notably Drexel University's Anonymouth and JStylo programs. * We have partnered with Drexel University to create a stylometric authentication program for DARPA's Active Authentication project. This project aims to replace the password by continuously monitoring the behavior of a computer user. * We have established new technologies for psychological testing using only writing, creating new possibilities for telemedicine. We have, for example, been able to identify bipolar disorder with near-perfect accuracy. * We have participated in and supported PAN/CLEF in the establishment of a long-running TREC-style competitive evaluation for authorship attribution and profiling. * We have published a proposed standard protocol for addressing authorship questions in an effort to improve standardization and admissibility. * We have expanded the pool of languages studied for authorship attribution purposes and provided strong evidence that "best practices" are cross-linguistic and transfer between languages. * We have created a tool (JGAAP) available for the general public to use to resolve authorship questions; this tool, in turn, has been used by many third parties, most notably in the analysis of Edgar Allan Poe's early work as published in the New Yorker. * We have been involved in a number of high-profile attribution cases, including the refutation of Newsweek's suggestion that Dorian Nakamoto was the author of the Bitcoin protocol, and our identification of J. K. Rowling (the author of the Harry Potter series) as the true author of Robert Galbraith's The Cuckoo's Calling