Documents do not speak only of their contents; to a trained eye, they can also say much about their author. The field of ``authorship attribution'' in humanities scholarship has been attending to this for centuries, trying to determine how and to what accuracy the author of a document can be determined. Recent developments in corpus linguistics have shown it to be possible to make these determinations automatically by ``non-traditional'' methods, essentially statistical investigations of the words, phrases, layout, and other features of the document.
Unfortunately, the current state-of-the-art is a confused collection of proposed methods, with little guidance about which methods work, why they work, and under what conditions they work best. We are addressing this by developing a modular software framework (using a theoretical model proposed by Juola[23]) to perform this task in a modular design that permits easy swapping of functional components in cross-combination.
By applying a rigorous testing method to the resulting set of (novel) combinations, the project is establishing accuracy benchmarks for various techniques (under the various testing conditions), finding new combinations resulting in improved techniques, and creating "best practices."