This project studies three aspects of human linguistic communication: the language used in the communication (for example, whether formal or informal), the topology of the social network of the communicators (for example, whether the the speaker is embedded in a single tight-knit group), and the roles the communicators occupy in an organization (for example, whether the speaker is an upper-level manager, or an administrative assistant). In the past, computer scientists and sociologists have analyzed these aspects in isolation, while sociolinguists and linguistic anthropologists have elaborated qualitative joint models. The ever-increasing flow of electronic communication offers new opportunities to analyze and quantitatively model these aspects of communication.

The project uses the Enron email corpus as a testbed for the development of computational joint models of these three aspects of communication, focusing on linguistic features (such as topic, genre, and speech act) and topological abstractions (such as subgroup analysis) that can be reliably and automatically analyzed in electronic communication. The work is being evaluated via concrete prediction tasks, such as the prediction of a person's organizational role based on the topology and linguistic content of their communication, and the prediction of how likely two people are to communicate in the future based on a limited sample of their communications.

The work is expected to have various potential applications, both for the general public, in the form of improved human-computer interaction for email clients and software that accesses email, and for the law enforcement and intelligence communities, in the development of automated techniques for discovering leadership and predicting communicative behavior.

Project Report

Humans are social beings: we establish and maintain relationships with other human beings. The collection of all such relationships is what we call a "social network" (which today may be supported by an application such as FaceBook). While some animals also have social networks, we are distinguished by the fact that we primarily use language to create, maintain, and describe our social relationships. The goal of this project was to contribute to our understanding of how we use language in relation to our social network. The starting point for our investigation was the Enron email database, a subset of the emails sent in the Enron corporation prior to its bankruptcy. In such a corporate setting, we have several types of relations between people. First, as a corporation, there is a well-defined hierarchy which includes all employees. Second, there is the email network that is defined simply as "who sent email to whom". Finally, there is the actual social network: not all social relations are reflected in the email network (we can have non-email communications, for example). We wanted to investigate the relation between these three networks. To do so, we first produced a new version of the set of emails, which includes the threads of email conversation (and not just individual emails), as well as a much more complete hierarchy network than was previously available. We then chose the problem of predicting the hierarchical relations from the communication. Previous work has used the language in emails and the email network to predict hierarchy. We instead were interested in using the actual social network to predict the hierarchy. Since we don't know the actual social network, we made use of a simple observation: if person A sends an email to person B and mentions person C, then person A must know person C, and person B either knows or comes to know of person C through this email. Thus, even if there is no email between person A and person C in our collection of emails, we know that there is a social relation between them. We call this new network the "mention network", and we make the assumption that the email network together with the mention network are closer to the actual social network than either of them alone. We showed that the mention network is indeed a powerful predictor of the corporate hierarchy, more reliable than the email network on its own, and more reliable than just the language in the emails. Our mention network is only a simple approximation of the actual social network. Can we instead extract a better approximation of the actual social network from language? A social relation can be written about explicitly ("John and Sandy are married"); the detection of such relations from texts is a well known problem in Natural Language Processing (NLP). In addition, a social relation can also be suggested by what we call a "social event": if John and Sandy have dinner, then they know each other. While we don't know if they are married or friends or colleagues, we do know that there is some sort of social relation between them. In fact, any social relation must be based on social events. We developed a system which allows us to detect social events in text, and to assemble them into a social network. This system uses different types of lexical and syntactic information in conjunction with a "tree kernel", which enables a machine learning system to efficiently access a vast space of hypotheses about what exactly characterizes a social event. We have applied this system to two domains so far: to news reports, and to literature, specifically, to "Alice in Wonderland". Literature, even fantastic literature such as "Alice", is an imagining of possible human behavior, and thus of social networks. We can analyze literature by analyzing the social networks, including for example the evolution over the course of the novel. In future work, we will apply our tool to other domains, such as collections of historical sources. We believe that our system could become an important tool for data-driven research in the social sciences and humanities. In summary, our work has validated the claim that the actual social network is more useful in predicting aspects of social situations (such as the hierarchy) than simpler approaches. We have developed an initial version of a tool that can extract social networks from text, which we believe will lead to much more accurate representations of actual social networks than the simpler methods currently being used.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
0713548
Program Officer
Tatiana D. Korelsky
Project Start
Project End
Budget Start
2007-09-01
Budget End
2012-08-31
Support Year
Fiscal Year
2007
Total Cost
$509,679
Indirect Cost
Name
Columbia University
Department
Type
DUNS #
City
New York
State
NY
Country
United States
Zip Code
10027