Twitter is an electronic medium that allows a large user populace to communicate with each other simultaneously. Inherent to Twitter is an asymmetrical relationship between friends and followers thereby provides an interesting social network-like structure among the users of Twitter. Twitter messages, called tweets, are restricted to 140 characters and thus are usually very focused. Twitter is becoming the medium of choice for keeping abreast of rapidly breaking news. This project explores the use of Twitter to build a news processing system from Twitter tweets. The result is analogous to a distributed news wire service. The difference is that the identities of the contributors/reporters are not known in advance and there may be many of them. The tweets are not sent according to a schedule. The tweets occur as news is happening and are noisy while usually arriving at a high throughput rate.
The goal of this exploratory research project is to find effective methods for making Twitter a useful news gathering mechanism. Challenges addressed in this project include: removing the noise; determining tweet clusters of interest bearing in mind that the methods must be online; and determining the relevant location associated with the tweets.
The broad impact of this research is to make it easier to disseminate late breaking news and enhancing the distributed news gathering and reporting process. Web site (www.cs.umd.edu/~hjs/hjscat.html) reports results of this and related research.
Twitter is an electronic medium that allows a large user populace to communicate with each other simultaneously. Inherent to Twitter is an asymmetrical relationship between friends and followers thereby providing an interesting social network-like structure among its users. Twitter messages, called tweets, are restricted to 140 characters and thus are usually very focused. In this project our goal was the investigation of the use of Twitter to build a news processing system from Twitter tweets. The idea was to capture tweets that correspond to late breaking news. The result is analogous to a distributed news wire service. The difference is that the identities of the contributors/reporters are not necessarily known in advance and there may be many of them. The tweets are not sent according to a schedule. Instead, the tweets occur as news is happening and are noisy while usually arriving at a high throughput rate. Some of the issues that we investigated involved the removal of noise which meant trying to determine tweet clusters of interest bearing in mind that the methods must be online, and determining the relevant location associated with the tweets. The latter is quite difficult as our goal is to associate a tweet with the location that the tweet is about rather than the location of the tweeter, which is quite easy to determine given the GPS capabilities of smartphones, which are the tweeting device of choice. The former is also quite difficult as we must identify tweeters who tweet newsworthy tweets. This is analogous to removing the noise tweets. Our primary focus was two-parted. The first involved the disambiguation of entities (e.g., people, geographic locations, organizations etc.) in tweets. As tweets are limited to 140 characters in length this meant that there was very little information in tweets to aid the disambiguation. Typically, when Twitter users refer to entities with which they are familiar, they include very little contextual information, which makes the disambiguation process all the more difficult. For example, for Twitter users who are residents of London, UK, it is well understood that ``David Cameron'' refers to the Prime Minister of the UK, while a reference to ``Buckingham Palace'' corresponds to a landmark in London, neither of which requires any additional information for the purpose of disambiguation. However, without these additional elaborations it would be almost impossible for a disambiguation algorithm to work properly. Most importantly, note that the Twitter user sending the tweet would find the inclusion of these additional elaboration rather redundant and probably silly if they were required to qualify ``David Cameron'' with the phrase ``Prime Minister of the UK''. In this regard, we say that ``David Cameron'' and ``Buckingham Palace' are part of the local lexicon (i.e., common knowledge) of all Twitter users who are from London, UK. Our approach to resolve ambiguities in tweets from a user at location s is based on computing the local lexicon of s, which is informally defined as a set of concepts that is strongly associated with s and, furthermore, its elements are recognized without ambiguity by Twitter users from s. The Local Lexicon includes, but is not limited to, people, landmarks, organizations, and even historical events. The key to our work was to investigate the use of the Wikipedia to help form that local lexicon. The second part of our focus was on determining the trustworthy and noteworthy news tweeters which we termed seeders. In order to evaluate a user's contribution to TwitterStand, we defined what we term are the important traits of a user and provided a mechanism to quantify and monitor that behavior. There are three areas that we quantified in order to evaluate a Twitter user: the number of clusters to which a user contributes; how many other users are also tweeting about the topics a user tweets about; and the timing of a user's tweets in terms of whether their tweets are the earliest ones in the cluster. In particular, we defined two different metrics to quantify when a user tweets - an overall score (RawTiming) and an average score (AvgTiming). RawTiming serves to provide an indication of how large clusters are and how quickly users tweet about a topic. RawTiming corresponds to the sum of the ranks of the user's tweets in terms of the number of tweets in each cluster. This means that if a user is the first to tweet about a cluster with 300 tweets, then the user's RawTiming for that cluster is 299. AvgTiming corresponds to the average of the sum of the ranks of all of the user's tweets regardless of how many tweets are in each cluster, although the average is taken over the sum of the number of tweets in these clusters. Thus we see that AvgTiming helps compare one Twitter user to another, which is similar to calculating the Earned Run Average (ERA) of a pitcher in baseball. Overall our work showed the challenges in analyzing tweets both from the point of making sense of them and whether it is even worthwhile to expend the effort to try to understand them.