Real-world social events provide a convenient and intuitive way to organize social media content for individuals. Current approaches to event detection from social media: assume that the events to be monitored (and their social media signatures) are known a priori; focus largely on text data and fail to take advantage of other forms of media e.g., images.
Against this background, this project explores a novel approach to discovering spontaneous, a priori unspecified, social events through joint Bayesian non parametric modeling of multi-modal data (including text and images) and using the events thus discovered to foster new social links. The resulting tools for event discovery will be tested in an application involving discovery of wild animal disease outbreaks from twitter text messages and images posted by individuals.
The project brings together an interdisciplinary team of researchers with expertise in image analysis, text mining, and machine learning to advance the state of the art in detection of spontaneous, a priori unspecified events (as they emerge) from social media data. It is expected to yield new scalable nonparametric Bayesian approaches to joint modeling of image and text data, and more generally multi-modal social media data. The resulting tools could potentially transform the way in which people use social media data by empowering them to discover and participate in real world events even as they emerge.
Social media produces a vast source of data in the form of text, images, videos and more that can be leveraged to help solve tasks such as event detection, surveillance, monitoring and prediction, where near-real-time data over large areas and times is necessary but physical and financial constraints make special-purpose, engineering-based solutions limited or impractical. The benefits of using this rich and timely information source are tempered by the many challenges of using this data effectively, including (1) how to efficiently utilize the enormous quantity of posts per day of unstructured, diverse text, images and videos containing complex, ambiguous content and poor quality, (2) how to use data produced by social media users who aren’t controlled or directed or motivated as citizen scientists, and (3) how to use data whose distribution both spatially and temporally is very non-uniform, possibly erroneous or missing, and depends on many factors. In this project we developed new ways of using social media as an information source. In particular, user posts to microblogging services such as Twitter may contain useful information for a given task even though the user had a completely different purpose for their post. For example, a tweet that is primarily about a person may also contain secondary information about a place, event or activity that is useful for a given task. In order to explore the many issues that must be addressed to exploit social media data, e.g., extracting useful text and image feature representations from the raw data, and designing machine learning models that utilize these features, we selected several tasks to use as testbeds. One task was to use posts to estimate air pollution in order to develop machine learning models that can accurately estimate the Air Quality Index. A second task was to analyze social media users who post bullying related tweets, in order to make inferences about the participants’ roles, the temporal dynamics of posts, and prediction of regret causing posts to be deleted. A third task was to monitor wildlife roadkill in order to study methods for dealing with biased, missing and scarce data to estimate an underlying spatiotemporal signal, in this case for estimating wildlife encounters. In all three cases our methods were shown to be able to accurately estimate the target phenomenon. The broader impacts of these results are that social media can in fact be exploited to help build new tools for important societal problems.