The emerging field of sentiment analysis employs algorithmic methods to identify and summarize opinions expressed in text. Both machine learning and ad-hoc approaches lie at the foundations of contemporary sentiment analysis systems, but progress on improving both precision and recall has been slowed by the expense and complexity of obtaining sufficiently broad, general sentiment training/validation data.

Recent work has established that fundamental economic variables can successfully be forecast by applying sentiment analysis methods to news-oriented text streams. This project turns this relation on its head, using such forecasting approaches to improve both the precision and recall of general entity-oriented sentiment analysis methods. In particular, this project provides a three-pronged research effort into entity-level sentiment analysis, focusing on improved assessment and algorithms, with applications to the social sciences and forecasting. In particular: (1) Developing a complete entity-level, text and language-independent sentiment evaluation environment, both to further the development of the Lydia system and for release to the international sentiment analysis community. (2) Building on this environment, to develop improved sentiment-detection methods for English news, foreign language news streams, social media such as blogs and Twitter, and historical text corpora. (3) Finally, applying improved sentiment analysis to a variety of challenges in the social sciences.

This research promises to substantially improve both the precision and recall of sentiment detection methods, by focusing on the weakest link: rigorous yet domain-, source-, and language-independent assessment of sentiment. Beyond improvements in natural language processing (NLP), this includes other issues in opinion mining, including article clustering and duplicate detection, entity-domain context, and combining opinions from large numbers of distinct sources.

The sentiment analysis methods and data developed under this research project are expected to have a broad impact, as the results will be directly applicable in a broad range of social sciences, including sociology, economics, political science, and media and communication studies. The techniques will serve as both an educational and scholarly resource in these fields, empowering students and researchers to conduct their own primary studies on historical trends and social forces. Results will be disseminated to the community through the project website (www.textmap.org/III).

Project Report

The major activities for this project revolved around a new approach to natural language processing and sentiment analysis which naturally generalizes to all the world's major languges. Word embeddings assign each word in a language a unique point in (say) 50 dimensional space. Two words have similar meanings/roles if they lie close to each other in space. Recently re-introduced techniques in unsupervised feature learning make this possible, by acquiring common features for a specific language vocabulary from unlabeled text. These features, also known as distributed words representations (embeddings), have been used by us and other groups to build a unified NLP architecture that solved multiple tasks; part of speech (POS) tagging, named entity recognition (NER), semantic role labeling and chunking. We have built word embeddings for one hundred of world's most frequently spoken languages (Al-Rfou, et. al. 2013), using neural networks (auto-encoders) trained on each language's Wikipedia in an unsupervised setting, and shown that they capture surprisingly subtle features of language usage like sentiment, plurality, even nation of origin (Chen et. al 2013). We have made these word embeddings freely available to the research community and employ them in our work on sentiment analysis, with well over 1,000 downloads per date. Further, in work presented at KDD 2014, we (Perozzi, al-Rfou, and Skiena, 2014) have developed DeepWalk, an extension of the ideas behind word embeddings to identifying features in graphs. We have quantitatively demonstrated the utility of our word embeddings by using them as the sole features for training a part of speech tagger for a subset of these languages. We find their performance to be competitive with near state-of-art methods in English, Danish and Swedish. In particular, these word embeddings point to a way to build sentiment analysis systems for all the world's languages in an elegant, consistant, non-ad hoc approach, by training on the Wikipedia edition of each language. Our work (Chen and Skiena, 2014) was reported at ACL 2014, where we presented high-quality sentiment lexicons for 136 major languages, by integrating a variety of linguistic resources into an immense knowledge graph. Our lexicons have a polarity agreement of 95.7% with published lexicons while achieving an overall coverage of 45.2%. Further, we demonstrated the performance of our lexicons in an extrinsic analysis of 2,000 distinct historical figures in Wikipedia articles from 30 languages. Despite cultural difference and the intended neutrality of Wikipedia, our lexicons show an average sentiment correlation of 0.28 across all language pairs. This paper (and the release of our lexicons) marked the successful completion of our major goal of sentiment detection systems for foreign language streams.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1017181
Program Officer
Maria Zemankova
Project Start
Project End
Budget Start
2010-09-01
Budget End
2014-08-31
Support Year
Fiscal Year
2010
Total Cost
$423,164
Indirect Cost
Name
State University New York Stony Brook
Department
Type
DUNS #
City
Stony Brook
State
NY
Country
United States
Zip Code
11794