This proposal focuses on some fundamental issues concerning unstructured data that arise from text-heavy documents, where the underlying data exhibit unique characteristics such as large volume, large variety and large velocity of change. Automating the process of information extraction is extremely critical in the information age, and has high-utility in online surveys, and threat detection and prevention. The integrated program of research and education will have significant impacts in many fields such as machine learning and data mining, natural language processing, opinion survey, business forecasting and service, health research, and social and political science, among others. This will stimulate interdisciplinary research and collaboration with scientists from disparate fields. The proposed project requires extensive algorithm and software development for target applications. In particular, advanced computational tools will be developed through mapReduce over distributed computational platforms such as OpenMP, MPI and hadoop, and documentation of the software will be disseminated along with the technology transfer.

Unstructured data impose great challenges in that text documents need to be embedded and integrated with numerical input for statistical modeling, which requires overparameterized modeling to achieve accurate prediction and unbiased inference for high-dimensional data. The proposed research aims to develop new statistical methods and tools for sentiment analysis and text summarization utilizing word relations through graphs and personalized prediction for recommender systems. It borrows information across all available information for document summarization, including tagged and untagged documents, leading to higher accuracy of tagging. This will enhance information storage, sorting and processing as well as filtering. Moreover, the project develops a novel approach for accurate personalized prediction utilizing the heterogeneity variation among all users, which impacts everyday life in terms of personalization, such as in service, recommendation and advertising. More importantly, the proposed statistical methodology and scalable computational algorithms will be valuable and useful for other types of unstructured data. Finally, many of the advanced optimization techniques and computing procedures to be developed will also be applicable to other types of ``BIG" data problems.

National Science Foundation (NSF)
Division of Mathematical Sciences (DMS)
Standard Grant (Standard)
Application #
Program Officer
Yong Zeng
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Illinois Urbana-Champaign
United States
Zip Code