The goal of this project is to help users ask complicated questions over unstructured text and get as results a concrete answer and not simply pointers to more data, as is currently the practice with general search engines. The research examines how to execute efficiently complex queries over multiple web sources and how to estimate the quality of the returned results. The research also examines how to execute queries over subjective, sentiment data by examining the economic context in which the stated opinions are evaluated. The experimental research is strongly linked to the educational goals of the project that include, among others, modernization of the undergraduate curriculum on databases to include the latest trends in database and information systems and the introduction of a graduate course on the transformative role of search technologies in business and society. The results of this project will give the power and tools to the users to quickly process the vast amount of available data on the web and get back answers (and not more documents or data to process). The systems built as part of this project, together with the related data sets, will be available on the project website (http://text-centric-db.stern.nyu.edu/), allowing everyone to evaluate complex queries over web data and build novel applications on top of the built infrastructure.
You might have bought something on eBay and left a short feedback posting, summarizing your interaction with the seller, such as 'Lightning fast delivery! Sloppy packaging, though.' Similarly, you might have visited Amazon and written a review for the latest digital camera that you bought, such as 'The picture quality is fantastic, but the shutter speed lags badly.' While reading an online review, you may have also come across identity descriptive social information disclosed by reviewers about themselves such as their 'Real name', 'Geographical location', 'Hobbies', 'Nick name', etc. Or while searching for a used product in electronic second-hand markets such as those hosted by Amazon, you might have come across the description posted by the seller such as 'Brand new device with original packaging! Factory authorized dealer! Full manufacturer's warranty.' What is the economic value of these comments? The comment about 'lightning fast delivery' can enhance a seller's reputation and thus allow the seller to increase the price of the listed items by a few cents, without losing any sales. On the other hand the feedback about 'sloppy packaging' can have the opposite effect on a seller's pricing power. Our research studies the 'economic value of user generated content' in such online settings. Based on this research, we are able now to understand how (economically) important are different product features (where products range from cameras to hotels), and we were able to build a search engine for products where we rank products according to their 'value for money' for a user. Going beyond the 'voluntarily generated' wisdom of the crowds, the project also examined the use of crowds for labor. Specifically, we examined in depth the new concept of micro-outsourcing (i.e., using Amazon Mechanical Turk). In Mechanical Turk, we pay users to perform 'micro-tasks' (e.g., is there a human in this picture?). Unfortunately, the results are inherently noisy. So, we developed techniques for examining how we can best use such a vastly useful but noisy resource, especially when we can train a machine learning model to 'learn' from the submissions of the workers. We developed algorithms that can allocate labor resources for labeling items when these items are going to be used for training machine learning models.