This project investigates the nature of crowd-based human analytics at various scales, specifically how the concentrated efforts of a few contributors differ from the summed micro contributions of many. Automated approaches are good at handling huge amounts of data, but they lack the flexibility and sensitivity of human perception when making decisions or observations, especially when computational challenges revolve around visual analytics. Networks of humans, as an alternative, can scale up human perception by facilitating massively parallel computation through the distribution of micro-tasks, but human data interpretation is variant between individuals. Wide variability in the amount of participation of individuals in crowd-based computation creates non-uniform representations of a crowd, which is an important discrepancy that could significantly impact the validity of the term "crowd" in crowdsourcing. The research will explore data generated from the extreme ends of the participation curve and quantify the quality of data produced from a broad sampling of a crowd versus concentrated voice of the few "super users."

As one measure of comparison, the researchers will observe how characteristically variant samplings of human generated analysis alter the outcome when used as training data in a machine learning framework. This investigation will utilize data generated from a crowdsourcing effort that tapped over 10,000 volunteer participants to generate over 2 million human annotations on ultra-high resolution satellite imagery in search for tombs across Mongolia. Image tiles were distributed at random to participants who tagged anomalies of interest, while crowd consensus on points of interest provided a field survey team with locations to ground truth in Mongolia. Participation ranged widely, as illustrated by the fact that 20 percent of the data came from the most active 1 percent of participants, while at the other extreme 20 percent of the data came from the 80 percent of participants who were least active. While consensus of the crowd provided one metric to measure the quality of anomaly identifications, ground truth observations showed actual validation tended to correspond with identifications made from higher interest participants. This study will explore the nature of data generated from experts versus crowds of non-experts, starting from the discrepancies in participation levels.

Crowd-based human analytics has been welcomed as a potential solution to some of the world?s largest data challenges. Examples of crowdsourcing have shown that the power of distributed microtasking can engage challenges as overwhelming as categorizing the galaxies, or as complicated as folding proteins. However this concept depends upon the recruitment of human help, often at whatever levels of participation an individual is willing to contribute. The variation in contributions, and thus impact levels, between individuals can be staggering, with participation typically distributed across a longtail curve. That fundamental aspect of a recruited crowd should be recognized and understood when extracting knowledge from the data that is generated. This project will contribute to the necessary understanding by determining how the distributed inputs from a crowd differ from the concentrated efforts of an individual. Insight into the effects of crowd dynamics on results will determine how we pool and retain participation and, thus, have transformative impact on the development of crowdsourcing as a concept for analytics.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Application #
1219138
Program Officer
William Bainbridge
Project Start
Project End
Budget Start
2012-08-15
Budget End
2016-07-31
Support Year
Fiscal Year
2012
Total Cost
$497,250
Indirect Cost
Name
University of California San Diego
Department
Type
DUNS #
City
La Jolla
State
CA
Country
United States
Zip Code
92093