The project will develop new statistical models for network growth and change, and apply these to study the evolution of the Wikipedia. The research builds on latent factor models for social networks and recent advances in variable selection and cluster analysis in high dimensions. Using information on the text in Wikipedia entries and its current connectivity structure, the research will estimate where new entries will appear and characterize the local graph structures in different regions of the hyperlinked data set. Although the models are tuned to the Wikipedia, the methodology has general relevance to the study of complex networks.
The Wikipedia is a unique mirror of human knowledge It has grown quickly, and this growth continues. From the standpoint of understanding how humans organize information, it is important to identify the "holes" in the Wikipedia, where new entries will arise. Similarly, one wants to know whether information on, say, Henry VIII is organized in the same way as information on Homotopy Theory. Both kinds of questions can be analyzed statistically, using publicly available version control data that has been archived to help discover Wikipedia vandalism. The research has direct impact on the study of the structure of human knowledge, and indirect impact on the study of change in complex networks.