This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5).

The Virtual Research Data Center at Cornell University has been a successful research support tool for users of many of the Census Bureau large-scale confidential data products including, but not limited to, those that are accessible via the Census Research Data Center network. Over 200 computational users and 600 download users have benefited from the VirtualRDC resources. Their scientific publications cite the NSF grants that supported the development of the VirtualRDC. The proposed activity seeks to keep this support network flourishing. In addition, most social science researchers face substantial hurdles when they wish to harness the power of large-scale computational clusters, in particular when using new, very large synthetic data sets with their unprecedented detail on people, jobs, and firms. The proposed activity seeks to extend the VirtualRDC model to allow support of tera-scale social science computing via the NSF-sponsored TeraGrid resources. The most widespread statistical software packages used by social scientists, i.e., SAS, Stata, and SPSS, are not available on the TeraGrid itself or on any of the servers at the borders of the TeraGrid with fast connections to it. When viewing the problem through the lens of the typical data-driven research process (extract, edit and transform data; transfer data to a computational location; and perform analysis) social science researchers are typically constrained in at least one of these steps when approaching the high-performance computing clusters on the TeraGrid. For most data preparation, and for much analysis, the lack of standard statistical analysis and data preparation software packages is a serious impediment. However, the typical social scientist workstation or university-provided computational infrastructure does not have the resources to handle these very large data sets. Furthermore, the social science workstation and the university-provided infrastructure do not have sufficiently fast data connectivity to transfer any large prepared data files to the TeraGrid for processing there. This project aims to remedy bottlenecks in the first and second steps, with a focused expansion of resources at a critical location resulting in a highly useful gateway to the TeraGrid for the social sciences. The project builds a social science TeraGrid gateway that (i) allows researchers to perform the data preparation step using their comfort-level software packages, speeding up the data preparation phase, and (ii) do so on servers that have a fast connection to the TeraGrid, thus greatly speeding up the data-transfer process. The third bottleneck absence of social statistics packages on the TeraGrid is not addressed by this proposal, since it would require resources, in particular licensing resources, an order of magnitude larger than our proposed budget. This step is left to future proposals.

Broader impacts: Tera-scale social science data are underutilized. Initially, serious confidentiality issues prevented most researchers from accessing these data. Significant research effort on projects that solve most of these confidentiality issues in combination with an expansion of the restricted-access model via Census Research Data Centers has begun to address this underutilization. Now that an increasing number of previously confidential data sources are finding their way into the public domain, the quantity of social science public-use data is once-again expanding dramatically. This project proposes a method of unlocking those recently released data sources to allow much broader access by the research community. Research strategies such as very large scale resampling and synthesis, which were previously proposed but not technically feasible, will be implemented. The expected explosion of use will lead to new results in a multitude of social sciences. The knowledge gained from running the Social Science TeraGrid Gateway will be leveraged and applied to future proposals in which the third identified bottleneck the absence of familiar software for social scientists on large-scale computing resources will be addressed. The PIs on this proposal are actively involved with other research teams that are moving forward with the development of such proposals. The long-term goal of this proposal is that the tools put together for the research community through this proposal will be the building blocks for bigger, and more transparent mechanisms, for granting social scientists easy access to large-scale computational facilities.

Agency
National Science Foundation (NSF)
Institute
Division of Social and Economic Sciences (SES)
Type
Standard Grant (Standard)
Application #
0922005
Program Officer
Nancy A. Lutz
Project Start
Project End
Budget Start
2009-07-01
Budget End
2013-06-30
Support Year
Fiscal Year
2009
Total Cost
$393,523
Indirect Cost
Name
Cornell University
Department
Type
DUNS #
City
Ithaca
State
NY
Country
United States
Zip Code
14850