Collecting representative Internet measurement data has remained a challenging and often elusive goal for the networking community. Obstacles include the Internet's scale and scope, technical challenges in capturing, filtering and sampling high data rates, difficulty obtaining measurements across a decentralized network with radically distributed ownership, cost of building and operating instrumentation, and political hurdles. Even (or especially) with all these obstacles, the demand for and importance of representative Internet data sets is increasing -- which is good news for rigorous scientific Internet research. The primary driver of this demand is the growing awareness of vulnerabilities to threats to various critical and increasingly interdependent infrastructures, and that a primary limiting factor in the escalating arms race is our surprisingly still primitive approach to sharing cyberinfrastructure data.
CAIDA has developed an Internet Measurement Data Catalog -- IMDC -- an index of information (metadata) about data sets and their availability under various usage policies. The catalog addresses a significant challenge in network science: reducing the cost of searching for data by organizing metadata about accessible Internet data sets into a single repository.
This SDCI project will support integration of lessons learned thus far from IMDC development, implementation, and usage to better support the cybersecurity research and cyberinfrastructure development communities. The project's three primary deployment goals are to: (1) reduce the burden on those contributing data via a streamlined interface and tools for easier indexing, annotation and navigation of relevant data; (2) convert from use of a proprietary database backend (Oracle) to a completely open source solution; and (3) expand the catalog's relevance to the cybersecurity and other research communities via workshops, new emphasis on security-relevant data sets, and creation of public web forums to discuss data-sharing issues.
Although the focus for this project is utility for the cybersecurity and cyberinfrastructure research community, our proposed design objectives and outreach plans explicitly target a range of science and engineering communities. In particular, we believe the proposed software development can support and promote NSF's newly announced Data Sharing Policy, which as of January 2011 requires all proposals to include a plan for how researchers intend to share their data with other researchers.
The intellectual merit of the proposed software development activities is a range of measurable benefits to cyberinfrastructure research: maximizing the re-use of existing Internet data; decreasing the time spent collecting redundant data; reducing the effort needed to start a new study; promoting validation and reproducibility of analyses and r intellectual merit of cyberinfrastructure research methods: maximizing the re-use of existing Internet data; decreasing the time spent collecting redundant data; reducing the effort needed to start a new study; promoting validation and reproducibility of analyses and results; enabling longitudinal and cross-disciplinary studies of the Internet; and opening up new cross-domain areas of transformative networking research.
The broader impacts of this project are diverse. The success of the catalog and related workshops will facilitate wide dissemination of Internet measurement data to researchers and security experts across academic, commercial, and government sectors. By including education-oriented data collections in the catalog, this project creates an immediate link between research and education, and improves access to Internet research for underrepresented groups in computer science and engineering. Most importantly, the software created through this project will help other disciplines and sectors to develop their own catalog instances to support the type of data management plans now articulated as essential to NSF.
It has become clear that in order to be able to keep up with pervasive cybersecurity threats to various critical and often interdependent infrastructures, researchers desperately need to collect and share more cyberinfrastructure data. Yet despite the increasing demand for and importance of Internet data sets, procuring representative Internet measurement data remains a challenging and often elusive goal for the networking community. Obstacles include the Internet’s scale and scope, technical challenges in capturing, filtering, sampling, and storing high rates and volumes of data, difficulty conducting measurements across a decentralized network with radically distributed ownership, cost of building and operating instrumentation, and political and legal hurdles. To maximize the re-use of existing Internet data, CAIDA has previously developed an Internet Measurement Data Catalog – IMDC or DatCat – an index of information (metadata) about data sets and their availability under various usage policies. Over the course of this project we streamlined and improved the IMDC and expanded its underlying software capabilities. Most importantly, we have successfully curtailed the overhead of metadata entry to incent contribution to the catalog by researchers who collect and curate data, who would have to volunteer their time to index metadata. Reducing this burden is crucial to the success of the catalog. We refined the search capabilities and improved the output of search results. We also converted the database backend from use of a proprietary software (Oracle) to a completely open source solution. To engage the community, we included a focus on DatCat at one of our workshops, presented DatCat at various relevant meetings, and created a public web forum for discussion of specific and broader data-sharing issues. By the end of the project, we have seen the beginning of organic use of the IMDC outside of our direct efforts to seed the catalog with our own data sets as well as those of close collaborators. As a sign of growing community acceptance, the 2015 Internet Measurement Conference Call for Papers included notice of an award to the paper that contributes a novel dataset to the community with a requirement to make the dataset publicly available through DatCat or CRAWDAD. We also explored the possibility of using DatCat framework for the DHS-funded PREDICT project (www.predict.org/), and presented the updates to DatCat funded by this project to the PREDICT community. Intellectual Merit. IMDC supports a wide range of measurable benefits to cyberinfrastructure research: simplifies the process of searching for data by organizing metadata about accessible Internet data sets into a single repository; decreases the time and effort that might be spent collecting redundant data; lowers the threshold needed to start a new study; promotes validation and reproducibility of analyses and results; enables longitudinal and cross-disciplinary studies of the Internet; and facilitates new cross-domain areas of transformative networking research. Broader Impact. The catalog opens up a wealth of Internet data and statistics to anyone interested in bolstering their expertise in Internet science and technology, including groups underrepresented in computer science and engineering. By indexing education-oriented data sets into DatCat, we make current relevant Internet data sets available for use in classrooms and thus efficiently link research and education. The success of our catalog and related extensive outreach efforts will facilitate wider dissemination of Internet measurement data to researchers and security experts across academic, commercial, and government sectors.