As a consequence of the 2009 White House 60-Day Review, it became clear that the academic community working on cyber-security is in dire need of real data. The PREDICT portal is an effort to catalog and house the increasingly available research data, and per NSF's request, the needs of the academia for data have been documented. On the other hand, agencies, companies and organizations that do have data are seeking research innovations in the ?arms race? against cyber-attacks. The ?Cyber-security Data for Experimentation? will bring together people from companies/organizations, academia, and government agencies to discuss (1) models of engagement that will allow the research community to conduct experiments with real-world data sets, (2) how to share research results, and (3) how funding agencies can facilitate the process. A guiding principle of the workshop is that the resulting models should be feasible with the built-in incentives to all parties involved, and the guarantees against violations of privacy or other regulations. The workshop will be organized around panel discussions of three topics: (1) Data availability and use conditions, (2) Research that can greatly benefit from available data to make much needed progresses in cyber-defense, and (3) Engagement models and related IP issues. This will be a one-day workshop, in the DC metro area, held on August 27th, 2010. The workshop will be broadcast to the community at large and will be open to questions and suggestions from the listeners.
The "60-Day Review" pointed out that academic community working on cybersecurity is in dire need of real data. The PREDICT portal is an effort to catalog and house the increasingly available research data, and per NSF’s request, the needs of the academia for data have been documented. On the other hand, agencies, companies, and organizations who do have data are seeking research in- novations in the "arms race" against cyber-attacks. To bridge the two sides, we are organizing a workshop on "Real-world Cybersecurity Data Research" that brought together people from companies/organizations, academia, and government agencies. The goal of the workshop is to discuss (1) models of engagement that allowed the research community to conduct experiments with real-world data sets, (2) how to share research results, and (3) how funding agencies can facilitate the process. This workshop is sponsored by NSF and is in collaboration with DHS, ONR, and the Treasury (we are currently seeking more collaborators). The workshop has attracted companies that are committed to participate and eager to share their data or to engage the research community in other ways concerning real-world data for cybersecurity research. A guiding principle of the workshop is that the resulting models should be feasible with the built-in incentives to all parties involved, and the guarantees against violations of privacy or other regulations. The workshop was organized around panel discussions of three topics: (1) Data availability and use conditions, (2) Research that can greatly benefit from available data to make much needed progresses in cyber-defense, and (3) Engagement models and related IP issues. This was a one-day workshop, in the DC metro area, held on August 27th, 2010. We asked attendees from research and industry to prepare respective material. We asked academics to write one paragraph describing a research project or idea that they are currently working on where real-world cybersecurity data could help them answer their questions better. We asked industry to list one question they would like answered, or one problem that they would like solved where the use of their data could be brought to bear in solving the problem. The workshop had a keynote speech on why data is so important for academics. The remainder of the day consisted of three panels: (1) a panel from industry, discussing the availability of various data, as well as how it can be analyzed; (2) a panel from academics describing the data that is needed; (3) a panel from various participants discussing the mechanics and policies of data sharing. Roger Dingledine shared the following insights from the workshop, which best summarized the outcomes and insights: 1. Researchers already have data, it’s just not the data they think they want. Either they need to clean / understand / better analyze what they already have, or they need to figure out where they can gather the data themselves (universities sure have lots of users), or they can turn to organizations like PREDICT (or Tor) that are aggregating data sets for the purpose of making them available for researchers. 2. Data preservation questions. When a student moves on, it’s common that nobody knows how to continue using what gets left behind. The industry side similarly finds internships too short to be a reliable investment: interns disappear right about the time they start to provide value. 3. Standardized data sets vs specialized data sets. On the one hand, we want standardized public open data sets (think traces for voice recognition), so everybody is doing research from the same starting point. 4. Existing databases need labelling – for example, if you’re looking at a traffic flow and you want to identify what protocols were in use, you need somebody to annotate it with the ground truth (what protocols actually were in use) or you’ll never know if your algorithms are producing useful answers. Metadata is critical, but it’s expensive to build and maintain. Corporations tend to regard the labelling work as their competitive advantage and keep it to themselves. 5. "Building individual relationships is the only way to get your data." That was a quote from several people on the academia side, and nobody on the industry side disputed it. If you have to spend a few years interacting with the person who controls the data before you have a chance of seeing it, that sure doesn’t seem to scale well. 6. "The goal of research is not to write papers. It’s to solve problems." That’s a quote from the industry side. They regard research that’s intellectually interesting as often unrelated to the real-world problems that their company needs to solve. The incentives in academia are structured around publication, and we shouldn’t confuse that with actually solving problems. 7. Internal Review Boards (IRBs) are uniformly not equipped to evaluate this sort of research.