Monitoring web traffic is critical to several players in the Internet eco-system--including Internet Service Providers, regulators, network administrators, as well as researchers. Because of the increasing use of traffic encryption, as well as growing privacy concerns, however, such monitoring may soon need to be conducted using only limited information (anonymized TCP/IP headers). This project considers a learning-based classification approach for web traffic monitoring, that can work with such limited information and yet identify the type of web page being downloaded. The project argues through a proof of concept that the potential impact of this approach is quite significant in several application domains, including profiling of user content preference, profiling of user navigation behavior, as well as identifying the usage of video streaming and mobile devices.

The project, however, also identifies several fundamental issues that challenge the promise to deliver this impact in practice. The proposed research will evaluate and address these risks by: (i) designing robust statistical techniques for web page boundary detection; (ii) conducting an extensive study of web traffic and page diversity across client platforms, client locations, and time; (iii) identifying stable and robust traffic features and using these to study the performance of classification for several contemporary labeling schemes; and (iv) incorporating and evaluating the proposed approach within several real-world application domains.

Broader Impacts: This project is expected to have a transformative impact on several domains. The first impact is on the current debate on privacy-violating monitoring techniques. While most arguments are either for or against allowing deep packet monitoring, this project shifts the focus to a different paradigm -- that of simultaneously achieving a balance between monitoring and privacy goals. The proposed classification approach provides a great alternative to network managers, regulators, ISPs, as well as researchers, who can understand client preferences and application prevalence, without relying on slow and privacy-threatening techniques. Second, the project will be an excellent source of undergraduate and graduate students trained in experimentation, measurements, and scientific analysis of big data---skills that are invaluable for many federal, commercial and academic institutions that are involved in mining for information in large data-sets. Third, through involvement of minorities, the project will help broaden the diversity of the Computer-Science work force. Finally, through outreach using demos to middle- and high-schoolers, especially on a topic related to web browsing that is near and dear to most, the project will help increase community engagement with science and technology.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
1526268
Program Officer
Darleen Fisher
Project Start
Project End
Budget Start
2015-10-01
Budget End
2020-09-30
Support Year
Fiscal Year
2015
Total Cost
$508,000
Indirect Cost
Name
University of North Carolina Chapel Hill
Department
Type
DUNS #
City
Chapel Hill
State
NC
Country
United States
Zip Code
27599