Internet fraud costs consumers and businesses billions of dollars each year. Through creative combinations of spam and social engineering, attackers regularly lure end users into visiting phishing sites, malware-hosting sites, and scam sites. One popular defense mechanism against Web-based attacks is blacklisting, but today's blacklists suffer from three fundamental deciencies. First, most of them employ a combination of Web crawling and human intervention to infer malicious sites. This adds an inherent delay in adding entries and causes many malicious sites to be missed. Second, blacklists are mostly based on exact URL strings, and hence unable to adapt to simple changes to the URLs that attackers are using today to evade detection. Third, as blacklist entries grow, matching them against URLs in real-time could create performance bottlenecks. To overcome these deficiencies, this project is developing novel mechanisms to aid in the construction, maintenance, and matching of blacklists in real time. Specifically, it is developing a scalable architecture that can discover new malicious websites by passively observing the onset of new techniques exploited by the miscreants, such as redirects and fast flux in network traffic. The architecture also leverages common attacker tendencies to find novel and automated ways of discovering new malicious URLs from existing blacklisted URLs. The final thrust of the project is on developing high-speed approximate matching algorithms for effective in-network blacklisting to match URLs embedded in packets against potentially millions of blacklist entries. If successful, this project will make the Web safer for millions of Internet users.

Project Report

Phishing attacks are extremely common today and are increasing by the day. One popular solution to address this problem is to add additional security features within an Internet browser that warns users when they access phishing sites using a technique known as blacklisting. Blacklisting essentially matches agiven URL with a list of URLs belonging to a blacklist, which is essentially a list of URLs that are known to be hosting malicious content. Amajor problem with blacklists is incompleteness. Modern cyber-criminals are quite savvy; they employ many simple techniques to evade blacklists. For example, the attackers do not use URLs for more than a certain period of time. Also, they often use many variants of the same URL, so that it takes longer for the blacklists to catch up with all the new variants. Despite these apparent weaknesses, the inherent simplicity of the blacklisting approach makes it easy for browsers and many other applications to adopt and use them to provide some level of protection to users. The key focus of this project is improving the efficacy of blacklisting as it is a key practical defense against phishing attacks. Specifically, the most significant outcome of this project is the design of a system called PhishNet that increases the resilience and efficiency of blacklists significantly. PhishNet is based on two key observations. First, a simple examination of common blacklists suggested that malicious URLs do often tend to occur in groups that are close to each other either syntactically (e.g., www1.rogue.com, www2.rogue.com) or semantically (e.g., two URLs with hostnames resolving to the same IP address). PhishNet exploits this observation to systematically discover new malicious URLs in and around the original blacklist entries and add them to the blacklist, that would significantly increase its resilience to evasion. It also pioneered a new approach that deviates from the current practice of exact match implementation of a blacklist to an approximate match that is aware of several of the legal mutations that often exist within these URLs. Many of the key ideas in the design of the PhishNet system were communicated to leading researchers through talks at premier international conferences (e.g., Infocom), as well as industry practitioners (e.g., Google). Another key outcome of this research is an extensive study of phishing attacks from an edge router of a campus network. Our work here was motivated by the fact that there exist few studies that focus on understanding temporal characteristics of phishing or malware accesses in an edge network such as a campus or an enterprise network comprising of a few 10s of thousand users. For example, there exists no studies that clearly indicate what fraction of URL accesses in a given campus or edge network comprises phishing or malware hosting sites (together referred to as malicious sites). Finding answers to some of these questions can play a significant role in informing future defense solutions against phishing and malware attacks. We have conducted the first-of-its-kind study study involving data collected from an edge router of acampus network that comprises of more than 50,000 users. A key requirement for our study is the ability to identify whether a given access is to a malicious site or not, for which, we leveraged existing blacklisting tools such as the Google Safe Browsing (GSB) back-end server. We obtained a specialized high-speed capture device (e.g., Endace 10Gbps monitoring card) to collect packets going through the gateway router to the outside world,which we used to filter the HTTP traffic (port 80) and extract the URLs from HTTP requests. Our system called PhishLive monitors the HTTP traffic going through the gateway of the campus network and captures malicious URLs detected by Google Safe Browsing (GSB) database in HTTP requests and redirect responses in real-time. It analyzes the statistical characteristics of dataset off-line including distribution of attacks over time, geolocation distribution of attacking IP addresses, attacking hostnames clustering and malicious redirect chain analysis. Using our Phishlive system, we analysed over 1Billion URLs and made several important observations. For example, percentage of malicious URLs is signficiantly higher during night than during the day. Also, most domain names associated with URLs are present for not more than a day. Many more interesting observations made through this measurement study have been shared with the research community through publications and talks in international venues.

Agency
National Science Foundation (NSF)
Institute
Division of Computer and Network Systems (CNS)
Type
Standard Grant (Standard)
Application #
1017915
Program Officer
Sylvia J. Spengler
Project Start
Project End
Budget Start
2010-08-01
Budget End
2013-07-31
Support Year
Fiscal Year
2010
Total Cost
$249,907
Indirect Cost
Name
Purdue University
Department
Type
DUNS #
City
West Lafayette
State
IN
Country
United States
Zip Code
47907