This Small Business Innovation Research (SBIR) Phase II project will develop software to automatically detect a broad spectrum of websites that are fraudulent or otherwise harmful to consumers. Much work has been done on specific software capable of detecting websites hosting malware or engaged in phishing. However, software does not yet exist which can detect a broader array of harmful websites, including those selling counterfeits, selling illegal drugs, and hosting weight-loss scams, to name just a few. The challenge in doing this involves selecting the right features of fraudulent sites which in isolation or combination are good indictors of a site's harmfulness. Using these features, a machine learning classifier can be trained using data on known harmful websites. Unknown websites can then be run through the classifier to evaluate their potential for harm. Additional challenges involve gathering sufficient data to properly train the classifier, making the classifier general enough to detect a range of harmful sites while still maintaining accuracy, and updating the classifier in real-time such that it can improve with ongoing human feedback and additional data.

The principal impact of this project is the protection of consumers from online fraud. Today, consumers lack reliable resources to evaluate unfamiliar websites. Most use familiar sites like Amazon or take a gamble on Google search results. These gambles frequently result in fraud. It is believed that there are now over 250 million websites and $100 billion lost yearly to online fraud. While the statistics cover many types of fraud, examples of risky sites include online counterfeiters, pharmacies, and retailers. The software developed in this project will greatly improve transparency around websites and protect millions from fraud. The technical achievements in this project involve the use of a vector space model in converting non-discrete features of fraudulent sites into useful data that can be inputted into a machine learning classifier. Additionally, this technology will include innovative feature choices, access to high-quality data, and the creation of a general classifier capable of improving itself in real-time and detecting a broad array of heretofore undetectable fraudulent sites.

Project Report

For its Phase II SBIR, SiteJabber developed software to automate the detection of websites selling fraudulent goods and services to consumers. Initial types of sites covered by the software include those selling counterfeit goods and suspect pharmaceuticals. Americans generally lack a reliable resource to evaluate unfamiliar websites and online businesses. Most use a familiar site like Amazon or take a gamble on whatever comes up in a Google search. These gambles too frequently result in fraud. There are now over 250 million websites and the Washington Post has estimated $100 billion is lost every year to online fraud. While consumer reviews of online businesses and websites (such as those that appear on www.sitejabber.com) can be a very helpful guide for consumers, new fraudulent sites appear constantly, often before consumers can report them. For this reason SiteJabber developed its machine learning software capable of predicting whether websites are engaged in certain fraudulent activities. While previous technologies developed by academics and security companies are capable of detecting websites containing malware and phishing, websites selling fraudulent products such as counterfeits and suspect pharmaceuticals generally evaded automated detection. To create their software, SiteJabber analyzed public and proprietary data on tens of thousands of websites and utilized a vector space model and machine learning technology. Challenges included the vectorization of certain valuable, non-discrete datasets and training the software not to turn up false positives and improperly implicate otherwise honest online businesses. The result is an important, novel mechanism for consumers to stay safe online and for companies to prevent their intellectual property from being stolen. Using the software developed, SiteJabber has so far uncovered over 1,500 websites selling counterfeit goods and suspect pharmaceuticals to unsuspecting consumers. SiteJabber expects the software to eventually uncover tens or even hundreds of thousands of fraudulent online businesses and expand beyond counterfeits to other classes of fraudulent sites. SiteJabber plans to publish this Information publicly on SiteJabber.com, a free service for consumers to check the reputations of online businesses and websites. As a result, consumers should be made better aware of these often dangerous websites and U.S. companies should also be better protected from counterfeiters and other intellectual property thieves.

Project Start
Project End
Budget Start
2011-09-01
Budget End
2014-02-28
Support Year
Fiscal Year
2011
Total Cost
$600,000
Indirect Cost
Name
Ggl Projects, Inc.
Department
Type
DUNS #
City
San Francisco
State
CA
Country
United States
Zip Code
94110