This award will enhance national security, prosperity and health by studying ways to automatically identify illicit commercial enterprises that operate primarily via online advertising. While many legitimate enterprises use online platforms, such as advertisement services, job recruitment ads, and review boards, illicit business also make use of these services, and it may be difficult to distinguish between them. Illicit business using these platforms are often associated with human trafficking activity. This project develops methods to analyze large amounts of online data from multiple sources to create an interpretable risk score that facilitates detection of illicit business. In partnership with the Global Emancipation Network, a data analytics non-profit dedicated to countering human trafficking, the project will fuse data from business-specific operations with data from publicly available licensing documents and court records to better detect suspicious activity and guide resource-constrained interdiction efforts. The results will modernize anti-trafficking efforts to keep pace with the complex strategies used by traffickers. The award will provide support to educate graduate students to meet the emerging needs of illicit support network research to inform policy.

Using a large existing database of scraped data from the deep and open web, this research will build risk scores for automatically detecting illicit businesses. Risk scores are linear classification models that only require users to add, subtract and multiply a few small numbers in order to make a prediction, as such, these models are easy to apply and understand. Information in ads from illicit businesses has distinguishing features, such as data obfuscation, non-random misspellings, high occurrences of out-of-vocabulary and unusual words, and frequent use of Unicode characters, making natural language processing difficult. The risk score learning problem is formulated as a nonlinear mixed-integer optimization problem. The analytical framework leverages and extends state-of-the-art techniques from optimization and statistical learning and will produce a scalable branch-and-cut procedure to solve the learning problem over large training sets. It will employ semi-supervised learning methods to use both labeled and unlabeled data to generate better risk scores. The performance evaluation of the risk scores will be informed by real data from legitimate and illicit massage businesses. The research results will be generalizable to different data platforms, and the methods developed in this work is expected to be translatable to detection of human trafficking in other sectors.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2019-09-01
Budget End
2022-08-31
Support Year
Fiscal Year
2019
Total Cost
$514,726
Indirect Cost
Name
North Carolina State University Raleigh
Department
Type
DUNS #
City
Raleigh
State
NC
Country
United States
Zip Code
27695