Hackers often target the information systems that underlie critical systems in domains ranging from finance to healthcare. The estimated cost of defending against and responding to hacking incidents currently runs at hundreds of billions of dollars annually. To reduce these costs, many organizations have aimed to develop timely, relevant, actionable, and shareable Cyber Threat Intelligence (CTI) about security and privacy threats to support cybersecurity decision-making. However, existing methods tend to react to known threats rather than proactively detecting emerging ones. One promising approach to proactive exploit detection is mining large, international, and rapidly evolving online hacker community platforms to detect emerging threats and key actors. To this end, this project aims to develop advanced, proactive CTI capabilities through (1) collecting large, dynamic datasets of hacker forum posts and (2) developing methods to analyze them to extract emerging threats, particularly malware, through a novel graph-based method for modeling text content.
To achieve these goals, this project aims to develop a novel CTI framework designed to collect and identify emerging threats from multi-million record hacker forums. To address the problem of collecting large-scale and dynamic datasets, the team will develop advanced obfuscated crawling mechanisms that bypass automated collection countermeasures while requiring minimal human involvement. The data collected will be segmented into time spells and analyzed by a novel computational algorithm, the Diachronic Graph Convolutional Autoencoder (D-GCAE). D-GCAE is rooted in methods drawn from the diachronic linguistics, network science, text mining, and deep learning communities. In this project, D-GCAE will extract graph embeddings at each time spell, align embeddings, and analyze semantic shifts of hacker terminology to identify potential emerging threats. These tools will be evaluated against both state-of-the-art benchmarks proposed in computer science and related domains and through analysis of their outputs by leading CTI sharing organizations. The datasets and tools will also be disseminated for use by cybersecurity researchers, practitioners, and educators.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.