Large amounts of epidemiological data are being generated and collected from a variety of sources to understand the impact and propagation of COVID-19. Similarly, huge amounts of news articles are generated and disseminated about the pandemic to keep the population informed. The appropriateness of the actions taken by individuals, corporations, and governments are often based on the quality of data and news. Thus, ensuring the quality of data and news is important. However, malicious actors can alter the attributes of data records, insert spurious records, or suppress records causing any analysis to be inadequate and misinformation to be propagated. This project addresses the critical problem of defining and identifying spurious data and news concerning COVID-19, and tracking the source of misinformation. The project novelty lies in the development of an approach and associated toolset that adapts and combines Machine Learning technologies to detect spurious data and misinformation and presents the results in a manner that is easy for end users to understand and interpret. The approach detects discrepancies in COVID-19 data and traces the flagged discrepancies back to the data sources. The results obtained from the news sources and those obtained from the medical data analysis are compared to determine correlations between the quality of news and the degree and type of data manipulation performed at any region. The project’s impacts are on significantly enhancing the ability to perform accurate scientific analysis, and detecting and explaining news manipulation with respect to COVID-19. The scientific principles developed in the project are expected to be useful outside the medical domain. The PI and the students identified for this project are minorities. The project will be carried out in the Computer Science Department at Colorado State University which is a BRAID affiliate.

COVID-19 data discrepancies are related to (1) single records, where some field is modified, (2) sequence of records over time forming a temporal dimension, where spurious records have been inserted or records have been suppressed, and (3) sequences of records across regions forming a spatial dimension, where there is a pattern of manipulation or information disclosure across regions. The approach determines the appropriate combination of autoencoders, Long Short-Term Memory (LSTM), Temporal Convolution Network (TCNs), and Convolution Neural Networks (CNNs) that can work with data obtained from medical sources and news containing both spatial and temporal dimensions. The tools help the investigators’ collaborators at the University of Colorado Anschutz Medical Center and Center for Disease Control and Prevention to perform data integrity checking of medical records and to provide explanations of integrity violations. The tools also handle different types of data and news alterations pertaining to COVID-19.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2020-05-01
Budget End
2022-04-30
Support Year
Fiscal Year
2020
Total Cost
$199,748
Indirect Cost
Name
Colorado State University-Fort Collins
Department
Type
DUNS #
City
Fort Collins
State
CO
Country
United States
Zip Code
80523