This COIVD-19 RAPID project will assist in the mitigation of the negative impacts of COVID-19 on public health, society, and the economy, by creating high-quality databases from highly distributed data about medical and governmental services related to COVID-19. The project will develop software tools to help in the creation of "auxiliary" databases with high-quality data to assist in making better decisions, avoiding fraud, and yielding high-quality analysis sooner in the urgent and rapidly evolving situation created by the coronavirus pandemic. The techniques that will be used to achieve high-quality include:

(1) linking "background data" to the data sets to enable quality-checking and fraud detection. For example, ensuring that hospital information listed in the medical resource database is annotated with an accurate phone number so that a volunteer can contact the hospital and check on the accuracy of the data, and

(2) creating new "join keys" to enable easy integration of data in the auxiliary database with other data. The project will work closely with other related COVID-19 RAPID efforts which are working on various aspects of data and information collection from the Web.

The project will focus on creating two high-quality databases using these strategies:

(1) A unified medical institution auxiliary database, which will be a database of all known US medical institutions and (2) A unified government office auxiliary database, which will be a database of all known government offices in the United States—city halls, courts, licensing offices, etc.—at any level of government.

Both these data sets are crucial for ensuring that citizens receive a base level of medical aid and government assistance. These resources would be beneficial not only for this particular pandemic, but would become essential resources, in general, for the future.

The proposed auxiliary data set creation infrastructure will include a rich schema of background information, used for quality-checking, and a set of join keys for data integration. While there is a huge array of medical institution data sets online, many of the data sets are misaligned due to lack of standard names and/or data integration keys since different projects make different local decisions in choosing these values that may not be universally compatible. As a result, the background information becomes less rich and makes integration with data from other institutions or analysis pipelines much more difficult. The strategies used to create this infrastructure would include:

(1) synthesis of preliminary auxiliary datasets, which includes generating common, candidate attributes for all objects in the input set, for example, creating a helipad field for hospitals based on examining all hospital data in Wikidata; (2) identification of inputs with missing values, and filling in those values with a combination of Web extraction tasks and crowdsourcing tasks, and (3) flagging values that are suspected of being incorrect by, for example, automatically creating a set of machine-learned predictors for each column in the auxiliary data. The system could then run the predictor and identify outlier values.

This RAPID award is made by the Convergence Accelerator program in the Office of Integrative Activities using funds from the Coronavirus Aid, Relief, and Economic Security (CARES) Act, and is associated with the Convergence Accelerator Track A: Open Knowledge Network.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2020-05-01
Budget End
2021-04-30
Support Year
Fiscal Year
2020
Total Cost
$164,811
Indirect Cost
Name
Regents of the University of Michigan - Ann Arbor
Department
Type
DUNS #
City
Ann Arbor
State
MI
Country
United States
Zip Code
48109