This project uses natural language processing and machine learning to investigate methods for breaking the human metadata generation bottleneck that has plagued projects providing access to educational resources on the Internet. Breaking the metadata generation bottleneck is necessary if access to National SMETE Digital Library (NSDL) resources is to scale appropriately to the Internet. Comporting fully with emerging international standards for educational metadata, the project demonstrates the feasibility of automatically generating metadata for the NSDL through the processing of full-text collections from the Eisenhower National Clearinghouse on Science and Mathematics. The metadata generated enhances the GEM metadata repository, a nationally recognized finding tool for educational resources, and provides the technical means for the automatic generation of educational metadata from text-based resources. There are five research goals for this project: (1) develop a sublanguage and discourse model for science and mathematics educational materials; (2) extend an automatic metatagger to these materials, using machine learning, the GEM metatag set, extended metatag sets, and heuristics based on the sublanguage and discourse model; (3) extend a sophisticated information extraction technology that can simultaneously extract event-specific relational information as well as domain-independent concepts and relationships;(4) identify appropriate controlled vocabularies and thesauri for science and mathematics educational materials, and incorporate them into the registry used by the automatic metatagger; and (5) evaluate automatic vs. manual metatagging, in both quantitative and qualitative terms. An innovative array of experimental methods is used to achieve these goals.

The project includes a qualitative analysis to understand the role of human inconsistency within the manual process and quantitative analysis of the results through the metrics of precision and recall. This project is designed to apply natural language processing and machine learning to the task of automatic metatagging to scale to the needs of the NSDL and to provide access to a far greater number of educational resources.

Agency
National Science Foundation (NSF)
Institute
Division of Undergraduate Education (DUE)
Type
Standard Grant (Standard)
Application #
0085837
Program Officer
Jane Prey
Project Start
Project End
Budget Start
2000-09-15
Budget End
2002-02-28
Support Year
Fiscal Year
2000
Total Cost
$366,293
Indirect Cost
Name
Syracuse University
Department
Type
DUNS #
City
Syracuse
State
NY
Country
United States
Zip Code
13244