This project addresses current limitations in automatic information extraction technology. Specific objectives are to: 1. use bootstrapping techniques to greatly increase the number of types of entities and relations that can be extracted and the rate at which one is able to create new extractors, 2. improve the performance of supervised training for entity and relation extractors by using bootstrapping to add additional training features and by applying new supervised learning techniques, including new perceptron and discriminative training techniques, 3. address meta-data issues of provenance, confidence, and temporal extent of facts, focussing particularly on the construction of a model of the expected lifetime of facts based on a longitudinal corpus of Web data.
The outcome of the project will be scientific understanding and technology for automatic information extraction from free text, making it possible to convert large document collections into formal databases suitable for automated processing. This will represent a significant enhancement in the utility and societal benefit of digital libraries and the World Wide Web. Project results will be disseminated in the form of publications and publicly available code for information extraction and learning of extractors.