In order to build domain-specific web-scale video digital libraries on the Web, it is critical to be able to identify and extract certain information of interest (termed information segments) efficiently and automatically. For instance, by collecting only so-called academic videos and their information segments from the Web, one can build a next-generation digital library similar to CiteSeer or Google Scholar. However, that only archives and indexes academic videos (instead of academic papers). Toward this goal, we conduct a preliminary study to develop such identification and extraction of latent information segments from domain-specific videos on the Web. Key emphasis is on how to unearth diverse metadata and associated data from video contents and web pages from which videos are downloaded. Techniques from machine learning (e.g., LDA), data extraction and integration (e.g., wrapper/mediator), natural language processing (e.g., named entity recognition and extraction), and multimedia processing (e.g., near-duplicate detection) are evaluated, applied, and extended appropriately. Scalability of such techniques over large volumes of video data is also being explored.

Agency
National Science Foundation (NSF)
Institute
Division of Undergraduate Education (DUE)
Type
Standard Grant (Standard)
Application #
0937891
Program Officer
Herbert H. Richtol
Project Start
Project End
Budget Start
2009-12-01
Budget End
2012-11-30
Support Year
Fiscal Year
2009
Total Cost
$150,000
Indirect Cost
Name
Pennsylvania State University
Department
Type
DUNS #
City
University Park
State
PA
Country
United States
Zip Code
16802