The Web has made humans smarter, providing ready access to vast amounts of knowledge and facts. The Semantic Web has the capacity to similarly enhance computer programs and devices by giving them access to enormous volumes of data, facts and knowledge. This project is exploring the feasibility of automatically extracting new knowledge directly from data found in spreadsheets, database relations, and document tables and representing it as highly interoperable linked open data (LOD) in the Semantic Web language RDF. The extraction is guided by probabilistic graphical models that use statistical information mined from current LOD knowledge resources. To demonstrate the potential payoff of the research, the system is used to extract knowledge from tables collected from medical journals and tables from web sites like data.gov.
While the W3C semantic web languages RDF and OWL are used to represent the knowledge, the results are applicable to other semantic data frameworks such as Microdata (Search Consortium), Freebase (Google), Probase (Microsoft) and the Open Graph (Facebook). The open sourced prototype software allows other researchers to experiment with automatically producing semantically enriched data from tables for their domains.
If successful, such software extraction systems are expected to become part of a new online knowledge ecology -- both consuming existing LOD knowledge to understand the intended meaning implicit in a table and producing new facts and knowledge that will become part of Web. This represents a dramatic increase in the breadth and depth of public semantic data that can make "big data" analytics more effective.