In 1953, Claude Shannon, the founder of information theory, pointed out that there is no theory via which information embodied in structure can be quantified; this situation remains in effect today. The need for such a theory has become pressing in recent years with the proliferation of structured data sets arising from diverse applications. We have yet to answer fundamental questions such as: What are fundamental limits on storage and processing of structural information? What are fundamental bounds on extraction of information from large biological databases? Lack of understanding of such questions threatens to raise severe impediments to further advances in science and engineering of complex systems. The main goal of this work is to search for measures and algorithms to appraise the amount of organization and structure embodied in artifacts and natural objects. We propose to make headway in information theory of data structures.
Data is increasingly available in various forms (e.g., sequences, expressions, interactions, structures) and in exponentially increasing amounts. Most of such data is multidimensional and context dependent; thus it necessitates novel theory and efficient algorithms to extract meaningful information from non-conventional data structures. In compressing such a data structure, one must take into account two types of information: the information conveyed by the structure itself, and then the information conveyed by the data labels implanted in the structure. The specific goals of this project are: (i) characterization of the total amount of information conveyed by a data structure (and how this decomposes into the two types of information mentioned above), and (ii) the design of efficient compression algorithms based upon the total amount of information conveyed in (i).