Distributed information storage architectures that are resilient against drive failures and organized in a manner that allow themselves to heal efficiently are a key component of the emerging area of big data systems. Moreover, the transition by companies and individuals to storing data in the cloud are enabling gigantic economies of scale, and distributed storage architectures are one of the most important components of such a large scale computing services system. The architectures modern distributed storage systems are using are still, from the standpoint of the coding utilized, archaic. More efficient distributed storage architectures would deliver better, more responsive, services to customers including faster access, better reliability, and more responsive processing, and at a lower aggregate long term cost. To reap these long term benefits, new science has to be developed to understand and determine the fundamental limits for these systems. These fundamental limits can be expressed in terms of the capacity regions of networks under network coding, which are themselves as of yet only implicitly characterized in all but the most symmetrized and simplified forms.
This project will study the capacity regions of networks and distributed storage systems, and the classes of codes that achieve them. Automated methods of computing these rate regions for small networks will be developed that harness the detection of problem symmetries, specialized polyhedral computation, and the enumeration of classes of matroids associated with optimal codes. Larger networks and storage systems will also be studied via the introduction of network embedding operations that preserve the sufficiency of classes of codes to achieve the entire capacity region.