The demand for data storage continues to increase at a rapid pace, posing significant challenges to current data centers and spurring significant interest in the development of new storage technologies. DNA, the molecule that carries the genetic information of all living matter, has become a promising medium for long-term archival data storage due to its longevity and very high information density. This new approach to data storage presents unique challenges. Unlike typical hard drives, where data bits are stored in a well-ordered linear fashion, storing data on DNA requires the synthesis of a large number of DNA molecules that are then mixed out of order in a liquid solution. This makes the process of reliably reading the data after storage significantly more expensive and computationally complex. The goal of this project is to understand the fundamental limitations and capabilities of DNA as a storage medium. In particular, this research seeks to characterize basic tradeoffs between cost, information density, reading and writing speeds, and reliability, aiming to develop new coding strategies that can unlock the full potential of this innovative approach to data storage.
The project will investigate the fundamental limits of DNA storage systems by focusing on three main objectives. The first objective is to develop an information theory framework to formally analyze these systems. DNA storage systems will be modeled via the abstraction of a shuffling channel, which captures the fact that, in DNA-based storage, many blocks of data are shuffled out of order. The capacity of these channels will be characterized under different noise models and properties of optimal coding schemes will be studied. A particular question of interest is how to design capacity-optimal indexing strategies that allow the proper reordering of the data. Since the cost of synthesizing long DNA strands is the main obstacle to practical DNA storage, the second research objective will deal with systems that store data on many very short DNA strands, each of which is too short to encode any meaningful information. For that reason, new strategies to encode information in the concentration of different DNA molecules in the solution will be proposed, and their fundamental capabilities established. The third research objective will focus on the computational challenges associated with the joint processing of a large set of DNA sequences. In particular, basic tradeoffs between storage capacity and computational requirements will be established, and recent algorithmic advances in large-scale sequence alignment will be leveraged towards the development of computationally efficient decoding algorithms.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.