The rapid accumulation of information stored as computer files, such as images, videos, etc. require a lot of computer and internet data storage. The ways we maintain computer data today may cost a lot of money and energy (especially cooling); also, the materials on which the files are written are not too stable - they get spoiled with time, so that data may be eventually lost as a matter of few decades. To solve this issue, a new technology of writing and reading digital data in the molecular strings will be developed, based on DNA, the molecules from which the genetic code is also made. All living cells rely on DNA molecules for storing the instructions to run our cells and tissues, and these molecules are more stable than magnetic tape or paper. If successful, this DNA-based storage of computer data would readily retain all of the world`s current electronic data.

To develop a DNA-based data storage technology, a coding scheme that can reliably write and read back data in segments of DNA is proposed. One approach will involve the use of combinatorial molecular barcodes for addressing and random access, to generate the data blocks as well. Synthesis schemes to write long segments of DNA will be employed, where millions of such segment will be generated in parallel. Longer segments allow one to divide large files into fewer segments and thus require shorter index and random access barcodes. The use of nanopore DNA sequencers that generate long sequences will permit reading this data. Sophisticated mathematical coding techniques will be utilized to robustly reconstruct such a written message after accounting for errors specific to the write and read platforms, with a special emphasis on using nanopore technology. The coding techniques are tailored to the higher error rates in nanopore sequencing, which is the most promising for a scalable sequencing scheme. A DNA storage simulator will also be developed, that will allow researchers to model specific application needs, using predefined or custom error models for DNA write and read platforms and a variety of coding models for addresses and data. This will allow the trade-offs between cost, robustness and efficiency to be simulated at data scales from Gigabytes to Exabytes. Several cohorts of students will be trained in the field on the interface of genetics, biochemistry, electrical engineering and coding theory in the course of the proposed work.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

National Science Foundation (NSF)
Division of Molecular and Cellular Biosciences (MCB)
Application #
Program Officer
Arcady Mushegian
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
Stanford University
United States
Zip Code