The problem of reconstructing text strings from their fragments or masses of fragments is of focal importance in computational biology as current DNA and protein sequencing platforms are unable to read the content of long strings that denote protein/gene sequences. One prototypical example of string reconstruction arises in DNA assembly: there, one creates multiple copies of the same long string and cuts the copies to read out short overlapping substrings that can be put together by matching their prefixes and suffixes. The reconstructed string may not be a perfect replica of the original string due to errors in the fragmentation and matching processes. Furthermore, for many strings unique reconstruction is inherently impossible. This represents a major issue for next generation sequencing technologies used in fundamental biological research, since in this setting it is impossible to ensure unambiguous results. The successful code designs pursued in this project can resolve reliability and content retrieval issues impeding implementations of emerging molecular computing and storage paradigms.

This project is concerned with developing novel coding methods for unique reconstruction of strings or pools of strings based on their constituent substrings, subsequences and substring compositions. The techniques employed represent a combination of new graph-theoretic, combinatorial optimization and information theory approaches. In particular, the project will investigate the use of balanced partial de Bruijn strings for substring-based reconstruction, Catalan-like paths for multiset composition reconstruction as well as coded multi-trace reconstruction methods involving specialized modifications of deletion-correcting codes and superposition codes. The coding schemes will be tested on DNA-based and synthetic polymer-based data storage platforms under the development at the University of Illinois.

This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.

Project Start
Project End
Budget Start
2020-10-01
Budget End
2023-09-30
Support Year
Fiscal Year
2020
Total Cost
$231,463
Indirect Cost
Name
University of Illinois Urbana-Champaign
Department
Type
DUNS #
City
Champaign
State
IL
Country
United States
Zip Code
61820