Next generation sequencing-by-synthesis platforms enable fast and affordable DNA sequencing. However, read-lengths that they achieve are still shorter than those provided by the costly Sanger sequencing, and their accuracy is insufficient for most medical studies. To determine the order of nucleotides in a DNA fragment, sequencing-by-synthesis relies on enzymatic synthesis of the complementary strand on the fragment. The synthesis is enabled by a sequential addition of free nucleotides;extension of the complementary strand with the Watson-Crick complement of the first unpaired base of the DNA fragment is detected optically. However, the signal generated by sequencing a single DNA molecule is weak, and thus its detection requires complex and expensive hardware. Ensemble-based systems provide an efficient alternative: they amplify the signal by sequencing a large number of identical copies of the DNA fragment in parallel. To fully reap the benefits of having multiple signal sources, extension of complementary strands should progress at the same rate (so that the signals add in phase). However, synthesis of strands in an ensemble gets out-of-sync due to an occasional failure of nucleotide incorporation in some strands, and premature extension of others. These so-called phasing effects, probabilistic in nature, limit the achievable accuracy and read-lengths of sequencing-by-synthesis. The goal of the proposed project is to develop practical algorithms for optimal base-calling in sequencing-by-synthesis systems, improving their effective read-lengths and accuracy. To this end, we rely on concepts and tools from signal processing and information theory. We address two broadly employed systems: Illumina's four-color platform and Roche's (454 Life Sciences) pyrosequencing platform. If successful, as we expect based on preliminary results, our research will have immediate impact on various applications which require high-performance DNA sequencing.

Public Health Relevance

Performance of next generation DNA sequencing is fundamentally limited by the stochastic nature of the underlying biochemical process. Drawing on concepts from signal processing and information theory, we propose to design practical algorithms which may significantly improve the accuracy and effective read-lengths of next generation DNA sequencing systems. If successful, as we expect based on preliminary results, our research will have immediate impact on various applications which require high-performance DNA sequencing.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Exploratory/Developmental Grants (R21)
Project #
5R21HG006171-02
Application #
8288688
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Bonazzi, Vivien
Project Start
2011-09-01
Project End
2014-07-31
Budget Start
2012-08-01
Budget End
2014-07-31
Support Year
2
Fiscal Year
2012
Total Cost
$177,883
Indirect Cost
$52,883
Name
University of Texas Austin
Department
Engineering (All Types)
Type
Schools of Engineering
DUNS #
170230239
City
Austin
State
TX
Country
United States
Zip Code
78712
Das, Shreepriya; Vikalo, Haris (2013) Base calling for high-throughput short-read sequencing: dynamic programming solutions. BMC Bioinformatics 14:129