Next generation sequencing-by-synthesis platforms enable fast and affordable DNA sequencing. However, read-lengths that they achieve are still shorter than those provided by the costly Sanger sequencing, and their accuracy is insufficient for most medical studies. To determine the order of nucleotides in a DNA fragment, sequencing-by-synthesis relies on enzymatic synthesis of the complementary strand on the fragment. The synthesis is enabled by a sequential addition of free nucleotides;extension of the complementary strand with the Watson-Crick complement of the first unpaired base of the DNA fragment is detected optically. However, the signal generated by sequencing a single DNA molecule is weak, and thus its detection requires complex and expensive hardware. Ensemble-based systems provide an efficient alternative: they amplify the signal by sequencing a large number of identical copies of the DNA fragment in parallel. To fully reap the benefits of having multiple signal sources, extension of complementary strands should progress at the same rate (so that the signals add in phase). However, synthesis of strands in an ensemble gets out-of-sync due to an occasional failure of nucleotide incorporation in some strands, and premature extension of others. These so-called phasing effects, probabilistic in nature, limit the achievable accuracy and read-lengths of sequencing-by-synthesis. The goal of the proposed project is to develop practical algorithms for optimal base-calling in sequencing-by-synthesis systems, improving their effective read-lengths and accuracy. To this end, we rely on concepts and tools from signal processing and information theory. We address two broadly employed systems: Illumina's four-color platform and Roche's (454 Life Sciences) pyrosequencing platform. If successful, as we expect based on preliminary results, our research will have immediate impact on various applications which require high-performance DNA sequencing.

Public Health Relevance

Performance of next generation DNA sequencing is fundamentally limited by the stochastic nature of the underlying biochemical process. Drawing on concepts from signal processing and information theory, we propose to design practical algorithms which may significantly improve the accuracy and effective read-lengths of next generation DNA sequencing systems. If successful, as we expect based on preliminary results, our research will have immediate impact on various applications which require high-performance DNA sequencing.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Exploratory/Developmental Grants (R21)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Bonazzi, Vivien
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Texas Austin
Engineering (All Types)
Schools of Engineering
United States
Zip Code
Das, Shreepriya; Vikalo, Haris (2013) Base calling for high-throughput short-read sequencing: dynamic programming solutions. BMC Bioinformatics 14:129
Shen, Xiaohu; Vikalo, Haris (2012) ParticleCall: a particle filter for base calling in next-generation sequencing systems. BMC Bioinformatics 13:160
Das, Shreepriya; Vikalo, Haris (2012) OnlineCall: fast online parameter estimation and base calling for illumina's next-generation sequencing. Bioinformatics 28:1677-83