Raw Sequencing Data Processing and Base Calling

Domnisoru, Cristian

Abstract

The long term objective of this application is to develop a software application for processing raw data obtained using DNA capillary electrophoresis sequencing machines (data processing) and identify the DNA bases achieving an overall higher accuracy over the existing techniques (base calling).
The specific aims are to: collect a large number of data files (approximately 50,000 files will be used), create a database including the correct basecalls associated with each of the datafiles, develop a methodology for comparing the results of two basecallers (and incorporate the confidence values associated with each call into the assessment method), develop novel algorithms for processing the raw data, incorporate into basecalling a model for the peak amplitudes, improve the current base spacing model and finally, test the basecaller with the above proposed database. The proposed methodology is based on a novel signal processing approach applied to the raw data. A highly adaptive filter will be used for the raw data. The filter will adapt to the various levels of noise in the raw data and to the variation of the peaks width. The order in which traditional steps for DNA sequencing raw data processing are performed will be changed to allow for a better color separation between the channels. Features from the data itself will be identified and used to predict the base calls. For instance, a peak amplitudes model will be created to allow for a better prediction of the base calls. This peak amplitudes model will also be used to indicate whether or not an individual base follows the model, thus indicating a probability for an insertion/deletion error. An automatic algorithm will be developed to detect and remove stutter peaks from the raw data. Combined with an improved cross-talk removal procedure this will allow for a better sensitivity in identifying heterozygotes in the processed sequences. The calculated confidence values will follow the current standard as introduced by phred and will be calibrated such that for data with reduced levels of noise to match the actual accuracy rate over the testing database. The software and the testing database will be free of charge for academic and publicly funded sequencing projects.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Research Project (R01)
Project #: 5R01HG002929-03
Application #: 7252524
Study Section: Special Emphasis Panel (ZRG1-BST-D (02))
Program Officer: Felsenfeld, Adam

Project Start: 2005-06-01
Project End: 2008-05-31
Budget Start: 2007-06-01
Budget End: 2008-05-31
Support Year: 3
Fiscal Year: 2007
Total Cost: $161,534
Indirect Cost

Institution

Name: University of St. Thomas
Department
Type: Other Domestic Higher Education
DUNS #: 606870090

City: St Paul
State: MN
Country: United States
Zip Code: 55105

Related projects


NIH 2007 R01 HG	Raw Sequencing Data Processing and Base Calling Domnisoru, Cristian / University of St. Thomas	$161,534
NIH 2006 R01 HG	Raw Sequencing Data Processing and Base Calling Domnisoru, Cristian / University of St. Thomas	$166,999
NIH 2005 R01 HG	Raw Sequencing Data Processing and Base Calling Domnisoru, Cristian / University of St. Thomas	$171,653

Comments

Be the first to comment on Cristian Domnisoru's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants:

Abstract

Funding Agency

Institution

Related projects

Comments