The long term objective of this application is to develop a software application for processing raw data obtained using DNA capillary electrophoresis sequencing machines (data processing) and identify the DNA bases achieving an overall higher accuracy over the existing techniques (base calling).
The specific aims are to: collect a large number of data files (approximately 50,000 files will be used), create a database including the correct basecalls associated with each of the datafiles, develop a methodology for comparing the results of two basecallers (and incorporate the confidence values associated with each call into the assessment method), develop novel algorithms for processing the raw data, incorporate into basecalling a model for the peak amplitudes, improve the current base spacing model and finally, test the basecaller with the above proposed database. The proposed methodology is based on a novel signal processing approach applied to the raw data. A highly adaptive filter will be used for the raw data. The filter will adapt to the various levels of noise in the raw data and to the variation of the peaks width. The order in which traditional steps for DNA sequencing raw data processing are performed will be changed to allow for a better color separation between the channels. Features from the data itself will be identified and used to predict the base calls. For instance, a peak amplitudes model will be created to allow for a better prediction of the base calls. This peak amplitudes model will also be used to indicate whether or not an individual base follows the model, thus indicating a probability for an insertion/deletion error. An automatic algorithm will be developed to detect and remove stutter peaks from the raw data. Combined with an improved cross-talk removal procedure this will allow for a better sensitivity in identifying heterozygotes in the processed sequences. The calculated confidence values will follow the current standard as introduced by phred and will be calibrated such that for data with reduced levels of noise to match the actual accuracy rate over the testing database. The software and the testing database will be free of charge for academic and publicly funded sequencing projects.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG002929-03
Application #
7252524
Study Section
Special Emphasis Panel (ZRG1-BST-D (02))
Program Officer
Felsenfeld, Adam
Project Start
2005-06-01
Project End
2008-05-31
Budget Start
2007-06-01
Budget End
2008-05-31
Support Year
3
Fiscal Year
2007
Total Cost
$161,534
Indirect Cost
Name
University of St. Thomas
Department
Type
Other Domestic Higher Education
DUNS #
606870090
City
St Paul
State
MN
Country
United States
Zip Code
55105