The investigator will develop a software tool for assembling DNA fragments generated in megabase- scale shotgun sequencing projects. The software will be tested first on DNA fragments generated by computers from megabase DNA sequences and then on real DNA fragments from large-scale sequencing projects. The software will be freely distributed to nonprofit organizations. The investigator will assist the integration of the software into sequencing environments at genome centers. The objective of this project will be achieved by making two major improvements to a DNA sequence assembly program developed previously. The first improvement is to develop a strategy for solving the problems caused by repetitive sequences. In this strategy, all the fragments from a repetitive sequence are identified, and the uncertainties in assembly of the fragments are resolved using additional information on the fragments that flank copies of the repetitive sequence. The second improvement is to increase the capacity of the assembly program by developing a parallel version of the program in the PVM parallel programming environment on a local network of computers. The investigator will parallelize the two most time-consuming parts of the sequential program, the detection of overlaps among fragments and the construction of fragment alignments for contigs. The parallel sequence assembly program will be able to use the computation power of many computers to assemble tens of thousands of DNA fragments into sequences of low error. The investigator will improve the multiple sequence alignment program by addressing reading frame shifts in comparison of protein, cDNA and genomic DNA sequences.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
7R01HG001502-04
Application #
6209660
Study Section
Special Emphasis Panel (ZRG2-GNM (03))
Project Start
1996-08-01
Project End
2001-03-22
Budget Start
1999-08-28
Budget End
2001-03-22
Support Year
4
Fiscal Year
1998
Total Cost
Indirect Cost
Name
Keck Graduate Institute of Applied Life Scis
Department
Type
DUNS #
011116907
City
Claremont
State
CA
Country
United States
Zip Code
91711
Huang, Xiaoqiu; Brutlag, Douglas L (2007) Dynamic use of multiple parameter sets in sequence alignment. Nucleic Acids Res 35:678-86
Wang, Jianmin; Huang, Xiaoqiu (2005) A method for finding single-nucleotide polymorphisms with allele frequencies in sequences of deep coverage. BMC Bioinformatics 6:220
Ye, Liang; Huang, Xiaoqiu (2005) MAP2: multiple alignment of syntenic genomic sequences. Nucleic Acids Res 33:162-70
Huang, Xiaoqiu; Ye, Liang; Chou, Hui-Hsien et al. (2004) Efficient combination of multiple word models for improved sequence comparison. Bioinformatics 20:2529-33
Lin, Yaw-Ling; Huang, Xiaoqiu; Jiang, Tao et al. (2003) MAVG: locating non-overlapping maximum average segments in a given sequence. Bioinformatics 19:151-2
Huang, Xiaoqiu; Chao, Kun-Mao (2003) A generalized global alignment algorithm. Bioinformatics 19:228-33
Huang, Xiaoqiu; Wang, Jianmin; Aluru, Srinivas et al. (2003) PCAP: a whole-genome assembly program. Genome Res 13:2164-70
Huang, X; Madan, A (1999) CAP3: A DNA sequence assembly program. Genome Res 9:868-77
Huang, X; Adams, M D; Zhou, H et al. (1997) A tool for analyzing and annotating genomic sequences. Genomics 46:37-45