The biological interpretation of the data generated by large scale sequencing projects is plagued by a high rate of false positive results concerning the location of exons and coding regions, and /or the statistical significance of similarities found with the existing databases. This situation leads to unmanageable large program outputs the size and the noise/signal ratio of which obscure the truly relevant findings. This problem can be attributed to large fluctuations in the local information content of both natural and database sequences. After classifying the redundancies (repeats) in three categories, we have developed two programs (XNU and Xblast) that are now routinely used in an information enhancement step prior to the analysis of large body of sequence data. Following this steps, the output of gene identification and sequence comparison programs become biologically and statistically interpretable without further processing. The power of this approach was illustrated by the analysis of large human genomic contigs (90 kb from the HLA class III region on chromosome 6, 67 kb from Xp22.3 region) as well as from the analysis of large EST data sets.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000010-01
Application #
3845101
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
1
Fiscal Year
1992
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code