The biological interpretation of the data generated by large scale sequencing projects is plagued by unmanageable large program outputs the size and the noise/signal ratio of which obscure the truly relevant findings. This problem can be attributed to various causes: the increasing size of the databases, their quality (redundancy, experimental and clerical errors) and some intrinsic properties of biological sequences such as repeated elements and local compositional bias. In order to alleviate these problems we have developed a set of programs that are now routinely used in an information enhancement step prior to the analysis of large body of sequence data. Following this steps, the output of gene identification and sequence comparison programs become biologically and statistically interpretable without further processing. In addition we have developed a suite of independent modules that can be used in sequence to automatically analyze large body of experimental data. The power of these tools has been demonstrated in the context of collaborations with experimental groups generating a large amount of sequences (see project Z01-LM-00011-01-BRB). In the meantime, we are also developing new sequence analysis methods than can be both applied to the quality control of sequences already in the databases or to the interpretation of newly determined ones.

Agency
National Institute of Health (NIH)
Institute
National Library of Medicine (NLM)
Type
Intramural Research (Z01)
Project #
1Z01LM000010-03
Application #
3759297
Study Section
Project Start
Project End
Budget Start
Budget End
Support Year
3
Fiscal Year
1994
Total Cost
Indirect Cost
Name
National Library of Medicine
Department
Type
DUNS #
City
State
Country
United States
Zip Code