A genome (the DNA in a cell) can be represented by a sequence of letters called "bases." A large genome can consist of billions of bases. Chemical techniques allow scientists to read only a few hundred bases at a time. The whole genome shotgun (WGS) assembly technique creates a draft of the sequence of a whole genome by selecting such short fragments at random from the genome, determining the sequence of the fragments, and then computationally re-assembling millions of these fragments. Two fragments are said to "overlap" if it is plausible that they come from the same part of the genome, based on a comparison of their sequences. The goal of this project is to focus efforts on producing an extremely robust set of overlaps, using a combination of sophisticated error-correction techniques, as well as "localizing" fragments to validate overlaps by ensuring that both fragments come from the same vicinity of the genome. Several issues complicate the determination of which pairs of fragments overlap. First, most genomes contain many "repeat regions," i.e., two or more almost identical copies of long stretches of sequence. Thus, two fragments that do not actually overlap may look like they do. Second, the random sampling technique results in many base errors --- bases can be mis-read or missed entirely. These errors, combined with the fact that repeat regions usually differ slightly, make it very difficult to distinguish a spurious overlap from a true overlap in which one or both fragments contain read errors. Thus, if extreme care is not taken, it is easy to use a spurious overlap and thereby mistakenly connect distant parts of the genome. Preliminary results in collaboration with Celera Genomics, the Baylor College of Medicine, and The Institute for Genomic Research (TIGR) have demonstrated that the investigator's current techniques can already produce more sequence at higher quality. The goal is improve these techniques and make them widely available. The determination and interpretation of genetic information is one of the great challenges of the twenty-first century. The genome, i.e., all the DNA in a cell, is the molecular basis of diversity and the cornerstone of genetic information. Draft genomes have been obtained for human, mouse, and some insects, fish, plants, and bacteria. This is a start, but a full understanding of biological processes cannot be had by studying the genomes of only a handful of species. The federal government is spending about 100 million dollars per year generating sequence data. Millions of small pieces of a genome are sampled from the genome. The second stage is called "assembly," when these pieces are re-assembled on a computer like a giant jigsaw puzzle. The puzzle is complicated by two facts: first, many of the puzzle pieces have small errors that make them mis-fit against pieces that they SHOULD fit with; and second, many pieces that should NOT go together actually fit together quite well. This makes it extremely difficult to correctly assemble a genome. There are two ways to decrease the ambiguities: first, one could generate more pieces. However, each new piece costs about $2, and one would need to generate millions of new pieces to have a significant effect on assembly quality. The investigators use a second route. They attempt to squeeze as much information out of the existing pieces as possible. The latter route is substantially cheaper, and there is still much room for improvement here over existing techniques. The investigators are using sophisticated mathematics to help discern with extreme precision those pairs of pieces that do, and those that do not, fit together. Preliminary results of the investigators -- in collaboration with several large sequencing centers -- have demonstrated that using their techniques to "pre-process" the pieces can produce more of the genome, with fewer errors. This project aims at extending these ideas further and making them freely accessible to all investigators. The impact on the federal genome (biotechnology) projects is potentially great.