This subproject is one of many research subprojects utilizing the resources provided by a Center grant funded by NIH/NCRR. The subproject and investigator (PI) may have received primary funding from another NIH source, and thus could be represented in other CRISP entries. The institution listed is for the Center, which is not necessarily the institution for the investigator. In whole genome shotgun assembly experimentally obtained random pieces of the genome (typically of size about 500 bases), called sequence reads, are put back together thereby reconstructing the sequence of an organism's genome. Because of the scale of the data (70 Million reads at full coverage for a 3 Gigabase mammalian genome) and the repetitive nature of the genome, this problem is extremely difficult. Even smaller genomes with 3-5 fold coverage can pose serious computational challenges in terms of memory and processing power. The Whitehead Institute/MIT Center for Genome Research is part of an NIH-funded consortium to sequence the mouse genome. By the end of 2001, this consortium will have produced about 35 million sequence reads, about half of which are from the Whitehead. They have also developed a software system (Arachne) for the assembly of genomes, and tested it on the existing data (about 17 million reads). The memory demands scales linearly with the number of reads, which is proportional to the product of the size of the genome and the degree of coverage. As more reads become available they expect to continue with incremental assemblies using the additional data. Importance of the problem. Sequence for the mouse genome will facilitate the discovery of features in the human genome. It will also facilitate research about the mouse. Together, these two purposes make the mouse sequence of fundamental importance to the biological and biomedical communities. Computational requirements. An assembly of 17 million reads required about 5 days, and used up to 29 GB of memory on a Compaq ES40 667 Mhz machine. Prior experience suggests that the problem scales approximately linearly, so we anticipate that 35 million reads will require a running time of 10 days (which should be reduced to perhaps 5 days because the processors will be faster), and memory usage of about 60 GB. In addition to the mouse genome, the Center has an active program in sequencing and assembling other organisms. Currently the Center produces about 45 Million lanes (or reads) of sequence a year. Organisms recently sequenced or currently in the sequencing pipeline include -Methanosarcina, Neurospora, Tetraodon, and Ciona. The proposed resource will increase the rapidity with which they can assemble and release these genomes to the community. The computational requirements for assembling these other genomes are less than that needed for the mouse. Thus for Tetraodon and Ciona (whose genomes are substantially larger than those of Methanosarcina and Neurospora), they expect running times of two to three days and memory usage of 10 to 15 GB. They will want to repeat each assembly many times, each time experimenting with the algorithms. In general, these experiments lead to code improvements which apply to all genomes. PSC will make this Whitehead software into a service for other groups doing linkage analysis or whole genome assembly. Besides making the software and computer time available, the Research Resource at PSC will develop a biomedical training workshop focused on these codes, to make the techniques more widely known throughout the genomic community.
Showing the most recent 10 out of 292 publications