We propose to create a new system, the Terabase Search Engine that will make it possible for biomedical researchers to search all human DNA sequences that have been sequenced and deposited in public archives. The vast and growing resource of human DNA sequences provides a wealth of opportunities for scientific discovery and for validation of results, but the size of the data sets has already far exceeded the ability o most researchers to use them. For more than two decades, geneticists and geneticists have relied on DNA sequence databases for a wide range of scientific endeavors, including the discovery of new genes and new mutations, the investigation of evolutionary changes within and between species, the forces affecting chromosomal structure and change, and many other molecular and evolutionary processes. The ability to search all known genes and genomes using BLAST and similar programs has long been assumed, and sequence search engines throughout the world provide this ability. However, the raw data pouring out of next-generation sequencing (NGS) projects has exceeded our ability to provide rapid access to it. A single NGS instrument can generate six billion reads encompassing 600 billion bases in a single run, and this capacity is still growing. Traditional alignment programs like BLAST cannot sort through this data in a reasonable amount of time. Newer, faster programs such as Bowtie (developed by our group) allow far faster alignment of NGS reads to the genome, but today the size of the data sets, now in excess of 1 trillion reads, far exceeds the ability of most computers to store it. And even the fastest alignment programs today could not search all this data in a reasonable amount of time. A new approach is required in order to serve up these huge and hugely valuable DNA sequences to the research community. The Terabase Search Engine will be a new, highly efficient system for searching trillions of bases in real time. Using a hierarchical search strategy with extensive pre-processing to speed up response time, the TSE will allow a scientist to align any sequence, human or non-human, to all publicly-available human sequence reads. Reads that match the human genome will be indexed and stored on very high-speed disks for rapid retrieval. Reads that match microbial sequences will be captured and stored separately for use in micro biome and infectious disease research. The system will be made available through a user-friendly web interface, and a local database will store each user's results for further analysis on the TSE site or for download to a local site. This system will make it possible, for the first time ever, for any scientist to align a sequence to the complete set of human DNA sequences and to retrieve everything that matches, without the need to write special-purpose programs or to use complex cloud-based software interfaces. All of the software for this project will be developed under an open-source model that will permit others to use, modify, share, and re-distribute the code without restriction.
The 1000 Genomes project and many other human sequencing projects are generating extremely valuable data about human genetics and human disease. The data is already so voluminous that very few scientists have the ability to download or search it, a problem that will only get worse. This project will make the raw data searchable for the first time, allowing scientists to discover DNA sequences that might be responsible for human traits and diseases that are not yet understood.