With the introduction of next generation sequencing (NGS) technologies, we are facing an exponential increase in the amount of genomic sequence data. These new methods have already started to fundamentally revolu- tionize the area of genome research through low-cost and high-throughput genome sequencing. NGS technologies promise to impact a broad range of genetic applications. These include, but are not limited to, large-scale sequencing studies, polymorphism detection, small RNA analysis, metagenomics, com- parative genomics, discovery of epigenetic variation (histone modification and methylation patterns), charac- terization of tumor DNA sequences, identification of mutant genes in disease pathways and transcriptome profiling. Low-cost sequencing will impact the whole health care system because sequencing of personal genomes will be a part of preventive and personalized medicine as a result of potential advancements in phar- macogenomics. The overall data throughput generated by these new technologies is enormous: for example, in the Illumina Genome Analyzer, each run produces up to 1 billion reads and >100 Gb of basepairs of sequence data. Due to the lower cost of these methods, large genome centers have started to upgrade their sequencing capa- bilities, and are now able to generate 500 gigabases of data per day when 40 instruments are used. Such large amounts of data overwhelm existing computational resources, and urgent action is needed to enable the translation of this rich new source of genomic information into medical benefit. The success of all medical and genetic applications of next-generation sequencing critically depends on the existence of computational tech- nologies that can process and analyze the enormous amounts of sequence data fast and in an energy-efficient manner. The goal of this proposal is to develop such technologies by combining the benefits of enhanced software algorithms and specialized hardware accelerators. Our proposed research aims to accelerate next generation sequence analysis 1000-fold or more by combin- ing our knowledge in genomic sequence analysis, algorithms development, and computer architecture/engineering. Our plan to address the problems of processing unprecedented amounts of sequence data has three major components. First, we will develop and improve sophisticated software algorithms and tools to handle large amounts of sequence reads generated by all major NGS platforms without sacrificing sensitivity while cor- recting for the sequencing biases associated by each of the NGS platforms. Our algorithms will also be able to map reads in the duplicated regions of the genome and report the underlying sequence variation, an important feature especially to characterize segmental duplications and structural variation that no other read mapping tool can currently achieve. Second, we will boost the performance and efficiency of our algorithms (100 to 1000-fold) by accelerating the required inherently-parallel computations of the sequence search problem on massively-parallel hardware engines available today, graphics processing units (GPUs). Finally, we will design specialized hardware architectures to enhance the speed of sequence analysis beyond orders of magnitude while reducing energy consumed by it by 100-fold or more. Our research will broadly impact large-scale genome studies such as the 1000 Genomes Project, the Can- cer Genome Atlas Project, and the ENCODE Project, by not only increasing their ability to reach conclusions very fast but also reducing their energy consumption and maintenance costs related to maintaining compu- tation clusters for data analysis. Our research, if successful, can eliminate the dependence of sequence analysis on large and power-hungry computing clusters/data-centers, thereby making sequence analysis significantly cheaper and energy-efficient, and hence enabling sequence analysis to be performed by the main- stream without the need to build large computational infrastructures. Together with further advances in sequencing technologies, research resulting from this proposal can help personal genomics become a reality: advancement and application of pharmacogenomics will start the era of personalized medicine. Through ultra- fast, energy-efficient and cost-efficient sequence analysis, this study can pave the way to unlimited number of new discoveries by making it feasible to analyze terabases of sequence data that cannot currently be handled with existing computational processing power.

Public Health Relevance

Next-generation sequencing (NGS) technologies promise the era of preventive and personalized medicine through low-cost and high-throughput genome sequencing. The success of all medical and genetic applica- tions of next-generation sequencing critically depends on the existence of computational technologies that can process and analyze the enormous amounts of sequence data fast and in an energy-efficient manner without requiring the building of large infrastructures. The goal of this proposal is to develop such technologies by combining the benefits of enhanced software algorithms and specialized hardware accelerators.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG006004-02
Application #
8286157
Study Section
Biodata Management and Analysis Study Section (BDMA)
Program Officer
Bonazzi, Vivien
Project Start
2011-06-20
Project End
2015-04-30
Budget Start
2012-05-01
Budget End
2013-04-30
Support Year
2
Fiscal Year
2012
Total Cost
$346,949
Indirect Cost
$45,564
Name
Carnegie-Mellon University
Department
Engineering (All Types)
Type
Schools of Engineering
DUNS #
052184116
City
Pittsburgh
State
PA
Country
United States
Zip Code
15213
Kim, Jeremie S; Senol Cali, Damla; Xin, Hongyi et al. (2018) GRIM-Filter: Fast seed location filtering in DNA read mapping using processing-in-memory technologies. BMC Genomics 19:89
Xin, Hongyi; Greth, John; Emmons, John et al. (2015) Shifted Hamming distance: a fast and accurate SIMD-friendly filter to accelerate alignment verification in read mapping. Bioinformatics 31:1553-60
Lee, Donghyuk; Hormozdiari, Farhad; Xin, Hongyi et al. (2015) Fast and accurate mapping of Complete Genomics reads. Methods 79-80:3-10
Hach, Faraz; Sarrafi, Iman; Hormozdiari, Farhad et al. (2014) mrsFAST-Ultra: a compact, SNP-aware mapper for high performance sequencing applications. Nucleic Acids Res 42:W494-500
Xin, Hongyi; Lee, Donghyuk; Hormozdiari, Farhad et al. (2013) Accelerating read mapping with FastHASH. BMC Genomics 14 Suppl 1:S13
Hach, Faraz; Numanagic, Ibrahim; Alkan, Can et al. (2012) SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics 28:3051-7
Hach, Faraz; Hormozdiari, Fereydoun; Alkan, Can et al. (2010) mrsFAST: a cache-oblivious algorithm for short-read mapping. Nat Methods 7:576-7
Alkan, Can; Kidd, Jeffrey M; Marques-Bonet, Tomas et al. (2009) Personalized copy number and segmental duplication maps using next-generation sequencing. Nat Genet 41:1061-7