Genomics, which studies the structure, function and evolution of DNA and RNA sequences of organisms, now has significant impact on every aspect of life sciences, such as agriculture, environment, medicine and biology. The rapid advance of sequencing technologies is one of the most important reasons behind the evolution of genomics research. Next-generation sequencing (NGS), which has significantly lowered the cost for sequencing DNA and RNA, has remarkably increased the application of genomics in every aspect of life sciences. More recently, we have seen the emergence of third-generation long-read Single-Molecule Sequencing (SMS) technologies from companies like PacBio and Oxford Nanopore. Unlike short (100-500 bp) NGS reads, the SMS reads have the distinguishing characteristics of long read length (2,000-50,000 bp), unbiased sequencing, a different type and frequency of random errors, and detection of additional modifications to the DNA bases, called epigenetic modification information. These characteristics make SMS reads useful in many genomics investigations, such as de novo genome assemblies (where there is no guiding framework available), methylation detection, gene isoform detection (small sequence changes that identify different alleles of a gene) and structural variation detection (large rearrangements in the organization of the genes). This project will develop efficient algorithms and tools to improve the effectiveness, usefulness and applicability domain of SMS reads. The successful completion of this project will significantly transform genomics research. The new tools will enable biologists to perform genomics studies, such as de novo assembly and global methylation detection, on large genomes using SMS. The tools will significantly lower the cost of analysis and increase the utility of the data for biologists so that they can advance their research. All algorithms, tools and demonstrations resulting from this project will be made publicly available to educators, researchers and students through our project website and GitHub. This project will be useful to train computer science students, including women and minority students, on bioinformatics problems and algorithm design.
Although SMS is now widely used in the genomics studies of small bacterial and archaeal genomes, the computational cost and high data volume currently prevent its use in the study of mid-to-large size genomes. The overall goal of this project is to develop fast algorithms and tools to investigate remedies for problems in three SMS applications: pairwise and reference alignment, error correction, and base modification detection. First, we will develop a tool for pairwise and reference genome alignments of SMS reads at least 5X faster than those currently available by designing and integrating fast k-mer matching, linear positional chaining and SIMD (Single-Instruction-Multiple-Data) based banded Smith-Waterman-Gotoh algorithms. Then, we will develop a linear space and linear time algorithm for reads alignment graph (RAG) based method, as well as a multiple reads alignment graph (MRAG) based method to efficiently correct processing for Oxford Nanopore technology data output. Furthermore, we will design an optimized and parallelized Spark pipeline for base modification detection using SMS reads, as well as a two-step classification method for effectively detecting base modification in SMS reads using neural networks. This research will substantially advance the state-of-the-art algorithms and tools for SMS reads. Project pages will be linked from https://people.cs.clemson.edu/~luofeng/research.html .
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.