Advances in DNA sequencing technologies have paved the way for a revolution in the biological and medical sciences. By sequencing the human genome, one can learn about the genetic basis of several diseases and use this information to develop treatments and preventative care. By sequencing the genomes of viruses and bacteria, one can obtain key insights into the mechanisms of infectious diseases. But the acquisition, processing, and analysis of large amounts of genomic data pose several fundamental questions such as: (i) how much sequencing data need be collected to reliably learn the genome of a species? (ii) how much can one compress genomic sequencing data while maintaining its usefulness? (iii) how do sequencing errors impact the ability to perform biologically valid inference? The goal of this project is to develop a framework to establish the informational limits of genomic data science problems, that is, establish what genomic data can and cannot reveal. This will lead to the development of computationally efficient algorithms that process genomic data in an information-optimal way. The project will also mentor underrepresented students and provide them with research opportunities in the field of genomics. The research efforts will directly shape the contents of an undergraduate course on data science, which, in turn, will produce materials (lecture notes, educational videos, data assignments, open-source code) that will be used to disseminate reliable information about genomic technologies to the community.
A distinctive aspect of genomic data science is that it operates mainly on sequence data. The proposed research will be organized along three main thrusts, each one focused on a key data science task that arises when dealing with genomic sequence data: (1) aligning pairs of sequences, (2) reconstructing sequences from noisy fragments, and (3) clustering sequences based on appropriate metrics. Since the pairwise alignment of a large number of noisy sequences is often a bottleneck in genomic data science, the first thrust will study how low-dimensional representations of these sequences, or sketches, can be optimally used for alignment computation. A source-coding framework will be leveraged to study the tradeoffs between sketch size and the incurred distortion in alignment computation. The second thrust, on sequence reconstruction, will tackle the fact that computational complexity obstacles such as NP-hardness often do not appropriately capture the complexity of real-world problem instances. A notion of instance-based informational hardness is introduced to allow the development of efficient algorithms with instance-specific theoretical guarantees. Finally, the third thrust studies the problem of clustering sequences in the context of metagenomic sequencing, where the goal is to determine which sequences come from the same microbial genome. Information-theoretic metrics for the clustering of metagenomic sequencing data will be introduced and used in algorithms that seek to resolve microbial communities at the maximum resolution allowed by the data.
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.