Optical mapping is a laboratory technique for constructing ordered high-resolution optical maps from stained molecules of DNA. The popularity of this type of data has amplified because the commercial production of the data has improved in terms of quality, expense, and throughput. For example, BioNano Genomics released a new generation of optical mapping technology called the Irys System in 2015, which has been used to uncover diploid variation in the human genome. However, the raw optical mapping data, called Rmaps, is not inherently useful by itself and must be first assembled into a genome-wide optical map; a computational process that has very few nonproprietary solutions. The optical map assembly problem that aims to stich together the Rmap data into a genome-wide optical map has some similarities to genome assembly that aims to build contiguous sequences corresponding to the genome of interest from short sequence reads. Although this similarity exists, there are some significant differences that have prevented the direct application of genome assemblers to this latter problem. The main objective of this proposed work is to build a scalable and efficient optical map assembler through the exploration and adaptation of genome assembly algorithms and methods. This research objective will be enhanced by an education plan that aims to broaden the participation in computer science by creating research opportunities for female graduate students, as well as, high school senior students. The plan is to redevelop the existing bioinformatics graduate course so that it: (1) can be cross-listed with the Department of Biology and thus, increase the female enrolment, and (2) will be project based and thus, produce supportive relationships between female students. The team will conduct comprehensive surveys of graduate students in the redeveloped course, as well as the other graduate courses in computer science, to determine whether the changes are impactful. We plan to disseminate our findings to other institutions.

Even with significantly high coverage and various insert sizes, genome assembly and structural variation detection are tenuous computational processes using short read data alone due to repetitive regions in the genome. Optical maps, which are ordered genome-wide high-resolution restriction maps that specify the positions of occurrence of one or more short nucleotide sequences, are one such type of data. The raw optical mapping data identified by the image processing is an ordered sequence of fragment lengths. The (unassembled) optical mapping data produced by the system are referred to as Rmaps and are synonymous with sequence reads in standard genome sequencing. Optical mapping has gained popularity due to (1) the ability to automate the generation of the data; and (2) its use in several large sequencing projects---including the sequencing projects of goat, budgerigar, and Ambler trichopoda. With this popularity comes the growing need for means to analyze optical mapping data. The goal of this proposed work is to build an optical map assembler that is efficient for genomes of various sizes that will accept as input a set of Rmaps and return an assembled genome-wide optical map. Thus, we have divided this larger goal into the following intermediate research objectives that we plan to tackle in succession: (1) Develop a Rmap error correction method; (2) create a robust alignment algorithm; and (3) create a succinct graph representation of a genome wide optical map. Our algorithmic contributions are not limited to optical mapping data but can be extended to other type of data, e.g., PacBio and genetic linkage maps. We will actively engage in knowledge transfer through these research objectives and our outreach and education plan. These efforts will redevelop a bioinformatics course in order to attract a greater number of female students and foster collaborative relationships, and continue an ongoing high school outreach program. In addition to these activities, we will survey the students in the redeveloped course and other graduate classes to see whether the changes were impactful, and mentor a senior high school student researcher during the summers.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
1618814
Program Officer
Sylvia Spengler
Project Start
Project End
Budget Start
2016-10-01
Budget End
2021-09-30
Support Year
Fiscal Year
2016
Total Cost
$397,461
Indirect Cost
Name
University of Florida
Department
Type
DUNS #
City
Gainesville
State
FL
Country
United States
Zip Code
32611