The Virginia Bioinformatics Institute at Virginia Tech is awarded a grant to develop GenoTHREAT, a software application to screen DNA sequences ordered from gene synthesis companies for the possible presence of potentially harmful sequences. GenoTHREAT implements the screening algorithm recommended by the Department of Health and Human Services in a document entitled "Screening Frameworks Guidance for Synthetic Double-Stranded DNA Providers." It is important to characterize the relationship between the computational cost of the sequence screening algorithm, the rate of false positives or innocuous sequences that are incorrectly red flagged, and the rate of false negatives or sequences of concern not detected by the screening algorithm. The screen is applied to a database composed of a mixture of publicly available sets of synthetic DNA sequences and annotated test cases designed for this project. This large scale analysis will lead to the determination of optimal screen parameters that represent an acceptable compromise between the security concerns of the government and the operational constraints of gene synthesis companies.

The rapid progress in synthetic biology demonstrated by the recent publication of the first synthetic cell by Craig Venter and his team has raised biosecurity concerns among the public, its elected officials, and various administrations. This grant facilitates the adoption of DNA screening algorithm recommended by the federal government to detect, in the order books of gene synthesis companies, the presence of sequences of concern requiring further investigations. GenoTHREAT, the software developed with the grant, will be made broadly available (www.genothreat.org) to allow gene synthesis companies, users of synthetic DNA, or managers of bioinformatics resources for synthetic biology to implement the biosecurity screen recommended by the government. In order to provide another layer of biosecurity protection, GenoTHREAT will also be capable of screening DNA sequencing data. A sustainability plan that does not involve federal funding for the maintenance and future development of GenoTHREAT is being developed with industrial partners.

Project Report

Summary This research project has its root in the work of a team of five undergraduate students involved in a summer research project. Specifically, the students formed a team enrolled in iGEM, a student competition in synthetic biology. The team was composed of three computer science students and two biology students. The team was supervised by a graduate student. The goal of the team was to implement the guidance published by the federal government to help gene synthesis companies detect in their order books, DNA sequences that may present a biosecurity risk. Since we received the NSF funding at the end of the summer research experience, the educational activities provided by the project included: Preparing the students to present their work at the iGEM Jamboree, an undergraduate research conference with more than 1,000 participants. The day after the Jamboree (11/9/10), the students gave a two hour seminar to 30 representatives of 15 government agencies at the FBI Headquarter. The audience was composed of the people who were involved in writing the guidance, assessing its effect, and evaluating the need for further regulation. The 1 hour presentation was followed by an hour long session of Q&A. The students replied to the questions on their own with depth and diplomacy. Involving the students in the preparation and submission of a manuscript. All the students are co-authored of the paper resulting from their work. They were involved in the write up, preparation of the figure, and were kept informed of the interactions with the journals and reviewers. We released the software open source using the Apache V2.0 license. The source code was deposited on SourceForge. Intellectual Merit Our worked confirmed that we were the first to implement this screening algorithm. We did not find any evidence that anyone in the government or in industry had another implementation of the screening algorithm. Our work showed that implementing the guidance was possible but not trivial. The document published in the Federal Register provided most of the information necessary to implement the screen. The computational cost of running the screen was acceptable but not lightweight. It takes about one minute of computing time on a dedicated business-class server to analyze 1 kilobase of DNA. One of the motivations for proposing the best match method was to overcome the need to develop and maintain a database of curated sequences representing potential threats. However, in order to detect sequences of concern, it is still necessary to develop a database of keywords used to associate GenBank records with select agents. It is true that the development of a database of keywords is easier and faster than developing a database of sequences but the need for a database has not been completely eliminated contrary to what the government claimed. We also discovered major weaknesses of the guidance. The result of the screen is very dependent on the keywords used to analyze the BLAST results. Different sets of keywords result in different screening outcomes with very different false positive and false negative rates. The lack of official test suite makes it difficult to tune the screening algorithm. These results were published in Nature Biotechnology in March 2010. They were also presented orally at ISMB (the premier bioinformatics conference) in the track that highlights the most significant publications of the last 12 months. Broader Impacts The grant allowed us to train five undergraduate students and one graduate students in the presentations of high-impact scientific results with policy implications to different audiences (peers, government) and in different format (oral, poster, article). In addition to presenting our results to the government, we presented them to four major gene synthesis companies (GeneArt, DNA2.0, IDT, Life Technologies) and two large biotechnology companies (Dow Agrosciences, DuPont) to get the feed-back of prospective users. These presentations took the form of day-long on site visits. The picture that emerged from these meetings is complex. - Companies do not want a software application that automatically flags sequences in different categories. They want human operators to make the decision. The Best Match algorithm makes it easy to automatically categorize sequences into innocuous sequences and sequences of concern. However, the number of sequence alignments generated by the algorithm makes it difficult to understand why a sequence is flagged and how to interpret the screen result. - Companies want the means to be in compliance. They are demanding more regulation and certified solutions. The voluntary nature of the guidance and the lack of specific screening objective provide no incentive to implement the guidance. - Biosecurity is just one aspect of the broader issue of compliance. Companies operate in a complex landscape of regulations and policies that they have to comply with. They need an integrated solution that can properly identify export control, labeling, or safety issues in addition to security.

Agency
National Science Foundation (NSF)
Institute
Division of Biological Infrastructure (DBI)
Type
Standard Grant (Standard)
Application #
1060776
Program Officer
Julie Dickerson
Project Start
Project End
Budget Start
2010-09-01
Budget End
2011-08-31
Support Year
Fiscal Year
2010
Total Cost
$99,819
Indirect Cost
City
Blacksburg
State
VA
Country
United States
Zip Code
24061