Fast and sensitive sequence homology searches are fundamental tools in molecular biology. Our understanding of the human genome sequence depends in part on comparative sequence analysis of more experimentally ac- cessible model organisms, and indeed on sequence comparisons across the tree of life. This proposal describes a plan to support two software packages for sequence homology search and alignment, HMMER and Infernal. HMMER is for protein and DNA sequence comparison, and it underlies many protein domain family databases and many genome sequence annotation procedures. Infernal is for RNA secondary structure/sequence com- parison, and it is the foundation of various RNA structure/sequence analysis tools including the Rfam database of RNA families. Recent developments ? including a new collaboration with the EMBL European Bioinformatics Institute to provide HMMER web servers, an upcoming HMMER4 release with new memory-ef?cient algorithms, and an expansion of the development teams to multiple universities and sites ? suggest that beyond their current niches in genome analysis, both software packages are in a position to increase the scope and importance of their applications. To improve the foundation of software engineering in these packages, the proposal has three speci?c aims for improving speed, scaling, and support. The ?rst aim focuses on speed improvements, especially in paral- lelization, both on typical desktop computers and on high performance computing resources. A measurable and important milestone of this aim is to make sequence homology searches run at interactive speeds (less than 1 second response time), the speed of a Google search, which could radically change the way biologists interact with sequence data.
The second aim focuses on scaling improvements. Biological sequence data are growing exponentially, and we will make sure that the software can handle ? and help biologists visualize ? very large numbers of signi?cant homologs, up to millions and more.
The third aim focuses on improving support for the software, especially in improving our ability to engage a wider community of academic and industry developers who contribute to our codebases, and who use parts of our codebases in their own work.

Public Health Relevance

Interpreting the human genome sequence ? or any other genome sequence ? depends in part on recognizing evolutionarily related genes across the tree of life, especially in experimentally accessible model organisms. Computational tools for fast and sensitive sequence comparison are fundamental, and the exponentially growing scale of biological sequence data makes it essential that these computational tools are well engineered and highly ef?cient. This proposal describes a plan to support engineering of two widely used software packages: HMMER, for protein and DNA sequence comparisons, and Infernal, for RNA sequence/structure comparison.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG009116-04
Application #
9736760
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Sen, Shurjo Kumar
Project Start
2016-09-16
Project End
2020-06-30
Budget Start
2019-07-01
Budget End
2020-06-30
Support Year
4
Fiscal Year
2019
Total Cost
Indirect Cost
Name
Harvard University
Department
Microbiology/Immun/Virology
Type
Schools of Arts and Sciences
DUNS #
082359691
City
Cambridge
State
MA
Country
United States
Zip Code
02138
Potter, Simon C; Luciani, Aurélien; Eddy, Sean R et al. (2018) HMMER web server: 2018 update. Nucleic Acids Res 46:W200-W204
Nawrocki, Eric P; Jones, Thomas A; Eddy, Sean R (2018) Group I introns are widespread in archaea. Nucleic Acids Res 46:7970-7976
Kalvari, Ioanna; Argasinska, Joanna; Quinones-Olvera, Natalia et al. (2018) Rfam 13.0: shifting to a genome-centric resource for non-coding RNA families. Nucleic Acids Res 46:D335-D342