The question of how species are related is one of long-standing interest. Only when species relationships are understood is it possible to understand how and why species have changed over time (i.e. by comparing them based on how closely they are related). Surprisingly, despite extensive study, the relationships among some groups of species remain controversial. For example, in the last six years, large efforts to understand how birds are related by sequencing their genomes have produced conflicting results. This research project seeks to understand why, despite sequencing large amounts of genomic data for many species, scientists continue to struggle to understand some species relationships. In this project, subsets of genome sequence data will be examined to determine which types of data provide accurate information about species relationships, and which data confound the understanding of relationships and indicate other evolutionary patterns. Using this knowledge, software will be developed that allows biologists to rapidly process genomic data to quickly and accurately understand relationships among species they work on, and therefore the ways in which these groups evolved. As this research highlights the need for current and future scientists to be able to work with large and complex datasets, the project will (1) provide computational training for hundreds of biologists, and (2) support the development of computational and research skills for diverse undergraduates as part of an inclusive community, with the objective of supporting their future research careers. Research-based courses for undergraduates will reduce the hurdles faced by students who have limited resources to engage in unpaid independent research, or may be excluded from paid opportunities due to a lack of experience.
The goal of this research is to move phylogenetics (i.e. the understanding of species relationships) from using more data, to using data with optimal information about species relationships. First, subsets of the genome that support well-established species relationships will be identified. These subsets will be used to evaluate support for alternative hypotheses among species with more controversial relationships, and identify the correct relationships. Second, machine learning methods will be used to determine genomic subsets that result in more accurate phylogenies based on their characteristics. Finally, these approaches to data filtering will be automated in freely-available, easy-to-use, open-source software, to facilitate their use in research projects relying on accurate estimates of species relationships. Computational training will be provided through semi-annual workshops for researchers; training for others to lead similar workshops around the world will also be provided. Undergraduates’ computational and research skills will be developed through courses using real genomic datasets for novel research. Together, the research and education components of this project will support a greater number and diversity of researchers who have the skills and software to effectively conduct research in evolutionary biology and various areas of research relying on computational skills to analyze large datasets. Regular updates on this research will be made available at https://schwartzlaburi.github.io/. This project is jointly funded by the Cyber-infrastructure for Biological Research program and the Established Program to Stimulate Competitive Research (EPSCoR).
This award reflects NSF's statutory mission and has been deemed worthy of support through evaluation using the Foundation's intellectual merit and broader impacts review criteria.