Quantifying the immense diversity of microbial life poses significant new challenges. Statistical methods currently available to estimate microbial diversity are, at best, unproven, because they are not compatible with the highly skewed species abundance curves that are characteristic of most microbial communities. The objectives of this research are to 1) develop novel statistical approaches to estimate biodiversity, 2) test these methods using existing data sets, and 3) make these methods available to the community in the form of freely accessible, easy to use, sophisticated statistical software. The software tools will implement a wide variety of older and novel methods for estimating species richness, including parametric and nonparametric procedures and interactive graphical displays. These tools will be tested by applying them to existing data sets, and they will be used to analyze emerging global patterns of microbial biodiversity. The intellectual merit of the project is that these tools have the potential to transform microbial biodiversity research by providing reliable biodiversity metrics with meaningful standard errors. The broader impacts of the project include student training and the creation of tools that will serve a growing cadre of biostatisticians, bioinformaticians, and microbial ecologists.

Project Report

Assessment of biodiversity is a critical part of our stewardship of the natural resources of our planet. Quantitative tracking of biodiversity allows us, for example, to model large-scale biogeochemical processes such as the activity of microbial organisms in the oceans; and to estimate extinction rates and identify at-risk rare species. On a different scale, there is currently intense interest in the microbial communities that inhabit the human body, called the "human microbiome," and it is thought that the biodiversity of these communities plays a crucial role in human health. To investigate biodiversity in these different scenarios researchers collect samples of organisms and sort them into species (or other groups). They then count how many species appear only once in the sample (the rarest items), how many appear twice, three times, and so on. Statisticians then use this count data to estimate how many species there are in the total population: both the species that were observed in the sample, and those that escaped the sample collection. The problem of exactly how to use the data to estimate the total number of species, or the "species richness," is difficult and has been studied by statisticians since at least 1943. Various solutions have been proposed, depending on what assumptions one is willing to make about the population, and some of these have proved to be practical and reliable in a broad range of research situations. However, to use the solutions, researchers require fast, flexible, and easy-to-use software, and this has been generally lacking in the past. In this project we developed a new software package, called CatchAll, which puts together all of the most reliable solutions to species richness estimation in one place. It is fast (running in at most a few seconds), flexible (it can be used on any computer system), and easy-to-use (in the simplest case the user needs only to supply the input count data, hit "go," and collect the output files containing the results). We have now provided CatchAll, along with a comprehensive user manual and other files, on a website (www.northeastern.edu/catchall) for free download by the scientific community. CatchAll is up-to-date (now in version 3.0), featuring the latest analysis methods (some of which are only now being published in the research literature), and it is in the process of being incorporated into the main software "pipelines" used by biological researchers, for even greater flexibility and ease of use. Biodiversity assessment is not the only application of CatchAll. Fundamentally the same statistical problem occurs when one wishes to estimate the size of a population from a sample of individuals who are observed or registered repeatedly: this is called "capture-recapture" data. In this setting there are many important applications beyond biodiversity: estimating the sizes of different kinds of elusive human populations such as drug addicts, sufferers from rare diseases, or persons engaging in crime or illicit activity; or the numbers of certain animals such as sheep or cattle with prion diseases (scrapie or mad cow); or the numbers of rare phenomena of various types such as rare coins or even astronomical events. The applications are essentially unlimited, and of considerable social and economic importance. The researchers on this project, that is the CatchAll team, are very pleased that our software is rapidly becoming the standard in its field of application.

Agency
National Science Foundation (NSF)
Institute
Division of Environmental Biology (DEB)
Application #
0816638
Program Officer
Alan James Tessier
Project Start
Project End
Budget Start
2008-09-01
Budget End
2011-08-31
Support Year
Fiscal Year
2008
Total Cost
$299,446
Indirect Cost
Name
Cornell Univ - State: Awds Made Prior May 2010
Department
Type
DUNS #
City
Ithica
State
NY
Country
United States
Zip Code
14850