Bioconductor: an open computing resource for genomics

Morgan, Martin

Abstract

The Bioconductor project provides an open resource for the development and distribution of innovative reliable software for computational biology and bioinformatics. The range of available software is broad and rapidly growing as are both the user community and the developer community. The project maintains a web portal for delivering software and documentation to end users as well as an active mailing list. Additional services for developers include a software archive, mailing list and assistance and advice program development and design We propose an active development strategy designed to meet new challenges while simultaneously providing user and developer support for existing tools and methods. In particular we emphasize a design strategy that accommodates the imperfect, yet evolving nature of biological knowledge and the relatively rapid development of new experimental technologies. Software solutions must be able to rapidly adapt and to facilitate new problems when they arise. CRITQUE 1: The Bioconductor project began in 2001. In 2002 it was awarded a BISTI grant for three years 2003-2006). During this time the project has expanded and provided support for a world wide community of researchers. This is a proposal for continued development for Bioconductor, which is a set of statistical programs which are specifically tailored to the computatational biology community. Bioconductor is composed of over 130 R packages that have been contributed by a large number of developers. The software packages range from state of the art statistical methods which typically are used in microarray analysis, to annotation tools, to plotting functions, GUIs, to sequence alignment and data management packages. Contributions to and usage of Bioconductor is growing rapidly and the applicants are requesting support to continue its development as well as general logistical support for software distribution and quality assurance. The proposal includes a research component for Bioconductor which will involve the development of analysis techniques. This will include optimization of the R statistical analyses, statistical processing of Affymetrix data, analysis of SNP data, improved standards, data storage, retreivals from NCBI, sequence management, machine learning, web services and distributed computing. SCIENTIFIC MERIT The applicants address many issues that are crucial to the success of a large open source project with multiple contributors. Examples of training, scientific publication, documentation and resource development run throughout the proposal. Many tangible examples were given on the usage of the system by the scientific community. EXPERIMENTAL DESIGN This is a description of their management workflow for the project which does a good job of demonstrating the technical excellence brought to the project by this group. 1) Build annotation packages every three months, Integrate changes in annotation source data structure into annotation package building code. 2) Maintain project website, mailing lists, source control archive. Organize web resources for short course and conferences. 3) Improve existing software. 4) Sustain automated nightly builds. Work with developers whose packages fail to pass QA. 5) Resolve cross-platform issues. 6) Review new submissions. Answer questions on the mailing lists. 7) Use software engineering best practices. Develop unit testing strategies. Design appropriate classes and methods for new data types. Refactor existing code for better interoperability and extensibility. 8) Develop and organize training materials and documentation. Extensive detail on testing, build procedures, interoperability, quality assurance and project management is given elsewhere in the document. They clearly have dealt with many issues necessary for a project of this size. They state that one of the biggest cost items is support of this package to run on multiple platforms. They point out that many contributors focus on a single platform, much of their work is track down cross-platform bugs. This is time well-spent, given the platforms used are in sync with the needs of the greater bioinformatics community. ORIGINALITY While a high degree of originality is not a particularly critical element of open source software development project, there are certainly areas in the proposal that are unique. Most importantly, it is safe to say that there is not another project which has this blend of statistical analysis systems specifically tailored to a important research bioinformatics area that can be deployed on a number of different computer environments. INVESTIGATOR AND CO-INVESTIGATORS Dr. Gentleman is the founder and leader of the Bioconductor project. Dr. Gentlemen was an Associate Professor in the Department of Biostatistics, Harvard School of Public Health and Department of Biostatistics and Computational Biology, Dana Farber Cancer Institute. In 2004 he became Program Head, Computational Biology, at the Fred Hutchinson Cancer Research Center in Seattle. He has on the order of ten publications relating to Bioconductor or related statistical analysis. He implemented the original versions of the R programming language jointly with another co-founder. He is PI or Investigator of a number of research grants, at least two are directly related to this work. He and other members of the proposal have taught a number of courses and given lectures on Bioconductor, the amount of these courses certainly indicate significant dedication to the project. A review of the PI and Co-PI activities related to this project are shown on Table 3 on page 42 of the application. The roles and time allocations assigned to each participant appear to be reasonable. Dr. Gentleman will serve as project leader and will manage the programmers, coordinating the project, and investigating new computational methods and approaches. Dr. Vincent Carey, as co Principal Investigator has 20% time allocated for the project. In 2005 he became Associate Professor of Medicine (Biostatistics). Carey is a senior member of the Bioconductor development core. He will improve interoperability to allow Bioconductor reuse of external modules in Java, Perl and other languages as well as strengthen interfaces between high throughput experimental workflows and machine learning tools, and ontology capture. An administrative assistant will assist Dr. Carey with administrative requirements, including call coordination, manuscript preparation and distribution, scheduling and budget management. Dr. Rafael Irizarry as co-PI will spend 30% effort on the project. Dr. Irizarry has four years experience developing methods for microarray data analysis and in the Department of Biostatistics serving as faculty liaison to the Johns Hopkins Medical Institution's Microarray Core. He will supervize all efforts to support preprocessing on all platforms and support for microarray related consortiums such as the ERCC, GEO, and ArrayExpress. Programmers will be responsible for the project website, managing email lists, maintaining training materials, upgrading software, refactoring and other code enhancements, managing the svn archive, and Bioconductor releases. They will handle checking all submitted packages, developing unit tests, and simplifying downloads, nightly build procedures, cross-platform issues, data technologies as well as integrating resources found in other languages (e.g. large C libraries of routines for string handling, machine learning and so on). Programmers have familiarity with R packages and systems for database management and for parallel and distributed computing. They will be responsible for managing the annotation data including package building and liaising with organism specific and other data providers. SIGNIFICANCE Given the scope of the proposal, and the size of the Bioconductor project in general the request for the above resources is appropriate. There is an excellent mix of grounded project management along with development of newer state of the art techniques that will benifit many members of the bioinformatics community. There is a high probability that funding this project will help to maintain and advance this important community resource. ENVIRONMENT The computer infrastructure, and the local departments of the PI and Co-PIs, as well as the work with the larger scientific community are all excellent environments to support this project. IN SUMMARY This is a terrific resource. It is a well managed large open source project with very well crafted QA testing, documentation and training. Continuation of this is a three year project. Beyond that period, a statement of long term stated goals is needed. The PI should articulate the strategic goals, as well as their research motivation and translate that into an action plan. They should also use that context to describe how they would go about choosing packages that are put into the Bioconductor system;Table 3 only listed the names of the packages made by the applicants, it could have gone further to give the reader more information for choosing packages. A simple example would have been if they stated in the document: """"""""Given our assessment of the microarray state of the art, we ultimately aim to overlay annotation data, ontological information, and other forms of meta data onto a statistical framework for expression data."""""""" The resulting research plan would then justify a five year project, but it was not strong enough in this application. It should be noted that many of the benificiaries to this system are not just users that download the system. In many cases a centralized informatics service downloads their system and then performs analysis for other members of the campus or the wider www community. While that type of """"""""success measure"""""""" is hard to assess, more effort in this area in subsequent proposals would be helpful.

Funding Agency

Agency: National Institute of Health (NIH)
Institute: National Human Genome Research Institute (NHGRI)
Type: Biotechnology Resource Grants (P41)
Project #: 5P41HG004059-05
Application #: 7910730
Study Section: Special Emphasis Panel (ZHG1-HGR-N (M2))
Program Officer: Bonazzi, Vivien

Project Start: 2006-09-28
Project End: 2011-09-25
Budget Start: 2010-08-01
Budget End: 2011-09-25
Support Year: 5
Fiscal Year: 2010
Total Cost: $1,093,220
Indirect Cost

Institution

Name: Fred Hutchinson Cancer Research Center
Department
Type
DUNS #: 078200995

City: Seattle
State: WA
Country: United States
Zip Code: 98109

Related projects


NIH 2010 P41 HG	Bioconductor: an open computing resource for genomics Morgan, Martin / Fred Hutchinson Cancer Research Center	$1,093,220
NIH 2009 P41 HG	Bioconductor: an open computing resource for genomics Morgan, Martin / Fred Hutchinson Cancer Research Center	$829,379
NIH 2009 P41 HG	Bioconductor: an open computing resource for genomics Morgan, Martin / Fred Hutchinson Cancer Research Center	$250,001
NIH 2008 P41 HG	Bioconductor: an open computing resource for genomics Gentleman, Robert C. / Fred Hutchinson Cancer Research Center	$805,222
NIH 2007 P41 HG	Bioconductor: an open computing resource for genomics Gentleman, Robert C. / Fred Hutchinson Cancer Research Center	$796,910
NIH 2006 P41 HG	Bioconductor: an open computing resource for genomics Gentleman, Robert C. / Fred Hutchinson Cancer Research Center	$796,807

Publications

Pasolli, Edoardo; Schiffer, Lucas; Manghi, Paolo et al. (2017) Accessible, curated metagenomic data through ExperimentHub. Nat Methods 14:1023-1024

Huber, Wolfgang; Carey, Vincent J; Gentleman, Robert et al. (2015) Orchestrating high-throughput genomic analysis with Bioconductor. Nat Methods 12:115-21

Lawrence, Michael; Huber, Wolfgang; Pages, Herve et al. (2013) Software for computing and annotating genomic ranges. PLoS Comput Biol 9:e1003118

Gurtowski, James; Schatz, Michael C; Langmead, Ben (2012) Genotyping in the cloud with Crossbow. Curr Protoc Bioinformatics Chapter 15:Unit15.3

Le Meur, Nolwenn; Gentleman, Robert (2012) Analyzing biological data using R: methods for graphs and networks. Methods Mol Biol 804:343-73

Frazee, Alyssa C; Langmead, Ben; Leek, Jeffrey T (2011) ReCount: a multi-experiment resource of analysis-ready RNA-seq gene count datasets. BMC Bioinformatics 12:449

Bravo, Héctor Corrada; Irizarry, Rafael A (2010) Model-based quality assessment and base-calling for second-generation sequencing data. Biometrics 66:665-74

Leek, Jeffrey T; Scharpf, Robert B; Bravo, Héctor Corrada et al. (2010) Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet 11:733-9

Cao, Yi; Yao, Zizhen; Sarkar, Deepayan et al. (2010) Genome-wide MyoD binding in skeletal muscle cells: a potential for broad cellular reprogramming. Dev Cell 18:662-74

Carvalho, Benilton S; Irizarry, Rafael A (2010) A framework for oligonucleotide microarray preprocessing. Bioinformatics 26:2363-7

Showing the most recent 10 out of 27 publications

Comments

Be the first to comment on Martin Morgan's grant

Recent in Grantomics:

Recently viewed grants:

Recently added grants: