The field of bioinformatics and computational biology is experiencing a data revolution unlike any other scientific computing field. Experimental techniques to procure data have increased in throughput, improved in accuracy, and reduced in costs. The preponderance of data has limited the scalability of existing software tools. In a pursuit to understand the complexities and challenges that stem from designing algorithms for data-intensive biocomputing, this project is developing new approaches for two major problems in protein bioinformatics: i) identification of protein families and homology clusters; and ii) peptide identification from large-scale mass spectrometry data. The former requires large-scale graph analysis and the latter requires large-scale database search. The project is investigating a multi-faceted approach which involves designing space-efficient algorithms for massively parallel machines, developing algorithmic heuristics for reducing the time to solution, evaluating the MapReduce paradigm as an alternate computing model, and deploying multicore architectures for fine-grain parallelism. Project outcomes will include new algorithms and open-source software libraries for large-scale protein bioinformatics, including a more generic library for data-intensive biocomputing. The project is addressing a critical need for scalable methods in protein bioinformatics and in doing so will usher in state-of-the-art computing models and concepts from both software and hardware into mainstream biocomputing. Broader impacts include creating interdisciplinary research opportunities for undergraduate and graduate students, and new interdisciplinary curricula at high school, undergraduate and graduate levels. Education materials will be disseminated through a partnership program with Shodor Education Foundation, Inc.
Project homepage: www.eecs.wsu.edu/~ananth/DataIntensive-Biocomputing/
Intellectual merit: The primary contribution of the project of this project was to characterize and investigate the design and development of efficient solutions for some of the most data-intensive problems in biocomputing – viz. protein family characterization (involving large-scale graph analytics), mass spectra-based peptide identification (involving large-scale database searches) and phylogenetic tree reconstruction (involving tree optimization techniques). This was achieved via an investigation of a multi-faceted approach that combined novel algorithm design, exploration of a wide variety of parallel architectures and programming models, and design and testing the viability of emerging multicore platforms. Some of the key technical products resulting from our research include: i) A comprehensive suite of scalable algorithmic techniques and software for clustering protein sequences for the identification of protein families on different parallel architectures; ii) A suite of parallel implementations for generating optimal protein sequence homology graphs at scale. These graphs are required as input to clustering and are not readily available; iii) Network-on-chip architecture designs and implementations for sequence alignment and phylogenetic kernels. To the best of our knowledge our efforts represent the first exploration of the emerging NoC architectural platform for biological computing; iv) Cloud-enabled solutions for peptide identification from mass spectral data (proteomics); and v) New metrics and empirical models for comparing different schemes of data distribution on parallel machines with the goal of reducing communication. At a more high level, the project allowed us to answer the following questions: What are the challenges involved in designing large-scale graph analytics and what are the kind of algorithmic techniques and heuristic that are best suited from both a qualitative and performance scaling perspective? What are the advantages and limitations imposed by the various architectural choices that exist today in modern day supercomputers (viz. distributed memory, shared memory multicore) on the design of such graph analytic applications? What similar challenges and solutions are suited for database search-oriented applications in biocomputing? Can more loosely coupled parallel models such as cloud computing be also used for such applications? Can emerging hardware architectures which contain on-chip networks be used for performing the calculations involved in various data-intensive biological applications? What are the performance gains that can be estimated from such platforms, and what are the performance vs. energy cost tradeoffs involved in such custom designs? With data becoming too big to fit in any one compute node’s memory, what are some of the more effective ways to partition/distribute input data on parallel machines so as to reduce the overheads introduced during computation due to data movement? As with any research project, several open questions remain and the project has allowed us to identify new lines of inquiry in the future. Yet, it is expected that the contributions made in this project has firmly laid the foundation that is necessary to answer some of the most challenging and timely questions that outline modern day data-intensive biocomputing. Broader impact: The project has led to the support, training and advising of numerous students who represent the next generation scientific workforce in the interdisciplinary research area of computational life sciences. More specifically, over the course of the project’s 4 years since 2009, the project has supported (in full or in part) the training and advising of 6 PhD students (incl. 1 female), 3 MS students (incl. 1 female), and 8 undergraduate researchers (incl. 1 female). Six out of the total eight undergraduate researchers also contributed to scholarly activities by authoring one or more papers in peer-reviewed conferences and journals. All software products developed during the course of the project have been released to the public through open source outlets. The project also led to several new curricular materials in the area of data-intensive biocomputing for adoption in undergraduate and graduate classrooms, including a Blue Waters Petascale Undergraduate curriculum. The PI also closely worked with Shodor Educational Foundation and the Supercomputing Education program in designing and delivering several of these curricular materials through undergraduate faculty training workshops. The PI also worked with high schools in rural Washington which have an underrepresented Hispanic population through outreach programs at Washington State University. All the PIs took active role in various professional societies and served on a number of conference program committees. The project also led to the creation of new workshops, special sessions and symposia in conferences, and special issue in journals, collectively aimed at fostering a new research community in the area of high performance computational biology and big data life sciences.