. Sequencing of transcribed RNA molecules (RNA-seq) is an invaluable tool for studying cell transcriptomes at high resolution and depth. STAR is a popular RNA-seq analysis suite that combines high accuracy and ultra- fast speed of mapping with a reach collection of built-in features and tools. STAR is used by hundreds of researchers, including several major consortia and institutions. We propose to significantly enhance and expand STAR capabilities in the following important areas. 1. Develop novel algorithms and tools integrated directly into STAR. RNA-seq analyses require combining multiple tools into ?processing pipelines? which is demanding task owing to bottlenecks and compatibility issues.
We aim to overcome these impediments by integrating novel tools directly into STAR software: (i) mapping of RNA-seq reads to personal genomes utilizing genotype information to produce more accurate allele aware alignments, thus increasing precision of personal genomics analyses; (ii) mapping of long RNA reads from emerging sequencing technologies such as PacBio and Oxford Nanopore. 2. Increase accuracy and speed and of the core mapping algorithm. New applications, such as personal genomics, require significant improvements in mapping accuracy. We will enhance STAR mapping algorithm with (i) spliced seed extension through mismatches/indels; and (ii) limited local alignment so of the read ends. Tremendous increase of sequencing throughput has put a significant emphasis on the efficiency of the computational algorithms. To keep up with the increasing sequencing throughput, we will boost STAR algorithm with (i) vectorization of query-text comparisons using SIMD/SSE instructions; (ii) dynamical programming for seed stitching. The improvements in accuracy and speed will be validated in both simulated and real RNA-seq data. Mapping accuracy depends strongly on choosing the best mapping parameters for a particular dataset. We will devise automated parameter optimization procedures to eliminate guesswork in parameter selection. 3. Enhance user-friendliness, user support/education, and software maintenance. User-friendliness is crucial for bioinformatics software usefulness to the broadest audience.
We aim to significantly enhance users' experience by developing STAR web user interfaces for both pre-run data input, and post-run exploring of results. To enable STAR analysis in the cloud, we will create STAR virtual machines on popular Amazon and Google cloud computing services, and develop Hadoop-based tools for distribute processing of the big datasets. We will also expand user support and education, continue to implement user- requested features and debug user-reported issues.

Public Health Relevance

Sequencing of transcribed RNA molecules (RNA-seq) provides invaluable insight about gene expression and functions, which directly affect various clinically important aspects, such as development, disease susceptibility, and therapy/drug responses. The goal of this project is to significantly enhance capabilities of our RNA-seq analysis suite STAR, turning it into an ultimate one-stop solution for the majority of RNA-seq analyses. These enhancements, in conjunction with continued user support and software maintenance, will be beneficial to hundreds of medical researchers using RNA-seq to develop better diagnostics and treatments for major diseases.

Agency
National Institute of Health (NIH)
Institute
National Human Genome Research Institute (NHGRI)
Type
Research Project (R01)
Project #
5R01HG009318-04
Application #
9932464
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Sen, Shurjo Kumar
Project Start
2017-08-18
Project End
2022-05-31
Budget Start
2020-06-01
Budget End
2021-05-31
Support Year
4
Fiscal Year
2020
Total Cost
Indirect Cost
Name
Cold Spring Harbor Laboratory
Department
Type
DUNS #
065968786
City
Cold Spring Harbor
State
NY
Country
United States
Zip Code
11724