Recent advances in DNA sequencing technology have not been matched by improved analytic techniques to quickly and accurately interpret patient genome data to inform diagnosis, prognosis and therapy-making decisions in the clinic and to identify candidate biomarkers of disease in research laboratories. Development of automated techniques to facilitate interpretation of this data will benefit patient care and improve public health by promoting widespread use of cost-efficient sequencing clinically and by making it feasible to sequence a broader range of patients including those with complex disease or to identify patients who have an elevated risk of developing future disease. Our long-term goal is to commoditize sequence interpretation using high- throughput computational techniques in the same way that next-generation DNA sequencing technology has commoditized genome data production. The present project will result in commercial software that automates genome sequence interpretation. Specifically, we will develop (1) software that automatically collects and organizes a comprehensive set of genetic information by systematically reading millions of scientific articles and scanning dozens of genetic variant databases; (2) software that uses this information to prioritize patient data into clinical categories based on the likelihood of disease; and (3) software that automatically identifies candidate biomarkers of disease from multi-sample cohort data. To do this we will use a variety of innovative data processing techniques. First, we will systematically mutate the reference genome in silico to produce a comprehensive database of every possible mutation at every position of every gene and use this data to query every word of every article ever published or any publicly available database to identify disease-gene-variant associations. We will compare the results from this automated process to results obtained using more expensive and time-consuming manual methods and hypothesize that we can achieve 95% concordance and identify 33% more variants and 3-fold more references for each. These results will be organized into clinically meaningful categories and presented in an interactive graphical interface that displays the evidence for each of these associations. We will then use this information to drive prioritization of patient data based on similarities to known disease-causing variants and the strength of evidence for their pathogenicity in order to increase analytic sensitivity and specificity thereby improving speed and reliability of sequencing in the clinic. Our automated results will then be compared to conventional methods of data annotation and filtration for >1,100 patient samples from 4 diseases. Finally, we will use the same prioritization strategy to comprehensively compare variant data between all patients within a disease cohort to automatically identify the variants most likely to lead to disease and compare our automated results to conventional methods for >600 samples from 10 diseases. The growth in the $3.6B genome sequencing market is driven by improvements in informatics techniques and automated solutions such as proposed here have significant commercial potential.
The successful completion of the proposed project will contribute to the public health mission of the NIH by promoting more widespread adoption of genome sequencing by making the interpretation of this data more accurate and cost-effective in clinical and research laboratories. The community of users that can benefit from this research include geneticists, oncologists, pathologists, researchers and patients.