This award is funded under the American Recovery and Reinvestment Act of 2009 (Public Law 111-5).

Intellectual Merit. High throughput sequencing that allows human genetics to access rare variation "Next Generation" sequencing is transforming human genetics: several disruptive technologies are coming of age and now enable resequencing throughput of megabases per dollar. Specifically, thousands of individuals can now be sequenced for targeted regions of the genome, in pools of individuals. The complete spectrum of common and rare alleles thus revealed is a key resource for understanding origins, genomics, and heritable traits of our species. Naïve tests of association of a heritable trait to a common variant are inappropriate for analysis of rare gene variants, since the contribution of each such rare variant to the trait is often statistically undetectable. The hope for finding an associated gene therefore lies in accumulating association signal across multiple functional variants. The problem of multiple-variant association is complicated by background correlations between nearby variants.This proposal tackles two challenges:

1.Initial task: Recovery of individual identity of mutation carriers from pooled sequencing data

2.Main task: Using individual-level mutation data for scoring of association to multiple variants in a locus

Proposed solution: Bayesian scoring, decomposable by individual and by variant. This proposal involves design of overlapping pools for recovering mutation carrier identity. Each individual will be sequenced in a unique combination of pools. Mutations observed in such a set of pools will be inferred to be carried by the corresponding individual, addressing the initial task. This proposal tackles the main task by Bayesian scoring for genomic intervals containing functional variants. Comparative genomics is used to guide a prior distribution for whether a sequenced variant is likely to be functional. The association score is further decomposed to contributions of each sample and each site, with Markovian dependency between such contributions along the genome. A dynamic-program is proposed for optimizing the causal locus boundaries.

Broad Impact. The outcomes of the project would facilitate new paradigms in genetic research, alongside the recently launched high throughput experimental technologies. Specifically, projected impacts include:

- software tools and tailored interfaces to be disseminated to the reseaerch community. - Education for undergraduates by project courses implementing proposed research tasks and for K-12 students by curriculum development and delivery to high-school diversity students - allowing a generation to have widespread access to their individual DNA.

Agency
National Science Foundation (NSF)
Institute
Division of Information and Intelligent Systems (IIS)
Type
Standard Grant (Standard)
Application #
0845677
Program Officer
Vijayalakshmi Atluri
Project Start
Project End
Budget Start
2009-06-01
Budget End
2014-05-31
Support Year
Fiscal Year
2008
Total Cost
$399,999
Indirect Cost
Name
Columbia University
Department
Type
DUNS #
City
New York
State
NY
Country
United States
Zip Code
10027