Genome wide association studies aim to link genetic markers to phenotypes. The markers assessed are usually single nucleotide polymorphisms (SNPs) that are first identified by large-scale surveys of human genome variation, and later typed in individuals using microarrays. The advent of low-cost high-throughput sequencing allows for the direct sequencing of individual genomes, and opens up the possibility of finding other genomic features, such as insertions or deletions, associated with phenotypes. The complete sequencing of individual genomes also allows for the examination of rare variants and their potential contributions. However the sequences obtained by high-throughput sequencing are short, and it is not currently possible to assemble complete whole genomes directly from them. Individual variant detection is therefore based on first mapping the sequenced reads to a reference genome that serves as a scaffold. There is inevitably a loss of information in such a mapping: some reads do not map due to errors, or because the reference sequence is incomplete. Even when reads can be mapped, the identification of variants is complicated by errors. We propose a novel approach to association mapping by high-throughput sequencing, via the direct comparison of short subsequences (k-mers) extracted from the reads. Such an approach has the advantage of avoiding the need for genome assembly or mapping, and therefore utilizes information that completely represents the underlying genome. The challenge of such an approach is twofold: first, many tests need to be performed, possibly reducing the power to detect association. Second, even if associations are found, they are identified only via short subsequences from the genome;the determination of the underlying genome sequences responsible for the differences is still required. In this proposal we address both of these problems. First, we will demonstrate that association mapping directly from sequenced reads is feasible with sufficient depth of sequencing, to an extent that is already cost effective for short genomes, and that will soon be affordable for larger genomes. Our approach will be based on the large-scale application of multiple hypotheses testing theory. Second, we explain how distinguishing k- mers can be used to extract reads that can then be locally assembled to reveal genomic regions associated with phenotype, thereby avoiding the need for a global genome assembly. A key to our methods is the notion of statistical sequence assembly. This is a formulation of genome assembly based on evaluating the likelihood of proposed assemblies according to a probabilistic model of the random fragmentation used to create libraries, and the subsequent random sequencing of them. Our proposal therefore offers a novel approach to association mapping that circumvents inherent limitations of current approaches by directly assessing differences between cases and controls from sequence data. While we focus on the case of binary traits in this exploratory/developmental grant, our approach should be applicable to quantitative traits as well.

Public Health Relevance

We propose a novel approach to association mapping based on a direct comparison of k-mer counts obtained from reads. This avoids the need for whole genome assembly or read mapping. Instead, genomic regions associated with phenotypes are be identified by local assembly of read containing significantly different k-mers.

National Institute of Health (NIH)
National Human Genome Research Institute (NHGRI)
Exploratory/Developmental Grants (R21)
Project #
Application #
Study Section
Genomics, Computational Biology and Technology Study Section (GCAT)
Program Officer
Brooks, Lisa
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of California Berkeley
Biostatistics & Other Math Sci
Schools of Arts and Sciences
United States
Zip Code