Recent genetic evidence shows that the human species is a smaller family than previously thought: arbitrary groups of people include many pairs of hidden relatives, that unknown to them share a recent ancestor a few generations back. The investigators develop novel computational methods to reveal these remote family ties, from large scale datasets that contains billions of snippets of genetic information. This research effort will use this information to compile a genealogy of thousands of otherwise-unrelated individuals.
Individuals with a common ancestor have a chance to share one or more long fragments of DNA, that are identical-by-descent (IBD) over several megabases. High throughput genetic data from commercial arrays of 300,000-1,000,000 Single Nucleotide Polymorphisms (SNPs) can therefore detect IBD with certainty. The computational challenge in large scale data of tens of thousands of individuals typed for these array is making all the quadratic number of pairwise comparisons in search for IBD. The investigators develop a per-locus hashing algorithm, that detects identical haplotypes across all O(n2) sample pairs, but operates in linear time. They are using this methodology to map hidden relatedness across publicly available samples, creating a useful tool for population-based linkage analysis in unrelateds, as well as inferences on population genetics of recent generations.