This subproject is one of many research subprojects utilizing the resources provided by a Center grant funded by NIH/NCRR. The subproject and investigator (PI) may have received primary funding from another NIH source, and thus could be represented in other CRISP entries. The institution listed is for the Center, which is not necessarily the institution for the investigator. Characterizing the insertion site preference of Alu elements on a large scale DNA bases is an important problem in primate-specific informatics. Key characteristics of this problem that are challenging and interesting include: 1) Without any prior knowledge, can we discover the general patterns that could exist and also make biological insights? 2) How to obtain the compact yet essential discriminative patterns given a search space up of 4200 or 10120? This research proposes an integrated divide-conquer and aggregate based algorithm for successfully fulfilling the above task. Compared to the existing state-of-the-art biological study, our results on over 8400 pre-Alu insertion sequences demonstrate a further refined analysis of the characteristic patterns involved in the mechanism of Alu insertion. Most importantly in biology, we acquire a 200nt predictive profile around the Alu insertion which not only contains the widely accepted signal consensus, but also suggests a longer pattern (T)7AA[AG]AATAA. The biological significance is that this pattern provides more insight into the favored sequence variations allowed for preferred binding and cleavage by the L1 ORF2 endonuclease that is involved in initiating the insertion process. Whole-genome search for the distribution of the discovered pattern will be conducted accordingly. The obtained genome-wide locations of the pattern will be compared to gene distributions on the human genome to identify which genes might be especially susceptible to the Alu insertion mutations.
Showing the most recent 10 out of 179 publications