This subproject is one of many research subprojects utilizing the resources provided by a Center grant funded by NIH/NCRR. The subproject and investigator (PI) may have received primary funding from another NIH source, and thus could be represented in other CRISP entries. The institution listed is for the Center, which is not necessarily the institution for the investigator. Machine Learning is the subfield of computer science that addresses the issues of programs that are able to improve their performance with experience on a task and to find patterns in data. We continued work on GAMI, an approach to motif inference that uses a genetic algorithms search. We use a computational approach to search for regions in non-coding DNA sequences that may affect gene function, with the hypothesis that motifs that are conserved across evolutionary time are more likely to be functional. GAMI identifies motifs that are most strongly represented in the data, so that the motifs may be studied to assess functionality. We have worked with several genes, including ABCC7, the cystic fibrosis transmembrane conductance regulator (CFTR), finding many highly conserved patterns that merit additional study. We assessed the ability of our scoring metric to capture highly conserved regions, and demonstrated that it outperforms the metric typically used for motif inference. We ascertained that motifs identified by GAMI correlate with known functional regions cataloged in the TRANSFAC database. We demonstrated GAMI to be an effective tool for searching large datasets of divergent species. The system has been validated for small problems, finding known TFBS referenced in other published work and finding the best motifs identified by exhaustive search. We have also compared the CFTR motifs found by GAMI to the full human genome and to known TFBSs. Some of these motifs represent known TFBS for other genes while some of these motifs may represent novel discoveries.
Showing the most recent 10 out of 246 publications