Computational analysis of proteins is an essential shortcut to random experimentation. Multiple sequence alignments (MSAs) reveal evolutionary history of a protein family, govern predictions of 3D structures and functions and guide experimental design. Accuracy of these alignments is critical for the accuracy of conclusions from their analysis. With the finding from the previous round of the grant, we significantly advanced the power of sequence similarity search and improved the accuracy of MSA. Using these techniques, we aided biological discoveries in dozens of collaborations with experimentalists, analyzed medically important protein families and implemented a number of public web-servers. For the next funding period we propose to: 1) Build on our advances to perfect homology search and multiple sequence alignment. Sequence profile search will be improved by more sound statistics and by averaging scores over predicted homologs of found hits. Sequence alignment will be corrected in regions that interact less closely with the rest of the protein and segments that require large adjustments. 2) Maintain, improve and integrate our protein sequence analysis servers. During the first funding period of the grant, in addition to improving our sequence search and alignment web-servers, we developed three new servers for predicting a number of characteristics for a protein sequence, finding literature about a protein and visualizing relationships between proteins as networks, and compiled a searchable database of clinical mutations. We will integrate these servers into a single sequence analysis "stop", augmented with other information, such as expression patterns, protein interactions, human polymorphism and known diseases. 3) Develop an Atlas of clinical mutations in proteins, freely available for browsing and download without login requirements. Each out of 25,000 known mutations will have a dedicated web-page with mutation's characteristics and predictions about its negative effects.

Public Health Relevance

Accurate protein sequence analysis is an essential step in planning of experiments. Despite recent progress, computational methods are not precise enough to predict properties of biological molecules, explain molecular mechanisms of diseases and design drugs. We will improve the accuracy of sequence analysis methods and apply them to develop an Atlas of clinical mutations - a free, accessible to all on-line interactive database with hypotheses about how each of 25,000 known mutations affects a protein and causes disease.

National Institute of Health (NIH)
Research Project (R01)
Project #
Application #
Study Section
Macromolecular Structure and Function D Study Section (MSFD)
Program Officer
Wehrle, Janna P
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Texas Sw Medical Center Dallas
Schools of Medicine
United States
Zip Code
Semeiks, Jeremy; Borek, Dominika; Otwinowski, Zbyszek et al. (2014) Comparative genome sequencing reveals chemotype-specific gene clusters in the toxigenic black mold Stachybotrys. BMC Genomics 15:590
Liao, Yuxing; Pei, Jimin; Cheng, Hua et al. (2014) An ancient autoproteolytic domain found in GAIN, ZU5 and Nucleoporin98. J Mol Biol 426:3935-45
Pei, Jimin; Grishin, Nick V (2014) PROMALS3D: multiple protein sequence alignment enhanced with evolutionary and three-dimensional structural information. Methods Mol Biol 1079:263-71
Salomon, Dor; Kinch, Lisa N; Trudgian, David C et al. (2014) Marker for type VI secretion system effectors. Proc Natl Acad Sci U S A 111:9271-6
Calder, Thomas; Kinch, Lisa N; Fernandez, Jessie et al. (2014) Vibrio type III effector VPA1380 is related to the cysteine protease domain of large bacterial toxins. PLoS One 9:e104387
Chen, Baoyu; Brinkmann, Klaus; Chen, Zhucheng et al. (2014) The WAVE regulatory complex links diverse receptors to the actin cytoskeleton. Cell 156:195-207
Shoji-Kawata, Sanae; Sumpter, Rhea; Leveno, Matthew et al. (2013) Identification of a candidate therapeutic autophagy-inducing peptide. Nature 494:201-6
Li, Wenlin; Cong, Qian; Kinch, Lisa N et al. (2013) Seq2Ref: a web server to facilitate functional interpretation. BMC Bioinformatics 14:30
Ji, Renkai; Cong, Qian; Li, Wenlin et al. (2013) M2SG: mapping human disease-related genetic variants to protein sequences and genomic loci. Bioinformatics 29:2953-4
Pei, Jimin; Grishin, Nick V (2013) A new family of predicted Kruppel-like factor genes and pseudogenes in placental mammals. PLoS One 8:e81109

Showing the most recent 10 out of 20 publications