A genome space is a moduli space of genomes. In this space each point corresponds to a genome. It is expected that two genomes are closely related if the corresponding points in the genome space are close to each other. The investigators and their collaborators obtain a new geometric representation-the natural vector for DNA sequences, and show that the correspondence between DNA sequences and the natural vectors is one-to-one. They perform phylogenetic and clustering analysis for genome sequences in this space. Unlike most existing methods, the proposed genome space here does not need sequence alignment or any evolutionary model and thus avoids computational repetition. In a pilot study of 27,643 genome sequences, it takes only a couple of hours using the natural vector method to compute all the pairwise differences, while it will take four years using the classical multiple alignment methods. Considering the exponentially increasing size of the known genome database, the natural vector method is the only known feasible approach to cluster the whole genome space. With the constructed natural vectors, the investigators use the classification model based on a permanental process, a stochastic classification model, to perform classification and clustering. Moreover, the probability of each virus genome belonging to a cluster can also be obtained. For example, the investigators did clustering analysis for 59 Influenza A H1N1 swine flu genomes and 113 human rhinovirus (HRV) genomes based on their whole genome sequences, and showed that the new outbreak of Influenza A H1N1 swine flu virus was most closely related to Eurasian swine flu viruses and North American swine flu viruses, and the 113 HRV genomes were well clustered into 5 classes HRV-A, HRV-B, HRV-C, HEV-B, and HEV-C. It takes only 18 seconds for the proposed method to get the clustering result while it takes more than 19 hours for the commonly used multiple alignment method. Both methods yield the same clustering result. The first goal of the proposed activity is to collect all available genome sequences for each type of virus, compute their natural vectors, set up and maintain a "natural vector bank" for viruses. Secondly, the investigators will explore the necessary number of dimensions of the natural vector such that it accurately classifies or clusters the genomes. The third goal is to do clustering on the virus genomes based on their natural vectors. The final goal is to classify or identify any given new virus based on its genome sequence, and predict its functions or behavior pattern.

In this project the investigators construct a novel, high-speed, accurate geometric representation, called the natural vector, for DNA sequences. Based on this new powerful method, the biologists can have a global comparison of all genomes simultaneously, which cannot be achieved by any other method. It is very fast and convenient once the genome sequence is known, which is vital to the homeland security. To predict the characteristics of a new virus coming from terrorist groups, one can compute the natural vector of the new virus and compare it with the natural vectors of other known viruses. In this way one can predict the possible properties of this new virus by looking at the properties of those viruses located nearby. Quickly and accurately identifying a new virus and predicting its functions will be very helpful to authorities taking precautions and manufacturing a vaccine before it reaches a pandemic state and propagates throughout the general public.

National Science Foundation (NSF)
Division of Mathematical Sciences (DMS)
Application #
Program Officer
Leland M. Jameson
Project Start
Project End
Budget Start
Budget End
Support Year
Fiscal Year
Total Cost
Indirect Cost
University of Illinois at Chicago
United States
Zip Code