When applying likelihood theory for analyzing high-frequency time series data, scientists and analysts simultaneously face both computational complexity due to the accommodation of several million observations, and informational complexity, due to the involvement of manifold and diverse dynamic mechanisms. The investigator proposes to establish Kolmogorov's algorithmic statistics as the unified foundation to bridge the gaps caused by these computational and informational complexities, and to make it possible for systematic and effective discovery of characteristic dynamics. The proposal focuses on its development by resolving critical issues including: What are the models of data's individuality and typicality, why are they crucial, and how can they be applied by scientists and analysts for discovery and detection? A new vehicle for this development is the Hierarchical Factor Segmentation (HFS) algorithm. This completely distinct approach is undertaken to transform an observed time series into various counting processes corresponding to different events of interest, and then to apply the coding schemes to achieve lossy data compression as a way to find the governing state-space trajectory. This is accomplished without estimating the point processes' time-varying intensity functions, nor relying on any unrealistic prior knowledge about the number of changes, nor assumptions about the regime-generating mechanisms. Using the computed state-space trajectory, the investigator is able to modify or replace currently prevailing statistical thinking ? such as likelihood theory ? and existing popular methodologies ? such as those based on statistical correlation and association ? by using the connectivity and concurrence of the decoded states. These real-world applications in finance, biology and national security will realistically illuminate the great merit and potential of this new statistical thinking and computing for discovering real dynamics that are of great interest in the sciences and in society.

Currently, there are many situations in which data are being sampled and recorded on a time scale of milliseconds, or even nanoseconds. These high-frequency data are found not only in the sciences, but also in economics, finance and national security. However, due to its enormous length and complexity, these data types cannot be handled well using existing statistical methodologies. In fact, prevailing statistical thinking is inadequate for resolving issues underlying these kinds of data. Brand-new statistical thinking is urgently needed to bridge the gap between computing and conception in order to produce coherent and real mechanisms for data analysis. The investigator proposes the algorithmic statistics as the new foundation for scientists and analysts to focus on extracting key characteristics, such as individuality and typicality, within high-frequency data. An algorithmic statistic is a computer algorithm that takes a multidimensional time series consisting of millions of time points as the input, and efficiently computes realistic and sufficient analytic results as the output. Accordingly, the classic concepts, such as correlation and association, would be modified or replaced based on the connectivity and concurrence of decoded significant regimes. In particular, resultant models of individuality and typicality are tremendously useful and important for regulating and detecting purposes. This proposal also targets the detection of any abnormality or extremism, for example, in trading or in physiological and behavioral processes, embedded within long and noisy high-frequency data.

Project Report

Project Outcome Report of NSF 1007219 Physicist and Nobel Prize winner, Murray Gell-Mann, was quoted in Coveney & Highfield’s 1994 book "Frontiers of Complexity-the search for order in a chaotic world" (p-8) as saying: We must get away from the idea that serious work is restricted to "beating to death a well-defined problem in a narrow discipline, while broadly integrative thinking is relegated to cocktail parties." This vision is more pertinent now than ever before. As Information Technology advances, researchers are able to create many new and revisit even more old systems in human society and in nature. These new and old systems receive tremendous amounts of new research attentions because of brand new technologies of collecting large amounts of data. Data of various formats, high dimensional point cloud or high frequency time series, or networks of many types and sizes, is collected and explored to shed new lights of authentic understanding into systems of interest. Why integrative thinking is desperately needed now is because all these data formats contain complex dependence structures among and between sampled subjects and measured features. Such structural dependences consist of two closely coupled systemic characteristics: deterministic structures and inherent randomness. This concept is one well-known in statistical physics, particularly in the field of Complex System. This concept prescribes scientific endeavors in complex systems as formulating and discovering patterns of deterministic structures and mechanisms of inherent randomness. In contrast, these two systemic characteristics clearly are in defiance of majority of statistical principles, concepts, modeling and bootstrapping approaches that are based on unrealistic independent and stationary assumptions. Therefore statistical investigations are commonly conducted by ignoring systemic sensitivity. This is why statistical results for complex system are likely unrealistic or even illusory. The primary achievement in this NSF grant is that I build an integrative foundation for discovering systemic characteristics by combining principles from three perspectives of: 1) Statistical physics: I quantify the deterministic structure through the lowest energy macrostate with respect to a suitably chosen Hamiltonian; 2) Computational geometry: I develop a new computing paradigm, called Data Mechanics, by building two highly coupled Ultrametric trees on row and column axes of any data matrix to bring out its multiscale block patterns as the optimal resolution for the macrostate; 3) Information theory: I take the mutliscale block patterns as minimum sufficient statistics of Kolmogorov’s algorithmic statistics, and construct algorithms for capturing mechanisms for inherent randomness within all involving blocks. The computed deterministic structures and inherent randomness is then called a coupling geometry on a data matrix. Such a coupling geometry indeed embraces a matrix version of Kolmogorov complexity. Kolmogorov’s two-part coding scheme then becomes the integrative principle for mimicking or bootstrapping a matrix by retaining its systemic characteristics. That is why such a coupling geometry plays the new foundation role for studying a complex system at a single phase, which is approximated by a network or high dimensional data cloud or time series. For the study of an entire system, a partial coupling geometry computation is developed to compute the causal and predictive associations linking two adjacent phases. In such a fashion, an evolution of phases can be sequentially linked and studied to manifest systemic understanding. Network bootstrapping ensembles as technical devices are derived to bear with various scales of structural information, and then give rise to an energy distribution profile. This energy distribution profile becomes the base for inferring and testing hypothesis regarding systemic structure. Then a mutual energy derived from partial coupling geometries plays the role of mutual information in Information Theory to evaluate degrees of association between two adjacent phases. These computational developments are under one unified theme: Learning-From-Data. Under this theme, complexity of dependence structure is not only computable, but visible. This visualization surly enhances researchers’ understanding about the system as one whole. I envision the coupling geometry computing is the foundation of Data Science, and is critical in any scientific computing. Here we list all the real world systems investigated under this NSF grant. They include NCAA Football League with focuses on nonlinear ranking hierarchy and systemic robustness, Rhesus Macaque monkey society with focuses on nonlinear ranking hierarchy and behavioral network interactions, winemaking system with focuses on propagating effects of water-stress on grapes to bottled wine as the ending phase, ecological and biogeographic systems with focuses of mutualism and phylogenetic effects, stock market with focused on networking among and along many dimensional high frequency time series data, and the last, but the least, one is the English words system through the Lewis Carroll’s word game called Doublet, or Word-L adder . The take-home message is: Via Data mechanics and its coupling geometry, systemic knowledge is computable and visible.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1007219
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2010-07-15
Budget End
2014-12-31
Support Year
Fiscal Year
2010
Total Cost
$348,765
Indirect Cost
Name
University of California Davis
Department
Type
DUNS #
City
Davis
State
CA
Country
United States
Zip Code
95618