Yukich, Joseph

Lehigh University, Bethlehem, PA, United States

Many questions arising in stochastic geometry and applied probability, as well as questions arising in networks, spatial statistics, and statistical mechanics, may be understood in terms of the behavior of large random geometric structures, where `large' means that the randomness involves a growing number of random variables. `Geometric' means that the problems depend heavily on the geometry of the underlying space. Problems involving these complex structures involve understanding the behavior of sums of spatially dependent terms having short range interactions, but complicated long range dependence. Problems of interest in discrete stochastic geometry involve functionals of convex hulls of i.i.d. samples, asymptotic quantization error, the limit behavior of maximal points, and the limit behavior of generalized tessellations in Euclidean space. Problems of interest involving spatial data include dimension estimation of non-linear data clouds embedded in a high dimensional Euclidean space, estimation of entropy, estimation of surface and volume integrals, as well as establishing minimal cost networks for data transmission and energy scaling laws. In each case, one seeks to quantify the `mean' or average behavior of functionals arising in these problems. A chief goal is to show that sums of spatially dependent terms behave as though they were sums of independent identically distributed random variables. One thus wants to show that such sums satisfy laws of large numbers, that they have asymptotically a normal distribution, and that the random point measures defined by these sums satisfy functional central limit theorems, that is to say show their scaling behavior is understood in terms of Brownian sheets.

This project aims to solve problems in geometric probability which are of interest to researchers in both industry and academia. Examples include the following: (i) given an unknown object or body (such as an infarction in the human body or an underground deposit of oil) how can we use effectively use random probes of the object to find reliable estimators of its surface area and volume? (ii) given a huge amount of spatial data, how do we use only the interpoint distances of the data to determine intrinsic properties of the data, including its intrinsic dimension? (iii) given a network such as the world wide web, how does one best find ways to efficiently transmit and route information through it, minimizing cost and travel time? Similarly, given a communication network, how does one optimally place transmitters to maximize coverage? (iv) given any complex network, including airline and other transportation networks, how does one efficiently route vehicles to maximize revenue? The goal of this project is to develop theoretical tools to solve these and related problems and to develop efficient algorithms of use in industry.

This three year project lay at the interface of theoretical and applied probability. The goals of the project were to understand and quantify the behavior of certain random phenomena. In many cases, this meant showing that these random phenomena all have similar behavior, ie after re-scaling, that either they follow a Gaussian distribution (bell curve) or a law of large numbers. Some of the random phenomena include the number of Pareto extreme points in a random sample and the number of vertices of the convex hull of a random sample. It has long been expected that these random phenomena follow a Gaussian distribution and my joint work with Calka showed this to be the case. Pareto extreme points are useful in economic theory and they represent data points which are optimal in more than one way (ie they simultaneously maximize several parameters or interest). In another line of inquiry, my ressearch addressed the following problem: Given an object which is invisible to the human eye (e.g. a tumor inside the body or an oil deposit underground) how does one estimate its volume and shape? In this project we showed that if one randomly shoots rays at the object and if one can collect information about which rays hit the object and which do not, then using only this information it is possible (using Voronoi diagrams) to construct efficient statistical estimators of the volume and shape of the object. More precisely, if one knows whether a given randomly shot ray hits the object and if one knows the location of the `hit', then when the number of rays is large, it is possible to know with high accuracy the exact surface area and volume of the unknown object. This may lead to accurate estimators of tumor growth in the human body and it could also be useful in collecting geological information, such as size and volume of underground oil deposits. In another piece of work, I made some progress on understanding high dimensional data sets (big data). One of the fundamental problems with big data is that it is in fact `too big'. Too much data overwhelms the user. Oftentimes, however, large data sets are generated by a relatively few number of variables, that is to say there are only a few key parameters describing the data set. Understanding how many key parameters generate the data sets is a fundamental problem in applied statistics (this is known as the instrinsic dimension of the data). Some of my work with Quiroz and Penrose uses graph theoretic methods to find efficient estimators of the intrinsic dimension of high dimensional data sets. Work with Penrose develops a dimension estimator initially proposed by Bickel + Levina and we show that this dimension estimator is useful in a large number of situations involving data which could even live on a manifold or data which could be corrupted by noise. Solving the above problems involved first developing the proper mathematical tools, including developing limit theorems for functionals of point processes. Some of the limit theorems are extensions of the classical central limit theorem and classical law of large numbers to the setting of spatially dependent random variables. These general results appear to have their own merit and may lead to further applications in fileds outside probability theory.

- Agency
- National Science Foundation (NSF)
- Institute
- Division of Mathematical Sciences (DMS)
- Application #
- 1106619
- Program Officer
- Tomek Bartoszynski

- Project Start
- Project End
- Budget Start
- 2011-08-15
- Budget End
- 2014-07-31
- Support Year
- Fiscal Year
- 2011
- Total Cost
- $195,000
- Indirect Cost

- Name
- Lehigh University
- Department
- Type
- DUNS #

- City
- Bethlehem
- State
- PA
- Country
- United States
- Zip Code
- 18015