The central theme of statistical sciences may be described as making inferences about the probability distribution of a random variable based on a sample of the random variable driven by the underlying distribution. Over the central region of the data range, the higher data frequency often enables more reliable estimation of the distribution. However, regardless how large a sample may be, there are always regions in the range of the random variable where few or no sample observations are available, for example, the tail region. Ironically it is on the tail regions where statistical inferences are often most important. The investigator studies the problem of estimating a parametric tail with power decay in, and only in, the extremely distant tail of a discrete probability distribution. The investigator proposes to develop a consistent estimator of the tail probability distribution via a perspective offered by Turing?s formula. Such an estimator, if established, would represent a previously unknown methodology which could shed light on an array of statistical problems involving tails of discrete probability distributions across a range of research disciplines.
The proposed project is motivated by, in addition to its theoretical merits, many practically important problems. The central focus of the project is to provide a methodology to quantify the likelihood of an extremely rare event, so rare that it may not have been previously observed. For example, in finance, the assessment of value at risk may involve quantification of an extremely unlikely event that would cause a huge loss of value during a short time period in a portfolio; the scenarios of stress test for financial industry may also be beneficially considered under this methodology. In insurance, it may be of interest to assess the likelihood of a natural or personal disaster of extreme magnitude. In environmental biology, it may be of interest to assess bio-diversity in a population accounting for those super small minority species that are not represented in a sample. In homeland security, it may be of interest to assess the likelihood of a terrorist attack whose type is previously unobserved or unaware of. The proposed project provides an opportunity to enhance the ability to find solutions to all the above mentioned problems and beyond.
A key issue in statistical inference is to assess the probability of an unlikely event. Such an event could be that ``the next observed bird in the wild will be of previously unseen or unknown species'', that ``on the next day the Dow Jones index will drop below a level far beyond the record'', that ``the next terrorist attach will be by a tactic previously unseen'', or that ``the next Tsunami will be of a magnitude much greater than any on record''. Such a problem may be formulated as one of estimating the (right) tail probability of an unknown distribution. The difficulty of such a problem is that in the neighborhood of the tail of a distribution there are very few (or simply no) observations available that could support or validate any reasonable inferences. The common wisdom is: if you don't see, you cannot tell. The problem is also perpetual in the sense that no matter how large the sample size is and how far the data range reaches there always exists the tail beyond the data range. Given the scarcity of data in the tail, the universally accepted model is to specify a parametric form of the distribution in the tail beyond a threshold (a nuisance parameter) and leave the form of the distribution to the left of the threshold unspecified. Such is a long standing problem in Statistics literature. Some versions of the problem have been studied since 1975 when Hill first proposed a solution to the problem in the continuous case. Hill's method was a breakthrough but did not resolve the issue of a predetermined constant which, if correctly chosen all is well, but if not all is ill, and there is no known way in the existing literature to guarantee a correct choice of that constant. This unresolved issue leaves the problem fundamentally unsolved. The PI of this proposed to study a discrete version of the above problem, and to derive a procedure to estimate the parameters of the underlying distribution but bypassing any estimation of the nuisance threshold. The proposed approach takes a fundamentally new perspective and utilizes newly acquired results associated with Turing's formula. At the end of the proposed study, the PI has successfully established the following: A nonparametric statistical procedure called AMLE (asymptotic maximum likelihood estimator) is proposed to estimate the parameters of an assumed probability distribution with power decay in the tail. A proof is secured to show that the proposed estimator is consistent, i.e., the estimator produces satisfactory description of the probability law when a sufficiently large sample is available. As a result of the proposed study, there exists now a new statistical tool to assess the nature of extreme random behavior in regions with sparse (or no) observed data.