Data depth has provided a systematic nonparametric multivariate framework and given rise to a powerful multivariate analysis tool set. However, its full potential in spacings and classification is yet to be fully explored. Motivated by several real applications, the investigator plans to: 1) develop nonparametric classification procedures based on DD (Depth-vs-Depth) plots. These procedures are referred to as DD-classifiers, and they are to be compared with the so-called support vector machine procedures; 2) use the multivariate spacings derived from data depth to: (2a) construct tolerance envelopes for functional or time series data and (2b) develop a class of multivariate goodness-of-fit tests.
Classification is is an important task in all scientific domains, such as identifying new species in archaeological investigations or distinguishing disease types in medical studies. Applying the notion of data depth, the investigator proposes to develop effective classification procedures, which can automatically yield the best separating power for classification purposes and compete well with the highly calibrated existing classification procedures. The classification outcomes can be easily visualized in a two-dimensional plot regardless of the dimension of the data. The investigator also introduces multivariate spacings for the analysis of multi-dimensional data. These multivariate spacings should have a wide range of utilities. In particular, the investigator applies these spacings to develop both tolerance envelopes for tracking multivariate data and a class of multivariate goodness-of-fit tests. She plans to apply the proposed tolerance envelope to the monitoring of aircraft landing patterns and to ensure landing safety. She also plans to apply the proposed classifications to disease identification. These applications are motivated by the investigator's ongoing collaborative research projects with the Federal Aviation Administration and the Department of Psychiatry of the Robert Wood Johnson Medical School. The proposed projects involve real databases and are ideally suited for engaging students and postdocs.
This project has developed a novel classification methodology called "DD-classifier". DD-classifier has been developed and implemented with full theoretical justification and complete computer software codes for applications, which are detailed in a research article published in Journal of the American Statistical Association. DD-classifier is shown to be the best nonparametric classification procedure for multivariate data to date. It achieves the optimal error rate, and outperforms even highly calibrated existing classification procedures, including nearest neighbor approach. More specifically, DD-classifier has the potential of automatically yielding the best separating plane (or surface) for classification purpose. The classification outcomes can be easily visualized in a two-dimensional plot regardless of the dimension of the data. The approach is completely data driven and it inherits the nonparametric nature of data depth, namely it has the advantage of carrying out statistical analysis without assuming models. The latter property affords DD-classifier with broad applicability in real life applications. Part of this project also focused on applying spacings derived from data depth to develop tolerance envelopes for tracking multivariate data. To this end, the new notion of data depth "Antipodal Reflection Depth (ARD)" has been developed for the project. Aside from possessing the usual desirable properties of a data depth, ARD is shown to be effective for detecting outliers in multivariate or functional settings. Furthermore, the results of this project have helped develop methods for combining inferences from different studies or sources. Cases in point are: combining inferences from clinical trial data and expert opinions; combining inferences from discrete analyses where rare events may render useless the existing meta-analysis approach; and combining inferences from heterogeneous studies. Four articles are published or currently in press to report those new methods. These research outcomes are particularly useful and timely, in light of the information explosion in this Big Data era. These outcomes further highlight the importance of statistics. During the support of this NSF grant, the PI has made substantial progresses on her proposed research and developed new statistical methodology for solving important real life problems. The PI had also supported four graduate students, of which two had completed their PhD degrees. She was also involved in organizing statistical events and educational activities, including short courses and conferences. She has given more than 50 invited lectures or presentations to disseminate her research outcomes in national and international meetings.