Classification has broad applications in various fields, including biological sciences, medicine, engineering, finance, and social sciences. The aim of classification is to accurately predict class labels for new observations based on labeled training data. For example, an email service provider needs to decide whether an incoming email is spam. Among different types of classification problems, binary classification is the most basic and important type for theoretical, methodological and algorithmic development. An important question in binary classification is how to control a prioritized type of error, either the type I error (the chance of misclassifying a class 0 data point as class 1) or the type II error (the chance of misclassifying a class 1 data point as class 0). The Neyman-Pearson (NP) classification paradigm is a theoretic framework aiming to control the type I (or type II) error with theoretic guarantee. Yet how to implement the NP paradigm with practical classification algorithms remains a great challenge. In this research, the PIs will tackle this challenge by developing new statistical theory, methods, algorithms, and a novel evaluation metric under the NP paradigm. Results from this proposal will have broad potential applications, such as reducing false positive rates in disease diagnosis and improving prediction accuracy of social events from social media data. The PIs will supervise graduate and undergraduate students of diverse background in the proposed project, and the project outcomes will be taught in graduate-level seminar courses. To aid statistical and interdisciplinary research, the PIs will distribute methods developed in this project as open-source software packages.

The PIs will develop new statistical theory, methods, algorithms and applications to control asymmetric classification errors under the Neyman-Pearson (NP) paradigm. The NP paradigm addresses cases where users insist on a specific bound on type I error while keeping type II error to a minimum. Although the NP paradigm has a century-long history in hypothesis testing, until recently it did not receive much attention in the classification area, and its theory and methodologies are as yet incomplete. With the following four aims, the PIs will develop a general NP classification framework and show how it can be applied in the biomedical and social sciences. Under Aim I, the PIs will develop new NP classification theory and methods by exploring feature dependency and interactions for different data structures and sample sizes. Under Aim II, the PIs will design an umbrella algorithm to adapt popular classification methods to the NP paradigm. Under Aim III, the PIs will construct an NP version of Receiver Operating Characteristic (ROC) curves: "NP-ROC", a new evaluation metric based on the NP classification theory and methodologies. Under Aim IV, the PIs will apply the novel NP classification methodologies developed in Aims I-III to large-scale biomedical and social applications.

Agency
National Science Foundation (NSF)
Institute
Division of Mathematical Sciences (DMS)
Type
Standard Grant (Standard)
Application #
1613338
Program Officer
Gabor Szekely
Project Start
Project End
Budget Start
2016-08-15
Budget End
2019-10-31
Support Year
Fiscal Year
2016
Total Cost
$120,000
Indirect Cost
Name
University of Southern California
Department
Type
DUNS #
City
Los Angeles
State
CA
Country
United States
Zip Code
90089