Project Type:

Project

Project Sponsors:

  • National Science Foundation - NSF

Project Award:

  • $239,998

Project Timeline:

2014-09-01 – 2017-08-31



Lead Principal Investigator:



RUI: Classification, regression, and density estimation with missing variables


Project Type:

Project

Project Sponsors:

  • National Science Foundation - NSF

Project Award:

  • $239,998

Project Timeline:

2014-09-01 – 2017-08-31


Lead Principal Investigator:



This project develops statistical theory and methods for nonparametric classification and curve estimation in the presence of missing or incomplete data. Many data sets have missing values; these include the data from biomedical studies, remote sensing, as well as social sciences. There are a number of classical approaches for handing the missing data. Many of the existing results first impute for the missing values and then apply a standard statistical technique to carry out inferences. However, a study of the theoretical validity of such techniques can become intractable due to the loss of independence assumption in the data; this is particularly true for distribution-free statistical methods. The new results of the Principal Investigator (PI) will answer a number of fundamental questions in statistical classification and pattern recognition with applications to biomedical, remote sensing, and social sciences. The new results will also solve many important theoretical problems at the intersection of machine learning and statistical classification.

A long-standing problem in classification with missing covariates involves the situation where missing covariates can appear in both the data and in the new unclassified observation. This is fundamentally different from the simpler problem where missing covariates appear in the data only. In the latter case, standard methods based on Horvitz-Thompson inverse weighting can be used to construct asymptotically optimal classifiers. One part of the PI's research project focuses on this challenging case of classification with missing covariates. The PI will develop new asymptotically optimal local-averaging-type classifiers, such as kernel and partitioning rules. Another part of this project concentrates on the continuation and refinements of the PI's previous efforts on combined classification and estimation, based on recently obtained results in the literature. The PI is currently developing new methods to combine several individual classifiers in such a way that the asymptotic error of the resulting classifier will be at least as good as that of the best individual classifier. The PI will also develop methods to combine several regression function estimators in an optimal way. Tools from the empirical process theory will be used to establish the large-sample optimality of the resulting classifiers and estimators. The third part of the project focuses on the weak convergence of various norms of kernel density estimates in the presence of missing data. The PI will study weighted bootstrap approximations of these statistics. Such results will allow someone to construct correct confidence bands for the unknown density in the presence of missing values. The main tools here are the strong approximation theorems that allow one to replace the weighted bootstrapped empirical processes by a sequence of Brownian bridges.






Give Feedback