Home   >   CSC-OpenAccess Library   >    Manuscript Information
Full Text Available

(309.85KB)
This is an Open Access publication published under CSC-OpenAccess Policy.
A Two-Step Self-Evaluation Algorithm On Imputation Approaches For Missing Categorical Data
Lukun Zheng
Pages - 1 - 12     |    Revised - 31-01-2018     |    Published - 30-04-2018
Volume - 7   Issue - 1    |    Publication Date - April 2018  Table of Contents
MORE INFORMATION
KEYWORDS
Categorical Variable, Imputation Methods, Missing Value, Re-Imputation Accuracy Rate.
ABSTRACT
Missing data are often encountered in data sets and a common problem for researchers in different fields of research. There are many reasons why observations may have missing values. For instance, some respondents may not report some of the items for some reason. The existence of missing data brings difficulties to the conduct of statistical analyses, especially when there is a large fraction of data which are missing. Many methods have been developed for dealing with missing data, numeric or categorical. The performances of imputation methods on missing data are key in choosing which imputation method to use. They are usually evaluated on how the missing data method performs for inference about target parameters based on a statistical model. One important parameter is the expected imputation accuracy rate, which, however, relies heavily on the assumptions of missing data type and the imputation methods. For instance, it may require that the missing data is missing completely at random. The goal of the current study was to develop a two-step algorithm to evaluate the performances of imputation methods for missing categorical data. The evaluation is based on the re-imputation accuracy rate (RIAR) introduced in the current work. A simulation study based on real data is conducted to demonstrate how the evaluation algorithm works.
CITED BY (0)  
1 Google Scholar
2 BibSonomy
3 Doc Player
4 Scribd
5 SlideShare
1 W.H. Finch. "Imputation methods for missing categorical questionnaire data: a comparison of approaches." Journal of Data Science, vol. 8, pp. 361-378, 2010.
2 R.J.A. Little. and D.B. Rubin. Statistical Analysis with Missing Data. New York: Wiley, 1987.
3 E. D. de Leeuw, J. Hox, and M. Husman. "Prevention and treatment of item nonresponse." Journal of Official Statistics, vol. 19, pp. 277-314, 2003.
4 S. F. Messner. "Exploring the Consequences of Erratic Data Reporting for Cross- National Research on Homicide." Journal of Quantitative Criminology, vol. 8, pp.155-173, 1992.
5 D. J. Hand, H. J. Adér, and G. J. Mellenbergh. "Advising on Research Methods: A Consultant's Companion." Huizen, Netherlands: Johannes van Kessel. pp. 305-332, 2008.
6 J. L. Schafer. Analysis of Incomplete Multivariate Data. Chapman and Hall, 1997.
7 J. L. Schafer and J. W. Graham. "Missing data: Our view of the state of the art." Psychological Methods, vol. 7, pp.147-177, 2002.
8 I. Myrtverit and E. Stensrud. "Analyzing data sets with missing data: an empirical evaluation of imputation methods and likelihood-based methods." IEEE Transactions On Software Engineering, vol. 27, pp.999-1013, 2001.
9 D.B. Rubin. "Multiple imputation after 18+ years." J. Am. Stat. Assoc, vol. 91, pp. 473-489, 1996.
10 S.C. Zhang, et al. "Optimized parameters for missing data imputation." PRICAI, vol. 6, pp. 1010-1016, 2006.
11 Q. Wang and J. Rao, "Empirical likelihood-based inferences in linear models with missing data." Scand. J. Statist, vol. 29, pp. 563-576, 2002.
12 J. Chen and J. Shao. "Jackknife variance estimation for nearest-neighbor imputation." J. Amer. Statist, Assoc, vol. 96, pp. 260-269, 2001.
13 S.M. Chen and C.M. Huang. "Generating weighted fuzzy rules from relational database systems for estimating null values using genetic algorithms." IEEE Transactions on Fuzzy Systems, vol. 11, pp. 495-506, 2003.
14 R.S. Somasundaram and R. Nedunchezhian. "Evaluation of three simple imputation methods for enhancing preprocessing of data with missing values." International Journal of Computer Applications, vol. 21, pp. 14-19, 2011.
15 A.B. Anderson, A. Basilevsky, and D.P.J. Hum. "Missing data: a review of the literature," in Handbook of Survey Research. New York: Academic Press, 1983, pp. 415-492.
16 M.J. Rovine and M. Delaney. " Missing data estimation in developmental research," in Statistical Methods in Longitudinal Research: Principles and Structuring Change, A. Von Eye ed. 1, New York: Academic Press, pp. 35-79.
17 O. Troyanskaya, M. Cantor, and G. Sherlock. "Missing value estimation methods for DNA microarrays." Bioinformatics, vol. 17, pp. 520-525, 2001.
18 J. Chen and J. Shao. "Nearest neighbor imputation for survey data." Journal of Official Statistics, vol. 16, pp. 113-131, 2000.
19 L. Hurley. "Missing covariates in causal inference matching: Statistical imputation using machine learning and evolutionary search algorithms." Doctoral dissertation, Fordham University, 2017.
20 J. R. Quinlan. C4.5: Programs for machine learning, Morgan Kaufman, Los Altos, CA, 1993.
21 R.O. Duda and P.E. Hart. Pattern Classification and Scene Analysis, New York: Wiley, 1973.
22 J. Fox, S. Weisberg, D. Adler, D. Bates, G. Baud-Bovy, S. Ellison and R. Heiberger. Package "car", Companion to Applied Regression. R Package version, 2-1, 2016.
Dr. Lukun Zheng
Tennessee Technological University - United States of America
lzheng@tntech.edu