Home   >   CSC-OpenAccess Library   >    Manuscript Information
Classification Scoring for Cleaning Inconsistent Survey Data
M. Rita Thissen
Pages - 1 - 14     |    Revised - 31-01-2017     |    Published - 28-02-2017
Volume - 7   Issue - 1    |    Publication Date - February 2017  Table of Contents
Data Cleaning, Classification Models, Data Editing, Classification Scoring, Survey Data, Data Integrity.
Data engineers are often asked to detect and resolve inconsistencies within data sets. For some data sources with problems, there is no option to ask for corrections or updates, and the processing steps must do their best with the values in hand. Such circumstances arise in processing survey data, in constructing knowledge bases or data warehouses [1] and in using some public or open data sets.

The goal of data cleaning, sometimes called data editing or integrity checking, is to improve the accuracy of each data record and by extension the quality of the data set as a whole. Generally, this is accomplished through deterministic processes that recode specific data points according to static rules based entirely on data from within the individual record. This traditional method works well for many purposes. However, when high levels of inconsistency exist within an individual respondent's data, classification scoring may provide better results.

Classification scoring is a two-stage process that makes use of information from more than the individual data record. In the first stage, population data is used to define a model, and in the second stage the model is applied to the individual record. The author and colleagues turned to a classification scoring method to resolve inconsistencies in a key value from a recent health survey. Drawing records from a pool of about 11,000 survey respondents for use in training, we defined a model and used it to classify the vital status of the survey subject, since in the case of proxy surveys, the subject of the study may be a different person from the respondent. The scoring model was tested on the next several months' receipts and then applied on a flow basis during the remainder of data collection to the scanned and interpreted forms for a total of 18,841 unique survey subjects. Classification results were confirmed through external means to further validate the approach. This paper provides methodology and algorithmic details and suggests when this type of cleaning process may be useful.
1 Google Scholar 
2 Academia 
3 CiteSeerX 
4 BibSonomy 
5 Doc Player 
6 Scribd 
7 SlideShare 
8 PdfSR 
A. Jonsson and G. Svingby. "The use of scoring rubrics: Reliability, validity and educational consequences." Educational Research Review, vol. 2(2), pp.130-144, 2007.
A.L. Brennaman. "Examination of osteoarthritis for age-at-death estimation in a modern population." Doctoral dissertation, Boston University, 2014.
D. Shukla and R. Singhai. "Some imputation methods to treat missing values in knowledge discovery in data warehouse." International Journal of Data Engineering, vol. 1(2), pp. 1-13, 2010.
D.B. Rubin. Multiple Imputation for Nonresponse in Surveys. Hoboken, NJ: Wiley, 2004, pp. 11-15.
E. Leahey. "Overseeing research practice: the case of data editing." Science, Technology & Human Values, vol. 33(5), pp. 605-630, 2008.
E. Rahm and H.H. Do. "Data cleaning: Problems and current approaches." IEEE Data Engineering Bulletin, vol. 23(4), pp. 3-13, 2000.
F. Kreuter, S. McCulloch, S. Presser, and R. Tourangeau. "The effects of asking filter questions in interleafed versus grouped format." Sociological Methods & Research, vol. 40(1), pp. 88-104, 2011.
H.J. Kim, L.H. Cox, A.F. Karr, J.P. Reiter, and Q. Wang. "Simultaneous edit-imputation for continuous microdata." Journal of the American Statistical Association, vol. 110(511), pp. 987-999, 2015.
J. Han, J. Pei, and M. Kamber. Data Mining: Concepts and Techniques. Amsterdam, Netherlands: Elsevier, 2011, p. 331.
M. Farfel, L. DiGrande, R. Brackbill, A. Prann, J. Cone, S. Friedman, D.J. Walker, G. Pezeshki, P. Thomas, S. Galea, and D. Williamson. "An overview of 9/11 experiences and respiratory and mental health conditions among World Trade Center Health Registry enrollees." Journal of Urban Health, vol. 85(6), pp. 880-909, 2008.
P.M. Fayers and D. Machin. Quality of Life: The Assessment, Analysis and Interpretation of Patient-Reported Outcomes. Hoboken, NJ: Wiley, 2013, p. 1958.
P.N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Boston: Addison- Wesley, 2005, pp.145-205.
P.P. Biemer and L.E. Lyberg. Introduction to Survey Quality. Hoboken, NJ: Wiley 2003, p. 41.
R. Brackbill, D. Walker, S. Campolucci, J. Sapp, M. Dolan, J. Murphy, L. Thalji, and P. Pulliam. World Trade Center Health Registry. New York: New York City Department of Health and Mental Hygiene, 2006.
S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. "Robust and efficient fuzzy match for online data cleaning," in Proc. 2003 ACM SIGMOD International Conference on Management of Data, pp. 313-324, 2003.
S. Tuffery. Data Mining et statistique décisionnelle: L'intelligence des données. Paris: Éditions Technip, 2012, p. 30.
V. Raman and J.M. Hellerstein. "Potter's wheel: An interactive data cleaning system." The VLDB Journal, vol. 1, pp. 381-390, 2001.
W.E. Winkler and B.-C. Chen. "Extending the Fellegi-Holt model of statistical data editing," in ASA Proceedings of the Section on Survey Research Methods, 2001, pp. 1-21.
W.E. Winkler. "State of statistical data editing and current research problems (No. 29)." U.S. Census Bureau, Working Paper. 1999.
Mrs. M. Rita Thissen
RTI International - United States of America

View all special issues >>