Home   >   CSC-OpenAccess Library   >    Manuscript Information
Full Text Available

This is an Open Access publication published under CSC-OpenAccess Policy.
Classification Scoring for Cleaning Inconsistent Survey Data
M. Rita Thissen
Pages - 1 - 14     |    Revised - 31-01-2017     |    Published - 28-02-2017
Volume - 7   Issue - 1    |    Publication Date - February 2017  Table of Contents
Data Cleaning, Classification Models, Data Editing, Classification Scoring, Survey Data, Data Integrity.
Data engineers are often asked to detect and resolve inconsistencies within data sets. For some data sources with problems, there is no option to ask for corrections or updates, and the processing steps must do their best with the values in hand. Such circumstances arise in processing survey data, in constructing knowledge bases or data warehouses [1] and in using some public or open data sets.

The goal of data cleaning, sometimes called data editing or integrity checking, is to improve the accuracy of each data record and by extension the quality of the data set as a whole. Generally, this is accomplished through deterministic processes that recode specific data points according to static rules based entirely on data from within the individual record. This traditional method works well for many purposes. However, when high levels of inconsistency exist within an individual respondent's data, classification scoring may provide better results.

Classification scoring is a two-stage process that makes use of information from more than the individual data record. In the first stage, population data is used to define a model, and in the second stage the model is applied to the individual record. The author and colleagues turned to a classification scoring method to resolve inconsistencies in a key value from a recent health survey. Drawing records from a pool of about 11,000 survey respondents for use in training, we defined a model and used it to classify the vital status of the survey subject, since in the case of proxy surveys, the subject of the study may be a different person from the respondent. The scoring model was tested on the next several months' receipts and then applied on a flow basis during the remainder of data collection to the scanned and interpreted forms for a total of 18,841 unique survey subjects. Classification results were confirmed through external means to further validate the approach. This paper provides methodology and algorithmic details and suggests when this type of cleaning process may be useful.
CITED BY (0)  
1 Google Scholar
2 Academia
3 CiteSeerX
4 BibSonomy
5 Doc Player
6 Scribd
7 SlideShare
8 PdfSR
1 V. Raman and J.M. Hellerstein. "Potter's wheel: An interactive data cleaning system." The VLDB Journal, vol. 1, pp. 381-390, 2001.
2 P.P. Biemer and L.E. Lyberg. Introduction to Survey Quality. Hoboken, NJ: Wiley 2003, p. 41.
3 F. Kreuter, S. McCulloch, S. Presser, and R. Tourangeau. "The effects of asking filter questions in interleafed versus grouped format." Sociological Methods & Research, vol. 40(1), pp. 88-104, 2011.
4 W.E. Winkler. "State of statistical data editing and current research problems (No. 29)." U.S. Census Bureau, Working Paper. 1999.
5 W.E. Winkler and B.-C. Chen. "Extending the Fellegi-Holt model of statistical data editing," in ASA Proceedings of the Section on Survey Research Methods, 2001, pp. 1-21.
6 E. Rahm and H.H. Do. "Data cleaning: Problems and current approaches." IEEE Data Engineering Bulletin, vol. 23(4), pp. 3-13, 2000.
7 D. Shukla and R. Singhai. "Some imputation methods to treat missing values in knowledge discovery in data warehouse." International Journal of Data Engineering, vol. 1(2), pp. 1-13, 2010.
8 S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. "Robust and efficient fuzzy match for online data cleaning," in Proc. 2003 ACM SIGMOD International Conference on Management of Data, pp. 313-324, 2003.
9 D.B. Rubin. Multiple Imputation for Nonresponse in Surveys. Hoboken, NJ: Wiley, 2004, pp. 11-15.
10 J. Han, J. Pei, and M. Kamber. Data Mining: Concepts and Techniques. Amsterdam, Netherlands: Elsevier, 2011, p. 331.
11 P.N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Boston: Addison- Wesley, 2005, pp.145-205.
12 R. Brackbill, D. Walker, S. Campolucci, J. Sapp, M. Dolan, J. Murphy, L. Thalji, and P. Pulliam. World Trade Center Health Registry. New York: New York City Department of Health and Mental Hygiene, 2006.
13 M. Farfel, L. DiGrande, R. Brackbill, A. Prann, J. Cone, S. Friedman, D.J. Walker, G. Pezeshki, P. Thomas, S. Galea, and D. Williamson. "An overview of 9/11 experiences and respiratory and mental health conditions among World Trade Center Health Registry enrollees." Journal of Urban Health, vol. 85(6), pp. 880-909, 2008.
14 H.J. Kim, L.H. Cox, A.F. Karr, J.P. Reiter, and Q. Wang. "Simultaneous edit-imputation for continuous microdata." Journal of the American Statistical Association, vol. 110(511), pp. 987-999, 2015.
15 A. Jonsson and G. Svingby. "The use of scoring rubrics: Reliability, validity and educational consequences." Educational Research Review, vol. 2(2), pp.130-144, 2007.
16 P.M. Fayers and D. Machin. Quality of Life: The Assessment, Analysis and Interpretation of Patient-Reported Outcomes. Hoboken, NJ: Wiley, 2013, p. 1958.
17 A.L. Brennaman. "Examination of osteoarthritis for age-at-death estimation in a modern population." Doctoral dissertation, Boston University, 2014.
18 E. Leahey. "Overseeing research practice: the case of data editing." Science, Technology & Human Values, vol. 33(5), pp. 605-630, 2008.
19 S. Tuffery. Data Mining et statistique décisionnelle: L'intelligence des données. Paris: Éditions Technip, 2012, p. 30.
Mrs. M. Rita Thissen
RTI International - United States of America