Classification Scoring for Cleaning Inconsistent Survey Data

M. Rita  Thissen

Call for Papers - Ongoing round of submission, notification and publication.

Home | Login or Register | Contact CSC

Home > CSC-OpenAccess Library > Manuscript Information

Full Text Available
(no registration required)

(426.87KB)

-- CSC-OpenAccess Policy

-- Creative Commons Attribution NonCommercial 4.0 International License

>> COMPLETE LIST OF JOURNALS

EXPLORE PUBLICATIONS BY COUNTRIES


	EUROPE

	MIDDLE EAST

	ASIA

	AFRICA
.............................

	United States of America

	United Kingdom

	Canada

	Australia

	Italy

	France

	Brazil

	Germany

	Malaysia

	Turkey

	China

	Taiwan

	Japan

	Saudi Arabia

	Jordan

	Egypt

	United Arab Emirates

	India

	Nigeria

Classification Scoring for Cleaning Inconsistent Survey Data

M. Rita Thissen

Pages - 1 - 14 | Revised - 31-01-2017 | Published - 28-02-2017

Published in International Journal of Data Engineering (IJDE)

Volume - 7 Issue - 1 | Publication Date - February 2017 Table of Contents

MORE INFORMATION

References | Abstracting & Indexing

KEYWORDS

Data Cleaning, Classification Models, Data Editing, Classification Scoring, Survey Data, Data Integrity.

ABSTRACT

Data engineers are often asked to detect and resolve inconsistencies within data sets. For some data sources with problems, there is no option to ask for corrections or updates, and the processing steps must do their best with the values in hand. Such circumstances arise in processing survey data, in constructing knowledge bases or data warehouses [1] and in using some public or open data sets.

The goal of data cleaning, sometimes called data editing or integrity checking, is to improve the accuracy of each data record and by extension the quality of the data set as a whole. Generally, this is accomplished through deterministic processes that recode specific data points according to static rules based entirely on data from within the individual record. This traditional method works well for many purposes. However, when high levels of inconsistency exist within an individual respondent's data, classification scoring may provide better results.

Classification scoring is a two-stage process that makes use of information from more than the individual data record. In the first stage, population data is used to define a model, and in the second stage the model is applied to the individual record. The author and colleagues turned to a classification scoring method to resolve inconsistencies in a key value from a recent health survey. Drawing records from a pool of about 11,000 survey respondents for use in training, we defined a model and used it to classify the vital status of the survey subject, since in the case of proxy surveys, the subject of the study may be a different person from the respondent. The scoring model was tested on the next several months' receipts and then applied on a flow basis during the remainder of data collection to the scanned and interpreted forms for a total of 18,841 unique survey subjects. Classification results were confirmed through external means to further validate the approach. This paper provides methodology and algorithmic details and suggests when this type of cleaning process may be useful.

ABSTRACTING & INDEXING

1	Google Scholar

2	Academia

3	CiteSeerX

4	BibSonomy

5	Doc Player

6	Scribd

7	SlideShare

8	PdfSR

REFERENCES

A. Jonsson and G. Svingby. "The use of scoring rubrics: Reliability, validity and educational consequences." Educational Research Review, vol. 2(2), pp.130-144, 2007.

A.L. Brennaman. "Examination of osteoarthritis for age-at-death estimation in a modern population." Doctoral dissertation, Boston University, 2014.

D. Shukla and R. Singhai. "Some imputation methods to treat missing values in knowledge discovery in data warehouse." International Journal of Data Engineering, vol. 1(2), pp. 1-13, 2010.

D.B. Rubin. Multiple Imputation for Nonresponse in Surveys. Hoboken, NJ: Wiley, 2004, pp. 11-15.

E. Leahey. "Overseeing research practice: the case of data editing." Science, Technology & Human Values, vol. 33(5), pp. 605-630, 2008.

E. Rahm and H.H. Do. "Data cleaning: Problems and current approaches." IEEE Data Engineering Bulletin, vol. 23(4), pp. 3-13, 2000.

F. Kreuter, S. McCulloch, S. Presser, and R. Tourangeau. "The effects of asking filter questions in interleafed versus grouped format." Sociological Methods & Research, vol. 40(1), pp. 88-104, 2011.

H.J. Kim, L.H. Cox, A.F. Karr, J.P. Reiter, and Q. Wang. "Simultaneous edit-imputation for continuous microdata." Journal of the American Statistical Association, vol. 110(511), pp. 987-999, 2015.

J. Han, J. Pei, and M. Kamber. Data Mining: Concepts and Techniques. Amsterdam, Netherlands: Elsevier, 2011, p. 331.

M. Farfel, L. DiGrande, R. Brackbill, A. Prann, J. Cone, S. Friedman, D.J. Walker, G. Pezeshki, P. Thomas, S. Galea, and D. Williamson. "An overview of 9/11 experiences and respiratory and mental health conditions among World Trade Center Health Registry enrollees." Journal of Urban Health, vol. 85(6), pp. 880-909, 2008.

P.M. Fayers and D. Machin. Quality of Life: The Assessment, Analysis and Interpretation of Patient-Reported Outcomes. Hoboken, NJ: Wiley, 2013, p. 1958.

P.N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Boston: Addison- Wesley, 2005, pp.145-205.

P.P. Biemer and L.E. Lyberg. Introduction to Survey Quality. Hoboken, NJ: Wiley 2003, p. 41.

R. Brackbill, D. Walker, S. Campolucci, J. Sapp, M. Dolan, J. Murphy, L. Thalji, and P. Pulliam. World Trade Center Health Registry. New York: New York City Department of Health and Mental Hygiene, 2006.

S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. "Robust and efficient fuzzy match for online data cleaning," in Proc. 2003 ACM SIGMOD International Conference on Management of Data, pp. 313-324, 2003.

S. Tuffery. Data Mining et statistique décisionnelle: L'intelligence des données. Paris: Éditions Technip, 2012, p. 30.

V. Raman and J.M. Hellerstein. "Potter's wheel: An interactive data cleaning system." The VLDB Journal, vol. 1, pp. 381-390, 2001.

W.E. Winkler and B.-C. Chen. "Extending the Fellegi-Holt model of statistical data editing," in ASA Proceedings of the Section on Survey Research Methods, 2001, pp. 1-21.

W.E. Winkler. "State of statistical data editing and current research problems (No. 29)." U.S. Census Bureau, Working Paper. 1999.

MANUSCRIPT AUTHORS

Mrs. M. Rita Thissen

RTI International - United States of America

rthissen@rti.org

CREATE AUTHOR ACCOUNT

LAUNCH YOUR SPECIAL ISSUE

View all special issues >>

PUBLICATION VIDEOS