Data defines the model by dint of genetic programming, producing the best decile table.


Data Cleaning is Not Completed Until the “Noise” is Eliminated
Bruce Ratner, Ph.D.

Data cleaning (aka data cleansing or scrubbing) is step 0 (the first task) of any data analysis, and statistical modeling to ensure the quality and soundness of the data. After a good data scrub for detecting and removing errors and inconsistencies in the data, the resultant analysis/model can be stamped: “Results with Confidence.” Otherwise, if the data analysis/model is performed without concern for the caliber of the data, then the stamp should read: “Results are Wanting.” Although the list of steps for cleaning “dirty” data is as varied as the analyst doing the “dirty” work, there are the Ten Basics.

Ten Basics of Data Cleaning

    1. Check frequencies of continuous and categorical variables for unreasonable distributions.
    2. Check frequencies of continuous and categorical variables for detection of unexpected values. For continuous variables, look into data “clumps” and “gaps.”
    3. Check for improbable values (e.g., a boy named Sue), and impossible values (e.g., age is 120 years young, and x/0).
    4. Check the type for numeric variables: Decimal, integer, and date.
    5. Check the meanings of misinformative values, e.g., “NA”, the blank “ “, the number “0”, the letter “O”, the dash “—“, and the dot “. “.
    6. Check for out-of-range data: Values “far out” from the “fences” of the data. [1]
    7. Check for outliers: Values “outside” the fences of the data. [1]
    8. Check for missing values, and the meanings of their coded values, e.g., the varied string of “9s”, the number “0”, the letter “O”, the dash “—“, and the dot “. “.
    9. Check the logic of data, e.g., response rates cannot be 110%, and weigh contradictory values, along with conflict resolution rules, e.g., duplicate records of BR’s DOB: 12/22/56 and 12/22/65.
    10. Last but not least, check for the typos.
 Data Cleaning is Not Completed Yet
After the ten basic and analyst-specific checks are done, data cleaning is not completed until the noise in the data is eliminated. Noise is the idiosyncrasies of the data: The particulars, the “nooks and crannies” that are not part of the sought-after essence (e.g., predominant pattern) of the data with regard to the objective of the analysis/model. Ergo, the data particulars are lonely, not-really-belonging-to pieces of information that happen to be both in the population from which the data was drawn and in the data itself (what an example of a double-chance occurrence!) Paradoxically, as the analyst includes more and more of the prickly particulars in the analysis/model, the analysis/model becomes better and better, yet the analysis/model validation becomes worse and worse. Noise must be eliminated from the data.

The purpose of this article is to provide a procedure for eliminated noise from data. The GenIQ Model© is used to 1) Identify the idiosyncrasies, and 2) Deleting the actual records that define the idiosyncrasies of the data. Now, the analysis/model can be built with “cleaned” data that reliably represents the sought-after essence of the data, yielding a well conducted analysis and a well-fitted model.

Do not be diffident, make your request by email for a power point presentation of GenIQ as a new, unique procedure for eliminating noise from data. When you do, you will be different, having the know-how to eliminate noise from your data for efficient, effective data cleaning.


Hope to hear from you!

BRoverfit

[1] Tukey, J.W., The Exploratory Data Analysis, Addison-Wesley, Reading, MA, 1977.



For more information about this article, call Bruce Ratner at 516.791.3544 or 1 800 DM STAT-1; or e-mail at br@dmstat1.com.
Sign-up for a free GenIQ webcast: Click here.