Data Cleaning, Coding, Labeling & Creating dummy/secondary variables

data cleaning

Data cleaning is simply a visa for correct data analysis. It is absolutely wrong to jump-start data analysis without sanitizing the data entered in a masterchart. Most researchers themselves enter their valuable and worthy data into excel or other spreadsheets without any assistance or help, at times in compelling circumstances due to lack of adequate time and appropriate planning. They fail to foresee future problems when such raw and unrefined data are imported into standard statistical software like SPSS or Stata.

Every statistical package essentially expects data to be only in numeric or coded form without use of symbols, signs, units, characters or names.

However, naïve researcher may enter various research findings in subjective manner (character or string form) which is not analyzable directly by statistical packages.

Moreover, it is very likely to have errors of decimals (for example 0.70 being wrongly entered as 70), logically incompatible combinations (for example, gender male and pregnancy status yes), extremely outlier values (for example, number of teeth mentioned as 300 instead of 30), erroneous use of zero as some real value versus an indicator of missing data (for example, 0 degree may be actual value of recorded vaccine refrigerator temperature and it may also be wrongly entered for any missing temperature record by researcher) and finally, inconsistent case-mix of codes with alphabets or symbols (for example, grades of outcome mentioned as 4+, ++++, IV). There are cascade of commands to sanitize entered data in various statistical programs which can be really handy to eliminate any unforced errors or blunders. Once data are cleaned, each variable needs to be coded (if categorical) and labeled so that all variable names are logical and easily interpretable. In addition, new variables need to be computed from existing data variables before analysis. For example, age of study subjects is entered in actual years by researcher, but for analysis continuous variable of age can be transformed into a new categorical variable with three different age categories.