Soumit Kar
5 min readMar 15, 2021

--

In real world data, there are some instances where a particular element is absent because of various reasons, such as, corrupt data, failure to load the information, or incomplete extraction. Handling the missing values is one of the greatest challenges faced by analysts, because making the right decision on how to handle it generates robust data models.

There are some frameworks for understanding missing data.

Missing At Random:

Missing data are missing at random (MAR) when the probability of missing data on a variable is related to some other measured variable in the model, but not to the value of the variable with missing values itself.

The probability of missing values, at random, in a variable depends only on the available information in other predictors.

For example, when men and women respond to the question “have you ever taken parental leave?”, men would tend to ignore the question at a different rate compared to women.

For example, only younger people have missing values for IQ. In that case the probability of missing data on IQ is related to age.

Missing Not At Random:

The probability of missing values, not at random, depends on information that has not been recorded, and this information also predicts the missing values.

For example, in a survey, cheaters are less likely to respond when asked if they have ever cheated.

For example, when data are missing on IQ and only the people with low IQ values have missing observations for this variable. A problem with the MNAR mechanism is that it is impossible to verify that scores are MNAR without knowing the missing values.

Missing Completely At Random:

MCAR occurs when the probability of missing values in a variable is the same for all samples.

For example, when a survey is conducted, and values were just randomly missed when being entered in the computer or a respondent chose not to respond to a question.

1. Deleting Rows/Columns

Here, we either delete a particular row if it has a null value for a particular feature and a particular column if it has more than 70–75% of missing values. This method is advised only when there are enough samples in the data set.

Pros:

🔻 Removal of missing values (for 75% ) tends to more accurate models.

Cons:

🔻 Loss of information and data.

🔻 As per small dataset poor performance of model.

2. Filling Null Values:

Sometimes rather than dropping NA values, you’d rather replace them with a valid value. This value might be a single number like zero, or it might be some sort of imputation or interpolation from the good values.

Create a datasets

Fill missing values with 0, previous values & next values .

Pros:

🔻 Avoid data loss .

Cons:

🔻 Not work for every situation.

3:Mean/Median/Mode Imputation:

Mean/median imputation has the assumption that the data are missing completely at random(MCAR)

Here ,standard deviation of Age column is 14.526497332334042

And std of Age_median column is 13.019696550973201

Pros:

🔻 Better approach when the data size is small .

🔻 Faster way to obtain the complete datasets .

Cons:

🔻 Change or distortion in original variance.

🔻 Impacts correlation .

4:Random Sample Imputation:

Random sample imputation consists of taking random observation from the datasets.

It assumes that the data are missing completely at random(MCAR)

Comparison

Pros:

🔻 Less distortion in variance.

Cons:

🔻 Not work for every situation.

5:Capturing nan Values With New Features

It works well if the data are not missing completely at random. Create a new feature and capture the importance of missing values by place 1 for missing values and 0 for other values .Here ,row no 5 is missing impute with median =2

Pros:

🔻 Capture the importance of missing values.

Cons:

🔻 Create additional features(Curse of dimensionality)

6:End Of Distribution Imputation:

If there is suspicion that the missing value is not at random then capturing that information is important. In this scenario, one would want to replace missing data with values that are at the tails of the distribution of the variable.

Assumptions: Data is not missing at random; Data is skewed at the tail-end

Pros:

🔻 Capture the importance of missing values.

Cons:

🔻Changes Co-variance/variance; may create biased data.

--

--