On handling missing data — classes of missing data
Unlike most datasets found on Kaggle and similar sources, datasets in the wild are hardly complete. In many cases, some of the results of an exploratory data analysis is the revelations of missing data in rows and columns. Working with such datasets, the solution is usually either of two categories — imputation or deletion.
Deletion is pretty straight-forward as the name implies, albeit there are considerations usually to be made about the choice of deleting rows and/or columns of missing data (usually based on their proportion relative to the available data) while for Imputation, there are numerous techniques which can be deployed to use to perform this on dataset; data science frameworks like scikit-learn and Pandas come with methods for handling this.
However, this isn’t the purpose of this short piece rather it is to talk about the different categories/scenarios of missing data and the implications they have on the task at hand and the choice of handling them.
To aid understanding, let’s use a case study: A data scientist wants to predict the kind of house a user will buy given a number of factors which can include age, race, gender, income level, occupation. He has spent some time gathering data of people living in a community and will like to use that data to make predictions. However, he notices a few missing data in the columns for gender and income level.
Considering the target in question (predicting the kind of house a user will purchase), we can make a fairly accurate guess as to which of the 5 aforementioned factors, will contribute the least to the prediction. Certainly, the gender of a person tops the list (this is under my assumption that the kind of house a user buys is invariant to their gender). For rows where the gender is missing, the missing data can be categorised under the class Missing Completely At Random(MCAR). MCAR refers to a situation whereby there is no causal relationship between the missing value and the target variable. In this case, the missing features has no effect on the target variable and neither does it have any relationship with other features in the dataset.
Having categorized the gender variable as MCAR, the next column with missing features is the income level. Unlike the gender feature, there is definitely a relationship between the income level and the kind of house a user will purchase. However, in a case where some of the income values are missing, it can be due to either of two factors:
First factor: let’s assume some people of a particular occupation declined submitting their income levels, probably because they are high income earners and do not want to specify such details. In such a case, there is a relationship between the propensity of a feature/value (income level) being missing and another feature (occupation) in the dataset i.e while people of other occupations provided their income levels, a particular occupation declined to answer. This category of “missingness” is termed Missing At Random (MAR).
The second factor to be considered might be the case that there is no relationship between the missing values and another feature in the dataset, rather, the propensity of missingness simply has to do with the variable in itself. In this case, using the missing income values as an example, some respondents just didn’t want to share their income levels as they probably felt it too private an information to share. In such a situation where the propensity of a value being missing is related to the value in itself, we categorize such a missing value as Missing Not At Random (MNAR).
Summarily,
- Missing Completely At Random means there is no relationship between the missing values and itself, other features and the target feature in the dataset. In such a case as this, deleting or any imputation technique can be deployed to solve the challenge. However, it must be mentioned that this rarely happens in the wild. So in such a case, it’s recommended that you investigate further.
- Missing At Random occurs when the propensity of a value being missing is related to other features in the dataset (using our example, income levels were missing for certain occupations). In cases like this, careful imputation can be done to mitigate the challenge.
- Missing Not At Random occurs when the propensity of a value being missing is related to the value/feature itself (using our example, because income levels can be considered private information, some people declined providing answers). In a case like this, it might be hard to build a good model as usually this feature is very important and unless the proportion of missing values is negligible, the task can’t progress much further.
I hope with these distinctions and categorizations, you can make better informed decisions as to how to handle missing data when next you meet them.