Dealing with NA/Null Data in Pandas
Having Null and missing Data can be a real issue when trying to perform any EDA on your data. It can also lead to making statistical errors and can cause your machine learning models to predict incorrectly. you want to get your hands on clean data, but that is sometimes not the case so you need to know how to deal with data that isn't clean and data that are missing values.
NA values occur when no data value is stored for the variable in an observation. NA or Missing values can be caused in a multitude of ways, sometimes the information is not available, Sometimes missing values are caused when the data collection is done improperly or mistakes are made in data entry.
Now there are many ways to dealing with this kind of missing Data, in this blog, I will be specifically focusing on Pandas and showing you how to firstly discover Missing/incorrect Data and the different methods of getting rid of them
The first thing you do before you actually start performing any Machine Learning or even Statistical Analysis on your Data is to take a look at the data you are dealing with. You can do this by using:
.describe():
This allows you to see to do a statistical analysis on the data, the first thing you should notice is the count for each column, they should always be the same to ensure that each column has equal number of entries.
I personally like to Transpose this which will end up switching the positions of columns and rows, it makes it look a little bit more clean and easy to read in my opinion
From this, you can see that the count which is just the number of entries varies for each column. this is a sign that there may be missing values in columns otherwise they should all have the same exact count.
to check this we use:
.isna():
if we use this on its own it returns a boolean value for every single entry, that’s why we use .sum() after it like so which will count all the booleans that were ‘False’ like so:
this tells us exactly how many missing values each column has.
Now comes dealing with these missing values
1) .fillna()
now as it sounds you are filling the na value with something. in the parentheses, you can add some conditions.
.fillna(method=’ffill’): ffill stands for forward fill and this will fill the na value with the most recent non-missing value before this missing value.
.fillna(method=’bfill’): Backwards fill which will replace the na value with the closest value ahead of it
2) .dropna()
this is pretty straight forward it removes those na values completely but in the process removes the whole row and therefore you are removing a row that may have data for other columns. this should only be used if it is a very small amount of NA values and won’t have an effect on the size of your data.
3) .interpolate()
It takes a condition in the parenthesis. what it does is it fills in missing values using a variety of statistical methods. By default, it uses linear interpolation on every column in the DataFrame but can be changed for something like Quadratic, and the method changes depending on the pattern of the data. What it means to use Linear interpolation is that it will compute missing values such that they are modeled using a straight line
4) replacing with Mean, Mode, Median
we use the .fillna() but in the parenthesis, we use the column name we want and the mean/median/mode depending on your need.
you would replace .mean() with .median() or .mode() or any type of statistical method.
the statistical method you choose very much depends on the data, for example, if you have many outliers in your data it may be better to use the median to replace the missing data.
Even after all of this sometimes the best method is doing nothing, sometimes the missing data is so vast that the whole data set is not worth using and it may be wiser to find a better data set or even to use a different method for collecting the data.
The important thing to take from this is to just approach the data in a logical way and to take a deep look at your data before deciding to just drop the missing values or to replace it with the mean or using any other method, you should understand why you chose that method and what implications it may have on the data set as a whole and further what kind of further consequence it may have when applying Statistical analysis and Machine Learning Models