I spoke about the Groupby method which is one of the most important tools for a data scientist while using Pandas.
Now although pivot tables and cross tabs are similar to Groupby they do differ slightly. I will be explaining how you actually implement each of them and how they differ from Groupby.
Let’s start with Pivot Tables, the way Pivot Tables works is that they aggregate data between two columns unique intersecting values. For example, if we are aggregating between two columns Age and Sex than the unique intersecting Value could be salary between both.
for pivot tables, you must have
- two grouping columns
- the aggregation column(which would be salary in the example above)
- the aggregation function(mean, median, etc..)
df.pivot_table(index=’grouping column 1',columns = ‘grouping column 2,values= ‘unique intersecting value’,aggfunc=’how you want to calculate the values’)
the example mentioned earlier would be written like so:
df.pivot_table(index=’race’, columns=’sex’, values=’salary’, aggfunc=’mean’)
you can do the same thing with the groupby function but it will be more difficult to read and compare between entries:
This is the output if we use Groupby method instead
now cross-tab really shines when it comes to counting occurrences in columns.
We know that we can use .count_values for counting frequencies in a series, basically counting the frequency in a single column. Now when it comes to counting the frequency in a data frame with multiple columns Cross tab is the easiest way to do that, although it can also be done by the groupby method and the pivot table method, the CrossTab was specifically designed for just this.
I will show you an example where we want to count the number of people receiving treatment and those not receiving treatment in each country and compare between each method:
the groupby method is easy to apply but it leaves the data looking very long and difficult to read and interpret compared to the pivot table and cross tabs
df.pivot_table(index=’country’, columns=’treatment’, aggfunc=’size’, fill_value=0)
Pivot table gives us the shape we are looking for, notice we didn’t have to assign value because we are using aggfunc=’size’. also, in this case, fill_value=0 doesn’t do anything but in the cases where there are no combinations for some values it can be used and will just fill the value as 0 instead of showing an NA or Null value which can interfere with further analysis
something to note is that the crosstab is a function and not a method hence we are using “pd.” at the start instead of df like previous two. The crosstab function is built specifically for this kind of thing in mind, by default it counts the occurrences between the given columns
A big reason for using crosstab is that it can return relative frequencies if you add the normalize condition
pd.crosstab(index=mh[‘country’], columns=mh[‘treatment’], normalize=’index’)
It returns the same dataset but you can see the values are now normalized so we can see the percentage values for each.
overall even though crosstab is helpful in certain cases most of the time you will rather want to use the groupby or pivot table method.
These are some of the tools you will need whilst doing EDA on your data using Pandas. it is so important to understand the data before you start doing any machine learning because you gain new insight and understanding by performing EDA on the data and can prevent mistakes and save time in the end since some machine learning methods are very time-consuming.