Groupby Method In Pandas

Khalid Gharib
3 min readAug 7, 2020

--

I spoke about how to deal with missing data, to be able to analyze and perform machine learning on the data.

now during the EDA(Explanatory Data Analysis) Phase, we can take advantage of the many tools, one of them relates to reorganizing the data in pandas, and through this, we can see the data in a new light and possibly gain new insights and knowledge from it.

Groupby:

The Groupby method in pandas handles the task of grouping data in pandas. When using the groupby method you must use an Aggregation method alongside it.

It is important to first understand the basic layout:

<Basic Layout>
df.groupby(‘<grouping column>’).agg(<new_column_name>=(‘<agg column>’, ‘<agg func>’))

And this is what it would look like in a real example

<Example>
emp - refers to the name of a specific df related to employees
emp.groupby(‘dept’).agg(avg_salary=(‘salary’, ‘mean’))

you can use any type of aggregation, whether that is mean, mode median or anything else.

you can also groupby more than 1 columns like so:

sf_emp.groupby([‘year’, ‘organization group’])\
.agg(avg_salary=(‘salaries’, ‘mean’),
min_salary=(‘salaries’, ‘min’),
max_salary=(‘salaries’, ‘max’),
avg_overtime=(‘overtime’, ‘mean’)

in this case, we are grouping by the Year and Organization group and we are going to get 4 new columns that have applied a different aggregation to each specific column and given us this output:

For each year, the organization groups are organized and the avg_salary, min_salary, max_salary and avg_overtime for each organization group has been calculated. In most cases resetting the index will make it look a lot cleaner and easier to read and can be done by adding ‘.reset_index()’ at the end of the groupby method. This will change the data frame to look like so(it does this by removing the hiearchy in column headings and puts all column headings on the same level):

In the case of wanting to know the size of a single column we don’t need to use a groupby method, we have the value_count() method which will do the same thing. Using Value count is also an important tool to use at the start of your EDA process.

emp[‘organization group’].value_counts()

This is just a short intro into the groupby method which is a very powerful tool for every data scientist to have, it’s simple and straight forwards but when you use the groupby method alongside different aggregation you can discover a lot of insight on the data and get a better understanding which will help you to ask the right questions and solve them.

--

--

No responses yet