Tidy Data (Part 2)

Khalid Gharib
2 min readJul 23, 2020

I mentioned in a previous blog:

  • 3 Things constitute tidy data: (1)Each variable forms a column, (2)Each observation forms a row, (3) Each type of observational unit forms a table
  • What messy Data tends to look like
  • How we can use the melt() method to clean the data by stacking them one-by-one on top of each other

Pandas has a method of inverting the melted data back to its original form. it is called pivoting and uses the pivot() method which contains 3 parameters:

  • index — the column that will stay vertical. This column will be set as the index
  • columns — The column which will be transposed and whose unique values will be made into column names
  • values — The column which will be tiled across as the new values

Data set after we used the melt method which will look something like the code below and the output.

df_pivot = df.pivot(index=’airline’, columns=’origin_airport’, values=’avg_arrival_delay’)

you can also use reset_index() and rename_axis() to make the layout look cleaner and without the different hierarchy, as you can pivot by more than one column which will start to look messy if not reset.

setting both Reset_index() and rename_axis() to None will output the above.

Pivot_table method

this is different to pivot method as this works when there exists only 1 unique combination for each intersection. This is where the Pivot table shines, you can verify the number of occurrences of each airline-airport combination.

df.pivot_table(index=’airline’, columns=’origin_airport’, aggfunc=’size’)

Pivot_table is also able to aggregate all the values at the intersection such as Min, Max, Mean, Median, etc.

if you use both these tools correctly you can ensure you have Tidy Data which is extremely important for any future works and machine learning you will apply, as well as essential to extracting information and meaning from the data in an efficient and effective way.

--

--