Combining Data — Python & Pandas

Khalid Gharib
5 min readOct 25, 2020

--

Most of the time when you are doing data analysis you won't be dealing with just one data frame but multiple ones or at least multiple datasets created from the same source. pandas has some nifty tools to combine DataFrames in a wide variety of ways.

The most common way is by using the .merge

.merge

.merge will combine/join two data frames on a column that is shared between both

df = pd.merge(left_df, right_df, on='shared_column_name')

In the example below we have the date, and adjusted closing price and volume for Amazon and Apple stocks, they can both be represented in a single df to save space and make it both visually better and easier to apply analysis on.

merged_stocks = pd.merge(aapl, amzn, on='date')

As you can see it automatically renamed the columns. In this case, x refers to aapl the first argument in the merge, and y refers to amzn. It is always better to make column names clear and simple so I opt to use .rename and change column names.

merged_stocks.rename(columns={'adjusted_close_x':'aapl_adjusted_close',
'volume_x':'aapl_volume',
'adjusted_close_y':'amzn_adjusted_close',
'volume_y':'amzn_volume'},
inplace=True)

When merging you should understand the types of joins you can do.

I showed you how we can merge two data frames along a shared column in both , if you don't specify the type of join it will use an Inner-Join by default. You can choose which type of join by adding another argument how=’left’ or whichever join you want to apply.

We can also use any of the other join types above

  • Left Merge/Left Join: This will keep every row in the left data frame, it will return NaN where there are missing values of the ‘on’ variable in the right data frame.
  • Right Merge/Right Join: This will keep every row in the Right data frame, it will return NaN where there are missing values of the ‘on’ variable in the left data frame.
  • Outer Merge/Full Outer Join: it will return all the rows from both the left and right data frames and matches up rows where it is possible, and for those that do it will return NaN values.

An example image of a left join. It is forced to return every row in the left data frame and thus these rows don't have any join value with the right data frame on ‘platform’ and ‘device’

These are the results from a left merge. It will keep every row in the left data frame, due to it keeping every row in the left data frame, there will be cases where the left and right don't have any corresponding join value and are thus left with NaN values

Now, this is really helpful when we can join data frames horizontally and they share similar columns between them.

What if we need to stack data frames on top of one another?

.concat

This function provides many different arguments which can be confusing at times.

firstly we can stack data frames one on top of another, I will be using the same amzn and aapl data frames to show this

pd.concat([amzn, aapl])

Now you may have noticed that the index restarts at the new data frame we just stacked below the first one. You can make it continuous by using ignore_index=True.

pd.concat([amzn, aapl], ignore_index=True)

Now to be honest this data frame is very difficult to interpret, which one is Amazon stocks and which ones are Apple stocks?

we can include another argument in the Concat function, keys.

pd.concat([amzn, aapl], keys=['amzn', 'aapl'])

Now that is a lot more helpful. We can also just add a new column to the original data frames instead which makes it clear which stock we are referring to.

amzn['symbol'] = 'amzn'
aapl['symbol'] = 'aapl'
pd.concat([amzn, aapl], ignore_index=True)

Another cool argument you can use is changing the direction of the concat by changing the axis

pd.concat([amzn2, aapl], axis=1)

If you enjoyed reading this, follow me, and stay up to date with my weekly blogs!

--

--