One of the most common jobs you will have to do during the Data Analysis process is selecting subsets within the data. This basically is extracting specific information from cells in both Rows or Columns from our Dataframe or Series.
Now you can select from both Columns and rows
Most of the time when you are doing data analysis you won't be dealing with just one data frame but multiple ones or at least multiple datasets created from the same source. pandas has some nifty tools to combine DataFrames in a wide variety of ways.
The most common way is by using the .merge
.merge will combine/join two data frames on a column that is shared between both
df = pd.merge(left_df, right_df, on='shared_column_name')
In the example below we have the date, and adjusted closing price and volume for Amazon and Apple stocks, they can both be represented in a…
All Machine learning models contain hyperparameters which you can tune to change the way the learning occurs. For each machine learning model, the hyperparameters can be different, and different datasets require different hyperparameter setting and adjusting.
In Data Science and computing in general a pipeline is a set of data processing elements connected in series, where the output of one element is the input of the next one. It helps us get from the raw data we have in our dataset to a format where the data is ready and prepped for Machine Learning.
An example is reading in our data and using transformers to impute missing values and standardize the input data
This will be a continuation of my first blog which you can find here:
The last thing I mentioned was the IN/NOT IN syntax which allows you to select certain items or exclude certain ones when selecting data in your query.
We will continue with more conditional clauses as well as other Syntax and even applying Aggregations on columns. I will be using the same SQL Schema as the previous blog.
In this blog, I wanted to give an overview of SQL with examples so that you are more familiarised with the SQL syntax and what you can accomplish with SQL.
SQL stands for Structured Query Language, you are able to access and manipulate databases. Basically, SQL allows you to execute and call queries to the database to extract the information you require, these queries can be as simple as trying to extract all the customers in a database who live in a specific town, and as advanced as applying multiple condition and aggregations to extract the data you want.
One of the most important and commonly used tools for Machine Learning is a library called Scikit-learn. I wanted to go over the layout of the Scikit-learn library and explain how to use it.
An estimator in this case refers to any object that learns from the data, in Scikit we have numerous type of estimators that have been built for us. Keep in mind that not all estimators are machine learning models, but they all learn from the data. The main types of estimators are:
Recently I have been taking part in a Data Science Internship with a medical company called Medwise. It has been very eye-opening and has introduced me to some of the ways Data Science is being applied within the medical industry.
During my Internship I was introduced to the Entity Linkin Problem, but what exactly is this issue? and how can we use data science tools to solve it?
Entity linking is a task to extract query mentions in documents and then link them to their corresponding entities in a knowledge base. This is very helpful as certain words can have…
I’m sure a lot of people have seen all the different boot camps available both online and in-person and wondered if that was the right step for them to take to learn Data Science.
After having completed the Flatiron 15 weeks Intensive in London, I feel I can shed some light and maybe help anyone who is trying to make that decision.
I think there are two things you need to keep in mind while making your decision to join a Bootcamp:
Firstly, why are you joining? what are your expectations? I think the number one issue with boot camps…
In this blog, I wanted to go over tips and tricks that I couldn’t do stand-alone blogs for but are very helpful tools for anyone who wants to advance their pandas skills, as well as things that can improve your presentations and simplify the EDA process
this is something I don’t see often and it makes a really big difference when you are presenting a table where specific values are important to take note of.
let say you have a df that you have performed a pivot table on
pivot_df = df.pivot_table(index=’dept’, columns=’race’,
you can highlight the…