Scikit learn infrastructure

Khalid Gharib
3 min readSep 20, 2020

--

One of the most important and commonly used tools for Machine Learning is a library called Scikit-learn. I wanted to go over the layout of the Scikit-learn library and explain how to use it.

The scikit-learn Estimator

An estimator in this case refers to any object that learns from the data, in Scikit we have numerous type of estimators that have been built for us. Keep in mind that not all estimators are machine learning models, but they all learn from the data. The main types of estimators are:

  • Regressors — Supervised learning with continuous data
  • Classifiers — Supervised learning with categorical data
  • Clusterers — Used in Unsupervised learning
  • Transformers — Transform the input/output data
  • Meta-estimators — Learn from other estimators

To understand the Scikit API I found this image to be of great benefit and help in visualizing the infrastructure.

credit to dunder-data for this picture

It goes from House → Room → objects

The House is the whole Scikit-Learn library, each room is a module(linear_model, clusters, etc…) and then there are objects in each room which can be estimators or just helper functions. Most of the objects in Scikit-Learn are helper functions and they fulfill a single task such as the f1_score which as it sounds only calculates the f1 score.

There is a three step process for each estimator in order to apply it. I abbreviate it to IIF:

  1. Import: You import that estimator from the Scikit Library
from sklearn.linear_model import LinearRegression#you are calling the linear_model module (the room) from the sklearn(the house) and in that module you are getting the LinearRegression estimator

2. Instantiate: once you have imported it, you need to instantiate it, I like to think of this as activating the estimator in order to use it.

lr = LinearRegression()
# instantiating the LinearRegression estimator

3. Fit: We now have a single linear regression instance that is ready to learn from data. For it to learn from data, we must pass it both input and output data to its fit method

lr.fit(X, y)
#fitting the data to our Linear Regression instance in order to learn.

something to note when preparing data for scikit-learn estimators:

  • Missing values are a big no-no
  • input data must be Numeric if you are using string data it must be encoded as numeric
  • Target variable must be numeric for Regression, but it can be both numeric or string for Classification
  • input data must be two-dimensional
  • target variable must be one-dimensional

what the last two mean is when you are assigning your x and y data it should look something like this

#example used of housing data
X = df[['NoBedrooms']] # could be any number of features
y = df['HousePrice']

you can see above that our y (Target Variable) is one dimensional and our X (input data) is two-dimensional

once you have fit the data correctly you are able to access the model in different ways, in the case of Linear Regression you can access both the intercept and coefficient like so:

lr.intercept_
lr.coef_

depending on the estimator you will have different ways to access the model.

I wanted to share this as I found Scikit-learn very intimidating and felt very lost and confused the first time I used it. I hope this has made Scikit-learn clearer and easier for you and hope that it helps you with your future endeavours!

--

--