Data Science Pipelines
In Data Science and computing in general a pipeline is a set of data processing elements connected in series, where the output of one element is the input of the next one. It helps us get from the raw data we have in our dataset to a format where the data is ready and prepped for Machine Learning.
An example is reading in our data and using transformers to impute missing values and standardize the input data
Now you can do each step individually and it would look like:
Imputation
from sklearn.impute import SimpleImputer
si = SimpleImputer(strategy='mean')
si.fit_transform(X)
This will fill all the missing values with the mean. you can also use other ‘Strategy’ such as Median, most frequent, and other types of imputation.
like any transformer in scikit-learn, you must import it and then instantiate it and then fit and transform it on your features.
Feature Scaling
One of the most common transformations to make on continuous data. it scales each feature so they all share a similar scale, for example, standard scaling which makes all the features have a value between 0 and 1.
from sklearn.preprocessing import StandardScaler
ss = StandardScaler()
ss.fit_transform(X)
The same steps as the SimpleImputer, import, instantiate and then fit and transform it( the fit means the transformer learns from the data and transform is the actual transforming of the data from its original value to a value between 0 and 1 in the case of StandardScaler)
While the above workflow works, it’s a bit clumsy as we must keep track of each transformed dataset, If we had many more transformations, then we would be creating many variables and would start to mix up and lose track. scikit-learn provides us a nicer workflow with the `Pipeline` meta-estimator. This pipeline applies one transformation after the other without the need to use intermediate variables. You just need to create a list of the steps it takes.
Pipeline
from sklearn.pipeline import Pipeline
si = SimpleImputer(strategy='mean')
ss = StandardScaler()
steps = [('si', si), ('ss', ss)]
We import the pipeline, and we instantiate the transformers we will use. we then make a list containing tuples that state the step process that our pipeline will take. In the case, above we will first apply SimpleImputer to our data, then the output of that will be fed again and a StandardScaler will be applied to it.
We can now fit the pipeline to our original data.
pipe = Pipeline(steps)
We instantiated our pipeline and it will follow the steps we have specified. you can add more steps and even integrate a machine learning estimator(but this has to be the last step)
pipe.fit_transform(X)
Then it is as simple as fitting and transforming our pipeline to the features. and we have applied all our transformations on the feature data.
Pipeline with Machine Learning estimator
from sklearn.neighbors import KNeighborsRegressor
si = SimpleImputer(strategy='mean')
ss = StandardScaler()
knr = KNeighborsRegressor()
steps = [('si', si), ('ss', ss), ('knr', knr)]
pipe_ml = Pipeline(steps)
In the block of code above I have also included a K-Nearest Neighbours to our pipeline to show what it would look like. It is as simple as just adding another step in the pipeline process. It will apply each step in the order you have expressed, and then you can use .predict or .score as you would normally on a machine learning estimator or even a cross_val_score.
You can do a lot more with Pipelines and can even tweak each individual step with different parameters specific to that transformer or machine learning estimator.