How to release a model across your organization in one morning (and stop spending excessive amounts of time on a model that will never go into production)

What is this Guide about ?

This guide is a walkthrough for delivering a complete Machine Learning Project by using the Provision.io platform in a very short amount of time.

The guide details each step of a Machine Learning project, from data to model usage across the organization, and shows how to accomplish this within the platform.

What’s in this guide ?

Starting point

This guide assumes that :

  • You want to create a weekly sales forecast
  • You have a free version of the io platform
  • Your IT Team (or data team) has historical sales data in a database and given you access ( but if not, csv files are provided below for the sake of this guide )
  • An objective has been defined by the Line of Business (LoB)
Steps

The steps of our guide are defined in the chart below:

Steps of a Machine Learning Project

Step

Name

Goal

LoB

IT

Datalab

Output

time spent

1

Data acquisition

Access the historical data for training Machine Learning Model

No

Yes

Yes

dataset

5mn

2

Feature engineering

Prepare the dataset

No

No

Yes

Holdout and validation strategy ( fold )

20mn

3

Define the problem

Define a metrics that reflects LoB process

Yes

No

Yes

An agreement

As Much as possible

4

Experiment

Train models

No

No

Yes

~100 Models

25mn

5

Evaluate

Select the fittest model

Yes

No

Yes

A selection of 3 to 4 models that go into production

As Much as possible

6

Deploy

Share the model across the organization

No

No

Yes

Webapp for human, API for machine

5mn

7

Schedule

Schedule predictions

Yes

Yes

Yes

Prediction delivered each Monday a 9:00 am in CRM software

20mn

8

Monitor

Track the model

Yes

No

Yes

Dashboard

As Much as possible

TOTAL

 

 

 

 

 

 

1h15

 

For each step, the guide explains what to expect and how we accomplish it on the Provision.io platform.

Results

At the end of this guide :

  • LoB will receive a weekly sales forecast each Monday at 09:00 am
  • LoB will receive a simulator for testing hypothesis over the model
  • Applicative team will get an API for calling the model in their own Application
  • IT Team will get a dashboard to monitor model Quality of Service

You can open a free account to practice the following steps. When your account is ready, create a Project to host the assets

Create a new project

Data acquisition

The first step to any project is getting historical data in order to train our algorithm. As the name implies, Machine Learning is all about reading historical data and letting a computer model learn to predict a target, at least for supervised use cases.

The data should have been loaded into a database by the IT Team and they have generated credentials for you. Once you have created your project, and selected it:

  1. Go to the data section ( sidebar on the left )
  2. Create a new connector and provide the credentials
  3. Create a new datasource from the db and table of past sales
  4. Import it as a dataset

Create a new connector

Create a new datasource

Import dataset

If available, you could import recent sales as a holdout dataset in order to validate and check the stability of your model.

You have two datasets.

Data acquisition is done, you can now start to model.

Status

Task

Status

Output

Time spent

Data acquisition

Done

one trainset, one holdout

5mn

If you don’t have database credentials, you can use the following files. Just import the file instead of using a datasource when importing the dataset.

Feature engineering

Feature engineering is the addition or transformation of one or more features to create new features from the original dataset. In the Provision Platform, and most of the modern tools, feature engineering is done with components and pipelines, yet in most cases you don’t need to add features as the AutoML engine makes all of the standard feature engineering by itself.

Here we are going to add a fold column on the date features in order to properly evaluate our model stability. A specific component has been developed by the data science team starting from the Provision Boilerplate and pushed on a private repo.

The component may now be integrated into the component library of the project.

Go to the pipelines section of your project and under the Pipeline Components tab, click New pipeline Component

Create a new component

And select your repo and branch.

Import component from your repo

Once the component is built, its status will be ok and we can use it in a pipeline. Create a new pipeline template with three nodes:

  • An import dataset, to read the trainset
  • The newly created component ( “build fold” )
  • A save dataset node to save the feature engineered dataset into you Data

A simple feature engineering pipeline

Then create a new schedule run that you are going to execute manually once on your trainset.

Create a new scheduled run

Set your trainset as the input dataset

configure your fold component parameters

Once you have done the configuration, select “Manual” as the trigger and run your Schedule run. In a few seconds, a new dataset should be available in your data section as a pipeline output with a new fold column.

Pipeline output dataset

Pipeline output dataset

You now have a dataset with features for a training model and a holdout to validate your models.

Status

Task

Status

Output

Time spent

Data acquisition

Done

one trainset, one holdout

5mn

Feature engineering

Done

one engineered dataset with features, one holdout

20mn

 

For the sake of this guide, we built a very basic feature engineering pipeline, but you can add as many transformations as you want and build a very complex pipeline.

Here we only have one component that adds a fold column, which is the year modulo 4. You can make the feature engineering on your local machine with the following code. Yet, if you want to build your own component you can follow this guide or some others.

def addfold(df: pd.DataFrame, dtcol: str="dt", foldon:str="year", nfolds:int=3) -> pd.DataFrame:
    if nfolds <=0 :
       nfolds=3
 
    df[dtcol] = pd.to_datetime(df[dtcol])
    df["fold"] = df[dtcol].dt.month % nfolds
 
    if foldon=="year" :
       df["fold"] = df[dtcol].dt.year % nfolds
   if foldon=="day" :
      df["fold"] = df[dtcol].dt.day % nfolds
   if foldon=="hour" :
      df["fold"] = df[dtcol].dt.hour % nfolds
   return df

Define the problem

This is the most important part and the one that should be allocated the most time.

In this step, you’re going to define with the Line of Business how to qualify the project as a success and you, as a data scientist, are going to translate this as data science metrics.

Regression metrics

Choosing the best metrics is out of the scope of this document but you must spend time with your business teams and ask these kinds of questions:

  • Imagine that I have the perfect model, does it make me gain something?
  • How much money do I lose if I forecast 110 sales instead of 100?
  • How much money do I lose if I forecast 90 sales instead of 100?
  • Are all the predicted products equal?
  • Should I forecast the total number of items sold, the total amount of sales (in € or $) ), the total weight of my items or the total volume?
  • How much time before should I forecast?

As a data scientist, by using an AutoML platform, your role is not to code in python or create dockerfiles, but to transcribe business problems to Machine Learning parameters.

In the Provision Platform, you can build what is called an Experiment to help refine your objectives.

Status

Task

Status

Output

Time spent

Data acquisition

Done

one trainset, one holdout

5mn

Feature engineering

Done

one engineered dataset with features, one holdout

20mn

Define the problem

Done

a metric to validate the models

1 week

Experiment

An experiment is a set of Model Building with slightly different parameters across each version and a common Target as well. On each experiment, many models will be automatically built, evaluated in cross-validation and on the holdout dataset if you provide some.

In our case, the models will be trained on our engineered dataset with a fold column and evaluated on a holdout provided by the IT Team.

It is very important to have a good validation strategy to guarantee that the model built in the experiment phase will stay stable on production. Here we choose to :

  • build a fold column on the modulo of the year number so that we stay confident that the model learned some trends that stay stable over the year
  • Validate on a holdout with sales from a year that was not in the trainset

Hence, if the holdout score is near to the cross validation score, we know that our model is going to hold up when launched in production and shared across the company.

For creating a new experiment, go to the Experiments section of your project and click New Experiment. You could choose to import some external models if you have some, but here we are using the AutoML Provision Engine. As we want to forecast sales, choose “Tabular” and “Regression”. Give a name to your experiment and click “Create experiment”.

Setting the experiment up

When you create a new experiment, there is no version of the experiment existing so you will be prompted to create a new version. The next screen is where you set up all of your experiment parameters:

Experiment parameters

  • The train dataset : use the output of the Schedule run from the step 2 with engineered features
  • The holdout dataset : use a dataset with the same target as the trainset but with data that are not in the trainset
  • The metric : use the best metrics that solve the business objectives defined in step 3. You can change it on each version of your experiment so run as many versions as you need if you are not sure
  • set your target ( here we choose “Weekly Sales” )
  • and set the fold column up, using the column built during the feature engineering phase.

Note that you may go to the models and feature engineering tabs to change some automl configuration but in most cases the default configuration is fine.

Once done, click on train to launch the training. The platform will immediately start to build and select models with the best hyper parameters. The models will stack in the “models” tabs of your experiment:

The experiment dashboard

Note that you can launch another version of your experiment as soon as you want, for testing other metrics for example, by using the new version button in the top right corner.

If you have several versions, the experiment dashboard will always display the last version, but you can change to another version with the version dropdown menu or the versions list tabs.

The experiment dashboard

You can launch as many versions as you want and they will run in parallel. You can now grab a coffee and wait till models are built! Depending on the size of your dataset and the plan you subscribed to, expect to wait from 10 minutes to 2 hours before having enough models to evaluate your experiment. In our case, we got our model in approximately 20 minutes.

Status

Task

Status

Output

Time spent

Data acquisition

Done

one trainset, one holdout

5mn

Feature engineering

Done

one engineered dataset with features, one holdout

20mn

Define the problem

Done

a metric to validate the models

1 week

Experiment

Done

~100 models

~20mn

Evaluate

After a few minutes, you should have between 15 and 40 models for each version, depending on which option you choose.

List of experiment versions

This step is all about evaluating all the models produced and selecting 2 to 4 models to deploy for testing in real conditions.

First, have a quick look at the list of versions below ( tab versions of your experiment ). There is a small 3-star evaluation that gives you information about each version’s quality. In this instance, Version 3, which has been trained on Mean Absolute Error, looks the most promising. Click on the specific version to enter the version dashboard for the deepest analysis.

On the Version dashboard, you have several indicators, but the most important is the models comparator :

List of models of a version

You can quickly see :

  • performance of each model done , evaluated on the metrics you choose for this version
  • stability of each model ( represented with a small error bar ) computed on a cross validation of the trainset using the fold column provided

 

Simple models

The Provision Platform always produces what we sometimes call “simple models”, a linear regression and a Random Forest of only 5 depth, called simple-LR and simple DT. It is always a good idea to watch performance of these models against the most complex one and ask yourself if using them could be good enough for your problem.

Indeed, as they are very simple :

  • they can be implement in sql ( auto-generated code is even provided on the model analysis page )
  • they often are more explainable and are more accepted from the Business teams, are they are easier to understand and use.

As a data scientist, deciding to use a simple if-else instead of a complex Blend of Gradient Boosting if it solves the issue is within your purview!

 

On the experiment above, the :

LGB-3, XGB-4, and CB-2 look promising so we are going to have a closer look. Click on the model barplot to enter the detailed model analysis, CB-2 for example.

Here you have more detail about the models you select, like various metrics and the actual vs predicted Scatterplot.

All the metrics of the model

Predicted vs actual

You can download the Cross validation file if you want to run your own evaluation. The CB2 is quite good but if we look at the Scatterplot, we see that performance falls in the range from 40k to 80k. If we go to the LGB-3 page, we see a more stable performance.

Predicted vs actual ( LGB-3 )

Evaluating a model is out of the scope of this guide but be aware that it is another step where you MUST involve your business team and explain each metrics and chart to them so you can choose the model that best solves their problem through group consensus.

The model analysis page is full of metrics to parse and you can run as many experiments as you want in order to find the model that fits the business problem the best.

After discussions with the LoB, we decided to keep the LGB-3 and the XGB-4, one because it performs well and the others because its performance is stable when evaluated on the holdout.

In order to refine this, we are now going to deploy both models and see how they perform in the real world.

Status

Task

Status

Output

Time spent

Data acquisition

Done

one trainset, one holdout

5mn

Feature engineering

Done

one engineered dataset with features, one holdout

20mn

Define the problem

Done

a metric to validate the models

1 week

Experiment

Done

~100 models

~20mn

Evaluate

Done

2 models selected with business team

1 week

Deploy

In this step two models will be deployed in order to test them on real data and usage. While deployed, their performance will be closely monitored for deciding if they are good for production grade utilisation.

Go to the “Deployment” section of your project and click on deploy a new experiment. Select LGB-3 as the main model and XGB-4 as a challenger in order to see which one performs best on real data.

Set your main and challenger

The Main model will be used for prediction but each time you call it, a prediction will be done with the challenger model too and a chart will be generated so you can compare them.

Wait a few minutes to get :

  • a standalone webapp for a human user to test ( “Application link” url )
  • a batch predictor available for scheduling prediction
  • a REST API for calling the model from others software ( “Documentation API” link )

Set your main and challenger

That’s all. Your model can now be called from any client of your company and all its requests will be logged for further monitoring. Yet, in order to send predictions each week to the sales team, you need to schedule them.

Status

Task

Status

Output

Time spent

Data acquisition

Done

one trainset, one holdout

5mn

Feature engineering

Done

one engineered dataset with features, one holdout

20mn

Define the problem

Done

a metric to validate the models

1 week

Experiment

Done

~100 models

~20mn

Evaluate

Done

2 models selected with business team

1 week

Deploy

Done

Model available across the organization

5mn

Schedule

Once any model is deployed, it can be used to schedule prediction. First step is to insert it into a pipeline template and then create a new Schedule using this template.

Note that you need help from your IT team in this step, in order to define the name of the table where you will read the features from each week. You can use the same table that will be overwritten each week, for example “sales to predict” to read and “Sales predicted” to write, or a more complex naming scheme.

First you need to create two new assets :

  • a new datasource that will link to the Table where the IT team is going to put the features for prediction each week
  • a new exporter to push the result

Create an exporter to push data to your crm

So you can use them in a new pipeline template with 3 nodes again:

  • Import from the datasource, where the datasource is the table with all the weekly features
  • a deployment prediction regression node
  • an export dataset node, that uses the exporter above

Template

Once you have your template, create a new Schedule based on it.

Use your template in a schedule run

Choose the Name of your deployment as the experiment deployment ID

Use your template in a schedule run

And then, instead of the manual Trigger, use a periodic one, putting the configuration that fits your need the best ( here, a weekly prediction each Monday at 7:00 AM )

Scheduling a prediction each monday Morning

Click run and wait a few seconds. Your Prediction is now scheduled to run every Monday, from the table of “sales to predict” to the “Sales predicted” table of your databases.

Status

Task

Status

Output

Time spent

Data acquisition

Done

one trainset, one holdout

5mn

Feature engineering

Done

one engineered dataset with features, one holdout

20mn

Define the problem

Done

a metric to validate the models

1 week

Experiment

Done

~100 models

~20mn

Evaluate

Done

2 models selected with business team

1 week

Deploy

Done

Model available across the organization

5mn

Schedule

Done

Prediction in CRM each Monday at 09:00

20mn

Monitoring

Once a model is deployed, each call to it will be logged, being unit one or scheduled batch. You can track your model into the Deployments section of your project by clicking on a deployed experiment name in the list of experiments to access the deployment dashboard.

Train and production distribution

You can watch the features distribution of the trainset compared to the feature distribution seen in production and check the drift. Target distribution of the Main Model and Challenger model are shown side-by-side with those of the production in order to evaluate performance in a real application.

Under the monitoring/usage tab sit some SLA statistics about number of call average response time and errors.

By tracking all these indicators for a month or more, you can evaluate how your model lives in production and check that it behaves the way you expected while evaluating it in the experiment step.

Status

Task

Status

Output

Time spent

Data acquisition

Done

one trainset, one holdout

5mn

Feature engineering

Done

one engineered dataset with features, one holdout

20mn

Define the problem

Done

a metric to validate the models

1 week

Experiment

Done

~100 models

~20mn

Evaluate

Done

2 models selected with business team

1 week

Deploy

Done

Model available across the organization

5mn

Schedule

Done

Prediction in CRM each Monday at 09:00

20mn

Monitor

Done

Prediction in CRM each Monday at 09:00

1 month

Conclusion

In this guide, you saw how to complete the whole data science process in less than a morning and went from data to fully deployed model, shared across the company with full monitoring.

Using a tool to solve the technical issue of the data science, like finding the best model, deploying a model or importing the data, allows you to spend more time on what truly matters : talk with the Line of Business team to translate their problem to datascience configuration and metrics.