Thinking of starting a new data science project:  Use this checklist as a guide to save you a headache and your company a ton of money.

It’s 2022 and Machine Learning is finally approaching the “Plateau of Productivity”. Algorithms are leaving the data science team and Artificial Intelligence projects are now fully supported and funded enterprise wide initiatives, like web development or real time decisioning. The scope is now infinitely  larger than just “modeling”.

To help you start this new year, here are the “10 things to check before starting a Data Science project”

These tips are not about the modeling part but about making Data Science productive.

There is something to improve 

Call it an OKR or a KPI but, … do not start a project without a quantitative objective that you and your stakeholder(s) agree on and document. Spend some time together, agree on a formula to calculate your objective and then decide on the scope.

Your problem is not underdetermined or overdetermined

You must predict some target value, that’s great! Now let’s count how many different values of your target you have versus the number of features (columns). First, you have a lot more features than samples. Your dataset is wider than tall and your problem is overdetermined, like a system of equations with more  variables than equations.

The trap with this  kind of data is that you will probably achieve good results.  Systems like this will always find a model that perfectly predicts the data. But this is overfitting and nobody likes overfitting as it means that the performance you saw while experimenting will collapse once the model is pushed into production. 

In the second case, you have more samples than features. The probable output in this case is that your  model is barely better than a simple average or “most common value” model.

Of course, sometimes it works because there is a lot of covariance between samples or features, and your 800 features dataset is in fact one feature in disguise. But remember, always be suspicious when you get a wide dataset or a tall dataset.

Agree on a validation strategy

Which number(s) is a  “good-to-go” marker for you and your stakeholder (from simplest to most complex ) :

  • performance on a simple data fold 
  • performance on complex custom data fold 
  • performance on holdout data 
  • performance on a preprod model 
  • Performance of  cohort A vs cohort B, where A is with predictive model and B is without model( A/B testing

Of course, A/B testing is the best but the most expensive too. You have to trade project cost for expected performance.

Have a baseline performance value

Always have a score baseline coming from trivial model like “Always predict 1”, “predict the most common output” or “predict 1 if age < 40”

Sometimes making a very good prediction with one line of sql is better than making a perfect prediction with a system that takes 3 FTE’s and months to develop.

Remember, if just two “IF THEN ELSE” or a “SELECT AVG(target) FROM datas GROUP BY age” have a good performance, you probably don’t need ML.

Check performance, variance, size and speed of your model

Machine learning models are useful only if they bring more value than cost.

Gains come from performance and its speed.  If you can make only 200,000 predictions a day because your model spends 500ms on each prediction and each prediction earns you 0,001€ on average, you can only get 200€ a day.

Cost comes from the size and complexity of your model. The more memory your model requires, the more it’s gonna cost in production. Put that in front of your expected gain and decide.

Last, variance. Always check the stability of your model and think about what is gonna happen with an unstable model. Sometimes you will prefer to sacrifice performance for model stability.

The data is available for predictions

Remember that the goal of your model is to leave the lab and be applied on new, never-seen before data.

Be sure that data can be delivered to your model once in production. Whether that is a human that uploads excel files each Monday morning or a complex automated pipeline with remote database, Access Control, SQL request and Scheduler.  Either way, just have a plan.

The predictions can be accessed

Once you‘ve got input to your model, check that you have some kind of pipeline to deliver its predictions. Having a model that makes predictions but does not write them in a consumable fashion is useless.

Yes, it sounds stupid but trust me, this is a common source of failure.

Actions can be taken from your predictions

Remember the first point? A Machine Learning project should always have something to improve.  Some action must be taken from the predictions coming from your model. It could be assigning top priority on an email or ordering less inventory. Check and recheck that your model is actionable.  

For example, predicting the rain 1 minute before it starts doesn’t help anyone.

Assert that your model works

Of course everybody wants something that works but the point is, can you assert it?

Here you are dealing with monitoring, feedback loops and A/B testing. The claim is obvious but validating that the predictions are good is actually quite hard.  For each completed prediction, you must have a way to verify whether the prediction was indeed the right outcome.

Moreover, what happens when you take action from a prediction to avoid the expected outcome? 

Do you have some A/B testing strategy to allow some predictions to go on without intervention to test the theory? Example, your model tells you a John Doe has a high probability to churn.  If you ignore this record and a selection of others, does the prediction hold?

This is also very important in the last step of your data science project.  Ask yourself and the project team members how you know you can trust your model in production.  In reality,  the only way is to compare “using the model” vs “not using the model” thanks to A/B testing.

Know when to retrain your model

Be upfront with stakeholders that models can become obsolete and have a plan to train alternatives.

The answer could be “never, the model was trained on 2 billion people so it will never get wrong.” The most common answer will be something based on a period of time (e.g. retrain each month), amount of data (e.g. retrain every 10,000 new samples), or a drift of the features or the target (e.g. average age is 10% greater than the train test”). 

Summary

While this is my checklist, I’d encourage you to take the parts that matter for you and add as you see appropriate.  For a quick summary, here’s a snapshot of what was covered above.  

The Machine Learning Project Cheatsheet

 

Step

Check ?

Example

1

There is something to improve

Total number of actual likely to churn detected before they churn

2

Your problem is not underdetermined or overdetermined

400 Features for 60,000 samples

3

Agree on a validation strategy

Use model during the  first 4 weeks of 2022

4

Have a baseline performance value

Predicting Target average => AUC: 0.67
if age < 60 => AUC:0.84

5

Check performance, variance, size and speed of your model

AUC: 0.95, Variance: 0.008, Size in Ram: 5Mo, predictions: 30ms

6

The data is available for predictions

Table “new_data” in DB “customer_prod”. Each week

7

The predictions can be accessed

Put the prediction in Table “new_pred” in DB “customer _prod” each week

8

Actions can be taken from your predictions

Each customer with a probability above 0.78 is called

9

Assert that your model works

95% of the churn were caught

10

Know when to retrain your model

We retrain the model every 6 months OR if drift is more than 30%

Arnold Zephir

About the author

Arnold Zephir

User advocacy