There can be a knowledge gap when transitioning from exploratory Machine Learning projects, typical in research and study, to industry-level projects. This is due to the fact that industry projects generally have three additional goals: collaborative, reproducible, and reusable, which serve the purpose of enhancing business continuity, increasing efficiency and reducing cost. Although I am no way near finding a perfect solution, I would like to document some tips to transform a exploratory, notebook-based ML code to industry-ready project that is designed with more scalability and sustainability.
I have categorized these tips into three key strategies:
- Improvement 1: Modularization — Break Down Code into Smaller Pieces
- Improvement 2: Versioning — Data, Code and Model Versioning
- Improvement 3: Consistency — Consistent Structure and Naming Convention
Problem Statement
One struggle I have faced is to have only one notebook for the entire data science project — which is common while learning data science. As you may experience, there are repeatable code components in a data science lifecycle, for instance, same data preprocessing steps are applied to transform both train data and inference data. If not handled properly, it results in different versions of the same function are copied and reused at multiple locations. Not only does it decrease the consistency of the code, but it also makes troubleshooting the entire notebook more challenging.
Bad Example
train_data = train_data.drop(['Evaporation', 'Sunshine', 'Cloud3pm', 'Cloud9am'], axis=1)
numeric_cols = ['MinTemp', 'MaxTemp', 'Rainfall', 'WindGustSpeed', 'WindSpeed9am']
train_data[numeric_cols] = train_data[numeric_cols].fillna(train_data[numeric_cols].mean())
train_data['Month'] = pd.to_datetime(train_data['Date']).dt.month.apply(str)inference_data = inference_data.drop(['Evaporation', 'Sunshine'…