Journey to Full-Stack Data Scientist: Model Deployment

An introduction to productionizing machine learning models using APIs and Docker.Growing Responsibilities of Data ScientistsThe title of data scientist is ever-changing and often vague. It usually involves one who is fluent in mathematics, programming, and machine learning. They spend time cleaning data, building models, fine-tuning, and conducting experimentation. They must also have great communication skills, a good grasp on their domain, and other soft skills.However, this is not always exactly the case. If you spend enough time scrolling through job boards, “Data Scientist” can differ quite a bit. Some read more like a data engineer, focusing on pipelines and big data platforms. Some are closer to a data analyst, focusing on data cleaning and dashboarding. And as of late, there are many that are similar to software or ML engineering, focusing on object-oriented programming, building applications, deploying models, and sometimes even web development.Image by AuthorAnd there are those who expect all of this and more, thus, the “full-stack Data Scientist”. With this in mind, data scientists should consider looking to go beyond developing models in a notebook and expand their skillset to other areas like ML Ops. As Pau Labarta Bajo says: “ML models inside Jupyter notebooks have a business value of $????.????????”.This article will go over how data scientists can successfully deploy their machine learning models from notebooks to fully productionized APIs by using FastAPI and Docker.Thoughts on the “Full-Stack” Data ScientistFirst, my personal opinion on the “full-stack data scientist”. With all of these emerging expectations, it is important for us to learn and be comfortable with other skills that we may not have learned in our education or early career. However, the expectation seems to be to master all of these skills, on top of keeping up with traditional data science. And while there are a few out there who are capable of this, it is not feasible for most of us.I don't believe that becoming a full-stack data scientist means mastering every one of these skills, technologies, etc. I think that a full-stack data scientist is about being able to wear all of the hats in the data science lifecycle through continuous learning and development.While it may not be my expertise, I should be able to collaborate with data engineers to optimize pipelines. And while I am much more comfortable with developing models, I should be able to wear my “ML Engineer” hat and help get a model into deployment. A great data scientist will always have their niches, but will also have a working knowledge of other areas and can quickly learn new skills if and when they need to.Model DevelopmentFirst, for our example, we need to develop a model. Since this article focuses on model deployment, we will not worry about the performance of the model. Instead, we will build a simple model with limited features to focus on learning model deployment.In this example, we will predict a data professional’s salary based on a few features, such as experience, job title, company size, etc.See data here: https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries (CC0: Public Domain). I slightly modified the data to reduce the number of options for certain features.#import packages for data manipulationimport pandas as pdimport numpy as np#import packages for machine learningfrom sklearn import linear_modelfrom sklearn.model_selection import train_test_splitfrom sklearn.preprocessing import OneHotEncoder, OrdinalEncoderfrom sklearn.metrics import mean_squared_error, r2_score#import packages for data managementimport joblibFirst, let’s take a look at the data.Image by AuthorSince all of our features are categorical, we will use encoding to transform our data to numerical. Below, we use ordinal encoders to encode experience level and company size. These are ordinal because they represent some kind of progression (1 = entry level, 2 = mid-level, etc.).For job title and employment type, we will create a dummy variables for each option (note we drop the first to avoid multicollinearity).#use ordinal encoder to encode experience levelencoder = OrdinalEncoder(categories=[['EN', 'MI', 'SE', 'EX']])salary_data['experience_level_encoded'] = encoder.fit_transform(salary_data[['experience_level']])#use ordinal encoder to encode company sizeencoder = OrdinalEncoder(categories=[['S', 'M', 'L']])salary_data['company_size_encoded'] = encoder.fit_transform(salary_data[['company_size']])#encode employmeny type and job title using dummy columnssalary_data = pd.get_dummies(salary_data, columns = ['employment_type', 'job_title'], drop_first = True, dtype = int)#drop original columnssalary_data = salary_data.drop(columns = ['experience_level', 'company_size'])Now that we have transformed our model inputs, we can create our training and test sets. We will input these features into a simple linear regression model to predict the employee’s salary.#define independe

Journey to Full-Stack Data Scientist: Model Deployment

An introduction to productionizing machine learning models using APIs and Docker.

Growing Responsibilities of Data Scientists

The title of data scientist is ever-changing and often vague. It usually involves one who is fluent in mathematics, programming, and machine learning. They spend time cleaning data, building models, fine-tuning, and conducting experimentation. They must also have great communication skills, a good grasp on their domain, and other soft skills.

However, this is not always exactly the case. If you spend enough time scrolling through job boards, “Data Scientist” can differ quite a bit. Some read more like a data engineer, focusing on pipelines and big data platforms. Some are closer to a data analyst, focusing on data cleaning and dashboarding. And as of late, there are many that are similar to software or ML engineering, focusing on object-oriented programming, building applications, deploying models, and sometimes even web development.

Image by Author

And there are those who expect all of this and more, thus, the “full-stack Data Scientist”. With this in mind, data scientists should consider looking to go beyond developing models in a notebook and expand their skillset to other areas like ML Ops. As Pau Labarta Bajo says: “ML models inside Jupyter notebooks have a business value of $????.????????”.

This article will go over how data scientists can successfully deploy their machine learning models from notebooks to fully productionized APIs by using FastAPI and Docker.

Thoughts on the “Full-Stack” Data Scientist

First, my personal opinion on the “full-stack data scientist”. With all of these emerging expectations, it is important for us to learn and be comfortable with other skills that we may not have learned in our education or early career. However, the expectation seems to be to master all of these skills, on top of keeping up with traditional data science. And while there are a few out there who are capable of this, it is not feasible for most of us.

I don't believe that becoming a full-stack data scientist means mastering every one of these skills, technologies, etc. I think that a full-stack data scientist is about being able to wear all of the hats in the data science lifecycle through continuous learning and development.

While it may not be my expertise, I should be able to collaborate with data engineers to optimize pipelines. And while I am much more comfortable with developing models, I should be able to wear my “ML Engineer” hat and help get a model into deployment. A great data scientist will always have their niches, but will also have a working knowledge of other areas and can quickly learn new skills if and when they need to.

Model Development

First, for our example, we need to develop a model. Since this article focuses on model deployment, we will not worry about the performance of the model. Instead, we will build a simple model with limited features to focus on learning model deployment.

In this example, we will predict a data professional’s salary based on a few features, such as experience, job title, company size, etc.

See data here: https://www.kaggle.com/datasets/ruchi798/data-science-job-salaries (CC0: Public Domain). I slightly modified the data to reduce the number of options for certain features.

#import packages for data manipulation
import pandas as pd
import numpy as np

#import packages for machine learning
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.metrics import mean_squared_error, r2_score

#import packages for data management
import joblib

First, let’s take a look at the data.

Image by Author

Since all of our features are categorical, we will use encoding to transform our data to numerical. Below, we use ordinal encoders to encode experience level and company size. These are ordinal because they represent some kind of progression (1 = entry level, 2 = mid-level, etc.).

For job title and employment type, we will create a dummy variables for each option (note we drop the first to avoid multicollinearity).

#use ordinal encoder to encode experience level
encoder = OrdinalEncoder(categories=[['EN', 'MI', 'SE', 'EX']])
salary_data['experience_level_encoded'] = encoder.fit_transform(salary_data[['experience_level']])

#use ordinal encoder to encode company size
encoder = OrdinalEncoder(categories=[['S', 'M', 'L']])
salary_data['company_size_encoded'] = encoder.fit_transform(salary_data[['company_size']])

#encode employmeny type and job title using dummy columns
salary_data = pd.get_dummies(salary_data, columns = ['employment_type', 'job_title'], drop_first = True, dtype = int)

#drop original columns
salary_data = salary_data.drop(columns = ['experience_level', 'company_size'])

Now that we have transformed our model inputs, we can create our training and test sets. We will input these features into a simple linear regression model to predict the employee’s salary.

#define independent and dependent features
X = salary_data.drop(columns = 'salary_in_usd')
y = salary_data['salary_in_usd']

#split between training and testing sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, random_state = 104, test_size = 0.2, shuffle = True)

#fit linear regression model
regr = linear_model.LinearRegression()
regr.fit(X_train, y_train)

#make predictions
y_pred = regr.predict(X_test)

#print the coefficients
print("Coefficients: \n", regr.coef_)

#print the MSE
print("Mean squared error: %.2f" % mean_squared_error(y_test, y_pred))

#print the adjusted R2 value
print("R2: %.2f" % r2_score(y_test, y_pred))

Let’s see how our model did.

Image by Author

Looks like our R-squared is 0.27, yikes. A lot more work would need to be done with this model. We would likely need more data and additional information on the observations. But for the sake of this article, we will move forward and save our model.

#save model using joblib
joblib.dump(regr, 'lin_regress.sav')

Creating an API

There are several ways to deploy a model. One of those ways is with an API. An API (Application Programming Interface) enables two pieces of software to communicate with each other. There are several API architectures like SOAP, RPC, and REST APIs. We will use a REST API, which is the most popular and flexible architecture to access a service.

For our framework, we will use FastAPI (https://fastapi.tiangolo.com/), which is great for beginners as its fairly easy to use and has tons of documentation and examples.

With REST APIs, there are five methods that are commonly used: POST, GET, PUT, PATCH, and DELETE. These correspond to create, read, update, and delete operations. Our script below (Main.py) will follow these steps:

  1. Initialize the FastAPI framework and define the request format.
  2. Download the model.
  3. Create a GET endpoint to retrieve the model.
  4. Create a POST endpoint to allow the user to send it new data and create a prediction.
  5. Define the host IP and port (location to operate the API).
import uvicorn
import pandas as pd
from fastapi import FastAPI
from pydantic import BaseModel
import joblib

# Initialize FastAPI
app = FastAPI()

# Define the request body format for predictions
class PredictionFeatures(BaseModel):
experience_level_encoded: float
company_size_encoded: float
employment_type_PT: int
job_title_Data_Engineer: int
job_title_Data_Manager: int
job_title_Data_Scientist: int
job_title_Machine_Learning_Engineer: int

# Global variable to store the loaded model
model = None

# Download the model
def download_model():
global model
model = joblib.load('lin_regress.sav')

# Download the model immediately when the script runs
download_model()


# API Root endpoint
@app.get("/")
async def index():
return {"message": "Welcome to the Data Science Income API. Use the /predict feature to predict your income."}

# Prediction endpoint
@app.post("/predict")
async def predict(features: PredictionFeatures):

# Create input DataFrame for prediction
input_data = pd.DataFrame([{
"experience_level_encoded": features.experience_level_encoded,
"company_size_encoded": features.company_size_encoded,
"employment_type_PT": features.employment_type_PT,
"job_title_Data Engineer": features.job_title_Data_Engineer,
"job_title_Data Manager": features.job_title_Data_Manager,
"job_title_Data Scientist": features.job_title_Data_Scientist,
"job_title_Machine Learning Engineer": features.job_title_Machine_Learning_Engineer
}])

# Predict using the loaded model
prediction = model.predict(input_data)[0]

return {
"Salary (USD)": prediction
}

if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)

Now let’s use the command line to test the API. First, change the directory to your project. Then, run the API using uvicorn.

cd "C:\Users\adavi\OneDrive\Desktop\Salary Model"
py -m uvicorn main:app --reload

The command line gives me a link to follow. I am then greeted with the message from the GET endpoint. Nice!

Image by Author

Lastly, let’s create a test script to submit new data and retrieve a prediction. Using the requests library, we define the URL and submit a new observation.

import requests

url = 'http://127.0.0.1:8000/predict'

#dummy data to test API
data = {"experience_level_encoded": 3.0,
"company_size_encoded": 3.0,
"employment_type_PT": 0,
"job_title_Data_Engineer": 0,
"job_title_Data_Manager": 1,
"job_title_Data_Scientist": 0,
"job_title_Machine_Learning_Engineer": 0}

#make a POST request to the API
response = requests.post(url, json=data)

#print response
response.json()
Image by Author

The prediction is then returned in JSON format thanks to the POST endpoint. Great, we have a functioning API!

Deploying Model Using Docker

What is Docker?

Now we have a way to interact with our model, but the model is still not deployed. Let’s say we have a team of 20 people, all of whom we want to have the API running on their computer. This is likely to be a headache. Replicating data science applications can be challenging as there are a number of roadblocks, such as different operating systems, dependencies, tech stacks, etc.

This is where Docker comes in. Docker is a platform that enables developers to package their applications and all of their dependencies in “containers”. Anyone who has access to a container can run the application without worrying about downloading the correct versions of packages, changing operating systems, etc. Docker containers are also very fast and lightweight, giving an advantage over virtual environments or machines.

Download Docker Desktop here: https://www.docker.com/

Creating a DockerFile and Image

Before we create a container, we must first create an image. A Docker image is a snapshot of the application and its dependencies. It basically outlines the instructions for the container.

To create an image, you must create a Dockerfile (https://docs.docker.com/reference/dockerfile/). The Dockerfile is a text-based document that is stored inside the project and provides the instructions on how to assemble the image. The Dockerfile cannot be a .txt file. It must have no extension. The easiest way to create a Dockerfile is through VSCode. Simply add a new file, and name it “Dockerfile”.

I built the following Dockerfile using their beginner documentation. It follows these steps:

  1. Install python 3.9.
  2. Create a new directory and copy the project files.
  3. Install the necessary packages using requirements.txt.
  4. Specify the port (8000).
  5. Run the application.
# A Dockerfile is a text document that contains all the commands
# a user could call on the command line to assemble an image.

FROM python:3.9.4-buster

# Our Debian with python is now installed.

RUN mkdir build

# We create folder named build for our stuff.

WORKDIR /build

# Now we just want to our WORKDIR to be /build

COPY . .

# FROM [path to files from the folder we run docker run]
# TO [current WORKDIR]
# We copy our files (files from .dockerignore are ignored)
# to the WORKDIR

RUN pip install --no-cache-dir -r requirements.txt

# OK, now we pip install our requirements

EXPOSE 8000

# Instruction informs Docker that the container listens on port 8000

WORKDIR /build/app

# Now we just want to our WORKDIR to be /build/app for simplicity

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

# This command runs our uvicorn server

Now that we have our Dockerfile, we can create the image with the following command. The name of the image will be “apiserver”.

#build docker image
docker build . -t apiserver

If we navigate to Docker Desktop, we can see that the image was successfully created.

Image by Author

Creating a Docker Container

Now that we have an image, creating the container is very simple. Once we run the image with a few instructions, the container is created. Below, we run the image and specify the port.

#run docker image
#acces at http://localhost:8000
docker run --rm -it -p 8000:8000/tcp apiserver:latest

If we navigate back to Docker Desktop again, we can see the container. Docker gives containers random names, which can become difficult to track. If you develop many applications, it is useful to rename them.

Image by Author

The model is now deployed! Going back to our team of 20, all they need is Docker installed on their machine and access to our container. Then they can run the container and use the API as needed.

Conclusion

In conclusion, with new expectations for data scientists, it is vital to learn other skills like software engineering and ML Ops. The need for “full-stack data scientists” is growing as organizations need those that can engage in all stages of the data science lifecycle.

Taking machine learning models out of notebooks and into production is a great first step to become a full-stack data scientist. By using tools like FastAPI and Docker, you can share the hard work it took to build your model by allowing others to use it too.

I hope you have enjoyed my article! Please feel free to comment, ask questions, or request other topics.

Connect with me on LinkedIn: https://www.linkedin.com/in/alexdavis2020/


Journey to Full-Stack Data Scientist: Model Deployment was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.