Serverless ML: Deploying Lightweight Models at Scale

Deploying ML models 'into production' as scalable APIs can be tricky. This post looks at how Serverless Functions can make deployment easier for some applications, and gives an example project to get you started deploying your own models as Google Cloud Functions.

A deployment conundrum

Deploying machine learning (ML) models into production can sometimes be something of a stumbling block for Data Science (DS) teams. A common mode of deployment is to find somewhere to host your models and expose them via APIs. In practice, this can make it easy for your end users to integrate your model outputs directly into their applications and business processes. Furthermore, if the customer trusts the validity of your outputs and performance of your API, this can drive huge business value: your models can make a direct and lasting impact on the target business problem.

However, if you don't have access to ongoing technical support in the form of DevOps or MLOps teams, then wading through cloud services to set up load balancers, API gateways, continuous integration and delivery pipelines, security settings etc. can be quite a lot of overhead. Moreover, unless you're pretty confident with these concepts, delivering (and monitoring) an ML API for which you can guarantee security and performance at scale and thereby engender the trust of your users can be challenging.

Of course, there's an ever increasing number of services available to help you with this process. Perhaps your first port of call would be one of the major cloud providers' managed model deployment service (such as SageMaker). Or perhaps you'd look at one of the thriving MLOps tools/platforms, such as Cortex or Seldon, or maybe even you'd choose to just dive headfirst into something like Kubeflow and TensorFlow Serving. It can be a bit bewildering when you're starting out.

... if you want to deploy models at scale, you are going to have to spend a significant amount of time learning a fair few technologies, and getting comfortable with a fair few software engineering concepts to boot.

There's another problem here too. While these tools and platforms automate many ML-specific tasks and often reduce the cognitive burden of deploying ML models, they still involve a lot of overhead: you will still need to spend a fair while reading documentation and playing with examples before you're likely to feel confident using these tools in practice. Indeed, in general, if you want to deploy models at scale, you are going to have to spend a significant amount of time learning a fair few technologies, and getting comfortable with a fair few software engineering concepts to boot. There's no shortcut there, even with these new tools.

But while this is all true, there are of course always special cases. And in these special cases, you can take a few shortcuts that can help you get a ML model 'into production' quickly and with minimal overhead. That's what this post is about: it'll give a high-level overview of what counts as a special case; give a brief introduction to the concept of serverless computing; and then introduce a code example (with repo!) for deploying a special-case ML model as a Google Cloud Function in only a. few lines of code.

Special cases

There's a long standing, often quoted but occasionally misapplied (ignored?) notion in DS, software engineering and elsewhere of starting small, clean and simple and then increasing complexity over time. From the perspective of ML models, this means start with the simplest ML model that works (i.e. produces useful business value), and go from there. For example, this might mean that – if the given business problem allows – you might want to try playing around with some simple linear models before reaching for a monster gradient boosted tree or deep neural network.

These more complex modelling approaches are naturally appealing for those in the DS/ML field: they're clearly very powerful in many applications after all. But there are downsides. Explainability can rapidly become difficult for one. Poor generalisation on small datasets is a real danger in some contexts too. From a deployment perspective, there are potentially several issues as well. First off: the packages and libraries that provide these capabilities are typically heavy. They often have large binaries (i.e. take up a lot of disk space), they have significant memory and CPU (plus GPU/TPU too) requirements, they can sometimes have (relatively) poor inference performance, and typically have bloated images (if using a technology like Docker, for example).

In comparison, a simple linear model typically has minimal dependencies (plus if needed it takes a few tens of lines of pure NumPy code to implement a lightweight linear model from scratch), near-zero resource requirements after training, and lightning-fast inference performance to boot. At the risk of sounding old fashioned: you can go a long way with a bunch of well designed linear models.

At the risk of sounding old fashioned: you can go a long way with a bunch of well designed linear models.

All of this to say: if you can do a good job of addressing a business problem using linear models, you'd do well to at least start there. And if you do start there (and are using Python – a limitation of serverless, not a tribal thing!), then you're one of the special cases. Congratulations! Deploying your lightweight models as serverless functions might save you a lot of time and pain, and might be ideal for your use-case.

Going serverless

Over the last few years, the concept of serverless computing has blown up in the software engineering world. The basic idea is that cloud providers can go a long way towards abstracting away (hiding) the complexity of deploying applications into production environments by largely doing away with the need for engineers to manually configure their servers, load balancers, containers etc. before being able to generate any business value.

In some ways, this is analogous to the problem faced by some Data Scientists as outlined above: there's often a lot of overhead involved in getting a model 'out of the door', in stable service and ultimately generating business value. Serverless computing aims to remove much of this overhead: in principle, the objective is to have you, the developer, write a simple function and immediately deploy it 'into production' in a theoretically infinitely scalable and secure manner.

... the objective is to have you, the developer, write a simple function and immediately deploy it 'into production' in a theoretically infinitely scalable and secure manner.

What's more, serverless computing is explicitly aimed at supporting 'event driven' applications. For example, if you need a function (model) to run every time a specific file changes in your cloud storage, or every day at a specific time, or perhaps every time a new customer signs up for your service, you can configure your function to do this with ease. You get this sort of functionality for free. Sounds pretty cool, right?

This post isn't going to cover the basics of serverless beyond the discussion here. If you'd like a deeper dive into how serverless works, and what it's relative strengths and weaknesses are, then there's a previous post on this you should check out:

Now for an example!

Time for code

This example will show you how to structure and build a simple ML pipeline for training a Scikit-Learn model (in this case a simple LogisticRegression model) to predict heart disease using the UCI Heart Disease dataset, and then deploying it as a Google Cloud Function. This isn't going to be an exercise in data analysis, so don't expect much discussion of data exploration and modelling decisions! If you'd rather just dive into the code, here you go:

For everyone else, read on!

Before you begin...

You'll need to register for a Google Cloud Account and make sure you've read the previous post introducing serverless computing too. Google are currently offering $300 of 'free' credit when you sign up which will be more than enough for this tutorial and a few of your own projects too. Just remember to disable your account when you're done! As a disclaimer: this post is not affiliated with Google in any way – it is simply a generous offer that could be handy for those wanting to grow their knowledge of cloud services!

Additionally, you'll need to have Python 3.7 installed on your system, and access to GitHub. If you regularly work with Python and GitHub, you should be fine. if not, you can check the version of Python you have installed with:

python --version

If you don't see Python 3.7.x (where x will be some minor version), you'll need to install it. You might find pyenv helpful for this. Here's a guide on installing specific versions of Python with pyenv. You can find out how to get started with GitHub with GitHub's 'Hello World' guide.

Cloning the repository

First things first: you'll need the repository. You can clone the repository with:

git clone https://github.com/markdouthwaite/serverless-scikit-learn-demo

Or you can clone the repository from GitHub directly.

What is inside the box?

Now, navigate into this newly cloned directory. The repository provides a simple framework for structuring your project and code ready for deploying it as a cloud function. There's a few files that you should familiarise yourself with:

requirements.txt - Python's venerable (if flawed) convention for capturing a project's dependencies. In here you'll find a list of the packages you'll need to run your Cloud Function. Google's service looks for this file and will automatically install files listed in it before running your function.
steps/train.py - This script trains your model. It builds a Scikit-Learn Pipeline that provides a concise 'canonical' way of binding your pre- and post-processing to your model as a single portable block. This makes it a lot easier and cleaner to deploy as a Cloud Function (and as an API in general!). When it has done training, it'll do a simple evaluation of the model, and print some stats to the terminal. The resulting model will be pickled with joblib and saved in the artifacts directory as pipeline.joblib. In practice you might find it useful to save these files in Google Cloud Storage, but storing them locally will work for now.
main.py - This module contains your 'handler'. This is what users will interact with when they call your service when it is deployed. You might notice that the structure of the init_predict_handler function is a little odd. This is because the function needs to load your model when the main module is first loaded, and your function needs to maintain a reference to the loaded model. You could simply load it outside the function scope too, of course, but the structure shown soothes my OCD by limiting the 'visibility' of the model only to the function itself, and not to any other code you may write accessing the module.
app.py - This module provides a minimal Flask app setup for testing your cloud function locally (i.e. before deploying it). You can launch it with python app.py and call it as normal.
datasets/default.csv - The 'default' dataset for this example. This is the UCI Heart Disease dataset in CSV format.
resources/payload.json - An example payload for you to send to the API when it is deployed. Convenient, eh?
notebooks/eda.ipynb - A dummy notebook to illustrate where you might want to store Exploratory Data Analysis (EDA) code for/during your model development.

If you've read the previous post on serverless concepts, this structure should make some sense to you. Anyway, now for the fun part, actually running the code.

Training the model

Naturally, to deploy a model, you need a model in the first place. To do this, you'll need to use the train.py script. This script uses Fire to help you configure the inputs, outputs and model parameters from the command line. You can run the script with:

python steps/train.py --path=datasets/default.csv --tag=_example --dump

What is this doing? It's telling the script to load data from datasets/default.csv, to label the output model with the tag example and to dump (write) the model file to the target location (which would be artifacts/pipeline_example.joblib). You should see an output like:

Training accuracy: 86.78%
Validation accuracy: 88.52%
ROC AUC score: 0.95

Not a terrible model eh? And all from old-school logistic regression. Now for the party piece: deploying the model.

Deploying the model

Now you have your model, you can deploy it. You'll need to install Google Cloud's command line tool. You can do this for your system using their guide. With that done, in your terminal inside your project directory, you can then deploy your new model with:

gcloud functions deploy heart-disease --entry-point=predict_handler --runtime=python37 --allow-unauthenticated --project={project-id}
--trigger-http

You'll need to substitute {project-id} for your project ID (you can get this from your Google Cloud console). After a few moments, your model should be live. You'll be able to query your nice new API at:

https://{subdomain}.cloudfunctions.net/heart-disease

When your model is deployed, the terminal output will give you the specific URL you can call (substitute your {subdomain}). Simple right? Your newly deployed model will essentially be infinitely scalable to boot, which is nice.

Querying the model

Finally, it's time to get some predictions. You can send:

curl --location --request POST 'https://{subdomain}.cloudfunctions.net/heart-disease' --header 'Content-Type: application/json' -d @resources/payload.json

And you should get:

{"diagnosis":"heart-disease"}

In response. That's it, you have a live ML API. Congratulations!

Things to remember

As you can see, deploying your ML models as serverless functions is probably the fastest, simplest route to getting a stable, scalable ML API deployed. But as always, it can come at a price. Here's some things to remember:

Serverless functions (like Google Cloud Functions) are generally constrained by resources (CPU, RAM), so loading and using 'big' models in serverless functions will often be problematic.
Serverless functions work best when they're highly responsive (i.e. when inference time is very fast in the ML case). If your model is a bit slow, or depends on other services that are slow (e.g. slow SQL queries), this might not be the best option either.
Serverless functions are designed to process lots of small requests often. If your use-case involves big, batched queries, they might not be a good fit – there's often hard timeout restrictions on requests.
Serverless functions that have not been used for a period of time are spun down. As soon as you make a call to them, they're spun back up. This creates a short lag as the function is 'warmed up'. Additionally, if you're using Python, for example, the notoriously slow start time for the Python interpreter can become problematic
Serverless functions generally have first-class support for common 'infrastructure' languages like Node JS and Python. If you're using R, MATLAB, Julia etc. (including as dependencies for Python via rpy2) the support will vary from non-existent to poor. There are workarounds, but these typically come with performance penalties and increased complexity (in some ways reducing the value of the 'serverless' concept).