Design ML Model & Code Deployment System

In this post, we will talk about the design of ML Model and Code Deployment system in multi-regional data center with archival, disaster recovery, less latency and multi instance deployment. We have created this product as One-Click Deployment System to accelerate the ML models & code deployment to production faster in Walmart Search.

Functional Requirements:
User should be able to deploy code and ML model from UI.
Different models can be deployed to multi instance in single environment. Say, for environment dev, there can be different models in every instance.
Model Storage should be archived in blob store.
Disaster discovery should be enabled in case of any distributed blob store going down.
Non-Functional Requirements:
Upload and Download latency of model and package to distributed blob store should be very low. Should be within 1 min approx.
Number of daily deployments will be 100.
Deployment Service should be highly available.

User Flow:

As per the below image, the developer can select the endpoint/instance which he/she prefers to deploy in dev/stage/prod environment. This will in turn will pushes the model and package to blob storage (blob store).

Capacity Estimation:

Normally Machine Learnt model binary files are huge in size.

ML Model and package file size : 4 GB

No of deployments handled by search services daily : 100

No of environments (dev, stage and prod) : 3

Blob Store Storage for one month : 4 GB * 100 * 30 * 3=~ 36 TB

Storage for three DataCenters for west, south, east (disaster recovery)=~ 36 TB * 3 = 108 TB

Storage for one year = 108 TB * 12 = 1296 TB =~ 1.2 PB (we will revisit this again in Archival process)

High Level Design:

High Level Design — 10000 feet overview

API Design:

Deploy API: POST /deploy/:modelName/:modelVersion?dataCenter=dc-all&endpoint=variant-2

UnDeploy API: POST /undeploy/:modelName/:modelVersion?dataCenter=dc-all&endpoint=variant-2

Database Design:

We use CloudSql to store the metadata of the deployments. On Deploy/Undeploy API call, we will create an entry to store the deployments. This will help us to track the history, status, failure, workflow failure, user details of the deployments.

We use blob storage for ML model and package storage. To make it as highly available and avoid Single point of failure, we have blob store replicas in three regions (south, west and east blob store).

Algo Design:

Deploy API Flow:

When the user clicks on Deploy in client UI with environments and instance, then we will use our Deploy API to trigger workflow (airflow DAG) asynchronously which pushes the models and package to blob storage of all three regions. If model and package failed to store in any of the region blob store, then the deployment will be stopped/failed.We will store the metadata of that deployment into application.properties in blob store. In this case, the file will have model name and version of that deployment in the below path of blob store of all three regions. We will update the deployment status in the Metadata DB to LIVE / DEPLOYED after the deployment workflow completed successfully

Directory path : Container / data/ model_name/env / application.properties

Example: search / data / reranking_model / dev / application.properties

At runtime, on daily restart or manual deployment, we will read the application.properties from the above path and download the models and package from region specific blob storage and load it on system startup. This will reduce the download latency during system startup since we are connecting to region specific blob storage.

UnDeploy API flow:

When the user clicks on UnDeploy in client UI with environments and instance, then we will use our UnDeploy API to trigger workflow (airflow DAG) asynchronously which remove the models and package from blob storage of all three regions. We will update the deployment status in Metadata DB to UNDEPLOYED after the undeployment workflow completed successfully. Similarly the application.properties file also will be updated with empty data record and it will wait for the next deployment to takes place.

WorkFlow Design:

We will follow the below directory structure to store the models and package into blob storage of all three regions.

Directory path : Container / data/ model_name/env / endpoint / version

Example : search / data / reranking_model / dev / variant-1 / 1.0.32

Multi-Instance Deployment per Environment

From Client UI, the user selects DataCenter (blob-store-all) and Endpoint (variant-1, variant-2) from env (dev , stage, prod) for a particular model and package to be deployed.

At runtime of microservice, we will have configure environment variable per instance. In below example, we will have environment variable named endpoint which value will be used to follow the directory path in blob storage of that respective region.

- name: dev-variant-01
flows: [master]
target:
- cluster_id: [north-dev-a3]
helm:
values:
scaling:
enabled: true
env:
endpoint: variant-1
deploy_env: dev

Archival Process

At any point of given time, solution is designed to store only two versions in blob storage. Hence Archival process is handled through undeployment workflow. In this case, on every deployment from UI will automatically trigger an undeployment API call to delete/archive the last old deployment on that environment. This will help us to clean lot of space as we had 1.2 PB previously has been reduced to 72 GB in all replicas.

Two versions : 4GB + 4GB = 8GB

With Three environments : 3 * 8GB = 24 GB

With Three data centers/replicas : 24 * 3 = 72GB

Disaster Recovery

Automated Process: On-demand DAG created to achieve this workflow. Replicas of binary and package jar are maintained in GCP Blob store. We have feature to download all the directories in a container for both GCP and Blob Store. When entire blob storage is corrupted, then we will use the GCP blob store and copy it all the region specific blob storage.

Latency Metrics

Uploading 4GB model and package to blob store of all three regions in less latency is quite challenging. We started with 45 minutes upload latency using blob Storage Python, Java SDKs and async processing. But later when we optimised using blob store cli, the latency was brought down to 1 minute 40 seconds.

Similarly, Download latency took 1 minute using blob Storage Python SDK during service startup time.

Summary

This blog covered various design perspectives and challenges of implementing code deployment system. We also learned about how the high availability, disaster recovery is maintained for creating any new product. If you have any follow-up thoughts, please do leave in comments.

Design ML Model & Code Deployment System was originally published in Walmart Global Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Article Link: Design ML Model & Code Deployment System | by Sanjay T | Walmart Global Tech Blog | Jun, 2021 | Medium