Walmart’s Multi-Cloud Machine Learning Platform

Authors: Pamidi Pradeep & Thomas Vengal

Need for Multi-Cloud

Our planet has been witnessing rapidly changing business demands and technology adoption in the last couple of decades. Major technology shift happened with the commoditization of cloud infrastructure and post pandemic changes in human behaviour pattern. These changes in patterns have forced large enterprises to rethink and redefine their longer-term strategy to scale and adapt to ever-changing and demanding business needs, including IT (Information Technology) infrastructure availability for 24x7x365 operations.

Being the fortune #1 company in the world, Walmart has been in the forefront of technology adoption. The scale of data systems in Walmart has been the largest in this planet. Even though Walmart is known for its largest chain of retail stores, there are many other distinct functions that need to be served other than pure retail domain that are operated at high volume scale such as finance, logistics, e-commerce, customer service, supply chain, and so on. Walmart requires economies of scale to operate in a highly competitive and price sensitive market with elevated levels of fluctuating and inflationary costs.

Accelerated digital transformation has seen increased adoption of Cloud service. This provides an edge in infrastructure management for on-demand computing capability and reduced capital expenditures. It brings the first challenge of deciding which cloud provider to be chosen for the enterprises. Increasingly, companies have been standardizing on available commercial cloud vendors to bring in over operational efficiencies that eventually lower the costs, improve speed of execution, and improve time to market. In case of any concern, switching over to another cloud provider in a short span of time is a huge barrier. This includes substantial cost, time and effort that goes into setting the solutions on a specific cloud service, along with the applications intercedences and the technical debt accumulated in the process. Hence, it is imperative to consider setting up a multi-cloud deployment strategy especially for newer Machine Learning projects that are getting deployed under mission critical applications.

Challenges and Requirements

The prime need is to make use of services from various cloud technology providers to the best advantage of the enterprise without incurring extensive costs and time. Adding high-end technology components needed for Machine Learning (ML) and Artificial Intelligence (AI) into the mix exemplify the complexity. For big data scale, AI solutions are increasingly becoming resource intensive with ever growing demand for complex algorithmic computations. A major challenge for enterprises has been to bring together products and services from different vendors to meet the business expectations since one vendor cannot solve all the problems. It is required to ensure that a selected solution should be future proof and flexible enough to cater the evolving AI workloads and all the other needs of an ever-growing high-tech enterprise.

It is required to cover two basic requirements with the cloud solution:

1. Portability: Build once and deploy anywhere. One of the major challenges is to ensure that solutions can be deployed to any cloud provider without having to incur migration costs and time delays, lower vendor dependency or lock-in, along with flexibility to choose a provider that offers the best price to performance.
2. Best of the Breed: Make use of the best available services from various third-party vendors (open source software, commercial packaged software vendor and cloud vendors).

Considering the scale and breadth of technology usage, creating a single platform for newer technology demands with AI has been a major challenge. Hence, before building a ML Platform at Walmart, following requirements must be considered:

1) Lifecycle: Managing end to end life cycle of Data Science lifecycle involves stages from data ingestion to model deployment through model training and model evaluation. Post model deployment, regular monitoring for model fairness and biases and retraining processes are required to avoid model decay over time.
2) Development Tools: Requires multiple technology tools such as
a. Programming language support (Python, PySpark, Scala, R, SQL and so on)
b. Developer tools availability (Jupyter, Theia, PyCharm, RStudio and so on)
c. Latest state of the art AI/ML algorithms and libraries (TensorFlow, Keras, PyTorch and so on)
d. Reporting tools (Grafana, RShiny and so on)
3) Data: Enabling quick and easy access to the required data
4) Location: Physical location of the infrastructure where the development and deployment are occurring such as,
a. Cloud infrastructure to run with high availability, elastic demand, and disaster recovery requirements in different regions.
b. Edge Infrastructure to run at remote stores with low power, high availability, and minimal maintenance
5) Compute Type: Large scale computing requirements across CPU (Central Processing Unit), GPU (Graphical Processing Unit) and TPUs (Tensor Processing Units) to run the training and inferencing jobs.
6) Scale: Usage with several million ML models that need to run in parallel for different prediction algorithms.
7) Availability: Disaster Recovery of IT systems to run on redundant infrastructure across multiple regions and multiple service providers.
8) Governance: The way to have simple and uniform processes and tool to have gating procedures to have minimal impact to this large scale of operation
9) Economies of Scale: Keeping the costs low with increasing scale of data volume and algorithmic complexity

Solution

It is near impossible to find a single cloud vendor to meet all the requirements for Walmart (as addressed above with 9 requirements). In addition, having a dependency on a single vendor would be a catastrophic strategy for long-term sustenance due to vendor dependency. Hence, the solution was to build an AI/ML Platform where multiple third-party vendors can be hosted and managed. A combination of best of breed technology is available from multiple cloud vendors. For portability, the platform provides the flexibility of multi-cloud technologies under a single umbrella called Triplet Model that encompasses three different cloud technologies — Walmart Cloud, GCP (Google Cloud Platform) & Azure Cloud.

Fig: Multi-cloud Triplet Deployment Architecture from element Machine Learning Platform

Driven by the multi-cloud strategy, Walmart developed an in-house AI/ML platform called element Machine Learning platform. The ML platform is built in a multi-cloud and in a unified manner and it has the capability to run on Triplet model. The element ML platform has been assisting data scientists to scale and build solutions at speed. It brings in the best of breed tools from across cloud providers, open source, and third-party providers. Now, Data scientists can focus on the solution, while the element platform provides access to the tools and standardizes the process to bring in efficiency, speed, and lower cost.

In this multi-cloud deployment process, ML development can happen on one cloud and ML run-time can happen on another cloud with Walmart’s Cloud Native Platform (WCNP), a cloud “abstraction layer” This AI/ML platform is built using best of breed open-source technologies which can plug and play any of the external open source technology of cloud technologies. Any code that is written in a development environment is translated to symmetrical deployment and serving across different regions and clouds which solves the portability problem. Here the cost of running is only for basic infrastructure for computing; not at the cost of higher value AI/ML tools. The element platform has the capability to develop code in various programming languages such as Python, Scala, R, SQL, enabling developers to use various tools such as Jupyter Notebooks, Theia, RStudio, Google Vertex, PyCharm and so on. The element platform provides direct data connectivity to over two dozen data source systems and this helps data scientists to unlock data for data science projects. The element platform is also integrated with over twenty internal IT tools and systems which are required for managing complete MLOps (Machine Learning Operations) lifecycle such as code versioning, CI/CD process, authentication, monitoring, alerting, logging and so on. As a part of element platform, all the above integrations need to be customized to be made available with triplet operations.

Fig: element integration with various IT tools and products within Walmart

Any code developed on element can be deployed on Walmart’s private cloud or edge systems, Google Cloud and on Microsoft Azure cloud. The element ML platform can abstract complexities of provisioning resources across multiple cloud vendors and streamlining the process of managing them. Being a platform, the element ML platform was able to optimize the resource usage and sharing of infrastructure for varying ML workloads across different projects.

In summary, the element ML platform helps Data Science teams in Walmart to go on-board quickly, avail benefits of training from best of breed tools and portability to deploy on the most optimal cloud in terms of high performance, low cost, high availability, and high scalability.

Walmart’s Multi-Cloud Machine Learning Platform was originally published in Walmart Global Tech Blog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Article Link: Walmart’s “Element” Multi-Cloud Machine Learning Platform | by Thomas Vengal | Walmart Global Tech Blog | Medium