A birds-eye view into the key components of a modern enterprise ML Platform
A query from an investor friend nudged me into a deep dive into the ML Platform landscape over the thanksgiving weekend. As I was gathering notes, I thought it’d be good to share them for the benefit of folks interested in understanding the space.
I was particularly interested in the different components that make up a modern ML platform for a company, and I looked at 3 different classes of solutions:
- Both OSS and closed source solutions primarily built for internal consumption at hi-tech companies like Uber, Airbnb, Lyft, DoorDash.
- Commercial startups I’m aware, of that offer these services as enterprise-grade software solutions. By no means is this meant to be an exhaustive list.
- Commercial managed services from the 3 big cloud vendors, AWS/Azure/ Google Cloud.
I identified the following key services, spanning the exploration, build, train, test & production stages of the model development lifecycle (MDLC):
- Managed Notebook service
- Model training service
- Data labeling service
- Workflow orchestration & Data pipelines
- Model management services
- Model serving/deployment service
- Model monitoring and explainability service
Managed Notebook service
Jupyter notebooks are the most popular IDE for data scientists. Besides their browser-based UI, Notebooks offer a great deal of flexibility for their day to day tasks, which could involve feature exploration, feature engineering, data wrangling, model building, and data analysis. Additionally, built-in visualization tools, ability to save and share as simple HTML files, make Jupyter notebooks ubiquitous in the Data Science world.
Specifically, in a company setting, when run as a managed notebook service, it could be offered with some additional utilities over the vanilla laptop run notebook service, and this could drive significant efficiency for a decently big team of data scientists. For e.g:
- Pre-built environments with commonly used ML libraries and easy access to algorithms.
- Built-in connectivity to access-protected data-warehouses, feature stores
- Built-in support for cloud backed GPU based model runtimes, which developer laptops cannot offer.
- Ability to spin up training jobs from within notebooks.
OSS/Well known frameworks: Airbnb Redspot service, Uber DSW, Kubeflow
Enterprise Startups: Determined AI, Databricks, DominoData labs, Paperspace Gradient
Cloud Vendors: AWS Sagemaker
Model training jobs are iterative, stateful, incremental, and compute-intensive — requiring distributed computing for speedy turnarounds. To iterate fast on model development, data scientists need to be able to spin up training jobs without having to jump through the hoops of hardware setup, working with their IT team.
In addition, smart ML training services provide access to automated hyper-parameter search algorithms, which allow for efficient and optimized training convergence. They also offer built-in resiliency via checkpoint saving and restoring. The ability to provision CPU/GPU resources for distributed training algorithms, and support for multiple popular frameworks, are also key characteristics of an enterprise-wide training platform.
OSS/Well known frameworks: Airbnb BiqQueue service, Uber Horovid, Kubeflow Fairness, TFX.
Enterprise Startups: Determined AI, Databricks, DominoData labs
Cloud Vendors: AWS Sagemaker, Azure ML
Data powers ML applications. High-quality labeled training data is imperative for the high-performance of models. Labeling data in a scalable reliable way requires specifically designed workflows and expertise. Smart and intuitively designed labeling workflows can bring in significant efficiency gains while dealing with a large labeling workforce on large data projects.
Enterprise Startups: Scale AI, Snorkel, LabelBox
Cloud Vendors: AWS Sagemaker ground truth
Workflow orchestration and Data pipelines
Machine learning production workflows involve data movement between multiple sub-systems at data ingestion, preparation, training, and scoring time. Data needs to be moved, massaged into the desired format before feeding into models for training and inference.
Various teams and products inside an organization use different locations to store data that may be consumed in a completely different part of the organization, which could be in a different data center or cloud or even geography. Workflow orchestration systems allow for composing complex workflows, scheduling, and executing them reliably at scale in production.
Modern workflow services build on top of Kubernetes and focus on building expressive ways to represent and execute complex DAG style workflows while offloading the aspects of compute-scaleout to the underlying k8s system.
OSS/ Well known frameworks: Flyte, Airflow, Argos, Kubeflow pipelines
Enterprise Startups: Prefect, Pachyderm, Databricks, Domino Data
Cloud Vendors: AWS Data Pipeline, Google AI pipelines(Kubeflow pipelines), Azure ML Pipelines
Feature handling is a fairly involved process, both during model development and at production time — feature discovery, exploration, extraction, transformations, and serving. Repeatability and reliability of these operations at scale, and serving features to multiple ML applications for scoring, version tracking and maintaining data lineage, and metadata are key responsibilities handled by feature stores. Additionally, feature stores also make backtesting of models easier, by backfilling historical data when new features are added.
OSS/ Well known frameworks: Airbnb Zipline, Uber’s Michaelangelo Palette, Facebook’s FBLearner, Kubeflow
Enterprise Startups: Tecton, Scribble, Hopsworks from LogicalClocks
Model management and experiment management
Once models are trained, they can be cataloged and tracked using model management systems.
Data Scientists also have to work with Business/ Product teams to run multiple experiments. Large teams would need to track changes to models during experiments in a scalable way, which involves capturing things like code versions, data versions, hyperparameters, environment, and metrics.
The key challenge lies in supporting popular model frameworks, like XGBoost, TensorFlow, Keras, Pytorch, Scikit Learn, LightGBM.
OSS/ Well known frameworks: MLFlow, TensorBoard
Enterprise Startups: Weights and Biases, Neptune, Comet ML, Paperspace Gradient
Once models are trained and need to be deployed either to serve as part of live applications or for offline scoring, you need a consistent and reliable mechanism within the company. Model serving and deployment requires a few different considerations:
- CPU/GPU/TPU based environments
- Support for composable, stacked models
- Model dependency management
Many enterprises, fresh from their digital transformation journeys, recognize the need for standardized deployment formats across teams to drive efficiencies. As a result, containerized model formats are gaining popularity as Kubernetes can be used to serve them up at scale while hiding the complexity of scaling, memory, and CPU utilization.
OSS/ Well known frameworks: KF Serving, TF Serving, Seldon core
Enterprise Startups: Algorithmia, Seldon, AnyScale, Paperspace Gradient
Cloud Vendors: AWS Sagemaker
Model monitoring and explainability
Once models are deployed in production, keeping an eye on their performance is key to ensure you get the ROI for all the hard work the teams have put in to reach your business goals. The basic premise of machine learning, which is to train and come up with a generalized model for unseen data can actually be validated through constant monitoring in production.
Model training is not a one-time process. Models need to be updated and re-trained over time. Model monitoring solutions mainly look for bias, drift, outliers, data quality issues, and also overall performance of the model over time, which inform retraining decisions. Model monitoring solutions also allow for comparison of challenger-champion models and help speed up the model promotion process.
Model explainability takes a step further into inspecting why the model makes a decision. Techniques like LIME and SHAP, work well for simpler models, but for more advanced insights for deep learning models, techniques like Integrated Gradients offer deeper visibility into the attributions of the different features, and their interactions, especially for deep learning models.
OSS/ Well known frameworks (Explainability): SHAP, LIME, Captum
Enterprise Startups: Fiddler, Arthur, Arize
Cloud Vendors: AWS Sagemaker
If you think there’s any other key component that didn’t make it to the list, please leave a comment, so that I can follow up!
*Update 1 on 12/7/20: AWS announced 3 new Sagemaker services during the AWS re:Invent on 12/1/20