28/01/2021

ML Platform landscape overview

A birds-eye view into the key components of a modern enterprise ML Platform

Pranil Dasika Nov 28, 2020·6 min read

A query from an investor friend nudged me into a deep dive into the ML Platform landscape over the thanksgiving weekend. As I was gathering notes, I thought it’d be good to share them for the benefit of folks interested in understanding the space.

Machine Learning Platform Architecture — ML Platform Services (Image by Author)

I was particularly interested in the different components that make up a modern ML platform for a company, and I looked at 3 different classes of solutions:

Both OSS and closed source solutions primarily built for internal consumption at hi-tech companies like Uber, Airbnb, Lyft, DoorDash.
Commercial startups I’m aware, of that offer these services as enterprise-grade software solutions. By no means is this meant to be an exhaustive list.
Commercial managed services from the 3 big cloud vendors, AWS/Azure/ Google Cloud.

I identified the following key services, spanning the exploration, build, train, test & production stages of the model development lifecycle (MDLC):

Managed Notebook service
Model training service
Data labeling service
Workflow orchestration & Data pipelines
Feature-stores
Model management services
Model serving/deployment service
Model monitoring and explainability service

Managed Notebook service

Jupyter notebooks are the most popular IDE for data scientists. Besides their browser-based UI, Notebooks offer a great deal of flexibility for their day to day tasks, which could involve feature exploration, feature engineering, data wrangling, model building, and data analysis. Additionally, built-in visualization tools, ability to save and share as simple HTML files, make Jupyter notebooks ubiquitous in the Data Science world.

Specifically, in a company setting, when run as a managed notebook service, it could be offered with some additional utilities over the vanilla laptop run notebook service, and this could drive significant efficiency for a decently big team of data scientists. For e.g:

Pre-built environments with commonly used ML libraries and easy access to algorithms.
Built-in connectivity to access-protected data-warehouses, feature stores
Built-in support for cloud backed GPU based model runtimes, which developer laptops cannot offer.
Ability to spin up training jobs from within notebooks.

OSS/Well known frameworks: Airbnb Redspot service, Uber DSW, Kubeflow

Enterprise Startups: Determined AI, Databricks, DominoData labs, Paperspace Gradient

Cloud Vendors: AWS Sagemaker

Model Training

Model training jobs are iterative, stateful, incremental, and compute-intensive — requiring distributed computing for speedy turnarounds. To iterate fast on model development, data scientists need to be able to spin up training jobs without having to jump through the hoops of hardware setup, working with their IT team.

In addition, smart ML training services provide access to automated hyper-parameter search algorithms, which allow for efficient and optimized training convergence. They also offer built-in resiliency via checkpoint saving and restoring. The ability to provision CPU/GPU resources for distributed training algorithms, and support for multiple popular frameworks, are also key characteristics of an enterprise-wide training platform.

OSS/Well known frameworks: Airbnb BiqQueue service, Uber Horovid, Kubeflow Fairness, TFX.

Enterprise Startups: Determined AI, Databricks, DominoData labs

Cloud Vendors: AWS Sagemaker, Azure ML

References:

https://determined.ai/blog/determined-ai-sagemaker-comparison/

Data labeling

Data powers ML applications. High-quality labeled training data is imperative for the high-performance of models. Labeling data in a scalable reliable way requires specifically designed workflows and expertise. Smart and intuitively designed labeling workflows can bring in significant efficiency gains while dealing with a large labeling workforce on large data projects.

Enterprise Startups: Scale AI, Snorkel, LabelBox

Cloud Vendors: AWS Sagemaker ground truth

Workflow orchestration and Data pipelines

Machine learning production workflows involve data movement between multiple sub-systems at data ingestion, preparation, training, and scoring time. Data needs to be moved, massaged into the desired format before feeding into models for training and inference.

Various teams and products inside an organization use different locations to store data that may be consumed in a completely different part of the organization, which could be in a different data center or cloud or even geography. Workflow orchestration systems allow for composing complex workflows, scheduling, and executing them reliably at scale in production.

Modern workflow services build on top of Kubernetes and focus on building expressive ways to represent and execute complex DAG style workflows while offloading the aspects of compute-scaleout to the underlying k8s system.

OSS/ Well known frameworks: Flyte, Airflow, Argos, Kubeflow pipelines

Enterprise Startups: Prefect, Pachyderm, Databricks, Domino Data

Cloud Vendors: AWS Data Pipeline, Google AI pipelines(Kubeflow pipelines), Azure ML Pipelines

References: https://www.youtube.com/watch?v=oXPgX7G_eow&t=1152s

A comparison of data processing frameworks

Feature-stores

Feature handling is a fairly involved process, both during model development and at production time — feature discovery, exploration, extraction, transformations, and serving. Repeatability and reliability of these operations at scale, and serving features to multiple ML applications for scoring, version tracking and maintaining data lineage, and metadata are key responsibilities handled by feature stores. Additionally, feature stores also make backtesting of models easier, by backfilling historical data when new features are added.

OSS/ Well known frameworks: Airbnb Zipline, Uber’s Michaelangelo Palette, Facebook’s FBLearner, Kubeflow

Enterprise Startups: Tecton, Scribble, Hopsworks from LogicalClocks

References:

https://www.tecton.ai/blog/what-is-a-feature-store/

https://www.scribbledata.io/resources-feature-store-guide

https://hackernoon.com/the-essential-architectures-for-every-data-scientist-and-big-data-engineer-f21u3e5c

Model management and experiment management

Once models are trained, they can be cataloged and tracked using model management systems.

Data Scientists also have to work with Business/ Product teams to run multiple experiments. Large teams would need to track changes to models during experiments in a scalable way, which involves capturing things like code versions, data versions, hyperparameters, environment, and metrics.

The key challenge lies in supporting popular model frameworks, like XGBoost, TensorFlow, Keras, Pytorch, Scikit Learn, LightGBM.

OSS/ Well known frameworks: MLFlow, TensorBoard

Enterprise Startups: Weights and Biases, Neptune, Comet ML, Paperspace Gradient

References:

https://towardsdatascience.com/tracking-ml-experiments-using-mlflow-7910197091bb

Machine Learning Experiment Management: How to Organize Your Model Development Process

Model serving/deployment

Once models are trained and need to be deployed either to serve as part of live applications or for offline scoring, you need a consistent and reliable mechanism within the company. Model serving and deployment requires a few different considerations:

CPU/GPU/TPU based environments
Support for composable, stacked models
Model dependency management

Many enterprises, fresh from their digital transformation journeys, recognize the need for standardized deployment formats across teams to drive efficiencies. As a result, containerized model formats are gaining popularity as Kubernetes can be used to serve them up at scale while hiding the complexity of scaling, memory, and CPU utilization.

OSS/ Well known frameworks: KF Serving, TF Serving, Seldon core

Enterprise Startups: Algorithmia, Seldon, AnyScale, Paperspace Gradient

Cloud Vendors: AWS Sagemaker

References:

https://huyenchip.com/2020/06/22/mlops.html

Model monitoring and explainability

Once models are deployed in production, keeping an eye on their performance is key to ensure you get the ROI for all the hard work the teams have put in to reach your business goals. The basic premise of machine learning, which is to train and come up with a generalized model for unseen data can actually be validated through constant monitoring in production.

Model training is not a one-time process. Models need to be updated and re-trained over time. Model monitoring solutions mainly look for bias, drift, outliers, data quality issues, and also overall performance of the model over time, which inform retraining decisions. Model monitoring solutions also allow for comparison of challenger-champion models and help speed up the model promotion process.

Model explainability takes a step further into inspecting why the model makes a decision. Techniques like LIME and SHAP, work well for simpler models, but for more advanced insights for deep learning models, techniques like Integrated Gradients offer deeper visibility into the attributions of the different features, and their interactions, especially for deep learning models.

OSS/ Well known frameworks (Explainability): SHAP, LIME, Captum

Enterprise Startups: Fiddler, Arthur, Arize

Cloud Vendors: AWS Sagemaker

References: https://blog.fiddler.ai/2020/04/explainable-monitoring-stop-flying-blind-and-monitor-your-ai/

If you think there’s any other key component that didn’t make it to the list, please leave a comment, so that I can follow up!

*Update 1 on 12/7/20: AWS announced 3 new Sagemaker services during the AWS re:Invent on 12/1/20

AWS Sagemaker feature store
AWS Sagemaker data wrangler
AWS Sagemaker pipelines, which is different from the AWS Data pipeline mentioned above.

Uncategorized

T3/2023 – NTC – Thông báo tuyển dụng lập trình viên hệ thống

Công ty TNHH Công nghệ Nguyễn là một công ty chuyên cung cấp các giải pháp phần mềm và dịch vụ công nghệ thông tin cho các doanh nghiệp trong và ngoài nước. Hiện tại, chúng tôi đang tìm kiếm các ứng viên có tay nghề chuyên môn và…

Công ty Công nghệ Nguyễn, Tuyển dụng

T3/2022 – NTC – Thông báo tuyển dụng Nhân viên Marketing Online

THÔNG BÁO TUYỂN DỤNG NHÂN VIÊN MARKETING ONLINE Công ty Công nghệ Nguyễn đang tìm kiếm một nhân viên Marketing Online tài năng và nhiệt huyết để gia nhập đội ngũ của chúng tôi. Nếu bạn là một người yêu thích marketing trực tuyến, có kinh nghiệm trong lĩnh…

Công ty Công nghệ Nguyễn, Tuyển dụng

Các cấp độ triển khai phần mềm ERP

ERP cấp I là một hệ thống quản lý doanh nghiệp tích hợp mà có thể được sử dụng cho các doanh nghiệp nhỏ hoặc trung bình. ERP cấp I được thiết kế để hỗ trợ các hoạt động cơ bản của doanh nghiệp, bao gồm quản lý tài…

Kiến thức ERP, NTC ERP

Môi trường sống lý tưởng

Môi trường sống lý tưởng của con người bao gồm các yếu tố như: Môi trường sống lý tưởng còn bao gồm các yếu tố như: Tổng quát, môi trường sống lý tưởng là một môi trường mà con người có thể tự do phát triển và tự do…

Kinh tế học

Phát triển kinh tế địa phương Thái Bình

Kinh tế học

Các cách phát triển kinh tế một doanh nghiệp ?

Có nhiều cách để phát triển kinh tế của một doanh nghiệp, bao gồm:…

Uncategorized

NGUYỄN

Technologies Ltd

28/01/2021

ML Platform landscape overview

A birds-eye view into the key components of a modern enterprise ML Platform

Managed Notebook service

Model Training

Data labeling

Workflow orchestration and Data pipelines

Feature-stores

Model management and experiment management

Model serving/deployment

Model monitoring and explainability

T3/2023 – NTC – Thông báo tuyển dụng lập trình viên hệ thống

T3/2022 – NTC – Thông báo tuyển dụng Nhân viên Marketing Online

Các cấp độ triển khai phần mềm ERP

Môi trường sống lý tưởng

Phát triển kinh tế địa phương Thái Bình

Các cách phát triển kinh tế một doanh nghiệp ?

Nội dung khác từ NTC

Nguyễn Vương Anh

T3/2023 – NTC – Thông báo tuyển dụng lập trình viên hệ thống

Nguyễn Vương Anh

T3/2022 – NTC – Thông báo tuyển dụng Nhân viên Marketing Online

Nguyễn Vương Anh

Các cấp độ triển khai phần mềm ERP

Nguyễn Vương Anh

Môi trường sống lý tưởng

các kết nối tác giả

Bạn có câu hỏi ?
Chúng tôi luôn vui lòng giúp đỡ

Tham gia hội thảo trực tuyến

Đặt lịch hẹn
tư vấn

Gọi
02273 83 84 85

Hỗ trợ kỹ thuật

NGUYỄN

Technologies Ltd

ĐĂNG KÝ NHẬN CÁC THÔNG TIN MỚI NHẤT TỪ CHÚNG TÔI

2017-2020 Công ty TNHH Công nghệ Nguyễn

28/01/2021

ML Platform landscape overview

A birds-eye view into the key components of a modern enterprise ML Platform

Managed Notebook service

Model Training

Data labeling

Workflow orchestration and Data pipelines

Feature-stores

Model management and experiment management

Model serving/deployment

Model monitoring and explainability

Nội dung khác từ NTC

Nguyễn Vương Anh

Nguyễn Vương Anh

Nguyễn Vương Anh

Nguyễn Vương Anh

các kết nối tác giả

Bạn có câu hỏi ? Chúng tôi luôn vui lòng giúp đỡ

Tham gia hội thảo trực tuyến

Hỗ trợ kỹ thuật

Technologies Ltd

ĐĂNG KÝ NHẬN CÁC THÔNG TIN MỚI NHẤT TỪ CHÚNG TÔI

2017-2020 Công ty TNHH Công nghệ Nguyễn

Bạn có câu hỏi ?
Chúng tôi luôn vui lòng giúp đỡ