The Hierarchy of ML tooling on the Public Cloud
Not all ML services are built the same 🪛 🧰
1 ML Services on the Public Cloud
Not all ML services are built the same. As a consultant working in the public cloud, I can tell you that you are spoilt for options for Artificial Intelligence (AI) / Machine Learning (ML) tooling on the 3 big public clouds - Azure, AWS, and GCP.
It can be overwhelming to process and synthesize the wave of information; especially when these services are constantly coming out with new features.
Just imagine how much of a nightmare it would be to explain to a layman which platform to choose, and why you chose to use this particular tool to solve your machine learning problem.
I’m writing this post to alleviate that problem statement for others, as well as for myself, so you walk away with a succinct and distilled understanding of what the public cloud has to offer. For the sake of simplicity, I will use the terms AI and ML interchangeably throughout this post.
2 Building a Custom ML System… should be a Last Resort
Before we jump into tooling comparison, let’s understand why we should even use managed services on the public cloud. It’s a valid assumption to question - Why not build your own custom infrastructure and ML model from scratch? To answer this question, let’s take a quick look at the ML lifecycle.
The below diagram depicts a typical ML lifecycle (the cycle is iterative):
As you can see, there are many parts to the entire lifecycle that must be considered.
A famous paper published by Google showed that a small fraction of the effort that goes into building maintainable ML models in production is writing the model training code.
This phenomenon is known as the hidden technical debt of ML systems in production, and also what has been termed by industry as Machine Learning Operations (MLOps), which has become an umbrella term to refer to the mentioned technical debt.
Below is a visual explanation to support the above statistics, adapted from Google’s paper:
I won’t go into a detailed explanation of each stage in the lifecycle, but here’s a summarized list of definitions. If you’re interested in learning more, I would recommend reading Machine Learning Design Patterns Chapter 9 on ML Lifecycle and AI Readiness for a detailed answer.
Data pre-processing
- prepare data for ML training; data pipeline engineeringFeature engineering
- transform input data into new features that are closely aligned with the ML model learning objectiveModel training
- training and initial validation of ML model; iterate through algorithms, train / test splits, perform hyperparameter tuningModel evaluation
- model performance assessed against predetermined evaluation metricsModel versioning
- version control of model artifacts; model training parameters, model pipelineModel serving
- serving model predictions via batch or real-time inferenceModel deployment
- automated build, test, deployment to production, and model retrainingModel monitoring
- monitor infrastructure, input data quality, and model predictions
The ML lifecycle does not consider the supporting platform infrastructure, which has to be secure from a encryption, networking, and identity and access management (IAM) perspective.
Cloud services provide managed compute infrastructure, development environments, centralized IAM, encryption features, and network protection services that can achieve security compliance with internal IT policies - hence you really should not be building these ML services yourself, and leverage the power of the cloud to add ML capabilities into your product roadmap.
This section illustrates that writing the model training code is a relatively tiny part of the entire ML lifecycle, and actual data prep, evaluation, deployment, and monitoring of ML models in production is difficult.
Naturally, the conclusion is that building your own custom infrastructure and ML model takes considerable time and effort, and the decision to do so should be a last resort.
3 Hierarchy of ML tooling
Here is where leveraging public cloud services come in to fill the gap. There are broadly two offerings these hyperscalers package and provide to customers:
- 🧰 AI services. EITHER:
- 🔨
Pre-Trained Standard
- Use base model only, No Option to customize by bringing your own training data. - ⚒️
Pre-Trained Customizable
- Can use base model, and Optional customization by bringing your own training data. - ⚙️
Bring Your Own Data
- Mandatory to bring your own training data.
- 🪛 ML Platform.
For the sharper ones reading this post, I have purposefully omitted a few honorable AI service mentions in the hierarchy:
- Data Warehouse built-in ML models which enable ML development using SQL syntax. Further reading can be done on BigQuery ML, Redshift ML, and Synapse dedicated SQL pool PREDICT function. These services are meant to be used by data analysts, given that your data is already inside the cloud data warehouse.
- AI Builder for Microsoft Power Platform, and Amazon SageMaker Canvas. These services are meant to be used by non-technical business users a.k.a. citizen data scientists.
- Azure OpenAI which is nascent service and regulated by Microsoft; you are required to request approval for a trial.
3.1 ML Platform 🪛
We will first discuss the ML Platform before discussing AI services. The platform provides auxiliary tooling required for MLOps.
Each public cloud has their own version of the ML Platform:
- Azure - Azure Machine Learning
- AWS - Amazon SageMaker
- GCP - Vertex AI
Who is it for?
Persona-wise, this is for the team who has internal data scientist resources, want to build custom state-of-the-art (SOTA) models with their own training data, and develop frameworks to do custom management of MLOps across the ML lifecycle.
How do I use it?
Requirement-wise, the business use case would need them to engineer a custom ML model implementation that AI services in Section 3.2 do not have the capabilities to meet.
As much as possible, this should not be your first option when looking to leverage a service on the public cloud.
Even with the ML platform, considerable time and effort has to be invested into learning the features on the ML platform, and writing the code to build out a custom MLOps framework using the hyperscaler software development kits (SDKs).
Instead, first look for an AI service in the next Section 3.2 that could meet your need.
What technology capabilities does the service provide?
When you utilize a cloud platform, you gain access to a completely hyperscaler managed environment that you would otherwise be pulling your hair out trying to get right:
Managed Compute Infrastructure
- these are clusters of machines with default environments, containing ubiquitous built-in ML libraries, and cloud-native SDKs. Compute can be used for distributed training, or to power model endpoints for serving batch and real-time predictions.Managed Development Environments
- in the form of Notebooks, or through your choice of IDE given that there is integration with the ML platform.
These host of utilities enable data scientists and ML engineers to fully focus on the ML lifecycle instead of infrastructure configuration and dependency management.
Built-in libraries and cloud-native SDKs facilitates data scientists writing custom code to do more seamless engineering throughout the ML lifecycle.
The following table shows the technology features of each cloud ML platform:
Capability | Feature | Azure | AWS | GCP |
---|---|---|---|---|
Graphical User Interface (GUI) | Studio workspace | Azure ML Studio | SageMaker Studio | Vertex AI Workbench |
Data pre-processing | Data labeling | Azure ML Data Labeling | SageMaker Groud Truth (Plus) | Vertex AI Data Labeling |
Feature engineering | Feature store | Open-source Feast library | SageMaker Feature Store | Vertex AI Feature Store |
Feature engineering | Automatic feature engineering | Azure AutoML | SageMaker AutoPilot | Vertex AI AutoML |
Model evaluation | Model Experiment tracking | Azure ML Experiments | SageMaker Experiments | Vertex AI Experiments / Vertex AI TensorBoard |
Model evaluation | Automatic Hyperparameter tuning | Azure ML Sweep | SageMaker Hyperparameter Tuning | Vertex AI Vizier |
ML pipeline orchestration | Pipeline orchestration | Azure ML Pipelines | SageMaker Pipelines | Vertex AI Pipelines |
Model versioning | Model registry | Azure ML Registry | SageMaker Model Registry | Vertex AI Model Registry |
Model serving | Model endpoints | Azure ML Endpoints | SageMaker Batch Transform / SageMaker Endpoints | Vertex AI Prediction |
Model deployment | Container registry | Azure Container Registry | Elastic Container Registry | Cloud Artifact Registry |
Model deployment | Continuous Integration (CI) / Continuous Delivery (CD) | Azure DevOps | CodeCommit, CodeBuild, CodeDeploy, CodePipeline | Cloud Code, Cloud Build, Cloud Deploy |
Model monitoring | Infrastructure monitoring | Azure Monitor | CloudWatch | Cloud Operations Suite |
Model monitoring | Concept and data drift | Azure Monitor | SageMaker Model Monitoring | Vertex AI Model Monitoring |
Model monitoring | Model understanding and interpretability | Azure ML Responsible AI | SageMaker Clarify | Vertex AI Explainable AI |
3.2 AI Services 🧰
Next, we will discuss AI services. They enable ML development using a low-code / no-code approach, and mitigate the overhead of managing MLOps.
The over-arching argument for these services is neatly put below by Jeff Atwood:
The best code is no code at all.
Every new line of code you willingly bring into the world is code that has to be debugged, code that has to be read and understood, code that has to be supported. Every time you write new code, you should do so reluctantly, under duress, because you completely exhausted all your other options.
Who is it for?
Persona-wise, these are for the teams who DO NOT HAVE EITHER:
- Internal data scientist resources.
- Own training data to train a custom ML model.
- Investment of resources, effort, and time to engineer a custom ML model end-to-end.
How do I use it?
Requirement-wise, the ML business use case would be met by cloud provider AI service capabilities.
The goal is to add ML features into the product by leveraging hyperscaler base models and training data; so the team can prioritize core application development, integrate with the AI service via retrieving predictions from API endpoints, and ultimately spend minimal effort on model training and MLOps.
What technology capabilities does the service provide?
We’re going to organize the following comparison table by the technology capabilities the AI service provides. This is closely interlinked with but should be differentiated from the ML business use case.
For example, Amazon Comprehend service gives you the capability
to do text classification. That capability is used to build models for business use cases
such as:
- Sentiment analysis of customer reviews.
- Content quality moderation.
- Multi-class item classification into custom-defined categories.
For certain AI services, the technology capability and business use case is exactly the same; in that scenario the AI service was built to solve that exact ML business use case.
Note that I have excluded or avoided mention of industry specific version of AI services. Just know that hyperscalers train models specifically to achieve higher model performance in these domains and you should use them over the generic version of the service for the particular industry or domain.
Notable mentions of these services include Amazon Comprehend Medical, Amazon HealthLake, Amazon Lookout for{domain}
, Amazon Transcribe Call Analytics, Google Cloud Retail Search etc.
- 🔨
Pre-Trained Standard
- Use base model only, No Option to customize by bringing your own training data. - ⚒️
Pre-Trained Customizable
- Can use base model, and Optional customization by bringing your own training data. - ⚙️
Bring Your Own Data
- Mandatory to bring your own training data.
--- Speech ---
AI / ML Category | Capability | Azure | AWS | GCP |
---|---|---|---|---|
Speech | Speech to Text | Speech Service, — with Custom Speech option ⚒️ | Amazon Transcribe — with Custom vocab and lang models ⚒️ | Cloud Speech to Text — with Model Adaptation option ⚒️ |
Speech | Text to Speech | Speech Service — with Custom Synthesis and Custom Neural Voice option ⚒️ | Amazon Polly — with Custom Synthesis and Brand Voice option ⚒️ | Cloud Text to Speech — with Custom Synthesis and Custom Voice option ⚒️ |
Speech | Speech to Speech Translation | Speech Service 🔨 | Combination of Amazon Transcribe, Translate, and Polly | Combination of Cloud Speech to Text, Translation, Media Translation, Text to Speech |
Speech | Speech to Text Translation | Speech Service 🔨 | Combination of Amazon Transcribe, and Translate | Cloud Media Translation 🔨 |
Speech | Speaker Recognition | Speech Service ⚙️ | Amazon Connect Voice ID ⚙️ | Dialogflow CX Speaker ID ⚙️ |
--- Natural Language ---
--- Vision ---
AI / ML Category | Capability | Azure | AWS | GCP |
---|---|---|---|---|
Vision | Optical Character Recognition (OCR) | Vision Service 🔨, or Form Recognizer, with Custom Model option ⚒️ | Amazon Textract 🔨 | Cloud Vision API 🔨, or Document AI, with Custom Extraction processor option |
Vision | Object Detection | Vision Service, with Model Customization option ⚒️, or Custom Vision ⚙️ | Amazon Rekognition, with Custom Labels option ⚒️ | Cloud Vision API 🔨, or Vertex AI AutoML Vision ⚙️ |
Vision | Image Classification | Vision Service, with Model Customization option ⚒️ or Custom Vision ⚙️ | Amazon Rekognition, with Custom Labels option ⚒️ | Cloud Vision API 🔨, or Vertex AI AutoML Vision ⚙️ |
Vision | Image Moderation | Vision Service, with Model Customization option ⚒️ or Custom Vision ⚙️ | Amazon Rekognition, with Custom Labels option ⚒️ | Cloud Vision API, or Vertex AI AutoML Vision ⚙️ |
Vision | Facial Recognition | Vision Service 🔨, or Custom Facial Recognition ⚙️ | Amazon Rekognition, with Custom Facial Recognition option ⚒️ | Cloud Vision API 🔨 |
--- Decision ---
AI / ML Category | Capability | Azure | AWS | GCP |
---|---|---|---|---|
Decision | Fraud Detection | Anomaly Detector API ⚙️, or Azure AutoML Classification ⚙️ | Amazon Fraud Detector ⚙️ | Vertex AI AutoML Tabular Classification ⚙️ |
Decision | Forecasting | Azure AutoML Timeseries ⚙️ | Amazon Forecast ⚙️ | Timeseries Insights API ⚙️, or Vertex AI AutoML Tabular Forecasting ⚙️ |
Decision | Personalized Recommendations | Azure Personalizer ⚙️ | Amazon Personalize ⚙️ | Recommendations AI ⚙️ |
--- Search ---
AI / ML Category | Capability | Azure | AWS | GCP |
---|---|---|---|---|
Search | Full Text Search | Bing Custom Search ⚙️, or Azure Cognitive Search, with AI enrichment option ⚙️ | Amazon OpenSearch ⚙️, or Amazon Kendra ⚙️ | Cloud Search ⚙️ |
Search | Semantic Search | Azure Cognitive Search ⚙️ | Amazon OpenSearch ⚙️, or Amazon Kendra ⚙️ | Vertex AI Matching Engine ⚙️ |
Search | Image Search | Vision Service ⚙️, or Bing Image Search 🔨, or Bing Visual Search 🔨 | Combination of Amazon Rekognition, and Amazon OpenSearch | Vision API Product Search ⚙️ |
4 Further Reading on Topics
We have covered considerable ground in this post regarding the spectrum of ML services the public cloud offers, however there are still other concepts that we have to consider when building an ML system.
I would encourage you to explore and find your own answers to these concepts that were not discussed as AI / ML become more deeply embedded within the products we use.
What ML tooling do the 3 public clouds offer to implement the following functionality?
- Model data lineage and provenance
- Model catalog
- Human review for post-prediction ground truth labeling
- Models that work on video data
- Models that do generic regression and classification
Special Thanks / References
A special mention and thanks to the authors and creators of the following resources, that helped me to write this post:
ML Tooling
AI Services
- 📚 What is Azure Machine Learning?
- 📚 How Azure Machine Learning works: Architecture and Concepts
- 📚 Amazon SageMaker features
- 📚 Vertex AI
- 📚 Vertex AI Documentation
ML Platform
- 📚 Choose a Microsoft Cognitive Services Technology
- 📚 Azure Cognitive Services Documentation
- 📚 What is Azure AutoML?
- 🎦 Azure OpenAI Use Cases
- 📚 AWS Machine Learning Services
- 📚 AWS AutoML Solutions
- 📚 Google AI and Machine Learning Products
- 📚 Google AI and Machine Learning Solutions
- 📚 Google Cloud AutoML