How Can You Do Deep Learning in the Cloud?
Deep learning is at the center of most artificial intelligence initiatives. It is based on the concept of a deep neural network, which passes inputs through multiple layers of connections. Neural networks can perform many complex cognitive tasks, improving performance dramatically compared to classical machine learning algorithms. However, they often require huge data volumes to train, and can be very computationally intensive.
Cloud computing services are helping make deep learning more accessible, making it easier to manage large datasets and train algorithms on distributed hardware.
Cloud services are an enabler for deep learning in four respects:
- Provide access to large-scale computing capacity on demand, making it possible to distribute model training across multiple machines.
- Provide access to special hardware configurations, including GPUs, FPGAs, and massively parallel high performance computing (HPC) systems.
- Do not require an upfront investment—you can get advanced hardware, or large quantities of hardware, without having to purchase it. Pay only for the time you use.
- Assist with management of deep learning workflows—cloud services provide advanced features for managing datasets and algorithms, training models and deploying them efficiently to production.
This is part of an extensive series of guides about managed services.
In this article, you will learn:
- Top Deep Learning Services in the Cloud
- ~IaaS vs. PaaS
- ~Deep Learning on AWS with SageMaker
- ~Google Cloud Machine Learning Services
- ~Microsoft Azure Machine Learning
- How to Choose a Cloud Deep Learning Platform
- ~Data Preparation
- ~Scale-Up and Scale-Out Training
- ~Deep Learning Frameworks Support
- ~Pre-Tuned AI Services
- ~Monitor Prediction Performance
What Are the Most Popular Deep Learning Services in the Cloud?
Let’s briefly review the deep learning offerings of major cloud providers—Amazon, Google Cloud, and Microsoft Azure.
IaaS vs. PaaS
In each of these clouds, it is possible to run deep learning workloads in a “do it yourself” model. This involves selecting machine images that come pre-installed with deep learning infrastructure, and running them in an infrastructure as a service (IaaS) model, for example as Amazon EC2 instances or Google Compute Engine VMs.
All the cloud providers we review below offer compute instances suitable for deep learning models, which provide specialized hardware such as graphical processing units (GPU), field-programmable gate arrays (FPGA) and TensorFlow Processing Units (TPU). To learn about the compute options offered by each cloud provider, refer to our articles about:
Below, we focus on the platform as a service (PaaS) offering each cloud provides for deep learning users. These PaaS offerings provide the hardware needed for deep learning workloads, as well as software services for managing deep learning pipelines, from data ingestion to production deployment and real-world inference.
Deep Learning on AWS with SageMaker
Amazon Web Services provides the SageMaker service, which lets you build and manage machine learning models on the cloud, with a focus on deep learning.
- SageMaker services include:
- Ground Truth—lets you create and manage training data sets
- Studio—cloud-based development environment for machine learning models
- Autopilot—builds and trains models automatically
- Tuning—helps tune hyperparameters for a model
- Supports Jupyter notebooks—allowing users to share and collaborate on their own models and code.
- AWS Marketplace—provides pre-built algorithms and models created by third parties, which can be purchased on a pay-per-use basis.
- Framework support—supports all popular deep learning frameworks including TensorFlow, PyTorch, MXNet, Keras, Gluon, Scikit-learn, Horovod, and Deep Graph Library.
Learn more in our guide to AWS deep learning
Google Cloud Machine Learning Services
Google's set of machine learning services, together called Cloud AI, includes general purpose and dedicated services for specific use cases:
- Cloud AutoML suite—lets you build, train, and deploy models to production using cloud infrastructure
- AI Hub—provides a repository of components and algorithms that can be used to build models. Unlike the AWS model, AI Hub is focused on free knowledge sharing, not on commercial offerings of AI components.
- Data labeling service—lets you prepare and identify data for machine learning models.
- Visual AI and Video AI—these are two purpose-built services that provide preconfigured deep learning pipelines for processing image and video data.
Microsoft Azure Machine Learning
Azure Machine Learning is a complete environment for training, deploying, and managing machine learning models.
Key features of Azure Machine Learning:
- Drag-and-drop model designer—used to build machine learning models with no code. The designer supports several neural network architectures, including two-class classification, multi-class classification, neural network regression, DenseNet and ResNet.
- MLOps—supports a DevOps-style method for building and managing machine learning pipelines and workflows.
- Security and governance—integrated into the service, letting you verify compliance of machine learning processes, and perform identity and privacy management according to your organization’s governance policies.
- Frameworks support—supports PyTorch, TensorFlow, Keras, MXNet, scikit-learn, and Chainer.
Learn more in our guide to Azure deep learning
How Should You Choose a Cloud Deep Learning Platform?
Here are a few key considerations when selecting your cloud-based deep learning service.
Data Preparation
Data preparation can be one of the heaviest and most sensitive parts of a deep learning project. There are two common ways to prepare large volumes of data for analytics, which are also used to create deep learning datasets from raw data:
- Export, transform, load (ETL)—transforms data as it is pulled from the source and creates a ready-made dataset that can be used for analytics purposes.
- Export, load, transform (ELT)—provides greater flexibility, lets you store raw data in a data lake and then transform it into the required format on demand.
Check which data services are provided by your cloud vendor and whether they support ETL, ELT, or both. Understand which data storage, database or data warehouse services you will use, and how they can make data preparation easier.
Scale-Up and Scale-Out Training
Data scientists typically start by developing a model on a local notebook, but it is not feasible to train most deep learning models on a local workstation. A key capability of a cloud deep learning service is the ability to integrate with notebooks and push training jobs seamlessly to cloud-based compute instances.
Evaluate the process and how easy it is to run training jobs on hardware like GPUs, TPUs, and FPGAs, manage these jobs across data science teams, visualize and interpret their results.
Deep Learning Frameworks Support
Each cloud machine learning service supports different frameworks. You can typically get the broadest framework support in an IaaS model, when deploying deep learning directly on compute instances. However, if you use a full ML Ops platform, you will be limited to the frameworks it supports.
Look for support of the following frameworks, which your data science team may need to use now or in the future:
- Deep learning frameworks—TensorFlow, PyTorch, Keras, MXNet, Deep Java Library
- Classical machine learning—Scikit-learn, R, Spark MLlib, H2O.ai, Java-ML
- Job scheduling and distribution—Horovod, Kubernetes, Slurm, LSF (see our detailed comparison of job schedulers)
Also evaluate the ability to integrate your own code and algorithms with the platform’s library of built-in algorithms. This can improve productivity, because you can draw on existing building blocks and only develop unique aspects of your model.
Pre-Tuned AI Services
Most cloud platforms provide pre-trained, pre-optimized AI services for many applications including:
- Image classification
- Object recognition
- Video data extraction
- Language translation
- Speech synthesis
- Recommendation engines
The advantage of these types of services is that they have been trained on massive data volumes that are not available to individual companies. They can provide very high accuracy for general use cases, and provide excellent performance and low latency in production. Best of all, they are ready to use out of the box.
Monitor Prediction Performance
Deploying a model is only the start, not the end point, of your AI journey. Data changes and user requirements change, and it is essential to monitor a model’s performance over time, tune it, augment it, and if necessary, replace it. Evaluate the tools a cloud service provides for monitoring model performance when it is already in production, and how easy it is to release updates and improvements to live deep learning models.
Deep Learning in the Cloud with Run:ai
Run:ai automates resource management and orchestration for machine learning infrastructure. With Run:ai, you can automatically run as many compute intensive experiments as needed.
Our AI Orchestration Platform for GPU-based computers running AI/ML workloads provides:
- Advanced queueing and fair scheduling to allow users to easily and automatically share clusters of GPUs,
- Distributed training on multiple GPU nodes to accelerate model training times,
- Fractional GPUs to seamlessly run multiple workloads on a single GPU of any type,
- Visibility into workloads and resource utilization to improve user productivity.
Run:ai simplifies machine learning infrastructure orchestration, helping data scientists accelerate their productivity and the quality of their models.
Learn more about the Run:ai GPU virtualization platform.
Learn More About Cloud Deep Learning
There’s a lot more to learn about cloud deep learning. To continue your research, take a look at the rest of our blogs on this topic:
AWS Deep Learning: Choosing the Best Option for You
Amazon Web Services (AWS) is a cloud computing pioneer providing a wide range of scalable, affordable, and innovative cloud services, including a dedicated solution for deep learning. AWS offers a fully-managed machine learning service called SageMaker, and AWS Deep Learning AMI (DLAMI), which is a custom EC2 machine image, as well as deep learning containers.
This article explains in-detail the various deep learning services offered by AWS, and how to leverage AWS technology for training deep learning models.
Read more: AWS Deep Learning: Choosing the Best Option for You
Azure Machine Learning: From Basic ML to Distributed Deep Learning Models
Microsoft Azure is a top cloud computing vendor offering many enterprise-grade services, including a dedicated solution for machine learning and deep learning, called Azure Machine Learning (Azure ML). Azure ML leverages virtual machines (VMs), datasets, datastores, code models, and deployment environments to enable effective training of deep learning models.
This article explains how Azure ML works, and how to perform distributed training of deep learning models on Azure.
Read more: Azure Machine Learning: From Basic ML to Distributed Deep Learning Models
Google TPU: Architecture and Performance Best Practices
Google provides cloud computing services, including dedicated solutions for artificial intelligence (AI), machine learning, and deep learning. Google has long been considered a pioneer and innovator in AI and software development, creating solutions that are adopted worldwide. Tensor Processing Units (TPUs) are another Google innovation, created to help accelerate machine learning.
This article explains what a TPU is, how the technology works, and explores key best practices for optimal cloud TPU performance.
Read more: Google TPU: Architecture and Performance Best Practices
Google Cloud GPU: The Basics and a Quick Tutorial
Google Cloud Platform (GCP) is the world’s third largest cloud provider. Google offers a number of virtual machines (VMs) that provide graphical processing units (GPUs), including the NVIDIA Tesla K80, P4, T4, P100, and V100.
Learn about Google Cloud GPU and TPU options, and learn how to set up a compute instance with an attached GPU in a few easy steps.
Read more: Google Cloud GPU: The Basics and a Quick Tutorial
Triton Inference Server: The Basics and a Quick Tutorial
NVIDIA’s open-source Triton Inference Server offers backend support for most machine learning (ML) frameworks, as well as custom C++ and python backend. This reduces the need for multiple inference servers for different frameworks and allows you to simplify your machine learning infrastructure
Learn about the NVIDIA Triton Inference Server, its key features, models and model repositories, client libraries, and get started with a quick tutorial.
Read more: Triton Inference Server: The Basics and a Quick Tutorial