Infrastructure

Multi-Tenancy for AI Clusters: Enabling Scalability and Security

by
Oz Bar Shalom
January 17, 2024

The private & public clouds are built on the concept of shared infrastructure. Resources are being shared dynamically by multiple organizations or multiple business units within an organization, each with their unique needs and requirements.

Multi-tenancy is a concept that plays a crucial role in modern IT infrastructure, especially in the context of cloud computing and shared resources.

The key challenge in multi-tenancy lies in maintaining the isolation and security of resources, including compute and memory resources, storage, networking, and data, between these different tenants. This separation ensures that each tenant's data and operations remain distinct and secure.

Multi-tenancy finds applications in two primary use cases, each has different security and operation requirements:

  1. Public Clouds: Public cloud providers offer infrastructure, platforms and tools that serve multiple customers, including commercial organizations and government agencies. These platforms need to ensure the highest level of isolation and security to meet the stringent requirements of their customers. Each tenant must have confidence that their data and resources are kept separate from others.
  2. Enterprise Solutions: Enterprises may adopt multi-tenancy to serve their internal departments or business units as individual customers. In this scenario, the emphasis may shift from stringent isolation to resource efficiency and cost-effectiveness, as all tenants belong to the same organization.

Challenges in Multi-Tenancy

Multi-tenancy brings forth several challenges which can vary depending on the type of organizations and users involved.

  1. Isolation and Security: One of the primary challenges in multi-tenancy is ensuring adequate isolation and security between tenants. Each organization or user must be prevented from accessing or interfering with the compute resources, network, and data resources of others.
  2. Customization and Tool Flexibility: Tenants may require customization and the ability to bring their own tools and applications. Balancing customization options with the need for standardized services is a challenge.
  3. Resource Allocation and Quota Management: Multi-tenancy involves sharing physical resources such as GPU, CPU, RAM, storage, and networking. Managing resource allocation fairly and efficiently among tenants can be challenging, especially when some tenants may have higher resource demands than others. Striking a balance to ensure that all tenants receive adequate resources is a complex task. To achieve a balance between resource efficiency and isolation, multi-tenant solutions often implement resource quotas for each tenant, while offering different consumption models such as reserved, on demand, and spot instances.
  4. Centralized Management: Managing multiple tenants in a centralized manner is essential for managing and controlling effectively. This requires a management platform that can handle user authentication, permissions, and resource provisioning.
  5. Centralized Monitoring: Effective monitoring is necessary to track resource utilization, detect anomalies, and ensure the overall health of the multi-tenant environment. Centralized monitoring tools should provide insights into the performance of individual tenants while maintaining their privacy.
  6. Cost Management: Managing the costs associated with multi-tenancy, including infrastructure scaling and maintenance, can be complex. Public cloud providers must strike a balance between offering competitive pricing and covering their operational costs.

Isolation Levels of Multi-Tenancy

In multi-tenancy the level of isolation can vary depending on the use case and customer requirements. There are several approaches for implementing multi-tenancy isolation solutions:

  1. Virtual Private Cloud (VPC) per tenant: At the highest degree of multi-tenancy isolation, each tenant may have dedicated physical resources, including servers and storage isolated in a virtual network. This ensures complete separation between tenants and is often preferred when the tenants of the system are different enterprises that prioritize security and control. This solution, however, can be the most expensive option. Public clouds are an example for a multi-tenant cloud offering with VPC environments offered to each tenant.
    Most cloud providers offer virtual machines (VMs) to different tenants using isolation at the hypervisor layer. Some public clouds are also offering bare-metal compute to different tenants while isolation is implemented with smart network interface cards (also called Data Processing Units (DPUs)) which are offloading the isolation software from the host server.
  2. Shared VPC with logical isolation: At a lower degree of isolation, tenants run in the same VPC and share all of its physical resources securely via software isolation that prevents tenants from accessing or viewing workloads and data of other tenants. This is the most  cost-effective solution and is suitable for scenarios where tenants are more concerned with rapid enablement than strict isolation. This approach offers the highest resource efficiency and flexibility but requires logical levels of security to prevent unauthorized access between tenants.
    The Kubetenetes framework, for example, provides tools like namespaces, access controls, quotas, and network policies to help cluster administrators manage and isolate multiple tenants sharing a single cluster. More advanced solutions in the Kubernetes ecosystem, like vCluster or Capsule , offer enhanced multi-tenancy management and isolation capabilities like provisioning and managing tenants.  

The selection of an isolation level hinges upon the unique use case and the demands of the customer. Cloud service providers have the flexibility to provide enterprises with heightened isolation levels to guarantee data security and compliance, while offering startups reduced levels of isolation to emphasize resource efficiency and cost-effectiveness. When enterprises construct a private cloud infrastructure for internal purposes, they frequently lean towards lower levels of isolation between various business units, aiming to optimize resource flexibility and cost efficiency.

Left figure: Multi-tenancy with VPC per tenant. Right figure: Multi-tenancy in a shared VPC with lower degree of isolation using Kubernetes namespaces

Run:ai's Multi-Tenancy Solution for AI Clusters

Run:ai's multi-tenancy solution is designed to meet the unique demands of AI workloads. It provides a multi-tenancy platform-as-a-service (PaaS) solution at the Kubernetes level that can serve multiple teams or business units within one organization, offering a wide range of capabilities, including:

  • Isolation and Advanced Resource Quota Management: Run:AI's solution offers different levels of isolation to accommodate various customer requirements, from dedicated resources to shared clusters with logical resource quotas, including fair share allocations and preemption and over-quota scheduling.
  • Centralized Management and Enterprise Security: Platform operators have centralized control over the platform while ensuring data privacy and security for all tenants.
  • Resource Efficiency: The platform optimizes resource allocation and utilization, enabling efficient sharing of resources among tenants using advanced scheduling capabilities and GPU sharing capabilities like fractional GPUs and dynamic Multi Instance GPU (MIG) allocations.
  • Bring your own ML Tools: Run:ai is an open and flexible platform, allowing researchers to integrate their own tools and libraries. Run:ai supports and simplifies the usage of a wide variety of ML tools, allowing different tenants to choose different tools, form Jupyter notebooks, VSCode, Pycharm for interactive development, to Weights and Biases, Comet.ml, MLFlow, Tensorboard for experiment tracking, Pytorch Lightning, Ray, Horovod, MPI, and more for distributed training, and NVIDIA Triton Inference Server, Seldon, and more for model serving.

Conclusion

Multi-tenancy in AI is a pivotal concept that enables organizations to harness the power of AI while addressing unique challenges related to security, resource allocation, and diverse business models. Whether in public clouds or internal enterprise solutions, multi-tenancy offers flexibility and scalability. To navigate the complex landscape of multi-tenancy, organizations must carefully balance the levels of isolation and security to meet the specific needs of their tenants. In an era where AI is transforming industries, multi-tenancy paves the way for innovation and collaboration while safeguarding data and resources.