Each deployment of an ND A100 v4 cluster rivals the largest AI supercomputers in the industry in terms of raw scale and advanced technology.

Today, Azure is proud to take the next step toward our commitment to enabling customers to harness the power of AI (Artificial Intelligence) at scale. For AI, the bar for innovation has never been higher with hardware requirements for training models far outpacing Moore’s Law. Technology leaders across industries are discovering new ways to apply the power of machine learning, accelerated analytics and AI to make sense of unstructured data. The natural language models of today are exponentially larger than the largest models of four short years ago.

OpenAI’s GPT-3 model, for instance, has three orders of magnitude more parameters than the ResNet-50 image classification model that was at the forefront of AI in the mid-2010s. These kinds of demanding workloads required the development of a new class of system within Microsoft Azure, from the ground-up using the latest hardware innovations.

The Azure team has built on our experience virtualizing the latest GPU technology, and building the public cloud industry’s leading InfiniBand-enabled HPC virtual machines to offer something totally new for AI in the cloud. Each deployment of an ND A100 v4 cluster rivals the largest AI supercomputers in the industry in terms of raw scale and advanced technology. These VMs enjoy the same unprecedented 1.6 Tb/s of total dedicated InfiniBand bandwidth per VM, plus AMD Rome-powered compute cores behind every NVIDIA A100 GPU as used by the most powerful dedicated on-premise HPC systems. Azure adds massive scale, elasticity, and versatility of deployment, as expected by Microsoft’s customers and internal AI engineering teams.

This unparalleled scale and capability of interconnect in a cloud offering, with each GPU directly paired with a high-throughput low-latency InfiniBand interface, offers our customers a unique dimension of scaling on demand without managing their own datacenters.

Today, at SC20, we’re announcing the public preview of the ND A100 v4 VM family, available from one virtual machine to world-class supercomputer scale, with each individual VM featuring:

  • Eight of the latest NVIDIA A100 Tensor Core GPUs with 40 GB of HBM2 memory, offering a typical per-GPU performance improvement of 1.7x – 3.2x compared to V100 GPUs- or up to 20x by layering features like new mixed-precision modes, sparsity, and MIG- for significantly lower total cost of training with improved time-to-solution
  • VM system-level GPU interconnected based on NVLINK 3.0 + NVswitch
  • One 200 Gigabit InfiniBand HDR link per GPU with full NCCL2 support and GPUDirect RDMA for 1.6 Tb/s per virtual machine
  • 40 Gb/s front-end Azure networking
  • 6.4 TB of local NVMe storage
  • InfiniBand-connected job sizes in the thousands of GPUs, featuring any-to-any and all-to-all communication without requiring topology aware scheduling
  • 96 physical AMD Rome vCPU cores with 900 GB of DDR4 RAM
  • PCIe Gen 4 for the fastest possible connections between GPU, network and host CPUs- up to twice the I/O performance of PCIe Gen 3-based platforms
  • Like other Azure GPU virtual machines, ND A100 v4 is also available with Azure Machine Learning (AML) service for interactive AI development, distributed training, batch inferencing, and automation with ML Ops. Customers can choose to deploy through AML or traditional VM Scale Sets, and soon many other Azure-native deployment options such as Azure Kubernetes Service. With all of these, optimized configuration of the systems and InfiniBand backend network is taken care of automatically.

Azure Machine Learning provides a tuned virtual machine (pre-installed with the required drivers and libraries) and container-based environments optimized for the ND A100 v4 family. Sample recipes and Jupyter Notebooks help users get started quickly with multiple frameworks including PyTorch, TensorFlow, and training state of the art models like BERT. With Azure Machine Learning, customers have access to the same tools and capabilities in Azure as our AI engineering teams.

Originally published at HPC wire