Broadcom has come up with some interesting mechanisms to address the challenges of building an Ethernet-based fabric that supports AI workloads. These mechanisms, which include a scheduling framework, cells, and credits, are intended to minimize congestion, latency, and dropped frames or packets in the fabric. In this post I talk about what I learned at Network Field Day 32 about how Broadcom builds an Ethernet fabric optimized for AI using its Jericho3-AI and Ramon ASICs.
AI Workload Challenges For Ethernet Networks
Training AI models on huge data sets can take time: days or weeks of computation before you get results. The data sets are spread across hundreds or thousands of individual GPUs. No single GPU has the complete model or data set, so data has to be exchanged among these GPUs.
One key metric for AI training is job completion time. Data scientists want data exchanges to happen as quickly as possible because computation can’t continue until the entire flow has been received by the target. The very last bytes of a flow dictate when you can start the next computation cycle, which means expensive GPUs are sitting idle while traffic goes from point A to point B. That idle time increases if, for example, flows are slowed down due to congestion in the network, or if frames or packets get dropped and have to be re-transmitted.
The arrival of the final packets or frames at the target is called tail latency. Networks that support AI workloads seek to minimize tail latency.
AI workloads tend to generate fewer flows across the network, but the flows are higher bandwidth (so-called elephant flows). Without mechanisms to control congestion or balance loads, a flood of high-bandwidth traffic could saturate links. Frames may arrive out of order, or get dropped due to collisions or saturated buffers. Dropped frames have to be retransmitted, which increases tail latency.
One option is to use a lossless protocol such as Infiniband. But network professionals versed in Infiniband are few, and Infiniband chips and networking gear tend to be more expensive than Ethernet-based equipment. The Ethernet standard has also consistently increased its throughput: 400G and 800G ports are available, with 1.6Tb on the horizon.
Given these advantages, chip makers and networking vendors have experimented with various mechanisms to make Ethernet better suited to the demands of AI and HPC workloads. That means addressing issues such as collisions, in-cast, and link failures.
Making A Schedule
Broadcom has several options for minimizing tail latency in AI-focused Ethernet fabrics. These approaches, which are implemented in switch hardware, differ depending on the ASIC. This post focuses on Broadcom’s Jericho3-AI ASIC.
Broadcom has developed what it calls a scheduled fabric for Jericho3-AI. A scheduled fabric uses a leaf-spine configuration with Jericho3-AIs as the leaves and its Ramon ASICs in the spine. All of the intelligence for the scheduled fabric resides in the leaf switches.
One aspect of the fabric is a credit-based signalling system. Ingress and egress switches each have a set of credits they can exchange with one another to signal when traffic should be put into the fabric.
For example, if GPU A sends a flow addressed to GPU B, the leaf switch connected to GPU A implements a Virtual Output Queue (VOQ) and buffers the traffic. This switch then requests a sending credit from the receiver switch to which GPU B is attached. The receiver switch will only send that credit if and when GPU B is available to process the flow. In other words, the flow isn’t allowed onto the network until the receiver is ready to handle it. Broadcom says this credit-based framework minimizes congestion and any subsequent flow collisions.
Cell Division
The Jericho3-AI ASIC also supports what Broadcom calls Perfect Load Balancing (PLB). The PLB mechanism “sprays” traffic equally across all the links in the fabric. This ensures that individual links don’t get oversubscribed and provides more consistent performance.
To spray traffic equally, the ASIC divides Ethernet frames into equal-size cells. Cell size depends on frame size; a 64-byte frame might be sent as a single cell, while a jumbo frame would be divided into multiple cells.
Each cell gets a sequence number to mark the start and end of the frame, as well as header information to indicate the destination port and target device. When cells pass from a leaf switch to a Ramon-based spine switch, the spine only has to perform a header lookup to send the cell to the correct destination.
The leaf switch directly connected to the destination GPU will buffer and reassemble the incoming cells, then send the reassambled frame onto its destination.
Broadcom says its scheduling fabric, load balancing, and cell division are proprietary. Broadcom is a founding member of the Ultra Ethernet Consortium, a new standards body that aims to optimize the Ethernet protocol for AI and HPC workloads, but Broadcom has no plans to open-source these techniques.
Learn More
Broadcom participated in Network Field Day 32, which I attended virtually as a delegate. You can see all Broadcom’s presentations here. In the interest of full disclosure, I received a Broadcom-branded mug as a delegate gift.
Leave a Reply