Packet Pushers

Where Too Much Technology Would Be Barely Enough

  • Podcasts
    • Day Two Cloud
    • Full Stack Journey
    • Heavy Networking
    • Heavy Strategy
    • Heavy Wireless
    • IPv6 Buzz
    • Kubernetes Unpacked
    • Network Break
    • Tech Bytes
    • The Community Show
    • Datanauts (Retired)
    • Priority Queue (Retired)
  • Hosts
  • Articles
    • Tech Blogs
    • Industry News
    • SD-WAN Vendor List
    • Books And Whitepapers
    • Toolbox – IT Resource Collections
  • Library
  • Newsletter
  • Slack
  • Subscribe
  • Sponsor
You are here: Home / Blogs / A Look At Broadcom’s Jericho3-AI Ethernet Fabric: Schedules, Credits, And Cells

A Look At Broadcom’s Jericho3-AI Ethernet Fabric: Schedules, Credits, And Cells

Drew Conry-Murray August 8, 2023

Broadcom has come up with some interesting mechanisms to address the challenges of building an Ethernet-based fabric that supports AI workloads. These mechanisms, which include a scheduling framework, cells, and credits, are intended to minimize congestion, latency, and dropped frames or packets in the fabric. In this post I talk about what I learned at Network Field Day 32 about how Broadcom builds an Ethernet fabric optimized for AI using its Jericho3-AI and Ramon ASICs.

AI Workload Challenges For Ethernet Networks

Training AI models on huge data sets can take time: days or weeks of computation before you get results. The data sets are spread across hundreds or thousands of individual GPUs. No single GPU has the complete model or data set, so data has to be exchanged among these GPUs.

One key metric for AI training is job completion time. Data scientists want data exchanges to happen as quickly as possible because computation can’t continue until the entire flow has been received by the target. The very last bytes of a flow dictate when you can start the next computation cycle, which means expensive GPUs are sitting idle while traffic goes from point A to point B. That idle time increases if, for example, flows are slowed down due to congestion in the network, or if frames or packets get dropped and have to be re-transmitted.

The arrival of the final packets or frames at the target is called tail latency.  Networks that support AI workloads seek to minimize tail latency.

AI workloads tend to generate fewer flows across the network, but the flows are higher bandwidth (so-called elephant flows). Without mechanisms to control congestion or balance loads, a flood of high-bandwidth traffic could saturate links. Frames may arrive out of order, or get dropped due to collisions or saturated buffers. Dropped frames have to be retransmitted, which increases tail latency.

One option is to use a lossless protocol such as Infiniband. But network professionals versed in Infiniband are few, and Infiniband chips and networking gear tend to be more expensive than Ethernet-based equipment. The Ethernet standard has also consistently increased its throughput: 400G and 800G ports are available, with 1.6Tb on the horizon.

Given these advantages, chip makers and networking vendors have experimented with various mechanisms to make Ethernet better suited to the demands of AI and HPC workloads. That means addressing issues such as collisions, in-cast, and link failures.

Making A Schedule

Broadcom has several options for minimizing tail latency in AI-focused Ethernet fabrics. These approaches, which are implemented in switch hardware, differ depending on the ASIC. This post focuses on Broadcom’s Jericho3-AI ASIC.

Broadcom has developed what it calls a scheduled fabric for Jericho3-AI. A scheduled fabric uses a leaf-spine configuration with Jericho3-AIs as the leaves and its Ramon ASICs in the spine. All of the intelligence for the scheduled fabric resides in the leaf switches.

One aspect of the fabric is a credit-based signalling system. Ingress and egress switches each have a set of credits they can exchange with one another to signal when traffic should be put into the fabric.

For example, if GPU A sends a flow addressed to GPU B, the leaf switch connected to GPU A implements a Virtual Output Queue (VOQ) and buffers the traffic. This switch then requests a sending credit from the receiver switch to which GPU B is attached. The receiver switch will only send that credit if and when GPU B is available to process the flow. In other words, the flow isn’t allowed onto the network until the receiver is ready to handle it. Broadcom says this credit-based framework minimizes congestion and any subsequent flow collisions.

Cell Division

The Jericho3-AI ASIC also supports what Broadcom calls Perfect Load Balancing (PLB). The PLB mechanism “sprays” traffic equally across all the links in the fabric. This ensures that individual links don’t get oversubscribed and provides more consistent performance.

To spray traffic equally, the ASIC divides Ethernet frames into equal-size cells. Cell size depends on frame size; a 64-byte frame might be sent as a single cell, while a jumbo frame would be divided into multiple cells.

Each cell gets a sequence number to mark the start and end of the frame, as well as header information to indicate the destination port and target device. When cells pass from a leaf switch to a Ramon-based spine switch, the spine only has to perform a header lookup to send the cell to the correct destination.

The leaf switch directly connected to the destination GPU will buffer and reassemble the incoming cells, then send the reassambled frame onto its destination.

Broadcom says its scheduling fabric, load balancing, and cell division are proprietary. Broadcom is a founding member of the Ultra Ethernet Consortium, a new standards body that aims to optimize the Ethernet protocol for AI and HPC workloads, but Broadcom has no plans to open-source these techniques.

Learn More

Broadcom participated in Network Field Day 32, which I attended virtually as a delegate. You can see all Broadcom’s presentations here. In the interest of full disclosure, I received a Broadcom-branded mug as a delegate gift.

About Drew Conry-Murray

Drew Conry-Murray has been writing about information technology for more than 15 years, with an emphasis on networking, security, and cloud. He's co-host of The Network Break podcast and a Tech Field Day delegate. He loves real tea and virtual donuts, and is delighted that his job lets him talk with so many smart, passionate people. He writes novels in his spare time. Follow him on Twitter @Drew_CM or reach out at [email protected].

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

  • Email
  • Facebook
  • LinkedIn
  • RSS
  • Twitter
  • YouTube

RSS Day Two Cloud

  • Day Two Cloud 206: Making The Most Of Red Teaming With Gemma Moore August 9, 2023

RSS Full Stack Journey

  • Full Stack Journey 080: Career Transitions Via Cloud, Infrastructure, And Content Creation With Rishab Kumar July 18, 2023

RSS Heavy Networking

  • Heavy Networking 694: A Network Engineering Roundtable August 11, 2023

RSS Heavy Strategy

  • HS053 IT Facilities in 2023 August 2, 2023

RSS Heavy Wireless

  • Heavy Wireless 008: 3D Printing For Wireless Engineers August 8, 2023

RSS IPv6 Buzz

  • IPv6 Buzz 132: Down The Rabbit Hole Of IPv6 Router Advertisements August 10, 2023

RSS Kubernetes Unpacked

  • Kubernetes Unpacked 032: AI Use Cases For Kubernetes August 11, 2023

RSS Network Break

  • Network Break 442: HashiCorp Swaps Open Source For BSL; Open Enterprise Linux Goes After RHEL August 14, 2023

RSS Tech Bytes

  • Tech Bytes: Spotting Performance Problems Faster With Digital Experience Monitoring (Sponsored) August 7, 2023

RSS YouTube

  • Kubernetes Security And Networking 8: Loading The Cillium CNI May 23, 2023

Recent Comments

  • Shay Jan on Heavy Networking 694: A Network Engineering Roundtable
  • MikeT on Heavy Networking 692: Implementing Practical Network Automation – With Tony Bourke
  • Tony Bourke on Heavy Networking 692: Implementing Practical Network Automation – With Tony Bourke
  • Steve Titzer on Heavy Networking 689: Prepping For Certification Exams With Mary Fasang
  • Adrian Villanueva on Day Two Cloud 200: Coaching For Accidental (And On-Purpose) Managers
  • John Max on HS052 Professional Liability and Qualified Design

PacketPushers Podcast

  • Heavy Networking
  • Day Two Cloud
  • Network Break
  • Briefings In Brief & Tech Bytes
  • Full Stack Journey
  • IPv6 Buzz
  • Community Podcast
  • Heavy Strategy
  • Priority Queue (Retired)
  • Datanauts (Retired)

PacketPushers Articles

  • All the News & Blogs
  • Only the Latest News
  • Only the Community Blogs
  • Virtual Toolbox

Search

Website Information

  • Frequently Asked Questions
  • Subscribe
  • Sponsorship
  • Meet The Hosts
  • Pitch Us
  • Terms & Conditions
  • Privacy Policy

Connect

  • Contact The Packet Pushers
  • Join Our Slack Group
  • Subscribe To Podcasts
  • Subscribe To Newsletter
  • Become A Sponsor
  • Facebook
  • LinkedIn
  • RSS
  • Twitter
  • YouTube

© Copyright 2023 Packet Pushers Interactive, LLC · All Rights Reserved