Optimizing Ethernet to Meet AI Infrastructure Demands

Published: 25 October 2024
on channel: Packet Pushers
282
4

Ethernet competes with InfiniBand as a network fabric for AI workloads such as model training. One issue is that AI jobs don’t tolerate latency, drops, and retransmits. In other words, AI workloads do best with a lossless network. And while Ethernet has kept up with increasing demands to support greater bandwidth and throughput, it was never designed to be lossless.

ASIC makers, switch vendors, and industry groups aim to optimize Ethernet to make it a suitable network fabric for AI jobs. On today’s Heavy Networking we talk about those efforts, explore different optimization techniques, compare Ethernet and InfiniBand, and discuss general design considerations to be aware of if your organization decides to go down the Ethernet path for AI.

Our guests are Chris Kane, Sr. Manager Systems Engineering at Arista Networks, and Pete Lumbis, Principal Engineer at Nvidia.
Chris Kane -   / chris-kane-297aa915  

Pete Lumbis -   / alumbis  

Chris Kane on X - @ccie14430

Pete Lumbis on X - @peteccde

RoCE Networks for Distrubuted AI Training at Scale - Meta - https://engineering.fb.com/2024/08/05...

SemiAnalysis - semiconductor research - https://www.semianalysis.com/

Understanding ASICs for Network Engineers with Pete Lumbis - Packet Pushers video -
https://packetpushers.net/blog/unders...
A Look At Broadcom’s Jericho3-AI Ethernet Fabric: Schedules, Credits, And Cells - Packet Pushers blog - https://packetpushers.net/blog/a-look...

Heavy Networking 739: High Stakes Network Observability for High Frequency Trading (how to build an observability platform for extremely high traffic & latency sensitive networks) - Packet Pushers podcast -
https://packetpushers.net/podcasts/he...

Optimized Network Architectures for Training Large Language Models With Billions of Parameters - Academic Whitepaper - “Rail-only” networking for LLMs)
https://people.csail.mit.edu/ghobadi/...

Situational Awareness paper: Leopold Aschenbrenner
https://situational-awareness.ai/

Semiconductor Fabrication 101 - Purdue University
https://engineering.purdue.edu/online...

NVIDIA DGX SuperPod Network Design (AI Server with 8 GPUs and 8 NICs) - NVIDIA -
https://docs.NVIDIA.com/dgx-superpod/...

InfiniBand Essentials - free intro to InfiniBand training from NVIDIA
https://academy.NVIDIA.com/en/course/...

Heavy Networking is the flagship show of the Packet Pushers network. Visit our website to find more great networking and technology podcasts, along with tutorial videos, the Human Infrastructure newsletter, and loads more resources for building your IT career. https://packetpushers.net


Watch video Optimizing Ethernet to Meet AI Infrastructure Demands online without registration, duration hours minute second in high quality. This video was added by user Packet Pushers 25 October 2024, don't forget to share it with your friends and acquaintances, it has been viewed on our site 282 once and liked it 4 people.