Is Your GPU Really Working Efficiently in the Data Center? N Ways to... - Xiao Zhang & Wu Ying Jun

Published: 04 September 2024
on channel: CNCF [Cloud Native Computing Foundation]
101
1

Don't miss out! Join us at our next Flagship Conference: KubeCon + CloudNativeCon North America in Salt Lake City from November 12 - 15, 2024. Connect with our current graduated, incubating, and sandbox projects as the community gathers to further the education and advancement of cloud native computing. Learn more at https://kubecon.io

Is Your GPU Really Working Efficiently in the Data Center? N Ways to Improve GPU Usage | 您的GPU在数据中心真的高效工作吗?提高GPU使用率的N种方法 - Xiao Zhang, DaoCloud & Wu Ying Jun, China Mobile

AI has penetrated into various industries, and companies have purchased many expensive AI GPU devices and used them for training and inference.
Is MFU performing well?
Is the GPU card being monopolized by a large number of applications that are not heavily used?
Do these AI devices work efficiently 24/7?
This session will combine our mass production practices to summarize N ways to improve the MFU of AI accelerators,

We will share some experience in training LLMs with hundreds of billions of parameters on a large-scale K8s cluster with thousands of AI accelerators(GPUs or NPUs), including model parallelism, switch-affinity scheduling, checkpoint efficiency optimization, recovery from checkpoint and so on.

At the same time, we will also introduce how to improve MFU through GPU share technology, solve tidal scenarios with the help of training-inference hybrid solutions, and improve GPU utilization by node grouping and matching training and inference applications.

人工智能已经渗透到各个行业,企业购买了许多昂贵的AI GPU设备并用于训练和推理。
MFU表现良好吗?
GPU卡是否被大量未被充分利用的应用程序垄断?
这些AI设备是否全天候高效工作?
本次会议将结合我们的批量生产实践,总结出提高AI加速器MFU的N种方法。

我们将分享在拥有数千个AI加速器(GPU或NPU)的大型K8s集群上训练拥有数千亿参数的大型语言模型(LLM)的经验,包括模型并行性、交换机亲和调度、检查点效率优化、从检查点恢复等。

同时,我们还将介绍如何通过GPU共享技术提高MFU,利用训练-推理混合解决方案解决潮汐场景,并通过节点分组和匹配训练与推理应用提高GPU利用率。


Watch video Is Your GPU Really Working Efficiently in the Data Center? N Ways to... - Xiao Zhang & Wu Ying Jun online without registration, duration hours minute second in high quality. This video was added by user CNCF [Cloud Native Computing Foundation] 04 September 2024, don't forget to share it with your friends and acquaintances, it has been viewed on our site 101 once and liked it 1 people.