Session

The Future of AI Networks: Advancing TCP with Device Memory and Collective Communication

Speakers

Anjali Singhai
Shaopeng He
Sridhar Samudrala

Label

Nuts and Bolts

Session Type

Talk

Description

In our forthcoming presentation, we will explore significant advancements in network communication protocols, focusing on the extension of TCP to support Collective Communication (CC) semantics. Originally introduced as device memory TCP, or Devmem TCP, by Google at the last NetDev conference, this initiative marks a pivotal evolution in the AI network landscape. NCCL currently predominates in managing the transfer of CC semantics to both RDMA and TCP. Unlike traditional point-to-point configurations, CC enables intra-group communication, which is crucial for enhancing the complexity and performance of AI network interactions.

These enhancements simplify the framework for CC semantics, introducing innovations such as direct device access and the potential for random access, moving beyond conventional stream-only access. These developments are essential for a broad range of applications across AI, high-performance computing (HPC), and storage solutions, including NVMe over TCP. The evolution of TCP semantics is anticipated to inspire diverse implementations within the industry, as exemplified by Google’s Falcon and AWS’s EFA under RDMA semantics. Our efforts extend these innovations to TCP, significantly enhancing its applicability and potential for widespread adoption.

For practical deployment, we have enabled this enhanced TCP on Intel’s NICs, specifically within the Intel IPU series with IDPF driver, to ensure broader utilization in established environments such as HPC MPI and the AI NCCL framework. During our session, we will discuss our implementation strategies for these NICs and provide updates on our progress. Furthermore, we will present detailed performance data to demonstrate the enhanced TCP’s effectiveness, showcasing its comparability to RDMA and its superiority over standard TCP, particularly in handling larger packet sizes. This talk aims to provide a comprehensive overview of our technological advancements and their potential impact on the future of network communications.