Session

Toward Host-Pluggable Congestion Control for RDMA/IP Datacenter Transports

Speakers

Vivek Kashyap

Label

Nuts and Bolts

Session Type

Talk

Description

Building on the Netdev 0x19 talk on congestion control in AI/ML datacenter networks, this talk presents a concrete step toward host-pluggable congestion control for RDMA/IP datacenter transports. The previous talk surveyed modern datacenter congestion-control approaches, the limitations of fixed endpoint behavior, and the need for congestion-control algorithms to become more programmable and adaptable as workloads evolve.

This follow-up focuses on a practical implementation model that moves congestion-control policy out of fixed firmware or hardware implementations and into a host/hybrid control framework. A host component running in userspace, or alternatively in the kernel, periodically issues probe packets and uses hardware timestamping to obtain path RTT measurements. These measurements are converted into a congestion estimate for the path. The resulting control value is then distributed across the active Queue Pairs associated with that peer or path and applied through a driver-mediated QP update interface. The talk will share results from this implementation running with NIC-embedded congestion control disabled, without relying on DCQCN/PFC behavior, to demonstrate that a host-driven control loop can manage RDMA congestion.

The intent is not to claim that probe RTT is the only useful congestion signal. Rather, probe-driven feedback provides a deployable starting point for separating congestion-control policy from device-specific implementation. By commoditizing the control loop through a host-accessible framework, new algorithms can be prototyped, tuned, compared, and modified without requiring every change to be embedded directly in NIC firmware/hardware. The same host/driver framework can also be extended to incorporate additional endpoint signals such as ECN counts, ACK or progress counters, retransmit and retry events, selective-recovery information, and path-health indicators.

This flexibility matters because modern datacenter workloads are heterogeneous. AI/ML collectives, storage transfers, kv-cache movement, HPC messages, and front-end traffic may share Ethernet/IP infrastructure but have different latency, throughput, and burst behavior. A host-pluggable substrate allows congestion behavior to be adapted by workload, path, and policy rather than being constrained to a single fixed transport or firmware mechanism.

The talk will describe the probe-driven control loop, the interaction between the host congestion-control component and active QPs, early implementation results, the tradeoffs between userspace, kernel, firmware, and hardware placement, and the minimal endpoint interfaces needed to make RDMA congestion-control algorithms easier to deploy and evolve in datacenter environments.