Talk: "HW High-Availability and Link Aggregation for Ethernet switch and NIC RDMA using Linux bonding/team" (Or Gerlitz, Tzahi Oved)


The Linux networking stack support High-Availability (HA) and Link Aggregation (LAG) through usage of bonding/teaming drivers, where both set a software netdevice on top of two or more netdevs.

Those HA devices are set as "upper" devices acting over "lower" devices. The core networking stack uses notifier mechanism to announce setup/tear-down of such relations.

We show how to take advantage of standard bonding/team and their associated notifiers to reflect HW/LAG into HW and achieve enhanced functionality.

We present four use cases dealing with RDMA, SR-IOV Virtual Functions (VFs) and physical switch.

In the 1st RDMA case, the RDMA stack presents a RoCE (RDMA-over-Ethernet) device with one port where this device is backed-up by two bonded Ethernet NICs, and the HW goes through setup that makes RDMA connections set over this device (are offloaded from the networking stack) to be subject to HA and LAG.

In the SRIOV case, the PF host net-devices are bonded while the VF sees a HW device with one port. The HW setup done by the PF driver causes the overall VF traffic (both conventional TCP/IP and offloaded RDMA) to be subject to HA and LAG.

In the physical switch case, the creation of a LAG above the port netdevices is propagated to the device driver using network notifiers.

The device driver can either program the device to create the hardware LAG, or forbid the operation in case hardware resources were exceeded or because it lacks support for certain LAG parameters.

The creation of further upper devices on top the LAG is propagated to the lower port netdevices in the same way as if the upper device was created directly on top of them.

In the 2nd RDMA case, we propose an architecture for OS bypass Ethernet and RDMA bonding driver as a new kernel module for aggregating IB Devices network interfaces.

IB Device (struct ib_device) exposes verbs programing network API which allows OS bypass for RAW Ethernet networking and RDMA operations.

The driver will provide method for aggregating multiple IB device interfaces into a single logical bonded interface. This aggregation will allow existing verbs applications to use single logical device transparently and enjoy networking HA, load balancing and NUMA locality.

The IB bonding driver works similarly and in conjunction with the Linux standard bonding/team drivers and with the latter continue to support standard network aggregation.

In the talk we will present the architecture of the planned driver along with several configurations and offloads support as well as articulate various aggregation modes including Active-Active, Active Passive, resource allocation according to device affinity, and SRIOV bonding configuration.