Session

Characterizing IOTLB Wall for Multi-100-Gbps Linux-based Networking

Speakers

Alireza Farshin
Luigi Rizzo

Label

Moonshot

Session Type

Talk

Description

Many emerging technologies (e.g., AI, VR, and cloud VM hosting) require high network bandwidth. Consequently, NICs’ link speeds are moving quickly toward the 1 Tbps range and higher. Speed increases, in turn, introduce newer bottlenecks in the communication between system components.

In this talk, we focus on the performance impact of latency in transfers between the NIC and host memory—specifically, on how the IOMMU affects performance and how to mitigate its effects.

The core of the problem is that data transfer between the NIC and host memory has to traverse multiple blocks (i.e., PCIe bus, address translation (IOMMU), memory fabric, and memory chips) and is subject to ordering constraints. There is limited buffering (high-speed multiport memories) between these blocks, and when the bandwidth-delay product of the path exceeds the amount of buffering, the system has to throttle traffic or possibly even drop it (e.g., in the case of received packets). As a reference, high-end systems can only cope with 1.5-2us of average latency.

The IOMMU can consume a substantial amount of this budget, especially if each translation needs to actually walk the IOMMU table. The table has a cache (IOTLB) with only a limited number of entries. Depending on the pattern of requests, it is quite possible to overflow the cache and cause a dramatic drop in performance, which we call the “IOTLB wall”. For instance, the buffer management (e.g., PagePool API in the Linux kernel) and various offloading capabilities of modern NICs (e.g., LRO and TSO) could affect the memory request pattern, and consequently the number of IOTLB misses.

In the talk, we will first model the problem, describe scenarios that exhibit the phenomenon, and show experiment data that match our model. More specifically, we characterize IOTLB behavior and its effects on recent Intel Xeon Scalable & AMD EPYC processors at 200 Gbps, by analyzing the impact of different factors contributing to IOTLB misses and causing a throughput drop (up to 20% compared to the no-IOMMU case in our experiments).

Subsequently, we will discuss how the problem can be mitigated with various techniques—some obvious ones, such as using larger IOTLB mappings to reduce the number of IOTLB entries, and some less obvious ones, which try to increase address locality, thus causing better use of the available IOTLB entries. We hope to signify the importance of having a call to arms to rethink Linux-based I/O management at high link rates (which will continue to grow over time).