Fosstodon
NETDEV VIDEOS
Session
RPC Latency Breaker: Where Did My RPC Time Go?
Speakers
Satish Kumar
Fam Zheng
Label
Hands On
Session Type
Talk
Description
RPC latency in datacenter environments is difficult to diagnose because a single RPC spans multiple subsystems: the kernel receive stack on each host, application scheduling delays, business logic, and the physical network. Service mesh frameworks record end-to-end latency but cannot attribute it to a specific component.
SO_TIMESTAMPING provides kernel timestamps but retrieving them from the socket error queue is slow and carries performance overhead, forcing most deployments to rely on sampling rather than per-RPC coverage. It also requires application changes to consume the data and correlate timestamps with the RPC identifier. Recent eBPF-based developments solve the slow retrieval problem by collecting these timestamps directly in BPF and can correlate them with the originating system call, but cannot correlate them with the RPC.
Furthermore, current solutions emphasize TX-path, per-packet tracing, capturing time windows such as qdisc scheduling delays and ACK RTT. We argue that TX-path delays are rarely the source of problems — they occur within the application’s own execution context and are typically fast. Network RTT windows are already well-served by existing tools like BCC’s tcprtt.
We present RPC Latency Breaker, which decomposes every RPC into four components: (1) full RPC time, from the requester’s sendmsg to its recvmsg of the response; (2) NIC-to-softirq delay on both sides, capturing NIC processing and softirq scheduling latency; (3) softirq-to-application-read delay on both sides, capturing softirq processing and application scheduling delays; and (4) network RTT, reflecting physical network latency. All of this is derived solely from syscall-level hooks and SO_TIMESTAMPING RX timestamps — no per-packet tracing, no application changes, and no clock synchronization required, and provides this breakdown for every RPC.
Correlating timestamps with individual RPCs — without application-level RPC identifiers — is solved by exploiting the ping-pong pattern inherent in RPC traffic: on a given TCP flow, sendmsg and recvmsg events strictly alternate between requester and responder. We detect this pattern by analyzing time gaps between consecutive events and pair timestamps across the two endpoints using TCP socket sequence numbers: a sender’s write_seq at sendmsg precisely matches the receiver’s copied_seq at the corresponding recvmsg, establishing a per-RPC mapping without any application cooperation.
Clock synchronization across the two hosts is not required because timestamps from different clock sources are never directly compared. Each latency component is computed by differencing timestamps from the same host: NIC-to-softirq and softirq-to-read delays are purely local measurements, and full RPC time is derived from the requester’s own clock alone. Socket sequence number alignment identifies which events correspond across hosts but carries no timing information from the remote clock. As a result, even large inter-host clock skews do not affect the accuracy of the breakdown.
We will present case studies from our production environment demonstrating how the tool has identified a range of RPC latency problems. Using this breakdown, we can readily categorize issues as slow physical network, application scheduling delays, or longer business logic — and once categorized, it becomes immediately clear where to focus the investigation.
Recent News
Bronze Sponsor, Common Net
[Tue, 16, Jun. 2026]
Bronze Sponsor, secunet
[Fri, 12, Jun. 2026]
Bronze Sponsor, Red Hat
[Fri, 12, Jun. 2026]
Bronze Sponsor, Mpiric
[Tue, 09, Jun. 2026]
Bronze Sponsor, Viasat
[Mon, 08, Jun. 2026]
Important Dates
| Closing of CFS | June 1st |
| Notification by | June 10th |
| Conference dates | July 13th-16th |