2020-12-11 19:25:35 +01:00
|
|
|
# PPing using XDP and TC-BPF
|
2021-02-25 11:16:38 +01:00
|
|
|
A re-implementation of [Kathie Nichols' passive ping
|
2021-03-26 16:57:48 +01:00
|
|
|
(pping)](https://github.com/pollere/pping) utility using XDP (on ingress) and
|
|
|
|
TC-BPF (on egress) for the packet capture logic.
|
2021-01-26 18:34:23 +01:00
|
|
|
|
|
|
|
## Simple description
|
2021-03-26 16:57:48 +01:00
|
|
|
Passive Ping (PPing) is a simple tool for passively measuring per-flow RTTs. It
|
|
|
|
can be used on endhosts as well as any (BPF-capable Linux) device which can see
|
|
|
|
both directions of the traffic (ex router or middlebox). Currently it only works
|
|
|
|
for TCP traffic which uses the TCP timestamp option, but could be extended to
|
|
|
|
also work with for example TCP seq/ACK numbers, the QUIC spinbit and ICMP
|
|
|
|
echo-reply messages. See the [TODO-list](./TODO.md) for more potential features
|
|
|
|
(which may or may not ever get implemented).
|
2021-01-26 18:34:23 +01:00
|
|
|
|
2021-03-26 16:57:48 +01:00
|
|
|
The fundamental logic of pping is to timestamp a pseudo-unique identifier for
|
|
|
|
outgoing packets, and then look for matches in the incoming packets. If a match
|
|
|
|
is found, the RTT is simply calculated as the time difference between the
|
2021-05-06 17:54:31 +02:00
|
|
|
current time and the stored timestamp.
|
2021-01-26 18:34:23 +01:00
|
|
|
|
2021-03-26 17:54:42 +01:00
|
|
|
This tool, just as Kathie's original pping implementation, uses TCP timestamps
|
|
|
|
as identifiers. For outgoing packets, the TSval (which is a timestamp in and off
|
|
|
|
itself) is timestamped. Incoming packets are then parsed for the TSecr, which
|
|
|
|
are the echoed TSval values from the receiver. The TCP timestamps are not
|
2021-03-26 16:57:48 +01:00
|
|
|
necessarily unique for every packet (they have a limited update frequency,
|
|
|
|
appears to be 1000 Hz for modern Linux systems), so only the first instance of
|
|
|
|
an identifier is timestamped, and matched against the first incoming packet with
|
|
|
|
the identifier. The mechanism to ensure only the first packet is timestamped and
|
|
|
|
matched differs from the one in Kathie's pping, and is further described in
|
|
|
|
[SAMPLING_DESIGN](./SAMPLING_DESIGN.md).
|
2020-12-17 18:10:50 +01:00
|
|
|
|
2021-03-26 16:57:48 +01:00
|
|
|
## Design and technical description
|
2020-12-17 18:10:50 +01:00
|
|
|

|
2021-03-26 16:57:48 +01:00
|
|
|
|
|
|
|
### Files:
|
2021-04-22 18:06:09 +02:00
|
|
|
- **pping.c:** Userspace program that loads and attaches the BPF programs, pulls
|
2021-03-26 16:57:48 +01:00
|
|
|
the perf-buffer `rtt_events` to print out RTT messages and periodically cleans
|
|
|
|
up the hash-maps from old entries. Also passes user options to the BPF
|
|
|
|
programs by setting a "global variable" (stored in the programs .rodata
|
|
|
|
section).
|
2021-04-22 18:06:09 +02:00
|
|
|
- **pping_kern.c:** Contains the BPF programs that are loaded on tc (egress) and
|
|
|
|
XDP (ingress), as well as several common functions, a global constant `config`
|
|
|
|
(set from userspace) and map definitions. The tc program `pping_egress()`
|
|
|
|
parses outgoing packets for identifiers. If an identifier is found and the
|
|
|
|
sampling strategy allows it, a timestamp for the packet is created in
|
|
|
|
`packet_ts`. The XDP program `pping_ingress()` parses incomming packets for an
|
|
|
|
identifier. If found, it looks up the `packet_ts` map for a match on the
|
|
|
|
reverse flow (to match source/dest on egress). If there is a match, it
|
|
|
|
calculates the RTT from the stored timestamp and deletes the entry. The
|
|
|
|
calculated RTT (together with the flow-tuple) is pushed to the perf-buffer
|
|
|
|
`rtt_events`.
|
|
|
|
- **bpf_egress_loader.sh:** A shell script that's used by `pping.c` to setup a
|
|
|
|
clsact qdisc and attach the `pping_egress()` program to egress using
|
2021-03-26 16:57:48 +01:00
|
|
|
tc. **Note**: Unless your iproute2 comes with libbpf support, tc will use
|
|
|
|
iproute's own loading mechanism when loading and attaching object files
|
|
|
|
directly through the tc command line. To ensure that libbpf is always used to
|
2021-04-22 18:06:09 +02:00
|
|
|
load `pping_egress()`, `pping.c` actually loads the program and pins it to
|
2021-03-26 16:57:48 +01:00
|
|
|
`/sys/fs/bpf/pping/classifier`, and tc only attaches the pinned program.
|
2021-04-22 18:06:09 +02:00
|
|
|
- **functions.sh and parameters.sh:** Imported by `bpf_egress_loader.sh`.
|
|
|
|
- **pping.h:** Common header file included by `pping.c` and
|
|
|
|
`pping_kern.c`. Contains some common structs used by both (are part of the
|
|
|
|
maps).
|
2021-03-26 16:57:48 +01:00
|
|
|
|
|
|
|
### BPF Maps:
|
2021-04-22 18:06:09 +02:00
|
|
|
- **flow_state:** A hash-map storing some basic state for each flow, such as the
|
2021-03-26 16:57:48 +01:00
|
|
|
last seen identifier for the flow and when the last timestamp entry for the
|
2021-04-22 18:06:09 +02:00
|
|
|
flow was created. Entries are created by `pping_egress()`, and can be updated
|
|
|
|
or deleted by both `pping_egress()` and `pping_ingress()`. Leftover entries
|
2021-03-26 17:54:42 +01:00
|
|
|
are eventually removed by `pping.c`. Pinned at `/sys/fs/bpf/pping`.
|
2021-04-22 18:06:09 +02:00
|
|
|
- **packet_ts:** A hash-map storing a timestamp for a specific packet
|
|
|
|
identifier. Entries are created by `pping_egress()` and removed by
|
|
|
|
`pping_ingress()` if a match is found. Leftover entries are eventually
|
|
|
|
removed by `pping.c`. Pinned at `/sys/fs/bpf/pping`.
|
|
|
|
- **rtt_events:** A perf-buffer used by `pping_ingress()` to push calculated RTTs
|
2021-03-26 16:57:48 +01:00
|
|
|
to `pping.c`, which continuously polls the map the print out the RTTs.
|
|
|
|
|
2021-05-06 17:54:31 +02:00
|
|
|
### A note on concurrency
|
|
|
|
The program uses "global" (not `PERCPU`) hash maps to keep state. As the BPF
|
|
|
|
programs need to see the global view to function properly, using `PERCPU` maps
|
|
|
|
is not an option. The program must be able to match against stored packet
|
|
|
|
timestamps regardless of the CPU the packets are processed on, and must also
|
|
|
|
have a global view of the flow state in order for the sampling to work
|
|
|
|
correctly.
|
|
|
|
|
|
|
|
As the BPF programs may run concurrently on different CPU cores accessing these
|
|
|
|
global hash maps, this may result in some concurrency issues. In practice, I do
|
|
|
|
not believe these will occur particularly often, as I'm under the impression
|
|
|
|
that packets from the same flow will typically be processed by the some
|
|
|
|
CPU. Furthermore, most of the concurrency issues will not be that problematic
|
|
|
|
even if they do occur. For now, I've therefore left these concurrency issues
|
|
|
|
unattended, even if some of them could be avoided with atomic operations and/or
|
|
|
|
spinlocks, in order to keep things simple and not hurt performance.
|
|
|
|
|
|
|
|
The (known) potential concurrency issues are:
|
|
|
|
|
|
|
|
#### Tracking last seen identifier
|
|
|
|
The tc/egress program keeps track of the last seen outgoing identifier for each
|
|
|
|
flow, by storing it in the `flow_state` map. This is done to detect the first
|
|
|
|
packet with a new identifier. If multiple packets are processed concurrently,
|
|
|
|
several of them could potentially detect themselves as being first with the same
|
|
|
|
identifier (which only matters if they also pass rate-limit check as well),
|
|
|
|
alternatively if the concurrent packets have different identifiers there may be
|
|
|
|
a lost update (but for TCP timestamps, concurrent packets would typically be
|
|
|
|
expected to have the same timestamp).
|
|
|
|
|
|
|
|
A possibly more severe issue is out-of-order packets. If a packet with an old
|
|
|
|
identifier arrives out of order, that identifier could be detected as a new
|
|
|
|
identifier. If for example the following flow of four packets with just two
|
|
|
|
different identifiers (id1 and id2) were to occur:
|
|
|
|
|
|
|
|
id1 -> id2 -> id1 -> id2
|
|
|
|
|
|
|
|
Then the tc/egress program would consider each of these packets to have new
|
|
|
|
identifiers and try to create a new timestamp for each of them if the sampling
|
|
|
|
strategy allows it. However even if the sampling strategy allows it, the
|
|
|
|
(incorrect) creation of timestamps for id1 and id2 the second time would only be
|
|
|
|
successful in case the first timestamps for id1 and id2 have already been
|
|
|
|
matched against (and thus deleted). Even if that is the case, they would only
|
|
|
|
result in reporting an incorrect RTT in case there are also new matches against
|
|
|
|
these identifiers.
|
|
|
|
|
|
|
|
This issue could be avoided entirely by requiring that new-id > old-id instead
|
|
|
|
of simply checking that new-id != old-id, as TCP timestamps should monotonically
|
|
|
|
increase. That may however not be a suitable solution if/when we add support for
|
|
|
|
other types of identifiers.
|
|
|
|
|
|
|
|
#### Rate-limiting new timestamps
|
|
|
|
In the tc/egress program packets to timestamp are sampled by using a per-flow
|
|
|
|
rate-limit, which is enforced by storing when the last timestamp was created in
|
|
|
|
the `flow_state` map. If multiple packets perform this check concurrently, it's
|
|
|
|
possible that multiple packets think they are allowed to create timestamps
|
|
|
|
before any of them are able to update the `last_timestamp`. When they update
|
|
|
|
`last_timestamp` it might also be slightly incorrect, however if they are
|
|
|
|
processed concurrently then they should also generate very similar timestamps.
|
|
|
|
|
|
|
|
If the packets have different identifiers, (which would typically not be
|
|
|
|
expected for concurrent TCP timestamps), then this would allow some packets to
|
|
|
|
bypass the rate-limit. By bypassing the rate-limit, the flow would use up some
|
|
|
|
additional map space and report some additional RTT(s) more than expected
|
|
|
|
(however the reported RTTs should still be correct).
|
|
|
|
|
|
|
|
If the packets have the same identifier, they must first have managed to bypass
|
|
|
|
the previous check for unique identifiers (see [previous point](#Tracking last
|
|
|
|
seen identifier)), and only one of them will be able to successfully store a
|
|
|
|
timestamp entry.
|
|
|
|
|
|
|
|
#### Matching against stored timestamps
|
|
|
|
The XDP/ingress program could potentially match multiple concurrent packets with
|
|
|
|
the same identifier against a single timestamp entry in `packet_ts`, before any
|
|
|
|
of them manage to delete the timestamp entry. This would result in multiple RTTs
|
|
|
|
being reported for the same identifier, but if they are processed concurrently
|
|
|
|
these RTTs should be very similar, so would mainly result in over-reporting
|
|
|
|
rather than reporting incorrect RTTs.
|
|
|
|
|
2021-05-07 14:54:12 +02:00
|
|
|
#### Updating flow statistics
|
|
|
|
Both the tc/egress and XDP/ingress programs will try to update some flow
|
|
|
|
statistics each time they successfully parse a packet with an
|
|
|
|
identifier. Specifically, they'll update the number of packets and bytes
|
|
|
|
sent/received. This is not done in an atomic fashion, so there could potentially
|
|
|
|
be some lost updates resulting an underestimate.
|
|
|
|
|
|
|
|
Furthermore, whenever the XDP/ingress program calculates an RTT, it will check
|
|
|
|
if this is the lowest RTT seen so far for the flow. If multiple RTTs are
|
|
|
|
calculated concurrently, then several could pass this check concurrently and
|
|
|
|
there may be a lost update. It should only be possible for multiple RTTs to be
|
|
|
|
calculated concurrently in case either the [timestamp rate-limit was
|
2021-05-06 17:54:31 +02:00
|
|
|
bypassed](#Rate-limiting new timestamps) or [multiple packets managed to match
|
|
|
|
against the same timestamp](#Matching against stored timestamps).
|
|
|
|
|
|
|
|
It's worth noting that with sampling the reported minimum-RTT is only an
|
|
|
|
estimate anyways (may never calculate RTT for packet with the true minimum
|
|
|
|
RTT). And even without sampling there is some inherent sampling due to TCP
|
|
|
|
timestamps only being updated at a limited rate (1000 Hz).
|
|
|
|
|
2021-03-26 16:57:48 +01:00
|
|
|
## Similar projects
|
|
|
|
Passively measuring the RTT for TCP traffic is not a novel concept, and there
|
2021-03-26 17:54:42 +01:00
|
|
|
exists a number of other tools that can do so. A good overview of how passive
|
|
|
|
RTT calculation using TCP timestamps (as in this project) works is provided in
|
|
|
|
[this paper](https://doi.org/10.1145/2523426.2539132) from 2013.
|
2021-03-26 16:57:48 +01:00
|
|
|
|
|
|
|
- [pping](https://github.com/pollere/pping): This project is largely a
|
2021-03-26 17:54:42 +01:00
|
|
|
re-implementation of Kathie's pping, but by using BPF and XDP as well as
|
2021-03-26 16:57:48 +01:00
|
|
|
implementing some filtering logic the hope is to be able to create a always-on
|
|
|
|
tool that can scale well even to large amounts of massive flows.
|
|
|
|
- [ppviz](https://github.com/pollere/ppviz): Web-based visualization tool for
|
2021-05-06 17:54:31 +02:00
|
|
|
the "machine-friendly" (-m) output from Kathie's pping tool. Running this
|
|
|
|
implementation of pping with --format="ppviz" will generate output that can be
|
|
|
|
used by ppviz.
|
2021-03-26 16:57:48 +01:00
|
|
|
- [tcptrace](https://github.com/blitz/tcptrace): A post-processing tool which
|
|
|
|
can analyze a tcpdump file and among other things calculate RTTs based on
|
|
|
|
seq/ACK numbers (`-r` or `-R` flag).
|
|
|
|
- **Dapper**: A passive TCP data plane monitoring tool implemented in P4 which
|
|
|
|
can among other things calculate the RTT based on the matching seq/ACK
|
|
|
|
numbers. [Paper](https://doi.org/10.1145/3050220.3050228). [Unofficial
|
|
|
|
source](https://github.com/muhe1991/p4-programs-survey/tree/master/dapper).
|
|
|
|
- [P4 Tofino TCP RTT measurement](https://github.com/Princeton-Cabernet/p4-projects/tree/master/RTT-tofino):
|
|
|
|
A passive TCP RTT monitor based on seq/ACK numbers implemented in P4 for
|
|
|
|
Tofino programmable switches. [Paper](https://doi.org/10.1145/3405669.3405823).
|