mirror of
https://github.com/xdp-project/bpf-examples.git
synced 2024-05-06 15:54:53 +00:00
pping: Update README with info on concurrency issues
Also, remove comments about concurrency issues from code in pping_kern.c as it is now documented in README. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
This commit is contained in:
100
pping/README.md
100
pping/README.md
@ -15,7 +15,7 @@ echo-reply messages. See the [TODO-list](./TODO.md) for more potential features
|
||||
The fundamental logic of pping is to timestamp a pseudo-unique identifier for
|
||||
outgoing packets, and then look for matches in the incoming packets. If a match
|
||||
is found, the RTT is simply calculated as the time difference between the
|
||||
current time and the timestamp.
|
||||
current time and the stored timestamp.
|
||||
|
||||
This tool, just as Kathie's original pping implementation, uses TCP timestamps
|
||||
as identifiers. For outgoing packets, the TSval (which is a timestamp in and off
|
||||
@ -73,6 +73,98 @@ matched differs from the one in Kathie's pping, and is further described in
|
||||
- **rtt_events:** A perf-buffer used by `pping_ingress()` to push calculated RTTs
|
||||
to `pping.c`, which continuously polls the map the print out the RTTs.
|
||||
|
||||
### A note on concurrency
|
||||
The program uses "global" (not `PERCPU`) hash maps to keep state. As the BPF
|
||||
programs need to see the global view to function properly, using `PERCPU` maps
|
||||
is not an option. The program must be able to match against stored packet
|
||||
timestamps regardless of the CPU the packets are processed on, and must also
|
||||
have a global view of the flow state in order for the sampling to work
|
||||
correctly.
|
||||
|
||||
As the BPF programs may run concurrently on different CPU cores accessing these
|
||||
global hash maps, this may result in some concurrency issues. In practice, I do
|
||||
not believe these will occur particularly often, as I'm under the impression
|
||||
that packets from the same flow will typically be processed by the some
|
||||
CPU. Furthermore, most of the concurrency issues will not be that problematic
|
||||
even if they do occur. For now, I've therefore left these concurrency issues
|
||||
unattended, even if some of them could be avoided with atomic operations and/or
|
||||
spinlocks, in order to keep things simple and not hurt performance.
|
||||
|
||||
The (known) potential concurrency issues are:
|
||||
|
||||
#### Tracking last seen identifier
|
||||
The tc/egress program keeps track of the last seen outgoing identifier for each
|
||||
flow, by storing it in the `flow_state` map. This is done to detect the first
|
||||
packet with a new identifier. If multiple packets are processed concurrently,
|
||||
several of them could potentially detect themselves as being first with the same
|
||||
identifier (which only matters if they also pass rate-limit check as well),
|
||||
alternatively if the concurrent packets have different identifiers there may be
|
||||
a lost update (but for TCP timestamps, concurrent packets would typically be
|
||||
expected to have the same timestamp).
|
||||
|
||||
A possibly more severe issue is out-of-order packets. If a packet with an old
|
||||
identifier arrives out of order, that identifier could be detected as a new
|
||||
identifier. If for example the following flow of four packets with just two
|
||||
different identifiers (id1 and id2) were to occur:
|
||||
|
||||
id1 -> id2 -> id1 -> id2
|
||||
|
||||
Then the tc/egress program would consider each of these packets to have new
|
||||
identifiers and try to create a new timestamp for each of them if the sampling
|
||||
strategy allows it. However even if the sampling strategy allows it, the
|
||||
(incorrect) creation of timestamps for id1 and id2 the second time would only be
|
||||
successful in case the first timestamps for id1 and id2 have already been
|
||||
matched against (and thus deleted). Even if that is the case, they would only
|
||||
result in reporting an incorrect RTT in case there are also new matches against
|
||||
these identifiers.
|
||||
|
||||
This issue could be avoided entirely by requiring that new-id > old-id instead
|
||||
of simply checking that new-id != old-id, as TCP timestamps should monotonically
|
||||
increase. That may however not be a suitable solution if/when we add support for
|
||||
other types of identifiers.
|
||||
|
||||
#### Rate-limiting new timestamps
|
||||
In the tc/egress program packets to timestamp are sampled by using a per-flow
|
||||
rate-limit, which is enforced by storing when the last timestamp was created in
|
||||
the `flow_state` map. If multiple packets perform this check concurrently, it's
|
||||
possible that multiple packets think they are allowed to create timestamps
|
||||
before any of them are able to update the `last_timestamp`. When they update
|
||||
`last_timestamp` it might also be slightly incorrect, however if they are
|
||||
processed concurrently then they should also generate very similar timestamps.
|
||||
|
||||
If the packets have different identifiers, (which would typically not be
|
||||
expected for concurrent TCP timestamps), then this would allow some packets to
|
||||
bypass the rate-limit. By bypassing the rate-limit, the flow would use up some
|
||||
additional map space and report some additional RTT(s) more than expected
|
||||
(however the reported RTTs should still be correct).
|
||||
|
||||
If the packets have the same identifier, they must first have managed to bypass
|
||||
the previous check for unique identifiers (see [previous point](#Tracking last
|
||||
seen identifier)), and only one of them will be able to successfully store a
|
||||
timestamp entry.
|
||||
|
||||
#### Matching against stored timestamps
|
||||
The XDP/ingress program could potentially match multiple concurrent packets with
|
||||
the same identifier against a single timestamp entry in `packet_ts`, before any
|
||||
of them manage to delete the timestamp entry. This would result in multiple RTTs
|
||||
being reported for the same identifier, but if they are processed concurrently
|
||||
these RTTs should be very similar, so would mainly result in over-reporting
|
||||
rather than reporting incorrect RTTs.
|
||||
|
||||
#### Updating flow min-RTT
|
||||
Whenever the XDP/ingress program calculates an RTT, it will check if this is the
|
||||
lowest RTT seen so far for the flow. If multiple RTTs are calculated
|
||||
concurrently, then several could pass this check concurrently and there may be a
|
||||
lost update. It should only be possible for multiple RTTs to be calculated
|
||||
concurrently in case either the [timestamp rate-limit was
|
||||
bypassed](#Rate-limiting new timestamps) or [multiple packets managed to match
|
||||
against the same timestamp](#Matching against stored timestamps).
|
||||
|
||||
It's worth noting that with sampling the reported minimum-RTT is only an
|
||||
estimate anyways (may never calculate RTT for packet with the true minimum
|
||||
RTT). And even without sampling there is some inherent sampling due to TCP
|
||||
timestamps only being updated at a limited rate (1000 Hz).
|
||||
|
||||
## Similar projects
|
||||
Passively measuring the RTT for TCP traffic is not a novel concept, and there
|
||||
exists a number of other tools that can do so. A good overview of how passive
|
||||
@ -84,9 +176,9 @@ RTT calculation using TCP timestamps (as in this project) works is provided in
|
||||
implementing some filtering logic the hope is to be able to create a always-on
|
||||
tool that can scale well even to large amounts of massive flows.
|
||||
- [ppviz](https://github.com/pollere/ppviz): Web-based visualization tool for
|
||||
the "machine-friendly" output from Kathie's pping tool. If/when we implement a
|
||||
similar machine readable output option it should hopefully work with this
|
||||
implementation as well.
|
||||
the "machine-friendly" (-m) output from Kathie's pping tool. Running this
|
||||
implementation of pping with --format="ppviz" will generate output that can be
|
||||
used by ppviz.
|
||||
- [tcptrace](https://github.com/blitz/tcptrace): A post-processing tool which
|
||||
can analyze a tcpdump file and among other things calculate RTTs based on
|
||||
seq/ACK numbers (`-r` or `-R` flag).
|
||||
|
@ -265,36 +265,11 @@ int pping_egress(struct __sk_buff *skb)
|
||||
}
|
||||
|
||||
// Check if identfier is new
|
||||
/* The gap between checking and updating last_id may cause concurrency
|
||||
* issues where multiple packets may simultaneously think they are the
|
||||
* first with a new identifier. As long as all of the identifiers are
|
||||
* the same though, only one should be able to create a timestamp entry.
|
||||
|
||||
* A bigger issue is that older identifiers (for example due to
|
||||
* out-of-order packets) may pass this check and update the current
|
||||
* identifier to an old one. This means that both the packet with the
|
||||
* old identifier itself as well the next packet with the current
|
||||
* identifier may be considered packets with new identifiers (even if
|
||||
* both have been seen before). For TCP timestamps this could be
|
||||
* prevented by changing the check to '>=' instead, but it may not be
|
||||
* suitable for other protocols, such as QUIC and its spinbit.
|
||||
*
|
||||
* For now, just hope that the rate limit saves us from creating an
|
||||
* incorrect timestamp. That may however also fail, either due to the
|
||||
* to it happening in a time it's not limited by rate sampling, or
|
||||
* because of rate check failing due to concurrency issues.
|
||||
*/
|
||||
if (f_state->last_id == p_id.identifier)
|
||||
goto out;
|
||||
f_state->last_id = p_id.identifier;
|
||||
|
||||
// Check rate-limit
|
||||
/*
|
||||
* The window between checking and updating last_timestamp may cause
|
||||
* concurrency issues, where multiple packets simultaneously pass the
|
||||
* rate limit. However, as long as they have the same identifier, only
|
||||
* a single timestamp entry should successfully be created.
|
||||
*/
|
||||
p_ts = bpf_ktime_get_ns(); // or bpf_ktime_get_boot_ns
|
||||
if (p_ts < f_state->last_timestamp ||
|
||||
p_ts - f_state->last_timestamp < config.rate_limit)
|
||||
@ -341,14 +316,11 @@ int pping_ingress(struct xdp_md *ctx)
|
||||
|
||||
event.rtt = now - *p_ts;
|
||||
event.timestamp = now;
|
||||
/*
|
||||
* Attempt to delete timestamp entry as soon as RTT is calculated.
|
||||
* But could have potential concurrency issue where multiple packets
|
||||
* manage to match against the identifier before it can be deleted.
|
||||
*/
|
||||
|
||||
// Delete timestamp entry as soon as RTT is calculated
|
||||
bpf_map_delete_elem(&packet_ts, &p_id);
|
||||
|
||||
// Update flow's min-RTT, may have concurrency issues
|
||||
// Update flow's min-RTT
|
||||
f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow);
|
||||
if (!f_state)
|
||||
goto validflow_out;
|
||||
@ -358,6 +330,7 @@ int pping_ingress(struct xdp_md *ctx)
|
||||
|
||||
event.min_rtt = f_state->min_rtt;
|
||||
|
||||
// Push event to perf-buffer
|
||||
__builtin_memcpy(&event.flow, &p_id.flow, sizeof(struct network_tuple));
|
||||
bpf_perf_event_output(ctx, &rtt_events, BPF_F_CURRENT_CPU, &event,
|
||||
sizeof(event));
|
||||
|
Reference in New Issue
Block a user