pping: Update README with info on concurrency issues

Also, remove comments about concurrency issues from code in pping_kern.c as it is now documented in README. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
2024-05-06 15:54:53 +00:00 · 2021-05-06 17:54:31 +02:00
parent d92109b3c8
commit fb454cd716
2 changed files with 100 additions and 35 deletions
--- a/pping/README.md
+++ b/pping/README.md
@ -15,7 +15,7 @@ echo-reply messages. See the [TODO-list](./TODO.md) for more potential features
 The fundamental logic of pping is to timestamp a pseudo-unique identifier for
 outgoing packets, and then look for matches in the incoming packets. If a match
 is found, the RTT is simply calculated as the time difference between the
-current time and the timestamp.
+current time and the stored timestamp.
 This tool, just as Kathie's original pping implementation, uses TCP timestamps
 as identifiers. For outgoing packets, the TSval (which is a timestamp in and off
@ -73,6 +73,98 @@ matched differs from the one in Kathie's pping, and is further described in
 - **rtt_events:** A perf-buffer used by `pping_ingress()` to push calculated RTTs
  to `pping.c`, which continuously polls the map the print out the RTTs.
 ### A note on concurrency
 The program uses "global" (not `PERCPU`) hash maps to keep state. As the BPF
 programs need to see the global view to function properly, using `PERCPU` maps
 is not an option. The program must be able to match against stored packet
 timestamps regardless of the CPU the packets are processed on, and must also
 have a global view of the flow state in order for the sampling to work
 correctly.
 As the BPF programs may run concurrently on different CPU cores accessing these
 global hash maps, this may result in some concurrency issues. In practice, I do
 not believe these will occur particularly often, as I'm under the impression
 that packets from the same flow will typically be processed by the some
 CPU. Furthermore, most of the concurrency issues will not be that problematic
 even if they do occur. For now, I've therefore left these concurrency issues
 unattended, even if some of them could be avoided with atomic operations and/or
 spinlocks, in order to keep things simple and not hurt performance.
 The (known) potential concurrency issues are:
 #### Tracking last seen identifier
 The tc/egress program keeps track of the last seen outgoing identifier for each
 flow, by storing it in the `flow_state` map. This is done to detect the first
 packet with a new identifier. If multiple packets are processed concurrently,
 several of them could potentially detect themselves as being first with the same
 identifier (which only matters if they also pass rate-limit check as well),
 alternatively if the concurrent packets have different identifiers there may be
 a lost update (but for TCP timestamps, concurrent packets would typically be
 expected to have the same timestamp).
 A possibly more severe issue is out-of-order packets. If a packet with an old
 identifier arrives out of order, that identifier could be detected as a new
 identifier. If for example the following flow of four packets with just two
 different identifiers (id1 and id2) were to occur:
 id1 -> id2 -> id1 -> id2
 Then the tc/egress program would consider each of these packets to have new
 identifiers and try to create a new timestamp for each of them if the sampling
 strategy allows it. However even if the sampling strategy allows it, the
 (incorrect) creation of timestamps for id1 and id2 the second time would only be
 successful in case the first timestamps for id1 and id2 have already been
 matched against (and thus deleted). Even if that is the case, they would only
 result in reporting an incorrect RTT in case there are also new matches against
 these identifiers.
 This issue could be avoided entirely by requiring that new-id > old-id instead
 of simply checking that new-id != old-id, as TCP timestamps should monotonically
 increase. That may however not be a suitable solution if/when we add support for
 other types of identifiers.
 #### Rate-limiting new timestamps
 In the tc/egress program packets to timestamp are sampled by using a per-flow
 rate-limit, which is enforced by storing when the last timestamp was created in
 the `flow_state` map. If multiple packets perform this check concurrently, it's
 possible that multiple packets think they are allowed to create timestamps
 before any of them are able to update the `last_timestamp`. When they update
 `last_timestamp` it might also be slightly incorrect, however if they are
 processed concurrently then they should also generate very similar timestamps.
 If the packets have different identifiers, (which would typically not be
 expected for concurrent TCP timestamps), then this would allow some packets to
 bypass the rate-limit. By bypassing the rate-limit, the flow would use up some
 additional map space and report some additional RTT(s) more than expected
 (however the reported RTTs should still be correct).
 If the packets have the same identifier, they must first have managed to bypass
 the previous check for unique identifiers (see [previous point](#Tracking last
 seen identifier)), and only one of them will be able to successfully store a
 timestamp entry.
 #### Matching against stored timestamps
 The XDP/ingress program could potentially match multiple concurrent packets with
 the same identifier against a single timestamp entry in `packet_ts`, before any
 of them manage to delete the timestamp entry. This would result in multiple RTTs
 being reported for the same identifier, but if they are processed concurrently
 these RTTs should be very similar, so would mainly result in over-reporting
 rather than reporting incorrect RTTs.
 #### Updating flow min-RTT
 Whenever the XDP/ingress program calculates an RTT, it will check if this is the
 lowest RTT seen so far for the flow. If multiple RTTs are calculated
 concurrently, then several could pass this check concurrently and there may be a
 lost update. It should only be possible for multiple RTTs to be calculated
 concurrently in case either the [timestamp rate-limit was
 bypassed](#Rate-limiting new timestamps) or [multiple packets managed to match
 against the same timestamp](#Matching against stored timestamps).
 It's worth noting that with sampling the reported minimum-RTT is only an
 estimate anyways (may never calculate RTT for packet with the true minimum
 RTT). And even without sampling there is some inherent sampling due to TCP
 timestamps only being updated at a limited rate (1000 Hz).
 ## Similar projects
 Passively measuring the RTT for TCP traffic is not a novel concept, and there
 exists a number of other tools that can do so. A good overview of how passive
@ -84,9 +176,9 @@ RTT calculation using TCP timestamps (as in this project) works is provided in
  implementing some filtering logic the hope is to be able to create a always-on
  tool that can scale well even to large amounts of massive flows.
 - [ppviz](https://github.com/pollere/ppviz): Web-based visualization tool for
-  the "machine-friendly" output from Kathie's pping tool. If/when we implement a
+  the "machine-friendly" (-m) output from Kathie's pping tool. Running this
-  similar machine readable output option it should hopefully work with this
+  implementation of pping with --format="ppviz" will generate output that can be
-  implementation as well.
+  used by ppviz.
 - [tcptrace](https://github.com/blitz/tcptrace): A post-processing tool which
  can analyze a tcpdump file and among other things calculate RTTs based on
  seq/ACK numbers (`-r` or `-R` flag).
--- a/pping/pping_kern.c
+++ b/pping/pping_kern.c
@ -265,36 +265,11 @@ int pping_egress(struct __sk_buff *skb)
 	}
 	// Check if identfier is new
 	/* The gap between checking and updating last_id may cause concurrency
 	 * issues where multiple packets may simultaneously think they are the
 	 * first with a new identifier. As long as all of the identifiers are
 	 * the same though, only one should be able to create a timestamp entry.
 	 * A bigger issue is that older identifiers (for example due to
         * out-of-order packets) may pass this check and update the current
 	 * identifier to an old one. This means that both the packet with the
 	 * old identifier itself as well the next packet with the current
 	 * identifier may be considered packets with new identifiers (even if
 	 * both have been seen before). For TCP timestamps this could be
 	 * prevented by changing the check to '>=' instead, but it may not be
 	 * suitable for other protocols, such as QUIC and its spinbit.
 	 *
 	 * For now, just hope that the rate limit saves us from creating an
 	 * incorrect timestamp. That may however also fail, either due to the
 	 * to it happening in a time it's not limited by rate sampling, or
 	 * because of rate check failing due to concurrency issues.
 	 */
 	if (f_state->last_id == p_id.identifier)
 		goto out;
 	f_state->last_id = p_id.identifier;
 	// Check rate-limit
 	/*
 	 * The window between checking and updating last_timestamp may cause
 	 * concurrency issues, where multiple packets simultaneously pass the
 	 * rate limit. However, as long as they have the same identifier, only
 	 * a single timestamp entry should successfully be created.
 	 */
 	p_ts = bpf_ktime_get_ns(); // or bpf_ktime_get_boot_ns
 	if (p_ts < f_state->last_timestamp ||
 	    p_ts - f_state->last_timestamp < config.rate_limit)
@ -341,14 +316,11 @@ int pping_ingress(struct xdp_md *ctx)
 	event.rtt = now - *p_ts;
 	event.timestamp = now;
-	/*
+
-	 * Attempt to delete timestamp entry as soon as RTT is calculated.
+	// Delete timestamp entry as soon as RTT is calculated
 	 * But could have potential concurrency issue where multiple packets
 	 * manage to match against the identifier before it can be deleted.
 	 */
 	bpf_map_delete_elem(&packet_ts, &p_id);
-	// Update flow's min-RTT, may have concurrency issues
+	// Update flow's min-RTT
 	f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow);
 	if (!f_state)
 		goto validflow_out;
@ -358,6 +330,7 @@ int pping_ingress(struct xdp_md *ctx)
 	event.min_rtt = f_state->min_rtt;
 	// Push event to perf-buffer
 	__builtin_memcpy(&event.flow, &p_id.flow, sizeof(struct network_tuple));
 	bpf_perf_event_output(ctx, &rtt_events, BPF_F_CURRENT_CPU, &event,
 			      sizeof(event));