pping: Update README with info on concurrency issues

Also, remove comments about concurrency issues from code in pping_kern.c as it is now documented in README. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
2024-05-06 15:54:53 +00:00 · 2021-05-06 17:54:31 +02:00
parent d92109b3c8
commit fb454cd716
2 changed files with 100 additions and 35 deletions
--- a/pping/README.md
+++ b/pping/README.md
@ -15,7 +15,7 @@ echo-reply messages. See the [TODO-list](./TODO.md) for more potential features
 The fundamental logic of pping is to timestamp a pseudo-unique identifier for
 outgoing packets, and then look for matches in the incoming packets. If a match
 is found, the RTT is simply calculated as the time difference between the
-current time and the timestamp.
+current time and the stored timestamp.

 This tool, just as Kathie's original pping implementation, uses TCP timestamps
 as identifiers. For outgoing packets, the TSval (which is a timestamp in and off
@ -73,6 +73,98 @@ matched differs from the one in Kathie's pping, and is further described in
 - **rtt_events:** A perf-buffer used by `pping_ingress()` to push calculated RTTs
  to `pping.c`, which continuously polls the map the print out the RTTs.

+### A note on concurrency
+The program uses "global" (not `PERCPU`) hash maps to keep state. As the BPF
+programs need to see the global view to function properly, using `PERCPU` maps
+is not an option. The program must be able to match against stored packet
+timestamps regardless of the CPU the packets are processed on, and must also
+have a global view of the flow state in order for the sampling to work
+correctly.
+
+As the BPF programs may run concurrently on different CPU cores accessing these
+global hash maps, this may result in some concurrency issues. In practice, I do
+not believe these will occur particularly often, as I'm under the impression
+that packets from the same flow will typically be processed by the some
+CPU. Furthermore, most of the concurrency issues will not be that problematic
+even if they do occur. For now, I've therefore left these concurrency issues
+unattended, even if some of them could be avoided with atomic operations and/or
+spinlocks, in order to keep things simple and not hurt performance.
+
+The (known) potential concurrency issues are:
+
+#### Tracking last seen identifier
+The tc/egress program keeps track of the last seen outgoing identifier for each
+flow, by storing it in the `flow_state` map. This is done to detect the first
+packet with a new identifier. If multiple packets are processed concurrently,
+several of them could potentially detect themselves as being first with the same
+identifier (which only matters if they also pass rate-limit check as well),
+alternatively if the concurrent packets have different identifiers there may be
+a lost update (but for TCP timestamps, concurrent packets would typically be
+expected to have the same timestamp).
+
+A possibly more severe issue is out-of-order packets. If a packet with an old
+identifier arrives out of order, that identifier could be detected as a new
+identifier. If for example the following flow of four packets with just two
+different identifiers (id1 and id2) were to occur:
+
+id1 -> id2 -> id1 -> id2
+
+Then the tc/egress program would consider each of these packets to have new
+identifiers and try to create a new timestamp for each of them if the sampling
+strategy allows it. However even if the sampling strategy allows it, the
+(incorrect) creation of timestamps for id1 and id2 the second time would only be
+successful in case the first timestamps for id1 and id2 have already been
+matched against (and thus deleted). Even if that is the case, they would only
+result in reporting an incorrect RTT in case there are also new matches against
+these identifiers.
+
+This issue could be avoided entirely by requiring that new-id > old-id instead
+of simply checking that new-id != old-id, as TCP timestamps should monotonically
+increase. That may however not be a suitable solution if/when we add support for
+other types of identifiers.
+
+#### Rate-limiting new timestamps
+In the tc/egress program packets to timestamp are sampled by using a per-flow
+rate-limit, which is enforced by storing when the last timestamp was created in
+the `flow_state` map. If multiple packets perform this check concurrently, it's
+possible that multiple packets think they are allowed to create timestamps
+before any of them are able to update the `last_timestamp`. When they update
+`last_timestamp` it might also be slightly incorrect, however if they are
+processed concurrently then they should also generate very similar timestamps.
+
+If the packets have different identifiers, (which would typically not be
+expected for concurrent TCP timestamps), then this would allow some packets to
+bypass the rate-limit. By bypassing the rate-limit, the flow would use up some
+additional map space and report some additional RTT(s) more than expected
+(however the reported RTTs should still be correct).
+
+If the packets have the same identifier, they must first have managed to bypass
+the previous check for unique identifiers (see [previous point](#Tracking last
+seen identifier)), and only one of them will be able to successfully store a
+timestamp entry.
+
+#### Matching against stored timestamps
+The XDP/ingress program could potentially match multiple concurrent packets with
+the same identifier against a single timestamp entry in `packet_ts`, before any
+of them manage to delete the timestamp entry. This would result in multiple RTTs
+being reported for the same identifier, but if they are processed concurrently
+these RTTs should be very similar, so would mainly result in over-reporting
+rather than reporting incorrect RTTs.
+
+#### Updating flow min-RTT
+Whenever the XDP/ingress program calculates an RTT, it will check if this is the
+lowest RTT seen so far for the flow. If multiple RTTs are calculated
+concurrently, then several could pass this check concurrently and there may be a
+lost update. It should only be possible for multiple RTTs to be calculated
+concurrently in case either the [timestamp rate-limit was
+bypassed](#Rate-limiting new timestamps) or [multiple packets managed to match
+against the same timestamp](#Matching against stored timestamps).
+
+It's worth noting that with sampling the reported minimum-RTT is only an
+estimate anyways (may never calculate RTT for packet with the true minimum
+RTT). And even without sampling there is some inherent sampling due to TCP
+timestamps only being updated at a limited rate (1000 Hz).
+
 ## Similar projects
 Passively measuring the RTT for TCP traffic is not a novel concept, and there
 exists a number of other tools that can do so. A good overview of how passive
@ -84,9 +176,9 @@ RTT calculation using TCP timestamps (as in this project) works is provided in
  implementing some filtering logic the hope is to be able to create a always-on
  tool that can scale well even to large amounts of massive flows.
 - [ppviz](https://github.com/pollere/ppviz): Web-based visualization tool for
-  the "machine-friendly" output from Kathie's pping tool. If/when we implement a
-  similar machine readable output option it should hopefully work with this
-  implementation as well.
+  the "machine-friendly" (-m) output from Kathie's pping tool. Running this
+  implementation of pping with --format="ppviz" will generate output that can be
+  used by ppviz.
 - [tcptrace](https://github.com/blitz/tcptrace): A post-processing tool which
  can analyze a tcpdump file and among other things calculate RTTs based on
  seq/ACK numbers (`-r` or `-R` flag).
--- a/pping/pping_kern.c
+++ b/pping/pping_kern.c
@ -265,36 +265,11 @@ int pping_egress(struct __sk_buff *skb)
 	}

 	// Check if identfier is new
-	/* The gap between checking and updating last_id may cause concurrency
-	 * issues where multiple packets may simultaneously think they are the
-	 * first with a new identifier. As long as all of the identifiers are
-	 * the same though, only one should be able to create a timestamp entry.
-
-	 * A bigger issue is that older identifiers (for example due to
-         * out-of-order packets) may pass this check and update the current
-	 * identifier to an old one. This means that both the packet with the
-	 * old identifier itself as well the next packet with the current
-	 * identifier may be considered packets with new identifiers (even if
-	 * both have been seen before). For TCP timestamps this could be
-	 * prevented by changing the check to '>=' instead, but it may not be
-	 * suitable for other protocols, such as QUIC and its spinbit.
-	 *
-	 * For now, just hope that the rate limit saves us from creating an
-	 * incorrect timestamp. That may however also fail, either due to the
-	 * to it happening in a time it's not limited by rate sampling, or
-	 * because of rate check failing due to concurrency issues.
-	 */
 	if (f_state->last_id == p_id.identifier)
 		goto out;
 	f_state->last_id = p_id.identifier;

 	// Check rate-limit
-	/*
-	 * The window between checking and updating last_timestamp may cause
-	 * concurrency issues, where multiple packets simultaneously pass the
-	 * rate limit. However, as long as they have the same identifier, only
-	 * a single timestamp entry should successfully be created.
-	 */
 	p_ts = bpf_ktime_get_ns(); // or bpf_ktime_get_boot_ns
 	if (p_ts < f_state->last_timestamp ||
 	    p_ts - f_state->last_timestamp < config.rate_limit)
@ -341,14 +316,11 @@ int pping_ingress(struct xdp_md *ctx)

 	event.rtt = now - *p_ts;
 	event.timestamp = now;
-	/*
-	 * Attempt to delete timestamp entry as soon as RTT is calculated.
-	 * But could have potential concurrency issue where multiple packets
-	 * manage to match against the identifier before it can be deleted.
-	 */
+
+	// Delete timestamp entry as soon as RTT is calculated
 	bpf_map_delete_elem(&packet_ts, &p_id);

-	// Update flow's min-RTT, may have concurrency issues
+	// Update flow's min-RTT
 	f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow);
 	if (!f_state)
 		goto validflow_out;
@ -358,6 +330,7 @@ int pping_ingress(struct xdp_md *ctx)

 	event.min_rtt = f_state->min_rtt;

+	// Push event to perf-buffer
 	__builtin_memcpy(&event.flow, &p_id.flow, sizeof(struct network_tuple));
 	bpf_perf_event_output(ctx, &rtt_events, BPF_F_CURRENT_CPU, &event,
 			      sizeof(event));