Update SAMPLING_DESIGN.md, partly based on discussions during the meeting with Red Hat on 2021-03-01. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
20 KiB
Introduction
This file is intended to document some of the challenges and design decisions for adding sampling functionality to pping. It is partly based on discussions from my supervisor meeting on 2021-02-22, and the contents of my status slides from that meeting.
Purpose of sampling
The main purpose of adding sampling to pping is to prevent a massive amount of timestamp entries being created and quickly filling up the map. This prevents new entries from being made until old ones can be cleared out. A few large flows could thus "hog" all the map entries, and prevent RTTs from other flows from being reported. Sampling is therefore only used on egress to determine if a timestamp entry should be created for a packet. All packets on ingress will still be parsed and checked for a potential match.
A secondary purpose of the sampling is the reduce the amount of output that pping creates. In most circumstances, getting 1000 RTT reports per second from a single flow will probably not be of interest, making it less useful as a direct command-line utility.
Considered sampling approaches
There are a number of different ways that the sampling could be performed, ex:
- Sample every N packets per flow
- Not very flexible
- If same rate is used for all flows small flows would get very few samples.
- Sample completely random packets
- Probably not a good idea...
- Head sampling (sample the first few packets of each flow)
- Not suitable for monitoring long flows
- RTT may change over lifetime of flow (due to buffer bloat)
- Probabilistic approach
- Probabilistic approaches have been used to for example capture most relevant information with limited overhead in INT
- Could potentially be configured across multiple devices, so that pping on all of the devices together capture the most relevant traffic.
- While it could potentially work well, I'm not very familiar with these approaches. Would take considerable research from my side to figure out how these methods work, how to best apply it to pping, and how to implement it in BPF.
- Used time-based sampling, limiting the rate of how often entries
can be created per flow
- Intuitively simple
- Should correspond quite well with the output you would probably want? I.e. a few entries per flow (regardless of how heavy they are) stating their current RTT.
I believe that time-based sampling is the most promising solution that I can implement in a reasonable time. In the future additional sampling methods could potentially be added.
Considerations for time-based sampling
Time interval
For the time-based sampling, we must determine how the interval between when new timestamp entries are allowed should be set.
Static time interval
The simplest alternative is probably to use a static limit, ex 100ms. This would provide a rather simple and predictable limit for how often entries can be created (per flow), and how much output you would get (per flow).
RTT-based time interval
It may be desirable to use a more dynamic time limit, which is adapted to each flow. One way to do this, would be do base the time limit on the RTT for the flow. Flows with short RTTs could be expected to undergo more rapid changes than flows with long RTTs. This would require keeping track of the RTT for each flow, for example a moving average. Additionally, some fall back is required before the RTT for the flow is known.
User configurable
Regardless if a static or RTT-based (or some other alternative) is used, it should probably be user configurable (including allowing the user to disable sampling entirely).
Allowing bursts
It may be desirable to allow to allow for multiple packets in a short burst to be timestamped. Due to delayed ACKs, one may only get a response for every other packet. If the first packed is timestamped, and shortly after a second packet is sent (that has a different identifier), then the response will effectively be for the second packet, and no match for the timestamped identifier will be found. For flows of the right (or wrong, depending on how you look at it) intensity, slow enough where consecutive packets are likely to get different TCP timestamps, but fast enough for the delayed ACKs to acknowledge multiple packets, then you essentially have a 50/50 chance of timestamping the wrong identifier and miss the RTT.
To handle this, you could timestamp multiple consecutive packets (with unique indentifiers) in a short burst. You probably need to limit this burst in both number of packets, as well as timeframe after the first packet that additional packets may be included. For example, allowing up to 3 packets (with different identifiers) get a timestamp for up to 4 ms after the first one of them are timestamped.
If allowing bursts of timestamps to be created, it may also be desirable to rate limit the output, in order to not get a burst of similar RTTs for the flow in the output (which may also skew averages and other post-processing).
Handing duplicate identifiers
TCP timestamps are only updated at a limited rate (ex. 1000 Hz), and thus you can have multiple consecutive packets with the same TCP timestamp if they're sent fast enough. For the calculated RTT to be correct, you should only match the first sent packet with a unique identifier with the first received packet with a matching identifier. Otherwise, you may for example have a sequence with 100 packets with the same identifier, and match the last of the outgoing packets with the first incoming response, which may underestimate the RTT with as much as the TCP timestamp clock rate (ex. 1 ms).
Current solution
The current solution to this is very simple. For outgoing packets, a
timestamp entry is only allowed to be created if no previous entry for
the identifier exists (realized through the BPF_NOEXIST
flag to
bpf_map_update_elem()
call). Thus only the first outgoing packet with
a specific identifier can be timestamped. On egress, the first packet
with a matching identifier will mark the timestamp as used, preventing
later incoming responses from using that timestamp. The reason why the
timestamp is marked as used rather than directly deleted once a
matching packet on ingress is found, is to avoid the egress side
creating a new entry for the same identifier. This could occur if the
RTT is shorter than the TCP timestamp clock rate, and could result in
a massively underestimated RTT. This is the same mechanic that is used
in the original pping, as explained
here.
New solution
The current solution will no longer work if sampling is introduced. With sampling, there's no guarantee that the sampled packed will be the first outgoing packet in the sequence of packets with identical timestamps. Thus the RTT may still be underestimated by as much as the TCP timestamp clock rate (ex. 1 ms). Therefore, a new solution is needed. The current idea is to keep track of the last-seen identifier of each flow, and only allow a packet to be sampled for timestamping if its identifier differs from the last-seen identifier of the flow, i.e. it is the first packet in the flow with that identifier. This would perhaps be problematic with some sampling approaches as it requires that the packet is both the first one with a specific identifier, as well as being elected for sampling. However for the rate-limited sampling it should work quite well, as it will only delay the sampling until a packet with a new identifier is found.
Another advantage with this solution is that it should allow for timestamp entries to be deleted as soon as the matching response is found on ingress. The timestamp no longer needs to be kept around only to prevent egress to create a new timestamp with the same identifier, as this new solution should take care of that. This would help a lot with keeping the map clean, as the timestamp entries would then automatically be removed as soon as they are no longer needed. The periodic cleanup from userspace would only be needed to remove the occasional entries that were never matched for some reason (e.g. the previously mentioned issue with delayed ACKs, flow stopped, the reverse flow can't be observed etc.).
One issue for this new solution is handling out-of-order packets. If an entry with an older identifier is a bit delayed, it may arrive after the last seen identifier for the flow has been updated. This old identifier may then be considered new (as it differs from the current one), allowing an entry to be created for it and reverting the last seen identifier to a previous one. Additionally, this may now allow the next packet having what used to be the current identifier, also being detected as a new identifier (as the out-of order packet reverted the last-seen identifier to an old one, creating a bit of a ping-pong effect). For TCP timestamps this can easily be avoided by simply requiring the new identifier to be greater than the last-seen identifier (as TCP timestamps should be monotonically increasing). That solution may however not be suitable if one wants to reuse this mechanic for other protocols, such as the QUIC spinbit.
Keeping per-flow information
In order for the per-flow rate limiting to work, some per-flow state must be maintained, namely when the last timestamp for that flow was added (so that one can check that sufficient time has passed before attempting to add another one).
There may be some drawbacks with having to keep per-flow state. First off, there will be some additional overhead from having to keep track of this state. However, the savings from sampling the per-packet state (the identifier/timestamps mappings) should hopefully cover the overhead from keeping some per-flow state (and then some).
Another issue that is worth keeping in mind is that this flow-state will also need to be cleaned up eventually. This cleanup could be handled in a similar manner as the current per-packet state is cleaned up, by having the userspace process occasionally remove old entries. In this case, the entries could be deemed as old if there was a long time since the last timestamp was added for the flow, ex 300 seconds as used by the original pping. Additionally, one can parse the packets for indications that the connection is being closed (ex TCP FIN/RST), and then directly delete the flow-state for that flow from the BPF programs.
Later on, this per-flow state could potentially be expanded to include other information deemed useful (such as ex. minimum and average RTT).
Alternative solution - keeping identifier in flow-state
One idea that came up during my supervisor meeting, was that instead of creating timestamps for individual packets as is currently done, you only create a number of timestamps for each flow. That is, instead of creating per-packet entries in a separate map, you include a number of timestamp/identifier pairs in the flow-state information itself.
While this would potentially be rather efficient, limiting the number of timestamp entries to a fixed number per flow, I'm opposed to this idea for a few reasons:
- The sampling rate would be inherently tied to the RTT of the flow. While this may in many cases be desirable, it is not very flexible. It would also make it hard to ex. turn of sampling completely.
- The number of timestamps per flow would need to be fixed and known at compile time(?). As the timestamps/identifier pairs are kept in the state-flow information itself, and the state-flow information needs to be of a known and fixed size when creating the maps. This may also result in some wasted space if the flow-state includes spots for several timestamp/identifier pairs, but most flows only makes use of a few (although having an additional timestamp entry map of fixed size wastes space in a similar manner).
- If a low number of timestamp/identifier pairs are kept, selecting an identifier that is missed (ex due to delayed ACKs) could effectivly block new timestamps from being created (and thus from RTTs being calculated) for the flow for a relatively long while. New timestamps can only be created if you have a free slot, and you can only free a slot by either getting a matching reply, or waiting until it can be safely assumed that the response was missed (and not just delayed).
Graceful degradation
Another aspect I've been asked to consider is how to gracefully reduce the functionality of pping as the timestamp entry map gets full (as with sufficiently many and heavy flows, it's likely inevitable).
What currently happens when the timestamp entry map is full, is simply that no more entries can be made until some have been cleared out. When adding a rate-limit to the number of entries per flow, as well as directly deleting entries upon match, I believe this is a reasonable way to handle the situation. As soon as some RTTs for current flows have been reported, space for new entries will be available. The next outgoing packet with a valid identifier from any flow that does not have to currently wait for its rate limit will then be able to grab the next spot. However this will still favor heavy flows over smaller flows, as heavy flows are more likely to be able to get in a packet first, but they will at least still be limited by the rate limit, and thus have to take turns with other flows.
It also worth noting that as per-flow state will need to be kept, there will be strict limit to the number of concurrent flows that can be monitored, corresponding to the number of entries that can be held by the map for the per-flow state. Once the per-flow state map is full, no new flows can be added until one is cleared. It also doesn't make sense to add packet timestamp entries for flows which state cannot be tracked, as the rate limit cannot be enforced then.
I see a few ways to more actively handle degradation, depending on what one views as desirable:
- One can attempt to monitor many flows, with infrequent RTT calculations for each. In this case, the userspace process that occasionally clears out the timestamp map could automatically decrease the per-flow rate limit if it detects the map is getting close to full. That way, fewer entries would be generated per flow, and flows would be forced to take turns to a greater degree when the map is completely full. Similarly, one may wish to reduce the timeout for old flows if the per-flow map is getting full, in order to more quickly allow new flows to be monitored, and only keeping the most active flows around.
- One can attempt to monitor fewer flows, but with more frequent RTT calculations for each. The easiest way to achieve this is to probably to set a smaller size on the per-flow map relative to the per-packet timestamp map. In case one wants to primarily focus on heavier flows, one could possibly add ex. packet rate to the per-flow information, and remove the flows with the lowest packet rates.
- One can attempt to focus on flows with shorter RTTs. Flows with shorter RTTs should make more efficient use of timestamp entries, as they can be cleared out faster allowing for new entries. On the other hand, flows with longer RTTs may be the more interesting ones, as they are more likely to indicate some issue.
- One can simply try to create a larger map (and copy over the old contents) once the map is approaching full. This way one can start with reasonably small maps, and only start eating up more memory if required.
While I'm leaning towards option 1 or 4, I don't have a very strong personal opinion here, and would like some input on what others (who may have more experience with network measurements) think are reasonable trade-offs to do.
Implementation considerations
There are of course several more practical considerations as well when implementing the sampling, some of which I'll try to address here.
"Global" vs PERCPU maps
In general, it's likely wise to go with PERCPU maps over "global" (aka non-PERCPU) maps whenever possible, as PERCPU maps should be more performant, and also avoids concurrency issues. But this only applies of course, if the BPF programs don't need to act on global state.
For pping, I unfortunately see no way for the program to work with only information local to each CPU core individually. The per-packet identifier and timestamps need to be global, as there is no guarantee that the same core that timestamped a packet will process the response for that packet. Likewise, the per-flow information, like the time of the last timestamping, also needs to be global. Otherwise rate limit would be per-CPU-per-flow rather than just per-flow.
In practice, packets from the same flow are apparently often handled by the same CPU, but this is not guaranteed, and therefore not something we can rely on (especially when state needs to be shared by both ingress and egress). Could try to use a CPU map to enforce this behavior, but probably not a great idea.
Concurrency issues
In addition to the performance hit, sharing global state between multiple concurrent processes risks running into concurrency issues unless access is synchronized in some manner (in BPF, the two mechanics I know of are atomic adds and spin-locks for maps). With the risk of me misunderstanding the memory model for BPF programs (which from what I can tell I'm probably not alone about), I will attempt to explain the potential concurrency issues I see with the pping implementation.
The current pping implementation already has a potential concurrency issue. When matches for identifiers are found on ingress, a check is performed to see if the timestamp has already been used or not. Multiple packets processed in parallel could potentially all find that the timestamp is unused, before any of them manage to mark it as used for the others. This may result in pping matching several responses to a single timestamp entry and reporting the RTTs for each of them. I do not consider this a significant issue however, as if they are concurrent enough that they manage to lookup the used status before another has time to set it, the difference in time between them should be very small, and therefore compute very similar RTTs. So the reported RTTs should still be rather accurate, just over-reported.
When adding sampling and per-flow information, some additional concurrency issues may be encountered. Mainly, multiple packets may find that they are allowed to add a new timestamp, before they manage to update the time of last added time-stamp in the per-flow state. This may lead to multiple attempts at creating a timestamp at approximately the same time. For TCP timestamps, all the identifiers are likely to be identical (as the TCP timestamp itself is only updated at limited rate), so only one of them should succeed anyways. If using identifiers that are more unique however, such as TCP sequence numbers, then it's possible that a short burst of entries would be created instead of just a single entry within the rate-limit for the flow.
Overall, I don't think these concurrency issues are that severe, as they should still result in accurate RTTs, just some possible over-reporting. I don't believe these issues warrants the performance impact and potential code complexity of trying to synchronize access. Furthermore, from what I understand these concurrency issues are not too likely to occur in reality, as packets from the same flow are often processed on the same core.
Global variable vs single-entry map
With BTF, there seems like BPF programs now support the use of global variables. These global variables can supposedly be modified from user space, and should from what I've heard also be more efficient than map lookups. They therefore seem like promising way to pass some user-configured options from userspace to the BPF programs.
I would however need to lookup how to actually use these, as the examples I've seen have used a slightly different libbpf setup, where a "skeleton" header-file is compiled and imported to the userspace program. There should be some examples in the xdp-tools repository.
The alternative I guess would be to use a
BPF_MAP_TYPE_PERCPU_ARRAY
with a single entry, which is filled in
with the user-configured option by the userspace program.