pping: Add document about sampling design

Add a document outlining my thoughts for how to implement sampling. Intended both as a basis for discussion, as well as being a form of documentation. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
2024-05-06 15:54:53 +00:00 · 2021-02-25 19:00:41 +01:00
parent a9c276cb54
commit ae1d89c7c9
1 changed files with 154 additions and 0 deletions
--- a/pping/SAMPLING_DESIGN.md
+++ b/pping/SAMPLING_DESIGN.md
@@ -0,0 +1,154 @@
+# Introduction
+This file is intended to document some of the challenges and design
+decisions for adding sampling functionality to pping.
+
+## Purpose of sampling
+The main purpose of adding sampling to pping is to prevent a massive
+amount of timestamp entries being created and quickly filling up the
+map. This prevents new entries from being made until old ones can be
+cleared out. A few large flows could thus "hog" all the map entries,
+and prevent RTTs from other flows from being reported. Sampling is
+therefore only used on egress to determine if a timestamp entry should
+be created for a packet. All packets on ingress will still be parsed
+and checked for a potential match.
+
+A secondary purpose of the sampling is the reduce the amount of output
+that pping creates. In most circumstances, getting 1000 RTT reports
+per second from a single flow will probably not be of interest, making
+it less useful as a direct command-line utility.
+
+# Considered sampling approaches
+There are a number of different ways that the sampling could be
+performed, ex:
+
+- Sample every N packets per flow
+  - Not very flexible
+  - If same rate is used for all flows small flows would get very few
+    samples.
+- Sample completely random packets
+  - Probably not a good idea...
+- Probabilistic approach
+  - Probabilistic approaches have been used to for example capture
+    most relevant information with limited overhead in INT
+  - Could potentially be configured across multiple devices, so that
+    pping on all of the devices together capture the most relevant
+    traffic.
+  - While it could potentially work well, I'm not very familiar with
+    these approaches. Would take considerable research from my side
+    to figure out how these methods work, how to best apply it to
+    pping, and how to implement it in BPF.
+- Used time-based sampling, limiting the rate of how often entries
+  can be created per flow
+  - Intuitively simple
+  - Should correspond quite well with the output you would probably
+    want? I.e. a few entries per flow (regardless of how heavy they
+    are) stating their current RTT.
+
+I believe that time-based sampling is the most promising solution that
+I can implement in a reasonable time. In the future additional
+sampling methods could potentially be added.
+
+# Considerations for time-based sampling
+## Time interval
+For the time-based sampling, we must determine how the interval
+between when new timestamp entries are allowed should be set. 
+
+### Static time interval
+The simplest alternative is probably to use a static limit, ex
+100ms. This would provide a rather simple and predictable limit for
+how often entries can be created (per flow), and how much output you
+would get (per flow).
+
+### RTT-based time interval
+It may be desirable to use a more dynamic time limit, which is
+adapted to each flow. One way to do this, would be do base the time
+limit on the RTT for the flow. Flows with short RTTs could be expected
+to undergo more rapid changes than flows with long RTTs. This would
+require keeping track of the RTT for each flow, for example a moving
+average. Additionally, some fall back is required before the RTT for
+the flow is known.
+
+### User configurable
+Regardless if a static or RTT-based (or some other alternative) is
+used, it should probably be user configurable (including allowing the
+user to disable to sampling entirely).
+
+## Allowing bursts
+It may be desirable to allow to allow for multiple packets in a short
+burst to be timestamped. Due to delayed ACKs, one may only get a
+response for every other packet. If the first packed is timestamped,
+and shortly after a second packet is sent (that has a different
+identifier), then the response will effectively be for the second
+packet, and no match for the timestamped identifier will be found. For
+flows of the right (or wrong depending on how you look at it)
+intensity, slow enough where consecutive packets are likely to get
+different TCP timestamps, but fast enough for the delayed ACKs to
+acknowledge multiple packets, then you essentially have a 50/50 chance
+of timestamping the wrong identifier an miss the RTT. 
+
+## Handing duplicate identifiers
+TCP timestamps are only updated at a limited rate (ex. 1000 Hz), and
+thus you can have multiple consecutive packets with the same TCP
+timestamp if they're sent fast enough. For the calculated RTT to be
+correct, you should only match the first sent packet with a unique
+identifier with the first received packet with a matching
+identifier. Otherwise, you may for example have a sequence with 100
+packets with the same identifier, and match the last of the outgoing
+packets with the first incoming response, which may underestimate the
+RTT with as much as the TCP timestamp clock rate (ex. 1 ms). 
+
+### Current solution
+The current solution to this is very simple. For outgoing packets, a
+timestamp entry is only allowed to be created if no previous entry for
+the identifier exists (realized through the BPF_NOEXIST flag to
+bpf_map_update_elem() call). Thus only the first outgoing packet with
+a specific identifier can be timestamped. On egress, the first packet
+with a matching identifier will mark the timestamp as used, preventing
+later incoming responses from using that timestamp. The reason why the
+timestamp is marked as used rather than directly deleted once a
+matching packet on ingress is found, is to avoid the egress side
+creating a new entry for the same identifier. This could occur if the
+RTT is shorter than the TCP timestamp clock rate, and could result in
+a massively underestimated RTT. This is the same mechanic that is used
+in the original pping, as explained
+[here](https://github.com/pollere/pping/blob/777eb72fd9b748b4bb628ef97b7fff19b751f1fd/pping.cpp#L155-L168).
+
+### New solution
+The current solution will no longer work if sampling is
+introduced. With sampling, there's no guarantee that the sampled
+packed will be the first outgoing packet in the sequence of packets
+with identical timestamps. Thus the RTT may still be underestimated by
+as much as the TCP timestamp clock rate (ex. 1 ms). Therefore, a new
+solution is needed. The current idea is to keep track of what the most
+recent identifier of each flow is, and only allow a packet to be
+sampled for timestamping if its identifier differs from the tracked
+identifier of the flow, i.e. it is the first packet in the flow with
+that identifier. This would perhaps be problematic with some sampling
+approaches as it requires that the packet is both the first one with a
+specific identifier, as well as being elected for sampling. However
+for the rate-limited sampling it should work quite well, as it will
+only delay the sampling until a packet with a new identifier is found.
+
+Another advantage with this solution is that it should allow for
+timestamp entries to be deleted as soon as the matching response is
+found on egress. The timestamp no longer needs to be kept around only
+to prevent egress to create a new timestamp with the same identifier,
+as this new solution should take care of that. This would help a lot
+with keeping the map clean, as the timestamp entries would then
+automatically be removed as soon as they are no longer needed. The
+periodic cleanup from userspace would only be needed to remove the
+occasional entries that were never matched for some reason (e.g. the
+previously mentioned issue with delayed ACKs, flow stopped, the
+reverse flow can't be observed etc.).
+
+# Implementation considerations
+TODO (can partly be found in
+[status-slides](https://github.com/xdp-project/bpf-research/blob/master/meetings/simon/work_summary_20210222.org))
+## "Global" vs PERCPU maps
+## Concurrency issues
+## Global variable vs single-entry map
+
+
+
+
+