mirror of
https://github.com/xdp-project/bpf-examples.git
synced 2024-05-06 15:54:53 +00:00
pping: Add document about sampling design
Add a document outlining my thoughts for how to implement sampling. Intended both as a basis for discussion, as well as being a form of documentation. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
This commit is contained in:
154
pping/SAMPLING_DESIGN.md
Normal file
154
pping/SAMPLING_DESIGN.md
Normal file
@@ -0,0 +1,154 @@
|
|||||||
|
# Introduction
|
||||||
|
This file is intended to document some of the challenges and design
|
||||||
|
decisions for adding sampling functionality to pping.
|
||||||
|
|
||||||
|
## Purpose of sampling
|
||||||
|
The main purpose of adding sampling to pping is to prevent a massive
|
||||||
|
amount of timestamp entries being created and quickly filling up the
|
||||||
|
map. This prevents new entries from being made until old ones can be
|
||||||
|
cleared out. A few large flows could thus "hog" all the map entries,
|
||||||
|
and prevent RTTs from other flows from being reported. Sampling is
|
||||||
|
therefore only used on egress to determine if a timestamp entry should
|
||||||
|
be created for a packet. All packets on ingress will still be parsed
|
||||||
|
and checked for a potential match.
|
||||||
|
|
||||||
|
A secondary purpose of the sampling is the reduce the amount of output
|
||||||
|
that pping creates. In most circumstances, getting 1000 RTT reports
|
||||||
|
per second from a single flow will probably not be of interest, making
|
||||||
|
it less useful as a direct command-line utility.
|
||||||
|
|
||||||
|
# Considered sampling approaches
|
||||||
|
There are a number of different ways that the sampling could be
|
||||||
|
performed, ex:
|
||||||
|
|
||||||
|
- Sample every N packets per flow
|
||||||
|
- Not very flexible
|
||||||
|
- If same rate is used for all flows small flows would get very few
|
||||||
|
samples.
|
||||||
|
- Sample completely random packets
|
||||||
|
- Probably not a good idea...
|
||||||
|
- Probabilistic approach
|
||||||
|
- Probabilistic approaches have been used to for example capture
|
||||||
|
most relevant information with limited overhead in INT
|
||||||
|
- Could potentially be configured across multiple devices, so that
|
||||||
|
pping on all of the devices together capture the most relevant
|
||||||
|
traffic.
|
||||||
|
- While it could potentially work well, I'm not very familiar with
|
||||||
|
these approaches. Would take considerable research from my side
|
||||||
|
to figure out how these methods work, how to best apply it to
|
||||||
|
pping, and how to implement it in BPF.
|
||||||
|
- Used time-based sampling, limiting the rate of how often entries
|
||||||
|
can be created per flow
|
||||||
|
- Intuitively simple
|
||||||
|
- Should correspond quite well with the output you would probably
|
||||||
|
want? I.e. a few entries per flow (regardless of how heavy they
|
||||||
|
are) stating their current RTT.
|
||||||
|
|
||||||
|
I believe that time-based sampling is the most promising solution that
|
||||||
|
I can implement in a reasonable time. In the future additional
|
||||||
|
sampling methods could potentially be added.
|
||||||
|
|
||||||
|
# Considerations for time-based sampling
|
||||||
|
## Time interval
|
||||||
|
For the time-based sampling, we must determine how the interval
|
||||||
|
between when new timestamp entries are allowed should be set.
|
||||||
|
|
||||||
|
### Static time interval
|
||||||
|
The simplest alternative is probably to use a static limit, ex
|
||||||
|
100ms. This would provide a rather simple and predictable limit for
|
||||||
|
how often entries can be created (per flow), and how much output you
|
||||||
|
would get (per flow).
|
||||||
|
|
||||||
|
### RTT-based time interval
|
||||||
|
It may be desirable to use a more dynamic time limit, which is
|
||||||
|
adapted to each flow. One way to do this, would be do base the time
|
||||||
|
limit on the RTT for the flow. Flows with short RTTs could be expected
|
||||||
|
to undergo more rapid changes than flows with long RTTs. This would
|
||||||
|
require keeping track of the RTT for each flow, for example a moving
|
||||||
|
average. Additionally, some fall back is required before the RTT for
|
||||||
|
the flow is known.
|
||||||
|
|
||||||
|
### User configurable
|
||||||
|
Regardless if a static or RTT-based (or some other alternative) is
|
||||||
|
used, it should probably be user configurable (including allowing the
|
||||||
|
user to disable to sampling entirely).
|
||||||
|
|
||||||
|
## Allowing bursts
|
||||||
|
It may be desirable to allow to allow for multiple packets in a short
|
||||||
|
burst to be timestamped. Due to delayed ACKs, one may only get a
|
||||||
|
response for every other packet. If the first packed is timestamped,
|
||||||
|
and shortly after a second packet is sent (that has a different
|
||||||
|
identifier), then the response will effectively be for the second
|
||||||
|
packet, and no match for the timestamped identifier will be found. For
|
||||||
|
flows of the right (or wrong depending on how you look at it)
|
||||||
|
intensity, slow enough where consecutive packets are likely to get
|
||||||
|
different TCP timestamps, but fast enough for the delayed ACKs to
|
||||||
|
acknowledge multiple packets, then you essentially have a 50/50 chance
|
||||||
|
of timestamping the wrong identifier an miss the RTT.
|
||||||
|
|
||||||
|
## Handing duplicate identifiers
|
||||||
|
TCP timestamps are only updated at a limited rate (ex. 1000 Hz), and
|
||||||
|
thus you can have multiple consecutive packets with the same TCP
|
||||||
|
timestamp if they're sent fast enough. For the calculated RTT to be
|
||||||
|
correct, you should only match the first sent packet with a unique
|
||||||
|
identifier with the first received packet with a matching
|
||||||
|
identifier. Otherwise, you may for example have a sequence with 100
|
||||||
|
packets with the same identifier, and match the last of the outgoing
|
||||||
|
packets with the first incoming response, which may underestimate the
|
||||||
|
RTT with as much as the TCP timestamp clock rate (ex. 1 ms).
|
||||||
|
|
||||||
|
### Current solution
|
||||||
|
The current solution to this is very simple. For outgoing packets, a
|
||||||
|
timestamp entry is only allowed to be created if no previous entry for
|
||||||
|
the identifier exists (realized through the BPF_NOEXIST flag to
|
||||||
|
bpf_map_update_elem() call). Thus only the first outgoing packet with
|
||||||
|
a specific identifier can be timestamped. On egress, the first packet
|
||||||
|
with a matching identifier will mark the timestamp as used, preventing
|
||||||
|
later incoming responses from using that timestamp. The reason why the
|
||||||
|
timestamp is marked as used rather than directly deleted once a
|
||||||
|
matching packet on ingress is found, is to avoid the egress side
|
||||||
|
creating a new entry for the same identifier. This could occur if the
|
||||||
|
RTT is shorter than the TCP timestamp clock rate, and could result in
|
||||||
|
a massively underestimated RTT. This is the same mechanic that is used
|
||||||
|
in the original pping, as explained
|
||||||
|
[here](https://github.com/pollere/pping/blob/777eb72fd9b748b4bb628ef97b7fff19b751f1fd/pping.cpp#L155-L168).
|
||||||
|
|
||||||
|
### New solution
|
||||||
|
The current solution will no longer work if sampling is
|
||||||
|
introduced. With sampling, there's no guarantee that the sampled
|
||||||
|
packed will be the first outgoing packet in the sequence of packets
|
||||||
|
with identical timestamps. Thus the RTT may still be underestimated by
|
||||||
|
as much as the TCP timestamp clock rate (ex. 1 ms). Therefore, a new
|
||||||
|
solution is needed. The current idea is to keep track of what the most
|
||||||
|
recent identifier of each flow is, and only allow a packet to be
|
||||||
|
sampled for timestamping if its identifier differs from the tracked
|
||||||
|
identifier of the flow, i.e. it is the first packet in the flow with
|
||||||
|
that identifier. This would perhaps be problematic with some sampling
|
||||||
|
approaches as it requires that the packet is both the first one with a
|
||||||
|
specific identifier, as well as being elected for sampling. However
|
||||||
|
for the rate-limited sampling it should work quite well, as it will
|
||||||
|
only delay the sampling until a packet with a new identifier is found.
|
||||||
|
|
||||||
|
Another advantage with this solution is that it should allow for
|
||||||
|
timestamp entries to be deleted as soon as the matching response is
|
||||||
|
found on egress. The timestamp no longer needs to be kept around only
|
||||||
|
to prevent egress to create a new timestamp with the same identifier,
|
||||||
|
as this new solution should take care of that. This would help a lot
|
||||||
|
with keeping the map clean, as the timestamp entries would then
|
||||||
|
automatically be removed as soon as they are no longer needed. The
|
||||||
|
periodic cleanup from userspace would only be needed to remove the
|
||||||
|
occasional entries that were never matched for some reason (e.g. the
|
||||||
|
previously mentioned issue with delayed ACKs, flow stopped, the
|
||||||
|
reverse flow can't be observed etc.).
|
||||||
|
|
||||||
|
# Implementation considerations
|
||||||
|
TODO (can partly be found in
|
||||||
|
[status-slides](https://github.com/xdp-project/bpf-research/blob/master/meetings/simon/work_summary_20210222.org))
|
||||||
|
## "Global" vs PERCPU maps
|
||||||
|
## Concurrency issues
|
||||||
|
## Global variable vs single-entry map
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
Reference in New Issue
Block a user