mirror of
https://github.com/xdp-project/bpf-examples.git
synced 2024-05-06 15:54:53 +00:00
The previous commit fixes the issue of reordered packets being able to bypass the unique TSval check, so remove the corresponding section from the issues in the TODO. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
273 lines
14 KiB
Markdown
273 lines
14 KiB
Markdown
# TODO
|
|
|
|
## Protocols
|
|
- [x] TCP (based on timestamp options)
|
|
- [x] Skip pure ACKs for egress
|
|
- Timestamping pure ACKs may lead to erroneous RTTs (ex. delay
|
|
between application attempting to send data being recognized as
|
|
an RTT)
|
|
- [x] Skip non-ACKs for ingress
|
|
- The echoed TSecr is not valid if the ACK-flag is not set
|
|
- [ ] Add fallback to SEQ/ACK in case of no timestamp?
|
|
- Some machines may not use TCP timestamps (either not supported
|
|
at all, or disabled as in ex. Windows 10)
|
|
- If one only considers SEQ/ACK (and don't check for SACK
|
|
options), could result in ex. delay from retransmission being
|
|
included in RTT
|
|
- [x] ICMP (ex Echo/Reply)
|
|
- [ ] QUIC (based on spinbit)
|
|
- [ ] DNS queries
|
|
|
|
## General pping
|
|
- [x] Add sampling so that RTT is not calculated for every packet
|
|
(with unique value) for large flows
|
|
- [ ] Allow short bursts to bypass sampling in order to handle
|
|
delayed ACKs, reordered or lost packets etc.
|
|
- [x] Keep some per-flow state
|
|
- Will likely be needed for the sampling
|
|
- [x] Could potentially include keeping track of average RTT, which
|
|
may be useful for some decisions (ex. how often to sample,
|
|
when entry can be removed etc)
|
|
- [x] Could potentially include keeping track of minimum RTT (as
|
|
done by the original pping), ex. to track bufferbloat
|
|
- [ ] Could potentially include keeping track of if flow is
|
|
bi-directional
|
|
- Original pping checks if flow is bi-directional before adding
|
|
timestamps, but this could miss shorter flows
|
|
- [ ] Dynamically grow the maps if they are starting to get full
|
|
- [ ] Use libxdp to load XDP program
|
|
|
|
## Done
|
|
- [x] Clean up commits and add signed-off-by tags
|
|
- [x] Add SPDX-license-identifier tags
|
|
- [x] Format C-code in kernel style
|
|
- [x] Use existing functionality to reuse maps by using BTF-defined
|
|
maps
|
|
- [x] Use BTF-defined maps for TC-BPF as well if iproute has libbpf
|
|
support
|
|
- [x] Cleanup: Unload TC-BPF at program shutdown, and unpin map - In
|
|
userspace part
|
|
- [x] Add IPv6 support
|
|
- [x] Refactor to support easy addition of other protocols
|
|
- [x] Load tc-bpf program with libbpf (only attach it with tc)
|
|
- [x] Switch to libbpf TC-BPF API for attaching the TC-BPF program
|
|
- [x] Add option for machine-readable output (as original pping)
|
|
- It may be a good idea to keep the same format as original pping,
|
|
so that tools such as [ppviz](https://github.com/pollere/ppviz)
|
|
works for both pping implementations.
|
|
- [x] Add timestamps to output (as original pping)
|
|
- [x] Add support for other hooks
|
|
- TC-BFP on ingress instead of XDP
|
|
- [x] Improve map cleaning:
|
|
- [x] Use BPF iter to clean up maps with BPF programs instead of
|
|
looping through entries in userspace.
|
|
- [x] Use a dynamic time to live for map entries based on flow's RTT
|
|
instead of static 10s limit
|
|
- Keeping entries around for a long time allows the map to grow
|
|
unnecessarily large, which slows down the cleaning and may block
|
|
new entries
|
|
- [x] Keep track of outstanding timestamps, only match when necessary
|
|
- Can avoid doing lookups in timestamp hash map if we know that
|
|
there are no outstanding (unmatched) timestamps for the flow
|
|
|
|
|
|
# Potential issues
|
|
## Limited information in different output formats
|
|
The ppviz format is a bit limited in what information it can
|
|
include. One of these limitations is that it does not include any
|
|
protocol information as it was designed with only TCP in mind. If
|
|
using PPing with other protocols than TCP may therefore not be
|
|
possible to distinguish flows with different protocols. PPing will
|
|
therefore emit a warning if attempting to use the ppviz format with
|
|
protocols other than TCP, but will still allow it.
|
|
|
|
Another piece of information tracked by PPing which can't be included
|
|
in the ppviz format is if the calculated RTT includes the local
|
|
processing delay or not (that is, it was timestamped on ingress and
|
|
matched on egress instead of being timestamped on egress and matched
|
|
on ingress). Currently this information is only included in the JSON
|
|
format, but could potentially be added to the standard format if
|
|
deemed important.
|
|
|
|
## Cannot detect end of ICMP "flow"
|
|
ICMP is not a flow-based protocol, and therefore there is no signaling
|
|
that the ICMP "flow" is about to close. Subsequently, there is not way
|
|
for PPing to automatically detect that and ICMP flow has stopped and
|
|
delete its flow-state entry (and send a timely flow closing event).
|
|
|
|
A consequence of this is that the ICMP flow entries will stick around
|
|
and occupy a space in the flow state map until they are cleaned out by
|
|
the periodic cleanup process. The current timeout for ICMP flows is 30
|
|
seconds, which means a lot of short ICMP flows could quickly fill up
|
|
the flow map and potentially block other flows for a considerable
|
|
time.
|
|
|
|
## RTT-based sampling
|
|
The RTT-based sampling features means that timestamp entries may only
|
|
be created at an interval proportional to the flows RTT. This allows
|
|
flows with shorter RTTs to get more frequent RTT samples than flows
|
|
with long RTTs. However, as the flows RTT can only be updated based on
|
|
the calculated RTT samples, this creates a situation where the RTTs
|
|
update rate is dependent on itself. Flows with short RTTs will update
|
|
the RTT more often, which in turn affects how often they can update
|
|
the RTT.
|
|
|
|
This mainly becomes problematic if basing the sampling rate on the
|
|
sRTT which may grow. In this case the sRTT will generally be prone to
|
|
growing faster than it shrinks, as if it starts with a low RTT it will
|
|
quickly update it to higher RTTs, but with high RTTs it will take
|
|
longer for it do decrease to a lower RTT again.
|
|
|
|
## Losing debug/warning information
|
|
The "map full" and "map cleaning" events are pushed through the same
|
|
perf-buffer as the rtt and flow events. Just like rtt events may be
|
|
lost if too many of them are pushed, these debug/warning messages may
|
|
also be lost if there's a lot of other events being pushed. In case
|
|
one considers these debug/warning messages more critical than the
|
|
normal RTT reports, it may be worth considering pushing them through a
|
|
separate channel to make it less likely that they are lost.
|
|
|
|
## RTT-based cleanup of timestamp entries
|
|
For most flows the RTT will likely be well below one second, which
|
|
allows for removing timestamp entries much earlier than the 10s static
|
|
limit. This helps keep the timestamp map size down, thereby decreasing
|
|
the risk of there being no room for new entries as well as reducing
|
|
the number of entries the cleanup process must go through. However, if
|
|
the RTT for a flow grows quickly (due to ex. buffer bloat) then the
|
|
actual RTT of the flow may increase to beyond 8 times the initial RTT
|
|
before ePPing has collected enough RTT samples to increase sRTT to a
|
|
similar level. This may cause timestamps being deleted too early
|
|
before they have time to actually match against a reply, in which case
|
|
one also loses the RTT sample that would be used to update the sRTT
|
|
causing the sRTT to remain too low.
|
|
|
|
In practice this issue is limited by the fact that the cleanup process
|
|
only runs periodically, so even for flows with a very low RTT the
|
|
average time to delete a timestamp entry would still be 500ms (at
|
|
default rate of running at 1Hz). But this also limits the usefulness
|
|
of trying to delete entries earlier.
|
|
|
|
Overall, a better approach in the future might be to get rid of the
|
|
periodic cleanup process entirely, and instead only evict old entries
|
|
if they're blocking a new entry from being inserted. This would be
|
|
similar to using LRU maps, but would need some way to prevent an
|
|
existing entry from being removed in case it's too young.
|
|
|
|
## Periodical map cleanup may miss entries on delete
|
|
Due to how the hash map traversal is implemented, all entries may not
|
|
be traversed each time a map is iterated through. Specifically, if an
|
|
entry is deleted and there are other entries remaining in the same
|
|
bucket, those remaining entries will not be traversed. As the
|
|
periodical cleanup may delete entries as it is traversing the map,
|
|
this may result in some of the entries which share the bucket
|
|
with deleted entries not always be traversed.
|
|
|
|
In general this should not cause a large problem as those entries will
|
|
simply be traversed the next time the map is iterated over
|
|
instead. However, it may cause certain entries to remain in the hash
|
|
map a bit longer than expected (and if the map is full subsequently
|
|
block new entries from being created for that duration)
|
|
|
|
|
|
## Concurrency issues
|
|
|
|
The program uses "global" (not `PERCPU`) hash maps to keep state. As
|
|
the BPF programs need to see the global view to function properly,
|
|
using `PERCPU` maps is not an option. The program must be able to
|
|
match against stored packet timestamps regardless of the CPU the
|
|
packets are processed on, and must also have a global view of the flow
|
|
state in order for the sampling to work correctly.
|
|
|
|
As the BPF programs may run concurrently on different CPU cores
|
|
accessing these global hash maps, this may result in some concurrency
|
|
issues. In practice, I do not believe these will occur particularly
|
|
often as the hash-map entries are per-flow, and I'm under the
|
|
impression that packets from the same flow will typically be processed
|
|
by the same CPU. Furthermore, most of the concurrency issues will not
|
|
be that problematic even if they do occur. For now, I've therefore
|
|
left these concurrency issues unattended, even if some of them could
|
|
be avoided with atomic operations and/or spinlocks, in order to keep
|
|
things simple and not hurt performance.
|
|
|
|
The (known) potential concurrency issues are:
|
|
|
|
### Tracking last seen identifier
|
|
The tc/egress program keeps track of the last seen outgoing identifier
|
|
for each flow, by storing it in the `flow_state` map. This is done to
|
|
detect the first packet with a new identifier. If multiple packets are
|
|
processed concurrently, several of them could potentially detect
|
|
themselves as being first with the same identifier (which only matters
|
|
if they also pass rate-limit check as well), alternatively if the
|
|
concurrent packets have different identifiers there may be a lost
|
|
update (but for TCP timestamps, concurrent packets would typically be
|
|
expected to have the same timestamp).
|
|
|
|
### Rate-limiting new timestamps
|
|
In the tc/egress program packets to timestamp are sampled by using a
|
|
per-flow rate-limit, which is enforced by storing when the last
|
|
timestamp was created in the `flow_state` map. If multiple packets
|
|
perform this check concurrently, it's possible that multiple packets
|
|
think they are allowed to create timestamps before any of them are
|
|
able to update the `last_timestamp`. When they update `last_timestamp`
|
|
it might also be slightly incorrect, however if they are processed
|
|
concurrently then they should also generate very similar timestamps.
|
|
|
|
If the packets have different identifiers, (which would typically not
|
|
be expected for concurrent TCP timestamps), then this would allow some
|
|
packets to bypass the rate-limit. By bypassing the rate-limit, the
|
|
flow would use up some additional map space and report some additional
|
|
RTT(s) more than expected (however the reported RTTs should still be
|
|
correct).
|
|
|
|
If the packets have the same identifier, they must first have managed
|
|
to bypass the previous check for unique identifiers (see [previous
|
|
point](#tracking-last-seen-identifier)), and only one of them will be
|
|
able to successfully store a timestamp entry.
|
|
|
|
### Matching against stored timestamps
|
|
The XDP/ingress program could potentially match multiple concurrent
|
|
packets with the same identifier against a single timestamp entry in
|
|
`packet_ts`, before any of them manage to delete the timestamp
|
|
entry. This would result in multiple RTTs being reported for the same
|
|
identifier, but if they are processed concurrently these RTTs should
|
|
be very similar, so would mainly result in over-reporting rather than
|
|
reporting incorrect RTTs.
|
|
|
|
### Updating flow statistics
|
|
Both the tc/egress and XDP/ingress programs will try to update some
|
|
flow statistics each time they successfully parse a packet with an
|
|
identifier. Specifically, they'll update the number of packets and
|
|
bytes sent/received. This is not done in an atomic fashion, so there
|
|
could potentially be some lost updates resulting an underestimate.
|
|
|
|
Furthermore, whenever the XDP/ingress program calculates an RTT, it
|
|
will check if this is the lowest RTT seen so far for the flow. If
|
|
multiple RTTs are calculated concurrently, then several could pass
|
|
this check concurrently and there may be a lost update. It should only
|
|
be possible for multiple RTTs to be calculated concurrently in case
|
|
either the [timestamp rate-limit was
|
|
bypassed](#rate-limiting-new-timestamps) or [multiple packets managed
|
|
to match against the same
|
|
timestamp](#matching-against-stored-timestamps).
|
|
|
|
It's worth noting that with sampling the reported minimum-RTT is only
|
|
an estimate anyways (may never calculate RTT for packet with the true
|
|
minimum RTT). And even without sampling there is some inherent
|
|
sampling due to TCP timestamps only being updated at a limited rate
|
|
(1000 Hz).
|
|
|
|
### Outputting flow opening/closing events
|
|
A flow is not considered opened until a reply has been seen for
|
|
it. The `flow_state` map keeps information about if the flow has been
|
|
opened or not, which is checked and updated for each reply. The check
|
|
and update of this information is not performed atomically, which may
|
|
result in multiple replies thinking they are the first, emitting
|
|
multiple flow-opened events, in case they are processed concurrently.
|
|
|
|
Likewise, when flows are closed it checks if the flow has been opened
|
|
to determine if a flow closing message should be sent. If multiple
|
|
replies are processed concurrently, it's possible one of them will
|
|
update the flow-open information and emit a flow opening message, but
|
|
another reply closing the flow without thinking it's ever been opened,
|
|
thus not sending a flow closing message.
|