15 KiB
TODO
Protocols
- TCP (based on timestamp options)
- Skip pure ACKs for egress
- Timestamping pure ACKs may lead to erroneous RTTs (ex. delay between application attempting to send data being recognized as an RTT)
- Skip non-ACKs for ingress
- The echoed TSecr is not valid if the ACK-flag is not set
- Add fallback to SEQ/ACK in case of no timestamp?
- Some machines may not use TCP timestamps (either not supported at all, or disabled as in ex. Windows 10)
- If one only considers SEQ/ACK (and don't check for SACK options), could result in ex. delay from retransmission being included in RTT
- Skip pure ACKs for egress
- ICMP (ex Echo/Reply)
- QUIC (based on spinbit)
- DNS queries
General pping
- Add sampling so that RTT is not calculated for every packet
(with unique value) for large flows
- Allow short bursts to bypass sampling in order to handle delayed ACKs, reordered or lost packets etc.
- Keep some per-flow state
- Will likely be needed for the sampling
- Could potentially include keeping track of average RTT, which may be useful for some decisions (ex. how often to sample, when entry can be removed etc)
- Could potentially include keeping track of minimum RTT (as done by the original pping), ex. to track bufferbloat
- Could potentially include keeping track of if flow is
bi-directional
- Original pping checks if flow is bi-directional before adding timestamps, but this could miss shorter flows
- Dynamically grow the maps if they are starting to get full
- Use libxdp to load XDP program
Done
- Clean up commits and add signed-off-by tags
- Add SPDX-license-identifier tags
- Format C-code in kernel style
- Use existing functionality to reuse maps by using BTF-defined
maps
- Use BTF-defined maps for TC-BPF as well if iproute has libbpf support
- Cleanup: Unload TC-BPF at program shutdown, and unpin map - In userspace part
- Add IPv6 support
- Refactor to support easy addition of other protocols
- Load tc-bpf program with libbpf (only attach it with tc)
- Switch to libbpf TC-BPF API for attaching the TC-BPF program
- Add option for machine-readable output (as original pping)
- It may be a good idea to keep the same format as original pping, so that tools such as ppviz works for both pping implementations.
- Add timestamps to output (as original pping)
- Add support for other hooks
- TC-BFP on ingress instead of XDP
- Improve map cleaning:
- Use BPF iter to clean up maps with BPF programs instead of looping through entries in userspace.
- Use a dynamic time to live for map entries based on flow's RTT
instead of static 10s limit
- Keeping entries around for a long time allows the map to grow unnecessarily large, which slows down the cleaning and may block new entries
- Keep track of outstanding timestamps, only match when necessary
- Can avoid doing lookups in timestamp hash map if we know that there are no outstanding (unmatched) timestamps for the flow
Potential issues
Limited information in different output formats
The ppviz format is a bit limited in what information it can include. One of these limitations is that it does not include any protocol information as it was designed with only TCP in mind. If using PPing with other protocols than TCP may therefore not be possible to distinguish flows with different protocols. PPing will therefore emit a warning if attempting to use the ppviz format with protocols other than TCP, but will still allow it.
Another piece of information tracked by PPing which can't be included in the ppviz format is if the calculated RTT includes the local processing delay or not (that is, it was timestamped on ingress and matched on egress instead of being timestamped on egress and matched on ingress). Currently this information is only included in the JSON format, but could potentially be added to the standard format if deemed important.
Cannot detect end of ICMP "flow"
ICMP is not a flow-based protocol, and therefore there is no signaling that the ICMP "flow" is about to close. Subsequently, there is not way for PPing to automatically detect that and ICMP flow has stopped and delete its flow-state entry (and send a timely flow closing event).
A consequence of this is that the ICMP flow entries will stick around and occupy a space in the flow state map until they are cleaned out by the periodic cleanup process. The current timeout for ICMP flows is 30 seconds, which means a lot of short ICMP flows could quickly fill up the flow map and potentially block other flows for a considerable time.
RTT-based sampling
The RTT-based sampling features means that timestamp entries may only be created at an interval proportional to the flows RTT. This allows flows with shorter RTTs to get more frequent RTT samples than flows with long RTTs. However, as the flows RTT can only be updated based on the calculated RTT samples, this creates a situation where the RTTs update rate is dependent on itself. Flows with short RTTs will update the RTT more often, which in turn affects how often they can update the RTT.
This mainly becomes problematic if basing the sampling rate on the sRTT which may grow. In this case the sRTT will generally be prone to growing faster than it shrinks, as if it starts with a low RTT it will quickly update it to higher RTTs, but with high RTTs it will take longer for it do decrease to a lower RTT again.
Losing debug/warning information
The "map full" and "map cleaning" events are pushed through the same perf-buffer as the rtt and flow events. Just like rtt events may be lost if too many of them are pushed, these debug/warning messages may also be lost if there's a lot of other events being pushed. In case one considers these debug/warning messages more critical than the normal RTT reports, it may be worth considering pushing them through a separate channel to make it less likely that they are lost.
RTT-based cleanup of timestamp entries
For most flows the RTT will likely be well below one second, which allows for removing timestamp entries much earlier than the 10s static limit. This helps keep the timestamp map size down, thereby decreasing the risk of there being no room for new entries as well as reducing the number of entries the cleanup process must go through. However, if the RTT for a flow grows quickly (due to ex. buffer bloat) then the actual RTT of the flow may increase to beyond 8 times the initial RTT before ePPing has collected enough RTT samples to increase sRTT to a similar level. This may cause timestamps being deleted too early before they have time to actually match against a reply, in which case one also loses the RTT sample that would be used to update the sRTT causing the sRTT to remain too low.
In practice this issue is limited by the fact that the cleanup process only runs periodically, so even for flows with a very low RTT the average time to delete a timestamp entry would still be 500ms (at default rate of running at 1Hz). But this also limits the usefulness of trying to delete entries earlier.
Overall, a better approach in the future might be to get rid of the periodic cleanup process entirely, and instead only evict old entries if they're blocking a new entry from being inserted. This would be similar to using LRU maps, but would need some way to prevent an existing entry from being removed in case it's too young.
Periodical map cleanup may miss entries on delete
Due to how the hash map traversal is implemented, all entries may not be traversed each time a map is iterated through. Specifically, if an entry is deleted and there are other entries remaining in the same bucket, those remaining entries will not be traversed. As the periodical cleanup may delete entries as it is traversing the map, this may result in some of the entries which share the bucket with deleted entries not always be traversed.
In general this should not cause a large problem as those entries will simply be traversed the next time the map is iterated over instead. However, it may cause certain entries to remain in the hash map a bit longer than expected (and if the map is full subsequently block new entries from being created for that duration)
Concurrency issues
The program uses "global" (not PERCPU) hash maps to keep state. As
the BPF programs need to see the global view to function properly,
using PERCPU maps is not an option. The program must be able to
match against stored packet timestamps regardless of the CPU the
packets are processed on, and must also have a global view of the flow
state in order for the sampling to work correctly.
As the BPF programs may run concurrently on different CPU cores accessing these global hash maps, this may result in some concurrency issues. In practice, I do not believe these will occur particularly often as the hash-map entries are per-flow, and I'm under the impression that packets from the same flow will typically be processed by the same CPU. Furthermore, most of the concurrency issues will not be that problematic even if they do occur. For now, I've therefore left these concurrency issues unattended, even if some of them could be avoided with atomic operations and/or spinlocks, in order to keep things simple and not hurt performance.
The (known) potential concurrency issues are:
Tracking last seen identifier
The tc/egress program keeps track of the last seen outgoing identifier
for each flow, by storing it in the flow_state map. This is done to
detect the first packet with a new identifier. If multiple packets are
processed concurrently, several of them could potentially detect
themselves as being first with the same identifier (which only matters
if they also pass rate-limit check as well), alternatively if the
concurrent packets have different identifiers there may be a lost
update (but for TCP timestamps, concurrent packets would typically be
expected to have the same timestamp).
A possibly more severe issue is out-of-order packets. If a packet with an old identifier arrives out of order, that identifier could be detected as a new identifier. If for example the following flow of four packets with just two different identifiers (id1 and id2) were to occur:
id1 -> id2 -> id1 -> id2
Then the tc/egress program would consider each of these packets to have new identifiers and try to create a new timestamp for each of them if the sampling strategy allows it. However even if the sampling strategy allows it, the (incorrect) creation of timestamps for id1 and id2 the second time would only be successful in case the first timestamps for id1 and id2 have already been matched against (and thus deleted). Even if that is the case, they would only result in reporting an incorrect RTT in case there are also new matches against these identifiers.
This issue could be avoided entirely by requiring that new-id > old-id instead of simply checking that new-id != old-id, as TCP timestamps should monotonically increase. That may however not be a suitable solution for other types of identifiers.
Rate-limiting new timestamps
In the tc/egress program packets to timestamp are sampled by using a
per-flow rate-limit, which is enforced by storing when the last
timestamp was created in the flow_state map. If multiple packets
perform this check concurrently, it's possible that multiple packets
think they are allowed to create timestamps before any of them are
able to update the last_timestamp. When they update last_timestamp
it might also be slightly incorrect, however if they are processed
concurrently then they should also generate very similar timestamps.
If the packets have different identifiers, (which would typically not be expected for concurrent TCP timestamps), then this would allow some packets to bypass the rate-limit. By bypassing the rate-limit, the flow would use up some additional map space and report some additional RTT(s) more than expected (however the reported RTTs should still be correct).
If the packets have the same identifier, they must first have managed to bypass the previous check for unique identifiers (see previous point), and only one of them will be able to successfully store a timestamp entry.
Matching against stored timestamps
The XDP/ingress program could potentially match multiple concurrent
packets with the same identifier against a single timestamp entry in
packet_ts, before any of them manage to delete the timestamp
entry. This would result in multiple RTTs being reported for the same
identifier, but if they are processed concurrently these RTTs should
be very similar, so would mainly result in over-reporting rather than
reporting incorrect RTTs.
Updating flow statistics
Both the tc/egress and XDP/ingress programs will try to update some flow statistics each time they successfully parse a packet with an identifier. Specifically, they'll update the number of packets and bytes sent/received. This is not done in an atomic fashion, so there could potentially be some lost updates resulting an underestimate.
Furthermore, whenever the XDP/ingress program calculates an RTT, it will check if this is the lowest RTT seen so far for the flow. If multiple RTTs are calculated concurrently, then several could pass this check concurrently and there may be a lost update. It should only be possible for multiple RTTs to be calculated concurrently in case either the timestamp rate-limit was bypassed or multiple packets managed to match against the same timestamp.
It's worth noting that with sampling the reported minimum-RTT is only an estimate anyways (may never calculate RTT for packet with the true minimum RTT). And even without sampling there is some inherent sampling due to TCP timestamps only being updated at a limited rate (1000 Hz).
Outputting flow opening/closing events
A flow is not considered opened until a reply has been seen for
it. The flow_state map keeps information about if the flow has been
opened or not, which is checked and updated for each reply. The check
and update of this information is not performed atomically, which may
result in multiple replies thinking they are the first, emitting
multiple flow-opened events, in case they are processed concurrently.
Likewise, when flows are closed it checks if the flow has been opened to determine if a flow closing message should be sent. If multiple replies are processed concurrently, it's possible one of them will update the flow-open information and emit a flow opening message, but another reply closing the flow without thinking it's ever been opened, thus not sending a flow closing message.