xdp-project-bpf-examples

mirror of https://github.com/xdp-project/bpf-examples.git synced 2024-05-06 15:54:53 +00:00

Author	SHA1	Message	Date
Simon Sundberg	35012a2804	pping: Add errors to global counters Add counters for runtime errors in the BPF programs to the global counters. Specifically, add counters for failing to create entries in the packet-timestamp, flow-state and aggregation-subnet maps. The counters can easily be extended to include other errors in the future. Output any non-zero counters at in an errors section at the end of the global-counters report. Example standard entry (linebreaks not part of actual output): 13:53:40.450555237: TCP=(pkts=110983, bytes=899455326), ICMP=(pkts=16, bytes=1568), ECN=(Not-ECT=110999), errors=(store-packet-ts=210, create-flow-state=8, create-agg-subnet-state=110999) Example JSON entry: { "timestamp": 1698235250698609700, "protocol_counters": { "TCP": { "packets": 111736, "bytes": 898999024 }, "ICMP": { "packets": 20, "bytes": 1960 } }, "ecn_counters": { "no_ECT": 111756 }, "errors": { "store_packet_ts": 165, "create_flow_state": 10, "create_agg_subnet_state": 111756 } } Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-10-25 16:12:41 +02:00
Simon Sundberg	0707ac084d	pping: Add ECN counters to the global counters Add counters for the 4 ECN code points (00=Not-ECT, 01=ECT1, 10=ECT0 and 11=CE) to the global counters. These are reported together with the global protocol counters when running in aggregated mode. Example standard entry (linebreaks not part of actual output): 19:32:40.224309565: non-IP=(pkts=6, bytes=252), UDP=(pkts=9, bytes=495), ECN=(Not-ECT=4, ECT1=3, CE=2) Example JSON entry: { "timestamp": 1698082435757528300, "protocol_counters": { "non_IP": { "packets": 6, "bytes": 252 }, "UDP": { "packets": 9, "bytes": 495 } }, "ecn_counters": { "no_ECT": 4, "ECT1": 3, "CE": 2 } } Originally planned to also include a counter for ECN-echo in the TCP header. However, adding parsing of a TCP field for ALL TCP packets is currently challenging due to the parsing of the TCP-header being conditional. First off, the TCP-header will only be parsed if the program is configured to capture TCP RTTs (can be disabled by passing the -C/--icmp flag without the -T/--tcp flag). Second off, parsing the TCP-header is tied to parsing TCP timestamps, and the function for parsing the TCP timestamps will signal failure regardless if it failed to parse the TCP-header itself or just the TCP timestamps. Parsing of a TCP field (like ECE) is thus only safe for the subset of packets where TCP timestamps could successfully be parsed, which would create misleading stats as the other ECN counters cover all IP-traffic. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-10-24 16:06:53 +02:00
Simon Sundberg	7ebf7d6125	pping: Add global per-protocol counters for aggregated output Add global per-protocol counters for the aggregated output. These counters include all the packets the eBPF program processes (even if it cannot parse an IP-address, and thereby add it to the per-subnet packet counts). Output the global counts at the end of every aggregated report. Example with standard output (linebreakes not part of output): 15:47:28.544011000: non-IP(pkts=6, bytes=252), TCP(pkts=88316, bytes=3094356024), ICMP(pkts=3983, bytes=390110), 47(pkts=80) Example with JSON output: { "timestamp": 1697635992487286800, "protocol_counters": { "non_IP": { "packets": 4, "bytes": 168 }, "TCP": { "packets": 344633, "bytes": 16609641822 }, "ICMP": { "packets": 3960, "bytes": 388016 }, "47": { "packets": 60 } } } Some implementation details: Internally keep packet and byte counters for non-IP, TCP, UDP, ICMP and ICMPv6, i.e. the "common protocols". To catch any other non-common IP-protocol, keep an array of packet counters for every possible IP-protocol [0, 255]. In the output, provide names for the common protocols (e.g. "TCP"), while only outputting the protocol number of non-common protocols. To avoid excessive output, only output counters that are non-zero. This way, output is minimized while still allowing for detecting unexpected (or even illegal) protocol numbers. Unlike the per-prefix stats, do not reset the global counters. Instead keep a copy of the previous counts and calculate the difference in user space to report the difference since the previous report. This unsynchronized approach is simpler than synchronized approach swapping between two instances of the map used by the per-prefix stats, but may result in small inconsistencies (ex. the packet-count and byte-count may mismatch in case the counters are fetched when an eBPF program has updated the packet-counter but not the byte-counter). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-10-24 15:32:55 +02:00
Simon Sundberg	9ebcc2a2f9	pping: Move packet_info to per-CPU map to save stack space Additions in the comming commits increase the maximum stack space used by the eBPF programs past the 512 byte limit (causing verifier rejection). To avoid this, move the relatively large packet_info struct to a single-entry per-CPU array map. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-10-20 12:17:46 +02:00
Simon Sundberg	2315d792fa	pping: Add additional per-prefix packet counters When running in aggregated mode (-a/--aggregate), the previous per-prefix packet and byte counters only included traffic that the RTT was tracked for, i.e. by default only TCP traffic with TCP timestamps (which a flowstate could be created for) was counted. This makes it hard to correlate RTTs with traffic load, as the total traffic load to/from a given prefix is not known. Therefore, split up the per-prefix counters into 3 sets of counters: - One for TCP traffic with timestamps, i.e. the TCP traffic we can track RTTs of - One of TCP traffic without timestamps, i.e TCP traffic we cannot track due to relying on TCP timestamps - One for non-TCP traffic, which when combined with the other counters gives the total amount of (IP) traffic going to/from a prefix Do NOT create NEW prefix entries for traffic which the RTT cannot be tracked for. This means that if some prefix only sees traffic of a type that RTTs cannot be captured for, they will use the global /0 backup entries. To keep the standard output somewhat manageable (it is already quite wide), only output the total packet and byte counts for the traffic to/from the prefix. For the JSON format, output the counters for each individual set (TCP_TS, TCP_noTS, and other) which are non-empty. Example standard entry after update (same as before update, linebreaks not part of actual output): 14:42:10.451929078: 10.11.1.10/32 -> rxpkts=4303, rxbytes=347742, txpkts=37658, txbytes=1888184076, rtt-count=1202, min=0.006963 ms, mean=2 ms, median=2 ms, p95=2 ms, max=3.10063 ms Example JSON entry before update: { "timestamp": 1697638197074346000, "ip_prefix": "10.11.1.10/32", "rx_packets": 2495, "tx_packets": 12121, "rx_bytes": 164670, "tx_bytes": 601306338, "count_rtt": 743, "min_rtt": 7717, "mean_rtt": 2021530, "median_rtt": 2000000, "p95_rtt": 2000000, "max_rtt": 4985117, "histogram": [ 739, 4 ] } Example JSON entry after update: { "timestamp": 1697635990442789000, "ip_prefix": "10.11.1.10/32", "rx_stats": { "TCP_TS": { "packets": 1458, "bytes": 96232 }, "TCP_noTS": { "packets": 1, "bytes": 74 }, "other": { "packets": 1874, "bytes": 183460 } }, "tx_stats": { "TCP_TS": { "packets": 17270, "bytes": 905414662 }, "TCP_noTS": { "packets": 1, "bytes": 74 }, "other": { "packets": 1898, "bytes": 184204 } }, "count_rtt": 629, "min_rtt": 7775, "mean_rtt": 2038160, "median_rtt": 2000000, "p95_rtt": 2000000, "max_rtt": 13431771, "histogram": [ 627, 0, 0, 2 ] } This commit will considerably increase the overhead for traffic types that RTT isn't tracked for compared to before. Previously, the eBPF programs would early abort as soon as it discovered that the packet was of a type which it couldn't track RTT for. Now, all IP packets will have their IP-address processed and later used to lookup the relevant prefixes to update the packet counters for. However, the overhead for packets that the RTT can't be tracked for should still be considerbly lower than for packets the RTT can be tracked for, and the overhead for packets the RTT can be tracked for should not increase much from previously. Potential bug/issue. If the program is configured to NOT track RTTs for TCP traffic (by using the -I/--icmp flag without the -T/--tcp flag), the program will not parse the TCP header and thus be unable to detect if it contains TCP timestamps. Therefore, all TCP packets will then be counted as TCP packets without timestamps, regardless if they have timestamps or not. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-10-20 12:17:46 +02:00
Simon Sundberg	d13f429907	pping: Reverse the interpretation of rx/tx for aggregated stats For the aggregated stats, report RX and TX from the perspective of the capture point, instead of the perspective of the subnet. Consider the following setup, consisting of subnet A, the capture point (CP) where we're running ePPing, and subnet B. A <-----> CP <-----> B Now consider that we have a TCP stream uploading data from A to B, so that we can capture RTTs between when the data packet from A reaches CP to when the ACK from B gets back to the CP, i.e. CP -> B -> CP. Previously, the RX stats for a subnet referred to packets received by the subnet, i.e. packets with dst address in the subnet. Likewise, TX packets were packets transmitted by the subnet, i.e. packets with src address in the subnet. So the data packet from A -> B would be reported as TX for subnet A and RX for subnet B. However, the RTTs are by default (can be changed by the --aggregate-reverse flag) aggregated from the perspective of the capture point, so that the RTT CP -> B -> CP would be reported as an RTT observed for subnet B. Make the TX and RX stats consistent with the RTT, so that all subnet stats are from the perspective of the CP. Make RX refer to packets the CP has received from the subnet, i.e. packets with src in A, and TX refer to packets the CP has transmitted to the subnet, i.e. packets with dst in the subnet. So report a data packet from A -> B as RX for subnet A and TX for subnet B. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-10-20 12:17:46 +02:00
Simon Sundberg	0200196244	pping: Rename aggregated rtts to aggregated stats The (per-subnet) aggregated stats already include packet byte counts, so it is not strictly only RTTs. Future commits will further extend the non-RTT related statistics that are aggregated. Therefore, rename structs, functions and paramters of the from "aggregated_rtts" to "aggregated_stats". To clarify which members of the aggregated_rtt_stats struct (now renamed to aggregated_stats) which are related to the RTT, prefix their names with "rtt_", e.g. "min" -> "rtt_min". Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-10-17 16:23:51 +02:00
Simon Sundberg	59310e8ead	pping: Preallocate memory for aggregation maps When maps are not preallocated, the creation of map entries may sometimes unpredictably fail with ENOMEM, despite plenty of free memory being available. Solving this memory allocation issue may take some time, so in the mean time let's just preallocate the memory for the aggregation maps as well. Preallocating the maps means the memory usage will be the same regardless of the amount of traffic actually observed (i.e. regardless of the number of aggregation entries that need to be created). To compensate for this higher out-of-the-box memory usage, decrease the histogram resolution from 1000 1ms bins to 250 4ms bins. The memory usage (for the aggregation maps) should be approximately: (56 + NR_BINS * 4) * CPUS * MAP_AGGREGATION_SIZE * 4 With the current values, that translates to roughly 66 MiB per CPU core (down from ~254 MiB/core with 1000 bins). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	e9db312ad5	pping: Add fallback entry for aggregation maps The aggregation maps may become full, in which case the BPF programs will fail to create new entries for any IP-prefixes not currently in the map. This would previously result in stats from traffic that cannot fit into any aggregation entry to be missed. Add a fallback entry for each map, so that in case the aggregation map is full stats from any new IP-prefix will be added to this fallback entry instead. The fallback entry is reported as 0.0.0.0/0 (for IPv4) or ::/0 (for IPv6) in the output. Note that this is implemented by adding specific "magic values" as special keys in the aggregation maps (see IPV{4,6}_BACKUP_KEY in pping.h). These special keys have been selected so that no real traffic should collide with them by using prefixes from blocks reserved for documentation. Furthermore, these entries are added by the user space program AFTER the BPF programs are attached (as it's not possible to do it in-between loading and attaching when using libxdp). In case the BPF programs manage to fill the maps before the user space component can create the backup entries, it will fail and abort the program. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	2224edf85e	pping: Add packet and byte counts to aggregated output In addition to RTTs, also aggregate no. packets and bytes transmitted and received for each IP-prefix. If both the src and dst IP address of a packet is within the same IP-prefix, it will be counted as both sent to and received by that prefix. The packet stats are added for all successfully parsed packets (i.e. packets that contain a valid identifier of some sort), regardless of if the packet actually produces a valid RTT sample. This means some IP-prefixes may only have packet stats, and no RTT stats, so only output the packet stats in those instances. From a performance perspective, it also means each BPF program needs to perform two lookups of the aggregation map (one for src IP and one for dst IP) for every packet that is successfully parsed. This is a substantial increase from only having to perform a single lookup on the subset of packets that produce an RTT sample. Packets that are not successfully parsed (i.e. they don't contain a valid identifier, e.g. UDP traffic) are still ignored to minimize overhead, and will therefore not be included in the aggregated packet stats. This means the aggregated packet stats may not include all traffic for an IP-prefix. Future commits may add some counters to also account for traffic that is not fully parsed. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	e5b6c55f42	pping: Add pkt_len to packet_info Add a field with the total packet length (including all headers) to the packet_info struct. This information will be needed in later commits which add byte counts to the aggregated information. Note that this information is already part of the parsing_context struct, but this won't be available after the packet has been parsed (once the parse_packet_identifier_{tc,xdp}() function have finished). It is unfortunately not trivial to replace current instaces which use pkt_len from the parsing_context to instead take it from packet_info, as ex. the parse_tcp_identifier() already takes 5 arguments, and packet_info is not one of them. Therefore, keep both the pkt_len in parsing_context and packet_info for now. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	0f6042bf0c	pping: Expire old aggregation prefixes Keep track of when the last update was made to each IP-prefix in the aggregation map, and delete entries which are older than --aggregate-timeout (30 seconds by default). If the user specifies zero (0), that is interpreted as never expire an entry (which is consistent with how the --cleanup-interval operates). Note that as the BPF programs rotate between two maps (an active one for BPF progs to use, and an inactive one the user space can operate on), it may expire an aggregation prefix from one of the maps even if it has seen recent action in the other map. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	a301900fbd	pping: Add switch for which IP stats are aggregated by By default ePPing will aggregate RTTs based on the src IP of the reply packet. I.e. the RTT A->B->A will be aggregated based on IP of B. In some scenarios it may be more interesting to aggregate based on the dst IP of the reply packet (IP of A in above example). Therefore, add a switch (--aggregate-reverse) which makes ePPing aggregate RTTs based on the dst IP of the reply packet instead of the src IP. In other words, by default ePPing will aggregate traffic based on where it's going to, but with this switch you can make ePPing aggregate traffic based on where it's comming from instead. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	5ef4ffdd1b	pping: Reset aggregated RTTs after each report Instead of keeping all RTTs since ePPing started, reset the aggregated stats after each time they're reported so the report only shows the RTTs since the last report. To avoid concurrency issues due to user space reading and resetting the map while the BPF programs are updating it, use two aggregation maps, one active and one inactive. Each time user space wishes to report the aggregated RTTs it first switches which map is actively used by the BPF progs, and then reads and resets the now inactive map. As the RTT stats are now periodically reset, change the histogram (aggregated_rtt_stats.bins) to use __u32 instead of __u64 counters as risk of overflowing is low (even if 1 million RTTs/s is added to the same bin, it would take over an hour to overflow, and report frequency is likely higher than that). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	3a7b15ab3e	pping: Add option to aggregate RTTs Add an option -a or --aggregate to provide an aggregate report of RTT samples every X seconds. This is currently mutually exclusive with the normal per-RTT sample reports. The aggregated stats are never reset, and thus contain all RTTs since the start of tracing. The next commit will change this to reset the stats after every report, so that each report only contain the RTTs since the last report. The RTTs are aggregated and reported per IP-prefix, where the user can modify the size of the prefixes used for IPv4 and IPv6 using the --aggregate-subnet-v4/v6 flags. In this intital implementation for aggregating RTTs, the minimum and maximum RTT are tracked and all RTTs are added to a histogram. It uses a predetermined number of bins of equal width (set to 1000 bins, each 1 ms wide), see RTT_AGG_NR_BINS and RTT_AGG_BIN_WIDTH in pping.h. In the future this could be changed to use more sophisticated histograms that better capture a wide variety of RTTs. Implement the periodic reporting of RTTs by using a timerfd (configured to the user-provided interval) and add it to the main epoll-loop. To minimize overhead from the hash lookups, use separate maps for IPv4 and IPv6, so that for IPv4 traffic the hashmap key is only 4 bytes (instead of 16). Furthermore, limit the maximum IPv6 prefix size to 64 so that the IPv6 map can use a 8 byte key. This limits the maximum prefix size for IPv6 to /64. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	149e2c6d90	pping: Define map sizes Instead of specifying the map size directly in the map definitions, add them as defines at the top of the file to make them easier to change (don't have to find the correct map among the map definitions). This pattern will also simplify future additions of maps, where multiple maps may share the same size. While at it, increase the default packet_ts to 131072 (2^17) entries, as the previous value of 16384 (2^14) which, especially for the packet_ts map, was fairly underdimensioned. If only half of the timestamps are echoed back (due to ex. delayed ACK), it would in theory be enough with just 16k / (500 * 1) = 32 concurrent flows to fill it up with stale entries (assuming default cleanout interval of 1s). Increasing the size of these maps will increase the corresponding memory cost from 2^14 * (48 + 4) = 832 KiB and 2^14 * (44 + 144) = 2.94 MiB to 2^17 * (48 + 4) = 6.5 MiB and 2^17 * (44 + 144) = 23.5 MiB, respectively, which should generally not be too problematic. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	22ac4d9192	pping: Factor out sending of RTT event Extract the logic for filling in and sending an RTT event to a function. This makes it consistent with other send_*_event() functions and will make it easier/cleaner to add an option to aggregate the RTT instead of sending it. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-05-25 13:17:41 +02:00
Simon Sundberg	9be0a6a8e4	pping: Move NULL-check to compile with LLVM-15 When compiled with LLVM-15 (clang-15 and llc-15), the verifier would reject the tsmap_cleanup program as reported in #63. To prevent this add a NULL-check for df_state after the map lookup, to convince the verifier that we're not trying to dereference a pointer to a map value before checking for NULL. This fix ensures that the generated bytecode by LLVM-12 to LLVM-15 passes the verifier (tested on kernel 5.19.3). There was already an NULL-check for df_state in the (inlined by the compiler) function fstate_from_dfkey() which checked df_state before accessing its fields (the specific access that angered the verifier was df_state->dir2). However, with LLVM-15 the compiler reorders the operations so that df_state->dir2 is accessed before the NULL-check is performed, thus upsetting the verifier. This commit removes the internal NULL-check in fstate_from_dfkey() and instead performs the relevant NULL-check directly in the tsmap_cleanup prog instead. In all other places that fstate_from_dfkey() ends up being called there are already NULL-checks for df_state to enable early returns. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-11-09 11:52:15 +01:00
Toke Høiland-Jørgensen	af5db036ab	Merge pull request #55 from simosund/pping-skip-syn PPing: Add option to ignore SYN-packets	2022-11-06 14:18:10 +01:00
Simon Sundberg	e932174882	pping: Fix XDP ingress ifindex Set the ingress_ifindex to the ctx->ingress_ifindex rather than ctx->rx_queue_index. This fixes a bug that was accidently introduced in commit #add8885, and which broke the localfilt functionality if the XDP hook was used on ingress (the FIB lookup would fail). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-11-06 14:14:54 +01:00
Simon Sundberg	251c9b7ad3	pping: Wait for id shift before timestamping packet in new flow Make ePPing wait until the first shift of identifier (the "edge") before starting to timestamp packets for new flows (for TCP flows we do not see the start of). The reason this is necessary is that if ePPing start monitoring a flow in the middle of it (ePPing did not see the start of the flow), then we cannot know if the first TSval we see is actually the first instance of the TSval in that flow, so we have to wait until the next TSval to ensure we get the first instance of a TSval (otherwise we may underestimate the RTT by up to the TCP timestamp update period). To avoid the first RTT sample potentially being underestimated this fix essentially ignores the first RTT sample instead. However, it is not always necessary to wait until the first shift. For TCP traffic where we see the initial handshake we know that we've seen the start of the flow. Furthermore, for ICMP traffic it's generally unlikely that there are duplicate identifiers to begin with, so also allow that to start timestamping right away. It should be noted that after the previous commit (which changed ePPing to ignore TCP SYN-packets by default), ePPing will never see the handshake and thus has to assume that it started to monitor all flows in the middle. Therefore, ePPing will (by default) now miss both the RTT during the handshake, as well as RTT for the first few packets sent after the handshake (until the TSval is updated). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-11-04 15:37:51 +01:00
Simon Sundberg	70f255cbf8	pping: Ignore SYN packets by default Make ePPing ignore TCP SYN packets by default, so that the initial handshake phase of the connection is ignored. Add an option (--include-syn/-s) to explicitly include SYN packets. The main reason it can be a good idea to avoid SYN-packets is to avoid being affected by SYN-flood attacks. When ePPing also includes SYN-packets it becomes quite vulnerable to SYN-flood attacks, which will quickly fill up its flow_state table, blocking actual useful flows from being tracked. As ePPing will consider the connection opened as soon as it sees the SYN-ACK (it will not wait for final ACK), flow-state created from SYN-flood attacks will also stay around in the flow-state table for a long time (5 minutes currently) as no RST/FIN will be sent that can be used to close it. The drawback from ignoring SYN-packets is that no RTTs will be collected during the handshake phase, and all connections will be considered opened due to "first observed packet". A more refined approach could be to properly track the full TCP handshake (SYN + SYN-ACK + ACK) instead of the more generic "open once we see reply in reverse direction" used now. However, this adds a fair bit of additional protocol-specific logic. Furthermore, to track the full handshake we will still need to store some flow-state before the handshake is completed, and thus such a solution would still be vulnerable to SYN-flood attacks (although the incomplete flow states could potentially be cleaned up faster). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-11-04 13:47:08 +01:00
Simon Sundberg	03e9245ae9	pping: ensure TSval is monotonically increasing The mechanism to ensure that only the first instance of each TSval is timestamped is a simple equals check. This is check may fail if there are reordered packets. Consider a sequence of packets A, B, C and D, where A and B have TSval=1 and C and D have TSval=2. If all packets arrive in order (ABCD), then A and C will correctly be the only packets that are timestamped (as B and D will have the same TSval as the previously observed one). However, consider if B is reorderd so instead the packets arrive as ACBD. In this scenario all ePPing will attempt to timestamp all (instead of only A and C), as each packet now has a different (but not always higher) TSval than the last seen packet. Note that it will only sucessfully create the timestamps for the later duplicated TSvals if the previous timestamp for the same TSval has already been cleared out, so this is mainly an issue when RTT < 1ms. Fix this by only allowing a packet to be timestamped if its TSval is stricly higher (accounting for wrap-around) than the last seen TSval, and likewise only update last seen TSval if it is strictly higher than the previous one. To allow this calculation, also convert TSval and TSecr from network byte order to host byte order when parsing the packet. While delaying the transform from network to host byte order until the comparison between the packet's TSval and last seen TSval could potentially save the overhead of bpf_ntohs for some packets that do not need to go through this check, most TCP packets will end up performing this check, so performance difference should be minimal. Therefore, opt for the simplier approach of converting TSval and TSecr directly, which also makes them easier to interpret if ex. dumping the maps. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-09-01 15:40:12 +02:00
Toke Høiland-Jørgensen	1b38fda8fb	Merge pull request #48 from simosund/pping-debugcount-fix pping: Add missing debug timeout count	2022-06-21 21:17:36 +02:00
Simon Sundberg	3c698076ee	pping: Add missing debug timeout count The debug counter for timed out (deleted by periodical cleanup) flow states was never incremented, so fix that. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-06-21 19:00:34 +02:00
Simon Sundberg	add888566d	pping: Make packet parsing a global function Use global functions to make use of function-by-function verification. This allows the verifier to analyze parts of the program individually from each other, which should help reduce the verification complexity (the number of instructions the verifier must go through to verify the program) and help prevent exponentially growing with every loop or branch added to the code. In this case, break out the packet parsing (parse_packet_identifier) as a global function, so that it can be handled separately from the logic after it (updating flow state, saving timestamps, matching replies to timestamps, calculating and pushing RTTs) etc. To do this, create small separate wrapper functions (parse_packet_identifier_tc() and parse_packet_identifier_xdp()) for tc/xdp, so that the verifier can correctly identify the arguments as pointers to context (PTR_TO_CTX) when evaluating the global functions. Also create small wrapper functions pping_tc() and pping_xdp() which call the corresponding parse_packet_identifier_tc/xdp function. For this to work in XDP mode (which is the default), the kernel must have been patched with a fix that addresses an issue with how global functions are verified for XDP programs, see: https://lore.kernel.org/all/20220606075253.28422-1-toke@redhat.com/ Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-06-21 17:58:59 +02:00
Simon Sundberg	404a70c524	pping: Remove packet pointers from packet_info Do not provide pointers into the original packet from packet_info anymore (which the verifier has to ensure are valid), and instead directly parse all necessary data in parse_packet_identifier and then only use the parsed data in later functions. This allows a cleaner separation of concerns, where the parsing functions parse all necessary data from the packets, and other functions that need information about the packet only rely on the data provided in packet_info (and do not attempt to parse any data on their own). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-06-21 17:33:58 +02:00
Simon Sundberg	aaf6036655	pping: Move some info from parsing_context to packet_info Remove the is_egress and ingress_ifindex from the parsing_context struct to the packet_info struct. Also change the member is_egress to is_ingress to better fit with the ingress_ifindex member. These members were only in parsing_context because they were convenient to fill in right from the start. However, it semantically makes little sense for the parsing_context to contain these because they are not used for any parsing, and they fit better with the packet_info. This also allows later functions (is_local_address(), pping_timestamp_packet() and pping_match_packet()) to get rid of their dependency on parsing_context, as it was only used for the is_egress and ingress_ifindex members (they do not do any parsing). After this change, parsing_context is only used for the initial parsing, and packet_info contains all the necessary data for all the functions related to the pping logic that runs after the packet has been parsed. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-06-10 17:01:14 +02:00
Toke Høiland-Jørgensen	58fcc521b7	pping: Guard local definitions of AF_* constants in pping_kern.c The header files included from pping_kern.c include definitions of AF_INET and AF_INET6, leading to warnings like: pping_kern.c:25:9: warning: 'AF_INET' macro redefined [-Wmacro-redefined] ^ /usr/include/bits/socket.h:97:9: note: previous definition is here ^ pping_kern.c:26:9: warning: 'AF_INET6' macro redefined [-Wmacro-redefined] ^ /usr/include/bits/socket.h:105:9: note: previous definition is here ^ 2 warnings generated. Fix this by guarding the definitions behind suitable ifdefs. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>	2022-04-14 21:29:32 +02:00
Simon Sundberg	2e7595b3ca	pping: Replace boolean connection state flags with enum The connection state had 3 boolean flags related to what state it was in (is_empty, has_opened and has_closed). Only specific combinations of these flags really made sense (has_opened/has_closed didn't really mean anything if is_empty, and if has_closed one would expect is_empty to be false and has_opened to be true etc.). Therefore, replace these combinations of boolean values with a singular enum which is used to check if the flow is empty, waiting to open (seen outgoing packet but no response), is open or has closed. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-03-30 18:12:46 +02:00
Simon Sundberg	4fb1f6de64	pping: Combine flow state in each direction to a dualflow state Combine the flow state entries for both the "forward" and "reverse" direction of the flow into a single dualflow state. Change the flowstate map to use the dualflow state so that state for both directions can be retrieved using a single map lookup. As flow states are now kept in pairs, cannot directly create/delete states from the BPF map each time a flow opens/closes in one direction. Therefore, update all logic related to creating/deleting flows. For example, use "empty" slot in dualflow state instead of creating a new map entry, and only delete the dual flow state entry once both directions of the flow have closed/timed out. Some implementation details: Have implemented a simple memcmp function as I could not get the __builtin_memcmp function to work (got error "libbpf: failed to find BTF for extern 'memcmp': -2"). To ensure that both directions of the flow always look up the same entry, use the "sorted" flow tuple (the (ip, port) pair that is smaller is always first) as key. This is what the memcmp is used for. To avoid storing two copies of the flow tuple (src -> dst and dst -> src) and doing additional memcmps, always store the flow state for the "sorted" direction as the first direction and the reverse as the second direction. Then simply check if a flow is sorted or not to determine which direction in the dual flow state that matches. Have attempted to at least partially abstract this detail away from most of the code by adding some get_flowstate_from* helpers. The dual flow state simply stores the two (single direction) flow states as the struct members dir1 and dir2. Use these two (admittedly poorly named) members instead of a single array of size 2 in order to avoid some issues with the verifier being worried that the array index might be out of bounds. Have added some new boolean members to the flow state to keep track of "connection state". In addition the the previous has_opened, I now also have a member for if the flow is "empty" or if it has been closed. These are needed to cope with having to keep individual flow states for both directions of the flow around as long as one direction of the flow is used. I plan to replace these boolean "connection state" members with a single enum in a future commit. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-03-30 18:12:46 +02:00
Simon Sundberg	75a979fc31	pping: Refactor parsing of packet identifiers Refactor functions for parsing protocol-specific packet identifiers (parse_tcp_identifier, parse_icmp6_identifer and parse_icmp_identifer) so they no longer directly fill in the packet_info struct. Instead make the functions take additional pointers as arguments and fill in a protocol_info struct. The reason for this change is to decouple the parse_<protocol>_identifier functions from the logic of how the packet_info struct should be filled. The parse_packet_indentifier is now solely responsible for filling in the members of packet_info struct correctly instead of working in tandem with the parse_<protocol>_identifier, filling in some members each. This might result in a minimal performance degradation as some values are now first filled in the protocol_info struct and later copied to packet_info instead of being filled in directly in packet_info. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-03-22 13:24:09 +01:00
Simon Sundberg	2935fb05cc	pping: Minor formating fixes Format code using clang-format from the kernel tree. However, leave code in orginal format in some instances where clang-format clearly reduces readability of code (ex. do not remove alginment of comments for struct members and long options). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-03-22 09:11:53 +01:00
Simon Sundberg	11f1d88742	pping: Keep track of outstanding timestamps Add a counter of outstanding (unmatched) timestamped entires in the flow state. Before a timestamp lookup is attempted, check that there are any outstanding timestamps, otherwise avoid the unecessary hash map lookup. Use 32 bit counter for outstanding timestamps to allow atomic increments/decrements using __synch_fetch_and_add. This operation is not supported on smaller integers, which is why such a large counter is used. The atomicity is needed because the counter may be concurrently accessed by both the ingress/egress hook as well as the periodical map cleanup. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-03-15 15:39:53 +01:00
Simon Sundberg	f22025f716	pping: More aggressive map cleanup Add conditions that allows removing old flow and timestamp entries sooner. For flow map, have added conditions that allow unopened flows and ICMP flows to be removed earlier than open TCP flows (currently both set to 30 sec instead of 300 sec). For timestamp entries, allow them to be removed if they're more than TIMESTAMP_RTT_LIFETIME (currently 8) times higher than the flow's sRTT. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-03-10 09:54:52 +01:00
Simon Sundberg	72404b6767	pping: Add map cleanup debug info Add some debug info to the periodical map cleanup process. Push debug information through the events perf buffer by using newly added map_clean_event. The old user space map cleanup process had some simple debug information that was lost when transitioning to using bpf_iter instead. Therefore, add back similar (but more extensive) debug information but now collected from the BPF-side. In addition to stats on entries deleted by the cleanup process, also include stats on entries deleted by ePPing itself due to matching (for timestamp entries) or detecting FIN/RST (for flow entries) Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-03-10 09:45:22 +01:00
Simon Sundberg	be0921d116	pping: Use BPF iterators to do map gc To improve the performance of the map cleanup, switch from the user-spaced loop to using BPF iterators. With BPF iterators, a BPF program can be run on each element in the map, and can thus be done in kernel-space. This should hopefully also avoid the issue the previous userspace loop had with resetting in case an element was removed by the BPF programs during the cleanup. Due to removal of userspace logic for map cleanup, no longer provide any debug information about how many entires there are in each map and how many of them were removed by the garbage collection. This will be added back in the next commit. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-03-10 09:45:15 +01:00
Simon Sundberg	2647429081	pping: Add warnings for failing to create map entry Send a warning notifying the user that PPing failed to create a flow/timestamp entry due to the corresponding map being full. To avoid sending a warning for every packet, only emit warnings every WARN_MAP_FULL_INTERVAL (which is currently hard-coded to 1s). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-02-10 16:17:10 +01:00
Simon Sundberg	32bdf11a96	pping: Only consider flow opened on reply Wait with sending a flow open message until a reply has been seen for the flow. Likewise, only emit a flow closing event if the flow has first been opened (that is, a reply has been seen). This introduces potential (but unlikely) concurrency issues for flow opening/closing messages which are further described in the README. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-02-10 16:17:10 +01:00
Simon Sundberg	8a8f538759	pping: Do both timestamping and matching on ingress and egress Perform both timestamping and matching on both ingress and egress hooks. This makes it more similar to Kathie's pping, allowing the tool to capture RTTs in both directions when deployed on just a single interface. Like Kathie's pping, by default filter out RTTs for packets going to the local machine (will only include local processing delays). This behavior can be disabled by passing the -l/--include-local option. As packets that are timestamped on ingress and matched on egress will include the local machines processing delay, add the "match_on_egress" member to the JSON output that can be used to differentiate between RTTs that include the local processing delay, and those which don't. Finally, report the source and destination addresses from the perspective of the reply packet, rather than the timestamped packet, to be consistent with Kathie's pping. Overall, refactor large parts of pping_kern to allow both timestamping and matching, as well as updating both the flow and reverse flow and handle flow-events related to them, in one go. Also update README to reflect changes. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-02-10 16:16:24 +01:00
Simon Sundberg	928a4144a9	pping: Add RTT-based sampling Add an option (-R, --rtt-rate) to adapt the rate sampling based on the RTT of the flow. The sampling rate will be C * RTT, where C is a configurable constant (ex 1.0 to get one sample every RTT), and RTT is either the current minimum (default) or smoothed RTT of the flow (chosen via the -t or --rtt-type option). The smoothed RTT (sRTT) is updated for each calculated RTT, and is calculated in a similar manner to srtt in the kernel's TCP stack. The sRTT is a moving average of all RTTs, and is calculated according to the formula: srtt = 7/8 * prev_srtt + 1/8 * rtt To allow the user to pass a non-integer C (ex 0.1 to get 10 RTT samples for every RTT-period), fixed-point arithmetic has been used in the eBPF programs (due to lack of support for floats). The maximum value for C has been limited to 10000 in order for it to be unlikely that the C * RTT calculation will overflow (with C = 10000, overflow will only occur if RTT > 28 seconds). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-02-10 16:11:21 +01:00
Simon Sundberg	c79c4e8571	pping: Eliminate flow creation/deletion concurrency issue Only push flow events for opening/closing flows if the creation/deletion of the flow-state was successful (as indicated by the bpf_map_*_elem() return value). This should avoid outputting several flow creation/deletion messages in case multiple instances are trying to create/delete a flow concurrently, as could theoretically occur previously. Also set the last_timestamp value before creating a new flow, to avoid a race condition where the userspace cleanup might incorrectly determine that a flow is old before the last_timestamp value can be set. Explicitly skip the rate-limit for the first packet of a new flow to avoid it failing the rate-limit. This also fixes an issue where the first packet of a new flow would previously fail the rate-limit if the rate-limit was higher than current time uptime (CLOCK_MONOTONIC). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-02-10 16:08:23 +01:00
Simon Sundberg	1cadbe0ae7	pping: Make parsed protocols configurable Add command-line flags for each protocol that pping should attempt to parse and report RTTs for (currently -T/--tcp and -C/--icmp). If no protocol is specified assume TCP. To clarify this, output a message before start on how ePPing has been configured (stating output format, tracked protocols and which interface to run on). Additionally, as the ppviz format was only designed for TCP it does not have any field for which protocol an entry belongs to. Therefore, emit a warning in case the user selects the ppviz format with anything other than TCP. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-02-10 16:02:24 +01:00
Simon Sundberg	bd6ded5c21	pping: Add support for ICMP echo messages Allow pping to passivly monitor RTT for ICMP echo request/reply flows. Use the echo identifier as ports, and echo sequence as packet identifier. Additionally, add protocol to standard output format in order to be able to distinguish between TCP and ICMP flows. The ppviz format does not include protocol, making it impossible to distinguish between TCP and ICMP traffic. Will add warning if ppviz format is used together with ICMP traffic in the future. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-02-03 16:04:14 +01:00
Simon Sundberg	af5e660d8e	pping: Only match TSecr in ACKs The echoed TCP timestamp (TSecr) is only valid if the ACK flag is set. So make sure to only attempt to match on ACK packets. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-02-02 10:14:44 +01:00
Simon Sundberg	cfdf224d93	pping: Remove usage of deprecated libbpf API The libbpf API has deprecated a number of functions used by the pping loader. While a couple of functions have simply been renamed, bpf_object__find_program_by_title has been completely deprecated in favor of bpf_object__find_program_by_name. Therefore, change so that BPF programs are found based on the C function names rather than section names. Also remove defines of section names as they are no longer used, and change the section names in pping_kern.c to use "tc" instead of "classifier/ingress" and "classifier/egress". Finally replace the flags json_format and json_ppviz in pping_config with a single enum for the different output formats. This makes the logic for which output format to use clearer compared to relying on multiple (supposedly) mutually exclusive flags (and implicitly assuming standard format if neither flag was set). One potential concern with this commit is that it introduces some "magical strings". In case the function names in pping_kern.c are changed it will require multiple changes in pping.c. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-01-17 18:43:04 +01:00
Simon Sundberg	cbfc3c9a9e	pping: Improve XDP and tc attach/detach process Make several changes to functions related to attaching and detaching the BPF programs: - Check the BPF program id when detaching programs to ensure that the correct programs are removed. - When attaching tc-programs, keep track of if the clsact qdisc was created or existed previously. Attempt to delete the qdisc if it was created and attaching failed. If the --force argument was given, also attempt to delete qdisc on shutdown in case it did not previously exist. - Rely on XDP flags to replace existing XDP program if --force is used rather than explicitly detaching any XDP program first. - Print out hints for why pping might have failed attaching the XDP program. Also, use libbpf_strerror instead of strerror to better display libbpf-specific error codes, and for more reliable error handling in general (don't need to ensure the error codes are positive). Finally, change return codes of tc programs to TC_ACT_UNSPEC from TC_ACT_OK to allow other TC-BPF programs to be used on the same interface as pping. Concerns with this commit: - When attaching a tc program libbpf will emit a warning if the clsact qdisc already exists on the interface. The fact that the clsact already exists is not an issue, and is handled in tc_attach by checking for EEXIST, so the warning could be a bit misleading/confusing for the user. - The tc_attach and xdp_attach functions attempt to return the u32 prog_id in an int. In case the programs are assigned a very high id (> 2^31) this may cause it to be interpreted as an error instead. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-01-17 14:02:19 +01:00
Simon Sundberg	2f5c3fc5b0	pping: Add tc ingress hook as alternative to XDP For some machines, XDP may not be suitable due to ex. lack of XDP support in NIC drivers or another program already being attached to the XDP hook on the desired interface. Therefore, add an option to use the tc-ingress hook instead of XDP to attach the pping ingress BPF program on. In practice, this adds an additional BPF program to the object file (a TC ingress program). To avoid loading an unnecessary BPF program, also explicitly disable autoloading for the ingress program not selected. Also, change the tc programs to return TC_ACT_OK instead of BPF_OK. While both should be compatible, the TC_ACT_* return codes seem to be more commonly used for TC-BPF programs. Concerns with this commit: - The error messages for XDP attach failure has gotten slightly less descriptive. I plan to improve the code for attaching and detaching XDP programs in a separate commit, and will then address that. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-01-17 14:02:19 +01:00
Simon Sundberg	1975367a3a	pping: Add end-of-flow message from userspace map cleanup Make the flow_timeout function call the current output function to simulate a flow-closing event. Also some other minor cleanup/fixes. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2021-06-23 15:02:26 +02:00
Simon Sundberg	543f75c9d8	pping: Add support for "flow events" Add "flow events" (flow opening or closing so far) which will trigger a printout of message. Note: The ppviz format will only print out the traditional rtt events as the format does not include opening/closing messages. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2021-06-23 15:02:26 +02:00

1 2

57 Commits