xdp-project-bpf-examples

mirror of https://github.com/xdp-project/bpf-examples.git synced 2024-05-06 15:54:53 +00:00

Author	SHA1	Message	Date
Simon Sundberg	aadc7535c1	pping: Reopen output file on SIGHUP Make the user space process reopen the output file (if used with the -w/--write option) when it recieves a SIGHUP signal. This makes it possible to for example rotate the output files with logrotate. In case the program receives the SIGHUP signal is BEFORE the output file has been moved, the program will throw a warning and then continue writing to its current file handle until another SIGHUP signal is received. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	965d34ffae	pping: Add option to write to file Add an option -w/--write <filename> which writes the output to the provided file instead of to stdout. Fail if file the provided file already exists to avoid data loss (if truncating file) or corrupting data (if appending file, JSON is not concatable). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	1e704be790	pping: Refactor output handling Refactor code for various output functions to avoid hard-coding stdout and relying on global variables. Collect output-related parameters into a new output_context struct which can easily be passed as an explicit argument to output related functions. Get rid of the global json_ctx and print_event_func variables, as corresponding information is now included in the output_context which is directly passed to the required functions. Overall, these changes aim to make it more flexible how and where output is written. This will make it easier to for example add support for writing directly to files in the future. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	1319871358	pping: Add aggregation configuration to output Add an initial entry containing information about the settings used for the aggregation. For the standard output format it reuses the message that was previously written to stderr (but is now written to stdout), while for the json format it adds a more detailed json entry. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	46f5913e6f	pping: Truncate aggregation histograms In many scenarios, the upper range of an aggregation histogram may be empty (ex. max histogram bins is for 1000ms but highest observed RTT was 100 ms, leaving 900 trailing empty bins). As trailing empty bins contain no useful information, simply truncate the histograms to the highest non-empty bin. The truncation of histograms has two benefits. 1. It avoids unnecessary processing of empty bins when internally calculating statics from the histograms. This should not have any impact on the output. 2. It reduces the size of the histogram in the JSON output format. This can potentially save a lot of space in instances where most maximum observed RTT for a prefix during an aggregation interval is significantly lower than the highest histogram bin. Removing trailing empty bins (unlike non-trailing ones) does not require encoding any additional information (like the number of removed bins or the index of the remaining ones). It can also never make the histogram take up more space. Thus there are no obvious drawbacks with "compressing" the histograms in this manner. In the future it may be relevant to implement other ways to compress the histograms, which may be more efficient for certain distributions (ex. very sparse histograms). However as this method of removing trailing empty bins is both simple and without drawbacks, so it makes sense to make the default behavior for now. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	e9db312ad5	pping: Add fallback entry for aggregation maps The aggregation maps may become full, in which case the BPF programs will fail to create new entries for any IP-prefixes not currently in the map. This would previously result in stats from traffic that cannot fit into any aggregation entry to be missed. Add a fallback entry for each map, so that in case the aggregation map is full stats from any new IP-prefix will be added to this fallback entry instead. The fallback entry is reported as 0.0.0.0/0 (for IPv4) or ::/0 (for IPv6) in the output. Note that this is implemented by adding specific "magic values" as special keys in the aggregation maps (see IPV{4,6}_BACKUP_KEY in pping.h). These special keys have been selected so that no real traffic should collide with them by using prefixes from blocks reserved for documentation. Furthermore, these entries are added by the user space program AFTER the BPF programs are attached (as it's not possible to do it in-between loading and attaching when using libxdp). In case the BPF programs manage to fill the maps before the user space component can create the backup entries, it will fail and abort the program. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	2224edf85e	pping: Add packet and byte counts to aggregated output In addition to RTTs, also aggregate no. packets and bytes transmitted and received for each IP-prefix. If both the src and dst IP address of a packet is within the same IP-prefix, it will be counted as both sent to and received by that prefix. The packet stats are added for all successfully parsed packets (i.e. packets that contain a valid identifier of some sort), regardless of if the packet actually produces a valid RTT sample. This means some IP-prefixes may only have packet stats, and no RTT stats, so only output the packet stats in those instances. From a performance perspective, it also means each BPF program needs to perform two lookups of the aggregation map (one for src IP and one for dst IP) for every packet that is successfully parsed. This is a substantial increase from only having to perform a single lookup on the subset of packets that produce an RTT sample. Packets that are not successfully parsed (i.e. they don't contain a valid identifier, e.g. UDP traffic) are still ignored to minimize overhead, and will therefore not be included in the aggregated packet stats. This means the aggregated packet stats may not include all traffic for an IP-prefix. Future commits may add some counters to also account for traffic that is not fully parsed. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	e5b6c55f42	pping: Add pkt_len to packet_info Add a field with the total packet length (including all headers) to the packet_info struct. This information will be needed in later commits which add byte counts to the aggregated information. Note that this information is already part of the parsing_context struct, but this won't be available after the packet has been parsed (once the parse_packet_identifier_{tc,xdp}() function have finished). It is unfortunately not trivial to replace current instaces which use pkt_len from the parsing_context to instead take it from packet_info, as ex. the parse_tcp_identifier() already takes 5 arguments, and packet_info is not one of them. Therefore, keep both the pkt_len in parsing_context and packet_info for now. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	0f6042bf0c	pping: Expire old aggregation prefixes Keep track of when the last update was made to each IP-prefix in the aggregation map, and delete entries which are older than --aggregate-timeout (30 seconds by default). If the user specifies zero (0), that is interpreted as never expire an entry (which is consistent with how the --cleanup-interval operates). Note that as the BPF programs rotate between two maps (an active one for BPF progs to use, and an inactive one the user space can operate on), it may expire an aggregation prefix from one of the maps even if it has seen recent action in the other map. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	ec92f5a91f	pping: Add JSON format for aggregation Add support for outputing the aggregated reports in JSON format. This format includes the raw histogram bin counts, making it possible to post-process the aggregated rtt statistics. The user specifies the format for the aggregated output in the same way as for the per-RTT output, by using the -F/--format argument. If the user attempts to use the ppviz format for the aggregated output (which is not supported) the program will error out. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	5a8eb8748a	pping: Always initialize JSON array Create start of JSON array during the start of program (if configured to use JSON format) instead of at report. This ensures that ePPing provides valid JSON output (an empty array, []) even if the program is stopped before any report is generated. Before this change, ePPing could generate empty output (""), which is not valid JSON output, if it was stopped before the first report. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	989905e870	pping: Improve aggregated output format Provide some statistics (min, mean, media, p95, max) instead of dumping the raw bin counts. While the raw bin counts provide more information and can be used for further post processing, they are hard for a human to parse and make sense of. Therefore, they are more suitable for a data-oriented format, such as the JSON output. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	a301900fbd	pping: Add switch for which IP stats are aggregated by By default ePPing will aggregate RTTs based on the src IP of the reply packet. I.e. the RTT A->B->A will be aggregated based on IP of B. In some scenarios it may be more interesting to aggregate based on the dst IP of the reply packet (IP of A in above example). Therefore, add a switch (--aggregate-reverse) which makes ePPing aggregate RTTs based on the dst IP of the reply packet instead of the src IP. In other words, by default ePPing will aggregate traffic based on where it's going to, but with this switch you can make ePPing aggregate traffic based on where it's comming from instead. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	5ef4ffdd1b	pping: Reset aggregated RTTs after each report Instead of keeping all RTTs since ePPing started, reset the aggregated stats after each time they're reported so the report only shows the RTTs since the last report. To avoid concurrency issues due to user space reading and resetting the map while the BPF programs are updating it, use two aggregation maps, one active and one inactive. Each time user space wishes to report the aggregated RTTs it first switches which map is actively used by the BPF progs, and then reads and resets the now inactive map. As the RTT stats are now periodically reset, change the histogram (aggregated_rtt_stats.bins) to use __u32 instead of __u64 counters as risk of overflowing is low (even if 1 million RTTs/s is added to the same bin, it would take over an hour to overflow, and report frequency is likely higher than that). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	3a7b15ab3e	pping: Add option to aggregate RTTs Add an option -a or --aggregate to provide an aggregate report of RTT samples every X seconds. This is currently mutually exclusive with the normal per-RTT sample reports. The aggregated stats are never reset, and thus contain all RTTs since the start of tracing. The next commit will change this to reset the stats after every report, so that each report only contain the RTTs since the last report. The RTTs are aggregated and reported per IP-prefix, where the user can modify the size of the prefixes used for IPv4 and IPv6 using the --aggregate-subnet-v4/v6 flags. In this intital implementation for aggregating RTTs, the minimum and maximum RTT are tracked and all RTTs are added to a histogram. It uses a predetermined number of bins of equal width (set to 1000 bins, each 1 ms wide), see RTT_AGG_NR_BINS and RTT_AGG_BIN_WIDTH in pping.h. In the future this could be changed to use more sophisticated histograms that better capture a wide variety of RTTs. Implement the periodic reporting of RTTs by using a timerfd (configured to the user-provided interval) and add it to the main epoll-loop. To minimize overhead from the hash lookups, use separate maps for IPv4 and IPv6, so that for IPv4 traffic the hashmap key is only 4 bytes (instead of 16). Furthermore, limit the maximum IPv6 prefix size to 64 so that the IPv6 map can use a 8 byte key. This limits the maximum prefix size for IPv6 to /64. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	149e2c6d90	pping: Define map sizes Instead of specifying the map size directly in the map definitions, add them as defines at the top of the file to make them easier to change (don't have to find the correct map among the map definitions). This pattern will also simplify future additions of maps, where multiple maps may share the same size. While at it, increase the default packet_ts to 131072 (2^17) entries, as the previous value of 16384 (2^14) which, especially for the packet_ts map, was fairly underdimensioned. If only half of the timestamps are echoed back (due to ex. delayed ACK), it would in theory be enough with just 16k / (500 * 1) = 32 concurrent flows to fill it up with stale entries (assuming default cleanout interval of 1s). Increasing the size of these maps will increase the corresponding memory cost from 2^14 * (48 + 4) = 832 KiB and 2^14 * (44 + 144) = 2.94 MiB to 2^17 * (48 + 4) = 6.5 MiB and 2^17 * (44 + 144) = 23.5 MiB, respectively, which should generally not be too problematic. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:57 +02:00
Simon Sundberg	699e8e839e	pping: Improve cross-thread abort handling Remove the global keep_running variable and instead write to a pipe to tell the main thread to abort in case periodical map cleanup fails. Add the reading-side of this pipe to the epoll loop in the main thread and abort if anything is written to the pipe. To abort the main thread, update the main loop so it silently stops if it receives the special value PPING_ABORT. As the map cleaning thread can now immediately tell the main loop to abort, it is no longer necessary to have a short timeout (EPOLL_TIMEOUT_MS) on the main loop quickly detect changes in the keep_running flag. So change the epoll loop to wait indefinitely for one of the fds to update instead of timing out frequently. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-06 18:01:30 +02:00
Simon Sundberg	12dc63b965	pping: Use signalfd instead of signalhandler Use the signalfd API to handle graceful shutdown on SIGINT/SIGTERM. To watch the signalfd, create an epoll instance and add both the signalfd and the perf-buffer to the epoll instance so that both can be monitored in the main loop with epoll_wait(). This avoids the signal handler from interrupting the perf-buffer polling and the other issues with the asynchronous signal handling. Furthermore, the restructuring of the main loop to support watching multiple file descriptors makes it possible to add additional events to the main loop in the future (such as a periodical task triggered by a timerfd). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-07-04 16:10:49 +02:00
Simon Sundberg	1cb4e93b04	pping: Fix edge cases of parse_bounded_double() Fix two edge cases with the parse_bounded_double() function. 1. It accept an empty string without raising an error. This should not have been an issue in practice as getopt_long() should have detected it as an lack of argument. This is addressed by adding a check for if it has parsed anything at all. 2. It could overflow/underflow without raising an error. This is addressed by adding a check of errno (which is set in case of overflow/underflow, but not in case of conversion error). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-05-25 16:30:56 +02:00
Simon Sundberg	83a85adb96	pping: Minor cleanup of argument parsing The parse_arguments() function used to have a separate variable for each float (or rather a double) value it would parse from the user. As only one argument will be parsed at the time this is redudant and will require more and more variables as new options are added. Replace all these variables with a single "user_float", which is used for all options that parse a float from the user. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-05-25 16:30:02 +02:00
Simon Sundberg	22ac4d9192	pping: Factor out sending of RTT event Extract the logic for filling in and sending an RTT event to a function. This makes it consistent with other send_*_event() functions and will make it easier/cleaner to add an option to aggregate the RTT instead of sending it. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-05-25 13:17:41 +02:00
Simon Sundberg	2578b819d8	pping: Lowercase pping_output_format enum The enum pping_output_format was uppercased, which is unconventional for a type. Change it to lowercase. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2023-05-25 13:16:01 +02:00
Simon Sundberg	9be0a6a8e4	pping: Move NULL-check to compile with LLVM-15 When compiled with LLVM-15 (clang-15 and llc-15), the verifier would reject the tsmap_cleanup program as reported in #63. To prevent this add a NULL-check for df_state after the map lookup, to convince the verifier that we're not trying to dereference a pointer to a map value before checking for NULL. This fix ensures that the generated bytecode by LLVM-12 to LLVM-15 passes the verifier (tested on kernel 5.19.3). There was already an NULL-check for df_state in the (inlined by the compiler) function fstate_from_dfkey() which checked df_state before accessing its fields (the specific access that angered the verifier was df_state->dir2). However, with LLVM-15 the compiler reorders the operations so that df_state->dir2 is accessed before the NULL-check is performed, thus upsetting the verifier. This commit removes the internal NULL-check in fstate_from_dfkey() and instead performs the relevant NULL-check directly in the tsmap_cleanup prog instead. In all other places that fstate_from_dfkey() ends up being called there are already NULL-checks for df_state to enable early returns. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-11-09 11:52:15 +01:00
Simon Sundberg	91d7242a44	pping: Abort program if map cleanup fails Previously the program would only print out an error message if the cleanup of a map failed, and then keep running. Each time the periodical cleanup failed the error message would be repeated, but no further action taken. Change this behavior so that the program instead terminates the cleanup thread and aborts the rest of the program. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-11-08 09:36:20 +01:00
Simon Sundberg	afc66aa570	pping: Close bpf_object on shutdown/error Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-11-08 09:36:20 +01:00
Simon Sundberg	0b0b1e9552	pping: Add hint on XDP failure Due to a kernel bug for XDP programs loaded via libxdp that use global functions (see https://lore.kernel.org/bpf/8735gkwy8h.fsf@toke.dk/t/), XDP mode only works on relatively recent kernels where the bug is patched (or kernels where the patch has been backported). As many users may not have such a recent kernel they will only see a confusing verifier error like the following: Starting ePPing in standard mode tracking TCP on test123 libbpf: elf: skipping unrecognized data section(7) xdp_metadata libbpf: elf: skipping unrecognized data section(7) xdp_metadata libbpf: prog 'pping_xdp_ingress': BPF program load failed: Invalid argument libbpf: prog 'pping_xdp_ingress': -- BEGIN PROG LOAD LOG -- Func#1 is safe for any args that match its prototype Validating pping_xdp_ingress() func#0... 0: R1=ctx(id=0,off=0,imm=0) R10=fp0 ; int pping_xdp_ingress(struct xdp_md ctx) 0: (bf) r6 = r1 ; R1=ctx(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0) 1: (b7) r7 = 0 ; R7_w=invP0 ; struct packet_info p_info = { 0 }; 2: (7b) (u64 )(r10 -8) = r7 ; R7_w=invP0 R10=fp0 fp-8_w=00000000 3: (7b) (u64 )(r10 -16) = r7 ; R7_w=invP0 R10=fp0 fp-16_w=00000000 4: (7b) (u64 )(r10 -24) = r7 ; R7_w=invP0 R10=fp0 fp-24_w=00000000 5: (7b) (u64 )(r10 -32) = r7 ; R7_w=invP0 R10=fp0 fp-32_w=00000000 6: (7b) (u64 )(r10 -40) = r7 ; R7_w=invP0 R10=fp0 fp-40_w=00000000 7: (7b) (u64 )(r10 -48) = r7 ; R7_w=invP0 R10=fp0 fp-48_w=00000000 8: (7b) (u64 )(r10 -56) = r7 ; R7_w=invP0 R10=fp0 fp-56_w=00000000 9: (7b) (u64 )(r10 -64) = r7 ; R7_w=invP0 R10=fp0 fp-64_w=00000000 10: (7b) (u64 )(r10 -72) = r7 ; R7_w=invP0 R10=fp0 fp-72_w=00000000 11: (7b) (u64 )(r10 -80) = r7 ; R7_w=invP0 R10=fp0 fp-80_w=00000000 12: (7b) (u64 )(r10 -88) = r7 ; R7_w=invP0 R10=fp0 fp-88_w=00000000 13: (7b) (u64 )(r10 -96) = r7 ; R7_w=invP0 R10=fp0 fp-96_w=00000000 14: (7b) (u64 )(r10 -104) = r7 ; R7_w=invP0 R10=fp0 fp-104_w=00000000 15: (7b) (u64 )(r10 -112) = r7 ; R7_w=invP0 R10=fp0 fp-112_w=00000000 16: (7b) (u64 )(r10 -120) = r7 ; R7_w=invP0 R10=fp0 fp-120_w=00000000 17: (7b) (u64 *)(r10 -128) = r7 ; R7_w=invP0 R10=fp0 fp-128_w=00000000 18: (bf) r2 = r10 ; R2_w=fp0 R10=fp0 ; 19: (07) r2 += -128 ; R2=fp-128 ; if (parse_packet_identifer_xdp(ctx, &p_info) < 0) 20: (85) call pc+13 R1 type=ctx expected=fp Caller passes invalid args into func#1 processed 206542 insns (limit 1000000) max_states_per_insn 32 total_states 13238 peak_states 792 mark_read 40 -- END PROG LOAD LOG -- libbpf: failed to load program 'pping_xdp_ingress' libbpf: failed to load object 'pping_kern.o' Failed attaching ingress BPF program on interface test123: Invalid argument Failed loading and attaching BPF programs in pping_kern.o To help users that run into this issue when loading the program in generic or unspecified mode, add a small hint suggesting to upgrade the kernel or use the tc ingress mode instead in case attaching the XDP program fails. However, if loaded in native mode, instead give the suggestion to try loading in generic mode instead. While libbpf and libxdp already add some messages hinting at this, this hint clarifies how to do this with ePPing (using the --xdp-mode argument). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-11-08 09:36:20 +01:00
Simon Sundberg	ff0c5a9b4a	pping: Add option to configure XDP mode Add an option to let the user configure which mode to load the XDP program in (unspecified, native or generic). Set the default mode to native (was unspecified previously) as that is what the user most likely wants to use (generic or unpsecified falling back on generic will likely have worse performance). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-11-08 09:35:46 +01:00
Simon Sundberg	8732c4f813	pping: Change default ingress program from XDP to tc Using the XDP ingress hook requires a newer kernel (needs Toke's patch fixing the verification of global function for BPF_PROG_TYPE_EXT programs) than tc mode, is will likely perform worse than tc if running in generic mode (due to no driver support for XDP). Furthermore, even when XDP works and has driver support, its performance benefit over tc is likely small as the packets are always passed on to the network stack regardless (not creating a fast-path that bypasses the network stack). Therefore, use the tc ingress hook as default instead, and only use XDP if explicitly required by the user (-I/--ingress hook xdp). This partly addresses issue #49, as ePPing should no longer by default get the confusing error message from failing verification if the kernel lacks Toke's verifier patch. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-11-08 09:30:31 +01:00
Simon Sundberg	832bdea23f	pping: Define BPF program names Define the BPF program names in the user space component. The strings corresponding to the BPF program names were before inserted in several places, including in multiple string comparison, which is error prone and could leave to subtle errors if the program names are changed and not updated correctly in all places. With the program name string being defined, they only have to be changed in a single place. Currently only the names of the ingress programs occur in multiple places, but also define the name for the egress program to be consistent. Note that even after this change one has the sync the defined values with the actual program names declared in the pping_kern.c file. Ideally, these would all be defined in a single place, but not aware of a convenient way to make that happen (cannot use the defined strings as function names as they are not identifiers, and if defined as identifiers instead it would not be possible to use them as strings). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-11-08 09:30:31 +01:00
Simon Sundberg	ddf25abfcc	pping: Check if creating clsact on ingress The userspace loader would only check if the tc clsact was created when the egress program was loaded. Thus, if the ingress program created the clsact the egress program would not have to create the clsact, the ePPing would thus falsely believe it did not create a clsact and fail to remove it on shutdown even if --force was used. Fix this by checking if either ingress or egress created clsact. This bug was introduced as a sneaky side effect of commit `78b45bde56` (pping: Use libxdp to load and attach XDP program). Before this commit the egress program (for which there is only a tc alternative) would be loaded first, and thus it was sufficient to check if it created the clsact. When switching to libxdp however, the ingress program (specifically the XDP program) had to be loaded first, and thus the order of loading ingress and egress program were swapped. Therefore, it was no longer sufficient to only check the egress program as the tc ingress program may have created the clsact before the the egress program is attached (and only checking the ingress program would also not be enough as the tc ingress program may never be loaded if XDP mode is used instead). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-11-08 09:30:31 +01:00
Toke Høiland-Jørgensen	af5db036ab	Merge pull request #55 from simosund/pping-skip-syn PPing: Add option to ignore SYN-packets	2022-11-06 14:18:10 +01:00
Simon Sundberg	e932174882	pping: Fix XDP ingress ifindex Set the ingress_ifindex to the ctx->ingress_ifindex rather than ctx->rx_queue_index. This fixes a bug that was accidently introduced in commit #add8885, and which broke the localfilt functionality if the XDP hook was used on ingress (the FIB lookup would fail). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-11-06 14:14:54 +01:00
Simon Sundberg	251c9b7ad3	pping: Wait for id shift before timestamping packet in new flow Make ePPing wait until the first shift of identifier (the "edge") before starting to timestamp packets for new flows (for TCP flows we do not see the start of). The reason this is necessary is that if ePPing start monitoring a flow in the middle of it (ePPing did not see the start of the flow), then we cannot know if the first TSval we see is actually the first instance of the TSval in that flow, so we have to wait until the next TSval to ensure we get the first instance of a TSval (otherwise we may underestimate the RTT by up to the TCP timestamp update period). To avoid the first RTT sample potentially being underestimated this fix essentially ignores the first RTT sample instead. However, it is not always necessary to wait until the first shift. For TCP traffic where we see the initial handshake we know that we've seen the start of the flow. Furthermore, for ICMP traffic it's generally unlikely that there are duplicate identifiers to begin with, so also allow that to start timestamping right away. It should be noted that after the previous commit (which changed ePPing to ignore TCP SYN-packets by default), ePPing will never see the handshake and thus has to assume that it started to monitor all flows in the middle. Therefore, ePPing will (by default) now miss both the RTT during the handshake, as well as RTT for the first few packets sent after the handshake (until the TSval is updated). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-11-04 15:37:51 +01:00
Simon Sundberg	70f255cbf8	pping: Ignore SYN packets by default Make ePPing ignore TCP SYN packets by default, so that the initial handshake phase of the connection is ignored. Add an option (--include-syn/-s) to explicitly include SYN packets. The main reason it can be a good idea to avoid SYN-packets is to avoid being affected by SYN-flood attacks. When ePPing also includes SYN-packets it becomes quite vulnerable to SYN-flood attacks, which will quickly fill up its flow_state table, blocking actual useful flows from being tracked. As ePPing will consider the connection opened as soon as it sees the SYN-ACK (it will not wait for final ACK), flow-state created from SYN-flood attacks will also stay around in the flow-state table for a long time (5 minutes currently) as no RST/FIN will be sent that can be used to close it. The drawback from ignoring SYN-packets is that no RTTs will be collected during the handshake phase, and all connections will be considered opened due to "first observed packet". A more refined approach could be to properly track the full TCP handshake (SYN + SYN-ACK + ACK) instead of the more generic "open once we see reply in reverse direction" used now. However, this adds a fair bit of additional protocol-specific logic. Furthermore, to track the full handshake we will still need to store some flow-state before the handshake is completed, and thus such a solution would still be vulnerable to SYN-flood attacks (although the incomplete flow states could potentially be cleaned up faster). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-11-04 13:47:08 +01:00
Simon Sundberg	867b659534	pping: Remove outdated issue from TODO The previous commit fixes the issue of reordered packets being able to bypass the unique TSval check, so remove the corresponding section from the issues in the TODO. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-09-01 15:47:55 +02:00
Simon Sundberg	03e9245ae9	pping: ensure TSval is monotonically increasing The mechanism to ensure that only the first instance of each TSval is timestamped is a simple equals check. This is check may fail if there are reordered packets. Consider a sequence of packets A, B, C and D, where A and B have TSval=1 and C and D have TSval=2. If all packets arrive in order (ABCD), then A and C will correctly be the only packets that are timestamped (as B and D will have the same TSval as the previously observed one). However, consider if B is reorderd so instead the packets arrive as ACBD. In this scenario all ePPing will attempt to timestamp all (instead of only A and C), as each packet now has a different (but not always higher) TSval than the last seen packet. Note that it will only sucessfully create the timestamps for the later duplicated TSvals if the previous timestamp for the same TSval has already been cleared out, so this is mainly an issue when RTT < 1ms. Fix this by only allowing a packet to be timestamped if its TSval is stricly higher (accounting for wrap-around) than the last seen TSval, and likewise only update last seen TSval if it is strictly higher than the previous one. To allow this calculation, also convert TSval and TSecr from network byte order to host byte order when parsing the packet. While delaying the transform from network to host byte order until the comparison between the packet's TSval and last seen TSval could potentially save the overhead of bpf_ntohs for some packets that do not need to go through this check, most TCP packets will end up performing this check, so performance difference should be minimal. Therefore, opt for the simplier approach of converting TSval and TSecr directly, which also makes them easier to interpret if ex. dumping the maps. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-09-01 15:40:12 +02:00
Toke Høiland-Jørgensen	1b38fda8fb	Merge pull request #48 from simosund/pping-debugcount-fix pping: Add missing debug timeout count	2022-06-21 21:17:36 +02:00
Simon Sundberg	3c698076ee	pping: Add missing debug timeout count The debug counter for timed out (deleted by periodical cleanup) flow states was never incremented, so fix that. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-06-21 19:00:34 +02:00
Simon Sundberg	add888566d	pping: Make packet parsing a global function Use global functions to make use of function-by-function verification. This allows the verifier to analyze parts of the program individually from each other, which should help reduce the verification complexity (the number of instructions the verifier must go through to verify the program) and help prevent exponentially growing with every loop or branch added to the code. In this case, break out the packet parsing (parse_packet_identifier) as a global function, so that it can be handled separately from the logic after it (updating flow state, saving timestamps, matching replies to timestamps, calculating and pushing RTTs) etc. To do this, create small separate wrapper functions (parse_packet_identifier_tc() and parse_packet_identifier_xdp()) for tc/xdp, so that the verifier can correctly identify the arguments as pointers to context (PTR_TO_CTX) when evaluating the global functions. Also create small wrapper functions pping_tc() and pping_xdp() which call the corresponding parse_packet_identifier_tc/xdp function. For this to work in XDP mode (which is the default), the kernel must have been patched with a fix that addresses an issue with how global functions are verified for XDP programs, see: https://lore.kernel.org/all/20220606075253.28422-1-toke@redhat.com/ Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-06-21 17:58:59 +02:00
Simon Sundberg	404a70c524	pping: Remove packet pointers from packet_info Do not provide pointers into the original packet from packet_info anymore (which the verifier has to ensure are valid), and instead directly parse all necessary data in parse_packet_identifier and then only use the parsed data in later functions. This allows a cleaner separation of concerns, where the parsing functions parse all necessary data from the packets, and other functions that need information about the packet only rely on the data provided in packet_info (and do not attempt to parse any data on their own). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-06-21 17:33:58 +02:00
Simon Sundberg	aaf6036655	pping: Move some info from parsing_context to packet_info Remove the is_egress and ingress_ifindex from the parsing_context struct to the packet_info struct. Also change the member is_egress to is_ingress to better fit with the ingress_ifindex member. These members were only in parsing_context because they were convenient to fill in right from the start. However, it semantically makes little sense for the parsing_context to contain these because they are not used for any parsing, and they fit better with the packet_info. This also allows later functions (is_local_address(), pping_timestamp_packet() and pping_match_packet()) to get rid of their dependency on parsing_context, as it was only used for the is_egress and ingress_ifindex members (they do not do any parsing). After this change, parsing_context is only used for the initial parsing, and packet_info contains all the necessary data for all the functions related to the pping logic that runs after the packet has been parsed. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-06-10 17:01:14 +02:00
Toke Høiland-Jørgensen	78b45bde56	pping: Use libxdp to load and attach XDP program Since we now have libxdp as a submodule, we can switch pping over to using libxdp for loading its XDP program. This requires switching the order of attachment, because libxdp needs to be fed the BPF object in an unloaded state (and will load it as part of attaching). Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>	2022-04-20 17:08:36 +02:00
Toke Høiland-Jørgensen	58fcc521b7	pping: Guard local definitions of AF_* constants in pping_kern.c The header files included from pping_kern.c include definitions of AF_INET and AF_INET6, leading to warnings like: pping_kern.c:25:9: warning: 'AF_INET' macro redefined [-Wmacro-redefined] ^ /usr/include/bits/socket.h:97:9: note: previous definition is here ^ pping_kern.c:26:9: warning: 'AF_INET6' macro redefined [-Wmacro-redefined] ^ /usr/include/bits/socket.h:105:9: note: previous definition is here ^ 2 warnings generated. Fix this by guarding the definitions behind suitable ifdefs. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>	2022-04-14 21:29:32 +02:00
Simon Sundberg	2e7595b3ca	pping: Replace boolean connection state flags with enum The connection state had 3 boolean flags related to what state it was in (is_empty, has_opened and has_closed). Only specific combinations of these flags really made sense (has_opened/has_closed didn't really mean anything if is_empty, and if has_closed one would expect is_empty to be false and has_opened to be true etc.). Therefore, replace these combinations of boolean values with a singular enum which is used to check if the flow is empty, waiting to open (seen outgoing packet but no response), is open or has closed. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-03-30 18:12:46 +02:00
Simon Sundberg	4fb1f6de64	pping: Combine flow state in each direction to a dualflow state Combine the flow state entries for both the "forward" and "reverse" direction of the flow into a single dualflow state. Change the flowstate map to use the dualflow state so that state for both directions can be retrieved using a single map lookup. As flow states are now kept in pairs, cannot directly create/delete states from the BPF map each time a flow opens/closes in one direction. Therefore, update all logic related to creating/deleting flows. For example, use "empty" slot in dualflow state instead of creating a new map entry, and only delete the dual flow state entry once both directions of the flow have closed/timed out. Some implementation details: Have implemented a simple memcmp function as I could not get the __builtin_memcmp function to work (got error "libbpf: failed to find BTF for extern 'memcmp': -2"). To ensure that both directions of the flow always look up the same entry, use the "sorted" flow tuple (the (ip, port) pair that is smaller is always first) as key. This is what the memcmp is used for. To avoid storing two copies of the flow tuple (src -> dst and dst -> src) and doing additional memcmps, always store the flow state for the "sorted" direction as the first direction and the reverse as the second direction. Then simply check if a flow is sorted or not to determine which direction in the dual flow state that matches. Have attempted to at least partially abstract this detail away from most of the code by adding some get_flowstate_from* helpers. The dual flow state simply stores the two (single direction) flow states as the struct members dir1 and dir2. Use these two (admittedly poorly named) members instead of a single array of size 2 in order to avoid some issues with the verifier being worried that the array index might be out of bounds. Have added some new boolean members to the flow state to keep track of "connection state". In addition the the previous has_opened, I now also have a member for if the flow is "empty" or if it has been closed. These are needed to cope with having to keep individual flow states for both directions of the flow around as long as one direction of the flow is used. I plan to replace these boolean "connection state" members with a single enum in a future commit. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-03-30 18:12:46 +02:00
Simon Sundberg	75a979fc31	pping: Refactor parsing of packet identifiers Refactor functions for parsing protocol-specific packet identifiers (parse_tcp_identifier, parse_icmp6_identifer and parse_icmp_identifer) so they no longer directly fill in the packet_info struct. Instead make the functions take additional pointers as arguments and fill in a protocol_info struct. The reason for this change is to decouple the parse_<protocol>_identifier functions from the logic of how the packet_info struct should be filled. The parse_packet_indentifier is now solely responsible for filling in the members of packet_info struct correctly instead of working in tandem with the parse_<protocol>_identifier, filling in some members each. This might result in a minimal performance degradation as some values are now first filled in the protocol_info struct and later copied to packet_info instead of being filled in directly in packet_info. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-03-22 13:24:09 +01:00
Simon Sundberg	2935fb05cc	pping: Minor formating fixes Format code using clang-format from the kernel tree. However, leave code in orginal format in some instances where clang-format clearly reduces readability of code (ex. do not remove alginment of comments for struct members and long options). Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-03-22 09:11:53 +01:00
Simon Sundberg	8c0084dd0e	pping: Update TODO about timestamp ref count Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-03-15 15:39:53 +01:00
Simon Sundberg	11f1d88742	pping: Keep track of outstanding timestamps Add a counter of outstanding (unmatched) timestamped entires in the flow state. Before a timestamp lookup is attempted, check that there are any outstanding timestamps, otherwise avoid the unecessary hash map lookup. Use 32 bit counter for outstanding timestamps to allow atomic increments/decrements using __synch_fetch_and_add. This operation is not supported on smaller integers, which is why such a large counter is used. The atomicity is needed because the counter may be concurrently accessed by both the ingress/egress hook as well as the periodical map cleanup. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-03-15 15:39:53 +01:00
Simon Sundberg	e7201c9eab	pping: Update TODO based on map cleaning improvements Also add description of three potential issues introduced by the changes to cleanup process. Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>	2022-03-10 09:59:21 +01:00

1 2 3

131 Commits