Add a field with the total packet length (including all headers) to
the packet_info struct. This information will be needed in later
commits which add byte counts to the aggregated information.
Note that this information is already part of the parsing_context
struct, but this won't be available after the packet has been
parsed (once the parse_packet_identifier_{tc,xdp}() function have
finished). It is unfortunately not trivial to replace current instaces
which use pkt_len from the parsing_context to instead take it from
packet_info, as ex. the parse_tcp_identifier() already takes 5
arguments, and packet_info is not one of them. Therefore, keep both
the pkt_len in parsing_context and packet_info for now.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Keep track of when the last update was made to each IP-prefix in the
aggregation map, and delete entries which are older than
--aggregate-timeout (30 seconds by default). If the user specifies
zero (0), that is interpreted as never expire an entry (which is
consistent with how the --cleanup-interval operates).
Note that as the BPF programs rotate between two maps (an active one
for BPF progs to use, and an inactive one the user space can operate
on), it may expire an aggregation prefix from one of the maps even if
it has seen recent action in the other map.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Add support for outputing the aggregated reports in JSON format. This
format includes the raw histogram bin counts, making it possible to
post-process the aggregated rtt statistics.
The user specifies the format for the aggregated output in the same
way as for the per-RTT output, by using the -F/--format argument. If
the user attempts to use the ppviz format for the aggregated
output (which is not supported) the program will error out.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Create start of JSON array during the start of program (if configured
to use JSON format) instead of at report. This ensures that
ePPing provides valid JSON output (an empty array, []) even if the
program is stopped before any report is generated. Before this change,
ePPing could generate empty output (""), which is not valid JSON
output, if it was stopped before the first report.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Provide some statistics (min, mean, media, p95, max) instead of
dumping the raw bin counts.
While the raw bin counts provide more information and can be used for
further post processing, they are hard for a human to parse and make
sense of. Therefore, they are more suitable for a data-oriented
format, such as the JSON output.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
By default ePPing will aggregate RTTs based on the src IP of the reply
packet. I.e. the RTT A->B->A will be aggregated based on IP of B. In
some scenarios it may be more interesting to aggregate based on the
dst IP of the reply packet (IP of A in above example). Therefore, add
a switch (--aggregate-reverse) which makes ePPing aggregate RTTs
based on the dst IP of the reply packet instead of the src IP. In
other words, by default ePPing will aggregate traffic based on where
it's going to, but with this switch you can make ePPing aggregate
traffic based on where it's comming from instead.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Instead of keeping all RTTs since ePPing started, reset the aggregated
stats after each time they're reported so the report only shows the
RTTs since the last report.
To avoid concurrency issues due to user space reading and resetting
the map while the BPF programs are updating it, use two aggregation
maps, one active and one inactive. Each time user space wishes to
report the aggregated RTTs it first switches which map is actively
used by the BPF progs, and then reads and resets the now inactive map.
As the RTT stats are now periodically reset, change the
histogram (aggregated_rtt_stats.bins) to use __u32 instead of __u64
counters as risk of overflowing is low (even if 1 million RTTs/s is
added to the same bin, it would take over an hour to overflow, and
report frequency is likely higher than that).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Add an option -a or --aggregate to provide an aggregate report of RTT
samples every X seconds. This is currently mutually exclusive with the
normal per-RTT sample reports.
The aggregated stats are never reset, and thus contain all RTTs since
the start of tracing. The next commit will change this to reset the
stats after every report, so that each report only contain the RTTs
since the last report.
The RTTs are aggregated and reported per IP-prefix, where the user can
modify the size of the prefixes used for IPv4 and IPv6 using the
--aggregate-subnet-v4/v6 flags.
In this intital implementation for aggregating RTTs, the minimum and
maximum RTT are tracked and all RTTs are added to a histogram. It uses
a predetermined number of bins of equal width (set to 1000 bins, each
1 ms wide), see RTT_AGG_NR_BINS and RTT_AGG_BIN_WIDTH in pping.h. In
the future this could be changed to use more sophisticated histograms
that better capture a wide variety of RTTs.
Implement the periodic reporting of RTTs by using a
timerfd (configured to the user-provided interval) and add it to the
main epoll-loop.
To minimize overhead from the hash lookups, use separate maps for IPv4
and IPv6, so that for IPv4 traffic the hashmap key is only 4
bytes (instead of 16). Furthermore, limit the maximum IPv6 prefix size
to 64 so that the IPv6 map can use a 8 byte key. This limits the
maximum prefix size for IPv6 to /64.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Instead of specifying the map size directly in the map definitions,
add them as defines at the top of the file to make them easier to
change (don't have to find the correct map among the map
definitions). This pattern will also simplify future additions of
maps, where multiple maps may share the same size.
While at it, increase the default packet_ts to 131072 (2^17) entries,
as the previous value of 16384 (2^14) which, especially for the
packet_ts map, was fairly underdimensioned. If only half of the
timestamps are echoed back (due to ex. delayed ACK), it would in
theory be enough with just 16k / (500 * 1) = 32 concurrent flows to
fill it up with stale entries (assuming default cleanout interval of
1s). Increasing the size of these maps will increase the corresponding
memory cost from 2^14 * (48 + 4) = 832 KiB and 2^14 * (44 + 144) =
2.94 MiB to 2^17 * (48 + 4) = 6.5 MiB and 2^17 * (44 + 144) = 23.5
MiB, respectively, which should generally not be too problematic.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Remove the global keep_running variable and instead write to a pipe to
tell the main thread to abort in case periodical map cleanup
fails. Add the reading-side of this pipe to the epoll loop in the main
thread and abort if anything is written to the pipe. To abort the main
thread, update the main loop so it silently stops if it receives the
special value PPING_ABORT.
As the map cleaning thread can now immediately tell the main loop to
abort, it is no longer necessary to have a short
timeout (EPOLL_TIMEOUT_MS) on the main loop quickly detect changes in
the keep_running flag. So change the epoll loop to wait indefinitely
for one of the fds to update instead of timing out frequently.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Use the signalfd API to handle graceful shutdown on SIGINT/SIGTERM. To
watch the signalfd, create an epoll instance and add both the signalfd
and the perf-buffer to the epoll instance so that both can be
monitored in the main loop with epoll_wait().
This avoids the signal handler from interrupting the perf-buffer
polling and the other issues with the asynchronous signal
handling. Furthermore, the restructuring of the main loop to support
watching multiple file descriptors makes it possible to add additional
events to the main loop in the future (such as a periodical task
triggered by a timerfd).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Fix two edge cases with the parse_bounded_double() function.
1. It accept an empty string without raising an error. This should not
have been an issue in practice as getopt_long() should have detected
it as an lack of argument. This is addressed by adding a check for
if it has parsed anything at all.
2. It could overflow/underflow without raising an error. This is
addressed by adding a check of errno (which is set in case of
overflow/underflow, but not in case of conversion error).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
The parse_arguments() function used to have a separate variable for
each float (or rather a double) value it would parse from the user. As
only one argument will be parsed at the time this is redudant and will
require more and more variables as new options are added. Replace all
these variables with a single "user_float", which is used for all
options that parse a float from the user.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Extract the logic for filling in and sending an RTT event to a
function. This makes it consistent with other send_*_event() functions
and will make it easier/cleaner to add an option to aggregate the RTT
instead of sending it.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
The enum pping_output_format was uppercased, which is unconventional
for a type. Change it to lowercase.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
The system libbpf check would unconditionally fail with a modern libbpf
because the example program used outdated and removed library functions.
Update the test program so that system libbpf check can pass again.
Signed-off-by: Ronan Pigott <ronan@rjp.ie>
Update the xdp-tools submodule version to the newest upstream. Among other
things, this contains some build fixes for newer versions of libbpf.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
The defines.mk file always set -DDEBUG, inherited from the xdp-tools build
system. However, the configure script in this repository doesn't actually
support the PRODUCTION variable, so change the define to only set -DDEBUG
if a DEBUG variable is supplied to 'make'. This way DEBUG can be turned on
with a command-line DEBUG=1 parameter to 'make', but will be unset
otherwise.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Newer versions of libbpf deprecated the 'classifier' section names in
favour of just 'tc'. Update the nat64 code accordingly.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
When translating packets, we also need to update the TCP and UDP checksums
as they are computed over a pseudo header that also includes the IP
addresses.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
The attach mode is by default set to XDP_MODE_NATIVE and needs to be
overwritten to XDP_MODE_SKB when '-S' option is used. Instead of
overwriting the attach mode was ORed and so was always running in NATIVE
mode. This patch fixes that.
Signed-off-by: Tirthendu Sarkar <tirthendu.sarkar@intel.com>
Fix for issue 78; veth does not support zerocopy in bind flags
Take off the XDP_ZEROCOPY flag in the setting of
port_params_default.xsk_cfg.bind_flags in AF_XDP-forwarding/xsk_fwd.c
With this change, libxdp first tries to set up zerocopy, and when it finds
that this is not available it sets up an implementation which copies the
data. So performance will not be impactes for eths which support zerocopy.
Signed-off-by: Chris Ward <tjcw@uk.ibm.com>
Old system include headers don't have they SO_PREFER_BUSY_POLL and
SO_BUSY_POLL_BUDGET socket option defines. Add conditional defines to the
AF_XDP-example userspace code so we can still compile if they are missing.
Fixes#76.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
This updates the xdp-tools embedded version to fix an issue with the
feature testing of the custom libbpf copy that the bpf-examples repository
is using.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Update the xdp-tools submodule version; in particular, this there's a
bugfix for AF_XDP that wasn't in the submodule version.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Turns out we didn't actually bail out if libmnl was not found. Let's do
that so it becomes obvious what's missing (otherwise the build will just
fail later).
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
No reason to keep using the old version, updating doesn't even require
any other source code changes.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
We're using libxdp features from an unreleased version of the library, so
always use (and configure) the submodule version of it.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
The nat64-bpf example uses libmnl, so add a check for it in configure, and
bail if it's not available.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
When compiled with LLVM-15 (clang-15 and llc-15), the verifier would
reject the tsmap_cleanup program as reported in #63. To prevent this
add a NULL-check for df_state after the map lookup, to convince the
verifier that we're not trying to dereference a pointer to a map value
before checking for NULL. This fix ensures that the generated bytecode
by LLVM-12 to LLVM-15 passes the verifier (tested on kernel 5.19.3).
There was already an NULL-check for df_state in the (inlined by the
compiler) function fstate_from_dfkey() which checked df_state before
accessing its fields (the specific access that angered the verifier
was df_state->dir2). However, with LLVM-15 the compiler reorders the
operations so that df_state->dir2 is accessed before the NULL-check is
performed, thus upsetting the verifier. This commit removes the
internal NULL-check in fstate_from_dfkey() and instead performs the
relevant NULL-check directly in the tsmap_cleanup prog instead. In all
other places that fstate_from_dfkey() ends up being called there are
already NULL-checks for df_state to enable early returns.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Previously the program would only print out an error message if the
cleanup of a map failed, and then keep running. Each time the
periodical cleanup failed the error message would be repeated, but no
further action taken. Change this behavior so that the program instead
terminates the cleanup thread and aborts the rest of the program.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Due to a kernel bug for XDP programs loaded via libxdp that use global
functions (see https://lore.kernel.org/bpf/8735gkwy8h.fsf@toke.dk/t/),
XDP mode only works on relatively recent kernels where the bug is
patched (or kernels where the patch has been backported). As many
users may not have such a recent kernel they will only see a confusing
verifier error like the following:
Starting ePPing in standard mode tracking TCP on test123
libbpf: elf: skipping unrecognized data section(7) xdp_metadata
libbpf: elf: skipping unrecognized data section(7) xdp_metadata
libbpf: prog 'pping_xdp_ingress': BPF program load failed: Invalid argument
libbpf: prog 'pping_xdp_ingress': -- BEGIN PROG LOAD LOG --
Func#1 is safe for any args that match its prototype
Validating pping_xdp_ingress() func#0...
0: R1=ctx(id=0,off=0,imm=0) R10=fp0
; int pping_xdp_ingress(struct xdp_md *ctx)
0: (bf) r6 = r1 ; R1=ctx(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0)
1: (b7) r7 = 0 ; R7_w=invP0
; struct packet_info p_info = { 0 };
2: (7b) *(u64 *)(r10 -8) = r7 ; R7_w=invP0 R10=fp0 fp-8_w=00000000
3: (7b) *(u64 *)(r10 -16) = r7 ; R7_w=invP0 R10=fp0 fp-16_w=00000000
4: (7b) *(u64 *)(r10 -24) = r7 ; R7_w=invP0 R10=fp0 fp-24_w=00000000
5: (7b) *(u64 *)(r10 -32) = r7 ; R7_w=invP0 R10=fp0 fp-32_w=00000000
6: (7b) *(u64 *)(r10 -40) = r7 ; R7_w=invP0 R10=fp0 fp-40_w=00000000
7: (7b) *(u64 *)(r10 -48) = r7 ; R7_w=invP0 R10=fp0 fp-48_w=00000000
8: (7b) *(u64 *)(r10 -56) = r7 ; R7_w=invP0 R10=fp0 fp-56_w=00000000
9: (7b) *(u64 *)(r10 -64) = r7 ; R7_w=invP0 R10=fp0 fp-64_w=00000000
10: (7b) *(u64 *)(r10 -72) = r7 ; R7_w=invP0 R10=fp0 fp-72_w=00000000
11: (7b) *(u64 *)(r10 -80) = r7 ; R7_w=invP0 R10=fp0 fp-80_w=00000000
12: (7b) *(u64 *)(r10 -88) = r7 ; R7_w=invP0 R10=fp0 fp-88_w=00000000
13: (7b) *(u64 *)(r10 -96) = r7 ; R7_w=invP0 R10=fp0 fp-96_w=00000000
14: (7b) *(u64 *)(r10 -104) = r7 ; R7_w=invP0 R10=fp0 fp-104_w=00000000
15: (7b) *(u64 *)(r10 -112) = r7 ; R7_w=invP0 R10=fp0 fp-112_w=00000000
16: (7b) *(u64 *)(r10 -120) = r7 ; R7_w=invP0 R10=fp0 fp-120_w=00000000
17: (7b) *(u64 *)(r10 -128) = r7 ; R7_w=invP0 R10=fp0 fp-128_w=00000000
18: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
;
19: (07) r2 += -128 ; R2=fp-128
; if (parse_packet_identifer_xdp(ctx, &p_info) < 0)
20: (85) call pc+13
R1 type=ctx expected=fp
Caller passes invalid args into func#1
processed 206542 insns (limit 1000000) max_states_per_insn 32 total_states 13238 peak_states 792 mark_read 40
-- END PROG LOAD LOG --
libbpf: failed to load program 'pping_xdp_ingress'
libbpf: failed to load object 'pping_kern.o'
Failed attaching ingress BPF program on interface test123: Invalid argument
Failed loading and attaching BPF programs in pping_kern.o
To help users that run into this issue when loading the program in
generic or unspecified mode, add a small hint suggesting to
upgrade the kernel or use the tc ingress mode instead in case
attaching the XDP program fails.
However, if loaded in native mode, instead give the suggestion to try
loading in generic mode instead. While libbpf and libxdp already add
some messages hinting at this, this hint clarifies how to do this with
ePPing (using the --xdp-mode argument).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Add an option to let the user configure which mode to load the XDP
program in (unspecified, native or generic).
Set the default mode to native (was unspecified previously) as that is
what the user most likely wants to use (generic or unpsecified falling
back on generic will likely have worse performance).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Using the XDP ingress hook requires a newer kernel (needs Toke's patch
fixing the verification of global function for BPF_PROG_TYPE_EXT
programs) than tc mode, is will likely perform worse than tc if
running in generic mode (due to no driver support for
XDP). Furthermore, even when XDP works and has driver support, its
performance benefit over tc is likely small as the packets are always
passed on to the network stack regardless (not creating a fast-path
that bypasses the network stack). Therefore, use the tc ingress hook
as default instead, and only use XDP if explicitly required by the
user (-I/--ingress hook xdp).
This partly addresses issue #49, as ePPing should no longer by default
get the confusing error message from failing verification if the
kernel lacks Toke's verifier patch.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Define the BPF program names in the user space component. The strings
corresponding to the BPF program names were before inserted in several
places, including in multiple string comparison, which is error prone
and could leave to subtle errors if the program names are changed and
not updated correctly in all places. With the program name string
being defined, they only have to be changed in a single place.
Currently only the names of the ingress programs occur in multiple
places, but also define the name for the egress program to be
consistent.
Note that even after this change one has the sync the defined values
with the actual program names declared in the pping_kern.c
file. Ideally, these would all be defined in a single place, but not
aware of a convenient way to make that happen (cannot use the defined
strings as function names as they are not identifiers, and if defined
as identifiers instead it would not be possible to use them as
strings).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
The userspace loader would only check if the tc clsact was created
when the egress program was loaded. Thus, if the ingress program
created the clsact the egress program would not have to create the
clsact, the ePPing would thus falsely believe it did not create a
clsact and fail to remove it on shutdown even if --force was used. Fix
this by checking if either ingress or egress created clsact.
This bug was introduced as a sneaky side effect of commit
78b45bde56 (pping: Use libxdp to load
and attach XDP program). Before this commit the egress program (for
which there is only a tc alternative) would be loaded first, and thus
it was sufficient to check if it created the clsact. When switching to
libxdp however, the ingress program (specifically the XDP program) had
to be loaded first, and thus the order of loading ingress and egress
program were swapped. Therefore, it was no longer sufficient to only
check the egress program as the tc ingress program may have created
the clsact before the the egress program is attached (and only
checking the ingress program would also not be enough as the tc
ingress program may never be loaded if XDP mode is used instead).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
traffic-pacing-edt: Fixed an operator precedence issue in codel_impl.h
This bug caused to CoDel control law to always stay at the initial 100 ms drop interval.
After this fix the control law behaves correctly, by becoming more aggressive (smaller next drop intervals)
in accordance with the inverse square root (that TCP traffic responds to).
Set the ingress_ifindex to the ctx->ingress_ifindex rather than
ctx->rx_queue_index. This fixes a bug that was accidently introduced
in commit #add8885, and which broke the localfilt functionality if the
XDP hook was used on ingress (the FIB lookup would fail).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Make ePPing wait until the first shift of identifier (the "edge")
before starting to timestamp packets for new flows (for TCP flows we
do not see the start of).
The reason this is necessary is that if ePPing start monitoring a flow
in the middle of it (ePPing did not see the start of the flow), then
we cannot know if the first TSval we see is actually the first
instance of the TSval in that flow, so we have to wait until the next
TSval to ensure we get the first instance of a TSval (otherwise we may
underestimate the RTT by up to the TCP timestamp update period). To
avoid the first RTT sample potentially being underestimated this fix
essentially ignores the first RTT sample instead.
However, it is not always necessary to wait until the first shift. For
TCP traffic where we see the initial handshake we know that we've seen
the start of the flow. Furthermore, for ICMP traffic it's generally
unlikely that there are duplicate identifiers to begin with, so also
allow that to start timestamping right away.
It should be noted that after the previous commit (which changed
ePPing to ignore TCP SYN-packets by default), ePPing will never see
the handshake and thus has to assume that it started to monitor all
flows in the middle. Therefore, ePPing will (by default) now miss both
the RTT during the handshake, as well as RTT for the first few packets
sent after the handshake (until the TSval is updated).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Make ePPing ignore TCP SYN packets by default, so that the
initial handshake phase of the connection is ignored. Add an
option (--include-syn/-s) to explicitly include SYN packets.
The main reason it can be a good idea to avoid SYN-packets is to avoid
being affected by SYN-flood attacks. When ePPing also includes
SYN-packets it becomes quite vulnerable to SYN-flood attacks, which
will quickly fill up its flow_state table, blocking actual useful
flows from being tracked. As ePPing will consider the connection
opened as soon as it sees the SYN-ACK (it will not wait for final
ACK), flow-state created from SYN-flood attacks will also stay around
in the flow-state table for a long time (5 minutes currently) as no
RST/FIN will be sent that can be used to close it.
The drawback from ignoring SYN-packets is that no RTTs will be
collected during the handshake phase, and all connections will be
considered opened due to "first observed packet".
A more refined approach could be to properly track the full TCP
handshake (SYN + SYN-ACK + ACK) instead of the more generic "open once
we see reply in reverse direction" used now. However, this adds a fair
bit of additional protocol-specific logic. Furthermore, to track the
full handshake we will still need to store some flow-state before the
handshake is completed, and thus such a solution would still be
vulnerable to SYN-flood attacks (although the incomplete flow states
could potentially be cleaned up faster).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
The get_next_interval_sqrt function has the line:
__u64 val = (__u64)CODEL_EXCEED_INTERVAL << 16 / get_sqrt_sh16(cnt);
However, the division operator has higher precedence than the shift
operator. Therefore, 16 / get_sqrt_sh16(cnt) will always evaluate to zero.
Signed-off-by: Frey Alfredsson <freysteinn@freysteinn.com>