We're using libxdp features from an unreleased version of the library, so
always use (and configure) the submodule version of it.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
The nat64-bpf example uses libmnl, so add a check for it in configure, and
bail if it's not available.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
When compiled with LLVM-15 (clang-15 and llc-15), the verifier would
reject the tsmap_cleanup program as reported in #63. To prevent this
add a NULL-check for df_state after the map lookup, to convince the
verifier that we're not trying to dereference a pointer to a map value
before checking for NULL. This fix ensures that the generated bytecode
by LLVM-12 to LLVM-15 passes the verifier (tested on kernel 5.19.3).
There was already an NULL-check for df_state in the (inlined by the
compiler) function fstate_from_dfkey() which checked df_state before
accessing its fields (the specific access that angered the verifier
was df_state->dir2). However, with LLVM-15 the compiler reorders the
operations so that df_state->dir2 is accessed before the NULL-check is
performed, thus upsetting the verifier. This commit removes the
internal NULL-check in fstate_from_dfkey() and instead performs the
relevant NULL-check directly in the tsmap_cleanup prog instead. In all
other places that fstate_from_dfkey() ends up being called there are
already NULL-checks for df_state to enable early returns.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Previously the program would only print out an error message if the
cleanup of a map failed, and then keep running. Each time the
periodical cleanup failed the error message would be repeated, but no
further action taken. Change this behavior so that the program instead
terminates the cleanup thread and aborts the rest of the program.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Due to a kernel bug for XDP programs loaded via libxdp that use global
functions (see https://lore.kernel.org/bpf/8735gkwy8h.fsf@toke.dk/t/),
XDP mode only works on relatively recent kernels where the bug is
patched (or kernels where the patch has been backported). As many
users may not have such a recent kernel they will only see a confusing
verifier error like the following:
Starting ePPing in standard mode tracking TCP on test123
libbpf: elf: skipping unrecognized data section(7) xdp_metadata
libbpf: elf: skipping unrecognized data section(7) xdp_metadata
libbpf: prog 'pping_xdp_ingress': BPF program load failed: Invalid argument
libbpf: prog 'pping_xdp_ingress': -- BEGIN PROG LOAD LOG --
Func#1 is safe for any args that match its prototype
Validating pping_xdp_ingress() func#0...
0: R1=ctx(id=0,off=0,imm=0) R10=fp0
; int pping_xdp_ingress(struct xdp_md *ctx)
0: (bf) r6 = r1 ; R1=ctx(id=0,off=0,imm=0) R6_w=ctx(id=0,off=0,imm=0)
1: (b7) r7 = 0 ; R7_w=invP0
; struct packet_info p_info = { 0 };
2: (7b) *(u64 *)(r10 -8) = r7 ; R7_w=invP0 R10=fp0 fp-8_w=00000000
3: (7b) *(u64 *)(r10 -16) = r7 ; R7_w=invP0 R10=fp0 fp-16_w=00000000
4: (7b) *(u64 *)(r10 -24) = r7 ; R7_w=invP0 R10=fp0 fp-24_w=00000000
5: (7b) *(u64 *)(r10 -32) = r7 ; R7_w=invP0 R10=fp0 fp-32_w=00000000
6: (7b) *(u64 *)(r10 -40) = r7 ; R7_w=invP0 R10=fp0 fp-40_w=00000000
7: (7b) *(u64 *)(r10 -48) = r7 ; R7_w=invP0 R10=fp0 fp-48_w=00000000
8: (7b) *(u64 *)(r10 -56) = r7 ; R7_w=invP0 R10=fp0 fp-56_w=00000000
9: (7b) *(u64 *)(r10 -64) = r7 ; R7_w=invP0 R10=fp0 fp-64_w=00000000
10: (7b) *(u64 *)(r10 -72) = r7 ; R7_w=invP0 R10=fp0 fp-72_w=00000000
11: (7b) *(u64 *)(r10 -80) = r7 ; R7_w=invP0 R10=fp0 fp-80_w=00000000
12: (7b) *(u64 *)(r10 -88) = r7 ; R7_w=invP0 R10=fp0 fp-88_w=00000000
13: (7b) *(u64 *)(r10 -96) = r7 ; R7_w=invP0 R10=fp0 fp-96_w=00000000
14: (7b) *(u64 *)(r10 -104) = r7 ; R7_w=invP0 R10=fp0 fp-104_w=00000000
15: (7b) *(u64 *)(r10 -112) = r7 ; R7_w=invP0 R10=fp0 fp-112_w=00000000
16: (7b) *(u64 *)(r10 -120) = r7 ; R7_w=invP0 R10=fp0 fp-120_w=00000000
17: (7b) *(u64 *)(r10 -128) = r7 ; R7_w=invP0 R10=fp0 fp-128_w=00000000
18: (bf) r2 = r10 ; R2_w=fp0 R10=fp0
;
19: (07) r2 += -128 ; R2=fp-128
; if (parse_packet_identifer_xdp(ctx, &p_info) < 0)
20: (85) call pc+13
R1 type=ctx expected=fp
Caller passes invalid args into func#1
processed 206542 insns (limit 1000000) max_states_per_insn 32 total_states 13238 peak_states 792 mark_read 40
-- END PROG LOAD LOG --
libbpf: failed to load program 'pping_xdp_ingress'
libbpf: failed to load object 'pping_kern.o'
Failed attaching ingress BPF program on interface test123: Invalid argument
Failed loading and attaching BPF programs in pping_kern.o
To help users that run into this issue when loading the program in
generic or unspecified mode, add a small hint suggesting to
upgrade the kernel or use the tc ingress mode instead in case
attaching the XDP program fails.
However, if loaded in native mode, instead give the suggestion to try
loading in generic mode instead. While libbpf and libxdp already add
some messages hinting at this, this hint clarifies how to do this with
ePPing (using the --xdp-mode argument).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Add an option to let the user configure which mode to load the XDP
program in (unspecified, native or generic).
Set the default mode to native (was unspecified previously) as that is
what the user most likely wants to use (generic or unpsecified falling
back on generic will likely have worse performance).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Using the XDP ingress hook requires a newer kernel (needs Toke's patch
fixing the verification of global function for BPF_PROG_TYPE_EXT
programs) than tc mode, is will likely perform worse than tc if
running in generic mode (due to no driver support for
XDP). Furthermore, even when XDP works and has driver support, its
performance benefit over tc is likely small as the packets are always
passed on to the network stack regardless (not creating a fast-path
that bypasses the network stack). Therefore, use the tc ingress hook
as default instead, and only use XDP if explicitly required by the
user (-I/--ingress hook xdp).
This partly addresses issue #49, as ePPing should no longer by default
get the confusing error message from failing verification if the
kernel lacks Toke's verifier patch.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Define the BPF program names in the user space component. The strings
corresponding to the BPF program names were before inserted in several
places, including in multiple string comparison, which is error prone
and could leave to subtle errors if the program names are changed and
not updated correctly in all places. With the program name string
being defined, they only have to be changed in a single place.
Currently only the names of the ingress programs occur in multiple
places, but also define the name for the egress program to be
consistent.
Note that even after this change one has the sync the defined values
with the actual program names declared in the pping_kern.c
file. Ideally, these would all be defined in a single place, but not
aware of a convenient way to make that happen (cannot use the defined
strings as function names as they are not identifiers, and if defined
as identifiers instead it would not be possible to use them as
strings).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
The userspace loader would only check if the tc clsact was created
when the egress program was loaded. Thus, if the ingress program
created the clsact the egress program would not have to create the
clsact, the ePPing would thus falsely believe it did not create a
clsact and fail to remove it on shutdown even if --force was used. Fix
this by checking if either ingress or egress created clsact.
This bug was introduced as a sneaky side effect of commit
78b45bde56 (pping: Use libxdp to load
and attach XDP program). Before this commit the egress program (for
which there is only a tc alternative) would be loaded first, and thus
it was sufficient to check if it created the clsact. When switching to
libxdp however, the ingress program (specifically the XDP program) had
to be loaded first, and thus the order of loading ingress and egress
program were swapped. Therefore, it was no longer sufficient to only
check the egress program as the tc ingress program may have created
the clsact before the the egress program is attached (and only
checking the ingress program would also not be enough as the tc
ingress program may never be loaded if XDP mode is used instead).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
traffic-pacing-edt: Fixed an operator precedence issue in codel_impl.h
This bug caused to CoDel control law to always stay at the initial 100 ms drop interval.
After this fix the control law behaves correctly, by becoming more aggressive (smaller next drop intervals)
in accordance with the inverse square root (that TCP traffic responds to).
Set the ingress_ifindex to the ctx->ingress_ifindex rather than
ctx->rx_queue_index. This fixes a bug that was accidently introduced
in commit #add8885, and which broke the localfilt functionality if the
XDP hook was used on ingress (the FIB lookup would fail).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Make ePPing wait until the first shift of identifier (the "edge")
before starting to timestamp packets for new flows (for TCP flows we
do not see the start of).
The reason this is necessary is that if ePPing start monitoring a flow
in the middle of it (ePPing did not see the start of the flow), then
we cannot know if the first TSval we see is actually the first
instance of the TSval in that flow, so we have to wait until the next
TSval to ensure we get the first instance of a TSval (otherwise we may
underestimate the RTT by up to the TCP timestamp update period). To
avoid the first RTT sample potentially being underestimated this fix
essentially ignores the first RTT sample instead.
However, it is not always necessary to wait until the first shift. For
TCP traffic where we see the initial handshake we know that we've seen
the start of the flow. Furthermore, for ICMP traffic it's generally
unlikely that there are duplicate identifiers to begin with, so also
allow that to start timestamping right away.
It should be noted that after the previous commit (which changed
ePPing to ignore TCP SYN-packets by default), ePPing will never see
the handshake and thus has to assume that it started to monitor all
flows in the middle. Therefore, ePPing will (by default) now miss both
the RTT during the handshake, as well as RTT for the first few packets
sent after the handshake (until the TSval is updated).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Make ePPing ignore TCP SYN packets by default, so that the
initial handshake phase of the connection is ignored. Add an
option (--include-syn/-s) to explicitly include SYN packets.
The main reason it can be a good idea to avoid SYN-packets is to avoid
being affected by SYN-flood attacks. When ePPing also includes
SYN-packets it becomes quite vulnerable to SYN-flood attacks, which
will quickly fill up its flow_state table, blocking actual useful
flows from being tracked. As ePPing will consider the connection
opened as soon as it sees the SYN-ACK (it will not wait for final
ACK), flow-state created from SYN-flood attacks will also stay around
in the flow-state table for a long time (5 minutes currently) as no
RST/FIN will be sent that can be used to close it.
The drawback from ignoring SYN-packets is that no RTTs will be
collected during the handshake phase, and all connections will be
considered opened due to "first observed packet".
A more refined approach could be to properly track the full TCP
handshake (SYN + SYN-ACK + ACK) instead of the more generic "open once
we see reply in reverse direction" used now. However, this adds a fair
bit of additional protocol-specific logic. Furthermore, to track the
full handshake we will still need to store some flow-state before the
handshake is completed, and thus such a solution would still be
vulnerable to SYN-flood attacks (although the incomplete flow states
could potentially be cleaned up faster).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
The get_next_interval_sqrt function has the line:
__u64 val = (__u64)CODEL_EXCEED_INTERVAL << 16 / get_sqrt_sh16(cnt);
However, the division operator has higher precedence than the shift
operator. Therefore, 16 / get_sqrt_sh16(cnt) will always evaluate to zero.
Signed-off-by: Frey Alfredsson <freysteinn@freysteinn.com>
Move the xdpsock sample application from the Linux repo to the
bpf-examples repo. This example demonstrates a number of capabilities
of AF_XDP sockets.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Move the xsk_fwd example application from the Linux repo to
bpf-examples. This sample demonstrates the ability to share a umem
between multiple sockets by implementing a simple packet forwarding
application. It also has a buffer pool manager for allocating and
freeing packet buffers.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
bpf_object__find_program_by_title’ is deprecated: libbpf v0.7+:
use bpf_object__find_program_by_name() instead
See: https://github.com/libbpf/libbpf/issues/297
libbpf#297 Deprecate bpf_program__title() in favor of
bpf_program__section_name(). “Title” term is confusing and
unconventional, it’s SEC() in code and “section name” everywhere else.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
This makes it possible to use make -j for simultaneous make
processes to run. This does make the pretty output unordered.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Example programs seems to get out-of-sync (bit rot) more
easily when nobody sees the compile issues.
Thus, add more to the top-level Makefile.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
‘bpf_program__next’ is deprecated: libbpf v0.7+:
use bpf_object__next_program() instead
Also use bpf_xdp_attach() and bpf_xdp_detach() APIs.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
The distro kernel UAPI headers evolve too slow.
Thus, maintain a mirror in headers/linux/ in this proj.
Libbpf been overly-eager to get features into their releases
and depend on kernel commit 6089fb325cf7 ("bpf: Add btf enum64 support"),
which have not been released in an official kernel release yet.
Thus, this headers/linux/btf.h update comes from bpf-next git.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
As we have not found a way to get the BTF object ID via the
sysfs filesystem BTF files.
Signed-off-by: Jesper Dangaard Brouer <netoptimizer@brouer.com>
Skip BTF IDs that doesn't originate from the kernel as this
program are looking for kernel module BTF.
Signed-off-by: Jesper Dangaard Brouer <netoptimizer@brouer.com>
The previous commit fixes the issue of reordered packets being able to
bypass the unique TSval check, so remove the corresponding section
from the issues in the TODO.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
The mechanism to ensure that only the first instance of each TSval is
timestamped is a simple equals check. This is check may fail if there
are reordered packets.
Consider a sequence of packets A, B, C and D, where A and B have
TSval=1 and C and D have TSval=2. If all packets arrive in
order (ABCD), then A and C will correctly be the only packets that are
timestamped (as B and D will have the same TSval as the previously
observed one). However, consider if B is reorderd so instead the
packets arrive as ACBD. In this scenario all ePPing will attempt to
timestamp all (instead of only A and C), as each packet now has a
different (but not always higher) TSval than the last seen
packet. Note that it will only sucessfully create the timestamps for
the later duplicated TSvals if the previous timestamp for the same
TSval has already been cleared out, so this is mainly an issue when
RTT < 1ms.
Fix this by only allowing a packet to be timestamped if its TSval is
stricly higher (accounting for wrap-around) than the last seen TSval,
and likewise only update last seen TSval if it is strictly higher than
the previous one.
To allow this calculation, also convert TSval and TSecr from network
byte order to host byte order when parsing the packet. While delaying
the transform from network to host byte order until the comparison
between the packet's TSval and last seen TSval could potentially save
the overhead of bpf_ntohs for some packets that do not need to go
through this check, most TCP packets will end up performing this
check, so performance difference should be minimal. Therefore, opt for
the simplier approach of converting TSval and TSecr directly, which
also makes them easier to interpret if ex. dumping the maps.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Manually opening the /sys/kernel/btf/ file and trying to get
info via bpf_obj_get_info_by_fd() doesn't give us anything.
Signed-off-by: Jesper Dangaard Brouer <netoptimizer@brouer.com>