Using the XDP ingress hook requires a newer kernel (needs Toke's patch
fixing the verification of global function for BPF_PROG_TYPE_EXT
programs) than tc mode, is will likely perform worse than tc if
running in generic mode (due to no driver support for
XDP). Furthermore, even when XDP works and has driver support, its
performance benefit over tc is likely small as the packets are always
passed on to the network stack regardless (not creating a fast-path
that bypasses the network stack). Therefore, use the tc ingress hook
as default instead, and only use XDP if explicitly required by the
user (-I/--ingress hook xdp).
This partly addresses issue #49, as ePPing should no longer by default
get the confusing error message from failing verification if the
kernel lacks Toke's verifier patch.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Define the BPF program names in the user space component. The strings
corresponding to the BPF program names were before inserted in several
places, including in multiple string comparison, which is error prone
and could leave to subtle errors if the program names are changed and
not updated correctly in all places. With the program name string
being defined, they only have to be changed in a single place.
Currently only the names of the ingress programs occur in multiple
places, but also define the name for the egress program to be
consistent.
Note that even after this change one has the sync the defined values
with the actual program names declared in the pping_kern.c
file. Ideally, these would all be defined in a single place, but not
aware of a convenient way to make that happen (cannot use the defined
strings as function names as they are not identifiers, and if defined
as identifiers instead it would not be possible to use them as
strings).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
The userspace loader would only check if the tc clsact was created
when the egress program was loaded. Thus, if the ingress program
created the clsact the egress program would not have to create the
clsact, the ePPing would thus falsely believe it did not create a
clsact and fail to remove it on shutdown even if --force was used. Fix
this by checking if either ingress or egress created clsact.
This bug was introduced as a sneaky side effect of commit
78b45bde56 (pping: Use libxdp to load
and attach XDP program). Before this commit the egress program (for
which there is only a tc alternative) would be loaded first, and thus
it was sufficient to check if it created the clsact. When switching to
libxdp however, the ingress program (specifically the XDP program) had
to be loaded first, and thus the order of loading ingress and egress
program were swapped. Therefore, it was no longer sufficient to only
check the egress program as the tc ingress program may have created
the clsact before the the egress program is attached (and only
checking the ingress program would also not be enough as the tc
ingress program may never be loaded if XDP mode is used instead).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Set the ingress_ifindex to the ctx->ingress_ifindex rather than
ctx->rx_queue_index. This fixes a bug that was accidently introduced
in commit #add8885, and which broke the localfilt functionality if the
XDP hook was used on ingress (the FIB lookup would fail).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Make ePPing wait until the first shift of identifier (the "edge")
before starting to timestamp packets for new flows (for TCP flows we
do not see the start of).
The reason this is necessary is that if ePPing start monitoring a flow
in the middle of it (ePPing did not see the start of the flow), then
we cannot know if the first TSval we see is actually the first
instance of the TSval in that flow, so we have to wait until the next
TSval to ensure we get the first instance of a TSval (otherwise we may
underestimate the RTT by up to the TCP timestamp update period). To
avoid the first RTT sample potentially being underestimated this fix
essentially ignores the first RTT sample instead.
However, it is not always necessary to wait until the first shift. For
TCP traffic where we see the initial handshake we know that we've seen
the start of the flow. Furthermore, for ICMP traffic it's generally
unlikely that there are duplicate identifiers to begin with, so also
allow that to start timestamping right away.
It should be noted that after the previous commit (which changed
ePPing to ignore TCP SYN-packets by default), ePPing will never see
the handshake and thus has to assume that it started to monitor all
flows in the middle. Therefore, ePPing will (by default) now miss both
the RTT during the handshake, as well as RTT for the first few packets
sent after the handshake (until the TSval is updated).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Make ePPing ignore TCP SYN packets by default, so that the
initial handshake phase of the connection is ignored. Add an
option (--include-syn/-s) to explicitly include SYN packets.
The main reason it can be a good idea to avoid SYN-packets is to avoid
being affected by SYN-flood attacks. When ePPing also includes
SYN-packets it becomes quite vulnerable to SYN-flood attacks, which
will quickly fill up its flow_state table, blocking actual useful
flows from being tracked. As ePPing will consider the connection
opened as soon as it sees the SYN-ACK (it will not wait for final
ACK), flow-state created from SYN-flood attacks will also stay around
in the flow-state table for a long time (5 minutes currently) as no
RST/FIN will be sent that can be used to close it.
The drawback from ignoring SYN-packets is that no RTTs will be
collected during the handshake phase, and all connections will be
considered opened due to "first observed packet".
A more refined approach could be to properly track the full TCP
handshake (SYN + SYN-ACK + ACK) instead of the more generic "open once
we see reply in reverse direction" used now. However, this adds a fair
bit of additional protocol-specific logic. Furthermore, to track the
full handshake we will still need to store some flow-state before the
handshake is completed, and thus such a solution would still be
vulnerable to SYN-flood attacks (although the incomplete flow states
could potentially be cleaned up faster).
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Move the xdpsock sample application from the Linux repo to the
bpf-examples repo. This example demonstrates a number of capabilities
of AF_XDP sockets.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Move the xsk_fwd example application from the Linux repo to
bpf-examples. This sample demonstrates the ability to share a umem
between multiple sockets by implementing a simple packet forwarding
application. It also has a buffer pool manager for allocating and
freeing packet buffers.
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
bpf_object__find_program_by_title’ is deprecated: libbpf v0.7+:
use bpf_object__find_program_by_name() instead
See: https://github.com/libbpf/libbpf/issues/297
libbpf#297 Deprecate bpf_program__title() in favor of
bpf_program__section_name(). “Title” term is confusing and
unconventional, it’s SEC() in code and “section name” everywhere else.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
This makes it possible to use make -j for simultaneous make
processes to run. This does make the pretty output unordered.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Example programs seems to get out-of-sync (bit rot) more
easily when nobody sees the compile issues.
Thus, add more to the top-level Makefile.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
‘bpf_program__next’ is deprecated: libbpf v0.7+:
use bpf_object__next_program() instead
Also use bpf_xdp_attach() and bpf_xdp_detach() APIs.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
The distro kernel UAPI headers evolve too slow.
Thus, maintain a mirror in headers/linux/ in this proj.
Libbpf been overly-eager to get features into their releases
and depend on kernel commit 6089fb325cf7 ("bpf: Add btf enum64 support"),
which have not been released in an official kernel release yet.
Thus, this headers/linux/btf.h update comes from bpf-next git.
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
As we have not found a way to get the BTF object ID via the
sysfs filesystem BTF files.
Signed-off-by: Jesper Dangaard Brouer <netoptimizer@brouer.com>
Skip BTF IDs that doesn't originate from the kernel as this
program are looking for kernel module BTF.
Signed-off-by: Jesper Dangaard Brouer <netoptimizer@brouer.com>
The previous commit fixes the issue of reordered packets being able to
bypass the unique TSval check, so remove the corresponding section
from the issues in the TODO.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
The mechanism to ensure that only the first instance of each TSval is
timestamped is a simple equals check. This is check may fail if there
are reordered packets.
Consider a sequence of packets A, B, C and D, where A and B have
TSval=1 and C and D have TSval=2. If all packets arrive in
order (ABCD), then A and C will correctly be the only packets that are
timestamped (as B and D will have the same TSval as the previously
observed one). However, consider if B is reorderd so instead the
packets arrive as ACBD. In this scenario all ePPing will attempt to
timestamp all (instead of only A and C), as each packet now has a
different (but not always higher) TSval than the last seen
packet. Note that it will only sucessfully create the timestamps for
the later duplicated TSvals if the previous timestamp for the same
TSval has already been cleared out, so this is mainly an issue when
RTT < 1ms.
Fix this by only allowing a packet to be timestamped if its TSval is
stricly higher (accounting for wrap-around) than the last seen TSval,
and likewise only update last seen TSval if it is strictly higher than
the previous one.
To allow this calculation, also convert TSval and TSecr from network
byte order to host byte order when parsing the packet. While delaying
the transform from network to host byte order until the comparison
between the packet's TSval and last seen TSval could potentially save
the overhead of bpf_ntohs for some packets that do not need to go
through this check, most TCP packets will end up performing this
check, so performance difference should be minimal. Therefore, opt for
the simplier approach of converting TSval and TSecr directly, which
also makes them easier to interpret if ex. dumping the maps.
Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
Manually opening the /sys/kernel/btf/ file and trying to get
info via bpf_obj_get_info_by_fd() doesn't give us anything.
Signed-off-by: Jesper Dangaard Brouer <netoptimizer@brouer.com>
This contains a fix to the xdp-tools configure script so it works with the
Dash shell used on Debian and derivatives.
Fixes#50.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
The trick with printing debug output as a u64 got it in the wrong byte
order; fix that by swapping everything appropriately before printing. Also
add some more information to the drop debug print.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
There were a couple of issues with the IGMP and multicast handling: the
packet parsing checked the MAC address for whether it was a multicast
address before it looked at the IP header, which meant it never got to the
IGMP packets (because they are also sent as multicast). Also, we need to
redirect IGMP packets to the bond master on egress to make sure
subscriptions work as they're supposed to.
Fix the parsing, add the redirect, and also remove the explicit check for
IGMP packets on ingress, as that will already be matched by the multicast
check.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
This reverts commit d3aaec4bdd ("pkt-loop-filter: Check ifindex against
state before dropping packets") - we should not accept packets that are
looped back to the same port either.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
The exception for gratuitous ARPs are only supposed to be for entries that
would otherwise be dropped due to the loop filtering logic. In addition, we
should record egress gratuitous ARPs and make sure they don't trigger the
exception when looping back (this is 'rule 4' of the openvswitch SLB
bonding logic).
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
We were indiscriminately dropping packets when the map lookup succeeded,
let's actually check the ifindex first.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
We shouldn't be filtering incoming gratuitous ARPs based on the ifindex
learning. So parse ARP packets and allow them through if they have
identical source and destination IPs.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
When pinning of the bpf_link fails, we keep running to keep the PID alive.
However, staying in the foreground causes problems with scripts that
expects the setup to finish running; so fork into the background instead
and write a PID file so we can kill the running instance on unload.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
When running in the fallback mode where we keep running in the foreground
to keep the kprobe alive, we should unload the cls_bpf programs after being
interrupted instead of just exiting.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Support for bpf_link-based attaching of kprobes was added to kernel 5.15
with commit: b89fbfbb854c ("bpf: Implement minimal BPF perf link"). Prior
to this, it is not possible to pin kprobe attachments in bpffs, which
causes the pkt-loop-filter to fail. Add a fallback where we just keep
running in the foreground to keep the probe alive if bpf_link pinning
fails.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
The type of the net->net_cookie field member was changed in kernel 5.12
with commit 3d368ab87cf6 ("net: initialize net->net_cookie at netns setup").
Older versions of the kernel devices net->net_cookie as an atomic64_t
instead of a u64. This causes CO-RE reading of the field to fail due to the
type mismatch. Handle this by adding CO-RE checks for the old type as well
and using the CO-RE facility to check for the right type at load time.
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>