mirror of
https://github.com/xdp-project/bpf-examples.git
synced 2024-05-06 15:54:53 +00:00
As drivers supports multiple chips. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
187 lines
9.2 KiB
Org Mode
187 lines
9.2 KiB
Org Mode
#+Title: How to transfer info from XDP-prog to AF_XDP
|
|
|
|
This BPF-example show how use BTF to create a communication channel
|
|
between XDP BPF-prog (running kernel-side) and AF_XDP user-space
|
|
process, via XDP-hints in metadata area.
|
|
|
|
* XDP-hints via local BTF info
|
|
|
|
XDP-hints have been discussed as a facility where NIC drivers provide
|
|
information in the metadata area (located just before packet header
|
|
starts). There have been little progress on kernel drivers adding
|
|
XDP-hints, as end-users are unsure how to decode and consume the
|
|
information (chicken-and-egg problem).
|
|
|
|
In this example we let the BPF-object file define and contain the
|
|
BTF-info about the XDP-hints data-structures. Thus, no kernel or
|
|
driver changes are needed as the BTF type-definitions are *locally
|
|
defined*. And XDP-hints are used as communication channel between XDP
|
|
BPF-prog and AF_XDP userspace program.
|
|
|
|
The API for decoding the BTF data-structures have been added to
|
|
seperate files (in [[file:lib_xsk_extend.c]] and [[file:lib_xsk_extend.h]]) to
|
|
make this reusable for other projects, with the goal of getting this
|
|
included in libbpf or libxdp. The API takes an =struct btf= pointer
|
|
as input argument when searching for a struct-name. This BTF pointer
|
|
is obtained from opening the BPF-object ELF file, but it could also
|
|
come from the kernel (e.g via =btf__load_vmlinux_btf()=) and even from
|
|
a kernel module (e.g. via =btf__load_module_btf()=). See other
|
|
[[https://github.com/xdp-project/bpf-examples/blob/master/BTF-playground/btf_module_read.c][btf_module_read]] example howto do this.
|
|
|
|
The requirement for being a valid XDP-hints data-struct is that the
|
|
last member in the struct is named =btf_id= and have size 4 bytes
|
|
(32-bit). See C code example below. This =btf_id= member is used for
|
|
identifying what struct have been put into this metadata area. The
|
|
kernel-side BPF-prog stores the =btf_id= via using API
|
|
=bpf_core_type_id_local()= to obtain the ID. Userspace API reads the
|
|
=btf_id= via reading -4 bytes from packet header start, and can check
|
|
the ID against the IDs that was available via the =struct btf=
|
|
pointer.
|
|
|
|
#+begin_src C
|
|
struct xdp_hints_rx_time {
|
|
__u64 rx_ktime;
|
|
__u32 btf_id;
|
|
} __attribute__((aligned(4))) __attribute__((packed));
|
|
#+end_src
|
|
|
|
The location as the last member is because metadata area, located just
|
|
before packet header starts, can only grow "backwards" (via BPF-helper
|
|
=bpf_xdp_adjust_meta()=). To store a larger struct, the metadata is
|
|
grown by a larger negative offset. The BTF type-information knows the
|
|
size (and offsets) of all data-structures. Thus, we can deduce the
|
|
size of the metadata area, when knowing the =bpf_id=, which by placing
|
|
it as the last member is in a known location.
|
|
|
|
* Why is XDP RX-timestamp essential for AF_XDP
|
|
|
|
In this example, the kernel-side XDP BPF-prog (file:af_xdp_kern.c)
|
|
take a timestamp (=bpf_ktime_get_ns()=) and stores it in the metadata
|
|
as an XDP-hint. This make it possible to measure the time-delay from
|
|
XDP softirq execution and when AF_XDP gets the packet out of its
|
|
RX-ring. (See Interesting data-points below)
|
|
|
|
The real value for the application use-case (in question) is that it
|
|
doesn't need to waste so much CPU time spinning to get this accurate
|
|
timestamps for packet arrival. The application only need timestamps
|
|
on the synchronization traffic ([[https://en.wikipedia.org/wiki/TTEthernet][PCF frames]]).
|
|
The other Time-triggered traffic arrives at a deterministic time
|
|
(according to established time schedule based on PCF). The
|
|
application prefers to bulk receive the Time-triggered traffic, which
|
|
can be acheived by waking up at the right time (according to
|
|
time-schedule). Thus, it would be wasteful to busy-poll with the only
|
|
purpose of getting better timing accuracy for the PCF frames.
|
|
|
|
** Interesting data-points
|
|
|
|
The time-delay from XDP softirq execution and to when AF_XDP gets the
|
|
packet, give us some interesting data-points, and tell us about system
|
|
latency and sleep behavior.
|
|
|
|
The data-points are interesting, as there is (obviously) big
|
|
difference between waiting for a wakeup (via =poll= or =select=) or
|
|
using the spin-mode, and effects of userspace running on same or a
|
|
different CPU core, and the CPU sleep state modes and RT-patched
|
|
kernels.
|
|
|
|
| Driver/HW | Test | core | time-delay avg | min | max | System |
|
|
|-----------+--------+--------+----------------+----------+-------------+--------|
|
|
| igc/i225 | spin | same | 1575 ns | 849 ns | 2123 ns | A |
|
|
| igc/i225 | spin | remote | 2639 ns | 2337 ns | 4019 ns | A |
|
|
| igc/i225 | wakeup | same | 22881 ns | 21190 ns | 30619 ns | A |
|
|
| igc/i225 | wakeup | remote | 50353 ns | 47420 ns | 56156 ns | A |
|
|
|-----------+--------+--------+----------------+----------+-------------+--------|
|
|
| conf upd | | | | | no C-states | *B* |
|
|
|-----------+--------+--------+----------------+----------+-------------+--------|
|
|
| igc/i225 | spin | same | 1402 ns | 805 ns | 2867 ns | B |
|
|
| igc/i225 | spin | remote | 1056 ns | 419 ns | 2798 ns | B |
|
|
| igc/i225 | wakeup | same | 3177 ns | 2210 ns | 9136 ns | B |
|
|
| igc/i225 | wakeup | remote | 4095 ns | 3029 ns | 10595 ns | B |
|
|
|-----------+--------+--------+----------------+----------+-------------+--------|
|
|
|
|
The latency is affected a lot by CPUs power-saving states, which can
|
|
be limited globally by changing =/dev/cpu_dma_latency=. (See section
|
|
below). The main difference between system *A* and *B* is that
|
|
'cpu_dma_latency' have been changed to such a low value that CPU
|
|
doesn't use C-states. (Side-note: used tool =tuned-adm profile
|
|
latency-performance= thus other tunings might also have happened)
|
|
|
|
System *RT1* have a Real-Time patched kernel, and =cpu_dma_latency=
|
|
have no effect (likely due to kernel config).
|
|
|
|
| Driver/HW | Test | core | time-delay avg | min | max | System |
|
|
|-----------+--------+--------+----------------+---------+---------+--------|
|
|
| igb/i210 | spin | same | 2577 ns | 2129 ns | 4155 ns | RT1 |
|
|
| igb/i210 | spin | remote | 788 ns | 551 ns | 1473 ns | RT1 |
|
|
| igb/i210 | wakeup | same | 6209 ns | 5644 ns | 8178 ns | RT1 |
|
|
| igb/i210 | wakeup | remote | 5239 ns | 4463 ns | 7390 ns | RT1 |
|
|
|
|
|
|
Systems table:
|
|
| Name | CPU @ GHz | Kernel | Kernel options | cpu_dma_latency |
|
|
|------+----------------------+-----------------+----------------+----------------------|
|
|
| A | E5-1650 v4 @ 3.60GHz | 5.15.0-net-next | PREEMPT | 2 ms (2000000000 ns) |
|
|
| B | E5-1650 v4 @ 3.60GHz | 5.15.0-net-next | PREEMPT | 2 ns |
|
|
| RT1 | i5-6500TE @ 2.30GHz | 5.13.0-rt1+ | PREEMPT_RT | 2ms, but no-effect |
|
|
| | | | | |
|
|
|
|
** C-states wakeup time
|
|
|
|
It is possible to view the systems time (in usec) to wakeup from a
|
|
certain C-state, via below =grep= command:
|
|
|
|
#+BEGIN_SRC sh
|
|
# grep -H . /sys/devices/system/cpu/cpu0/cpuidle/*/latency
|
|
/sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
|
|
/sys/devices/system/cpu/cpu0/cpuidle/state1/latency:2
|
|
/sys/devices/system/cpu/cpu0/cpuidle/state2/latency:10
|
|
/sys/devices/system/cpu/cpu0/cpuidle/state3/latency:40
|
|
/sys/devices/system/cpu/cpu0/cpuidle/state4/latency:133
|
|
#+END_SRC
|
|
|
|
** Meaning of cpu_dma_latency
|
|
|
|
The global CPU latency limit is controlled via the file
|
|
=/dev/cpu_dma_latency=, which contains a binary value (interpreted as
|
|
a signed 32-bit integer). Reading contents can be annoying from the
|
|
command line, so lets provide a practical example:
|
|
|
|
Reading =/dev/cpu_dma_latency=:
|
|
#+begin_src sh
|
|
$ sudo hexdump --format '"%d\n"' /dev/cpu_dma_latency
|
|
2000000000
|
|
#+end_src
|
|
|
|
|
|
* AF_XDP documentation
|
|
|
|
When developing your AF_XDP application, we recommend familiarising
|
|
yourself with the core AF_XDP concepts, by reading the kernel
|
|
[[https://www.kernel.org/doc/html/latest/networking/af_xdp.html][documentation for AF_XDP]]. And XDP-tools also contain documentation in
|
|
[[https://github.com/xdp-project/xdp-tools/blob/master/lib/libxdp/README.org#using-af_xdp-sockets][libxdp for AF_XDP]], explaining how to use the API, and the difference
|
|
between the control-path and data-path APIs.
|
|
|
|
It is particularly important to understand the *four different
|
|
ring-queues* which are all Single-Producer Single-Consumer (SPSC)
|
|
ring-queues. A set of these four queues are needed *for each queue*
|
|
on the network device (netdev).
|
|
|
|
* Example bind to all queues
|
|
|
|
Usually AF_XDP examples makes a point out-of forcing end-user to
|
|
select a specific queue or channel ID, to show that AF_XDP sockets
|
|
operates on a single queue ID.
|
|
|
|
In this example, default behavior, is to setup AF_XDP sockets for
|
|
*ALL* configured queues/channels available, and "listen" for packets
|
|
on all of the queues. This way we can ignore setting up hardware
|
|
filters or reducing channels to 1 (as a popular workaround).
|
|
|
|
This also means memory consumption increase as NIC have more queues
|
|
available. For AF_XDP all the "UMEM" memory is preallocated by
|
|
userspace and registered with the kernel. AF_XDP trade wasting memory
|
|
for speedup. Each frame is a full memory-page 4K (4096 bytes). For
|
|
each channel/queue ID program allocates 4096 frames, which takes up
|
|
16MB memory per channel.
|
|
|