mirror of
https://github.com/xdp-project/bpf-examples.git
synced 2024-05-06 15:54:53 +00:00
Merge pull request #16 from simosund/pping_Better_output
Pping better output
This commit is contained in:
@@ -3,7 +3,7 @@
|
||||
USER_TARGETS := pping
|
||||
BPF_TARGETS := pping_kern
|
||||
|
||||
LDFLAGS += -pthread
|
||||
LDLIBS += -pthread
|
||||
EXTRA_DEPS += pping.h
|
||||
|
||||
LIB_DIR = ../lib
|
||||
|
202
pping/README.md
202
pping/README.md
@@ -15,7 +15,7 @@ echo-reply messages. See the [TODO-list](./TODO.md) for more potential features
|
||||
The fundamental logic of pping is to timestamp a pseudo-unique identifier for
|
||||
outgoing packets, and then look for matches in the incoming packets. If a match
|
||||
is found, the RTT is simply calculated as the time difference between the
|
||||
current time and the timestamp.
|
||||
current time and the stored timestamp.
|
||||
|
||||
This tool, just as Kathie's original pping implementation, uses TCP timestamps
|
||||
as identifiers. For outgoing packets, the TSval (which is a timestamp in and off
|
||||
@@ -28,6 +28,89 @@ the identifier. The mechanism to ensure only the first packet is timestamped and
|
||||
matched differs from the one in Kathie's pping, and is further described in
|
||||
[SAMPLING_DESIGN](./SAMPLING_DESIGN.md).
|
||||
|
||||
## Output formats
|
||||
pping currently supports 3 different formats, *standard*, *ppviz* and *json*. In
|
||||
general, the output consists of two different types of events, flow-events which
|
||||
gives information that a flow has started/ended, and RTT-events which provides
|
||||
information on a computed RTT within a flow.
|
||||
|
||||
### Standard format
|
||||
The standard format is quite similar to the Kathie's pping default output, and
|
||||
is generally intended to be an easily understood human-readable format writing a
|
||||
single line per event.
|
||||
|
||||
An example of the format is provided below:
|
||||
```shell
|
||||
16:00:46.142279766 10.11.1.1:5201+10.11.1.2:59528 opening due to SYN-ACK from src
|
||||
16:00:46.147705205 5.425439 ms 5.425439 ms 10.11.1.1:5201+10.11.1.2:59528
|
||||
16:00:47.148905125 5.261430 ms 5.261430 ms 10.11.1.1:5201+10.11.1.2:59528
|
||||
16:00:48.151666385 5.972284 ms 5.261430 ms 10.11.1.1:5201+10.11.1.2:59528
|
||||
16:00:49.152489316 6.017589 ms 5.261430 ms 10.11.1.1:5201+10.11.1.2:59528
|
||||
16:00:49.878508114 10.11.1.1:5201+10.11.1.2:59528 closing due to RST from dest
|
||||
```
|
||||
|
||||
### ppviz format
|
||||
The ppviz format is primarily intended to be used to generate data that can be
|
||||
visualized by Kathie's [ppviz](https://github.com/pollere/ppviz) tool. The
|
||||
format is essentially a CSV format, using a single space as the separator, and
|
||||
is further described [here](http://www.pollere.net/ppviz.html).
|
||||
|
||||
Note that the optional *FBytes*, *DBytes* and *PBytes* from the format
|
||||
specification have not been included here, and do not appear to be used by
|
||||
ppviz. Furthermore, flow events are not included in the output, as the those are
|
||||
not used by ppviz.
|
||||
|
||||
An example of the format is provided below:
|
||||
```shell
|
||||
1623420121.483727575 0.005298909 0.005298909 10.11.1.1:5201+10.11.1.2:59532
|
||||
1623420122.484530934 0.006016639 0.005298909 10.11.1.1:5201+10.11.1.2:59532
|
||||
1623420123.485899736 0.005590783 0.005298909 10.11.1.1:5201+10.11.1.2:59532
|
||||
1623420124.490584753 0.006123511 0.005298909 10.11.1.1:5201+10.11.1.2:59532
|
||||
1623420125.492190751 0.005624835 0.005298909 10.11.1.1:5201+10.11.1.2:59532
|
||||
```
|
||||
### JSON format
|
||||
The JSON format is primarily intended to be machine-readable, and thus uses no
|
||||
spacing or newlines between entries to reduce the overhead. External tools such
|
||||
as [jq](https://stedolan.github.io/jq/) can be used to pretty-print the format.
|
||||
|
||||
The format consists of an array at the root-level, and each flow or RTT even is
|
||||
added as an object to the root-array. The events contain some additional fields
|
||||
in the JSON format which is not displayed by the other formats. All times
|
||||
(*timestamp*, *rtt* and *min_rtt*) are provided as integers in nanoseconds.
|
||||
|
||||
An example of a (pretty-printed) flow-event is provided below:
|
||||
```json
|
||||
{
|
||||
"timestamp": 1623420837244545000,
|
||||
"src_ip": "10.11.1.1",
|
||||
"src_port": 5201,
|
||||
"dest_ip": "10.11.1.2",
|
||||
"dest_port": 59572,
|
||||
"protocol": "TCP",
|
||||
"flow_event": "opening",
|
||||
"reason": "SYN-ACK",
|
||||
"triggered_by": "src"
|
||||
}
|
||||
```
|
||||
|
||||
An example of a (pretty-printed) RTT-even is provided below:
|
||||
```json
|
||||
{
|
||||
"timestamp": 1623420838254558500,
|
||||
"src_ip": "10.11.1.1",
|
||||
"src_port": 5201,
|
||||
"dest_ip": "10.11.1.2",
|
||||
"dest_port": 59572,
|
||||
"protocol": "TCP",
|
||||
"rtt": 5977708,
|
||||
"min_rtt": 5441848,
|
||||
"sent_packets": 9393,
|
||||
"sent_bytes": 492457296,
|
||||
"rec_packets": 5922,
|
||||
"rec_bytes": 37
|
||||
}
|
||||
```
|
||||
|
||||
## Design and technical description
|
||||

|
||||
|
||||
@@ -47,7 +130,8 @@ matched differs from the one in Kathie's pping, and is further described in
|
||||
reverse flow (to match source/dest on egress). If there is a match, it
|
||||
calculates the RTT from the stored timestamp and deletes the entry. The
|
||||
calculated RTT (together with the flow-tuple) is pushed to the perf-buffer
|
||||
`rtt_events`.
|
||||
`events`. Both `pping_egress()` and `pping_ingress` can also push flow-events
|
||||
to the `events` buffer.
|
||||
- **bpf_egress_loader.sh:** A shell script that's used by `pping.c` to setup a
|
||||
clsact qdisc and attach the `pping_egress()` program to egress using
|
||||
tc. **Note**: Unless your iproute2 comes with libbpf support, tc will use
|
||||
@@ -65,13 +149,111 @@ matched differs from the one in Kathie's pping, and is further described in
|
||||
last seen identifier for the flow and when the last timestamp entry for the
|
||||
flow was created. Entries are created by `pping_egress()`, and can be updated
|
||||
or deleted by both `pping_egress()` and `pping_ingress()`. Leftover entries
|
||||
are eventually removed by `pping.c`. Pinned at `/sys/fs/bpf/pping`.
|
||||
are eventually removed by `pping.c`.
|
||||
- **packet_ts:** A hash-map storing a timestamp for a specific packet
|
||||
identifier. Entries are created by `pping_egress()` and removed by
|
||||
`pping_ingress()` if a match is found. Leftover entries are eventually
|
||||
removed by `pping.c`. Pinned at `/sys/fs/bpf/pping`.
|
||||
- **rtt_events:** A perf-buffer used by `pping_ingress()` to push calculated RTTs
|
||||
to `pping.c`, which continuously polls the map the print out the RTTs.
|
||||
`pping_ingress()` if a match is found. Leftover entries are eventually removed
|
||||
by `pping.c`.
|
||||
- **events:** A perf-buffer used by the BPF programs to push flow or RTT events
|
||||
to `pping.c`, which continuously polls the map the prints them out.
|
||||
|
||||
### A note on concurrency
|
||||
The program uses "global" (not `PERCPU`) hash maps to keep state. As the BPF
|
||||
programs need to see the global view to function properly, using `PERCPU` maps
|
||||
is not an option. The program must be able to match against stored packet
|
||||
timestamps regardless of the CPU the packets are processed on, and must also
|
||||
have a global view of the flow state in order for the sampling to work
|
||||
correctly.
|
||||
|
||||
As the BPF programs may run concurrently on different CPU cores accessing these
|
||||
global hash maps, this may result in some concurrency issues. In practice, I do
|
||||
not believe these will occur particularly often, as I'm under the impression
|
||||
that packets from the same flow will typically be processed by the some
|
||||
CPU. Furthermore, most of the concurrency issues will not be that problematic
|
||||
even if they do occur. For now, I've therefore left these concurrency issues
|
||||
unattended, even if some of them could be avoided with atomic operations and/or
|
||||
spinlocks, in order to keep things simple and not hurt performance.
|
||||
|
||||
The (known) potential concurrency issues are:
|
||||
|
||||
#### Tracking last seen identifier
|
||||
The tc/egress program keeps track of the last seen outgoing identifier for each
|
||||
flow, by storing it in the `flow_state` map. This is done to detect the first
|
||||
packet with a new identifier. If multiple packets are processed concurrently,
|
||||
several of them could potentially detect themselves as being first with the same
|
||||
identifier (which only matters if they also pass rate-limit check as well),
|
||||
alternatively if the concurrent packets have different identifiers there may be
|
||||
a lost update (but for TCP timestamps, concurrent packets would typically be
|
||||
expected to have the same timestamp).
|
||||
|
||||
A possibly more severe issue is out-of-order packets. If a packet with an old
|
||||
identifier arrives out of order, that identifier could be detected as a new
|
||||
identifier. If for example the following flow of four packets with just two
|
||||
different identifiers (id1 and id2) were to occur:
|
||||
|
||||
id1 -> id2 -> id1 -> id2
|
||||
|
||||
Then the tc/egress program would consider each of these packets to have new
|
||||
identifiers and try to create a new timestamp for each of them if the sampling
|
||||
strategy allows it. However even if the sampling strategy allows it, the
|
||||
(incorrect) creation of timestamps for id1 and id2 the second time would only be
|
||||
successful in case the first timestamps for id1 and id2 have already been
|
||||
matched against (and thus deleted). Even if that is the case, they would only
|
||||
result in reporting an incorrect RTT in case there are also new matches against
|
||||
these identifiers.
|
||||
|
||||
This issue could be avoided entirely by requiring that new-id > old-id instead
|
||||
of simply checking that new-id != old-id, as TCP timestamps should monotonically
|
||||
increase. That may however not be a suitable solution if/when we add support for
|
||||
other types of identifiers.
|
||||
|
||||
#### Rate-limiting new timestamps
|
||||
In the tc/egress program packets to timestamp are sampled by using a per-flow
|
||||
rate-limit, which is enforced by storing when the last timestamp was created in
|
||||
the `flow_state` map. If multiple packets perform this check concurrently, it's
|
||||
possible that multiple packets think they are allowed to create timestamps
|
||||
before any of them are able to update the `last_timestamp`. When they update
|
||||
`last_timestamp` it might also be slightly incorrect, however if they are
|
||||
processed concurrently then they should also generate very similar timestamps.
|
||||
|
||||
If the packets have different identifiers, (which would typically not be
|
||||
expected for concurrent TCP timestamps), then this would allow some packets to
|
||||
bypass the rate-limit. By bypassing the rate-limit, the flow would use up some
|
||||
additional map space and report some additional RTT(s) more than expected
|
||||
(however the reported RTTs should still be correct).
|
||||
|
||||
If the packets have the same identifier, they must first have managed to bypass
|
||||
the previous check for unique identifiers (see [previous point](#Tracking last
|
||||
seen identifier)), and only one of them will be able to successfully store a
|
||||
timestamp entry.
|
||||
|
||||
#### Matching against stored timestamps
|
||||
The XDP/ingress program could potentially match multiple concurrent packets with
|
||||
the same identifier against a single timestamp entry in `packet_ts`, before any
|
||||
of them manage to delete the timestamp entry. This would result in multiple RTTs
|
||||
being reported for the same identifier, but if they are processed concurrently
|
||||
these RTTs should be very similar, so would mainly result in over-reporting
|
||||
rather than reporting incorrect RTTs.
|
||||
|
||||
#### Updating flow statistics
|
||||
Both the tc/egress and XDP/ingress programs will try to update some flow
|
||||
statistics each time they successfully parse a packet with an
|
||||
identifier. Specifically, they'll update the number of packets and bytes
|
||||
sent/received. This is not done in an atomic fashion, so there could potentially
|
||||
be some lost updates resulting an underestimate.
|
||||
|
||||
Furthermore, whenever the XDP/ingress program calculates an RTT, it will check
|
||||
if this is the lowest RTT seen so far for the flow. If multiple RTTs are
|
||||
calculated concurrently, then several could pass this check concurrently and
|
||||
there may be a lost update. It should only be possible for multiple RTTs to be
|
||||
calculated concurrently in case either the [timestamp rate-limit was
|
||||
bypassed](#Rate-limiting new timestamps) or [multiple packets managed to match
|
||||
against the same timestamp](#Matching against stored timestamps).
|
||||
|
||||
It's worth noting that with sampling the reported minimum-RTT is only an
|
||||
estimate anyways (may never calculate RTT for packet with the true minimum
|
||||
RTT). And even without sampling there is some inherent sampling due to TCP
|
||||
timestamps only being updated at a limited rate (1000 Hz).
|
||||
|
||||
## Similar projects
|
||||
Passively measuring the RTT for TCP traffic is not a novel concept, and there
|
||||
@@ -84,9 +266,9 @@ RTT calculation using TCP timestamps (as in this project) works is provided in
|
||||
implementing some filtering logic the hope is to be able to create a always-on
|
||||
tool that can scale well even to large amounts of massive flows.
|
||||
- [ppviz](https://github.com/pollere/ppviz): Web-based visualization tool for
|
||||
the "machine-friendly" output from Kathie's pping tool. If/when we implement a
|
||||
similar machine readable output option it should hopefully work with this
|
||||
implementation as well.
|
||||
the "machine-friendly" (-m) output from Kathie's pping tool. Running this
|
||||
implementation of pping with --format="ppviz" will generate output that can be
|
||||
used by ppviz.
|
||||
- [tcptrace](https://github.com/blitz/tcptrace): A post-processing tool which
|
||||
can analyze a tcpdump file and among other things calculate RTTs based on
|
||||
seq/ACK numbers (`-r` or `-R` flag).
|
||||
|
@@ -25,7 +25,7 @@
|
||||
- [ ] Could potentially include keeping track of average RTT, which
|
||||
may be useful for some decisions (ex. how often to sample,
|
||||
when entry can be removed etc)
|
||||
- [ ] Could potentially include keeping track of minimum RTT (as
|
||||
- [x] Could potentially include keeping track of minimum RTT (as
|
||||
done by the original pping), ex. to track bufferbloat
|
||||
- [ ] Could potentially include keeping track of if flow is
|
||||
bi-directional
|
||||
@@ -38,11 +38,6 @@
|
||||
unnecessarily large, which slows down the cleaning and may block
|
||||
new entries
|
||||
- [ ] Use libxdp to load XDP program
|
||||
- [ ] Add option for machine-readable output (as original pping)
|
||||
- It may be a good idea to keep the same format as original pping,
|
||||
so that tools such as [ppviz](https://github.com/pollere/ppviz)
|
||||
works for both pping implementations.
|
||||
- [ ] Add timestamps to output (as original pping)
|
||||
- [ ] Add support for other hooks
|
||||
- Ex TC-BFP on ingress instead of XDP?
|
||||
|
||||
@@ -59,3 +54,9 @@
|
||||
- [x] Add IPv6 support
|
||||
- [x] Refactor to support easy addition of other protocols
|
||||
- [x] Load tc-bpf program with libbpf (only attach it with tc)
|
||||
- [x] Switch to libbpf TC-BPF API for attaching the TC-BPF program
|
||||
- [x] Add option for machine-readable output (as original pping)
|
||||
- It may be a good idea to keep the same format as original pping,
|
||||
so that tools such as [ppviz](https://github.com/pollere/ppviz)
|
||||
works for both pping implementations.
|
||||
- [x] Add timestamps to output (as original pping)
|
||||
|
309
pping/pping.c
309
pping/pping.c
@@ -23,6 +23,7 @@ static const char *__doc__ =
|
||||
#include <time.h>
|
||||
#include <pthread.h>
|
||||
|
||||
#include "json_writer.h"
|
||||
#include "pping.h" //common structs for user-space and BPF parts
|
||||
|
||||
#define NS_PER_SECOND 1000000000UL
|
||||
@@ -40,6 +41,9 @@ static const char *__doc__ =
|
||||
|
||||
#define MAX_PATH_LEN 1024
|
||||
|
||||
#define MON_TO_REAL_UPDATE_FREQ \
|
||||
(1 * NS_PER_SECOND) // Update offset between CLOCK_MONOTONIC and CLOCK_REALTIME once per second
|
||||
|
||||
/*
|
||||
* BPF implementation of pping using libbpf
|
||||
* Uses TC-BPF for egress and XDP for ingress
|
||||
@@ -64,20 +68,24 @@ struct map_cleanup_args {
|
||||
struct pping_config {
|
||||
struct bpf_config bpf_config;
|
||||
__u64 cleanup_interval;
|
||||
int xdp_flags;
|
||||
int ifindex;
|
||||
char ifname[IF_NAMESIZE];
|
||||
bool force;
|
||||
char *object_path;
|
||||
char *ingress_sec;
|
||||
char *egress_sec;
|
||||
char *pin_dir;
|
||||
char *packet_map;
|
||||
char *flow_map;
|
||||
char *rtt_map;
|
||||
char *event_map;
|
||||
int xdp_flags;
|
||||
int ifindex;
|
||||
char ifname[IF_NAMESIZE];
|
||||
bool json_format;
|
||||
bool ppviz_format;
|
||||
bool force;
|
||||
};
|
||||
|
||||
static volatile int keep_running = 1;
|
||||
static json_writer_t *json_ctx = NULL;
|
||||
static void (*print_event_func)(void *, int, void *, __u32) = NULL;
|
||||
|
||||
static const struct option long_options[] = {
|
||||
{ "help", no_argument, NULL, 'h' },
|
||||
@@ -85,6 +93,7 @@ static const struct option long_options[] = {
|
||||
{ "rate-limit", required_argument, NULL, 'r' }, // Sampling rate-limit in ms
|
||||
{ "force", no_argument, NULL, 'f' }, // Detach any existing XDP program on interface
|
||||
{ "cleanup-interval", required_argument, NULL, 'c' }, // Map cleaning interval in s
|
||||
{ "format", required_argument, NULL, 'F' }, // Which format to output in (standard/json/ppviz)
|
||||
{ 0, 0, NULL, 0 }
|
||||
};
|
||||
|
||||
@@ -135,8 +144,11 @@ static int parse_arguments(int argc, char *argv[], struct pping_config *config)
|
||||
double rate_limit_ms, cleanup_interval_s;
|
||||
|
||||
config->ifindex = 0;
|
||||
config->force = false;
|
||||
config->json_format = false;
|
||||
config->ppviz_format = false;
|
||||
|
||||
while ((opt = getopt_long(argc, argv, "hfi:r:c:", long_options,
|
||||
while ((opt = getopt_long(argc, argv, "hfi:r:c:F:", long_options,
|
||||
NULL)) != -1) {
|
||||
switch (opt) {
|
||||
case 'i':
|
||||
@@ -173,6 +185,16 @@ static int parse_arguments(int argc, char *argv[], struct pping_config *config)
|
||||
config->cleanup_interval =
|
||||
cleanup_interval_s * NS_PER_SECOND;
|
||||
break;
|
||||
case 'F':
|
||||
if (strcmp(optarg, "json") == 0) {
|
||||
config->json_format = true;
|
||||
} else if (strcmp(optarg, "ppviz") == 0) {
|
||||
config->ppviz_format = true;
|
||||
} else if (strcmp(optarg, "standard") != 0) {
|
||||
fprintf(stderr, "format must be \"standard\", \"json\" or \"ppviz\"\n");
|
||||
return -EINVAL;
|
||||
}
|
||||
break;
|
||||
case 'f':
|
||||
config->force = true;
|
||||
break;
|
||||
@@ -336,16 +358,16 @@ static int tc_bpf_clear(char *interface)
|
||||
* Returns time of CLOCK_MONOTONIC as nanoseconds in a single __u64.
|
||||
* On failure, the value 0 is returned (and errno will be set).
|
||||
*/
|
||||
static __u64 get_time_ns(void)
|
||||
static __u64 get_time_ns(clockid_t clockid)
|
||||
{
|
||||
struct timespec t;
|
||||
if (clock_gettime(CLOCK_MONOTONIC, &t) != 0)
|
||||
if (clock_gettime(clockid, &t) != 0)
|
||||
return 0;
|
||||
|
||||
return (__u64)t.tv_sec * NS_PER_SECOND + (__u64)t.tv_nsec;
|
||||
}
|
||||
|
||||
static bool packet_ts_timeout(void *val_ptr, __u64 now)
|
||||
static bool packet_ts_timeout(void *key_ptr, void *val_ptr, __u64 now)
|
||||
{
|
||||
__u64 ts = *(__u64 *)val_ptr;
|
||||
if (now > ts && now - ts > TIMESTAMP_LIFETIME)
|
||||
@@ -353,11 +375,23 @@ static bool packet_ts_timeout(void *val_ptr, __u64 now)
|
||||
return false;
|
||||
}
|
||||
|
||||
static bool flow_timeout(void *val_ptr, __u64 now)
|
||||
static bool flow_timeout(void *key_ptr, void *val_ptr, __u64 now)
|
||||
{
|
||||
struct flow_event fe;
|
||||
__u64 ts = ((struct flow_state *)val_ptr)->last_timestamp;
|
||||
if (now > ts && now - ts > FLOW_LIFETIME)
|
||||
|
||||
if (now > ts && now - ts > FLOW_LIFETIME) {
|
||||
if (print_event_func) {
|
||||
fe.event_type = EVENT_TYPE_FLOW;
|
||||
fe.timestamp = now;
|
||||
memcpy(&fe.flow, key_ptr, sizeof(struct network_tuple));
|
||||
fe.event_info.event = FLOW_EVENT_CLOSING;
|
||||
fe.event_info.reason = EVENT_REASON_FLOW_TIMEOUT;
|
||||
fe.source = EVENT_SOURCE_USERSPACE;
|
||||
print_event_func(NULL, 0, &fe, sizeof(fe));
|
||||
}
|
||||
return true;
|
||||
}
|
||||
return false;
|
||||
}
|
||||
|
||||
@@ -369,12 +403,12 @@ static bool flow_timeout(void *val_ptr, __u64 now)
|
||||
*/
|
||||
//TODO - maybe add some pointer to arguments for del_decision_func?
|
||||
static int clean_map(int map_fd, size_t key_size, size_t value_size,
|
||||
bool (*del_decision_func)(void *, __u64))
|
||||
bool (*del_decision_func)(void *, void *, __u64))
|
||||
{
|
||||
int removed = 0;
|
||||
void *key, *prev_key, *value;
|
||||
bool delete_prev = false;
|
||||
__u64 now_nsec = get_time_ns();
|
||||
__u64 now_nsec = get_time_ns(CLOCK_MONOTONIC);
|
||||
|
||||
#ifdef DEBUG
|
||||
int entries = 0;
|
||||
@@ -401,7 +435,7 @@ static int clean_map(int map_fd, size_t key_size, size_t value_size,
|
||||
}
|
||||
|
||||
if (bpf_map_lookup_elem(map_fd, key, value) == 0)
|
||||
delete_prev = del_decision_func(value, now_nsec);
|
||||
delete_prev = del_decision_func(key, value, now_nsec);
|
||||
#ifdef DEBUG
|
||||
entries++;
|
||||
#endif
|
||||
@@ -412,10 +446,11 @@ static int clean_map(int map_fd, size_t key_size, size_t value_size,
|
||||
removed++;
|
||||
}
|
||||
#ifdef DEBUG
|
||||
duration = get_time_ns() - now_nsec;
|
||||
printf("%d: Gone through %d entries and removed %d of them in %llu.%09llu s\n",
|
||||
map_fd, entries, removed, duration / NS_PER_SECOND,
|
||||
duration % NS_PER_SECOND);
|
||||
duration = get_time_ns(CLOCK_MONOTONIC) - now_nsec;
|
||||
fprintf(stderr,
|
||||
"%d: Gone through %d entries and removed %d of them in %llu.%09llu s\n",
|
||||
map_fd, entries, removed, duration / NS_PER_SECOND,
|
||||
duration % NS_PER_SECOND);
|
||||
#endif
|
||||
cleanup:
|
||||
free(key);
|
||||
@@ -441,12 +476,33 @@ static void *periodic_map_cleanup(void *args)
|
||||
pthread_exit(NULL);
|
||||
}
|
||||
|
||||
static __u64 convert_monotonic_to_realtime(__u64 monotonic_time)
|
||||
{
|
||||
static __u64 offset = 0;
|
||||
static __u64 offset_updated = 0;
|
||||
__u64 now_mon = get_time_ns(CLOCK_MONOTONIC);
|
||||
__u64 now_rt;
|
||||
|
||||
if (offset == 0 ||
|
||||
(now_mon > offset_updated &&
|
||||
now_mon - offset_updated > MON_TO_REAL_UPDATE_FREQ)) {
|
||||
now_mon = get_time_ns(CLOCK_MONOTONIC);
|
||||
now_rt = get_time_ns(CLOCK_REALTIME);
|
||||
|
||||
if (now_rt < now_mon)
|
||||
return 0;
|
||||
offset = now_rt - now_mon;
|
||||
offset_updated = now_mon;
|
||||
}
|
||||
return monotonic_time + offset;
|
||||
}
|
||||
|
||||
/*
|
||||
* Wrapper around inet_ntop designed to handle the "bug" that mapped IPv4
|
||||
* addresses are formated as IPv6 addresses for AF_INET6
|
||||
*/
|
||||
static int format_ip_address(int af, const struct in6_addr *addr, char *buf,
|
||||
size_t size)
|
||||
static int format_ip_address(char *buf, size_t size, int af,
|
||||
const struct in6_addr *addr)
|
||||
{
|
||||
if (af == AF_INET)
|
||||
return inet_ntop(af, &addr->s6_addr[12],
|
||||
@@ -456,18 +512,193 @@ static int format_ip_address(int af, const struct in6_addr *addr, char *buf,
|
||||
return -EINVAL;
|
||||
}
|
||||
|
||||
static void handle_rtt_event(void *ctx, int cpu, void *data, __u32 data_size)
|
||||
static const char *proto_to_str(__u16 proto)
|
||||
{
|
||||
static char buf[8];
|
||||
|
||||
switch (proto) {
|
||||
case IPPROTO_TCP:
|
||||
return "TCP";
|
||||
case IPPROTO_ICMP:
|
||||
return "ICMP";
|
||||
case IPPROTO_ICMPV6:
|
||||
return "ICMPv6";
|
||||
default:
|
||||
snprintf(buf, sizeof(buf), "%d", proto);
|
||||
return buf;
|
||||
}
|
||||
}
|
||||
|
||||
static const char *flowevent_to_str(enum flow_event_type fe)
|
||||
{
|
||||
switch (fe) {
|
||||
case FLOW_EVENT_NONE:
|
||||
return "none";
|
||||
case FLOW_EVENT_OPENING:
|
||||
return "opening";
|
||||
case FLOW_EVENT_CLOSING:
|
||||
return "closing";
|
||||
default:
|
||||
return "unknown";
|
||||
}
|
||||
}
|
||||
|
||||
static const char *eventreason_to_str(enum flow_event_reason er)
|
||||
{
|
||||
switch (er) {
|
||||
case EVENT_REASON_SYN:
|
||||
return "SYN";
|
||||
case EVENT_REASON_SYN_ACK:
|
||||
return "SYN-ACK";
|
||||
case EVENT_REASON_FIRST_OBS_PCKT:
|
||||
return "first observed packet";
|
||||
case EVENT_REASON_FIN:
|
||||
return "FIN";
|
||||
case EVENT_REASON_FIN_ACK:
|
||||
return "FIN-ACK";
|
||||
case EVENT_REASON_RST:
|
||||
return "RST";
|
||||
case EVENT_REASON_FLOW_TIMEOUT:
|
||||
return "flow timeout";
|
||||
default:
|
||||
return "unknown";
|
||||
}
|
||||
}
|
||||
|
||||
static const char *eventsource_to_str(enum flow_event_source es)
|
||||
{
|
||||
switch (es) {
|
||||
case EVENT_SOURCE_EGRESS:
|
||||
return "src";
|
||||
case EVENT_SOURCE_INGRESS:
|
||||
return "dest";
|
||||
case EVENT_SOURCE_USERSPACE:
|
||||
return "userspace-cleanup";
|
||||
default:
|
||||
return "unknown";
|
||||
}
|
||||
}
|
||||
|
||||
static void print_flow_ppvizformat(FILE *stream, const struct network_tuple *flow)
|
||||
{
|
||||
const struct rtt_event *e = data;
|
||||
char saddr[INET6_ADDRSTRLEN];
|
||||
char daddr[INET6_ADDRSTRLEN];
|
||||
|
||||
format_ip_address(e->flow.ipv, &e->flow.saddr.ip, saddr, sizeof(saddr));
|
||||
format_ip_address(e->flow.ipv, &e->flow.daddr.ip, daddr, sizeof(daddr));
|
||||
format_ip_address(saddr, sizeof(saddr), flow->ipv, &flow->saddr.ip);
|
||||
format_ip_address(daddr, sizeof(daddr), flow->ipv, &flow->daddr.ip);
|
||||
fprintf(stream, "%s:%d+%s:%d", saddr, ntohs(flow->saddr.port), daddr,
|
||||
ntohs(flow->daddr.port));
|
||||
}
|
||||
|
||||
printf("%llu.%06llu ms %s:%d+%s:%d\n", e->rtt / NS_PER_MS,
|
||||
e->rtt % NS_PER_MS, saddr, ntohs(e->flow.saddr.port), daddr,
|
||||
ntohs(e->flow.daddr.port));
|
||||
static void print_ns_datetime(FILE *stream, __u64 monotonic_ns)
|
||||
{
|
||||
char timestr[9];
|
||||
__u64 ts = convert_monotonic_to_realtime(monotonic_ns);
|
||||
time_t ts_s = ts / NS_PER_SECOND;
|
||||
|
||||
strftime(timestr, sizeof(timestr), "%H:%M:%S", localtime(&ts_s));
|
||||
fprintf(stream, "%s.%09llu", timestr, ts % NS_PER_SECOND);
|
||||
}
|
||||
|
||||
static void print_event_standard(void *ctx, int cpu, void *data,
|
||||
__u32 data_size)
|
||||
{
|
||||
const union pping_event *e = data;
|
||||
|
||||
if (e->event_type == EVENT_TYPE_RTT) {
|
||||
print_ns_datetime(stdout, e->rtt_event.timestamp);
|
||||
printf(" %llu.%06llu ms %llu.%06llu ms ",
|
||||
e->rtt_event.rtt / NS_PER_MS,
|
||||
e->rtt_event.rtt % NS_PER_MS,
|
||||
e->rtt_event.min_rtt / NS_PER_MS,
|
||||
e->rtt_event.min_rtt % NS_PER_MS);
|
||||
print_flow_ppvizformat(stdout, &e->rtt_event.flow);
|
||||
printf("\n");
|
||||
} else if (e->event_type == EVENT_TYPE_FLOW) {
|
||||
print_ns_datetime(stdout, e->flow_event.timestamp);
|
||||
printf(" ");
|
||||
print_flow_ppvizformat(stdout, &e->flow_event.flow);
|
||||
printf(" %s due to %s from %s\n",
|
||||
flowevent_to_str(e->flow_event.event_info.event),
|
||||
eventreason_to_str(e->flow_event.event_info.reason),
|
||||
eventsource_to_str(e->flow_event.source));
|
||||
}
|
||||
}
|
||||
|
||||
static void print_event_ppviz(void *ctx, int cpu, void *data, __u32 data_size)
|
||||
{
|
||||
const struct rtt_event *e = data;
|
||||
__u64 time = convert_monotonic_to_realtime(e->timestamp);
|
||||
|
||||
if (e->event_type != EVENT_TYPE_RTT)
|
||||
return;
|
||||
|
||||
printf("%llu.%09llu %llu.%09llu %llu.%09llu ", time / NS_PER_SECOND,
|
||||
time % NS_PER_SECOND, e->rtt / NS_PER_SECOND,
|
||||
e->rtt % NS_PER_SECOND, e->min_rtt / NS_PER_SECOND, e->min_rtt);
|
||||
print_flow_ppvizformat(stdout, &e->flow);
|
||||
printf("\n");
|
||||
}
|
||||
|
||||
static void print_common_fields_json(json_writer_t *ctx,
|
||||
const union pping_event *e)
|
||||
{
|
||||
const struct network_tuple *flow = &e->rtt_event.flow;
|
||||
char saddr[INET6_ADDRSTRLEN];
|
||||
char daddr[INET6_ADDRSTRLEN];
|
||||
|
||||
format_ip_address(saddr, sizeof(saddr), flow->ipv, &flow->saddr.ip);
|
||||
format_ip_address(daddr, sizeof(daddr), flow->ipv, &flow->daddr.ip);
|
||||
|
||||
jsonw_u64_field(ctx, "timestamp",
|
||||
convert_monotonic_to_realtime(e->rtt_event.timestamp));
|
||||
jsonw_string_field(ctx, "src_ip", saddr);
|
||||
jsonw_hu_field(ctx, "src_port", ntohs(flow->saddr.port));
|
||||
jsonw_string_field(ctx, "dest_ip", daddr);
|
||||
jsonw_hu_field(ctx, "dest_port", ntohs(flow->daddr.port));
|
||||
jsonw_string_field(ctx, "protocol", proto_to_str(flow->proto));
|
||||
}
|
||||
|
||||
static void print_rttevent_fields_json(json_writer_t *ctx,
|
||||
const struct rtt_event *re)
|
||||
{
|
||||
jsonw_u64_field(ctx, "rtt", re->rtt);
|
||||
jsonw_u64_field(ctx, "min_rtt", re->min_rtt);
|
||||
jsonw_u64_field(ctx, "sent_packets", re->sent_pkts);
|
||||
jsonw_u64_field(ctx, "sent_bytes", re->sent_bytes);
|
||||
jsonw_u64_field(ctx, "rec_packets", re->rec_pkts);
|
||||
jsonw_u64_field(ctx, "rec_bytes", re->rec_bytes);
|
||||
}
|
||||
|
||||
static void print_flowevent_fields_json(json_writer_t *ctx,
|
||||
const struct flow_event *fe)
|
||||
{
|
||||
jsonw_string_field(ctx, "flow_event",
|
||||
flowevent_to_str(fe->event_info.event));
|
||||
jsonw_string_field(ctx, "reason",
|
||||
eventreason_to_str(fe->event_info.reason));
|
||||
jsonw_string_field(ctx, "triggered_by", eventsource_to_str(fe->source));
|
||||
}
|
||||
|
||||
static void print_event_json(void *ctx, int cpu, void *data, __u32 data_size)
|
||||
{
|
||||
const union pping_event *e = data;
|
||||
|
||||
if (e->event_type != EVENT_TYPE_RTT && e->event_type != EVENT_TYPE_FLOW)
|
||||
return;
|
||||
|
||||
if (!json_ctx) {
|
||||
json_ctx = jsonw_new(stdout);
|
||||
jsonw_start_array(json_ctx);
|
||||
}
|
||||
|
||||
jsonw_start_object(json_ctx);
|
||||
print_common_fields_json(json_ctx, e);
|
||||
if (e->event_type == EVENT_TYPE_RTT)
|
||||
print_rttevent_fields_json(json_ctx, &e->rtt_event);
|
||||
else // flow-event
|
||||
print_flowevent_fields_json(json_ctx, &e->flow_event);
|
||||
jsonw_end_object(json_ctx);
|
||||
}
|
||||
|
||||
static void handle_missed_rtt_event(void *ctx, int cpu, __u64 lost_cnt)
|
||||
@@ -593,17 +824,18 @@ int main(int argc, char *argv[])
|
||||
.pin_dir = "/sys/fs/bpf/pping",
|
||||
.packet_map = "packet_ts",
|
||||
.flow_map = "flow_state",
|
||||
.rtt_map = "rtt_events",
|
||||
.event_map = "events",
|
||||
.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST,
|
||||
.force = false,
|
||||
};
|
||||
|
||||
struct perf_buffer *pb = NULL;
|
||||
struct perf_buffer_opts pb_opts = {
|
||||
.sample_cb = handle_rtt_event,
|
||||
.sample_cb = print_event_standard,
|
||||
.lost_cb = handle_missed_rtt_event,
|
||||
};
|
||||
|
||||
print_event_func = print_event_standard;
|
||||
|
||||
// Detect if running as root
|
||||
if (geteuid() != 0) {
|
||||
printf("This program must be run as root.\n");
|
||||
@@ -626,6 +858,14 @@ int main(int argc, char *argv[])
|
||||
return EXIT_FAILURE;
|
||||
}
|
||||
|
||||
if (config.json_format) {
|
||||
pb_opts.sample_cb = print_event_json;
|
||||
print_event_func = print_event_json;
|
||||
} else if (config.ppviz_format) {
|
||||
pb_opts.sample_cb = print_event_ppviz;
|
||||
print_event_func = print_event_ppviz;
|
||||
}
|
||||
|
||||
err = load_attach_bpfprogs(&obj, &config, &tc_attached, &xdp_attached);
|
||||
if (err) {
|
||||
fprintf(stderr,
|
||||
@@ -643,13 +883,13 @@ int main(int argc, char *argv[])
|
||||
|
||||
// Set up perf buffer
|
||||
pb = perf_buffer__new(bpf_object__find_map_fd_by_name(obj,
|
||||
config.rtt_map),
|
||||
config.event_map),
|
||||
PERF_BUFFER_PAGES, &pb_opts);
|
||||
err = libbpf_get_error(pb);
|
||||
if (err) {
|
||||
pb = NULL;
|
||||
fprintf(stderr, "Failed to open perf buffer %s: %s\n",
|
||||
config.rtt_map, strerror(err));
|
||||
config.event_map, strerror(err));
|
||||
goto cleanup;
|
||||
}
|
||||
|
||||
@@ -695,5 +935,10 @@ cleanup:
|
||||
config.pin_dir, strerror(-err));
|
||||
}
|
||||
|
||||
if (config.json_format && json_ctx) {
|
||||
jsonw_end_array(json_ctx);
|
||||
jsonw_destroy(&json_ctx);
|
||||
}
|
||||
|
||||
return err != 0;
|
||||
}
|
||||
|
@@ -4,10 +4,37 @@
|
||||
|
||||
#include <linux/types.h>
|
||||
#include <linux/in6.h>
|
||||
#include <stdbool.h>
|
||||
|
||||
#define INGRESS_PROG_SEC "xdp"
|
||||
#define EGRESS_PROG_SEC "classifier"
|
||||
|
||||
/* For the event_type members of rtt_event and flow_event */
|
||||
#define EVENT_TYPE_FLOW 1
|
||||
#define EVENT_TYPE_RTT 2
|
||||
|
||||
enum __attribute__((__packed__)) flow_event_type {
|
||||
FLOW_EVENT_NONE,
|
||||
FLOW_EVENT_OPENING,
|
||||
FLOW_EVENT_CLOSING
|
||||
};
|
||||
|
||||
enum __attribute__((__packed__)) flow_event_reason {
|
||||
EVENT_REASON_SYN,
|
||||
EVENT_REASON_SYN_ACK,
|
||||
EVENT_REASON_FIRST_OBS_PCKT,
|
||||
EVENT_REASON_FIN,
|
||||
EVENT_REASON_FIN_ACK,
|
||||
EVENT_REASON_RST,
|
||||
EVENT_REASON_FLOW_TIMEOUT
|
||||
};
|
||||
|
||||
enum __attribute__((__packed__)) flow_event_source {
|
||||
EVENT_SOURCE_EGRESS,
|
||||
EVENT_SOURCE_INGRESS,
|
||||
EVENT_SOURCE_USERSPACE
|
||||
};
|
||||
|
||||
struct bpf_config {
|
||||
__u64 rate_limit;
|
||||
};
|
||||
@@ -39,7 +66,12 @@ struct network_tuple {
|
||||
};
|
||||
|
||||
struct flow_state {
|
||||
__u64 min_rtt;
|
||||
__u64 last_timestamp;
|
||||
__u64 sent_pkts;
|
||||
__u64 sent_bytes;
|
||||
__u64 rec_pkts;
|
||||
__u64 rec_bytes;
|
||||
__u32 last_id;
|
||||
__u32 reserved;
|
||||
};
|
||||
@@ -49,10 +81,52 @@ struct packet_id {
|
||||
__u32 identifier; //tsval for TCP packets
|
||||
};
|
||||
|
||||
/*
|
||||
* An RTT event message that can be passed from the bpf-programs to user-space.
|
||||
* The initial event_type memeber is used to allow multiplexing between
|
||||
* different event types in a single perf buffer. Memebers up to and including
|
||||
* flow are identical to other event types.
|
||||
* Uses explicit padding instead of packing based on recommendations in cilium's
|
||||
* BPF reference documentation at https://docs.cilium.io/en/stable/bpf/#llvm.
|
||||
*/
|
||||
struct rtt_event {
|
||||
__u64 rtt;
|
||||
__u64 event_type;
|
||||
__u64 timestamp;
|
||||
struct network_tuple flow;
|
||||
__u32 padding;
|
||||
__u64 rtt;
|
||||
__u64 min_rtt;
|
||||
__u64 sent_pkts;
|
||||
__u64 sent_bytes;
|
||||
__u64 rec_pkts;
|
||||
__u64 rec_bytes;
|
||||
__u32 reserved;
|
||||
};
|
||||
|
||||
struct flow_event_info {
|
||||
enum flow_event_type event;
|
||||
enum flow_event_reason reason;
|
||||
};
|
||||
|
||||
/*
|
||||
* A flow event message that can be passed from the bpf-programs to user-space.
|
||||
* The initial event_type memeber is used to allow multiplexing between
|
||||
* different event types in a single perf buffer. Memebers up to and including
|
||||
* flow are identical to other event types.
|
||||
*/
|
||||
struct flow_event {
|
||||
__u64 event_type;
|
||||
__u64 timestamp;
|
||||
struct network_tuple flow;
|
||||
struct flow_event_info event_info;
|
||||
enum flow_event_source source;
|
||||
__u8 reserved;
|
||||
};
|
||||
|
||||
union pping_event {
|
||||
__u64 event_type;
|
||||
struct rtt_event rtt_event;
|
||||
struct flow_event flow_event;
|
||||
};
|
||||
|
||||
#endif
|
||||
|
@@ -60,7 +60,7 @@ struct {
|
||||
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
|
||||
__uint(key_size, sizeof(__u32));
|
||||
__uint(value_size, sizeof(__u32));
|
||||
} rtt_events SEC(".maps");
|
||||
} events SEC(".maps");
|
||||
|
||||
// Help functions
|
||||
|
||||
@@ -130,14 +130,15 @@ static int parse_tcp_ts(struct tcphdr *tcph, void *data_end, __u32 *tsval,
|
||||
|
||||
/*
|
||||
* Attempts to fetch an identifier for TCP packets, based on the TCP timestamp
|
||||
* option. If sucessful, identifier will be set to TSval if is_ingress, TSecr
|
||||
* otherwise, the port-members of saddr and daddr will be set the the TCP source
|
||||
* and dest, respectively, and 0 will be returned. On failure, -1 will be
|
||||
* returned. Additionally, if the connection is closing (FIN or RST flag), sets
|
||||
* flow_closing to true.
|
||||
* option.
|
||||
* If successful, identifier will be set to TSval if is_ingress, or TSecr
|
||||
* otherwise, the port-members of saddr and daddr will be set to the TCP source
|
||||
* and dest, respectively, fei will be filled appropriately (based on
|
||||
* SYN/FIN/RST) and 0 will be returned.
|
||||
* On failure, -1 will be returned.
|
||||
*/
|
||||
static int parse_tcp_identifier(struct parsing_context *ctx, __be16 *sport,
|
||||
__be16 *dport, bool *flow_closing,
|
||||
__be16 *dport, struct flow_event_info *fei,
|
||||
__u32 *identifier)
|
||||
{
|
||||
__u32 tsval, tsecr;
|
||||
@@ -146,18 +147,27 @@ static int parse_tcp_identifier(struct parsing_context *ctx, __be16 *sport,
|
||||
if (parse_tcphdr(&ctx->nh, ctx->data_end, &tcph) < 0)
|
||||
return -1;
|
||||
|
||||
// Check if connection is closing
|
||||
if (tcph->fin || tcph->rst) {
|
||||
*flow_closing = true;
|
||||
/* bpf_printk("Detected connection closing on %d\n", */
|
||||
/* ctx->is_egress); //Upsets verifier? */
|
||||
}
|
||||
|
||||
// Do not timestamp pure ACKs
|
||||
if (ctx->is_egress && ctx->nh.pos - ctx->data >= ctx->pkt_len &&
|
||||
!tcph->syn)
|
||||
return -1;
|
||||
|
||||
// Check if connection is opening/closing
|
||||
if (tcph->syn) {
|
||||
fei->event = FLOW_EVENT_OPENING;
|
||||
fei->reason =
|
||||
tcph->ack ? EVENT_REASON_SYN_ACK : EVENT_REASON_SYN;
|
||||
} else if (tcph->rst) {
|
||||
fei->event = FLOW_EVENT_CLOSING;
|
||||
fei->reason = EVENT_REASON_RST;
|
||||
} else if (!ctx->is_egress && tcph->fin) {
|
||||
fei->event = FLOW_EVENT_CLOSING;
|
||||
fei->reason =
|
||||
tcph->ack ? EVENT_REASON_FIN_ACK : EVENT_REASON_FIN;
|
||||
} else {
|
||||
fei->event = FLOW_EVENT_NONE;
|
||||
}
|
||||
|
||||
if (parse_tcp_ts(tcph, ctx->data_end, &tsval, &tsecr) < 0)
|
||||
return -1; //Possible TODO, fall back on seq/ack instead
|
||||
|
||||
@@ -170,7 +180,7 @@ static int parse_tcp_identifier(struct parsing_context *ctx, __be16 *sport,
|
||||
/*
|
||||
* Attempts to parse the packet limited by the data and data_end pointers,
|
||||
* to retrieve a protocol dependent packet identifier. If sucessful, the
|
||||
* pointed to p_id will be filled with parsed information from the packet
|
||||
* pointed to p_id and fei will be filled with parsed information from the
|
||||
* packet, and 0 will be returned. On failure, -1 will be returned.
|
||||
* If is_egress saddr and daddr will match source and destination of packet,
|
||||
* respectively, and identifier will be set to the identifer for an outgoing
|
||||
@@ -179,7 +189,8 @@ static int parse_tcp_identifier(struct parsing_context *ctx, __be16 *sport,
|
||||
* set to the identifier of a response.
|
||||
*/
|
||||
static int parse_packet_identifier(struct parsing_context *ctx,
|
||||
struct packet_id *p_id, bool *flow_closing)
|
||||
struct packet_id *p_id,
|
||||
struct flow_event_info *fei)
|
||||
{
|
||||
int proto, err;
|
||||
struct ethhdr *eth;
|
||||
@@ -201,18 +212,18 @@ static int parse_packet_identifier(struct parsing_context *ctx,
|
||||
// Parse IPv4/6 header
|
||||
if (proto == bpf_htons(ETH_P_IP)) {
|
||||
p_id->flow.ipv = AF_INET;
|
||||
proto = parse_iphdr(&ctx->nh, ctx->data_end, &iph);
|
||||
p_id->flow.proto = parse_iphdr(&ctx->nh, ctx->data_end, &iph);
|
||||
} else if (proto == bpf_htons(ETH_P_IPV6)) {
|
||||
p_id->flow.ipv = AF_INET6;
|
||||
proto = parse_ip6hdr(&ctx->nh, ctx->data_end, &ip6h);
|
||||
p_id->flow.proto = parse_ip6hdr(&ctx->nh, ctx->data_end, &ip6h);
|
||||
} else {
|
||||
return -1;
|
||||
}
|
||||
|
||||
// Add new protocols here
|
||||
if (proto == IPPROTO_TCP) {
|
||||
if (p_id->flow.proto == IPPROTO_TCP) {
|
||||
err = parse_tcp_identifier(ctx, &saddr->port, &daddr->port,
|
||||
flow_closing, &p_id->identifier);
|
||||
fei, &p_id->identifier);
|
||||
if (err)
|
||||
return -1;
|
||||
} else {
|
||||
@@ -230,6 +241,32 @@ static int parse_packet_identifier(struct parsing_context *ctx,
|
||||
return 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* Returns the number of unparsed bytes left in the packet (bytes after nh.pos)
|
||||
*/
|
||||
static __u32 remaining_pkt_payload(struct parsing_context *ctx)
|
||||
{
|
||||
// pkt_len - (pos - data) fails because compiler transforms it to pkt_len - pos + data (pkt_len - pos not ok because value - pointer)
|
||||
// data + pkt_len - pos fails on (data+pkt_len) - pos due to math between pkt_pointer and unbounded register
|
||||
__u32 parsed_bytes = ctx->nh.pos - ctx->data;
|
||||
return parsed_bytes < ctx->pkt_len ? ctx->pkt_len - parsed_bytes : 0;
|
||||
}
|
||||
|
||||
/*
|
||||
* Fills in event_type, timestamp, flow, source and reserved.
|
||||
* Does not fill in the flow_info.
|
||||
*/
|
||||
static void fill_flow_event(struct flow_event *fe, __u64 timestamp,
|
||||
struct network_tuple *flow,
|
||||
enum flow_event_source source)
|
||||
{
|
||||
fe->event_type = EVENT_TYPE_FLOW;
|
||||
fe->timestamp = timestamp;
|
||||
__builtin_memcpy(&fe->flow, flow, sizeof(struct network_tuple));
|
||||
fe->source = source;
|
||||
fe->reserved = 0; // Make sure it's initilized
|
||||
}
|
||||
|
||||
// Programs
|
||||
|
||||
// TC-BFP for parsing packet identifier from egress traffic and add to map
|
||||
@@ -237,7 +274,8 @@ SEC(EGRESS_PROG_SEC)
|
||||
int pping_egress(struct __sk_buff *skb)
|
||||
{
|
||||
struct packet_id p_id = { 0 };
|
||||
__u64 p_ts;
|
||||
struct flow_event fe;
|
||||
__u64 now;
|
||||
struct parsing_context pctx = {
|
||||
.data = (void *)(long)skb->data,
|
||||
.data_end = (void *)(long)skb->data_end,
|
||||
@@ -245,63 +283,56 @@ int pping_egress(struct __sk_buff *skb)
|
||||
.nh = { .pos = pctx.data },
|
||||
.is_egress = true,
|
||||
};
|
||||
bool flow_closing = false;
|
||||
struct flow_state *f_state;
|
||||
struct flow_state new_state = { 0 };
|
||||
|
||||
if (parse_packet_identifier(&pctx, &p_id, &flow_closing) < 0)
|
||||
if (parse_packet_identifier(&pctx, &p_id, &fe.event_info) < 0)
|
||||
goto out;
|
||||
|
||||
// Delete flow and create no timestamp entry if flow is closing
|
||||
if (flow_closing) {
|
||||
bpf_map_delete_elem(&flow_state, &p_id.flow);
|
||||
now = bpf_ktime_get_ns(); // or bpf_ktime_get_boot_ns
|
||||
f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow);
|
||||
|
||||
// Flow closing - try to delete flow state and push closing-event
|
||||
if (fe.event_info.event == FLOW_EVENT_CLOSING) {
|
||||
if (!f_state) {
|
||||
bpf_map_delete_elem(&flow_state, &p_id.flow);
|
||||
fill_flow_event(&fe, now, &p_id.flow,
|
||||
EVENT_SOURCE_EGRESS);
|
||||
bpf_perf_event_output(skb, &events, BPF_F_CURRENT_CPU,
|
||||
&fe, sizeof(fe));
|
||||
}
|
||||
goto out;
|
||||
}
|
||||
|
||||
// Check flow state
|
||||
f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow);
|
||||
if (!f_state) { // No previous state - attempt to create it
|
||||
// No previous state - attempt to create it and push flow-opening event
|
||||
if (!f_state) {
|
||||
bpf_map_update_elem(&flow_state, &p_id.flow, &new_state,
|
||||
BPF_NOEXIST);
|
||||
f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow);
|
||||
if (!f_state)
|
||||
|
||||
if (!f_state) // Creation failed
|
||||
goto out;
|
||||
|
||||
if (fe.event_info.event != FLOW_EVENT_OPENING) {
|
||||
fe.event_info.event = FLOW_EVENT_OPENING;
|
||||
fe.event_info.reason = EVENT_REASON_FIRST_OBS_PCKT;
|
||||
}
|
||||
fill_flow_event(&fe, now, &p_id.flow, EVENT_SOURCE_EGRESS);
|
||||
bpf_perf_event_output(skb, &events, BPF_F_CURRENT_CPU, &fe,
|
||||
sizeof(fe));
|
||||
}
|
||||
|
||||
// Check if identfier is new
|
||||
/* The gap between checking and updating last_id may cause concurrency
|
||||
* issues where multiple packets may simultaneously think they are the
|
||||
* first with a new identifier. As long as all of the identifiers are
|
||||
* the same though, only one should be able to create a timestamp entry.
|
||||
f_state->sent_pkts++;
|
||||
f_state->sent_bytes += remaining_pkt_payload(&pctx);
|
||||
|
||||
* A bigger issue is that older identifiers (for example due to
|
||||
* out-of-order packets) may pass this check and update the current
|
||||
* identifier to an old one. This means that both the packet with the
|
||||
* old identifier itself as well the next packet with the current
|
||||
* identifier may be considered packets with new identifiers (even if
|
||||
* both have been seen before). For TCP timestamps this could be
|
||||
* prevented by changing the check to '>=' instead, but it may not be
|
||||
* suitable for other protocols, such as QUIC and its spinbit.
|
||||
*
|
||||
* For now, just hope that the rate limit saves us from creating an
|
||||
* incorrect timestamp. That may however also fail, either due to the
|
||||
* to it happening in a time it's not limited by rate sampling, or
|
||||
* because of rate check failing due to concurrency issues.
|
||||
*/
|
||||
// Check if identfier is new
|
||||
if (f_state->last_id == p_id.identifier)
|
||||
goto out;
|
||||
f_state->last_id = p_id.identifier;
|
||||
|
||||
// Check rate-limit
|
||||
/*
|
||||
* The window between checking and updating last_timestamp may cause
|
||||
* concurrency issues, where multiple packets simultaneously pass the
|
||||
* rate limit. However, as long as they have the same identifier, only
|
||||
* a single timestamp entry should successfully be created.
|
||||
*/
|
||||
p_ts = bpf_ktime_get_ns(); // or bpf_ktime_get_boot_ns
|
||||
if (p_ts < f_state->last_timestamp ||
|
||||
p_ts - f_state->last_timestamp < config.rate_limit)
|
||||
if (now < f_state->last_timestamp ||
|
||||
now - f_state->last_timestamp < config.rate_limit)
|
||||
goto out;
|
||||
|
||||
/*
|
||||
@@ -310,8 +341,8 @@ int pping_egress(struct __sk_buff *skb)
|
||||
* the next available map slot somewhat fairer between heavy and sparse
|
||||
* flows.
|
||||
*/
|
||||
f_state->last_timestamp = p_ts;
|
||||
bpf_map_update_elem(&packet_ts, &p_id, &p_ts, BPF_NOEXIST);
|
||||
f_state->last_timestamp = now;
|
||||
bpf_map_update_elem(&packet_ts, &p_id, &now, BPF_NOEXIST);
|
||||
|
||||
out:
|
||||
return BPF_OK;
|
||||
@@ -323,7 +354,9 @@ int pping_ingress(struct xdp_md *ctx)
|
||||
{
|
||||
struct packet_id p_id = { 0 };
|
||||
__u64 *p_ts;
|
||||
struct rtt_event event = { 0 };
|
||||
struct flow_event fe;
|
||||
struct rtt_event re = { 0 };
|
||||
struct flow_state *f_state;
|
||||
struct parsing_context pctx = {
|
||||
.data = (void *)(long)ctx->data,
|
||||
.data_end = (void *)(long)ctx->data_end,
|
||||
@@ -331,30 +364,51 @@ int pping_ingress(struct xdp_md *ctx)
|
||||
.nh = { .pos = pctx.data },
|
||||
.is_egress = false,
|
||||
};
|
||||
bool flow_closing = false;
|
||||
__u64 now;
|
||||
|
||||
if (parse_packet_identifier(&pctx, &p_id, &flow_closing) < 0)
|
||||
if (parse_packet_identifier(&pctx, &p_id, &fe.event_info) < 0)
|
||||
goto out;
|
||||
|
||||
// Delete flow, but allow final attempt at RTT calculation
|
||||
if (flow_closing)
|
||||
bpf_map_delete_elem(&flow_state, &p_id.flow);
|
||||
f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow);
|
||||
if (!f_state)
|
||||
goto out;
|
||||
|
||||
f_state->rec_pkts++;
|
||||
f_state->rec_bytes += remaining_pkt_payload(&pctx);
|
||||
|
||||
now = bpf_ktime_get_ns();
|
||||
p_ts = bpf_map_lookup_elem(&packet_ts, &p_id);
|
||||
if (!p_ts)
|
||||
goto out;
|
||||
if (!p_ts || now < *p_ts)
|
||||
goto validflow_out;
|
||||
|
||||
event.rtt = bpf_ktime_get_ns() - *p_ts;
|
||||
/*
|
||||
* Attempt to delete timestamp entry as soon as RTT is calculated.
|
||||
* But could have potential concurrency issue where multiple packets
|
||||
* manage to match against the identifier before it can be deleted.
|
||||
*/
|
||||
re.rtt = now - *p_ts;
|
||||
|
||||
// Delete timestamp entry as soon as RTT is calculated
|
||||
bpf_map_delete_elem(&packet_ts, &p_id);
|
||||
|
||||
__builtin_memcpy(&event.flow, &p_id.flow, sizeof(struct network_tuple));
|
||||
bpf_perf_event_output(ctx, &rtt_events, BPF_F_CURRENT_CPU, &event,
|
||||
sizeof(event));
|
||||
if (f_state->min_rtt == 0 || re.rtt < f_state->min_rtt)
|
||||
f_state->min_rtt = re.rtt;
|
||||
|
||||
re.event_type = EVENT_TYPE_RTT;
|
||||
re.timestamp = now;
|
||||
re.min_rtt = f_state->min_rtt;
|
||||
re.sent_pkts = f_state->sent_pkts;
|
||||
re.sent_bytes = f_state->sent_bytes;
|
||||
re.rec_pkts = f_state->rec_pkts;
|
||||
re.rec_bytes = f_state->rec_bytes;
|
||||
|
||||
// Push event to perf-buffer
|
||||
__builtin_memcpy(&re.flow, &p_id.flow, sizeof(struct network_tuple));
|
||||
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &re, sizeof(re));
|
||||
|
||||
validflow_out:
|
||||
// Wait with deleting flow until having pushed final RTT message
|
||||
if (fe.event_info.event == FLOW_EVENT_CLOSING && f_state) {
|
||||
bpf_map_delete_elem(&flow_state, &p_id.flow);
|
||||
fill_flow_event(&fe, now, &p_id.flow, EVENT_SOURCE_INGRESS);
|
||||
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &fe,
|
||||
sizeof(fe));
|
||||
}
|
||||
|
||||
out:
|
||||
return XDP_PASS;
|
||||
|
Reference in New Issue
Block a user