Merge pull request #16 from simosund/pping_Better_output

Pping better output
This commit is contained in:
Toke Høiland-Jørgensen
2021-06-24 14:28:19 +02:00
committed by GitHub
9 changed files with 1155 additions and 128 deletions

View File

@@ -3,7 +3,7 @@
USER_TARGETS := pping
BPF_TARGETS := pping_kern
LDFLAGS += -pthread
LDLIBS += -pthread
EXTRA_DEPS += pping.h
LIB_DIR = ../lib

View File

@@ -15,7 +15,7 @@ echo-reply messages. See the [TODO-list](./TODO.md) for more potential features
The fundamental logic of pping is to timestamp a pseudo-unique identifier for
outgoing packets, and then look for matches in the incoming packets. If a match
is found, the RTT is simply calculated as the time difference between the
current time and the timestamp.
current time and the stored timestamp.
This tool, just as Kathie's original pping implementation, uses TCP timestamps
as identifiers. For outgoing packets, the TSval (which is a timestamp in and off
@@ -28,6 +28,89 @@ the identifier. The mechanism to ensure only the first packet is timestamped and
matched differs from the one in Kathie's pping, and is further described in
[SAMPLING_DESIGN](./SAMPLING_DESIGN.md).
## Output formats
pping currently supports 3 different formats, *standard*, *ppviz* and *json*. In
general, the output consists of two different types of events, flow-events which
gives information that a flow has started/ended, and RTT-events which provides
information on a computed RTT within a flow.
### Standard format
The standard format is quite similar to the Kathie's pping default output, and
is generally intended to be an easily understood human-readable format writing a
single line per event.
An example of the format is provided below:
```shell
16:00:46.142279766 10.11.1.1:5201+10.11.1.2:59528 opening due to SYN-ACK from src
16:00:46.147705205 5.425439 ms 5.425439 ms 10.11.1.1:5201+10.11.1.2:59528
16:00:47.148905125 5.261430 ms 5.261430 ms 10.11.1.1:5201+10.11.1.2:59528
16:00:48.151666385 5.972284 ms 5.261430 ms 10.11.1.1:5201+10.11.1.2:59528
16:00:49.152489316 6.017589 ms 5.261430 ms 10.11.1.1:5201+10.11.1.2:59528
16:00:49.878508114 10.11.1.1:5201+10.11.1.2:59528 closing due to RST from dest
```
### ppviz format
The ppviz format is primarily intended to be used to generate data that can be
visualized by Kathie's [ppviz](https://github.com/pollere/ppviz) tool. The
format is essentially a CSV format, using a single space as the separator, and
is further described [here](http://www.pollere.net/ppviz.html).
Note that the optional *FBytes*, *DBytes* and *PBytes* from the format
specification have not been included here, and do not appear to be used by
ppviz. Furthermore, flow events are not included in the output, as the those are
not used by ppviz.
An example of the format is provided below:
```shell
1623420121.483727575 0.005298909 0.005298909 10.11.1.1:5201+10.11.1.2:59532
1623420122.484530934 0.006016639 0.005298909 10.11.1.1:5201+10.11.1.2:59532
1623420123.485899736 0.005590783 0.005298909 10.11.1.1:5201+10.11.1.2:59532
1623420124.490584753 0.006123511 0.005298909 10.11.1.1:5201+10.11.1.2:59532
1623420125.492190751 0.005624835 0.005298909 10.11.1.1:5201+10.11.1.2:59532
```
### JSON format
The JSON format is primarily intended to be machine-readable, and thus uses no
spacing or newlines between entries to reduce the overhead. External tools such
as [jq](https://stedolan.github.io/jq/) can be used to pretty-print the format.
The format consists of an array at the root-level, and each flow or RTT even is
added as an object to the root-array. The events contain some additional fields
in the JSON format which is not displayed by the other formats. All times
(*timestamp*, *rtt* and *min_rtt*) are provided as integers in nanoseconds.
An example of a (pretty-printed) flow-event is provided below:
```json
{
"timestamp": 1623420837244545000,
"src_ip": "10.11.1.1",
"src_port": 5201,
"dest_ip": "10.11.1.2",
"dest_port": 59572,
"protocol": "TCP",
"flow_event": "opening",
"reason": "SYN-ACK",
"triggered_by": "src"
}
```
An example of a (pretty-printed) RTT-even is provided below:
```json
{
"timestamp": 1623420838254558500,
"src_ip": "10.11.1.1",
"src_port": 5201,
"dest_ip": "10.11.1.2",
"dest_port": 59572,
"protocol": "TCP",
"rtt": 5977708,
"min_rtt": 5441848,
"sent_packets": 9393,
"sent_bytes": 492457296,
"rec_packets": 5922,
"rec_bytes": 37
}
```
## Design and technical description
!["Design of eBPF pping](./eBPF_pping_design.png)
@@ -47,7 +130,8 @@ matched differs from the one in Kathie's pping, and is further described in
reverse flow (to match source/dest on egress). If there is a match, it
calculates the RTT from the stored timestamp and deletes the entry. The
calculated RTT (together with the flow-tuple) is pushed to the perf-buffer
`rtt_events`.
`events`. Both `pping_egress()` and `pping_ingress` can also push flow-events
to the `events` buffer.
- **bpf_egress_loader.sh:** A shell script that's used by `pping.c` to setup a
clsact qdisc and attach the `pping_egress()` program to egress using
tc. **Note**: Unless your iproute2 comes with libbpf support, tc will use
@@ -65,13 +149,111 @@ matched differs from the one in Kathie's pping, and is further described in
last seen identifier for the flow and when the last timestamp entry for the
flow was created. Entries are created by `pping_egress()`, and can be updated
or deleted by both `pping_egress()` and `pping_ingress()`. Leftover entries
are eventually removed by `pping.c`. Pinned at `/sys/fs/bpf/pping`.
are eventually removed by `pping.c`.
- **packet_ts:** A hash-map storing a timestamp for a specific packet
identifier. Entries are created by `pping_egress()` and removed by
`pping_ingress()` if a match is found. Leftover entries are eventually
removed by `pping.c`. Pinned at `/sys/fs/bpf/pping`.
- **rtt_events:** A perf-buffer used by `pping_ingress()` to push calculated RTTs
to `pping.c`, which continuously polls the map the print out the RTTs.
`pping_ingress()` if a match is found. Leftover entries are eventually removed
by `pping.c`.
- **events:** A perf-buffer used by the BPF programs to push flow or RTT events
to `pping.c`, which continuously polls the map the prints them out.
### A note on concurrency
The program uses "global" (not `PERCPU`) hash maps to keep state. As the BPF
programs need to see the global view to function properly, using `PERCPU` maps
is not an option. The program must be able to match against stored packet
timestamps regardless of the CPU the packets are processed on, and must also
have a global view of the flow state in order for the sampling to work
correctly.
As the BPF programs may run concurrently on different CPU cores accessing these
global hash maps, this may result in some concurrency issues. In practice, I do
not believe these will occur particularly often, as I'm under the impression
that packets from the same flow will typically be processed by the some
CPU. Furthermore, most of the concurrency issues will not be that problematic
even if they do occur. For now, I've therefore left these concurrency issues
unattended, even if some of them could be avoided with atomic operations and/or
spinlocks, in order to keep things simple and not hurt performance.
The (known) potential concurrency issues are:
#### Tracking last seen identifier
The tc/egress program keeps track of the last seen outgoing identifier for each
flow, by storing it in the `flow_state` map. This is done to detect the first
packet with a new identifier. If multiple packets are processed concurrently,
several of them could potentially detect themselves as being first with the same
identifier (which only matters if they also pass rate-limit check as well),
alternatively if the concurrent packets have different identifiers there may be
a lost update (but for TCP timestamps, concurrent packets would typically be
expected to have the same timestamp).
A possibly more severe issue is out-of-order packets. If a packet with an old
identifier arrives out of order, that identifier could be detected as a new
identifier. If for example the following flow of four packets with just two
different identifiers (id1 and id2) were to occur:
id1 -> id2 -> id1 -> id2
Then the tc/egress program would consider each of these packets to have new
identifiers and try to create a new timestamp for each of them if the sampling
strategy allows it. However even if the sampling strategy allows it, the
(incorrect) creation of timestamps for id1 and id2 the second time would only be
successful in case the first timestamps for id1 and id2 have already been
matched against (and thus deleted). Even if that is the case, they would only
result in reporting an incorrect RTT in case there are also new matches against
these identifiers.
This issue could be avoided entirely by requiring that new-id > old-id instead
of simply checking that new-id != old-id, as TCP timestamps should monotonically
increase. That may however not be a suitable solution if/when we add support for
other types of identifiers.
#### Rate-limiting new timestamps
In the tc/egress program packets to timestamp are sampled by using a per-flow
rate-limit, which is enforced by storing when the last timestamp was created in
the `flow_state` map. If multiple packets perform this check concurrently, it's
possible that multiple packets think they are allowed to create timestamps
before any of them are able to update the `last_timestamp`. When they update
`last_timestamp` it might also be slightly incorrect, however if they are
processed concurrently then they should also generate very similar timestamps.
If the packets have different identifiers, (which would typically not be
expected for concurrent TCP timestamps), then this would allow some packets to
bypass the rate-limit. By bypassing the rate-limit, the flow would use up some
additional map space and report some additional RTT(s) more than expected
(however the reported RTTs should still be correct).
If the packets have the same identifier, they must first have managed to bypass
the previous check for unique identifiers (see [previous point](#Tracking last
seen identifier)), and only one of them will be able to successfully store a
timestamp entry.
#### Matching against stored timestamps
The XDP/ingress program could potentially match multiple concurrent packets with
the same identifier against a single timestamp entry in `packet_ts`, before any
of them manage to delete the timestamp entry. This would result in multiple RTTs
being reported for the same identifier, but if they are processed concurrently
these RTTs should be very similar, so would mainly result in over-reporting
rather than reporting incorrect RTTs.
#### Updating flow statistics
Both the tc/egress and XDP/ingress programs will try to update some flow
statistics each time they successfully parse a packet with an
identifier. Specifically, they'll update the number of packets and bytes
sent/received. This is not done in an atomic fashion, so there could potentially
be some lost updates resulting an underestimate.
Furthermore, whenever the XDP/ingress program calculates an RTT, it will check
if this is the lowest RTT seen so far for the flow. If multiple RTTs are
calculated concurrently, then several could pass this check concurrently and
there may be a lost update. It should only be possible for multiple RTTs to be
calculated concurrently in case either the [timestamp rate-limit was
bypassed](#Rate-limiting new timestamps) or [multiple packets managed to match
against the same timestamp](#Matching against stored timestamps).
It's worth noting that with sampling the reported minimum-RTT is only an
estimate anyways (may never calculate RTT for packet with the true minimum
RTT). And even without sampling there is some inherent sampling due to TCP
timestamps only being updated at a limited rate (1000 Hz).
## Similar projects
Passively measuring the RTT for TCP traffic is not a novel concept, and there
@@ -84,9 +266,9 @@ RTT calculation using TCP timestamps (as in this project) works is provided in
implementing some filtering logic the hope is to be able to create a always-on
tool that can scale well even to large amounts of massive flows.
- [ppviz](https://github.com/pollere/ppviz): Web-based visualization tool for
the "machine-friendly" output from Kathie's pping tool. If/when we implement a
similar machine readable output option it should hopefully work with this
implementation as well.
the "machine-friendly" (-m) output from Kathie's pping tool. Running this
implementation of pping with --format="ppviz" will generate output that can be
used by ppviz.
- [tcptrace](https://github.com/blitz/tcptrace): A post-processing tool which
can analyze a tcpdump file and among other things calculate RTTs based on
seq/ACK numbers (`-r` or `-R` flag).

View File

@@ -25,7 +25,7 @@
- [ ] Could potentially include keeping track of average RTT, which
may be useful for some decisions (ex. how often to sample,
when entry can be removed etc)
- [ ] Could potentially include keeping track of minimum RTT (as
- [x] Could potentially include keeping track of minimum RTT (as
done by the original pping), ex. to track bufferbloat
- [ ] Could potentially include keeping track of if flow is
bi-directional
@@ -38,11 +38,6 @@
unnecessarily large, which slows down the cleaning and may block
new entries
- [ ] Use libxdp to load XDP program
- [ ] Add option for machine-readable output (as original pping)
- It may be a good idea to keep the same format as original pping,
so that tools such as [ppviz](https://github.com/pollere/ppviz)
works for both pping implementations.
- [ ] Add timestamps to output (as original pping)
- [ ] Add support for other hooks
- Ex TC-BFP on ingress instead of XDP?
@@ -59,3 +54,9 @@
- [x] Add IPv6 support
- [x] Refactor to support easy addition of other protocols
- [x] Load tc-bpf program with libbpf (only attach it with tc)
- [x] Switch to libbpf TC-BPF API for attaching the TC-BPF program
- [x] Add option for machine-readable output (as original pping)
- It may be a good idea to keep the same format as original pping,
so that tools such as [ppviz](https://github.com/pollere/ppviz)
works for both pping implementations.
- [x] Add timestamps to output (as original pping)

View File

@@ -23,6 +23,7 @@ static const char *__doc__ =
#include <time.h>
#include <pthread.h>
#include "json_writer.h"
#include "pping.h" //common structs for user-space and BPF parts
#define NS_PER_SECOND 1000000000UL
@@ -40,6 +41,9 @@ static const char *__doc__ =
#define MAX_PATH_LEN 1024
#define MON_TO_REAL_UPDATE_FREQ \
(1 * NS_PER_SECOND) // Update offset between CLOCK_MONOTONIC and CLOCK_REALTIME once per second
/*
* BPF implementation of pping using libbpf
* Uses TC-BPF for egress and XDP for ingress
@@ -64,20 +68,24 @@ struct map_cleanup_args {
struct pping_config {
struct bpf_config bpf_config;
__u64 cleanup_interval;
int xdp_flags;
int ifindex;
char ifname[IF_NAMESIZE];
bool force;
char *object_path;
char *ingress_sec;
char *egress_sec;
char *pin_dir;
char *packet_map;
char *flow_map;
char *rtt_map;
char *event_map;
int xdp_flags;
int ifindex;
char ifname[IF_NAMESIZE];
bool json_format;
bool ppviz_format;
bool force;
};
static volatile int keep_running = 1;
static json_writer_t *json_ctx = NULL;
static void (*print_event_func)(void *, int, void *, __u32) = NULL;
static const struct option long_options[] = {
{ "help", no_argument, NULL, 'h' },
@@ -85,6 +93,7 @@ static const struct option long_options[] = {
{ "rate-limit", required_argument, NULL, 'r' }, // Sampling rate-limit in ms
{ "force", no_argument, NULL, 'f' }, // Detach any existing XDP program on interface
{ "cleanup-interval", required_argument, NULL, 'c' }, // Map cleaning interval in s
{ "format", required_argument, NULL, 'F' }, // Which format to output in (standard/json/ppviz)
{ 0, 0, NULL, 0 }
};
@@ -135,8 +144,11 @@ static int parse_arguments(int argc, char *argv[], struct pping_config *config)
double rate_limit_ms, cleanup_interval_s;
config->ifindex = 0;
config->force = false;
config->json_format = false;
config->ppviz_format = false;
while ((opt = getopt_long(argc, argv, "hfi:r:c:", long_options,
while ((opt = getopt_long(argc, argv, "hfi:r:c:F:", long_options,
NULL)) != -1) {
switch (opt) {
case 'i':
@@ -173,6 +185,16 @@ static int parse_arguments(int argc, char *argv[], struct pping_config *config)
config->cleanup_interval =
cleanup_interval_s * NS_PER_SECOND;
break;
case 'F':
if (strcmp(optarg, "json") == 0) {
config->json_format = true;
} else if (strcmp(optarg, "ppviz") == 0) {
config->ppviz_format = true;
} else if (strcmp(optarg, "standard") != 0) {
fprintf(stderr, "format must be \"standard\", \"json\" or \"ppviz\"\n");
return -EINVAL;
}
break;
case 'f':
config->force = true;
break;
@@ -336,16 +358,16 @@ static int tc_bpf_clear(char *interface)
* Returns time of CLOCK_MONOTONIC as nanoseconds in a single __u64.
* On failure, the value 0 is returned (and errno will be set).
*/
static __u64 get_time_ns(void)
static __u64 get_time_ns(clockid_t clockid)
{
struct timespec t;
if (clock_gettime(CLOCK_MONOTONIC, &t) != 0)
if (clock_gettime(clockid, &t) != 0)
return 0;
return (__u64)t.tv_sec * NS_PER_SECOND + (__u64)t.tv_nsec;
}
static bool packet_ts_timeout(void *val_ptr, __u64 now)
static bool packet_ts_timeout(void *key_ptr, void *val_ptr, __u64 now)
{
__u64 ts = *(__u64 *)val_ptr;
if (now > ts && now - ts > TIMESTAMP_LIFETIME)
@@ -353,11 +375,23 @@ static bool packet_ts_timeout(void *val_ptr, __u64 now)
return false;
}
static bool flow_timeout(void *val_ptr, __u64 now)
static bool flow_timeout(void *key_ptr, void *val_ptr, __u64 now)
{
struct flow_event fe;
__u64 ts = ((struct flow_state *)val_ptr)->last_timestamp;
if (now > ts && now - ts > FLOW_LIFETIME)
if (now > ts && now - ts > FLOW_LIFETIME) {
if (print_event_func) {
fe.event_type = EVENT_TYPE_FLOW;
fe.timestamp = now;
memcpy(&fe.flow, key_ptr, sizeof(struct network_tuple));
fe.event_info.event = FLOW_EVENT_CLOSING;
fe.event_info.reason = EVENT_REASON_FLOW_TIMEOUT;
fe.source = EVENT_SOURCE_USERSPACE;
print_event_func(NULL, 0, &fe, sizeof(fe));
}
return true;
}
return false;
}
@@ -369,12 +403,12 @@ static bool flow_timeout(void *val_ptr, __u64 now)
*/
//TODO - maybe add some pointer to arguments for del_decision_func?
static int clean_map(int map_fd, size_t key_size, size_t value_size,
bool (*del_decision_func)(void *, __u64))
bool (*del_decision_func)(void *, void *, __u64))
{
int removed = 0;
void *key, *prev_key, *value;
bool delete_prev = false;
__u64 now_nsec = get_time_ns();
__u64 now_nsec = get_time_ns(CLOCK_MONOTONIC);
#ifdef DEBUG
int entries = 0;
@@ -401,7 +435,7 @@ static int clean_map(int map_fd, size_t key_size, size_t value_size,
}
if (bpf_map_lookup_elem(map_fd, key, value) == 0)
delete_prev = del_decision_func(value, now_nsec);
delete_prev = del_decision_func(key, value, now_nsec);
#ifdef DEBUG
entries++;
#endif
@@ -412,10 +446,11 @@ static int clean_map(int map_fd, size_t key_size, size_t value_size,
removed++;
}
#ifdef DEBUG
duration = get_time_ns() - now_nsec;
printf("%d: Gone through %d entries and removed %d of them in %llu.%09llu s\n",
map_fd, entries, removed, duration / NS_PER_SECOND,
duration % NS_PER_SECOND);
duration = get_time_ns(CLOCK_MONOTONIC) - now_nsec;
fprintf(stderr,
"%d: Gone through %d entries and removed %d of them in %llu.%09llu s\n",
map_fd, entries, removed, duration / NS_PER_SECOND,
duration % NS_PER_SECOND);
#endif
cleanup:
free(key);
@@ -441,12 +476,33 @@ static void *periodic_map_cleanup(void *args)
pthread_exit(NULL);
}
static __u64 convert_monotonic_to_realtime(__u64 monotonic_time)
{
static __u64 offset = 0;
static __u64 offset_updated = 0;
__u64 now_mon = get_time_ns(CLOCK_MONOTONIC);
__u64 now_rt;
if (offset == 0 ||
(now_mon > offset_updated &&
now_mon - offset_updated > MON_TO_REAL_UPDATE_FREQ)) {
now_mon = get_time_ns(CLOCK_MONOTONIC);
now_rt = get_time_ns(CLOCK_REALTIME);
if (now_rt < now_mon)
return 0;
offset = now_rt - now_mon;
offset_updated = now_mon;
}
return monotonic_time + offset;
}
/*
* Wrapper around inet_ntop designed to handle the "bug" that mapped IPv4
* addresses are formated as IPv6 addresses for AF_INET6
*/
static int format_ip_address(int af, const struct in6_addr *addr, char *buf,
size_t size)
static int format_ip_address(char *buf, size_t size, int af,
const struct in6_addr *addr)
{
if (af == AF_INET)
return inet_ntop(af, &addr->s6_addr[12],
@@ -456,18 +512,193 @@ static int format_ip_address(int af, const struct in6_addr *addr, char *buf,
return -EINVAL;
}
static void handle_rtt_event(void *ctx, int cpu, void *data, __u32 data_size)
static const char *proto_to_str(__u16 proto)
{
static char buf[8];
switch (proto) {
case IPPROTO_TCP:
return "TCP";
case IPPROTO_ICMP:
return "ICMP";
case IPPROTO_ICMPV6:
return "ICMPv6";
default:
snprintf(buf, sizeof(buf), "%d", proto);
return buf;
}
}
static const char *flowevent_to_str(enum flow_event_type fe)
{
switch (fe) {
case FLOW_EVENT_NONE:
return "none";
case FLOW_EVENT_OPENING:
return "opening";
case FLOW_EVENT_CLOSING:
return "closing";
default:
return "unknown";
}
}
static const char *eventreason_to_str(enum flow_event_reason er)
{
switch (er) {
case EVENT_REASON_SYN:
return "SYN";
case EVENT_REASON_SYN_ACK:
return "SYN-ACK";
case EVENT_REASON_FIRST_OBS_PCKT:
return "first observed packet";
case EVENT_REASON_FIN:
return "FIN";
case EVENT_REASON_FIN_ACK:
return "FIN-ACK";
case EVENT_REASON_RST:
return "RST";
case EVENT_REASON_FLOW_TIMEOUT:
return "flow timeout";
default:
return "unknown";
}
}
static const char *eventsource_to_str(enum flow_event_source es)
{
switch (es) {
case EVENT_SOURCE_EGRESS:
return "src";
case EVENT_SOURCE_INGRESS:
return "dest";
case EVENT_SOURCE_USERSPACE:
return "userspace-cleanup";
default:
return "unknown";
}
}
static void print_flow_ppvizformat(FILE *stream, const struct network_tuple *flow)
{
const struct rtt_event *e = data;
char saddr[INET6_ADDRSTRLEN];
char daddr[INET6_ADDRSTRLEN];
format_ip_address(e->flow.ipv, &e->flow.saddr.ip, saddr, sizeof(saddr));
format_ip_address(e->flow.ipv, &e->flow.daddr.ip, daddr, sizeof(daddr));
format_ip_address(saddr, sizeof(saddr), flow->ipv, &flow->saddr.ip);
format_ip_address(daddr, sizeof(daddr), flow->ipv, &flow->daddr.ip);
fprintf(stream, "%s:%d+%s:%d", saddr, ntohs(flow->saddr.port), daddr,
ntohs(flow->daddr.port));
}
printf("%llu.%06llu ms %s:%d+%s:%d\n", e->rtt / NS_PER_MS,
e->rtt % NS_PER_MS, saddr, ntohs(e->flow.saddr.port), daddr,
ntohs(e->flow.daddr.port));
static void print_ns_datetime(FILE *stream, __u64 monotonic_ns)
{
char timestr[9];
__u64 ts = convert_monotonic_to_realtime(monotonic_ns);
time_t ts_s = ts / NS_PER_SECOND;
strftime(timestr, sizeof(timestr), "%H:%M:%S", localtime(&ts_s));
fprintf(stream, "%s.%09llu", timestr, ts % NS_PER_SECOND);
}
static void print_event_standard(void *ctx, int cpu, void *data,
__u32 data_size)
{
const union pping_event *e = data;
if (e->event_type == EVENT_TYPE_RTT) {
print_ns_datetime(stdout, e->rtt_event.timestamp);
printf(" %llu.%06llu ms %llu.%06llu ms ",
e->rtt_event.rtt / NS_PER_MS,
e->rtt_event.rtt % NS_PER_MS,
e->rtt_event.min_rtt / NS_PER_MS,
e->rtt_event.min_rtt % NS_PER_MS);
print_flow_ppvizformat(stdout, &e->rtt_event.flow);
printf("\n");
} else if (e->event_type == EVENT_TYPE_FLOW) {
print_ns_datetime(stdout, e->flow_event.timestamp);
printf(" ");
print_flow_ppvizformat(stdout, &e->flow_event.flow);
printf(" %s due to %s from %s\n",
flowevent_to_str(e->flow_event.event_info.event),
eventreason_to_str(e->flow_event.event_info.reason),
eventsource_to_str(e->flow_event.source));
}
}
static void print_event_ppviz(void *ctx, int cpu, void *data, __u32 data_size)
{
const struct rtt_event *e = data;
__u64 time = convert_monotonic_to_realtime(e->timestamp);
if (e->event_type != EVENT_TYPE_RTT)
return;
printf("%llu.%09llu %llu.%09llu %llu.%09llu ", time / NS_PER_SECOND,
time % NS_PER_SECOND, e->rtt / NS_PER_SECOND,
e->rtt % NS_PER_SECOND, e->min_rtt / NS_PER_SECOND, e->min_rtt);
print_flow_ppvizformat(stdout, &e->flow);
printf("\n");
}
static void print_common_fields_json(json_writer_t *ctx,
const union pping_event *e)
{
const struct network_tuple *flow = &e->rtt_event.flow;
char saddr[INET6_ADDRSTRLEN];
char daddr[INET6_ADDRSTRLEN];
format_ip_address(saddr, sizeof(saddr), flow->ipv, &flow->saddr.ip);
format_ip_address(daddr, sizeof(daddr), flow->ipv, &flow->daddr.ip);
jsonw_u64_field(ctx, "timestamp",
convert_monotonic_to_realtime(e->rtt_event.timestamp));
jsonw_string_field(ctx, "src_ip", saddr);
jsonw_hu_field(ctx, "src_port", ntohs(flow->saddr.port));
jsonw_string_field(ctx, "dest_ip", daddr);
jsonw_hu_field(ctx, "dest_port", ntohs(flow->daddr.port));
jsonw_string_field(ctx, "protocol", proto_to_str(flow->proto));
}
static void print_rttevent_fields_json(json_writer_t *ctx,
const struct rtt_event *re)
{
jsonw_u64_field(ctx, "rtt", re->rtt);
jsonw_u64_field(ctx, "min_rtt", re->min_rtt);
jsonw_u64_field(ctx, "sent_packets", re->sent_pkts);
jsonw_u64_field(ctx, "sent_bytes", re->sent_bytes);
jsonw_u64_field(ctx, "rec_packets", re->rec_pkts);
jsonw_u64_field(ctx, "rec_bytes", re->rec_bytes);
}
static void print_flowevent_fields_json(json_writer_t *ctx,
const struct flow_event *fe)
{
jsonw_string_field(ctx, "flow_event",
flowevent_to_str(fe->event_info.event));
jsonw_string_field(ctx, "reason",
eventreason_to_str(fe->event_info.reason));
jsonw_string_field(ctx, "triggered_by", eventsource_to_str(fe->source));
}
static void print_event_json(void *ctx, int cpu, void *data, __u32 data_size)
{
const union pping_event *e = data;
if (e->event_type != EVENT_TYPE_RTT && e->event_type != EVENT_TYPE_FLOW)
return;
if (!json_ctx) {
json_ctx = jsonw_new(stdout);
jsonw_start_array(json_ctx);
}
jsonw_start_object(json_ctx);
print_common_fields_json(json_ctx, e);
if (e->event_type == EVENT_TYPE_RTT)
print_rttevent_fields_json(json_ctx, &e->rtt_event);
else // flow-event
print_flowevent_fields_json(json_ctx, &e->flow_event);
jsonw_end_object(json_ctx);
}
static void handle_missed_rtt_event(void *ctx, int cpu, __u64 lost_cnt)
@@ -593,17 +824,18 @@ int main(int argc, char *argv[])
.pin_dir = "/sys/fs/bpf/pping",
.packet_map = "packet_ts",
.flow_map = "flow_state",
.rtt_map = "rtt_events",
.event_map = "events",
.xdp_flags = XDP_FLAGS_UPDATE_IF_NOEXIST,
.force = false,
};
struct perf_buffer *pb = NULL;
struct perf_buffer_opts pb_opts = {
.sample_cb = handle_rtt_event,
.sample_cb = print_event_standard,
.lost_cb = handle_missed_rtt_event,
};
print_event_func = print_event_standard;
// Detect if running as root
if (geteuid() != 0) {
printf("This program must be run as root.\n");
@@ -626,6 +858,14 @@ int main(int argc, char *argv[])
return EXIT_FAILURE;
}
if (config.json_format) {
pb_opts.sample_cb = print_event_json;
print_event_func = print_event_json;
} else if (config.ppviz_format) {
pb_opts.sample_cb = print_event_ppviz;
print_event_func = print_event_ppviz;
}
err = load_attach_bpfprogs(&obj, &config, &tc_attached, &xdp_attached);
if (err) {
fprintf(stderr,
@@ -643,13 +883,13 @@ int main(int argc, char *argv[])
// Set up perf buffer
pb = perf_buffer__new(bpf_object__find_map_fd_by_name(obj,
config.rtt_map),
config.event_map),
PERF_BUFFER_PAGES, &pb_opts);
err = libbpf_get_error(pb);
if (err) {
pb = NULL;
fprintf(stderr, "Failed to open perf buffer %s: %s\n",
config.rtt_map, strerror(err));
config.event_map, strerror(err));
goto cleanup;
}
@@ -695,5 +935,10 @@ cleanup:
config.pin_dir, strerror(-err));
}
if (config.json_format && json_ctx) {
jsonw_end_array(json_ctx);
jsonw_destroy(&json_ctx);
}
return err != 0;
}

View File

@@ -4,10 +4,37 @@
#include <linux/types.h>
#include <linux/in6.h>
#include <stdbool.h>
#define INGRESS_PROG_SEC "xdp"
#define EGRESS_PROG_SEC "classifier"
/* For the event_type members of rtt_event and flow_event */
#define EVENT_TYPE_FLOW 1
#define EVENT_TYPE_RTT 2
enum __attribute__((__packed__)) flow_event_type {
FLOW_EVENT_NONE,
FLOW_EVENT_OPENING,
FLOW_EVENT_CLOSING
};
enum __attribute__((__packed__)) flow_event_reason {
EVENT_REASON_SYN,
EVENT_REASON_SYN_ACK,
EVENT_REASON_FIRST_OBS_PCKT,
EVENT_REASON_FIN,
EVENT_REASON_FIN_ACK,
EVENT_REASON_RST,
EVENT_REASON_FLOW_TIMEOUT
};
enum __attribute__((__packed__)) flow_event_source {
EVENT_SOURCE_EGRESS,
EVENT_SOURCE_INGRESS,
EVENT_SOURCE_USERSPACE
};
struct bpf_config {
__u64 rate_limit;
};
@@ -39,7 +66,12 @@ struct network_tuple {
};
struct flow_state {
__u64 min_rtt;
__u64 last_timestamp;
__u64 sent_pkts;
__u64 sent_bytes;
__u64 rec_pkts;
__u64 rec_bytes;
__u32 last_id;
__u32 reserved;
};
@@ -49,10 +81,52 @@ struct packet_id {
__u32 identifier; //tsval for TCP packets
};
/*
* An RTT event message that can be passed from the bpf-programs to user-space.
* The initial event_type memeber is used to allow multiplexing between
* different event types in a single perf buffer. Memebers up to and including
* flow are identical to other event types.
* Uses explicit padding instead of packing based on recommendations in cilium's
* BPF reference documentation at https://docs.cilium.io/en/stable/bpf/#llvm.
*/
struct rtt_event {
__u64 rtt;
__u64 event_type;
__u64 timestamp;
struct network_tuple flow;
__u32 padding;
__u64 rtt;
__u64 min_rtt;
__u64 sent_pkts;
__u64 sent_bytes;
__u64 rec_pkts;
__u64 rec_bytes;
__u32 reserved;
};
struct flow_event_info {
enum flow_event_type event;
enum flow_event_reason reason;
};
/*
* A flow event message that can be passed from the bpf-programs to user-space.
* The initial event_type memeber is used to allow multiplexing between
* different event types in a single perf buffer. Memebers up to and including
* flow are identical to other event types.
*/
struct flow_event {
__u64 event_type;
__u64 timestamp;
struct network_tuple flow;
struct flow_event_info event_info;
enum flow_event_source source;
__u8 reserved;
};
union pping_event {
__u64 event_type;
struct rtt_event rtt_event;
struct flow_event flow_event;
};
#endif

View File

@@ -60,7 +60,7 @@ struct {
__uint(type, BPF_MAP_TYPE_PERF_EVENT_ARRAY);
__uint(key_size, sizeof(__u32));
__uint(value_size, sizeof(__u32));
} rtt_events SEC(".maps");
} events SEC(".maps");
// Help functions
@@ -130,14 +130,15 @@ static int parse_tcp_ts(struct tcphdr *tcph, void *data_end, __u32 *tsval,
/*
* Attempts to fetch an identifier for TCP packets, based on the TCP timestamp
* option. If sucessful, identifier will be set to TSval if is_ingress, TSecr
* otherwise, the port-members of saddr and daddr will be set the the TCP source
* and dest, respectively, and 0 will be returned. On failure, -1 will be
* returned. Additionally, if the connection is closing (FIN or RST flag), sets
* flow_closing to true.
* option.
* If successful, identifier will be set to TSval if is_ingress, or TSecr
* otherwise, the port-members of saddr and daddr will be set to the TCP source
* and dest, respectively, fei will be filled appropriately (based on
* SYN/FIN/RST) and 0 will be returned.
* On failure, -1 will be returned.
*/
static int parse_tcp_identifier(struct parsing_context *ctx, __be16 *sport,
__be16 *dport, bool *flow_closing,
__be16 *dport, struct flow_event_info *fei,
__u32 *identifier)
{
__u32 tsval, tsecr;
@@ -146,18 +147,27 @@ static int parse_tcp_identifier(struct parsing_context *ctx, __be16 *sport,
if (parse_tcphdr(&ctx->nh, ctx->data_end, &tcph) < 0)
return -1;
// Check if connection is closing
if (tcph->fin || tcph->rst) {
*flow_closing = true;
/* bpf_printk("Detected connection closing on %d\n", */
/* ctx->is_egress); //Upsets verifier? */
}
// Do not timestamp pure ACKs
if (ctx->is_egress && ctx->nh.pos - ctx->data >= ctx->pkt_len &&
!tcph->syn)
return -1;
// Check if connection is opening/closing
if (tcph->syn) {
fei->event = FLOW_EVENT_OPENING;
fei->reason =
tcph->ack ? EVENT_REASON_SYN_ACK : EVENT_REASON_SYN;
} else if (tcph->rst) {
fei->event = FLOW_EVENT_CLOSING;
fei->reason = EVENT_REASON_RST;
} else if (!ctx->is_egress && tcph->fin) {
fei->event = FLOW_EVENT_CLOSING;
fei->reason =
tcph->ack ? EVENT_REASON_FIN_ACK : EVENT_REASON_FIN;
} else {
fei->event = FLOW_EVENT_NONE;
}
if (parse_tcp_ts(tcph, ctx->data_end, &tsval, &tsecr) < 0)
return -1; //Possible TODO, fall back on seq/ack instead
@@ -170,7 +180,7 @@ static int parse_tcp_identifier(struct parsing_context *ctx, __be16 *sport,
/*
* Attempts to parse the packet limited by the data and data_end pointers,
* to retrieve a protocol dependent packet identifier. If sucessful, the
* pointed to p_id will be filled with parsed information from the packet
* pointed to p_id and fei will be filled with parsed information from the
* packet, and 0 will be returned. On failure, -1 will be returned.
* If is_egress saddr and daddr will match source and destination of packet,
* respectively, and identifier will be set to the identifer for an outgoing
@@ -179,7 +189,8 @@ static int parse_tcp_identifier(struct parsing_context *ctx, __be16 *sport,
* set to the identifier of a response.
*/
static int parse_packet_identifier(struct parsing_context *ctx,
struct packet_id *p_id, bool *flow_closing)
struct packet_id *p_id,
struct flow_event_info *fei)
{
int proto, err;
struct ethhdr *eth;
@@ -201,18 +212,18 @@ static int parse_packet_identifier(struct parsing_context *ctx,
// Parse IPv4/6 header
if (proto == bpf_htons(ETH_P_IP)) {
p_id->flow.ipv = AF_INET;
proto = parse_iphdr(&ctx->nh, ctx->data_end, &iph);
p_id->flow.proto = parse_iphdr(&ctx->nh, ctx->data_end, &iph);
} else if (proto == bpf_htons(ETH_P_IPV6)) {
p_id->flow.ipv = AF_INET6;
proto = parse_ip6hdr(&ctx->nh, ctx->data_end, &ip6h);
p_id->flow.proto = parse_ip6hdr(&ctx->nh, ctx->data_end, &ip6h);
} else {
return -1;
}
// Add new protocols here
if (proto == IPPROTO_TCP) {
if (p_id->flow.proto == IPPROTO_TCP) {
err = parse_tcp_identifier(ctx, &saddr->port, &daddr->port,
flow_closing, &p_id->identifier);
fei, &p_id->identifier);
if (err)
return -1;
} else {
@@ -230,6 +241,32 @@ static int parse_packet_identifier(struct parsing_context *ctx,
return 0;
}
/*
* Returns the number of unparsed bytes left in the packet (bytes after nh.pos)
*/
static __u32 remaining_pkt_payload(struct parsing_context *ctx)
{
// pkt_len - (pos - data) fails because compiler transforms it to pkt_len - pos + data (pkt_len - pos not ok because value - pointer)
// data + pkt_len - pos fails on (data+pkt_len) - pos due to math between pkt_pointer and unbounded register
__u32 parsed_bytes = ctx->nh.pos - ctx->data;
return parsed_bytes < ctx->pkt_len ? ctx->pkt_len - parsed_bytes : 0;
}
/*
* Fills in event_type, timestamp, flow, source and reserved.
* Does not fill in the flow_info.
*/
static void fill_flow_event(struct flow_event *fe, __u64 timestamp,
struct network_tuple *flow,
enum flow_event_source source)
{
fe->event_type = EVENT_TYPE_FLOW;
fe->timestamp = timestamp;
__builtin_memcpy(&fe->flow, flow, sizeof(struct network_tuple));
fe->source = source;
fe->reserved = 0; // Make sure it's initilized
}
// Programs
// TC-BFP for parsing packet identifier from egress traffic and add to map
@@ -237,7 +274,8 @@ SEC(EGRESS_PROG_SEC)
int pping_egress(struct __sk_buff *skb)
{
struct packet_id p_id = { 0 };
__u64 p_ts;
struct flow_event fe;
__u64 now;
struct parsing_context pctx = {
.data = (void *)(long)skb->data,
.data_end = (void *)(long)skb->data_end,
@@ -245,63 +283,56 @@ int pping_egress(struct __sk_buff *skb)
.nh = { .pos = pctx.data },
.is_egress = true,
};
bool flow_closing = false;
struct flow_state *f_state;
struct flow_state new_state = { 0 };
if (parse_packet_identifier(&pctx, &p_id, &flow_closing) < 0)
if (parse_packet_identifier(&pctx, &p_id, &fe.event_info) < 0)
goto out;
// Delete flow and create no timestamp entry if flow is closing
if (flow_closing) {
bpf_map_delete_elem(&flow_state, &p_id.flow);
now = bpf_ktime_get_ns(); // or bpf_ktime_get_boot_ns
f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow);
// Flow closing - try to delete flow state and push closing-event
if (fe.event_info.event == FLOW_EVENT_CLOSING) {
if (!f_state) {
bpf_map_delete_elem(&flow_state, &p_id.flow);
fill_flow_event(&fe, now, &p_id.flow,
EVENT_SOURCE_EGRESS);
bpf_perf_event_output(skb, &events, BPF_F_CURRENT_CPU,
&fe, sizeof(fe));
}
goto out;
}
// Check flow state
f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow);
if (!f_state) { // No previous state - attempt to create it
// No previous state - attempt to create it and push flow-opening event
if (!f_state) {
bpf_map_update_elem(&flow_state, &p_id.flow, &new_state,
BPF_NOEXIST);
f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow);
if (!f_state)
if (!f_state) // Creation failed
goto out;
if (fe.event_info.event != FLOW_EVENT_OPENING) {
fe.event_info.event = FLOW_EVENT_OPENING;
fe.event_info.reason = EVENT_REASON_FIRST_OBS_PCKT;
}
fill_flow_event(&fe, now, &p_id.flow, EVENT_SOURCE_EGRESS);
bpf_perf_event_output(skb, &events, BPF_F_CURRENT_CPU, &fe,
sizeof(fe));
}
// Check if identfier is new
/* The gap between checking and updating last_id may cause concurrency
* issues where multiple packets may simultaneously think they are the
* first with a new identifier. As long as all of the identifiers are
* the same though, only one should be able to create a timestamp entry.
f_state->sent_pkts++;
f_state->sent_bytes += remaining_pkt_payload(&pctx);
* A bigger issue is that older identifiers (for example due to
* out-of-order packets) may pass this check and update the current
* identifier to an old one. This means that both the packet with the
* old identifier itself as well the next packet with the current
* identifier may be considered packets with new identifiers (even if
* both have been seen before). For TCP timestamps this could be
* prevented by changing the check to '>=' instead, but it may not be
* suitable for other protocols, such as QUIC and its spinbit.
*
* For now, just hope that the rate limit saves us from creating an
* incorrect timestamp. That may however also fail, either due to the
* to it happening in a time it's not limited by rate sampling, or
* because of rate check failing due to concurrency issues.
*/
// Check if identfier is new
if (f_state->last_id == p_id.identifier)
goto out;
f_state->last_id = p_id.identifier;
// Check rate-limit
/*
* The window between checking and updating last_timestamp may cause
* concurrency issues, where multiple packets simultaneously pass the
* rate limit. However, as long as they have the same identifier, only
* a single timestamp entry should successfully be created.
*/
p_ts = bpf_ktime_get_ns(); // or bpf_ktime_get_boot_ns
if (p_ts < f_state->last_timestamp ||
p_ts - f_state->last_timestamp < config.rate_limit)
if (now < f_state->last_timestamp ||
now - f_state->last_timestamp < config.rate_limit)
goto out;
/*
@@ -310,8 +341,8 @@ int pping_egress(struct __sk_buff *skb)
* the next available map slot somewhat fairer between heavy and sparse
* flows.
*/
f_state->last_timestamp = p_ts;
bpf_map_update_elem(&packet_ts, &p_id, &p_ts, BPF_NOEXIST);
f_state->last_timestamp = now;
bpf_map_update_elem(&packet_ts, &p_id, &now, BPF_NOEXIST);
out:
return BPF_OK;
@@ -323,7 +354,9 @@ int pping_ingress(struct xdp_md *ctx)
{
struct packet_id p_id = { 0 };
__u64 *p_ts;
struct rtt_event event = { 0 };
struct flow_event fe;
struct rtt_event re = { 0 };
struct flow_state *f_state;
struct parsing_context pctx = {
.data = (void *)(long)ctx->data,
.data_end = (void *)(long)ctx->data_end,
@@ -331,30 +364,51 @@ int pping_ingress(struct xdp_md *ctx)
.nh = { .pos = pctx.data },
.is_egress = false,
};
bool flow_closing = false;
__u64 now;
if (parse_packet_identifier(&pctx, &p_id, &flow_closing) < 0)
if (parse_packet_identifier(&pctx, &p_id, &fe.event_info) < 0)
goto out;
// Delete flow, but allow final attempt at RTT calculation
if (flow_closing)
bpf_map_delete_elem(&flow_state, &p_id.flow);
f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow);
if (!f_state)
goto out;
f_state->rec_pkts++;
f_state->rec_bytes += remaining_pkt_payload(&pctx);
now = bpf_ktime_get_ns();
p_ts = bpf_map_lookup_elem(&packet_ts, &p_id);
if (!p_ts)
goto out;
if (!p_ts || now < *p_ts)
goto validflow_out;
event.rtt = bpf_ktime_get_ns() - *p_ts;
/*
* Attempt to delete timestamp entry as soon as RTT is calculated.
* But could have potential concurrency issue where multiple packets
* manage to match against the identifier before it can be deleted.
*/
re.rtt = now - *p_ts;
// Delete timestamp entry as soon as RTT is calculated
bpf_map_delete_elem(&packet_ts, &p_id);
__builtin_memcpy(&event.flow, &p_id.flow, sizeof(struct network_tuple));
bpf_perf_event_output(ctx, &rtt_events, BPF_F_CURRENT_CPU, &event,
sizeof(event));
if (f_state->min_rtt == 0 || re.rtt < f_state->min_rtt)
f_state->min_rtt = re.rtt;
re.event_type = EVENT_TYPE_RTT;
re.timestamp = now;
re.min_rtt = f_state->min_rtt;
re.sent_pkts = f_state->sent_pkts;
re.sent_bytes = f_state->sent_bytes;
re.rec_pkts = f_state->rec_pkts;
re.rec_bytes = f_state->rec_bytes;
// Push event to perf-buffer
__builtin_memcpy(&re.flow, &p_id.flow, sizeof(struct network_tuple));
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &re, sizeof(re));
validflow_out:
// Wait with deleting flow until having pushed final RTT message
if (fe.event_info.event == FLOW_EVENT_CLOSING && f_state) {
bpf_map_delete_elem(&flow_state, &p_id.flow);
fill_flow_event(&fe, now, &p_id.flow, EVENT_SOURCE_INGRESS);
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &fe,
sizeof(fe));
}
out:
return XDP_PASS;