pping: Do both timestamping and matching on ingress and egress

Perform both timestamping and matching on both ingress and egress
hooks. This makes it more similar to Kathie's pping, allowing the tool
to capture RTTs in both directions when deployed on just a single
interface.

Like Kathie's pping, by default filter out RTTs for packets going to
the local machine (will only include local processing delays). This
behavior can be disabled by passing the -l/--include-local option.

As packets that are timestamped on ingress and matched on egress will
include the local machines processing delay, add the "match_on_egress"
member to the JSON output that can be used to differentiate between
RTTs that include the local processing delay, and those which don't.

Finally, report the source and destination addresses from the perspective
of the reply packet, rather than the timestamped packet, to be
consistent with Kathie's pping.

Overall, refactor large parts of pping_kern to allow both timestamping
and matching, as well as updating both the flow and reverse flow and
handle flow-events related to them, in one go. Also update README to
reflect changes.

Signed-off-by: Simon Sundberg <simon.sundberg@kau.se>
This commit is contained in:
Simon Sundberg
2022-02-10 16:16:24 +01:00
parent 928a4144a9
commit 8a8f538759
5 changed files with 446 additions and 302 deletions

View File

@@ -13,20 +13,21 @@ spinbit and DNS queries. See the [TODO-list](./TODO.md) for more potential
features (which may or may not ever get implemented).
The fundamental logic of pping is to timestamp a pseudo-unique identifier for
outgoing packets, and then look for matches in the incoming packets. If a match
is found, the RTT is simply calculated as the time difference between the
current time and the stored timestamp.
packets, and then look for matches in the reply packets. If a match is found,
the RTT is simply calculated as the time difference between the current time and
the stored timestamp.
This tool, just as Kathie's original pping implementation, uses TCP timestamps
as identifiers for TCP traffic. For outgoing packets, the TSval (which is a
timestamp in and off itself) is timestamped. Incoming packets are then parsed
for the TSecr, which are the echoed TSval values from the receiver. The TCP
timestamps are not necessarily unique for every packet (they have a limited
update frequency, appears to be 1000 Hz for modern Linux systems), so only the
first instance of an identifier is timestamped, and matched against the first
incoming packet with the identifier. The mechanism to ensure only the first
packet is timestamped and matched differs from the one in Kathie's pping, and is
further described in [SAMPLING_DESIGN](./SAMPLING_DESIGN.md).
as identifiers for TCP traffic. The TSval (which is a timestamp in and off
itself) is used as an identifier and timestamped. Reply packets in the reverse
flow are then parsed for the TSecr, which are the echoed TSval values from the
receiver. The TCP timestamps are not necessarily unique for every packet (they
have a limited update frequency, appears to be 1000 Hz for modern Linux
systems), so only the first instance of an identifier is timestamped, and
matched against the first incoming packet with a matching reply identifier. The
mechanism to ensure only the first packet is timestamped and matched differs
from the one in Kathie's pping, and is further described in
[SAMPLING_DESIGN](./SAMPLING_DESIGN.md).
For ICMP echo, it uses the echo identifier as port numbers, and echo sequence
number as identifer to match against. Linux systems will typically use different
@@ -48,7 +49,7 @@ single line per event.
An example of the format is provided below:
```shell
16:00:46.142279766 TCP 10.11.1.1:5201+10.11.1.2:59528 opening due to SYN-ACK from src
16:00:46.142279766 TCP 10.11.1.1:5201+10.11.1.2:59528 opening due to SYN-ACK from dest
16:00:46.147705205 5.425439 ms 5.425439 ms TCP 10.11.1.1:5201+10.11.1.2:59528
16:00:47.148905125 5.261430 ms 5.261430 ms TCP 10.11.1.1:5201+10.11.1.2:59528
16:00:48.151666385 5.972284 ms 5.261430 ms TCP 10.11.1.1:5201+10.11.1.2:59528
@@ -96,7 +97,7 @@ An example of a (pretty-printed) flow-event is provided below:
"protocol": "TCP",
"flow_event": "opening",
"reason": "SYN-ACK",
"triggered_by": "src"
"triggered_by": "dest"
}
```
@@ -114,7 +115,8 @@ An example of a (pretty-printed) RTT-even is provided below:
"sent_packets": 9393,
"sent_bytes": 492457296,
"rec_packets": 5922,
"rec_bytes": 37
"rec_bytes": 37,
"match_on_egress": false
}
```
@@ -123,22 +125,20 @@ An example of a (pretty-printed) RTT-even is provided below:
### Files:
- **pping.c:** Userspace program that loads and attaches the BPF programs, pulls
the perf-buffer `rtt_events` to print out RTT messages and periodically cleans
the perf-buffer `events` to print out RTT messages and periodically cleans
up the hash-maps from old entries. Also passes user options to the BPF
programs by setting a "global variable" (stored in the programs .rodata
section).
- **pping_kern.c:** Contains the BPF programs that are loaded on tc (egress) and
XDP (ingress), as well as several common functions, a global constant `config`
(set from userspace) and map definitions. The tc program `pping_egress()`
parses outgoing packets for identifiers. If an identifier is found and the
sampling strategy allows it, a timestamp for the packet is created in
`packet_ts`. The XDP program `pping_ingress()` parses incomming packets for an
identifier. If found, it looks up the `packet_ts` map for a match on the
reverse flow (to match source/dest on egress). If there is a match, it
calculates the RTT from the stored timestamp and deletes the entry. The
calculated RTT (together with the flow-tuple) is pushed to the perf-buffer
`events`. Both `pping_egress()` and `pping_ingress` can also push flow-events
to the `events` buffer.
- **pping_kern.c:** Contains the BPF programs that are loaded on egress (tc) and
ingress (XDP or tc), as well as several common functions, a global constant
`config` (set from userspace) and map definitions. Essentially the same pping
program is loaded on both ingress and egress. All packets are parsed for both
an identifier that can be used to create a timestamp entry `packet_ts`, and a
reply identifier that can be used to match the packet with a previously
timestamped one in the reverse flow. If a match is found, an RTT is calculated
and an RTT-event is pushed to userspace through the perf-buffer `events`. For
each packet with a valid identifier, the program also keeps track of and
updates the state flow and reverse flow, stored in the `flow_state` map.
- **pping.h:** Common header file included by `pping.c` and
`pping_kern.c`. Contains some common structs used by both (are part of the
maps).
@@ -146,13 +146,12 @@ An example of a (pretty-printed) RTT-even is provided below:
### BPF Maps:
- **flow_state:** A hash-map storing some basic state for each flow, such as the
last seen identifier for the flow and when the last timestamp entry for the
flow was created. Entries are created by `pping_egress()`, and can be updated
or deleted by both `pping_egress()` and `pping_ingress()`. Leftover entries
are eventually removed by `pping.c`.
flow was created. Entries are created, updated and deleted by the BPF pping
programs. Leftover entries are eventually removed by userspace (`pping.c`).
- **packet_ts:** A hash-map storing a timestamp for a specific packet
identifier. Entries are created by `pping_egress()` and removed by
`pping_ingress()` if a match is found. Leftover entries are eventually removed
by `pping.c`.
identifier. Entries are created by the BPF pping program if a valid identifier
is found, and removed if a match is found. Leftover entries are eventually
removed by userspace (`pping.c`).
- **events:** A perf-buffer used by the BPF programs to push flow or RTT events
to `pping.c`, which continuously polls the map the prints them out.
@@ -222,9 +221,9 @@ additional map space and report some additional RTT(s) more than expected
(however the reported RTTs should still be correct).
If the packets have the same identifier, they must first have managed to bypass
the previous check for unique identifiers (see [previous point](#Tracking last
seen identifier)), and only one of them will be able to successfully store a
timestamp entry.
the previous check for unique identifiers (see [previous
point](#tracking-last-seen-identifier)), and only one of them will be able to
successfully store a timestamp entry.
#### Matching against stored timestamps
The XDP/ingress program could potentially match multiple concurrent packets with
@@ -246,8 +245,8 @@ if this is the lowest RTT seen so far for the flow. If multiple RTTs are
calculated concurrently, then several could pass this check concurrently and
there may be a lost update. It should only be possible for multiple RTTs to be
calculated concurrently in case either the [timestamp rate-limit was
bypassed](#Rate-limiting new timestamps) or [multiple packets managed to match
against the same timestamp](#Matching against stored timestamps).
bypassed](#rate-limiting-new-timestamps) or [multiple packets managed to match
against the same timestamp](#matching-against-stored-timestamps).
It's worth noting that with sampling the reported minimum-RTT is only an
estimate anyways (may never calculate RTT for packet with the true minimum

View File

Binary file not shown.

Before

Width:  |  Height:  |  Size: 54 KiB

After

Width:  |  Height:  |  Size: 45 KiB

View File

@@ -15,11 +15,8 @@ static const char *__doc__ =
#include <unistd.h>
#include <getopt.h>
#include <stdbool.h>
#include <limits.h>
#include <signal.h> // For detecting Ctrl-C
#include <sys/resource.h> // For setting rlmit
#include <sys/wait.h>
#include <sys/stat.h>
#include <time.h>
#include <pthread.h>
@@ -108,6 +105,7 @@ static const struct option long_options[] = {
{ "ingress-hook", required_argument, NULL, 'I' }, // Use tc or XDP as ingress hook
{ "tcp", no_argument, NULL, 'T' }, // Calculate and report RTTs for TCP traffic (with TCP timestamps)
{ "icmp", no_argument, NULL, 'C' }, // Calculate and report RTTs for ICMP echo-reply traffic
{ "include-local", no_argument, NULL, 'l' }, // Also report "internal" RTTs
{ 0, 0, NULL, 0 }
};
@@ -172,11 +170,12 @@ static int parse_arguments(int argc, char *argv[], struct pping_config *config)
double rate_limit_ms, cleanup_interval_s, rtt_rate;
config->ifindex = 0;
config->bpf_config.localfilt = true;
config->force = false;
config->bpf_config.track_tcp = false;
config->bpf_config.track_icmp = false;
while ((opt = getopt_long(argc, argv, "hfTCi:r:R:t:c:F:I:", long_options,
while ((opt = getopt_long(argc, argv, "hflTCi:r:R:t:c:F:I:", long_options,
NULL)) != -1) {
switch (opt) {
case 'i':
@@ -257,6 +256,9 @@ static int parse_arguments(int argc, char *argv[], struct pping_config *config)
return -EINVAL;
}
break;
case 'l':
config->bpf_config.localfilt = false;
break;
case 'f':
config->force = true;
config->xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
@@ -504,9 +506,9 @@ static bool flow_timeout(void *key_ptr, void *val_ptr, __u64 now)
if (print_event_func) {
fe.event_type = EVENT_TYPE_FLOW;
fe.timestamp = now;
fe.flow = *(struct network_tuple *)key_ptr;
fe.event_info.event = FLOW_EVENT_CLOSING;
fe.event_info.reason = EVENT_REASON_FLOW_TIMEOUT;
reverse_flow(&fe.flow, key_ptr);
fe.flow_event_type = FLOW_EVENT_CLOSING;
fe.reason = EVENT_REASON_FLOW_TIMEOUT;
fe.source = EVENT_SOURCE_USERSPACE;
print_event_func(NULL, 0, &fe, sizeof(fe));
}
@@ -657,6 +659,7 @@ static const char *flowevent_to_str(enum flow_event_type fe)
case FLOW_EVENT_OPENING:
return "opening";
case FLOW_EVENT_CLOSING:
case FLOW_EVENT_CLOSING_BOTH:
return "closing";
default:
return "unknown";
@@ -674,8 +677,6 @@ static const char *eventreason_to_str(enum flow_event_reason er)
return "first observed packet";
case EVENT_REASON_FIN:
return "FIN";
case EVENT_REASON_FIN_ACK:
return "FIN-ACK";
case EVENT_REASON_RST:
return "RST";
case EVENT_REASON_FLOW_TIMEOUT:
@@ -688,9 +689,9 @@ static const char *eventreason_to_str(enum flow_event_reason er)
static const char *eventsource_to_str(enum flow_event_source es)
{
switch (es) {
case EVENT_SOURCE_EGRESS:
case EVENT_SOURCE_PKT_SRC:
return "src";
case EVENT_SOURCE_INGRESS:
case EVENT_SOURCE_PKT_DEST:
return "dest";
case EVENT_SOURCE_USERSPACE:
return "userspace-cleanup";
@@ -740,8 +741,8 @@ static void print_event_standard(void *ctx, int cpu, void *data,
printf(" %s ", proto_to_str(e->rtt_event.flow.proto));
print_flow_ppvizformat(stdout, &e->flow_event.flow);
printf(" %s due to %s from %s\n",
flowevent_to_str(e->flow_event.event_info.event),
eventreason_to_str(e->flow_event.event_info.reason),
flowevent_to_str(e->flow_event.flow_event_type),
eventreason_to_str(e->flow_event.reason),
eventsource_to_str(e->flow_event.source));
}
}
@@ -790,15 +791,16 @@ static void print_rttevent_fields_json(json_writer_t *ctx,
jsonw_u64_field(ctx, "sent_bytes", re->sent_bytes);
jsonw_u64_field(ctx, "rec_packets", re->rec_pkts);
jsonw_u64_field(ctx, "rec_bytes", re->rec_bytes);
jsonw_bool_field(ctx, "match_on_egress", re->match_on_egress);
}
static void print_flowevent_fields_json(json_writer_t *ctx,
const struct flow_event *fe)
{
jsonw_string_field(ctx, "flow_event",
flowevent_to_str(fe->event_info.event));
flowevent_to_str(fe->flow_event_type));
jsonw_string_field(ctx, "reason",
eventreason_to_str(fe->event_info.reason));
eventreason_to_str(fe->reason));
jsonw_string_field(ctx, "triggered_by", eventsource_to_str(fe->source));
}

View File

@@ -18,7 +18,8 @@ typedef __u64 fixpoint64;
enum __attribute__((__packed__)) flow_event_type {
FLOW_EVENT_NONE,
FLOW_EVENT_OPENING,
FLOW_EVENT_CLOSING
FLOW_EVENT_CLOSING,
FLOW_EVENT_CLOSING_BOTH
};
enum __attribute__((__packed__)) flow_event_reason {
@@ -26,14 +27,13 @@ enum __attribute__((__packed__)) flow_event_reason {
EVENT_REASON_SYN_ACK,
EVENT_REASON_FIRST_OBS_PCKT,
EVENT_REASON_FIN,
EVENT_REASON_FIN_ACK,
EVENT_REASON_RST,
EVENT_REASON_FLOW_TIMEOUT
};
enum __attribute__((__packed__)) flow_event_source {
EVENT_SOURCE_EGRESS,
EVENT_SOURCE_INGRESS,
EVENT_SOURCE_PKT_SRC,
EVENT_SOURCE_PKT_DEST,
EVENT_SOURCE_USERSPACE
};
@@ -43,7 +43,8 @@ struct bpf_config {
bool use_srtt;
bool track_tcp;
bool track_icmp;
__u8 reserved[5];
bool localfilt;
__u32 reserved;
};
/*
@@ -108,12 +109,8 @@ struct rtt_event {
__u64 sent_bytes;
__u64 rec_pkts;
__u64 rec_bytes;
__u32 reserved;
};
struct flow_event_info {
enum flow_event_type event;
enum flow_event_reason reason;
bool match_on_egress;
__u8 reserved[7];
};
/*
@@ -126,7 +123,8 @@ struct flow_event {
__u64 event_type;
__u64 timestamp;
struct network_tuple flow;
struct flow_event_info event_info;
enum flow_event_type flow_event_type;
enum flow_event_reason reason;
enum flow_event_source source;
__u8 reserved;
};
@@ -137,4 +135,19 @@ union pping_event {
struct flow_event flow_event;
};
/*
* Convenience function for getting the corresponding reverse flow.
* PPing needs to keep track of flow in both directions, and sometimes
* also needs to reverse the flow to report the "correct" (consistent
* with Kathie's PPing) src and dest address.
*/
static void reverse_flow(struct network_tuple *dest, struct network_tuple *src)
{
dest->ipv = src->ipv;
dest->proto = src->proto;
dest->saddr = src->daddr;
dest->daddr = src->saddr;
dest->reserved = 0;
}
#endif

View File

@@ -25,6 +25,9 @@
#define AF_INET6 10
#define MAX_TCP_OPTIONS 10
// Mask for IPv6 flowlabel + traffic class - used in fib lookup
#define IPV6_FLOWINFO_MASK __cpu_to_be32(0x0FFFFFFF)
/*
* This struct keeps track of the data and data_end pointers from the xdp_md or
* __skb_buff contexts, as well as a currently parsed to position kept in nh.
@@ -33,11 +36,43 @@
* header encloses.
*/
struct parsing_context {
void *data; //Start of eth hdr
void *data_end; //End of safe acessible area
struct hdr_cursor nh; //Position to parse next
__u32 pkt_len; //Full packet length (headers+data)
bool is_egress; //Is packet on egress or ingress?
void *data; // Start of eth hdr
void *data_end; // End of safe acessible area
struct hdr_cursor nh; // Position to parse next
__u32 pkt_len; // Full packet length (headers+data)
__u32 ingress_ifindex; // Interface packet arrived on
bool is_egress; // Is packet on egress or ingress?
};
/*
* Struct filled in by parse_packet_id.
*
* Note: As long as parse_packet_id is successful, the flow-parts of pid
* and reply_pid should be valid, regardless of value for pid_valid and
* reply_pid valid. The *pid_valid members are there to indicate that the
* identifier part of *pid are valid and can be used for timestamping/lookup.
* The reason for not keeping the flow parts as an entirely separate members
* is to save some performance by avoid doing a copy for lookup/insertion
* in the packet_ts map.
*/
struct packet_info {
union {
struct iphdr *iph;
struct ipv6hdr *ip6h;
};
union {
struct icmphdr *icmph;
struct icmp6hdr *icmp6h;
struct tcphdr *tcph;
};
__u64 time; // Arrival time of packet
__u32 payload; // Size of packet data (excluding headers)
struct packet_id pid; // identifier to timestamp (ex. TSval)
struct packet_id reply_pid; // identifier to match against (ex. TSecr)
bool pid_valid; // identifier can be used to timestamp packet
bool reply_pid_valid; // reply_identifier can be used to match packet
enum flow_event_type event_type; // flow event triggered by packet
enum flow_event_reason event_reason; // reason for triggering flow event
};
char _license[] SEC("license") = "GPL";
@@ -70,13 +105,24 @@ struct {
/*
* Maps an IPv4 address into an IPv6 address according to RFC 4291 sec 2.5.5.2
*/
static void map_ipv4_to_ipv6(__be32 ipv4, struct in6_addr *ipv6)
static void map_ipv4_to_ipv6(struct in6_addr *ipv6, __be32 ipv4)
{
__builtin_memset(&ipv6->in6_u.u6_addr8[0], 0x00, 10);
__builtin_memset(&ipv6->in6_u.u6_addr8[10], 0xff, 2);
ipv6->in6_u.u6_addr32[3] = ipv4;
}
/*
* Returns the number of unparsed bytes left in the packet (bytes after nh.pos)
*/
static __u32 remaining_pkt_payload(struct parsing_context *ctx)
{
// pkt_len - (pos - data) fails because compiler transforms it to pkt_len - pos + data (pkt_len - pos not ok because value - pointer)
// data + pkt_len - pos fails on (data+pkt_len) - pos due to math between pkt_pointer and unbounded register
__u32 parsed_bytes = ctx->nh.pos - ctx->data;
return parsed_bytes < ctx->pkt_len ? ctx->pkt_len - parsed_bytes : 0;
}
/*
* Parses the TSval and TSecr values from the TCP options field. If sucessful
* the TSval and TSecr values will be stored at tsval and tsecr (in network
@@ -134,198 +180,191 @@ static int parse_tcp_ts(struct tcphdr *tcph, void *data_end, __u32 *tsval,
/*
* Attempts to fetch an identifier for TCP packets, based on the TCP timestamp
* option.
* If successful, identifier will be set to TSval if is_ingress, or TSecr
* otherwise, the port-members of saddr and daddr will be set to the TCP source
* and dest, respectively, fei will be filled appropriately (based on
* SYN/FIN/RST) and 0 will be returned.
* On failure, -1 will be returned.
*
* Will use the TSval as pid and TSecr as reply_pid, and the TCP source and dest
* as port numbers.
*
* If successful, the pid (identifer + flow.port), reply_pid, pid_valid,
* reply_pid_valid, event_type and event_reason members of p_info will be set
* appropriately and 0 will be returned.
* On failure -1 will be returned (no guarantees on values set in p_info).
*/
static int parse_tcp_identifier(struct parsing_context *ctx, __be16 *sport,
__be16 *dport, struct flow_event_info *fei,
__u32 *identifier)
static int parse_tcp_identifier(struct parsing_context *pctx,
struct packet_info *p_info)
{
__u32 tsval, tsecr;
struct tcphdr *tcph;
if (parse_tcphdr(&ctx->nh, ctx->data_end, &tcph) < 0)
if (parse_tcphdr(&pctx->nh, pctx->data_end, &p_info->tcph) < 0)
return -1;
// Do not timestamp pure ACKs
if (ctx->is_egress && ctx->nh.pos - ctx->data >= ctx->pkt_len &&
!tcph->syn)
return -1;
// Do not match on non-ACKs (TSecr not valid)
if (!ctx->is_egress && !tcph->ack)
return -1;
// Check if connection is opening/closing
if (tcph->syn) {
fei->event = FLOW_EVENT_OPENING;
fei->reason =
tcph->ack ? EVENT_REASON_SYN_ACK : EVENT_REASON_SYN;
} else if (tcph->rst) {
fei->event = FLOW_EVENT_CLOSING;
fei->reason = EVENT_REASON_RST;
} else if (!ctx->is_egress && tcph->fin) {
fei->event = FLOW_EVENT_CLOSING;
fei->reason =
tcph->ack ? EVENT_REASON_FIN_ACK : EVENT_REASON_FIN;
} else {
fei->event = FLOW_EVENT_NONE;
}
if (parse_tcp_ts(tcph, ctx->data_end, &tsval, &tsecr) < 0)
if (parse_tcp_ts(p_info->tcph, pctx->data_end, &p_info->pid.identifier,
&p_info->reply_pid.identifier) < 0)
return -1; //Possible TODO, fall back on seq/ack instead
*sport = tcph->source;
*dport = tcph->dest;
*identifier = ctx->is_egress ? tsval : tsecr;
p_info->pid.flow.saddr.port = p_info->tcph->source;
p_info->pid.flow.daddr.port = p_info->tcph->dest;
// Do not timestamp pure ACKs (no payload)
p_info->pid_valid =
pctx->nh.pos - pctx->data < pctx->pkt_len || p_info->tcph->syn;
// Do not match on non-ACKs (TSecr not valid)
p_info->reply_pid_valid = p_info->tcph->ack;
// Check if connection is opening/closing
if (p_info->tcph->rst) {
p_info->event_type = FLOW_EVENT_CLOSING_BOTH;
p_info->event_reason = EVENT_REASON_RST;
} else if (p_info->tcph->fin) {
p_info->event_type = FLOW_EVENT_CLOSING;
p_info->event_reason = EVENT_REASON_FIN;
} else if (p_info->tcph->syn) {
p_info->event_type = FLOW_EVENT_OPENING;
p_info->event_reason = p_info->tcph->ack ?
EVENT_REASON_SYN_ACK :
EVENT_REASON_SYN;
} else {
p_info->event_type = FLOW_EVENT_NONE;
}
return 0;
}
/*
* Attemps to fetch an identifier for an ICMPv6 header, based on the echo
* Attempts to fetch an identifier for an ICMPv6 header, based on the echo
* request/reply sequence number.
* If successful, identifer will be set to the echo sequence number, both
* sport and dport will be set to the echo identifier, and 0 will be returned.
* On failure, -1 will be returned.
* Note: Will store the 16-bit echo sequence number in network byte order in
* the 32-bit identifier.
*
* Will use the echo sequence number as pid/reply_pid and the echo identifier
* as port numbers. Echo requests will only generate a valid pid and echo
* replies will only generate a valid reply_pid.
*
* If successful, the pid (identifier + flow.port), reply_pid, pid_valid,
* reply pid_valid and event_type of p_info will be set appropriately and 0
* will be returned.
* On failure, -1 will be returned (no guarantees on p_info members).
*
* Note: Will store the 16-bit sequence number in network byte order
* in the 32-bit (reply_)pid.identifier.
*/
static int parse_icmp6_identifier(struct parsing_context *ctx, __u16 *sport,
__u16 *dport, struct flow_event_info *fei,
__u32 *identifier)
static int parse_icmp6_identifier(struct parsing_context *pctx,
struct packet_info *p_info)
{
struct icmp6hdr *icmp6h;
if (parse_icmp6hdr(&ctx->nh, ctx->data_end, &icmp6h) < 0)
if (parse_icmp6hdr(&pctx->nh, pctx->data_end, &p_info->icmp6h) < 0)
return -1;
if (ctx->is_egress && icmp6h->icmp6_type != ICMPV6_ECHO_REQUEST)
return -1;
if (!ctx->is_egress && icmp6h->icmp6_type != ICMPV6_ECHO_REPLY)
return -1;
if (icmp6h->icmp6_code != 0)
if (p_info->icmp6h->icmp6_code != 0)
return -1;
fei->event = FLOW_EVENT_NONE;
*sport = icmp6h->icmp6_identifier;
*dport = *sport;
*identifier = icmp6h->icmp6_sequence;
if (p_info->icmp6h->icmp6_type == ICMPV6_ECHO_REQUEST) {
p_info->pid.identifier = p_info->icmp6h->icmp6_sequence;
p_info->pid_valid = true;
p_info->reply_pid_valid = false;
} else if (p_info->icmp6h->icmp6_type == ICMPV6_ECHO_REPLY) {
p_info->reply_pid.identifier = p_info->icmp6h->icmp6_sequence;
p_info->reply_pid_valid = true;
p_info->pid_valid = false;
} else {
return -1;
}
p_info->event_type = FLOW_EVENT_NONE;
p_info->pid.flow.saddr.port = p_info->icmp6h->icmp6_identifier;
p_info->pid.flow.daddr.port = p_info->pid.flow.saddr.port;
return 0;
}
/*
* Same as parse_icmp6_identifier, but for an ICMP(v4) header instead.
*/
static int parse_icmp_identifier(struct parsing_context *ctx, __u16 *sport,
__u16 *dport, struct flow_event_info *fei,
__u32 *identifier)
static int parse_icmp_identifier(struct parsing_context *pctx,
struct packet_info *p_info)
{
struct icmphdr *icmph;
if (parse_icmphdr(&ctx->nh, ctx->data_end, &icmph) < 0)
if (parse_icmphdr(&pctx->nh, pctx->data_end, &p_info->icmph) < 0)
return -1;
if (ctx->is_egress && icmph->type != ICMP_ECHO)
return -1;
if (!ctx->is_egress && icmph->type != ICMP_ECHOREPLY)
return -1;
if (icmph->code != 0)
if (p_info->icmph->code != 0)
return -1;
fei->event = FLOW_EVENT_NONE;
*sport = icmph->un.echo.id;
*dport = *sport;
*identifier = icmph->un.echo.sequence;
if (p_info->icmph->type == ICMP_ECHO) {
p_info->pid.identifier = p_info->icmph->un.echo.sequence;
p_info->pid_valid = true;
p_info->reply_pid_valid = false;
} else if (p_info->icmph->type == ICMP_ECHOREPLY) {
p_info->reply_pid.identifier = p_info->icmph->un.echo.sequence;
p_info->reply_pid_valid = true;
p_info->pid_valid = false;
} else {
return -1;
}
p_info->event_type = FLOW_EVENT_NONE;
p_info->pid.flow.saddr.port = p_info->icmph->un.echo.id;
p_info->pid.flow.daddr.port = p_info->pid.flow.saddr.port;
return 0;
}
/*
* Attempts to parse the packet limited by the data and data_end pointers,
* to retrieve a protocol dependent packet identifier. If sucessful, the
* pointed to p_id and fei will be filled with parsed information from the
* packet, and 0 will be returned. On failure, -1 will be returned.
* If is_egress saddr and daddr will match source and destination of packet,
* respectively, and identifier will be set to the identifer for an outgoing
* packet. Otherwise, saddr and daddr will be swapped (will match
* destination and source of packet, respectively), and identifier will be
* set to the identifier of a response.
* Attempts to parse the packet defined by pctx for a valid packet identifier
* and reply identifier, filling in p_info.
*
* If succesful, all members of p_info will be set appropriately and 0 will
* be returned.
* On failure -1 will be returned (no garantuees on p_info members).
*/
static int parse_packet_identifier(struct parsing_context *ctx,
struct packet_id *p_id,
struct flow_event_info *fei)
static int parse_packet_identifier(struct parsing_context *pctx,
struct packet_info *p_info)
{
int proto, err;
struct ethhdr *eth;
struct iphdr *iph;
struct ipv6hdr *ip6h;
struct flow_address *saddr, *daddr;
// Switch saddr <--> daddr on ingress to match egress
if (ctx->is_egress) {
saddr = &p_id->flow.saddr;
daddr = &p_id->flow.daddr;
} else {
saddr = &p_id->flow.daddr;
daddr = &p_id->flow.saddr;
}
proto = parse_ethhdr(&ctx->nh, ctx->data_end, &eth);
p_info->time = bpf_ktime_get_ns();
proto = parse_ethhdr(&pctx->nh, pctx->data_end, &eth);
// Parse IPv4/6 header
if (proto == bpf_htons(ETH_P_IP)) {
p_id->flow.ipv = AF_INET;
p_id->flow.proto = parse_iphdr(&ctx->nh, ctx->data_end, &iph);
p_info->pid.flow.ipv = AF_INET;
p_info->pid.flow.proto =
parse_iphdr(&pctx->nh, pctx->data_end, &p_info->iph);
} else if (proto == bpf_htons(ETH_P_IPV6)) {
p_id->flow.ipv = AF_INET6;
p_id->flow.proto = parse_ip6hdr(&ctx->nh, ctx->data_end, &ip6h);
p_info->pid.flow.ipv = AF_INET6;
p_info->pid.flow.proto =
parse_ip6hdr(&pctx->nh, pctx->data_end, &p_info->ip6h);
} else {
return -1;
}
// Parse identifer from suitable protocol
if (config.track_tcp && p_id->flow.proto == IPPROTO_TCP)
err = parse_tcp_identifier(ctx, &saddr->port, &daddr->port, fei,
&p_id->identifier);
else if (config.track_icmp && p_id->flow.proto == IPPROTO_ICMPV6 &&
p_id->flow.ipv == AF_INET6)
err = parse_icmp6_identifier(ctx, &saddr->port, &daddr->port,
fei, &p_id->identifier);
else if (config.track_icmp && p_id->flow.proto == IPPROTO_ICMP &&
p_id->flow.ipv == AF_INET)
err = parse_icmp_identifier(ctx, &saddr->port, &daddr->port,
fei, &p_id->identifier);
if (config.track_tcp && p_info->pid.flow.proto == IPPROTO_TCP)
err = parse_tcp_identifier(pctx, p_info);
else if (config.track_icmp &&
p_info->pid.flow.proto == IPPROTO_ICMPV6 &&
p_info->pid.flow.ipv == AF_INET6)
err = parse_icmp6_identifier(pctx, p_info);
else if (config.track_icmp && p_info->pid.flow.proto == IPPROTO_ICMP &&
p_info->pid.flow.ipv == AF_INET)
err = parse_icmp_identifier(pctx, p_info);
else
return -1; // No matching protocol
if (err)
return -1; // Failed parsing protocol
// Sucessfully parsed packet identifier - fill in IP-addresses and return
if (p_id->flow.ipv == AF_INET) {
map_ipv4_to_ipv6(iph->saddr, &saddr->ip);
map_ipv4_to_ipv6(iph->daddr, &daddr->ip);
if (p_info->pid.flow.ipv == AF_INET) {
map_ipv4_to_ipv6(&p_info->pid.flow.saddr.ip,
p_info->iph->saddr);
map_ipv4_to_ipv6(&p_info->pid.flow.daddr.ip,
p_info->iph->daddr);
} else { // IPv6
saddr->ip = ip6h->saddr;
daddr->ip = ip6h->daddr;
p_info->pid.flow.saddr.ip = p_info->ip6h->saddr;
p_info->pid.flow.daddr.ip = p_info->ip6h->daddr;
}
reverse_flow(&p_info->reply_pid.flow, &p_info->pid.flow);
p_info->payload = remaining_pkt_payload(pctx);
return 0;
}
/*
* Returns the number of unparsed bytes left in the packet (bytes after nh.pos)
*/
static __u32 remaining_pkt_payload(struct parsing_context *ctx)
{
// pkt_len - (pos - data) fails because compiler transforms it to pkt_len - pos + data (pkt_len - pos not ok because value - pointer)
// data + pkt_len - pos fails on (data+pkt_len) - pos due to math between pkt_pointer and unbounded register
__u32 parsed_bytes = ctx->nh.pos - ctx->data;
return parsed_bytes < ctx->pkt_len ? ctx->pkt_len - parsed_bytes : 0;
}
/*
* Calculate a smooted rtt similar to how TCP stack does it in
* Calculate a smoothed rtt similar to how TCP stack does it in
* net/ipv4/tcp_input.c/tcp_rtt_estimator().
*
* NOTE: Will cause roundoff errors, but if RTTs > 1000ns errors should be small
@@ -352,86 +391,172 @@ static bool is_rate_limited(__u64 now, __u64 last_ts, __u64 rtt)
}
/*
* Fills in event_type, timestamp, flow, source and reserved.
* Does not fill in the flow_info.
* Sends a flow-event message based on p_info.
*
* The rev_flow argument is used to inform if the message is for the flow
* in the current direction or the reverse flow, and will adapt the flow and
* source members accordingly.
*/
static void fill_flow_event(struct flow_event *fe, __u64 timestamp,
struct network_tuple *flow,
enum flow_event_source source)
static void send_flow_event(void *ctx, struct packet_info *p_info,
bool rev_flow)
{
fe->event_type = EVENT_TYPE_FLOW;
fe->timestamp = timestamp;
fe->flow = *flow;
fe->source = source;
fe->reserved = 0; // Make sure it's initilized
struct flow_event fe = {
.event_type = EVENT_TYPE_FLOW,
.flow_event_type = p_info->event_type,
.reason = p_info->event_reason,
.timestamp = p_info->time,
.reserved = 0, // Make sure it's initilized
};
if (rev_flow) {
fe.flow = p_info->pid.flow;
fe.source = EVENT_SOURCE_PKT_SRC;
} else {
fe.flow = p_info->reply_pid.flow;
fe.source = EVENT_SOURCE_PKT_DEST;
}
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &fe, sizeof(fe));
}
/*
* Main function for handling the pping egress path.
* Parses the packet for an identifer and attemps to store a timestamp for it
* in the packet_ts map.
* Attempt to create a new flow-state and push flow-opening message
* Returns a pointer to the flow_state if successful, NULL otherwise
*/
static void pping_egress(void *ctx, struct parsing_context *pctx)
static struct flow_state *create_flow(void *ctx, struct packet_info *p_info)
{
struct packet_id p_id = { 0 };
struct flow_event fe;
struct flow_state *f_state;
struct flow_state new_state = { 0 };
bool new_flow = false;
__u64 now;
if (parse_packet_identifier(pctx, &p_id, &fe.event_info) < 0)
return;
new_state.last_timestamp = p_info->time;
if (bpf_map_update_elem(&flow_state, &p_info->pid.flow, &new_state,
BPF_NOEXIST) != 0)
return NULL;
now = bpf_ktime_get_ns(); // or bpf_ktime_get_boot_ns
f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow);
if (p_info->event_type != FLOW_EVENT_OPENING) {
p_info->event_type = FLOW_EVENT_OPENING;
p_info->event_reason = EVENT_REASON_FIRST_OBS_PCKT;
}
send_flow_event(ctx, p_info, false);
// Flow closing - try to delete flow state and push closing-event
if (fe.event_info.event == FLOW_EVENT_CLOSING) {
if (!f_state) {
bpf_map_delete_elem(&flow_state, &p_id.flow);
fill_flow_event(&fe, now, &p_id.flow,
EVENT_SOURCE_EGRESS);
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
&fe, sizeof(fe));
}
return;
return bpf_map_lookup_elem(&flow_state, &p_info->pid.flow);
}
static struct flow_state *update_flow(void *ctx, struct packet_info *p_info,
bool *new_flow)
{
struct flow_state *f_state;
*new_flow = false;
f_state = bpf_map_lookup_elem(&flow_state, &p_info->pid.flow);
if (!f_state && p_info->pid_valid) {
*new_flow = true;
f_state = create_flow(ctx, p_info);
}
// No previous state - attempt to create it and push flow-opening event
if (!f_state) {
new_state.last_timestamp = now;
if (bpf_map_update_elem(&flow_state, &p_id.flow, &new_state,
BPF_NOEXIST) == 0) {
new_flow = true;
if (fe.event_info.event != FLOW_EVENT_OPENING) {
fe.event_info.event = FLOW_EVENT_OPENING;
fe.event_info.reason =
EVENT_REASON_FIRST_OBS_PCKT;
}
fill_flow_event(&fe, now, &p_id.flow,
EVENT_SOURCE_EGRESS);
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU,
&fe, sizeof(fe));
}
f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow);
if (!f_state) // Creation failed
return;
}
if (!f_state)
return NULL;
f_state->sent_pkts++;
f_state->sent_bytes += remaining_pkt_payload(pctx);
f_state->sent_bytes += p_info->payload;
return f_state;
}
static struct flow_state *update_rev_flow(struct packet_info *p_info)
{
struct flow_state *f_state;
f_state = bpf_map_lookup_elem(&flow_state, &p_info->reply_pid.flow);
if (!f_state)
return NULL;
f_state->rec_pkts++;
f_state->rec_bytes += p_info->payload;
return f_state;
}
static void delete_closed_flows(void *ctx, struct packet_info *p_info)
{
// Flow closing - try to delete flow state and push closing-event
if (p_info->event_type == FLOW_EVENT_CLOSING ||
p_info->event_type == FLOW_EVENT_CLOSING_BOTH) {
if (!bpf_map_delete_elem(&flow_state, &p_info->pid.flow))
send_flow_event(ctx, p_info, false);
}
// Also close reverse flow
if (p_info->event_type == FLOW_EVENT_CLOSING_BOTH) {
if (!bpf_map_delete_elem(&flow_state, &p_info->reply_pid.flow))
send_flow_event(ctx, p_info, true);
}
}
/*
* Return true if p_info->pid.flow.daddr is a "local" address.
*
* Works by performing a fib lookup for p_info->pid.flow.
* Lookup struct filled based on examples from
* samples/bpf/xdp_fwd_kern.c/xdp_fwd_flags() and
* tools/testing/selftests/bpf/progs/test_tc_neigh_fib.c
*/
static bool is_local_address(struct packet_info *p_info, void *ctx,
struct parsing_context *pctx)
{
int ret;
struct bpf_fib_lookup lookup;
__builtin_memset(&lookup, 0, sizeof(lookup));
lookup.ifindex = pctx->ingress_ifindex;
lookup.family = p_info->pid.flow.ipv;
if (lookup.family == AF_INET) {
lookup.tos = p_info->iph->tos;
lookup.tot_len = bpf_ntohs(p_info->iph->tot_len);
lookup.ipv4_src = p_info->iph->saddr;
lookup.ipv4_dst = p_info->iph->daddr;
} else if (lookup.family == AF_INET6) {
struct in6_addr *src = (struct in6_addr *)lookup.ipv6_src;
struct in6_addr *dst = (struct in6_addr *)lookup.ipv6_dst;
lookup.flowinfo = *(__be32 *)p_info->ip6h & IPV6_FLOWINFO_MASK;
lookup.tot_len = bpf_ntohs(p_info->ip6h->payload_len);
*src = p_info->pid.flow.saddr.ip; //verifier did not like ip6h->saddr
*dst = p_info->pid.flow.daddr.ip;
}
lookup.l4_protocol = p_info->pid.flow.proto;
lookup.sport = 0;
lookup.dport = 0;
ret = bpf_fib_lookup(ctx, &lookup, sizeof(lookup), 0);
return ret == BPF_FIB_LKUP_RET_NOT_FWDED ||
ret == BPF_FIB_LKUP_RET_FWD_DISABLED;
}
/*
* Attempt to create a timestamp-entry for packet p_info for flow in f_state
*/
static void pping_timestamp_packet(struct flow_state *f_state, void *ctx,
struct parsing_context *pctx,
struct packet_info *p_info, bool new_flow)
{
if (!f_state || !p_info->pid_valid)
return;
if (config.localfilt && !pctx->is_egress &&
is_local_address(p_info, ctx, pctx))
return;
// Check if identfier is new
if (f_state->last_id == p_id.identifier)
if (!new_flow && f_state->last_id == p_info->pid.identifier)
return;
f_state->last_id = p_id.identifier;
f_state->last_id = p_info->pid.identifier;
// Check rate-limit
if (!new_flow &&
is_rate_limited(now, f_state->last_timestamp,
is_rate_limited(p_info->time, f_state->last_timestamp,
config.use_srtt ? f_state->srtt : f_state->min_rtt))
return;
@@ -441,44 +566,32 @@ static void pping_egress(void *ctx, struct parsing_context *pctx)
* the next available map slot somewhat fairer between heavy and sparse
* flows.
*/
f_state->last_timestamp = now;
bpf_map_update_elem(&packet_ts, &p_id, &now, BPF_NOEXIST);
f_state->last_timestamp = p_info->time;
return;
bpf_map_update_elem(&packet_ts, &p_info->pid, &p_info->time,
BPF_NOEXIST);
}
/*
* Main function for handling the pping ingress path.
* Parses the packet for an identifer and tries to lookup a stored timestmap.
* If it finds a match, it pushes an rtt_event to the events buffer.
* Attempt to match packet in p_info with a timestamp from flow in f_state
*/
static void pping_ingress(void *ctx, struct parsing_context *pctx)
static void pping_match_packet(struct flow_state *f_state, void *ctx,
struct parsing_context *pctx,
struct packet_info *p_info)
{
struct packet_id p_id = { 0 };
struct flow_event fe;
struct rtt_event re = { 0 };
struct flow_state *f_state;
__u64 *p_ts;
__u64 now;
if (parse_packet_identifier(pctx, &p_id, &fe.event_info) < 0)
if (!f_state || !p_info->reply_pid_valid)
return;
f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow);
if (!f_state)
p_ts = bpf_map_lookup_elem(&packet_ts, &p_info->reply_pid);
if (!p_ts || p_info->time < *p_ts)
return;
f_state->rec_pkts++;
f_state->rec_bytes += remaining_pkt_payload(pctx);
now = bpf_ktime_get_ns();
p_ts = bpf_map_lookup_elem(&packet_ts, &p_id);
if (!p_ts || now < *p_ts)
goto validflow_out;
re.rtt = now - *p_ts;
re.rtt = p_info->time - *p_ts;
// Delete timestamp entry as soon as RTT is calculated
bpf_map_delete_elem(&packet_ts, &p_id);
bpf_map_delete_elem(&packet_ts, &p_info->reply_pid);
if (f_state->min_rtt == 0 || re.rtt < f_state->min_rtt)
f_state->min_rtt = re.rtt;
@@ -486,25 +599,40 @@ static void pping_ingress(void *ctx, struct parsing_context *pctx)
// Fill event and push to perf-buffer
re.event_type = EVENT_TYPE_RTT;
re.timestamp = now;
re.timestamp = p_info->time;
re.min_rtt = f_state->min_rtt;
re.sent_pkts = f_state->sent_pkts;
re.sent_bytes = f_state->sent_bytes;
re.rec_pkts = f_state->rec_pkts;
re.rec_bytes = f_state->rec_bytes;
re.flow = p_id.flow;
re.flow = p_info->pid.flow;
re.match_on_egress = pctx->is_egress;
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &re, sizeof(re));
}
validflow_out:
// Wait with deleting flow until having pushed final RTT message
if (fe.event_info.event == FLOW_EVENT_CLOSING &&
bpf_map_delete_elem(&flow_state, &p_id.flow) == 0) {
fill_flow_event(&fe, now, &p_id.flow, EVENT_SOURCE_INGRESS);
bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &fe,
sizeof(fe));
/*
* Will parse the ingress/egress packet in pctx and attempt to create a
* timestamp for it and match it against the reverse flow.
*/
static void pping(void *ctx, struct parsing_context *pctx)
{
struct packet_info p_info = { 0 };
struct flow_state *f_state;
bool new_flow;
if (parse_packet_identifier(pctx, &p_info) < 0)
return;
if (p_info.event_type != FLOW_EVENT_CLOSING &&
p_info.event_type != FLOW_EVENT_CLOSING_BOTH) {
f_state = update_flow(ctx, &p_info, &new_flow);
pping_timestamp_packet(f_state, ctx, pctx, &p_info, new_flow);
}
return;
f_state = update_rev_flow(&p_info);
pping_match_packet(f_state, ctx, pctx, &p_info);
delete_closed_flows(ctx, &p_info);
}
// Programs
@@ -521,7 +649,7 @@ int pping_tc_egress(struct __sk_buff *skb)
.is_egress = true,
};
pping_egress(skb, &pctx);
pping(skb, &pctx);
return TC_ACT_UNSPEC;
}
@@ -535,10 +663,11 @@ int pping_tc_ingress(struct __sk_buff *skb)
.data_end = (void *)(long)skb->data_end,
.pkt_len = skb->len,
.nh = { .pos = pctx.data },
.ingress_ifindex = skb->ingress_ifindex,
.is_egress = false,
};
pping_ingress(skb, &pctx);
pping(skb, &pctx);
return TC_ACT_UNSPEC;
}
@@ -552,10 +681,11 @@ int pping_xdp_ingress(struct xdp_md *ctx)
.data_end = (void *)(long)ctx->data_end,
.pkt_len = pctx.data_end - pctx.data,
.nh = { .pos = pctx.data },
.ingress_ifindex = ctx->ingress_ifindex,
.is_egress = false,
};
pping_ingress(ctx, &pctx);
pping(ctx, &pctx);
return XDP_PASS;
}