diff --git a/pping/README.md b/pping/README.md index 19b6526..966f452 100644 --- a/pping/README.md +++ b/pping/README.md @@ -13,20 +13,21 @@ spinbit and DNS queries. See the [TODO-list](./TODO.md) for more potential features (which may or may not ever get implemented). The fundamental logic of pping is to timestamp a pseudo-unique identifier for -outgoing packets, and then look for matches in the incoming packets. If a match -is found, the RTT is simply calculated as the time difference between the -current time and the stored timestamp. +packets, and then look for matches in the reply packets. If a match is found, +the RTT is simply calculated as the time difference between the current time and +the stored timestamp. This tool, just as Kathie's original pping implementation, uses TCP timestamps -as identifiers for TCP traffic. For outgoing packets, the TSval (which is a -timestamp in and off itself) is timestamped. Incoming packets are then parsed -for the TSecr, which are the echoed TSval values from the receiver. The TCP -timestamps are not necessarily unique for every packet (they have a limited -update frequency, appears to be 1000 Hz for modern Linux systems), so only the -first instance of an identifier is timestamped, and matched against the first -incoming packet with the identifier. The mechanism to ensure only the first -packet is timestamped and matched differs from the one in Kathie's pping, and is -further described in [SAMPLING_DESIGN](./SAMPLING_DESIGN.md). +as identifiers for TCP traffic. The TSval (which is a timestamp in and off +itself) is used as an identifier and timestamped. Reply packets in the reverse +flow are then parsed for the TSecr, which are the echoed TSval values from the +receiver. The TCP timestamps are not necessarily unique for every packet (they +have a limited update frequency, appears to be 1000 Hz for modern Linux +systems), so only the first instance of an identifier is timestamped, and +matched against the first incoming packet with a matching reply identifier. The +mechanism to ensure only the first packet is timestamped and matched differs +from the one in Kathie's pping, and is further described in +[SAMPLING_DESIGN](./SAMPLING_DESIGN.md). For ICMP echo, it uses the echo identifier as port numbers, and echo sequence number as identifer to match against. Linux systems will typically use different @@ -48,7 +49,7 @@ single line per event. An example of the format is provided below: ```shell -16:00:46.142279766 TCP 10.11.1.1:5201+10.11.1.2:59528 opening due to SYN-ACK from src +16:00:46.142279766 TCP 10.11.1.1:5201+10.11.1.2:59528 opening due to SYN-ACK from dest 16:00:46.147705205 5.425439 ms 5.425439 ms TCP 10.11.1.1:5201+10.11.1.2:59528 16:00:47.148905125 5.261430 ms 5.261430 ms TCP 10.11.1.1:5201+10.11.1.2:59528 16:00:48.151666385 5.972284 ms 5.261430 ms TCP 10.11.1.1:5201+10.11.1.2:59528 @@ -96,7 +97,7 @@ An example of a (pretty-printed) flow-event is provided below: "protocol": "TCP", "flow_event": "opening", "reason": "SYN-ACK", - "triggered_by": "src" + "triggered_by": "dest" } ``` @@ -114,7 +115,8 @@ An example of a (pretty-printed) RTT-even is provided below: "sent_packets": 9393, "sent_bytes": 492457296, "rec_packets": 5922, - "rec_bytes": 37 + "rec_bytes": 37, + "match_on_egress": false } ``` @@ -123,22 +125,20 @@ An example of a (pretty-printed) RTT-even is provided below: ### Files: - **pping.c:** Userspace program that loads and attaches the BPF programs, pulls - the perf-buffer `rtt_events` to print out RTT messages and periodically cleans + the perf-buffer `events` to print out RTT messages and periodically cleans up the hash-maps from old entries. Also passes user options to the BPF programs by setting a "global variable" (stored in the programs .rodata section). -- **pping_kern.c:** Contains the BPF programs that are loaded on tc (egress) and - XDP (ingress), as well as several common functions, a global constant `config` - (set from userspace) and map definitions. The tc program `pping_egress()` - parses outgoing packets for identifiers. If an identifier is found and the - sampling strategy allows it, a timestamp for the packet is created in - `packet_ts`. The XDP program `pping_ingress()` parses incomming packets for an - identifier. If found, it looks up the `packet_ts` map for a match on the - reverse flow (to match source/dest on egress). If there is a match, it - calculates the RTT from the stored timestamp and deletes the entry. The - calculated RTT (together with the flow-tuple) is pushed to the perf-buffer - `events`. Both `pping_egress()` and `pping_ingress` can also push flow-events - to the `events` buffer. +- **pping_kern.c:** Contains the BPF programs that are loaded on egress (tc) and + ingress (XDP or tc), as well as several common functions, a global constant + `config` (set from userspace) and map definitions. Essentially the same pping + program is loaded on both ingress and egress. All packets are parsed for both + an identifier that can be used to create a timestamp entry `packet_ts`, and a + reply identifier that can be used to match the packet with a previously + timestamped one in the reverse flow. If a match is found, an RTT is calculated + and an RTT-event is pushed to userspace through the perf-buffer `events`. For + each packet with a valid identifier, the program also keeps track of and + updates the state flow and reverse flow, stored in the `flow_state` map. - **pping.h:** Common header file included by `pping.c` and `pping_kern.c`. Contains some common structs used by both (are part of the maps). @@ -146,13 +146,12 @@ An example of a (pretty-printed) RTT-even is provided below: ### BPF Maps: - **flow_state:** A hash-map storing some basic state for each flow, such as the last seen identifier for the flow and when the last timestamp entry for the - flow was created. Entries are created by `pping_egress()`, and can be updated - or deleted by both `pping_egress()` and `pping_ingress()`. Leftover entries - are eventually removed by `pping.c`. + flow was created. Entries are created, updated and deleted by the BPF pping + programs. Leftover entries are eventually removed by userspace (`pping.c`). - **packet_ts:** A hash-map storing a timestamp for a specific packet - identifier. Entries are created by `pping_egress()` and removed by - `pping_ingress()` if a match is found. Leftover entries are eventually removed - by `pping.c`. + identifier. Entries are created by the BPF pping program if a valid identifier + is found, and removed if a match is found. Leftover entries are eventually + removed by userspace (`pping.c`). - **events:** A perf-buffer used by the BPF programs to push flow or RTT events to `pping.c`, which continuously polls the map the prints them out. @@ -222,9 +221,9 @@ additional map space and report some additional RTT(s) more than expected (however the reported RTTs should still be correct). If the packets have the same identifier, they must first have managed to bypass -the previous check for unique identifiers (see [previous point](#Tracking last -seen identifier)), and only one of them will be able to successfully store a -timestamp entry. +the previous check for unique identifiers (see [previous +point](#tracking-last-seen-identifier)), and only one of them will be able to +successfully store a timestamp entry. #### Matching against stored timestamps The XDP/ingress program could potentially match multiple concurrent packets with @@ -246,8 +245,8 @@ if this is the lowest RTT seen so far for the flow. If multiple RTTs are calculated concurrently, then several could pass this check concurrently and there may be a lost update. It should only be possible for multiple RTTs to be calculated concurrently in case either the [timestamp rate-limit was -bypassed](#Rate-limiting new timestamps) or [multiple packets managed to match -against the same timestamp](#Matching against stored timestamps). +bypassed](#rate-limiting-new-timestamps) or [multiple packets managed to match +against the same timestamp](#matching-against-stored-timestamps). It's worth noting that with sampling the reported minimum-RTT is only an estimate anyways (may never calculate RTT for packet with the true minimum diff --git a/pping/eBPF_pping_design.png b/pping/eBPF_pping_design.png index ab91002..8423b25 100644 Binary files a/pping/eBPF_pping_design.png and b/pping/eBPF_pping_design.png differ diff --git a/pping/pping.c b/pping/pping.c index 08aaac0..c9c596a 100644 --- a/pping/pping.c +++ b/pping/pping.c @@ -15,11 +15,8 @@ static const char *__doc__ = #include #include #include -#include #include // For detecting Ctrl-C #include // For setting rlmit -#include -#include #include #include @@ -108,6 +105,7 @@ static const struct option long_options[] = { { "ingress-hook", required_argument, NULL, 'I' }, // Use tc or XDP as ingress hook { "tcp", no_argument, NULL, 'T' }, // Calculate and report RTTs for TCP traffic (with TCP timestamps) { "icmp", no_argument, NULL, 'C' }, // Calculate and report RTTs for ICMP echo-reply traffic + { "include-local", no_argument, NULL, 'l' }, // Also report "internal" RTTs { 0, 0, NULL, 0 } }; @@ -172,11 +170,12 @@ static int parse_arguments(int argc, char *argv[], struct pping_config *config) double rate_limit_ms, cleanup_interval_s, rtt_rate; config->ifindex = 0; + config->bpf_config.localfilt = true; config->force = false; config->bpf_config.track_tcp = false; config->bpf_config.track_icmp = false; - while ((opt = getopt_long(argc, argv, "hfTCi:r:R:t:c:F:I:", long_options, + while ((opt = getopt_long(argc, argv, "hflTCi:r:R:t:c:F:I:", long_options, NULL)) != -1) { switch (opt) { case 'i': @@ -257,6 +256,9 @@ static int parse_arguments(int argc, char *argv[], struct pping_config *config) return -EINVAL; } break; + case 'l': + config->bpf_config.localfilt = false; + break; case 'f': config->force = true; config->xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST; @@ -504,9 +506,9 @@ static bool flow_timeout(void *key_ptr, void *val_ptr, __u64 now) if (print_event_func) { fe.event_type = EVENT_TYPE_FLOW; fe.timestamp = now; - fe.flow = *(struct network_tuple *)key_ptr; - fe.event_info.event = FLOW_EVENT_CLOSING; - fe.event_info.reason = EVENT_REASON_FLOW_TIMEOUT; + reverse_flow(&fe.flow, key_ptr); + fe.flow_event_type = FLOW_EVENT_CLOSING; + fe.reason = EVENT_REASON_FLOW_TIMEOUT; fe.source = EVENT_SOURCE_USERSPACE; print_event_func(NULL, 0, &fe, sizeof(fe)); } @@ -657,6 +659,7 @@ static const char *flowevent_to_str(enum flow_event_type fe) case FLOW_EVENT_OPENING: return "opening"; case FLOW_EVENT_CLOSING: + case FLOW_EVENT_CLOSING_BOTH: return "closing"; default: return "unknown"; @@ -674,8 +677,6 @@ static const char *eventreason_to_str(enum flow_event_reason er) return "first observed packet"; case EVENT_REASON_FIN: return "FIN"; - case EVENT_REASON_FIN_ACK: - return "FIN-ACK"; case EVENT_REASON_RST: return "RST"; case EVENT_REASON_FLOW_TIMEOUT: @@ -688,9 +689,9 @@ static const char *eventreason_to_str(enum flow_event_reason er) static const char *eventsource_to_str(enum flow_event_source es) { switch (es) { - case EVENT_SOURCE_EGRESS: + case EVENT_SOURCE_PKT_SRC: return "src"; - case EVENT_SOURCE_INGRESS: + case EVENT_SOURCE_PKT_DEST: return "dest"; case EVENT_SOURCE_USERSPACE: return "userspace-cleanup"; @@ -740,8 +741,8 @@ static void print_event_standard(void *ctx, int cpu, void *data, printf(" %s ", proto_to_str(e->rtt_event.flow.proto)); print_flow_ppvizformat(stdout, &e->flow_event.flow); printf(" %s due to %s from %s\n", - flowevent_to_str(e->flow_event.event_info.event), - eventreason_to_str(e->flow_event.event_info.reason), + flowevent_to_str(e->flow_event.flow_event_type), + eventreason_to_str(e->flow_event.reason), eventsource_to_str(e->flow_event.source)); } } @@ -790,15 +791,16 @@ static void print_rttevent_fields_json(json_writer_t *ctx, jsonw_u64_field(ctx, "sent_bytes", re->sent_bytes); jsonw_u64_field(ctx, "rec_packets", re->rec_pkts); jsonw_u64_field(ctx, "rec_bytes", re->rec_bytes); + jsonw_bool_field(ctx, "match_on_egress", re->match_on_egress); } static void print_flowevent_fields_json(json_writer_t *ctx, const struct flow_event *fe) { jsonw_string_field(ctx, "flow_event", - flowevent_to_str(fe->event_info.event)); + flowevent_to_str(fe->flow_event_type)); jsonw_string_field(ctx, "reason", - eventreason_to_str(fe->event_info.reason)); + eventreason_to_str(fe->reason)); jsonw_string_field(ctx, "triggered_by", eventsource_to_str(fe->source)); } diff --git a/pping/pping.h b/pping/pping.h index abf49e4..1d50517 100644 --- a/pping/pping.h +++ b/pping/pping.h @@ -18,7 +18,8 @@ typedef __u64 fixpoint64; enum __attribute__((__packed__)) flow_event_type { FLOW_EVENT_NONE, FLOW_EVENT_OPENING, - FLOW_EVENT_CLOSING + FLOW_EVENT_CLOSING, + FLOW_EVENT_CLOSING_BOTH }; enum __attribute__((__packed__)) flow_event_reason { @@ -26,14 +27,13 @@ enum __attribute__((__packed__)) flow_event_reason { EVENT_REASON_SYN_ACK, EVENT_REASON_FIRST_OBS_PCKT, EVENT_REASON_FIN, - EVENT_REASON_FIN_ACK, EVENT_REASON_RST, EVENT_REASON_FLOW_TIMEOUT }; enum __attribute__((__packed__)) flow_event_source { - EVENT_SOURCE_EGRESS, - EVENT_SOURCE_INGRESS, + EVENT_SOURCE_PKT_SRC, + EVENT_SOURCE_PKT_DEST, EVENT_SOURCE_USERSPACE }; @@ -43,7 +43,8 @@ struct bpf_config { bool use_srtt; bool track_tcp; bool track_icmp; - __u8 reserved[5]; + bool localfilt; + __u32 reserved; }; /* @@ -108,12 +109,8 @@ struct rtt_event { __u64 sent_bytes; __u64 rec_pkts; __u64 rec_bytes; - __u32 reserved; -}; - -struct flow_event_info { - enum flow_event_type event; - enum flow_event_reason reason; + bool match_on_egress; + __u8 reserved[7]; }; /* @@ -126,7 +123,8 @@ struct flow_event { __u64 event_type; __u64 timestamp; struct network_tuple flow; - struct flow_event_info event_info; + enum flow_event_type flow_event_type; + enum flow_event_reason reason; enum flow_event_source source; __u8 reserved; }; @@ -137,4 +135,19 @@ union pping_event { struct flow_event flow_event; }; +/* + * Convenience function for getting the corresponding reverse flow. + * PPing needs to keep track of flow in both directions, and sometimes + * also needs to reverse the flow to report the "correct" (consistent + * with Kathie's PPing) src and dest address. + */ +static void reverse_flow(struct network_tuple *dest, struct network_tuple *src) +{ + dest->ipv = src->ipv; + dest->proto = src->proto; + dest->saddr = src->daddr; + dest->daddr = src->saddr; + dest->reserved = 0; +} + #endif diff --git a/pping/pping_kern.c b/pping/pping_kern.c index ad9ff9e..312796f 100644 --- a/pping/pping_kern.c +++ b/pping/pping_kern.c @@ -25,6 +25,9 @@ #define AF_INET6 10 #define MAX_TCP_OPTIONS 10 +// Mask for IPv6 flowlabel + traffic class - used in fib lookup +#define IPV6_FLOWINFO_MASK __cpu_to_be32(0x0FFFFFFF) + /* * This struct keeps track of the data and data_end pointers from the xdp_md or * __skb_buff contexts, as well as a currently parsed to position kept in nh. @@ -33,11 +36,43 @@ * header encloses. */ struct parsing_context { - void *data; //Start of eth hdr - void *data_end; //End of safe acessible area - struct hdr_cursor nh; //Position to parse next - __u32 pkt_len; //Full packet length (headers+data) - bool is_egress; //Is packet on egress or ingress? + void *data; // Start of eth hdr + void *data_end; // End of safe acessible area + struct hdr_cursor nh; // Position to parse next + __u32 pkt_len; // Full packet length (headers+data) + __u32 ingress_ifindex; // Interface packet arrived on + bool is_egress; // Is packet on egress or ingress? +}; + +/* + * Struct filled in by parse_packet_id. + * + * Note: As long as parse_packet_id is successful, the flow-parts of pid + * and reply_pid should be valid, regardless of value for pid_valid and + * reply_pid valid. The *pid_valid members are there to indicate that the + * identifier part of *pid are valid and can be used for timestamping/lookup. + * The reason for not keeping the flow parts as an entirely separate members + * is to save some performance by avoid doing a copy for lookup/insertion + * in the packet_ts map. + */ +struct packet_info { + union { + struct iphdr *iph; + struct ipv6hdr *ip6h; + }; + union { + struct icmphdr *icmph; + struct icmp6hdr *icmp6h; + struct tcphdr *tcph; + }; + __u64 time; // Arrival time of packet + __u32 payload; // Size of packet data (excluding headers) + struct packet_id pid; // identifier to timestamp (ex. TSval) + struct packet_id reply_pid; // identifier to match against (ex. TSecr) + bool pid_valid; // identifier can be used to timestamp packet + bool reply_pid_valid; // reply_identifier can be used to match packet + enum flow_event_type event_type; // flow event triggered by packet + enum flow_event_reason event_reason; // reason for triggering flow event }; char _license[] SEC("license") = "GPL"; @@ -70,13 +105,24 @@ struct { /* * Maps an IPv4 address into an IPv6 address according to RFC 4291 sec 2.5.5.2 */ -static void map_ipv4_to_ipv6(__be32 ipv4, struct in6_addr *ipv6) +static void map_ipv4_to_ipv6(struct in6_addr *ipv6, __be32 ipv4) { __builtin_memset(&ipv6->in6_u.u6_addr8[0], 0x00, 10); __builtin_memset(&ipv6->in6_u.u6_addr8[10], 0xff, 2); ipv6->in6_u.u6_addr32[3] = ipv4; } +/* + * Returns the number of unparsed bytes left in the packet (bytes after nh.pos) + */ +static __u32 remaining_pkt_payload(struct parsing_context *ctx) +{ + // pkt_len - (pos - data) fails because compiler transforms it to pkt_len - pos + data (pkt_len - pos not ok because value - pointer) + // data + pkt_len - pos fails on (data+pkt_len) - pos due to math between pkt_pointer and unbounded register + __u32 parsed_bytes = ctx->nh.pos - ctx->data; + return parsed_bytes < ctx->pkt_len ? ctx->pkt_len - parsed_bytes : 0; +} + /* * Parses the TSval and TSecr values from the TCP options field. If sucessful * the TSval and TSecr values will be stored at tsval and tsecr (in network @@ -134,198 +180,191 @@ static int parse_tcp_ts(struct tcphdr *tcph, void *data_end, __u32 *tsval, /* * Attempts to fetch an identifier for TCP packets, based on the TCP timestamp * option. - * If successful, identifier will be set to TSval if is_ingress, or TSecr - * otherwise, the port-members of saddr and daddr will be set to the TCP source - * and dest, respectively, fei will be filled appropriately (based on - * SYN/FIN/RST) and 0 will be returned. - * On failure, -1 will be returned. + * + * Will use the TSval as pid and TSecr as reply_pid, and the TCP source and dest + * as port numbers. + * + * If successful, the pid (identifer + flow.port), reply_pid, pid_valid, + * reply_pid_valid, event_type and event_reason members of p_info will be set + * appropriately and 0 will be returned. + * On failure -1 will be returned (no guarantees on values set in p_info). */ -static int parse_tcp_identifier(struct parsing_context *ctx, __be16 *sport, - __be16 *dport, struct flow_event_info *fei, - __u32 *identifier) +static int parse_tcp_identifier(struct parsing_context *pctx, + struct packet_info *p_info) { - __u32 tsval, tsecr; - struct tcphdr *tcph; - - if (parse_tcphdr(&ctx->nh, ctx->data_end, &tcph) < 0) + if (parse_tcphdr(&pctx->nh, pctx->data_end, &p_info->tcph) < 0) return -1; - // Do not timestamp pure ACKs - if (ctx->is_egress && ctx->nh.pos - ctx->data >= ctx->pkt_len && - !tcph->syn) - return -1; - - // Do not match on non-ACKs (TSecr not valid) - if (!ctx->is_egress && !tcph->ack) - return -1; - - // Check if connection is opening/closing - if (tcph->syn) { - fei->event = FLOW_EVENT_OPENING; - fei->reason = - tcph->ack ? EVENT_REASON_SYN_ACK : EVENT_REASON_SYN; - } else if (tcph->rst) { - fei->event = FLOW_EVENT_CLOSING; - fei->reason = EVENT_REASON_RST; - } else if (!ctx->is_egress && tcph->fin) { - fei->event = FLOW_EVENT_CLOSING; - fei->reason = - tcph->ack ? EVENT_REASON_FIN_ACK : EVENT_REASON_FIN; - } else { - fei->event = FLOW_EVENT_NONE; - } - - if (parse_tcp_ts(tcph, ctx->data_end, &tsval, &tsecr) < 0) + if (parse_tcp_ts(p_info->tcph, pctx->data_end, &p_info->pid.identifier, + &p_info->reply_pid.identifier) < 0) return -1; //Possible TODO, fall back on seq/ack instead - *sport = tcph->source; - *dport = tcph->dest; - *identifier = ctx->is_egress ? tsval : tsecr; + p_info->pid.flow.saddr.port = p_info->tcph->source; + p_info->pid.flow.daddr.port = p_info->tcph->dest; + + // Do not timestamp pure ACKs (no payload) + p_info->pid_valid = + pctx->nh.pos - pctx->data < pctx->pkt_len || p_info->tcph->syn; + + // Do not match on non-ACKs (TSecr not valid) + p_info->reply_pid_valid = p_info->tcph->ack; + + // Check if connection is opening/closing + if (p_info->tcph->rst) { + p_info->event_type = FLOW_EVENT_CLOSING_BOTH; + p_info->event_reason = EVENT_REASON_RST; + } else if (p_info->tcph->fin) { + p_info->event_type = FLOW_EVENT_CLOSING; + p_info->event_reason = EVENT_REASON_FIN; + } else if (p_info->tcph->syn) { + p_info->event_type = FLOW_EVENT_OPENING; + p_info->event_reason = p_info->tcph->ack ? + EVENT_REASON_SYN_ACK : + EVENT_REASON_SYN; + } else { + p_info->event_type = FLOW_EVENT_NONE; + } + return 0; } /* - * Attemps to fetch an identifier for an ICMPv6 header, based on the echo + * Attempts to fetch an identifier for an ICMPv6 header, based on the echo * request/reply sequence number. - * If successful, identifer will be set to the echo sequence number, both - * sport and dport will be set to the echo identifier, and 0 will be returned. - * On failure, -1 will be returned. - * Note: Will store the 16-bit echo sequence number in network byte order in - * the 32-bit identifier. + * + * Will use the echo sequence number as pid/reply_pid and the echo identifier + * as port numbers. Echo requests will only generate a valid pid and echo + * replies will only generate a valid reply_pid. + * + * If successful, the pid (identifier + flow.port), reply_pid, pid_valid, + * reply pid_valid and event_type of p_info will be set appropriately and 0 + * will be returned. + * On failure, -1 will be returned (no guarantees on p_info members). + * + * Note: Will store the 16-bit sequence number in network byte order + * in the 32-bit (reply_)pid.identifier. */ -static int parse_icmp6_identifier(struct parsing_context *ctx, __u16 *sport, - __u16 *dport, struct flow_event_info *fei, - __u32 *identifier) +static int parse_icmp6_identifier(struct parsing_context *pctx, + struct packet_info *p_info) { - struct icmp6hdr *icmp6h; - - if (parse_icmp6hdr(&ctx->nh, ctx->data_end, &icmp6h) < 0) + if (parse_icmp6hdr(&pctx->nh, pctx->data_end, &p_info->icmp6h) < 0) return -1; - if (ctx->is_egress && icmp6h->icmp6_type != ICMPV6_ECHO_REQUEST) - return -1; - if (!ctx->is_egress && icmp6h->icmp6_type != ICMPV6_ECHO_REPLY) - return -1; - if (icmp6h->icmp6_code != 0) + if (p_info->icmp6h->icmp6_code != 0) return -1; - fei->event = FLOW_EVENT_NONE; - *sport = icmp6h->icmp6_identifier; - *dport = *sport; - *identifier = icmp6h->icmp6_sequence; + if (p_info->icmp6h->icmp6_type == ICMPV6_ECHO_REQUEST) { + p_info->pid.identifier = p_info->icmp6h->icmp6_sequence; + p_info->pid_valid = true; + p_info->reply_pid_valid = false; + } else if (p_info->icmp6h->icmp6_type == ICMPV6_ECHO_REPLY) { + p_info->reply_pid.identifier = p_info->icmp6h->icmp6_sequence; + p_info->reply_pid_valid = true; + p_info->pid_valid = false; + } else { + return -1; + } + + p_info->event_type = FLOW_EVENT_NONE; + p_info->pid.flow.saddr.port = p_info->icmp6h->icmp6_identifier; + p_info->pid.flow.daddr.port = p_info->pid.flow.saddr.port; return 0; } /* * Same as parse_icmp6_identifier, but for an ICMP(v4) header instead. */ -static int parse_icmp_identifier(struct parsing_context *ctx, __u16 *sport, - __u16 *dport, struct flow_event_info *fei, - __u32 *identifier) +static int parse_icmp_identifier(struct parsing_context *pctx, + struct packet_info *p_info) { - struct icmphdr *icmph; - - if (parse_icmphdr(&ctx->nh, ctx->data_end, &icmph) < 0) + if (parse_icmphdr(&pctx->nh, pctx->data_end, &p_info->icmph) < 0) return -1; - if (ctx->is_egress && icmph->type != ICMP_ECHO) - return -1; - if (!ctx->is_egress && icmph->type != ICMP_ECHOREPLY) - return -1; - if (icmph->code != 0) + if (p_info->icmph->code != 0) return -1; - fei->event = FLOW_EVENT_NONE; - *sport = icmph->un.echo.id; - *dport = *sport; - *identifier = icmph->un.echo.sequence; + if (p_info->icmph->type == ICMP_ECHO) { + p_info->pid.identifier = p_info->icmph->un.echo.sequence; + p_info->pid_valid = true; + p_info->reply_pid_valid = false; + } else if (p_info->icmph->type == ICMP_ECHOREPLY) { + p_info->reply_pid.identifier = p_info->icmph->un.echo.sequence; + p_info->reply_pid_valid = true; + p_info->pid_valid = false; + } else { + return -1; + } + + p_info->event_type = FLOW_EVENT_NONE; + p_info->pid.flow.saddr.port = p_info->icmph->un.echo.id; + p_info->pid.flow.daddr.port = p_info->pid.flow.saddr.port; return 0; } /* - * Attempts to parse the packet limited by the data and data_end pointers, - * to retrieve a protocol dependent packet identifier. If sucessful, the - * pointed to p_id and fei will be filled with parsed information from the - * packet, and 0 will be returned. On failure, -1 will be returned. - * If is_egress saddr and daddr will match source and destination of packet, - * respectively, and identifier will be set to the identifer for an outgoing - * packet. Otherwise, saddr and daddr will be swapped (will match - * destination and source of packet, respectively), and identifier will be - * set to the identifier of a response. + * Attempts to parse the packet defined by pctx for a valid packet identifier + * and reply identifier, filling in p_info. + * + * If succesful, all members of p_info will be set appropriately and 0 will + * be returned. + * On failure -1 will be returned (no garantuees on p_info members). */ -static int parse_packet_identifier(struct parsing_context *ctx, - struct packet_id *p_id, - struct flow_event_info *fei) +static int parse_packet_identifier(struct parsing_context *pctx, + struct packet_info *p_info) { int proto, err; struct ethhdr *eth; - struct iphdr *iph; - struct ipv6hdr *ip6h; - struct flow_address *saddr, *daddr; - // Switch saddr <--> daddr on ingress to match egress - if (ctx->is_egress) { - saddr = &p_id->flow.saddr; - daddr = &p_id->flow.daddr; - } else { - saddr = &p_id->flow.daddr; - daddr = &p_id->flow.saddr; - } - - proto = parse_ethhdr(&ctx->nh, ctx->data_end, ð); + p_info->time = bpf_ktime_get_ns(); + proto = parse_ethhdr(&pctx->nh, pctx->data_end, ð); // Parse IPv4/6 header if (proto == bpf_htons(ETH_P_IP)) { - p_id->flow.ipv = AF_INET; - p_id->flow.proto = parse_iphdr(&ctx->nh, ctx->data_end, &iph); + p_info->pid.flow.ipv = AF_INET; + p_info->pid.flow.proto = + parse_iphdr(&pctx->nh, pctx->data_end, &p_info->iph); } else if (proto == bpf_htons(ETH_P_IPV6)) { - p_id->flow.ipv = AF_INET6; - p_id->flow.proto = parse_ip6hdr(&ctx->nh, ctx->data_end, &ip6h); + p_info->pid.flow.ipv = AF_INET6; + p_info->pid.flow.proto = + parse_ip6hdr(&pctx->nh, pctx->data_end, &p_info->ip6h); } else { return -1; } // Parse identifer from suitable protocol - if (config.track_tcp && p_id->flow.proto == IPPROTO_TCP) - err = parse_tcp_identifier(ctx, &saddr->port, &daddr->port, fei, - &p_id->identifier); - else if (config.track_icmp && p_id->flow.proto == IPPROTO_ICMPV6 && - p_id->flow.ipv == AF_INET6) - err = parse_icmp6_identifier(ctx, &saddr->port, &daddr->port, - fei, &p_id->identifier); - else if (config.track_icmp && p_id->flow.proto == IPPROTO_ICMP && - p_id->flow.ipv == AF_INET) - err = parse_icmp_identifier(ctx, &saddr->port, &daddr->port, - fei, &p_id->identifier); + if (config.track_tcp && p_info->pid.flow.proto == IPPROTO_TCP) + err = parse_tcp_identifier(pctx, p_info); + else if (config.track_icmp && + p_info->pid.flow.proto == IPPROTO_ICMPV6 && + p_info->pid.flow.ipv == AF_INET6) + err = parse_icmp6_identifier(pctx, p_info); + else if (config.track_icmp && p_info->pid.flow.proto == IPPROTO_ICMP && + p_info->pid.flow.ipv == AF_INET) + err = parse_icmp_identifier(pctx, p_info); else return -1; // No matching protocol if (err) return -1; // Failed parsing protocol // Sucessfully parsed packet identifier - fill in IP-addresses and return - if (p_id->flow.ipv == AF_INET) { - map_ipv4_to_ipv6(iph->saddr, &saddr->ip); - map_ipv4_to_ipv6(iph->daddr, &daddr->ip); + if (p_info->pid.flow.ipv == AF_INET) { + map_ipv4_to_ipv6(&p_info->pid.flow.saddr.ip, + p_info->iph->saddr); + map_ipv4_to_ipv6(&p_info->pid.flow.daddr.ip, + p_info->iph->daddr); } else { // IPv6 - saddr->ip = ip6h->saddr; - daddr->ip = ip6h->daddr; + p_info->pid.flow.saddr.ip = p_info->ip6h->saddr; + p_info->pid.flow.daddr.ip = p_info->ip6h->daddr; } + + reverse_flow(&p_info->reply_pid.flow, &p_info->pid.flow); + p_info->payload = remaining_pkt_payload(pctx); + return 0; } /* - * Returns the number of unparsed bytes left in the packet (bytes after nh.pos) - */ -static __u32 remaining_pkt_payload(struct parsing_context *ctx) -{ - // pkt_len - (pos - data) fails because compiler transforms it to pkt_len - pos + data (pkt_len - pos not ok because value - pointer) - // data + pkt_len - pos fails on (data+pkt_len) - pos due to math between pkt_pointer and unbounded register - __u32 parsed_bytes = ctx->nh.pos - ctx->data; - return parsed_bytes < ctx->pkt_len ? ctx->pkt_len - parsed_bytes : 0; -} - -/* - * Calculate a smooted rtt similar to how TCP stack does it in + * Calculate a smoothed rtt similar to how TCP stack does it in * net/ipv4/tcp_input.c/tcp_rtt_estimator(). * * NOTE: Will cause roundoff errors, but if RTTs > 1000ns errors should be small @@ -352,86 +391,172 @@ static bool is_rate_limited(__u64 now, __u64 last_ts, __u64 rtt) } /* - * Fills in event_type, timestamp, flow, source and reserved. - * Does not fill in the flow_info. + * Sends a flow-event message based on p_info. + * + * The rev_flow argument is used to inform if the message is for the flow + * in the current direction or the reverse flow, and will adapt the flow and + * source members accordingly. */ -static void fill_flow_event(struct flow_event *fe, __u64 timestamp, - struct network_tuple *flow, - enum flow_event_source source) +static void send_flow_event(void *ctx, struct packet_info *p_info, + bool rev_flow) { - fe->event_type = EVENT_TYPE_FLOW; - fe->timestamp = timestamp; - fe->flow = *flow; - fe->source = source; - fe->reserved = 0; // Make sure it's initilized + struct flow_event fe = { + .event_type = EVENT_TYPE_FLOW, + .flow_event_type = p_info->event_type, + .reason = p_info->event_reason, + .timestamp = p_info->time, + .reserved = 0, // Make sure it's initilized + }; + + if (rev_flow) { + fe.flow = p_info->pid.flow; + fe.source = EVENT_SOURCE_PKT_SRC; + } else { + fe.flow = p_info->reply_pid.flow; + fe.source = EVENT_SOURCE_PKT_DEST; + } + + bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &fe, sizeof(fe)); } /* - * Main function for handling the pping egress path. - * Parses the packet for an identifer and attemps to store a timestamp for it - * in the packet_ts map. + * Attempt to create a new flow-state and push flow-opening message + * Returns a pointer to the flow_state if successful, NULL otherwise */ -static void pping_egress(void *ctx, struct parsing_context *pctx) +static struct flow_state *create_flow(void *ctx, struct packet_info *p_info) { - struct packet_id p_id = { 0 }; - struct flow_event fe; - struct flow_state *f_state; struct flow_state new_state = { 0 }; - bool new_flow = false; - __u64 now; - if (parse_packet_identifier(pctx, &p_id, &fe.event_info) < 0) - return; + new_state.last_timestamp = p_info->time; + if (bpf_map_update_elem(&flow_state, &p_info->pid.flow, &new_state, + BPF_NOEXIST) != 0) + return NULL; - now = bpf_ktime_get_ns(); // or bpf_ktime_get_boot_ns - f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow); + if (p_info->event_type != FLOW_EVENT_OPENING) { + p_info->event_type = FLOW_EVENT_OPENING; + p_info->event_reason = EVENT_REASON_FIRST_OBS_PCKT; + } + send_flow_event(ctx, p_info, false); - // Flow closing - try to delete flow state and push closing-event - if (fe.event_info.event == FLOW_EVENT_CLOSING) { - if (!f_state) { - bpf_map_delete_elem(&flow_state, &p_id.flow); - fill_flow_event(&fe, now, &p_id.flow, - EVENT_SOURCE_EGRESS); - bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, - &fe, sizeof(fe)); - } - return; + return bpf_map_lookup_elem(&flow_state, &p_info->pid.flow); +} + +static struct flow_state *update_flow(void *ctx, struct packet_info *p_info, + bool *new_flow) +{ + struct flow_state *f_state; + *new_flow = false; + + f_state = bpf_map_lookup_elem(&flow_state, &p_info->pid.flow); + if (!f_state && p_info->pid_valid) { + *new_flow = true; + f_state = create_flow(ctx, p_info); } - // No previous state - attempt to create it and push flow-opening event - if (!f_state) { - new_state.last_timestamp = now; - if (bpf_map_update_elem(&flow_state, &p_id.flow, &new_state, - BPF_NOEXIST) == 0) { - new_flow = true; - - if (fe.event_info.event != FLOW_EVENT_OPENING) { - fe.event_info.event = FLOW_EVENT_OPENING; - fe.event_info.reason = - EVENT_REASON_FIRST_OBS_PCKT; - } - fill_flow_event(&fe, now, &p_id.flow, - EVENT_SOURCE_EGRESS); - bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, - &fe, sizeof(fe)); - } - - f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow); - if (!f_state) // Creation failed - return; - } + if (!f_state) + return NULL; f_state->sent_pkts++; - f_state->sent_bytes += remaining_pkt_payload(pctx); + f_state->sent_bytes += p_info->payload; + + return f_state; +} + +static struct flow_state *update_rev_flow(struct packet_info *p_info) +{ + struct flow_state *f_state; + + f_state = bpf_map_lookup_elem(&flow_state, &p_info->reply_pid.flow); + if (!f_state) + return NULL; + + f_state->rec_pkts++; + f_state->rec_bytes += p_info->payload; + + return f_state; +} + +static void delete_closed_flows(void *ctx, struct packet_info *p_info) +{ + // Flow closing - try to delete flow state and push closing-event + if (p_info->event_type == FLOW_EVENT_CLOSING || + p_info->event_type == FLOW_EVENT_CLOSING_BOTH) { + if (!bpf_map_delete_elem(&flow_state, &p_info->pid.flow)) + send_flow_event(ctx, p_info, false); + } + + // Also close reverse flow + if (p_info->event_type == FLOW_EVENT_CLOSING_BOTH) { + if (!bpf_map_delete_elem(&flow_state, &p_info->reply_pid.flow)) + send_flow_event(ctx, p_info, true); + } +} + +/* + * Return true if p_info->pid.flow.daddr is a "local" address. + * + * Works by performing a fib lookup for p_info->pid.flow. + * Lookup struct filled based on examples from + * samples/bpf/xdp_fwd_kern.c/xdp_fwd_flags() and + * tools/testing/selftests/bpf/progs/test_tc_neigh_fib.c + */ +static bool is_local_address(struct packet_info *p_info, void *ctx, + struct parsing_context *pctx) +{ + int ret; + struct bpf_fib_lookup lookup; + __builtin_memset(&lookup, 0, sizeof(lookup)); + + lookup.ifindex = pctx->ingress_ifindex; + lookup.family = p_info->pid.flow.ipv; + + if (lookup.family == AF_INET) { + lookup.tos = p_info->iph->tos; + lookup.tot_len = bpf_ntohs(p_info->iph->tot_len); + lookup.ipv4_src = p_info->iph->saddr; + lookup.ipv4_dst = p_info->iph->daddr; + } else if (lookup.family == AF_INET6) { + struct in6_addr *src = (struct in6_addr *)lookup.ipv6_src; + struct in6_addr *dst = (struct in6_addr *)lookup.ipv6_dst; + + lookup.flowinfo = *(__be32 *)p_info->ip6h & IPV6_FLOWINFO_MASK; + lookup.tot_len = bpf_ntohs(p_info->ip6h->payload_len); + *src = p_info->pid.flow.saddr.ip; //verifier did not like ip6h->saddr + *dst = p_info->pid.flow.daddr.ip; + } + + lookup.l4_protocol = p_info->pid.flow.proto; + lookup.sport = 0; + lookup.dport = 0; + + ret = bpf_fib_lookup(ctx, &lookup, sizeof(lookup), 0); + + return ret == BPF_FIB_LKUP_RET_NOT_FWDED || + ret == BPF_FIB_LKUP_RET_FWD_DISABLED; +} + +/* + * Attempt to create a timestamp-entry for packet p_info for flow in f_state + */ +static void pping_timestamp_packet(struct flow_state *f_state, void *ctx, + struct parsing_context *pctx, + struct packet_info *p_info, bool new_flow) +{ + if (!f_state || !p_info->pid_valid) + return; + + if (config.localfilt && !pctx->is_egress && + is_local_address(p_info, ctx, pctx)) + return; // Check if identfier is new - if (f_state->last_id == p_id.identifier) + if (!new_flow && f_state->last_id == p_info->pid.identifier) return; - f_state->last_id = p_id.identifier; + f_state->last_id = p_info->pid.identifier; // Check rate-limit if (!new_flow && - is_rate_limited(now, f_state->last_timestamp, + is_rate_limited(p_info->time, f_state->last_timestamp, config.use_srtt ? f_state->srtt : f_state->min_rtt)) return; @@ -441,44 +566,32 @@ static void pping_egress(void *ctx, struct parsing_context *pctx) * the next available map slot somewhat fairer between heavy and sparse * flows. */ - f_state->last_timestamp = now; - bpf_map_update_elem(&packet_ts, &p_id, &now, BPF_NOEXIST); + f_state->last_timestamp = p_info->time; - return; + bpf_map_update_elem(&packet_ts, &p_info->pid, &p_info->time, + BPF_NOEXIST); } /* - * Main function for handling the pping ingress path. - * Parses the packet for an identifer and tries to lookup a stored timestmap. - * If it finds a match, it pushes an rtt_event to the events buffer. + * Attempt to match packet in p_info with a timestamp from flow in f_state */ -static void pping_ingress(void *ctx, struct parsing_context *pctx) +static void pping_match_packet(struct flow_state *f_state, void *ctx, + struct parsing_context *pctx, + struct packet_info *p_info) { - struct packet_id p_id = { 0 }; - struct flow_event fe; struct rtt_event re = { 0 }; - struct flow_state *f_state; __u64 *p_ts; - __u64 now; - if (parse_packet_identifier(pctx, &p_id, &fe.event_info) < 0) + if (!f_state || !p_info->reply_pid_valid) return; - f_state = bpf_map_lookup_elem(&flow_state, &p_id.flow); - if (!f_state) + p_ts = bpf_map_lookup_elem(&packet_ts, &p_info->reply_pid); + if (!p_ts || p_info->time < *p_ts) return; - f_state->rec_pkts++; - f_state->rec_bytes += remaining_pkt_payload(pctx); - - now = bpf_ktime_get_ns(); - p_ts = bpf_map_lookup_elem(&packet_ts, &p_id); - if (!p_ts || now < *p_ts) - goto validflow_out; - - re.rtt = now - *p_ts; + re.rtt = p_info->time - *p_ts; // Delete timestamp entry as soon as RTT is calculated - bpf_map_delete_elem(&packet_ts, &p_id); + bpf_map_delete_elem(&packet_ts, &p_info->reply_pid); if (f_state->min_rtt == 0 || re.rtt < f_state->min_rtt) f_state->min_rtt = re.rtt; @@ -486,25 +599,40 @@ static void pping_ingress(void *ctx, struct parsing_context *pctx) // Fill event and push to perf-buffer re.event_type = EVENT_TYPE_RTT; - re.timestamp = now; + re.timestamp = p_info->time; re.min_rtt = f_state->min_rtt; re.sent_pkts = f_state->sent_pkts; re.sent_bytes = f_state->sent_bytes; re.rec_pkts = f_state->rec_pkts; re.rec_bytes = f_state->rec_bytes; - re.flow = p_id.flow; + re.flow = p_info->pid.flow; + re.match_on_egress = pctx->is_egress; bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &re, sizeof(re)); +} -validflow_out: - // Wait with deleting flow until having pushed final RTT message - if (fe.event_info.event == FLOW_EVENT_CLOSING && - bpf_map_delete_elem(&flow_state, &p_id.flow) == 0) { - fill_flow_event(&fe, now, &p_id.flow, EVENT_SOURCE_INGRESS); - bpf_perf_event_output(ctx, &events, BPF_F_CURRENT_CPU, &fe, - sizeof(fe)); +/* + * Will parse the ingress/egress packet in pctx and attempt to create a + * timestamp for it and match it against the reverse flow. + */ +static void pping(void *ctx, struct parsing_context *pctx) +{ + struct packet_info p_info = { 0 }; + struct flow_state *f_state; + bool new_flow; + + if (parse_packet_identifier(pctx, &p_info) < 0) + return; + + if (p_info.event_type != FLOW_EVENT_CLOSING && + p_info.event_type != FLOW_EVENT_CLOSING_BOTH) { + f_state = update_flow(ctx, &p_info, &new_flow); + pping_timestamp_packet(f_state, ctx, pctx, &p_info, new_flow); } - return; + f_state = update_rev_flow(&p_info); + pping_match_packet(f_state, ctx, pctx, &p_info); + + delete_closed_flows(ctx, &p_info); } // Programs @@ -521,7 +649,7 @@ int pping_tc_egress(struct __sk_buff *skb) .is_egress = true, }; - pping_egress(skb, &pctx); + pping(skb, &pctx); return TC_ACT_UNSPEC; } @@ -535,10 +663,11 @@ int pping_tc_ingress(struct __sk_buff *skb) .data_end = (void *)(long)skb->data_end, .pkt_len = skb->len, .nh = { .pos = pctx.data }, + .ingress_ifindex = skb->ingress_ifindex, .is_egress = false, }; - pping_ingress(skb, &pctx); + pping(skb, &pctx); return TC_ACT_UNSPEC; } @@ -552,10 +681,11 @@ int pping_xdp_ingress(struct xdp_md *ctx) .data_end = (void *)(long)ctx->data_end, .pkt_len = pctx.data_end - pctx.data, .nh = { .pos = pctx.data }, + .ingress_ifindex = ctx->ingress_ifindex, .is_egress = false, }; - pping_ingress(ctx, &pctx); + pping(ctx, &pctx); return XDP_PASS; }