Files
Jesper Dangaard Brouer 88b05144a2 nat64-bpf: rename bpf_map__resize() to bpf_map__set_max_entries()
Libbpf API change:
 
 Discourage bpf_map__resize(), which is an alias to more clearly
 named bpf_map__set_max_entries()

See: https://github.com/libbpf/libbpf/issues/304

And API migration guide:
 https://github.com/libbpf/libbpf/wiki/Libbpf:-the-road-to-v1.0#libbpfh-high-level-apis

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
2022-09-02 14:34:11 +02:00
..
2021-09-29 01:46:09 +02:00
2021-10-05 00:19:28 +02:00
2021-09-30 23:11:15 +02:00
2021-10-05 00:44:43 +02:00

NAT64 BPF implementation

This directory contains a BPF implementation of a stateless NAT64 implementation, like that performed by Tayga, but entirely in BPF. It works by attaching to the TC hooks of an interface and translating incoming IPv6 addresses with a destination in the configured NAT64 prefix, and routing v4 packets back out through that interface based on the (v4) prefix used for translation.

Running

To run the translator on eth0 with an IPv4 prefix of 10.0.1.0/24 and using the default well-known v6 prefix (64:ff9b::/96), simply issue

sudo ./nat64 -i eth0 -4 10.0.1.0/24 -a fc00::/8

Run again with a -u parameter to unload (but make sure to also specify the rest of the parameters as they are needed to properly clean up). To specify another v6 prefix, use -6.

The userspace utility will install the necessary routing rules, and setup the BPF programs, then exit. The translator will then keep running entirely in the kernel until unloaded (with -u).

Assumptions

The operation of this NAT64 translator makes a few assumptions:

  • A single v6 NAT64 prefix is used, and the prefix length is always 96 (i.e., the v4 addresses live in the last four bytes). By default the well-known prefix 64:ff9b::/96 is used.
  • IPv6 source addresses are mapped into a configured IPv4 prefix one-to-one. Regular NAT4 can be applied afterwards to map to a single public IP. A separate v4 prefix should be used for every interface that the translator runs on. Source address v6-to-v4 mappings are dynamically created as new sources appear, and time out after two hours.
  • An allowlist of IPv6 source prefixes that should be subject to translation is maintained.

How it works

Two BPF programs are attached to the ingress and egress hooks of the interface being configured. The ingress program will process IPv6 packets, and any packet with a destination address in the configured NAT64 prefix will be either translated (if the source is allowed), or dropped. The egress program processes IPv4 packets and any packet with a destination in the configured v4 prefix will be either translated (if a v6 address is found in the state map) or dropped.

To make sure the v4 traffic makes it to the right interface, a v4-via-v6 route is installed on that interface with a gateway address of the network address of the v6 prefix, and a fake neighbour entry is installed to avoid the kernel doing neighbour lookups of the gateway. This gets the packets to where the BPF program can process them, and after translation a new neighbour lookup with be performed with the new v6 destination.

Note that because of the place of the BPF hook in ingress processing, the ingress BPF program will need to redirect the packet to the same interface after translation for re-processing as an IPv4 packet. This means that things like tcpdump will see first the original IPv6 packet, and then the translated IPv4 packet. On egress the translation happens earlier, so only the translated packet will be seen.

Limitations / known issues

At least the first two of these should probably be fixed before deploying this:

  • The IP headers in ICMP error message payloads are not translated, which probably breaks ICMP errors.
  • The BPF programs assume the interface is an Ethernet interface, so translation won't work on layer 3 devices (like Wireguard tunnels).
  • IP options are not handled at all. In particular this means that fragmented IPv6 packets won't pass the translator.
  • The BPF programs support specifying multiple allowed source IPv6 prefixes, as well as doing ahead-of-time static mappings, but the userspace component doesn't support these yet.
  • The userspace program also has no way to print its status, or dump the state of the translation table. The BPF maps can be inspected with bpftool as a stopgap measure, though.