Introduction
IPng’s network is built up in two main layers, (1) an MPLS transport layer, which is disconnected from the Internet, and (2) a VPP overlay, which carries the Internet. I created a BGP Free core transport network, which uses MPLS switches from a company called Centec. These switches offer IPv4, IPv6, VxLAN, GENEVE and GRE all in silicon, are very cheap on power and relatively affordable per port.
Centec switches allow for a modest but not huge amount of routes in the hardware forwarding tables. I loadtested them in [a previous article] at line rate (well, at least 8x10G at 64b packets and around 110Mpps), and they forward IPv4, IPv6 and MPLS traffic effortlessly, at 45 watts.
I wrote more about the Centec switches in [my review] of them back in 2022.
IPng Site Local
I leverage this internal transport network for more than just MPLS. The transport switches are perfectly capable of line rate (at 100G+) IPv4 and IPv6 forwarding as well. When designing IPng Site Local, I created a number plan that assigns IPv4 from the 198.19.0.0/16 prefix, and IPv6 from the 2001:678:d78:500::/56 prefix. Within these, I allocate blocks for Loopback addresses, PointToPoint subnets, and hypervisor networks for VMs and internal traffic.
Take a look at the diagram to the right. Each site has one or more Centec switches (in red), and there are three redundant gateways that connect the IPng Site Local network to the Internet (in orange). I run lots of services in this red portion of the network: site to site backups [Borgbackup], ZFS replication [ZRepl], a message bus using [Nats], and of course monitoring with SNMP and Prometheus all make use of this network. But it’s not only internal services like management traffic, I also actively use this private network to expose public services!
For example, I operate a bunch of [NGINX Frontends] that have a public IPv4/IPv6 address, and reversed proxy for webservices (like [ublog.tech] or [Rallly]) which run on VMs and Docker hosts which don’t have public IP addresses. Another example which I wrote about [last week], is a bunch of mail services that run on VMs without public access, but are each carefully exposed via reversed proxies (like Postfix, Dovecot, or [Roundcube]). It’s an incredibly versatile network design!
Border Gateways
Seeing as IPng Site Local uses native IPv6, it’s rather straight forward to give each hypervisor and VM an IPv6 address, and configure IPv4 only on the externally facing NGINX Frontends. As a reversed proxy, NGINX will create a new TCP session to the internal server, and that’s a fine solution. However, I also want my internal hypervisors and servers to have full Internet connectivity. For IPv6, this feels pretty straight forward, as I can just route the 2001:678:d78:500::/56 through a firewall that blocks incoming traffic, and call it a day. For IPv4, similarly I can use classic NAT just like one would in a residential network.
But what if I wanted to go IPv6-only? This poses a small challenge, because while IPng is fully IPv6 capable, and has been since the early 2000s, the rest of the internet is not quite there yet. For example, the quite popular [GitHub] hosting site still has only an IPv4 address. Come on, folks, what’s taking you so long?! It is for this purpose that NAT64 was invented. Described in [RFC6146]:
Stateful NAT64 translation allows IPv6-only clients to contact IPv4 servers using unicast UDP, TCP, or ICMP. One or more public IPv4 addresses assigned to a NAT64 translator are shared among several IPv6-only clients. When stateful NAT64 is used in conjunction with DNS64, no changes are usually required in the IPv6 client or the IPv4 server.
The rest of this article describes version 2 of the IPng SL border gateways, which opens the path for IPng to go IPv6-only. By the way, I thought it would be super complicated, but in hindsight: I should have done this years ago!
Gateway Design
Let me take a closer look at the orange boxes that I drew in the network diagram above. I call these machines Border Gateways. Their job is to sit between IPng Site Local and the Internet. They’ll each have one network interface connected to the Centec switch, and another connected to the VPP routers at AS8298. They will provide two main functions: firewalling, so that no unwanted traffic enters IPng Site local, and NAT translation, so that:
- IPv4 users from 198.19.0.0/16 can reach external IPv4 addresses,
- IPv6 users from 2001:678:d78:500::/56 can reach external IPv6,
- IPv6-only users can reach external IPv4 addresses, a neat trick.
IPv4 and IPv6 NAT
Let me start off with the basic tablestakes. You’ll likely be familiar with masquerading, a NAT technique in Linux that uses the public IPv4 address assigned by your provider, allowing many internal clients, often using [RFC1918] addresses, to access the internet via that shared IPv4 address. You may not have come across IPv6 masquerading though, but it’s equally possible to take an internal (private, non-routable) IPv6 network and access the internet via a shared IPv6 address.
I will assign a pool of four public IPv4 addresses and eight IPv6 addresses to each border gateway:
| Machine | IPv4 pool | IPv6 pool | | border0.chbtl0.net.ipng.ch | 194.126.235.0/30 | 2001:678:d78::3:0:0/125 | | border0.chrma0.net.ipng.ch | 194.126.235.4/30 | 2001:678:d78::3:1:0/125 | | border0.chplo0.net.ipng.ch | 194.126.235.8/30 | 2001:678:d78::3:2:0/125 | | border0.nlams0.net.ipng.ch | 194.126.235.12/30 | 2001:678:d78::3:3:0/125 |
Linux iptables masquerading will only work with the IP addresses assigned to the external interface, so I will need to use a slightly different approach to be able to use these pools. In case you’re wondering – IPng’s internal network has grown to the size now that I cannot expose it all behind a single IPv4 address; there will not be enough TCP/UDP ports. Luckily, NATing via a pool is pretty easy using the SNAT module:
pim@border0-chrma0:~$ cat << EOF | sudo tee /etc/rc.firewall.ipng-sl
# IPng Site Local: Enable stateful firewalling on IPv4/IPv6 forwarding
iptables -P FORWARD DROP
ip6tables -P FORWARD DROP
iptables -I FORWARD -i enp1s0f1 -m state --state NEW -s 198.19.0.0/16 -j ACCEPT
ip6tables -I FORWARD -i enp1s0f1 -m state --state NEW -s 2001:678:d78:500::/56 -j ACCEPT
iptables -I FORWARD -m state --state RELATED,ESTABLISHED -j ACCEPT
ip6tables -I FORWARD -m state --state RELATED,ESTABLISHED -j ACCEPT
# IPng Site Local: Enable NAT on external interface using NAT pools
iptables -t nat -I POSTROUTING -s 198.19.0.0/16 -o enp1s0f0 \
-j SNAT --to 194.126.235.4-194.126.235.7
ip6tables -t nat -I POSTROUTING -s 2001:678:d78:500::/56 -o enp1s0f0 \
-j SNAT --to 2001:678:d78::3:1:0-2001:678:d78::3:1:7
EOF
From the top – I’ll first make it the default for the kernel to refuse to FORWARD any traffic that
is not explicitly accepted. I will only allow traffic that comes in via enp1s0f1
(the internal
interface), only if it comes from the assigned IPv4 and IPv6 site local prefixes. On the way back,
I’ll allow traffic that matches states created on the way out. This is the firewalling portion of
the setup.
Then, two POSTROUTING rules turn on network address translation. If the source address is any of the site local prefixes, I’ll rewrite it to come from the IPv4 or IPv6 pool addresses, respectively. This is the NAT44 and NAT66 portion of the setup.
NAT64: Jool
So far, so good. But this article is about NAT64 :-) Here’s where I grossly overestimated how difficult it might be – and if there’s one takeaway from my story here, it should be that NAT64 is as straight forward as the others! Enter [Jool], an Open Source SIIT and NAT64 for Linux. It’s available in Debian as a DKMS kernel module and userspace tool, and it integrates cleanly with both iptables and netfilter.
Jool is a network address and port translating implementation, which is referred to as NAPT, just as regular IPv4 NAT. When internal IPv6 clients try to reach an external endpoint, Jool will make note of the internal src6:port, then select an external IPv4 address:port, rewrite the packet, and on the way back, correlate the src4:port with the internal src6:port, and rewrite the packet. If this sounds an awful lot like NAT, then you’re not wrong! The only difference is, Jool will also translate the address family: it will rewrite the internal IPv6 addresses to external IPv4 addresses.
Installing Jool is as simple as this:
pim@border0-chrma0:~$ sudo apt install jool-dkms jool-tools
pim@border0-chrma0:~$ sudo mkdir /etc/jool
pim@border0-chrma0:~$ cat << EOF | sudo tee /etc/jool/jool.conf
{
"comment": {
"description": "Full NAT64 configuration for border0.chrma0.net.ipng.ch",
"last update": "2024-05-21"
},
"instance": "default",
"framework": "netfilter",
"global": { "pool6": "2001:678:d78:564::/96", "lowest-ipv6-mtu": 1280, "logging-debug": false },
"pool4": [
{ "protocol": "TCP", "prefix": "194.126.235.4/30", "port range": "1024-65535" },
{ "protocol": "UDP", "prefix": "194.126.235.4/30", "port range": "1024-65535" },
{ "protocol": "ICMP", "prefix": "194.126.235.4/30" }
]
}
EOF
pim@border0-chrma0:~$ sudo systemctl start jool
.. and that, as they say, is all there is to it! There’s two things I make note of here:
- I have assigned 2001:678:d78:564::/96 as NAT64
pool6
, which means that if this machine sees any traffic destined to that prefix, it’ll activate Jool, select an available IPv4 address:port from thepool4
, and send the packet to the IPv4 destination address which it takes from the last 32 bits of the original IPv6 destination address. - Cool trick: I am reusing the same IPv4 pool as for regular NAT. The Jool kernel module happily coexists with the iptables implementation!
DNS64: Unbound
There’s one vital piece of information missing, and it took me a little while to appreciate that. If I take an IPv6 only host, like Summer, and I try to connect to an IPv4-only host, how does that even work?
pim@summer:~$ ip -br a
lo UNKNOWN 127.0.0.1/8 ::1/128
eno1 UP 2001:678:d78:50b::f/64 fe80::7e4d:8fff:fe03:3c00/64
pim@summer:~$ ip -6 ro
2001:678:d78:50b::/64 dev eno1 proto kernel metric 256 pref medium
fe80::/64 dev eno1 proto kernel metric 256 pref medium
default via 2001:678:d78:50b::1 dev eno1 proto static metric 1024 pref medium
pim@summer:~$ host github.com
github.com has address 140.82.121.4
pim@summer:~$ ping github.com
ping: connect: Network is unreachable
Now comes the really clever reveal – NAT64 works by assigning an IPv6 prefix that snugly fits the
entire IPv4 address space, typically 64:ff9b::/96, but operators can chose any prefix they’d like.
For IPng’s site local network, I decided to assign 2001:678:d78:564::/96 for this purpose
(this is the global.pool6
attribute in Jool’s config file I described above). A resolver can then
tweak DNS lookups for IPv6-only hosts to return addresses from that IPv6 range. This tweaking is
called DNS64, described in [RFC6147]:
DNS64 is a mechanism for synthesizing AAAA records from A records. DNS64 is used with an IPv6/IPv4 translator to enable client-server communication between an IPv6-only client and an IPv4-only server, without requiring any changes to either the IPv6 or the IPv4 node, for the class of applications that work through NATs.
I run the popular [Unbound] resolver at IPng, deployed as a set of anycasted instances across the network. With two lines of configuration only, I can turn on this feature:
pim@border0-chrma0:~$ cat << EOF | sudo tee /etc/unbound/unbound.conf.d/dns64.conf
server:
module-config: "dns64 iterator"
dns64-prefix: 2001:678:d78:564::/96
EOF
pim@border0-chrma0:~$ sudo systemctl restat unbound
The behavior of the resolver now changes in a very subtle but cool way:
pim@summer:~$ host github.com
github.com has address 140.82.121.3
github.com has IPv6 address 2001:678:d78:564::8c52:7903
pim@summer:~$ host 2001:678:d78:564::8c52:7903
3.0.9.7.2.5.c.8.0.0.0.0.0.0.0.0.4.6.5.0.8.7.d.0.8.7.6.0.1.0.0.2.ip6.arpa
domain name pointer lb-140-82-121-3-fra.github.com.
Before, [github.com] did not return an AAAA record, so there was
no way for Summer to connect to it. But now, not only does it return an AAAA record, but it also
rewrites the PTR request, knowing that I’m asking for something in the DNS64 range of
2001:678:d78:564::/96, Unbound will instead strip off the last 32 bits (8c52:7903
, which is the
hex encoding for the original IPv4 address), and return the answer for a PTR lookup for the original
3.121.82.140.in-addr.arpa
instead. Game changer!
DNS64 + NAT64
What I learned from this, is that the combination of these two tools provides the magic:
- When an IPv6-only client asks for AAAA for an IPv4-only hostname, Unbound will synthesize an AAAA from the IPv4 address, casting it into the last 32 bits of its NAT64 prefix 2001:678:d78:564::/96
- When an IPv6-only client tries to send traffic to 2001:678:d78:564::/96, Jool will do the address family (and address/port) translation. This is represented by the red (ipv6) flow in the diagram to the right turning into a green (ipv4) flow to the left.
What’s left for me to do is to ensure that (a) the NAT64 prefix is routed from IPng Site Local to the gateways and (b) that the IPv4 and IPv6 NAT address pools is routed from the Internet to the gateways.
Internal: OSPF
I use Bird2 to accomplish the dynamic routing - and considering the Centec switch network is by design BGP Free, I will use OSPF and OSPFv3 for these announcements. Using OSPF has an important benefit: I can selectively turn on and off the Bird announcements to the Centec IPng Site local network. Seeing as there will be multiple redundant gateways, if one of them goes down (either due to failure or because of maintenance), the network will quickly reconverge on another replica. Neat!
Here’s how I configure the OSPF import and export filters:
filter ospf_import {
if (net.type = NET_IP4 && net ~ [ 198.19.0.0/16 ]) then accept;
if (net.type = NET_IP6 && net ~ [ 2001:678:d78:500::/56 ]) then accept;
reject;
}
filter ospf_export {
if (net.type=NET_IP4 && !(net~[198.19.0.255/32,0.0.0.0/0])) then reject;
if (net.type=NET_IP6 && !(net~[2001:678:d78:564::/96,2001:678:d78:500::1:0/128,::/0])) then reject;
ospf_metric1 = 200; unset(ospf_metric2);
accept;
}
When learning prefixes from the Centec switch, I will only accept precisely the IPng Site Local IPv4 (198.19.0.0/16) and IPv6 (2001:678:d78:500::/56) supernets. On sending prefixes to the Centec switches, I will announce:
- 198.19.0.255/32 and 2001:678:d78:500::1:0/128: These are the anycast addresses of the Unbound resolver.
- 0.0.0.0/0 and ::/0: These are default routes for IPv4 and IPv6 respectively
- 2001:678:d78:564::/96: This is the NAT64 prefix, which will attract the IPv6-only traffic towards DNS64-rewritten destinations, for example 2001:678:d78:564::8c52:7903 as DNS64 representation of github.com, which is reachable only at legacy address 140.82.121.3.
I have to be careful with the announcements into OSPF. The cost of E1 routes is the cost of the
external metric in addition to the internal cost within OSPF to reach that network. The cost
of E2 routes will always be the external metric, the metric will take no notice of the internal
cost to reach that router. Therefor, I emit these prefixes without Bird’s ospf_metric2
set, so
that the closest border gateway is always used.
With that, I can see the following:
pim@summer:~$ traceroute6 github.com
traceroute to github.com (2001:678:d78:564::8c52:7903), 30 hops max, 80 byte packets
1 msw0.chbtl0.net.ipng.ch (2001:678:d78:50b::1) 4.134 ms 4.640 ms 4.796 ms
2 border0.chbtl0.net.ipng.ch (2001:678:d78:503::13) 0.751 ms 0.818 ms 0.688 ms
3 * * *
4 * * * ^C
I’m not quite there yet, I have one more step to go. What’s happening at the Border Gateway? Let me take a look at this, while I ping6 to github.com:
pim@summer:~$ ping6 github.com
PING github.com(lb-140-82-121-4-fra.github.com (2001:678:d78:564::8c52:7904)) 56 data bytes
... (nothing)
pim@border0-chbtl0:~$ sudo tcpdump -ni any src host 2001:678:d78:50b::f or dst host 140.82.121.4
11:25:19.225509 enp1s0f1 In IP6 2001:678:d78:50b::f > 2001:678:d78:564::8c52:7904:
ICMP6, echo request, id 3904, seq 7, length 64
11:25:19.225603 enp1s0f0 Out IP 194.126.235.3 > 140.82.121.4:
ICMP echo request, id 61668, seq 7, length 64
Unbound and Jool are doing great work. Unbound saw my DNS request for IPv4-only github.com, and synthesized a DNS64 response for me. Jool then saw the inbound packet from enp1s0f1, the internal interface pointed at IPng Site Local. This is because the 2001:678:d78:564::/96 prefix is announced in OSPFv3 so every host knows to route traffic to that prefix to this border gateway. But then, I see the NAT64 in action on the outbound interface enp1s0f0. Here, one of the IPv4 pool addresses is selected as source address. But there is no return packet, because there is no route back from the Internet, yet.
External: BGP
The final step for me is to allow return traffic, from the Internet to the IPv4 and IPv6 pools to reach this Border Gateway instance. For this, I configure BGP with the following Bird2 configuration snippet:
filter bgp_import {
if (net.type = NET_IP4 && !(net = 0.0.0.0/0)) then reject;
if (net.type = NET_IP6 && !(net = ::/0)) then reject;
accept;
}
filter bgp_export {
if (net.type = NET_IP4 && !(net ~ [ 194.126.235.4/30 ])) then reject;
if (net.type = NET_IP6 && !(net ~ [ 2001:678:d78::3:1:0/125 ])) then reject;
# Add BGP Wellknown community no-export (FFFF:FF01)
bgp_community.add((65535,65281));
accept;
}
I then establish an eBGP session from private AS64513 to two of IPng Networks’ core routers at
AS8298. I add the wellknown BGP no-export community (FFFF:FF01
) so that these prefixes are learned
in AS8298, but never propagated. It’s not strictly necessary, because AS8298 won’t announce more
specifics like these anyway, but it’s a nice way to really assert that these are meant to stay
local. Because AS8298 is already announcing 194.126.235.0/24 and 2001:678:d78::/48
supernets, return traffic will already be able to reach IPng’s routers upstream. With these more
specific announcements of the /30 and /125 pools, the upstream VPP routers will be able to route the
return traffic to this specific server.
And with that, the ping to Unbound’s DNS64 provided IPv6 address for github.com shoots to life.
Results
I deployed four of these Border Gateways using Ansible: one at my office in Brüttisellen, one in Zurich, one in Geneva and one in Amsterdam. They do all three types of NAT:
- Announcing the IPv4 default 0.0.0.0/0 will allow them to serve as NAT44 gateways for 198.19.0.0/16
- Announcing the IPv6 default ::/0 will allow them to serve as NAT66 gateway for 2001:678:d78:500::/56
- Announcing the IPv6 nat64 prefix 2001:678:d78:564::/96 will allow them to serve as NAT64 gateway
- Announcing the IPv4 and IPv6 anycast address for
nscache.net.ipng.ch
allows them to serve DNS64
Each individual service can be turned on or off. For example, stopping to announce the IPv4 default into the Centec network, will no longer attract NAT44 traffic through a replica. Similarly, stopping to announce the NAT64 prefix will no longer attract NAT64 traffic through that replica. OSPF in the IPng Site Local network will automatically select an alternative replica in such cases. Shutting down Bird2 alltogether will immediately drain the machine of all traffic, while traffic is immediately rerouted.
If you’re curious, here’s a few minutes of me playing with failover, while watching YouTube videos concurrently [asciinema, gif]:
What’s Next
I’ve added an Ansible module in which I can configure the individual instances’ IPv4 and IPv6 NAT pools, and turn on/off the three NAT types by means of steering the OSPF announcements. I can also turn on/off the Anycast Unbound announcements, in much the same way.
If you’re a regular reader of my stories, you’ll maybe be asking: Why didn’t you use VPP? And that would be an excellent question. I need to noodle a little bit more with respect to having all three NAT types concurrently working alongside Linux CP for the Bird and Unbound stuff, but I think in the future you might see a followup article on how to do all of this in VPP. Stay tuned!