About this series
Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. Thanks to the [Linux ControlPlane] plugin, higher level control plane software becomes available, that is to say: things like BGP, OSPF, LDP, VRRP and so on become quite natural for VPP.
IPng Networks is a small service provider that has built a network based entirely on open source: [Debian] servers with widely available Intel and Mellanox 10G/25G/100G network cards, paired with [VPP] for the dataplane, and [Bird2] for the controlplane.
As a small provider, I am well aware of the cost of IPv4 address space. Long gone are the times at which an initial allocation was a /19, and subsequent allocations usually a /20 based on justification. Then it watered down to a /22 for new Local Internet Registries, then that became a /24 for new LIRs, and ultimately we ran out. What was once a plentiful resource, has now become a very constrained resource.
In this first article, I want to show a rather clever way to conserve IPv4 addresses by exploring one of the newer routing protocols: Babel.
🙁 A sad waste
I have to go back to something very fundamental about routing. When RouterA holds a routing table, it will associate prefixes with next-hops and their associated interfaces. When RouterA gets a packet, it’ll look up the destination address, and then forward the packet on to RouterB which is the next router in the path towards the destination:
- RouterA does a route lookup in its routing table. For destination
192.0.2.1
, the covering prefix is192.0.2.0/24
and it might find that it can reach it via IPv4 next hop100.64.0.1
. - RouterA then does another lookup in its routing table, to figure out how can it reach
100.64.0.1
. It may find that this address is directly connected, say to interfaceeth0
, on which RouterA is100.64.0.2/30
. - Assuming that
eth0
is an ethernet device, which the vast majority of interfaces are, then RouterA can look up the link-layer address for that IPv4 address100.64.0.1
, by using ARP. - The ARP request asks, quite literally
who-has 100.64.0.1?
using a broadcast message oneth0
, to which the other RouterB will answer100.64.0.1 is-at 90:e2:ba:3f:ca:d5
. - Now that RouterA knows that, it can forward along the IP packet out on its
eth0
device and towards90:e2:ba:3f:ca:d5
. Huzzah.
🥰 A clever trick
I can’t help but notice that the only purpose of having the 100.64.0.0/30
transit network between
these two routers is to:
- provide the routers the ability to resolve IPv4 next hops towards link-layer MAC addresses, using ARP resolution.
- provide a means for the routers to send ICMP messages, for example in a traceroute, each hop along the way will respond with an TTL exceeded message. And I do like traceroutes!
Let me discuss these two purposes in more detail:
1. IPv4 ARP, née IPv6 NDP
One really neat trick is simply replacing ARP resolution by something that can resolve the link-layer MAC address in a different way. As it turns out, IPv6 has an equivalent that’s called Neighbor Discovery Protocol in which a router can determine the link-layer address of a neighbor, or to verify that a neighbor is still reachable via a cached link-layer address. This uses ICMPv6 to send out a query with the Neighbor Solicitation, which is followed by a response in the form of a Neighbor Advertisement.
Why am I talking about IPv6 neighbor discovery when I’m explaining IPv4 forwarding, you may be wondering? Well, because of this neat trick that the IPv4 prefix brokers don’t want you to know:
pim@vpp0-0:~$ sudo ip ro add 192.0.2.0/24 via inet6 fe80::5054:ff:fef0:1110 dev e1
pim@vpp0-0:~$ ip -br a show e1
e1 UP fe80::5054:ff:fef0:1101/64
pim@vpp0-0:~$ ip ro get 192.0.2.0
192.0.2.0 via inet6 fe80::5054:ff:fef0:1110 dev e1 src 192.168.10.0 uid 0
cache
pim@vpp0-0:~$ ip neighbor | grep fe80::5054:ff:fef0:1110
fe80::5054:ff:fef0:1110 dev e1 lladdr 52:54:00:f0:11:10 REACHABLE
pim@vpp0-0:~$ sudo tcpdump -evni e1 host 192.0.2.0
tcpdump: listening on e1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
16:21:30.002878 52:54:00:f0:11:01 > 52:54:00:f0:11:10, ethertype IPv4 (0x0800), length 98:
(tos 0x0, ttl 64, id 21521, offset 0, flags [DF], proto ICMP (1), length 84)
192.168.10.0 > 192.0.2.0: ICMP echo request, id 54710, seq 20, length 64
While it looks counter-intuitive at first, this is actually pretty straight forward. When the router
gets a packet destined for 192.0.2.0/24
, it will know that the next hop is some link-local IPv6
address, which it can resolve by NDP on ethernet interface e1
. It can then simply forward the
IPv4 datagram to the MAC address it found.
Who would’ve thunk that you do not need ARP or even IPv4 on the interface at all?
2. Originating ICMP messages
The Internet Control Message Protocol is described in [RFC792]. It’s mostly used to carry diagnostic and debugging information, either originated by end hosts, for example the “destination unreachable, port unreachable” types of messages, but they may also be originated by intermediate routers, for example with most other kinds of “destination unreachable” packets.
Path MTU Discovery, described in [RFC1191] allows a host to discover the maximum packet size that a route is able to carry. There’s a few different types of PMTUd, but the most common one uses ICMPv4 packets coming from these intermediate routers, informing them that packets which are marked as un-fragmentable, will not be able to be transmitted due to them being too large.
Without the ability for a router to signal these ICMPv4 packets, end to end connectivity quality might break undetected. So, every router that is able to forward IPv4 traffic SHOULD be able originate ICMPv4 traffic.
If you’re curious, you can read more in this [IETF Draft] from Juliusz Chroboczek et al. It’s really insightful, yet elegant.
Introducing Babel
I’ve learned so far that I (a) MAY use IPv6 link-local networks in order to forward IPv4 packets, as I can use IPv6 NDP to find the link-layer next hop; and (b) each router SHOULD be able to originate ICMPv4 packets, therefore it needs at least one IPv4 address.
These two claims mean that I need at most one IPv4 address on each router. Could it be?!
Babel is a loop-avoiding distance-vector routing protocol that is designed to be robust and efficient both in networks using prefix-based routing and in networks using flat routing (“mesh networks”), and both in relatively stable wired networks and in highly dynamic wireless networks.
The definitive [RFC8966] describes it in great detail, and previous work are in [RFC7557] and [RFC6126]. Lots of reading :) Babel is a hybrid routing protocol, in the sense that it can carry routes for multiple network-layer protocols (IPv4 and IPv6), regardless of which protocol the Babel packets are themselves being carried over.
I quickly realise that Babel is hybrid in a different and very interesting way: it can set next-hops across address families, which is described in [RFC9229]:
When a packet is routed according to a given routing table entry, the forwarding plane typically uses a neighbour discovery protocol (the Neighbour Discovery (ND) protocol [RFC4861] in the case of IPv6 and the Address Resolution Protocol (ARP) [RFC826] in the case of IPv4) to map the next-hop address to a link-layer address (a “Media Access Control (MAC) address”), which is then used to construct the link-layer frames that encapsulate forwarded packets.
It is apparent from the description above that there is no fundamental reason why the destination prefix and the next-hop address should be in the same address family: there is nothing preventing an IPv6 packet from being routed through a next hop with an IPv4 address (in which case the next hop’s MAC address will be obtained using ARP) or, conversely, an IPv4 packet from being routed through a next hop with an IPv6 address. (In fact, it is even possible to store link-layer addresses directly in the next-hop entry of the routing table, which is commonly done in networks using the OSI protocol suite).
Babel and Bird2
There’s an implementation of Babel in Bird2, the routing solution that I use at AS8298. What made me extra enthusiastic, is that I found out the functionality described in RFC9229 was committed about a year ago in Bird2 [ref], with a hat-tip to Toke Høiland-Jørgensen.
The Debian machines at IPng are current (Bookworm 12.5), but Debian still ships a version older than this commit, so my first order of business is to get a Debian package:
pim@summer:~/src$ sudo apt install devscripts
pim@summer:~/src$ wget http://deb.debian.org/debian/pool/main/b/bird2/bird2_2.14.orig.tar.gz
pim@summer:~/src$ tar xzf bird2_2.14.orig.tar.gz
pim@summer:~/src/bird-2.14$ wget http://deb.debian.org/debian/pool/main/b/bird2/bird2_2.14-1.debian.tar.xz
pim@summer:~/src/bird-2.14$ tar xf bird2_2.14-1.debian.tar.xz
pim@summer:~/src/bird-2.14$ sudo mk-build-deps -i
pim@summer:~/src/bird-2.14$ sudo dpkg-buildpackage -b -uc -us
And that yields me a fresh Bird 2.14 package. I can’t help but wonder though, why did the semantic
versioning [ref] of 2.0.X
change to 2.14
? I found an answer in the NEWS
file of the 2.13 release
[link].
It’s a little bit of a disappointment, but I quickly get over myself because I want to take this
Babel-Bird out for a test flight. Thank you for the Babel-Bird-Build, Summer!
Babel and the LAB
I decide to take an IPng [lab] out for a spin. These labs come with four VPP routers and two Debian machines connected like so:
The configuration snippet for Bird2 is very simple, as most of the defaults are sensible:
pim@vpp0-0:~$ cat << EOF | sudo tee -a /etc/bird/bird.conf
protocol babel {
interface "e*" {
type wired;
extended next hop on;
};
ipv6 { import all; export all; };
ipv4 { import all; export all; };
}
EOF
pim@vpp0-0:~$ birdc show babel interfaces
BIRD 2.14 ready.
babel1:
Interface State Auth RX cost Nbrs Timer Next hop (v4) Next hop (v6)
e1 Up No 96 1 0.958 :: fe80::5054:ff:fef0:1101
pim@vpp0-0:~$ birdc show babel neigh
BIRD 2.14 ready.
babel1:
IP address Interface Metric Routes Hellos Expires Auth RTT (ms)
fe80::5054:ff:fef0:1110 e1 96 8 16 5.003 No 4.831
pim@vpp0-0:~$ birdc show babel entries
BIRD 2.14 ready.
babel1:
Prefix Router ID Metric Seqno Routes Sources
192.168.10.0/32 00:00:00:00:c0:a8:0a:00 0 1 0 0
192.168.10.0/24 00:00:00:00:c0:a8:0a:00 0 1 1 0
192.168.10.1/32 00:00:00:00:c0:a8:0a:01 96 7 1 0
2001:678:d78:200::/128 00:00:00:00:c0:a8:0a:00 0 1 0 0
2001:678:d78:200::/60 00:00:00:00:c0:a8:0a:00 0 1 1 0
2001:678:d78:200::1/128 00:00:00:00:c0:a8:0a:01 96 7 1 0
Based on this simple configuration, Bird2 will start the babel protocol on e0
and e1
, and it
quickly finds a neighbor with which it establishes an adjacency. Looking at the routing protocol
database (called entries), I can see my own IPv4 and IPv6 loopbacks (192.168.10.0 and
2001:678:d78:200::), the neighbor’s IPv4 and IPv6 loopbacks (192.168.10.1 and 201:678:d78:200::1),
and finally the two supernets (192.168.10.0/24 and 2001:678:d78:200::/60).
The coolest part is the extended next hop on
statement, which enables Babel to set the nexthop
to be an IPv6 address, which becomes clear very quickly when looking at the Linux routing table:
pim@vpp0-0:~$ ip ro
192.168.10.1 via inet6 fe80::5054:ff:fef0:1110 dev e1 proto bird metric 32
unreachable 192.168.10.0/24 proto bird metric 32
pim@vpp0-0:~$ ip -6 ro
2001:678:d78:200:: dev loop0 proto kernel metric 256 pref medium
2001:678:d78:200::1 via fe80::5054:ff:fef0:1110 dev e1 proto bird metric 32 pref medium
unreachable 2001:678:d78:200::/60 dev lo proto bird metric 32 pref medium
fe80::/64 dev loop0 proto kernel metric 256 pref medium
fe80::/64 dev e1 proto kernel metric 256 pref medium
✅ Setting IPv4 routes over IPv6 nexthops works!
Babel and VPP
For the [VPP] configuration, I start off with a pretty much empty configuration,
creating only a loopback interface called loop0
, setting the interfaces up, and exposing them in
LinuxCP:
vpp0-0# create loopback interface instance 0
vpp0-0# set interface state loop0 up
vpp0-0# set interface ip address loop0 192.168.10.0/32
vpp0-0# set interface ip address loop0 2001:678:d78:200::/128
vpp0-0# set interface mtu 9000 GigabitEthernet10/0/0
vpp0-0# set interface state GigabitEthernet10/0/0 up
vpp0-0# set interface mtu 9000 GigabitEthernet10/0/1
vpp0-0# set interface state GigabitEthernet10/0/1 up
vpp0-0# lcp create loop0 host-if loop0
vpp0-0# lcp create GigabitEthernet10/0/0 host-if e0
vpp0-0# lcp create GigabitEthernet10/0/1 host-if e1
Between the four VPP routers, the only relevant difference is the IPv4 and IPv6 addresses of the loopback device. For the rest, things are good. The routing tables quickly fill with all IPv4 and IPv6 loopbacks across the network.
Adding support to VPP
IPv6 pings and looks good. However, IPv4 endpoints do not ping yet. The first thing I look at, is
does VPP understand how to interpret an IPv4 route with an IPv6 nexthop? I think it does, because I
remember reviewing a change from Adrian during our MPLS [project],
which he submitted in this [Gerrit]. His change
allows VPP to use routes with rtnl_route_nh_get_via()
to map them to a different address family,
exactly what I am looking for. The routes are correctly installed in the FIB:
pim@vpp0-0:~$ vppctl show ip fib 192.168.10.1
ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[default-route:1, lcp-rt:1, ]
192.168.10.1/32 fib:0 index:31 locks:2
lcp-rt-dynamic refs:1 src-flags:added,contributing,active,
path-list:[51] locks:4 flags:shared, uPRF-list:42 len:1 itfs:[2, ]
path:[72] pl-index:51 ip6 weight=1 pref=32 attached-nexthop: oper-flags:resolved,
fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1
[@0]: ipv6 via fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 525400f01110525400f0110186dd
forwarding: unicast-ip4-chain
[@0]: dpo-load-balance: [proto:ip4 index:34 buckets:1 uRPF:42 to:[0:0]]
[0] [@5]: ipv4 via fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 525400f01110525400f011010800
Using the Open vSwitch tap I can see I can clearly see the packets go out from vpp0-0.e1
and into
vpp0-1.e0
, but there is no response, so they are getting lost in vpp0-1
somewhere. I take a look
at a packet trace on vpp0-1
, I’m expecting the ICMP packet there:
pim@vpp0-1:~$ vppctl show trace
07:42:53:178694: dpdk-input
GigabitEthernet10/0/0 rx queue 0
buffer 0x4c513d: current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x0
ext-hdr-valid
PKT MBUF: port 0, nb_segs 1, pkt_len 98
buf_len 2176, data_len 98, ol_flags 0x0, data_off 128, phys_addr 0x29944fc0
packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0
rss 0x0 fdir.hi 0x0 fdir.lo 0x0
IP4: 52:54:00:f0:11:01 -> 52:54:00:f0:11:10
ICMP: 192.168.10.0 -> 192.168.10.1
tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN
fragment id 0xf52b, flags DONT_FRAGMENT
ICMP echo_request checksum 0x43b7 id 26166
07:42:53:178765: ethernet-input
frame: flags 0x1, hw-if-index 1, sw-if-index 1
IP4: 52:54:00:f0:11:01 -> 52:54:00:f0:11:10
07:42:53:178791: ip4-input
ICMP: 192.168.10.0 -> 192.168.10.1
tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN
fragment id 0xf52b, flags DONT_FRAGMENT
ICMP echo_request checksum 0x43b7 id 26166
07:42:53:178810: ip4-not-enabled
ICMP: 192.168.10.0 -> 192.168.10.1
tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN
fragment id 0xf52b, flags DONT_FRAGMENT
ICMP echo_request checksum 0x43b7 id 26166
07:42:53:178833: error-drop
rx:GigabitEthernet10/0/0
07:42:53:178835: drop
dpdk-input: no error
Okay, that checks out! Going over this packet trace, the ip4-input
node indeed got handed a packet,
which it promptly rejected by forwarding it to ip4-not-enabled
which drops it. It kind of makes
sense, the VPP dataplane doesn’t think it’s logical to handle IPv4 traffic on an interface which
does not have an IPv4 address. Except – I’m bending the rules a little bit by doing exactly that.
Approach 1: force-enable ip4 in VPP
There’s an internal function ip4_sw_interface_enable_disable()
which is called to enable IPv4
processing on an interface once the first IPv4 address is added. So my first fix is to force this to
be enabled for any interface that is exposed via Linux Control Plane, notably in lcp_itf_pair_create()
[here].
This approach is partially effective:
pim@vpp0-0:~$ ip ro get 192.168.10.1
192.168.10.1 via inet6 fe80::5054:ff:fef0:1110 dev e1 src 192.168.10.0 uid 0
cache
pim@vpp0-0:~$ ping -c5 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=3.92 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=3.81 ms
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=3.75 ms
64 bytes from 192.168.10.1: icmp_seq=4 ttl=64 time=3.23 ms
64 bytes from 192.168.10.1: icmp_seq=5 ttl=64 time=2.67 ms
^C
--- 192.168.10.1 ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 4006ms
rtt min/avg/max/mdev = 2.673/3.477/3.921/0.467 ms
pim@vpp0-0:~$ traceroute 192.168.10.3
traceroute to 192.168.10.3 (192.168.10.3), 30 hops max, 60 byte packets
1 * * *
2 * * *
3 192.168.10.3 (192.168.10.3) 10.418 ms 10.343 ms 11.362 ms
I take a moment to think about why the traceroutes are not responding in the routers in the middle, and it dawns on me that when the router needs to send an ICMPv4 TTL Exceeded message, it can’t select an IPv4 address to originate the message from, as the interface has none.
🟠 Forwarding works, but ❌ PMTUd does not!
Approach 2: Use unnumbered interfaces
Looking at my options, I see that VPP is capable of using so-called unnumbered interfaces. These
can be left unconfigured, but borrow an address from another interface. It’s a good idea to
borrow from loop0
, which has a valid IPv4 and IPv6 address. It looks like this in VPP:
vpp0-0# set interface unnumbered GigabitEthernet10/0/0 use loop0
vpp0-0# set interface unnumbered GigabitEthernet10/0/1 use loop0
vpp0-0# show interface address
GigabitEthernet10/0/0 (dn):
unnumbered, use loop0
L3 192.168.10.0/32
L3 2001:678:d78:200::/128
GigabitEthernet10/0/1 (up):
unnumbered, use loop0
L3 192.168.10.0/32
L3 2001:678:d78:200::/128
loop0 (up):
L3 192.168.10.0/32
L3 2001:678:d78:200::/128
The Linux ControlPlane configuration will always synchronize interface information from VPP to Linux, as I described back then when I [worked on the plugin]. Babel starts and sets next hops for IPv4 that look like this:
pim@vpp0-2:~$ ip -br a
lo UNKNOWN 127.0.0.1/8 ::1/128
loop0 UNKNOWN 192.168.10.2/32 2001:678:d78:200::2/128 fe80::dcad:ff:fe00:0/64
e0 UP 192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1120/64
e1 UP 192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1121/64
pim@vpp0-2:~$ ip ro
192.168.10.0 via 192.168.10.1 dev e0 proto bird metric 32 onlink
unreachable 192.168.10.0/24 proto bird metric 32
192.168.10.1 via 192.168.10.1 dev e0 proto bird metric 32 onlink
192.168.10.3 via 192.168.10.3 dev e1 proto bird metric 32 onlink
While on the surface this looks good, for VPP it clearly poses a problem, as my IPv4 neighbors (192.168.10.1 and 192.168.10.3) are not reachable:
pim@vpp0-2:~# ping -c3 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
From 192.168.10.2 icmp_seq=1 Destination Host Unreachable
From 192.168.10.2 icmp_seq=2 Destination Host Unreachable
From 192.168.10.2 icmp_seq=3 Destination Host Unreachable
--- 192.168.10.1 ping statistics ---
3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2034ms
I take a look at why that might be, and I notice this on the neighbor vpp0-1
when I try to ping it
from vpp0-2
:
vpp0-1# show err
Count Node Reason Severity
5 arp-reply IP4 source address not local to sub error
1 arp-reply IP4 source address matches local in error
Oh, snap! I traced this down to src/vnet/arp/arp.c
around line 522 where I can see that VPP, when
it receives an ARP request, wants that to be coming from a peer that is in its own subnet. But with a
point to point link like this one, there is nobody else in the 192.168.10.1/32
subnet! I think
this error should not be returned if the interface is arp_unnumbered()
, defined further up in the
same source file. I write a small patch in Gerrit [40482]
which removes this requirement and the test that asserts the previous behavior, allowing the ARP
request to succeed, and things shoot to life:
pim@vpp0-2:~$ ping -c3 192.168.10.1
PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data.
64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=11.5 ms
64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=1.69 ms
64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=3.03 ms
--- 192.168.10.1 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2004ms
rtt min/avg/max/mdev = 1.689/5.394/11.468/4.329 ms
I make a mental note to discuss this ARP relaxation Gerrit with [vpp-dev], and I’ll see where that takes me.
✅ Forwarding IPv4 routes over IPv4 point-to-point nexthops works!
Approach 3: VPP Unnumbered Hack
At this point, I think I’m good, but one of the cool features of Babel is that it can use IPv6 next
hops for IPv4 destinations. Setting GigabitEthernet10/0/X
to unnumbered will make
192.168.10.X/32
reappear on the e0
an e1
interfaces, which will make Babel prefer the more
classic IPv4 next-hops. So can I trick it somehow to use IPv6 anyway ?
One option is to ask Babel to use extended next hop
even when IPv4 is available, which would be a
change to Bird (and possibly a violation of the Babel specification, I should read up on that).
But I think there’s another way, so I take a look at the VPP code which prints out the unnumbered, use loop0 message, and I find a way to know if an interface is borrowing addresses in this way. I decide to change the LCP plugin to inhibit sync’ing the addresses if they belong to an interface which is unnumbered. Because I don’t know for sure if everybody would find this behavior desirable, I make sure to guard the behavior behind a backwards compatible configuration option.
If you’re curious, please take a look at the change in my [GitHub repo], in which I:
- add a new configuration option,
lcp-sync-unnumbered
, which defaults toon
. That would be what the plugin would do in the normal case: copy forward these borrowed IP addresses to Linux. - add a CLI call to change the value,
lcp lcp-sync-unnumbered [on|enable|off|disable]
- extend the CLI call to show the LCP plugin state, as an additional output of
lcp show
And with that, the VPP configuration becomes:
vpp0-0# lcp lcp-sync on
vpp0-0# lcp lcp-sync-unnumbered off
vpp0-0# create loopback interface instance 0
vpp0-0# set interface state loop0 up
vpp0-0# set interface ip address loop0 192.168.10.0/32
vpp0-0# set interface ip address loop0 2001:678:d78:200::/128
vpp0-0# set interface mtu 9000 GigabitEthernet10/0/0
vpp0-0# set interface unnumbered GigabitEthernet10/0/0 use loop0
vpp0-0# set interface state GigabitEthernet10/0/0 up
vpp0-0# set interface mtu 9000 GigabitEthernet10/0/1
vpp0-0# set interface unnumbered GigabitEthernet10/0/1 use loop0
vpp0-0# set interface state GigabitEthernet10/0/1 up
vpp0-0# lcp create loop0 host-if loop0
vpp0-0# lcp create GigabitEthernet10/0/0 host-if e0
vpp0-0# lcp create GigabitEthernet10/0/1 host-if e1
Results
I can claim plausible success on this effort, which makes me wiggle in my seat a little bit, I have to admit:
pim@vpp0-0:~$ ip -br a
lo UNKNOWN 127.0.0.1/8 ::1/128
loop0 UNKNOWN 192.168.10.0/32 2001:678:d78:200::/128 fe80::dcad:ff:fe00:0/64
e0 UP fe80::5054:ff:fef0:1100/64
e1 UP fe80::5054:ff:fef0:1101/64
e2 DOWN
e3 DOWN
pim@vpp0-0:~$ traceroute -n 192.168.10.3
traceroute to 192.168.10.3 (192.168.10.3), 30 hops max, 60 byte packets
1 192.168.10.1 1.882 ms 2.231 ms 1.472 ms
2 192.168.10.2 4.243 ms 3.492 ms 2.797 ms
3 192.168.10.3 6.689 ms 5.925 ms 5.157 ms
pim@vpp0-0:~$ traceroute -n 2001:678:d78:200::3
traceroute to 2001:678:d78:200::3 (2001:678:d78:200::3), 30 hops max, 80 byte packets
1 2001:678:d78:200::1 2.543 ms 1.762 ms 2.154 ms
2 2001:678:d78:200::2 4.943 ms 3.063 ms 3.562 ms
3 2001:678:d78:200::3 6.273 ms 6.694 ms 7.086 ms
✅ Forwarding IPv4 routes over IPv6 nexthops works, ICMPv4 works, PMTUd works!
I recorded a little [screencast] that shows my work, so far:
Additional thoughts
Comparing OSPFv2 and Babel
Ondrej from the Bird team pointed out (thank you!) that OSPFv2 can also be made to avoid use of IPv4
transit networks, by making use of this peer
pattern, which is similar but not quite the same as
what I discussed in Approach 2 above:
$ ip addr add 192.168.10.2 peer 192.168.10.1 dev e0
$ ip addr add 192.168.10.2 peer 192.168.10.3 dev e1
The Linux ControlPlane plugin is not currently capable of accepting the peer
netlink message, and
I can see a problem: VPP does not allow for two interfaces to have the same IP address, unless one
is borrowing from another using unnumbered. I wonder why that is …
I could certainly give implementing that peer
pattern in Netlink a go, but I’m not enthusiastic.
To consume the netlink message correctly, the plugin would need to assert that left hand (source) IPv4
address strictly corresponds to a loopback, and then internally rewrite the address addition into
a unnumbered use, and also somehow reject (delete?) the netlink configuration otherwise. Ick!
I think there’s a more idiomatic way of doing this in VPP. OSPFv2 doesn’t really need to use the
peer
pattern, as long as the point to point peer is reachable. Babel is emitting a static route
over the interface after using IPv6 to learn its peer’s IPv4 address, which is really neat! I
suppose for OSPFv2 setting a manual static route for the peer into the device would do the trick as
well.
The VPP idiom for the peer
pattern above, which Babel does naturally, and OSPFv2 could be manually
configured to do, would look like this:
vpp0-2# set interface ip address loop0 192.168.10.2/32
vpp0-2# set interface state loop0 up
vpp0-2# set interface unnumbered GigabitEthernet10/0/0 use loop0
vpp0-2# set interface state GigabitEthernet10/0/0 up
vpp0-2# ip route add 192.168.10.1/32 via 192.168.10.1 GigabitEthernet10/0/0
vpp0-2# set interface unnumbered GigabitEthernet10/0/1 use loop0
vpp0-2# set interface state GigabitEthernet10/0/1 up
vpp0-2# ip route add 192.168.10.3/32 via 192.168.10.3 GigabitEthernet10/0/1
Either way, when using point to point connections (like these explicit static routes, or the implied
static routes that the peer
pattern will yield) over an ethernet broadcast medium, will require to
get the ARP [Gerrit] merged. This one seems reasonably
straight forward because allowing point to point to work over an ethernet broadcast medium is
successfully done in many popular vendors, and I can’t find any RFC that forbids it. Perhaps VPP is
being a bit too strict.
To Unnumbered or Not To Unnumbered
I’m torn between Approach 2 and Approach 3. While on the one hand, setting the unnumbered
interface would be best reflected in Linux, it is not without problems. If the operator subsequently
tries to remove one of the addresses on e0
or e1
, that will yield a desync between Linux and
VPP (Linux will have removed the address, but VPP will still be unnumbered). On the other hand,
tricking Linux (and the operator) to believe there isn’t an IPv4 (and IPv6) address configured on
the interface, is also not great.
Of the two approaches, I think I prefer Approach 3 (changing the Linux CP plugin to not sync unnumbered addresses), because it minimizes the chance of operator error. If you’re reading this and have an Opinion™, would you please let me know?
What’s Next
I think that over time, IPng Networks might replace OSPF and OSPFv3 with Babel, as it will allow me to retire the many /31 IPv4 and /112 IPv6 transit networks (which consume about half of my routable IPv4 addresses!). I will discuss my change with the VPP and Babel/Bird Developer communities and see if it makes sense to upstream my changes. Personally, I think it’s a reasonable direction, because (a) both changes are backwards compatible and (b) its semantics are pretty straight forward. I’ll also add some configuration knobs to [vppcfg] to make it easier to configure VPP in this way.
Of course, migrating AS8298 won’t be overnight, I need to gain a bit more confidence, and obviously upgrade both Bird2 and VPP using my changes, which I think might benefit from a bit of peer review. And finally I need to roll this new IPv4-less IGP out very carefully and without interruptions, which considering the IGP is the most fundamental building block of the network, may be tricky.
But, I am uncomfortably excited by the prospect of having my network go entirely without backbone transit networks. By the way: Babel is amazing!