VPP

About this series

Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two.

After completing the Linux CP plugin, interfaces and their attributes such as addresses and routes can be shared between VPP and the Linux kernel in a clever way, so running software like FRR or Bird on top of VPP and achieving >100Mpps and >100Gbps forwarding rates are easily in reach!

If you’ve read my previous articles (thank you!), you will have noticed that I have done a lot of work on making VPP work well in an ISP (BGP/OSPF) environment with Linux CP. However, there’s many other cool things about VPP that make it a very competent advanced services router. One that that has always been super interesting to me, is being able to offer L2 connectivity over wide-area-network. For example, a virtual leased line from our Colo in Zurich to Amsterdam NIKHEF. This article explores this space.

NOTE: If you’re only interested in results, scroll all the way down to the markdown table and graph for performance stats.

Introduction

ISPs can offer ethernet services, often called Virtual Leased Lines (VLLs), Layer2 VPN (L2VPN) or Ethernet Backhaul. They mean the same thing: imagine a switchport in location A that appears to be transparently and directly connected to a switchport in location B, with the ISP (layer3, so IPv4 and IPv6) network in between. The “simple” old-school setup would be to have switches which define VLANs and are all interconnected. But we collectively learned that it’s a bad idea for several reasons:

  • Large broadcast domains tend to encouter L2 forwarding loops sooner rather than later
  • Spanning-Tree and its kin are a stopgap, but they often disable an entire port from forwarding, which can be expensive if that port is connected to a dark fiber into another datacenter far away.
  • Large VLAN setups that are intended to interconnect with other operators run into overlapping VLAN tags, which means switches have to do tag rewriting and filtering and such.
  • Traffic engineering is all but non-existent in L2-only networking domains, while L3 has all sorts of smart TE extensions, ECMP, and so on.

The canonical solution is for ISPs to encapsulate the ethernet traffic of their customers in some tunneling mechanism, for example in MPLS or in some IP tunneling protocol. Fundamentally, these are the same, except for the chosen protocol and overhead/cost of forwarding. MPLS is a very thin layer under the packet, but other IP based tunneling mechanisms exist, commonly used are GRE, VXLAN and GENEVE but many others exist.

They all work roughly the same:

  • An IP packet has a maximum transmission unit (MTU) of 1500 bytes, while the ethernet header is typically an additional 14 bytes: a 6 byte source MAC, 6 byte destination MAC, and 2 byte ethernet type, which is 0x0800 for an IPv4 datagram, 0x0806 for ARP, and 0x86dd for IPv6, and many others [ref].
    • If VLANs are used, an additional 4 bytes are needed [ref] making the ethernet frame at most 1518 bytes long, with an ethertype of 0x8100.
    • If QinQ or QinAD are used, yet again 4 bytes are needed [ref], making the ethernet frame at most 1522 bytes long, with an ethertype of either 0x8100 or 0x9100, depending on the implementation.
  • We can take such an ethernet frame, and make it the payload of another IP packet, encapsulating the original ethernet frame in a new IPv4 or IPv6 packet. We can then route it over an IP network to a remote site.
  • Upon receipt of such a packet, by looking at the headers the remote router can determine that this packet represents an encapsulated ethernet frame, unpack it all, and forward the original frame onto a given interface.

IP Tunneling Protocols

First let’s get some theory out of the way – I’ll discuss three common IP tunneling protocols here, and then move on to demonstrate how they are configured in VPP and perhaps more importantly, how they perform in VPP. Each tunneling protocol has its own advantages and disadvantages, but I’ll stick to the basics first:

GRE: Generic Routing Encapsulation

Generic Routing Encapsulation (GRE, described in RFC2784) is a very old and well known tunneling protocol. The packet is an IP datagram with protocol number 47, consisting of a header with 4 bits of flags, 8 reserved bits, 3 bits for the version (normally set to all-zeros), and 16 bits for the inner protocol (ether)type, so 0x0800 for IPv4, 0x8100 for 802.1q and so on. It’s a very small header of only 4 bytes and an optional key (4 bytes) and sequence number (also 4 bytes) whieah means that to be able to transport any ethernet frame (including the fancy QinQ and QinAD ones), the underlay must have an end to end MTU of at least 1522 + 20(IPv4)+12(GRE) = 1554 bytes for IPv4 and 1574 bytes for IPv6.

VXLAN: Virtual Extensible LAN

Virtual Extensible LAN (VXLAN, described in RFC7348) is a UDP datagram which has a header consisting of 8 bits worth of flags, 24 bits reserved for future expansion, 24 bits of Virtual Network Identifier (VNI) and an additional 8 bits or reserved space at the end. It uses UDP port 4789 as assigned by IANA. VXLAN encapsulation adds 20(IPv4)+8(UDP)+8(VXLAN) = 36 bytes, and considering IPv6 is 40 bytes, there it adds 56 bytes. This means that to be able to transport any ethernet frame, the underlay network must have an end to end MTU of at least 1522+36 = 1558 bytes for IPv4 and 1578 bytes for IPv6.

GENEVE: Generic Network Virtualization Encapsulation

GEneric NEtwork Virtualization Encapsulation (GENEVE, described in RFC8926) is somewhat similar to VXLAN although it was an attempt to stop the wild growth of tunneling protocols, I’m sure there is an XKCD out there specifically for this approach. The packet is also a UDP datagram with destination port 6081, followed by an 8 byte GENEVE specific header, containing 2 bits of version, 8 bits for flags, a 16 bit inner ethertype, a 24 bit Virtual Network Identifier (VNI), and 8 bits of reserved space. With GENEVE, several options are available and will be tacked onto the GENEVE header, but they are typically not used. If they are though, the options can add an additional 16 bytes which means that to be able to transport any ethernet frame, the underlay network must have an end to end MTU of at least 1522+52 = 1574 bytes for IPv4 and 1594 bytes for IPv6.

Hardware setup

First let’s take a look at the physical setup. I’m using three servers and a switch in the IPng Networks lab:

Loadtest Setup

  • hvn0: Dell R720xd, load generator
    • Dual E5-2620, 24 CPUs, 2 threads per core, 2 numa nodes
    • 64GB DDR3 at 1333MT/s
    • Intel X710 4-port 10G, Speed 8.0GT/s Width x8 (64 Gbit/s)
  • Hippo and Rhino: VPP routers
    • ASRock B550 Taichi
    • Ryzen 5950X 32 CPUs, 2 threads per core, 1 numa node
    • 64GB DDR4 at 2133 MT/s
    • Intel 810C 2-port 100G, Speed 16.0 GT/s Width x16 (256 Gbit/s)
  • fsw0: FS.com switch S5860-48SC, 8x 100G, 48x 10G
    • VLAN 4 (blue) connects Rhino’s Hu12/0/1 to Hippo’s Hu12/0/1
    • VLAN 5 (red) connects hvn0’s enp5s0f0 to Rhino’s Hu12/0/0
    • VLAN 6 (green) connects hvn0’s enp5s0f1 to Hippo’s Hu12/0/0
    • All switchports have jumbo frames enabled and are set to 9216 bytes.

Further, Hippo and Rhino are running VPP at head vpp v22.02-rc0~490-gde3648db0, and hvn0 is running T-Rex v2.93 in L2 mode, with MAC address 00:00:00:01:01:00 on the first port, and MAC address 00:00:00:02:01:00 on the second port. This machine can saturate 10G in both directions with small packets even when using only one flow, as can be seen, if the ports are just looped back onto one another, for example by physically crossconnecting them with an SFP+ or DAC; or in my case by putting fsw0 port Te0/1 and Te0/2 in the same VLAN together:

TRex on hvn0

Now that I have shared all the context and hardware, I’m ready to actually dive in to what I wanted to talk about: how does all this virtual leased line business look like, for VPP. Ready? Here we go!

Direct L2 CrossConnect

The simplest thing I can show in VPP, is to configure a layer2 cross-connect (l2 xconnect) between two ports. In this case, VPP doesn’t even need to have an IP address, all I do is bring up the ports, set their MTU to be able to carry the 1522 bytes frames (ethernet at 1514, dot1q at 1518, and QinQ at 1522 bytes). The configuration is identical on both Rhino and Hippo:

set interface state HundredGigabitEthernet12/0/0 up
set interface state HundredGigabitEthernet12/0/1 up
set interface mtu packet 1522 HundredGigabitEthernet12/0/0
set interface mtu packet 1522 HundredGigabitEthernet12/0/1
set interface l2 xconnect HundredGigabitEthernet12/0/0 HundredGigabitEthernet12/0/1
set interface l2 xconnect HundredGigabitEthernet12/0/1 HundredGigabitEthernet12/0/0

I’d say the only thing to keep in mind here, is that the cross-connect commands only link in one direction (receive in A, forward to B), and that’s why I have to type them twice (receive in B, forward to A). Of course, this must be really cheap on VPP – because all it has to do now is receive from DPDK and immediately schedule for transmit on the other port. Looking at show runtime I can see how much CPU time is spent in each of VPP’s nodes:

Time 1241.5, 10 sec internal node vector rate 28.70 loops/sec 475009.85
  vector rates in 1.4879e7, out 1.4879e7, drop 0.0000e0, punt 0.0000e0
             Name                      Calls         Vectors        Clocks    Vectors/Call  
HundredGigabitEthernet12/0/1-o     650727833     18472218801        7.49e0           28.39
HundredGigabitEthernet12/0/1-t     650727833     18472218801        4.12e1           28.39
ethernet-input                     650727833     18472218801        5.55e1           28.39
l2-input                           650727833     18472218801        1.52e1           28.39
l2-output                          650727833     18472218801        1.32e1           28.39

In this simple cross connect mode, the only thing VPP has to do is receive the ethernet, funnel it into l2-input, and immediately send it straight through l2-output back out, which does not cost much in terms of CPU cycles at all. In total, this CPU thread is forwarding 14.88Mpps (line rate 10G at 64 bytes), at an average of 133 cycles per packet (not counting the time spent in DPDK). The CPU has room to spare in this mode, in other words even one CPU thread can handle this workload at line rate, impressive!

Although cool, doing an L2 crossconnect like this isn’t super useful. Usually, the customer leased line has to be transported to another location, and for that we’ll need some form of encapsulation …

Crossconnect over IPv6 VXLAN

Let’s start with VXLAN. The concept is pretty straight forward in VPP. Based on the configuration I put in Rhino and Hippo above, I first will have to bring Hu12/0/1 out of L2 mode, give both interfaces an IPv6 address, create a tunnel with a given VNI, and then crossconnect the customer side Hu12/0/0 into the vxlan_tunnel0 and vice-versa. Piece of cake:

## On Rhino
set interface mtu packet 1600 HundredGigabitEthernet12/0/1
set interface l3 HundredGigabitEthernet12/0/1
set interface ip address HundredGigabitEthernet12/0/1 2001:db8::1/64
create vxlan tunnel instance 0 src 2001:db8::1 dst 2001:db8::2 vni 8298
set interface state vxlan_tunnel0 up
set interface mtu packet 1522 vxlan_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0
set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0

## On Hippo
set interface mtu packet 1600 HundredGigabitEthernet12/0/1
set interface l3 HundredGigabitEthernet12/0/1
set interface ip address HundredGigabitEthernet12/0/1 2001:db8::2/64
create vxlan tunnel instance 0 src 2001:db8::2 dst 2001:db8::1 vni 8298
set interface state vxlan_tunnel0 up
set interface mtu packet 1522 vxlan_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0
set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0

Of course, now we’re actually beginning to make VPP do some work, and the exciting thing is, if there would be an (opaque) ISP network between Rhino and Hippo, this would work just fine considering the encapsulation is ‘just’ IPv6 UDP. Under the covers, for each received frame, VPP has to encapsulate it into VXLAN, and route the resulting L3 packet by doing an IPv6 routing table lookup:

Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 32132.74
  vector rates in 8.5423e6, out 8.5423e6, drop 0.0000e0, punt 0.0000e0
             Name                   Calls         Vectors          Clocks    Vectors/Call  
HundredGigabitEthernet12/0/0-o     333777        85445944          2.74e0          255.99
HundredGigabitEthernet12/0/0-t     333777        85445944          5.28e1          255.99
ethernet-input                     333777        85445944          4.25e1          255.99
ip6-input                          333777        85445944          1.25e1          255.99
ip6-lookup                         333777        85445944          2.41e1          255.99
ip6-receive                        333777        85445944          1.71e2          255.99
ip6-udp-lookup                     333777        85445944          1.55e1          255.99
l2-input                           333777        85445944          8.94e0          255.99
l2-output                          333777        85445944          4.44e0          255.99
vxlan6-input                       333777        85445944          2.12e1          255.99

I can definitely see a lot more action here. In this mode, VPP is handlnig 8.54Mpps on this CPU thread before saturating. At full load, VPP is spending 356 CPU cycles per packet, of which almost half is in node ip6-receive.

Crossconnect over IPv4 VXLAN

Seeing ip6-receive being such a big part of the cost (almost half!), I wonder what it might look like if I change the tunnel to use IPv4. So I’ll give Rhino and Hippo an IPv4 address as well, delete the vxlan tunnel I made before (the IPv6 one), and create a new one with IPv4:

set interface ip address HundredGigabitEthernet12/0/1 10.0.0.0/31
create vxlan tunnel instance 0 src 2001:db8::1 dst 2001:db8::2 vni 8298 del
create vxlan tunnel instance 0 src 10.0.0.0 dst 10.0.0.1 vni 8298      
set interface state vxlan_tunnel0 up
set interface mtu packet 1522 vxlan_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0
set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0

set interface ip address HundredGigabitEthernet12/0/1 10.0.0.1/31
create vxlan tunnel instance 0 src 2001:db8::2 dst 2001:db8::1 vni 8298 del
create vxlan tunnel instance 0 src 10.0.0.1 dst 10.0.0.0 vni 8298      
set interface state vxlan_tunnel0 up
set interface mtu packet 1522 vxlan_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0
set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0

And after letting this run for a few seconds, I can take a look and see how the ip4-* version of the VPP code performs:

Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 53309.71
  vector rates in 1.4151e7, out 1.4151e7, drop 0.0000e0, punt 0.0000e0
             Name                  Calls         Vectors       Clocks    Vectors/Call  
HundredGigabitEthernet12/0/0-o    552890       141539600       2.76e0          255.99
HundredGigabitEthernet12/0/0-t    552890       141539600       5.30e1          255.99
ethernet-input                    552890       141539600       4.13e1          255.99
ip4-input-no-checksum             552890       141539600       1.18e1          255.99
ip4-lookup                        552890       141539600       1.68e1          255.99
ip4-receive                       552890       141539600       2.74e1          255.99
ip4-udp-lookup                    552890       141539600       1.79e1          255.99
l2-input                          552890       141539600       8.68e0          255.99
l2-output                         552890       141539600       4.41e0          255.99
vxlan4-input                      552890       141539600       1.76e1          255.99

Throughput is now quite a bit higher, clocking a cool 14.2Mpps (just short of line rate!) at 202 CPU cycles per packet, considerably less time spent than in IPv6, but keep in mind that VPP has an ~empty routing table in all of these tests.

Crossconnect over IPv6 GENEVE

Another popular cross connect type, also based on IPv4 and IPv6 UDP packets, is GENEVE. The configuration is almost identical, so I delete the IPv4 VXLAN and create an IPv6 GENEVE tunnel instead:

create vxlan tunnel instance 0 src 10.0.0.0 dst 10.0.0.1 vni 8298 del
create geneve tunnel local 2001:db8::1 remote 2001:db8::2 vni 8298
set interface state geneve_tunnel0 up
set interface mtu packet 1522 geneve_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0
set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0

create vxlan tunnel instance 0 src 10.0.0.1 dst 10.0.0.0 vni 8298 del
create geneve tunnel local 2001:db8::2 remote 2001:db8::1 vni 8298
set interface state geneve_tunnel0 up
set interface mtu packet 1522 geneve_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0
set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0

All the while, the TRex on the customer machine hvn0, is sending 14.88Mpps in both directions, and after just a short (second or so) interruption, the GENEVE tunnel comes up, cross-connects into the customer Hu12/0/0 interfaces, and starts to carry traffic:

Thread 8 vpp_wk_7 (lcore 8)
Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 29688.03
  vector rates in 8.3179e6, out 8.3179e6, drop 0.0000e0, punt 0.0000e0
             Name                 Calls         Vectors       Clocks    Vectors/Call  
HundredGigabitEthernet12/0/0-o   324981        83194664       2.74e0          255.99
HundredGigabitEthernet12/0/0-t   324981        83194664       5.18e1          255.99
ethernet-input                   324981        83194664       4.26e1          255.99
geneve6-input                    324981        83194664       3.87e1          255.99
ip6-input                        324981        83194664       1.22e1          255.99
ip6-lookup                       324981        83194664       2.39e1          255.99
ip6-receive                      324981        83194664       1.67e2          255.99
ip6-udp-lookup                   324981        83194664       1.54e1          255.99
l2-input                         324981        83194664       9.28e0          255.99
l2-output                        324981        83194664       4.47e0          255.99

Similar to VXLAN when using IPv6 the total for GENEVE-v6 is also comparatively slow (I say comparatively because you should not expect anything like this performance when using Linux or BSD kernel routing!). The lower throughput is again due to the ip6-receive node being costly. It is just slightly worse of a performer at 8.32Mpps per core and 368 CPU cycles per packet.

Crossconnect over IPv4 GENEVE

I am now suspecting that GENEVE over IPv4 would have similar gains to when I switched from VXLAN IPv6 to IPv4 above. So I remove the IPv6 tunnel, create a new IPv4 tunnel instead, and hook it back up to the customer port on both Rhino and Hippo, like so:

create geneve tunnel local 2001:db8::1 remote 2001:db8::2 vni 8298 del
create geneve tunnel local 10.0.0.0 remote 10.0.0.1 vni 8298 
set interface state geneve_tunnel0 up
set interface mtu packet 1522 geneve_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0
set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0

create geneve tunnel local 2001:db8::2 remote 2001:db8::1 vni 8298 del
create geneve tunnel local 10.0.0.1 remote 10.0.0.0 vni 8298
set interface state geneve_tunnel0 up
set interface mtu packet 1522 geneve_tunnel0
set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0
set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0

And the results, indeed a significant improvement:

Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 48639.97
  vector rates in 1.3737e7, out 1.3737e7, drop 0.0000e0, punt 0.0000e0
             Name                 Calls         Vectors       Clocks    Vectors/Call  
HundredGigabitEthernet12/0/0-o   536763       137409904       2.76e0          255.99
HundredGigabitEthernet12/0/0-t   536763       137409904       5.19e1          255.99
ethernet-input                   536763       137409904       4.19e1          255.99
geneve4-input                    536763       137409904       2.39e1          255.99
ip4-input-no-checksum            536763       137409904       1.18e1          255.99
ip4-lookup                       536763       137409904       1.69e1          255.99
ip4-receive                      536763       137409904       2.71e1          255.99
ip4-udp-lookup                   536763       137409904       1.79e1          255.99
l2-input                         536763       137409904       8.81e0          255.99
l2-output                        536763       137409904       4.47e0          255.99

So, close to line rate again! Performance of GENEVE-v4 clocks in at 13.7Mpps per core or 207 CPU cycles per packet.

Crossconnect over IPv6 GRE

Now I can’t help but wonder, that if those ip4|6-udp-lookup nodes burn valuable CPU cycles, GRE will possibly do better, because it’s an L3 protocol (proto number 47) and will never have to inspect beyond the IP header, so I delete the GENEVE tunnel and give GRE a go too:

create geneve tunnel local 10.0.0.0 remote 10.0.0.1 vni 8298 del
create gre tunnel src 2001:db8::1 dst 2001:db8::2 teb
set interface state gre0 up
set interface mtu packet 1522 gre0
set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0
set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0

create geneve tunnel local 10.0.0.1 remote 10.0.0.0 vni 8298 del
create gre tunnel src 2001:db8::2 dst 2001:db8::1 teb
set interface state gre0 up
set interface mtu packet 1522 gre0
set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0
set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0

Results:

Time 10.0, 10 sec internal node vector rate 255.99 loops/sec 37129.87
  vector rates in 9.9254e6, out 9.9254e6, drop 0.0000e0, punt 0.0000e0
             Name                 Calls         Vectors       Clocks    Vectors/Call  
HundredGigabitEthernet12/0/0-o   387881        99297464       2.80e0          255.99
HundredGigabitEthernet12/0/0-t   387881        99297464       5.21e1          255.99
ethernet-input                   775762       198594928       5.97e1          255.99
gre6-input                       387881        99297464       2.81e1          255.99
ip6-input                        387881        99297464       1.21e1          255.99
ip6-lookup                       387881        99297464       2.39e1          255.99
ip6-receive                      387881        99297464       5.09e1          255.99
l2-input                         387881        99297464       9.35e0          255.99
l2-output                        387881        99297464       4.40e0          255.99

The performance of GRE-v6 (in transparent ethernet bridge aka TEB mode) is 9.9Mpps per core or 243 CPU cycles per packet, and I’ll also note that while the ip6-receive node in all the UDP based tunneling were in the 170 clocks/packet arena, now we’re down to only 51 or so, so indeed a huge improvement.

Crossconnect over IPv4 GRE

To round off the set, I’ll remove the IPv6 GRE tunnel and put an IPv4 GRE tunnel in place:

create gre tunnel src 2001:db8::1 dst 2001:db8::2 teb del
create gre tunnel src 10.0.0.0 dst 10.0.0.1 teb 
set interface state gre0 up
set interface mtu packet 1522 gre0
set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0
set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0

create gre tunnel src 2001:db8::2 dst 2001:db8::1 teb del
create gre tunnel src 10.0.0.1 dst 10.0.0.0 teb 
set interface state gre0 up
set interface mtu packet 1522 gre0
set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0
set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0

And without further ado:

Time 10.0, 10 sec internal node vector rate 255.87 loops/sec 52898.61
  vector rates in 1.4039e7, out 1.4039e7, drop 0.0000e0, punt 0.0000e0
             Name                 Calls         Vectors       Clocks    Vectors/Call  
HundredGigabitEthernet12/0/0-o   548684       140435080       2.80e0          255.95
HundredGigabitEthernet12/0/0-t   548684       140435080       5.22e1          255.95
ethernet-input                  1097368       280870160       2.92e1          255.95
gre4-input                       548684       140435080       2.51e1          255.95
ip4-input-no-checksum            548684       140435080       1.19e1          255.95
ip4-lookup                       548684       140435080       1.68e1          255.95
ip4-receive                      548684       140435080       2.03e1          255.95
l2-input                         548684       140435080       8.72e0          255.95
l2-output                        548684       140435080       4.43e0          255.95

The performance of GRE-v4 (in transparent ethernet bridge aka TEB mode) is 14.0Mpps per core or 171 CPU cycles per packet. This is really very low, the best of all the tunneling protocols, but (for obvious reasons) will not outperform a direct L2 crossconnect, as that cuts out the L3 (and L4) middleperson entirely. Whohoo!

Conclusions

First, let me give a recap of the tests I did, from left to right the better to worse performer.

Test L2XC GRE-v4 VXLAN-v4 GENEVE-v4 GRE-v6 VXLAN-v6 GENEVE-v6
pps/core >14.88M 14.34M 14.15M 13.74M 9.93M 8.54M 8.32M
cycles/packet 132.59 171.45 201.65 207.44 243.35 355.72 368.09

(!) Achtung! Because in the L2XC mode the CPU was not fully consumed (VPP was consuming only ~28 frames per vector), it did not yet achieve its optimum CPU performance. Under full load, the cycles/packet will be somewhat lower than what is shown here.

Taking a closer look at the VPP nodes in use, below I draw a graph of CPU cycles spent in each VPP node, for each type of cross connect, where the lower the stack is, the faster cross connect will be:

Cycles by node

Although clearly GREv4 is the winner, I still would not use it for the following reason: VPP does not support GRE keys, and considering it is an L3 protocol, I will have to use unique IPv4 or IPv6 addresses for each tunnel src/dst pair, otherwise VPP will not know upon receipt of a GRE packet, which tunnel it belongs to. For IPv6 this is not a huge deal (I can bind a whole /64 to a loopback and just be done with it), but GREv6 does not perform as well as VXLAN-v4 or GENEVE-v4.

VXLAN and GENEVE are equal performers, both in IPv4 and in IPv6. In both cases, IPv4 is significantly faster than IPv6. But due to the use of VNI fields in the header, contrary to GRE, both VXLAN and GENEVE can have the same src/dst IP for any number of tunnels, which is a huge benefit.

Multithreading

Usually, the customer facing side is an ethernet port (or sub-interface with tag popping) that will be receiving IPv4 or IPv6 traffic (either tagged or untagged) and this allows the NIC to use RSS to assign this inbound traffic to multiple queues, and thus multiple CPU threads. That’s great, it means linear encapsulation performance.

Once the traffic is encapsulated, it risks becoming single flow with respect to the remote host, if Rhino would be sending from 10.0.0.0:4789 to Hippo’s 10.0.0.1:4789. However, the VPP VXLAN and GENEVE implementation both inspect the inner payload, and uses it to scramble the source port (thanks to Neale for pointing this out, it’s in vxlan/encap.c:246). Deterministically changing the source port based on the inner-flow will allow Hippo to use RSS on the receiving end, which allows these tunneling protocols to scale linearly. I proved this for myself by attaching a port-mirror to the switch and copying all traffic between Hippo and Rhino to a spare machine in the rack:

pim@hvn1:~$ sudo tcpdump -ni enp5s0f3 port 4789
11:19:54.887763 IP 10.0.0.1.4452 > 10.0.0.0.4789: VXLAN, flags [I] (0x08), vni 8298
11:19:54.888283 IP 10.0.0.1.42537 > 10.0.0.0.4789: VXLAN, flags [I] (0x08), vni 8298
11:19:54.888285 IP 10.0.0.0.17895 > 10.0.0.1.4789: VXLAN, flags [I] (0x08), vni 8298
11:19:54.899353 IP 10.0.0.1.40751 > 10.0.0.0.4789: VXLAN, flags [I] (0x08), vni 8298
11:19:54.899355 IP 10.0.0.0.35475 > 10.0.0.1.4789: VXLAN, flags [I] (0x08), vni 8298
11:19:54.904642 IP 10.0.0.0.60633 > 10.0.0.1.4789: VXLAN, flags [I] (0x08), vni 8298

pim@hvn1:~$ sudo tcpdump -ni enp5s0f3 port 6081
11:22:55.802406 IP 10.0.0.0.32299 > 10.0.0.1.6081: Geneve, Flags [none], vni 0x206a:
11:22:55.802409 IP 10.0.0.1.44011 > 10.0.0.0.6081: Geneve, Flags [none], vni 0x206a:
11:22:55.807711 IP 10.0.0.1.45503 > 10.0.0.0.6081: Geneve, Flags [none], vni 0x206a:
11:22:55.807712 IP 10.0.0.0.45532 > 10.0.0.1.6081: Geneve, Flags [none], vni 0x206a:
11:22:55.841495 IP 10.0.0.0.61694 > 10.0.0.1.6081: Geneve, Flags [none], vni 0x206a:
11:22:55.851719 IP 10.0.0.1.47581 > 10.0.0.0.6081: Geneve, Flags [none], vni 0x206a:

Considering I was sending the T-Rex profile bench.py with tunables vm=var2,size=64, the latter of which chooses randomized source and destination (inner) IP addresses in the loadtester, I can conclude that the outer source port is chosen based on a hash of the inner packet. Slick!!

Final conclusion

The most important practical conclusion to draw is that I can feel safe to offer L2VPN services at IPng Networks using VPP and a VXLAN or GENEVE IPv4 underlay – our backbone is 9000 bytes everywhere, so it will be possible to provide up to 8942 bytes of customer payload taking into account the VXLAN-v4 overhead. At least gigabit symmetric VLLs filled with 64b packets will not be a problem for the routers we have, as they forward approximately 10.2Mpps per core and 35Mpps per chassis when fully loaded. Even considering the overhead and CPU consumption that VXLAN encap/decap brings with it, due to the use of multiple transmit and receive threads, the router would have plenty of room to spare.

Appendix

The backing data for the graph in this article are captured in this Google Sheet.

VPP Configuration

For completeness, the startup.conf used on both Rhino and Hippo:

unix {
  nodaemon
  log /var/log/vpp/vpp.log
  full-coredump
  cli-listen /run/vpp/cli.sock
  cli-prompt rhino#
  gid vpp
}

api-trace { on }
api-segment { gid vpp }
socksvr { default }

memory {
  main-heap-size 1536M
  main-heap-page-size default-hugepage
}

cpu {
  main-core 0
  corelist-workers 1-15
}

buffers {
  buffers-per-numa 300000
  default data-size 2048
  page-size default-hugepage
}

statseg {
  size 1G
  page-size default-hugepage
  per-node-counters off
}

dpdk {
  dev default {
    num-rx-queues 7
  }
  decimal-interface-names
  dev 0000:0c:00.0
  dev 0000:0c:00.1
}

plugins {
  plugin lcpng_nl_plugin.so { enable }
  plugin lcpng_if_plugin.so { enable }
}

logging {
  default-log-level info
  default-syslog-log-level crit
  class linux-cp/if { rate-limit 10000 level debug syslog-level debug }
  class linux-cp/nl { rate-limit 10000 level debug syslog-level debug }
}

lcpng {
  default netns dataplane
  lcp-sync
  lcp-auto-subint
}

Other Details

For posterity, some other stats on the VPP deployment. First of all, a confirmation that PCIe 4.0 x16 slots were used, and that the Comms DDP was loaded:

[    0.433903] pci 0000:0c:00.0: [8086:1592] type 00 class 0x020000
[    0.433924] pci 0000:0c:00.0: reg 0x10: [mem 0xea000000-0xebffffff 64bit pref]
[    0.433946] pci 0000:0c:00.0: reg 0x1c: [mem 0xee010000-0xee01ffff 64bit pref]
[    0.433964] pci 0000:0c:00.0: reg 0x30: [mem 0xfcf00000-0xfcffffff pref]
[    0.434104] pci 0000:0c:00.0: reg 0x184: [mem 0xed000000-0xed01ffff 64bit pref]
[    0.434106] pci 0000:0c:00.0: VF(n) BAR0 space: [mem 0xed000000-0xedffffff 64bit pref] (contains BAR0 for 128 VFs)
[    0.434128] pci 0000:0c:00.0: reg 0x190: [mem 0xee220000-0xee223fff 64bit pref]
[    0.434129] pci 0000:0c:00.0: VF(n) BAR3 space: [mem 0xee220000-0xee41ffff 64bit pref] (contains BAR3 for 128 VFs)
[   11.216343] ice 0000:0c:00.0: The DDP package was successfully loaded: ICE COMMS Package version 1.3.30.0
[   11.280567] ice 0000:0c:00.0: PTP init successful
[   11.317826] ice 0000:0c:00.0: DCB is enabled in the hardware, max number of TCs supported on this port are 8
[   11.317828] ice 0000:0c:00.0: FW LLDP is disabled, DCBx/LLDP in SW mode.
[   11.317829] ice 0000:0c:00.0: Commit DCB Configuration to the hardware
[   11.320608] ice 0000:0c:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)

And how the NIC shows up in VPP, in particular the rx/tx burst modes and functions are interesting:

hippo# show hardware-interfaces 
              Name                Idx   Link  Hardware
HundredGigabitEthernet12/0/0       1     up   HundredGigabitEthernet12/0/0
  Link speed: 100 Gbps
  RX Queues:
    queue thread         mode      
    0     vpp_wk_0 (1)   polling   
    1     vpp_wk_1 (2)   polling   
    2     vpp_wk_2 (3)   polling   
    3     vpp_wk_3 (4)   polling   
    4     vpp_wk_4 (5)   polling   
    5     vpp_wk_5 (6)   polling   
    6     vpp_wk_6 (7)   polling   
  Ethernet address b4:96:91:b3:b1:10
  Intel E810 Family
    carrier up full duplex mtu 9190  promisc
    flags: admin-up promisc maybe-multiseg tx-offload intel-phdr-cksum rx-ip4-cksum int-supported
    Devargs: 
    rx: queues 7 (max 64), desc 1024 (min 64 max 4096 align 32)
    tx: queues 16 (max 64), desc 1024 (min 64 max 4096 align 32)
    pci: device 8086:1592 subsystem 8086:0002 address 0000:0c:00.00 numa 0
    max rx packet len: 9728
    promiscuous: unicast on all-multicast on
    vlan offload: strip off filter off qinq off
    rx offload avail:  vlan-strip ipv4-cksum udp-cksum tcp-cksum qinq-strip 
                       outer-ipv4-cksum vlan-filter vlan-extend jumbo-frame 
                       scatter keep-crc rss-hash 
    rx offload active: ipv4-cksum jumbo-frame scatter 
    tx offload avail:  vlan-insert ipv4-cksum udp-cksum tcp-cksum sctp-cksum 
                       tcp-tso outer-ipv4-cksum qinq-insert multi-segs mbuf-fast-free 
                       outer-udp-cksum 
    tx offload active: ipv4-cksum udp-cksum tcp-cksum multi-segs 
    rss avail:         ipv4-frag ipv4-tcp ipv4-udp ipv4-sctp ipv4-other ipv4 
                       ipv6-frag ipv6-tcp ipv6-udp ipv6-sctp ipv6-other ipv6 
                       l2-payload 
    rss active:        ipv4-frag ipv4-tcp ipv4-udp ipv4 ipv6-frag ipv6-tcp 
                       ipv6-udp ipv6 
    tx burst mode: Scalar
    tx burst function: ice_recv_scattered_pkts_vec_avx2_offload
    rx burst mode: Offload Vector AVX2 Scattered
    rx burst function: ice_xmit_pkts

Finally, in case it’s interesting, an output of lscpu, lspci and dmidecode as run on Hippo (Rhino is an identical machine).