VPP on FreeBSD - Part 2

Posted: 2024-02-17

About this series

Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. Over the years, folks have asked me regularly “What about BSD?” and to my surprise, late last year I read an announcement from the FreeBSD Foundation [ref] as they looked back over 2023 and forward to 2024:

Porting the Vector Packet Processor to FreeBSD

Vector Packet Processing (VPP) is an open-source, high-performance user space networking stack that provides fast packet processing suitable for software-defined networking and network function virtualization applications. VPP aims to optimize packet processing through vectorized operations and parallelism, making it well-suited for high-speed networking applications. In November of this year, the Foundation began a contract with Tom Jones, a FreeBSD developer specializing in network performance, to port VPP to FreeBSD. Under the contract, Tom will also allocate time for other tasks such as testing FreeBSD on common virtualization platforms to improve the desktop experience, improving hardware support on arm64 platforms, and adding support for low power idle on Intel and arm64 hardware.

In my first [article], I wrote a sort of a hello world by installing FreeBSD 14.0-RELEASE on both a VM and a bare metal Supermicro, and showed that Tom’s VPP branch compiles, runs and pings. In this article, I’ll take a look at some comparative performance numbers.

Comparing implementations

FreeBSD has an extensive network stack, including regular kernel based functionality such as routing, filtering and bridging, a faster netmap based datapath, including some userspace utilities like a netmap bridge, and of course completely userspace based dataplanes, such as the VPP project that I’m working on here. Last week, I learned that VPP has a netmap driver, and from previous travels I am already quite familiar with its DPDK based forwarding. I decide to do a baseline loadtest for each of these on the Supermicro Xeon-D1518 that I installed last week. See the [article] for details on the setup.

The loadtests will use a common set of different configurations, using Cisco T-Rex’s default benchmark profile called bench.py:

var2-1514b: Large Packets, multiple flows with modulating source and destination IPv4 addresses, often called an ‘iperf test’, with packets of 1514 bytes.
var2-imix: Mixed Packets, multiple flows, often called an ‘imix test’, which includes a bunch of 64b, 390b and 1514b packets.
var2-64b: Small Packets, still multiple flows, 64 bytes, which allows for multiple receive queues and kernel or application threads.
64b: Small Packets, but now single flow, often called ’linerate test’, with a packet size of 64 bytes, limiting to one receive queue.

Each of these four loadtests might occur in only undirectionally (port0 -> port1) or bidirectionally (port0 <-> port1). This yields eight different loadtests, each taking about 8 minutes. I put the kettle on and get underway.

FreeBSD 14: Kernel Bridge

The machine I’m testing has a quad-port Intel i350 (1Gbps copper, using the FreeBSD igb(4) driver), a dual-port Intel X522 (10Gbps SFP+, using the ix(4) driver), and a dual-port Intel i710-XXV (25Gbps SFP28, using the ixl(4) driver). I decide to live it up a little, and choose the 25G ports for my loadtests today, even if I think this machine with its relatively low-end Xeon-D1518 CPU will struggle a little bit at very high packet rates. No pain, no gain, amirite?

I take my fresh FreeBSD 14.0-RELEASE install, without any tinkering other than compiling a GENERIC kernel that has support for the DPDK modules I’ll need later. For my first loadtest, I create a kernel based bridge as follows, just tying the two 25G interfaces together:

[pim@france /usr/obj]$ uname -a
FreeBSD france 14.0-RELEASE FreeBSD 14.0-RELEASE #0: Sat Feb 10 22:18:51 CET 2024     root@france:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64

[pim@france ~]$ dmesg | grep ixl
ixl0: <Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 - 2.3.3-k> mem 0xf8000000-0xf8ffffff,0xf9008000-0xf900ffff irq 16 at device 0.0 on pci7
ixl1: <Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 - 2.3.3-k> mem 0xf7000000-0xf7ffffff,0xf9000000-0xf9007fff irq 16 at device 0.1 on pci7

[pim@france ~]$ sudo ifconfig bridge0 create
[pim@france ~]$ sudo ifconfig bridge0 addm ixl0 addm ixl1 up
[pim@france ~]$ sudo ifconfig ixl0 up
[pim@france ~]$ sudo ifconfig ixl1 up
[pim@france ~]$ ifconfig bridge0
bridge0: flags=1008843<UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP> metric 0 mtu 1500
	options=0
	ether 58:9c:fc:10:6c:2e
	id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15
	maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200
	root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0
	member: ixl1 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
	        ifmaxaddr 0 port 4 priority 128 path cost 800
	member: ixl0 flags=143<LEARNING,DISCOVER,AUTOEDGE,AUTOPTP>
	        ifmaxaddr 0 port 3 priority 128 path cost 800
	groups: bridge
	nd6 options=9<PERFORMNUD,IFDISABLED>

One thing that I quickly realize, is that FreeBSD, when using hyperthreading, does have 8 threads available, but only 4 of them participate in forwarding. When I put the machine under load, I see a curious 399% spent in kernel while I see 402% in idle:

When I then do a single-flow unidirectional loadtest, the expected outcome is that only one CPU participates (100% in kernel and 700% in idle) and if I perform a single-flow bidirectional loadtest, my expectations are confirmed again, seeing two CPU threads do the work (200% in kernel and 600% in idle).

While the math checks out, the performance is a little bit less impressive:

Type	Uni/BiDir	Packets/Sec	L2 Bits/Sec	Line Rate
vm=var2,size=1514	Unidirectional	2.02Mpps	24.77Gbps	99%
vm=var2,size=imix	Unidirectional	3.48Mpps	10.23Gbps	43%
vm=var2,size=64	Unidirectional	3.61Mpps	2.43Gbps	9.7%
size=64	Unidirectional	1.22Mpps	0.82Gbps	3.2%
vm=var2,size=1514	Bidirectional	3.77Mpps	46.31Gbps	93%
vm=var2,size=imix	Bidirectional	3.81Mpps	11.22Gbps	24%
vm=var2,size=64	Bidirectional	4.02Mpps	2.69Gbps	5.4%
size=64	Bidirectional	2.29Mpps	1.54Gbps	3.1%

Conclusion: FreeBSD’s kernel on this Xeon-D1518 processor can handle about 1.2Mpps per CPU thread, and I can use only four of them. FreeBSD is happy to forward big packets, and I can reasonably reach 2x25Gbps but once I start ramping up the packets/sec by lowering the packet size, things very quickly deteriorate.

FreeBSD 14: netmap Bridge

Tom pointed out a tool in the source tree, called the netmap bridge originally written by Luigi Rizzo and Matteo Landi. FreeBSD ships the source code, but you can also take a look at their GitHub repository [ref].

What is netmap anyway? It’s a framework for extremely fast and efficient packet I/O for userspace and kernel clients, and for Virtual Machines. It runs on FreeBSD, Linux and some versions of Windows. As an aside, my buddy Pavel from FastNetMon pointed out a blogpost from 2015 in which Cloudflare folks described a way to do DDoS mitigation on Linux using traffic classification to program the network cards to move certain offensive traffic to a dedicated hardware queue, and service that queue from a netmap client. If you’re curious (I certainly was!), you might take a look at that cool write-up [here].

I compile the code and put it to work, and the man-page tells me that I need to fiddle with the interfaces a bit. They need to be:

set to promiscuous, which makes sense as they have to receive ethernet frames sent to MAC addresses other than their own
turn off any hardware offloading, notably -rxcsum -txcsum -tso4 -tso6 -lro
my user needs write permission to /dev/netmap to bind the interfaces from userspace.

[pim@france /usr/src/tools/tools/netmap]$ make
[pim@france /usr/src/tools/tools/netmap]$ cd /usr/obj/usr/src/amd64.amd64/tools/tools/netmap
[pim@france .../tools/netmap]$ sudo ifconfig ixl0 -rxcsum -txcsum -tso4 -tso6 -lro promisc
[pim@france .../tools/netmap]$ sudo ifconfig ixl1 -rxcsum -txcsum -tso4 -tso6 -lro promisc
[pim@france .../tools/netmap]$ sudo chmod 660 /dev/netmap
[pim@france .../tools/netmap]$ ./bridge -i netmap:ixl0 -i netmap:ixl1
065.804686 main [290] ------- zerocopy supported
065.804708 main [297] Wait 4 secs for link to come up...
075.810547 main [301] Ready to go, ixl0 0x0/4 <-> ixl1 0x0/4.

I start my first loadtest, which pretty immediately fails. It’s an interesting behavior pattern which I’ve not seen before. After staring at the problem, and reading the code of bridge.c, which is a remarkably straight forward program, I restart the bridge utility, and traffic passes again but only for a little while. Whoops!

I took a [screencast] in case any kind soul on freebsd-net wants to take a closer look at this:

I start a bit of trial and error in which I conclude that if I send a lot of traffic (like 10Mpps), forwarding is fine; but if I send a little traffic (like 1kpps), at some point forwarding stops alltogether. So while it’s not great, this does allow me to measure the total throughput just by sending a lot of traffic, say 30Mpps, and seeing what amount comes out the other side.

Here I go, and I’m having fun:

Type	Uni/BiDir	Packets/Sec	L2 Bits/Sec	Line Rate
vm=var2,size=1514	Unidirectional	2.04Mpps	24.72Gbps	100%
vm=var2,size=imix	Unidirectional	8.16Mpps	23.76Gbps	100%
vm=var2,size=64	Unidirectional	10.83Mpps	5.55Gbps	29%
size=64	Unidirectional	11.42Mpps	5.83Gbps	31%
vm=var2,size=1514	Bidirectional	3.91Mpps	47.27Gbps	96%
vm=var2,size=imix	Bidirectional	11.31Mpps	32.74Gbps	77%
vm=var2,size=64	Bidirectional	11.39Mpps	5.83Gbps	15%
size=64	Bidirectional	11.57Mpps	5.93Gbps	16%

Conclusion: FreeBSD’s netmap implementation is also bound by packets/sec, and in this setup, the Xeon-D1518 machine is capable of forwarding roughly 11.2Mpps. What I find cool is that single flow or multiple flows doesn’t seem to matter that much, in fact bidirectional 64b single flow loadtest was most favorable at 11.57Mpps, which is an order of magnitude better than using just the kernel (which clocked in at 1.2Mpps).

FreeBSD 14: VPP with netmap

It’s good to have a baseline on this machine on how the FreeBSD kernel itself performs. But of course this series is about Vector Packet Processing, so I now turn my attention to the VPP branch that Tom shared with me. I wrote a bunch of details about the VM and bare metal install in my [first article] so I’ll just go straight to the configuration parts:

DBGvpp# create netmap name ixl0
DBGvpp# create netmap name ixl1
DBGvpp# set int state netmap-ixl0 up
DBGvpp# set int state netmap-ixl1 up
DBGvpp# set int l2 xconnect netmap-ixl0 netmap-ixl1
DBGvpp# set int l2 xconnect netmap-ixl1 netmap-ixl0

DBGvpp# show int
    Name       Idx    State  MTU (L3/IP4/IP6/MPLS)   Counter      Count
local0         0     down          0/0/0/0
netmap-ixl0    1      up          9000/0/0/0     rx packets       25622
                                                 rx bytes       1537320
                                                 tx packets       25437
                                                 tx bytes       1526220
netmap-ixl1    2      up          9000/0/0/0     rx packets       25437
                                                 rx bytes       1526220
                                                 tx packets       25622
                                                 tx bytes       1537320

At this point I can pretty much rule out that the netmap bridge.c is the issue, because a few seconds after introducing 10Kpps of traffic and seeing it successfully pass, the loadtester receives no more packets, even though T-Rex is still sending it. However, about a minute later I can also see the RX and TX counters continue to increase in the VPP dataplane:

DBGvpp# show int
    Name       Idx    State  MTU (L3/IP4/IP6/MPLS)   Counter      Count
local0         0     down          0/0/0/0
netmap-ixl0    1      up          9000/0/0/0     rx packets      515843
                                                 rx bytes      30950580
                                                 tx packets      515657
                                                 tx bytes      30939420
netmap-ixl1    2      up          9000/0/0/0     rx packets      515657
                                                 rx bytes      30939420
                                                 tx packets      515843
                                                 tx bytes      30950580

.. and I can see that every packet that VPP received is accounted for: interface ixl0 has received 515843 packets, and ixl1 claims to have transmitted exactly that amount of packets. So I think perhaps they are getting lost somewhere on egress between the kernel and the Intel i710-XXV network card.

However, counter to the previous case, I cannot sustain any reasonable amount of traffic, be it 1Kpps, 10Kpps or 10Mpps, the system pretty consistently comes to a halt mere seconds after introducing the load. Restarting VPP makes it forward traffic again for a few seconds, just to end up in the same upset state. I don’t learn much.

Conclusion: This setup with VPP using netmap does not yield results, for the moment. I have a suspicion that whatever the cause is of the netmap bridge in the previous test, is likely also the culprit for this test.

FreeBSD 14: VPP with DPDK

But not all is lost - I have one test left, and judging by what I learned last week when bringing up the first test environment, this one is going to be a fair bit better. In my previous loadtests, the network interfaces were on their usual kernel driver (ixl(4) in the case of the Intel i710-XXV interfaces), but now I’m going to mix it up a little, and rebind these interfaces to a specific DPDK driver called nic_uio(4) which stands for Network Interface Card Userspace Input/Output:

[pim@france ~]$ cat < EOF | sudo tee -a /boot/loader.conf
nic_uio_load="YES"
hw.nic_uio.bdfs="6:0:0,6:0:1"
EOF

After I reboot, the network interfaces are gone from the output of ifconfig(8), which is good. I start up VPP with a minimal config file [ref], which defines three worker threads and starts DPDK with 3 RX queues and 4 TX queues. It’s a common question why there would be one more TX queue. The explanation is that in VPP, there is one (1) main thread and zero or more worker threads. If the main thread wants to send traffic (for example, in a plugin like LLDP which sends periodic announcements), it would be most efficient to use a transmit queue specific to that main thread. Any return traffic will be picked up by the DPDK Process on worker threads (as main does not have one of these). That’s why the general rule num(TX) = num(RX)+1.

[pim@france ~/src/vpp]$ export STARTUP_CONF=/home/pim/src/startup.conf
[pim@france ~/src/vpp]$ gmake run-release

vpp# set int l2 xconnect TwentyFiveGigabitEthernet6/0/0 TwentyFiveGigabitEthernet6/0/1
vpp# set int l2 xconnect TwentyFiveGigabitEthernet6/0/1 TwentyFiveGigabitEthernet6/0/0
vpp# set int state TwentyFiveGigabitEthernet6/0/0 up
vpp# set int state TwentyFiveGigabitEthernet6/0/1 up
vpp# show int
              Name               Idx    State  MTU (L3/IP4/IP6/MPLS)     Counter          Count
TwentyFiveGigabitEthernet6/0/0    1      up          9000/0/0/0     rx packets           11615035382
                                                                    rx bytes           1785998048960
                                                                    tx packets             700076496
                                                                    tx bytes            161043604594
TwentyFiveGigabitEthernet6/0/1    2      up          9000/0/0/0     rx packets             700076542
                                                                    rx bytes            161043674054
                                                                    tx packets           11615035440
                                                                    tx bytes           1785998136540
local0                            0     down          0/0/0/0

And with that, the dataplane shoots to life and starts forwarding (lots of) packets. To my great relief, sending either 1kpps or 1Mpps “just works”. I can run my loadtest as per normal, first with 1514 byte packets, then imix, then 64 byte packets, and finally single-flow 64 byte packets. And of course, both unidirectionally and bidirectionally.

I take a look at the system load while the loadtests are running:

It is fully expected that the VPP process is spinning 300% +epsilon of CPU time. This is because it has started three worker threads, and these are execuing the DPDK Poll Mode Driver which is essentially a tight loop that asks the network cards for work, and if there are any packets arriving, executes on that work. As such, each worker thread is always burning 100% of its assigned CPU.

That said, I can take a look at finer grained statistics in the dataplane itself:

vpp# show run
Thread 0 vpp_main (lcore 0)
Time .9, 10 sec internal node vector rate 0.00 loops/sec 297041.19
  vector rates in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0
             Name                 State         Calls   Vectors   Suspends   Clocks  Vectors/Call
ip4-full-reassembly-expire-wal  any wait            0         0         18   2.39e3          0.00
ip6-full-reassembly-expire-wal  any wait            0         0         18   3.08e3          0.00
unix-cli-process-0               active             0         0          9   7.62e4          0.00
unix-epoll-input                 polling        13066         0          0   1.50e5          0.00
---------------
Thread 1 vpp_wk_0 (lcore 1)
Time .9, 10 sec internal node vector rate 12.38 loops/sec 1467742.01
  vector rates in 5.6294e6, out 5.6294e6, drop 0.0000e0, punt 0.0000e0
             Name                 State         Calls   Vectors   Suspends   Clocks  Vectors/Call
TwentyFiveGigabitEthernet6/0/1   active        399663   5047800          0   2.20e1         12.63
TwentyFiveGigabitEthernet6/0/1   active        399663   5047800          0   9.54e1         12.63
dpdk-input                       polling      1531252   5047800          0   1.45e2          3.29
ethernet-input                   active        399663   5047800          0   3.97e1         12.63
l2-input                         active        399663   5047800          0   2.93e1         12.63
l2-output                        active        399663   5047800          0   2.53e1         12.63
unix-epoll-input                 polling         1494         0          0   3.09e2          0.00

(et cetera)

I showed only one worker thread’s output, but there are actually three worker threads, and they are all doing similar work, because they are picking up 33% of the traffic each assigned to the three RX queues in the network card.

While the overall CPU load is 300%, here I can see a different picture. Thread 0 (the main thread) is doing essentially ~nothing. It is polling a set of unix sockets in the node called unix-epoll-input, but other than that, main doesn’t have much on its plate. Thread 1 however is a worker thread, and I can see that it is busy doing work:

dpdk-input: it’s polling the NIC for work, it has been called 1.53M times, and in total it has handled just over 5.04M vectors (which are packets). So I can derive, that each time the Poll Mode Driver gives work, on average there are 3.29 vectors (packets), and each packet is taking about 145 CPU clocks.
ethernet-input: The DPDK vectors are all ethernet frames coming from the loadtester. Seeing as I have cross connected all traffic from Tf6/0/0 to Tf6/0/1 and vice-versa, VPP knows that it should handle the packets in the L2 forwarding path.
l2-input is called with the (list of N) ethernet frames, which all get cross connected to the output interface, in this case Tf6/0/1.
l2-output prepares the ethernet frames for output into their egress interface.
TwentyFiveGigabitEthernet6/0/1-output (Note: the name is truncated) If this were to have been L3 traffic, this would be the place where the destination MAC address is inserted into the ethernet frame, but since this is an L2 cross connect, the node simply passes the ethernet frames through to the final egress node in DPDK.
TwentyFiveGigabitEthernet6/0/1-tx (Note: the name is truncated) hands them to the DPDK driver for marshalling on the wire.

Halfway through, I see that there’s an issue with the distribution of ingress traffic over the three workers, maybe you can spot it too:

---------------
Thread 1 vpp_wk_0 (lcore 1)
Time 56.7, 10 sec internal node vector rate 38.59 loops/sec 106879.84
  vector rates in 7.2982e6, out 7.2982e6, drop 0.0000e0, punt 0.0000e0
             Name                 State       Calls     Vectors  Suspends   Clocks  Vectors/Call
TwentyFiveGigabitEthernet6/0/0   active     6689553   206899956         0   1.34e1         30.93
TwentyFiveGigabitEthernet6/0/0   active     6689553   206899956         0   1.37e2         30.93
TwentyFiveGigabitEthernet6/0/1   active     6688572   206902836         0   1.45e1         30.93
TwentyFiveGigabitEthernet6/0/1   active     6688572   206902836         0   1.34e2         30.93
dpdk-input                       polling    7128012   413802792         0   8.77e1         58.05
ethernet-input                   active    13378125   413802792         0   2.77e1         30.93
l2-input                         active     6809002   413802792         0   1.81e1         60.77
l2-output                        active     6809002   413802792         0   1.68e1         60.77
unix-epoll-input                 polling       6954           0         0   6.61e2          0.00
---------------
Thread 2 vpp_wk_1 (lcore 2)
Time 56.7, 10 sec internal node vector rate 256.00 loops/sec 7702.68
  vector rates in 4.1188e6, out 4.1188e6, drop 0.0000e0, punt 0.0000e0
             Name                 State       Calls     Vectors  Suspends   Clocks  Vectors/Call
TwentyFiveGigabitEthernet6/0/0   active      456112   116764672         0   1.27e1        256.00
TwentyFiveGigabitEthernet6/0/0   active      456112   116764672         0   2.64e2        256.00
TwentyFiveGigabitEthernet6/0/1   active      456112   116764672         0   1.39e1        256.00
TwentyFiveGigabitEthernet6/0/1   active      456112   116764672         0   2.74e2        256.00
dpdk-input                       polling     456112   233529344         0   1.41e2        512.00
ethernet-input                   active      912224   233529344         0   5.71e1        256.00
l2-input                         active      912224   233529344         0   3.66e1        256.00
l2-output                        active      912224   233529344         0   1.70e1        256.00
unix-epoll-input                 polling        445           0         0   9.59e2          0.00
---------------
Thread 3 vpp_wk_2 (lcore 3)
Time 56.7, 10 sec internal node vector rate 256.00 loops/sec 7742.43
  vector rates in 4.1188e6, out 4.1188e6, drop 0.0000e0, punt 0.0000e0
             Name                 State       Calls     Vectors  Suspends  Clocks  Vectors/Call
TwentyFiveGigabitEthernet6/0/0   active      456113   116764928         0  8.94e0        256.00
TwentyFiveGigabitEthernet6/0/0   active      456113   116764928         0  2.81e2        256.00
TwentyFiveGigabitEthernet6/0/1   active      456113   116764928         0  9.54e0        256.00
TwentyFiveGigabitEthernet6/0/1   active      456113   116764928         0  2.72e2        256.00
dpdk-input                       polling     456113   233529856         0  1.61e2        512.00
ethernet-input                   active      912226   233529856         0  4.50e1        256.00
l2-input                         active      912226   233529856         0  2.93e1        256.00
l2-output                        active      912226   233529856         0  1.23e1        256.00
unix-epoll-input                 polling        445           0         0  1.03e3          0.00

Thread 1 (vpp_wk_0) is handling 7.29Mpps and moderately loaded, while Thread 2 and 3 are handling each 4.11Mpps and are completely pegged. That said, the relative amount of CPU clocks they are spending per packet is reasonably similar, but they don’t quite add up:

Thread 1 is doing 7.29Mpps and is spending on average 449 CPU cycles per packet. I get this number by adding up all of the values in the Clocks column, except for the unix-epoll-input node. But that’s somewhat strange, because this Xeon D1518 clocks at 2.2GHz – and yet 7.29M * 449 is 3.27GHz. My experience (in Linux) is that these numbers actually line up quite well.
Thread 2 is doing 4.12Mpps and is spending on average 816 CPU cycles per packet. This kind of makes sense as the cycles/packet is roughly double that of thread 1, and the packet/sec is roughly half … and the total of 4.12M * 816 is 3.36GHz.
I can see similarly values for thread 3: 4.12Mpps and also 819 CPU cycles per packet which amounts to VPP self-reporting using 3.37GHz worth of cycles on this thread.

When I look at the thread to CPU placement, I get another surprise:

vpp# show threads
ID     Name                Type        LWP     Sched Policy (Priority)  lcore  Core   Socket State
0      vpp_main                        100346  (nil) (n/a)              0      42949674294967
1      vpp_wk_0            workers     100473  (nil) (n/a)              1      42949674294967
2      vpp_wk_1            workers     100474  (nil) (n/a)              2      42949674294967
3      vpp_wk_2            workers     100475  (nil) (n/a)              3      42949674294967

vpp# show cpu
Model name:               Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz
Microarch model (family): [0x6] Broadwell ([0x56] Broadwell DE) stepping 0x3
Flags:                    sse3 pclmulqdq ssse3 sse41 sse42 avx rdrand avx2 bmi2 rtm pqm pqe
                          rdseed aes invariant_tsc
Base frequency:           2.19 GHz

The numbers in show threads are all messed up, and I don’t quite know what to make of it yet. I think the perhaps overly specific Linux implementation of the thread pool management is throwing off FreeBSD a bit. Perhaps some profiling could be useful, so I make a note to discuss this with Tom or the freebsd-net mailing list, who will know a fair bit more about this type of stuff on FreeBSD than I do.

Anyway, functionally: this works. Performance wise: I have some questions :-) I let all eight loadtests complete and without further ado, here’s the results:

Type	Uni/BiDir	Packets/Sec	L2 Bits/Sec	Line Rate
vm=var2,size=1514	Unidirectional	2.01Mpps	24.45Gbps	99%
vm=var2,size=imix	Unidirectional	8.07Mpps	23.42Gbps	99%
vm=var2,size=64	Unidirectional	23.93Mpps	12.25Gbps	64%
size=64	Unidirectional	12.80Mpps	6.56Gbps	34%
vm=var2,size=1514	Bidirectional	3.91Mpps	47.35Gbps	86%
vm=var2,size=imix	Bidirectional	13.38Mpps	38.81Gbps	82%
vm=var2,size=64	Bidirectional	15.56Mpps	7.97Gbps	21%
size=64	Bidirectional	20.96Mpps	10.73Gbps	28%

Conclusion: I have to say: 12.8Mpps on a unidirectional 64b single-flow loadtest (thereby only being able to make use of one DPDK worker), and 20.96Mpps on a bidirectional 64b single-flow loadtest, is not too shabby. But seeing as one CPU thread can do 12.8Mpps, I would imagine that three CPU threads would perform at 38.4Mpps or there-abouts, but I’m seeing only 23.9Mpps and some unexplained variance in per-thread performance.

Results

I learned a lot! Some hilights:

The netmap implementation is not playing ball for the moment, as forwarding stops consistently, in both the bridge.c as well as the VPP plugin.
It is clear though, that netmap is a fair bit faster (11.4Mpps) than kernel forwarding which came in at roughly 1.2Mpps per CPU thread. What’s a bit troubling is that netmap doesn’t seem to work very well in VPP – traffic forwarding also stops here.
DPDK performs quite well on FreeBSD, I manage to see a throughput of 20.96Mpps which is almost twice the throughput of netmap, which is cool but I can’t quite explain the stark variance in throughput between the worker threads. Perhaps VPP is placing the workers on hyperthreads? Perhaps an equivalent of isolcpus in the Linux kernel would help?

For the curious, I’ve bundled up a few files that describe the machine and its setup: [dmesg] [pciconf] [loader.conf] [VPP startup.conf]