VPP with sFlow - Part 2

Posted: 2024-10-06

Introduction

Last month, I picked up a project together with Neil McKee of [inMon], the care takers of [sFlow]: an industry standard technology for monitoring high speed switched networks. sFlow gives complete visibility into the use of networks enabling performance optimization, accounting/billing for usage, and defense against security threats.

The open source software dataplane [VPP] is a perfect match for sampling, as it forwards packets at very high rates using underlying libraries like [DPDK] and [RDMA]. A clever design choice in the so called Host sFlow Daemon [host-sflow], which allows for a small portion of code to grab the samples, for example in a merchant silicon ASIC or FPGA, but also in the VPP software dataplane, and then transmit these samples using a Linux kernel feature called [PSAMPLE]. This greatly reduces the complexity of code to be implemented in the forwarding path, while at the same time bringing consistency to the sFlow delivery pipeline by (re)using the hsflowd business logic for the more complex state keeping, packet marshalling and transmission from the Agent to a central Collector.

Last month, Neil and I discussed the proof of concept [ref] and I described this in a [first article]. Then, we iterated on the VPP plugin, playing with a few different approaches to strike a balance between performance, code complexity, and agent features. This article describes our journey.

VPP: an sFlow plugin

There are three things Neil and I specifically take a look at:

If sFlow is not enabled on a given interface, there should not be a regression on other interfaces.
If sFlow is enabled, but a packet is not sampled, the overhead should be as small as possible, targetting single digit CPU cycles per packet in overhead.
If sFlow actually selects a packet for sampling, it should be moved out of the dataplane as quickly as possible, targetting double digit CPU cycles per sample.

For all these validation and loadtests, I use a bare metal VPP machine which is receiving load from a T-Rex loadtester on eight TenGig ports. I have configured VPP and T-Rex as follows.

1. RX Queue Placement

It’s important that the network card that is receiving the traffic, gets serviced by a worker thread on the same NUMA domain. Since my machine has two processors (and thus, two NUMA nodes), I will align the NIC with the correct processor, like so:

set interface rx-placement TenGigabitEthernet3/0/0 queue 0 worker 0
set interface rx-placement TenGigabitEthernet3/0/1 queue 0 worker 2
set interface rx-placement TenGigabitEthernet3/0/2 queue 0 worker 4
set interface rx-placement TenGigabitEthernet3/0/3 queue 0 worker 6

set interface rx-placement TenGigabitEthernet130/0/0 queue 0 worker 1
set interface rx-placement TenGigabitEthernet130/0/1 queue 0 worker 3
set interface rx-placement TenGigabitEthernet130/0/2 queue 0 worker 5
set interface rx-placement TenGigabitEthernet130/0/3 queue 0 worker 7

2. L3 IPv4/MPLS interfaces

I will take two pairs of interfaces, one on NUMA0, and the other on NUMA1, so that I can make a comparison with L3 IPv4 or MPLS running without sFlow (these are TenGig3/0/, which I will call the baseline pairs) and two which are running with sFlow (these are TenGig130/0/, which I’ll call the experiment pairs).

comment { L3: IPv4 interfaces }
set int state TenGigabitEthernet3/0/0 up
set int state TenGigabitEthernet3/0/1 up
set int state TenGigabitEthernet130/0/0 up
set int state TenGigabitEthernet130/0/1 up
set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31
set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31
set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31
set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31
ip route add 16.0.0.0/24 via 100.64.0.0
ip route add 48.0.0.0/24 via 100.64.1.0
ip route add 16.0.2.0/24 via 100.64.4.0
ip route add 48.0.2.0/24 via 100.64.5.0
ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static
ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static
ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static
ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static

Here, the only specific trick worth mentioning is the use of ip neighbor to pre-populate the L2 adjacency for the T-Rex loadtester. This way, VPP knows which MAC address to send traffic to, in case a packet has to be forwarded to 100.64.0.0 or 100.64.5.0. It avoids VPP from having to use ARP resolution.

The configuration for an MPLS label switching router LSR or also called P-Router is added:

comment { MPLS interfaces }
mpls table add 0
set interface mpls TenGigabitEthernet3/0/0 enable
set interface mpls TenGigabitEthernet3/0/1 enable
set interface mpls TenGigabitEthernet130/0/0 enable
set interface mpls TenGigabitEthernet130/0/1 enable
mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17
mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16
mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21
mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20

3. L2 CrossConnect interfaces

Here, I will also use NUMA0 as my baseline (sFlow disabled) pair, and an equivalent pair of TenGig interfaces on NUMA1 as my experiment (sFlow enabled) pair. This way, I can both make a comparison on the performance impact of enabling sFlow, but I can also assert if any regression occurs in the baseline pair if I enable a feature in the experiment pair, which should really never happen.

comment { L2 xconnected interfaces }
set int state TenGigabitEthernet3/0/2 up
set int state TenGigabitEthernet3/0/3 up
set int state TenGigabitEthernet130/0/2 up
set int state TenGigabitEthernet130/0/3 up
set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3
set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2
set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3
set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2

4. T-Rex Configuration

The Cisco T-Rex loadtester is running on another machine in the same rack. Physically, it has eight ports which are connected to a LAB switch, a cool Mellanox SN2700 running Debian [ref]. From there, eight ports go to my VPP machine. The LAB switch just has VLANs with two ports in each: VLAN 100 takes T-Rex port0 and connects it to TenGig3/0/0, VLAN 101 takes port1 and connects it to TenGig3/0/1, and so on. In total, sixteen ports and eight VLANs are used.

The configuration for T-Rex then becomes:

- version: 2
  interfaces: [ '06:00.0', '06:00.1', '83:00.0', '83:00.1', '87:00.0', '87:00.1', '85:00.0', '85:00.1' ]
  port_info:
    - src_mac:  00:1b:21:06:00:00
      dest_mac: 9c:69:b4:61:a1:dc
    - src_mac:  00:1b:21:06:00:01
      dest_mac: 9c:69:b4:61:a1:dd

    - src_mac:  00:1b:21:83:00:00
      dest_mac: 00:1b:21:83:00:01
    - src_mac:  00:1b:21:83:00:01
      dest_mac: 00:1b:21:83:00:00

    - src_mac:  00:1b:21:87:00:00
      dest_mac: 9c:69:b4:61:75:d0
    - src_mac:  00:1b:21:87:00:01
      dest_mac: 9c:69:b4:61:75:d1

    - src_mac:  9c:69:b4:85:00:00
      dest_mac: 9c:69:b4:85:00:01
    - src_mac:  9c:69:b4:85:00:01
      dest_mac: 9c:69:b4:85:00:00

Do you see how the first pair sends from src_mac 00:1b:21:06:00:00? That’s the T-Rex side, and it encodes the PCI device 06:00.0 in the MAC address. It sends traffic to dest_mac 9c:69:b4:61:a1:dc, which is the MAC address of VPP’s TenGig3/0/0 interface. Looking back at the ip neighbor VPP config above, it becomes much easier to see who is sending traffic to whom.

For L2XC, the MAC addresses don’t matter. VPP will set the NIC in promiscuous mode which means it’ll accept any ethernet frame, not only those sent to the NIC’s own MAC address. Therefore, in L2XC modes (the second and fourth pair), I just use the MAC addresses from T-Rex. I find debugging connections and looking up FDB entries on the Mellanox switch much, much easier this way.

With all config in place, but with sFlow disabled, I run a quick bidirectional loadtest using 256b packets at line rate, which shows 79.83Gbps and 36.15Mpps. All ports are forwarding, with MPLS, IPv4, and L2XC. Neat!

The name of the game is now to do a loadtest that shows the packet throughput and CPU cycles spent for each of the plugin iterations, comparing their performance on ports with and without sFlow enabled. For each iteration, I will use exactly the same VPP configuration, I will generate unidirectional 4x14.88Mpps of traffic with T-Rex, and I will report on VPP’s performance in baseline and a somewhat unfavorable 1:100 sampling rate.

Ready? Here I go!

v1: Workers send RPC to main

TL/DR: 13 cycles/packet on passthrough, 4.68Mpps L2, 3.26Mpps L3, with severe regression in baseline

The first iteration goes all the way back to a proof of concept from last year. It’s described in detail in my [first post]. The performance results are not stellar:

☢ When slamming a single sFlow enabled interface, all interfaces regress. When sending 8Mpps of IPv4 traffic through an baseline interface, that is an interface without sFlow enabled, only 5.2Mpps get through. This is considered a mortal sin in VPP-land.
✅ Passing through packets without sampling them, costs about 13 CPU cycles, not bad.
❌ Sampling a packet, specifically at higher rates (say, 1:100 or worse, 1:10) completely destroys throughput. When sending 4x14.88MMpps of traffic, only one third makes it through.

Here’s the bloodbath as seen from T-Rex:

Debrief: When we talked through these issues, we sort of drew the conclusion that it would be much faster if, when a worker thread produces a sample, instead of sending an RPC to main and taking the spinlock, that the worker appends the sample to a producer queue and moves on. This way, no locks are needed, and each worker thread will have its own producer queue.

Then, we can create a separate thread (or even pool of threads), scheduling on possibly a different CPU (or in main), that runs a loop iterating on all sflow sample queues, consuming the samples and sending them in batches to the PSAMPLE Netlink group, possibly dropping samples if there are too many coming in.

v2: Workers send PSAMPLE directly

TL/DR: 7.21Mpps IPv4 L3, 9.45Mpps L2XC, 87 cycles/packet, no impact on disabled interfaces

But before we do that, we have one curiosity itch to scratch - what if we sent the sample directly from the worker? With such a model, if it works, we will need no RPCs or sample queue at all. Of course, in this model any sample will have to be rewritten into a PSAMPLE packet and written via the netlink socket. It would be less complex, but not as efficient as it could be. One thing is prety certain, though: it should be much faster than sending an RPC to the main thread.

After short refactor, Neil commits [d278273], which adds compiler macros SFLOW_SEND_FROM_WORKER (v2) and SFLOW_SEND_VIA_MAIN (v1). When workers send directly, they will invoke sflow_send_sample_from_worker() instead of sending an RPC with vl_api_rpc_call_main_thread() in the previous version.

The code currently uses clib_warning() to print stats from the dataplane, which is pretty expensive. We should be using the VPP logging framework, but for the time being, I add a few CPU counters so we can more accurately count the cummulative time spent for each part of the calls, see [6ca61d2]. I can now see these with vppctl show err instead.

When loadtesting this, the deadly sin of impacting performance of interfaces that did not have sFlow enabled is gone. The throughput is not great, though. Instead of showing screenshots of T-Rex, I can also take a look at the throughput as measured by VPP itself. In its show runtime statistics, each worker thread shows both CPU cycles spent, as well as how many packets/sec it received and how many it transmitted:

pim@hvn6-lab:~$ export C="v2-100"; vppctl clear run; vppctl clear err; sleep 30; \
                vppctl show run > $C-runtime.txt; vppctl show err > $C-err.txt
pim@hvn6-lab:~$ grep 'vector rates' v2-100-runtime.txt  | grep -v 'in 0'
  vector rates in 1.0909e7, out 1.0909e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 7.2078e6, out 7.2078e6, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 9.4476e6, out 9.4476e6, drop 0.0000e0, punt 0.0000e0
pim@hvn6-lab:~$ grep 'sflow' v2-100-runtime.txt 
Name            State        Calls      Vectors  Suspends  Clocks    Vectors/Call  
sflow           active      844916    216298496         0  8.69e1          256.00
sflow           active     1107466    283511296         0  8.26e1          256.00
pim@hvn6-lab:~$ grep -i sflow v2-100-err.txt 
 217929472               sflow                     sflow packets processed         error  
   1614519               sflow                      sflow packets sampled          error  
2606893106               sflow                    CPU cycles in sent samples       error  
 280697344               sflow                     sflow packets processed         error  
   2078203               sflow                      sflow packets sampled          error  
1844674406               sflow                    CPU cycles in sent samples       error

At a glance, I can see in the first grep, the in and out vector (==packet) rates for each worker thread that is doing meaningful work (ie. has more than 0pps of input). Remember that I pinned the RX queues to worker threads, and this now pays dividend: worker thread 0 is servicing TenGig3/0/0 (as even worker thread numbers are on NUMA domain 0), worker thread 1 is servicing TenGig130/0/0. What’s cool about this, is it gives me an easy way to compare baseline L3 (10.9Mpps) with experiment L3 (7.21Mpps). Equally, L2XC comes in at 14.88Mpps in baseline and 9.45Mpps in experiment.

Looking at the output of vppctl show error, I can learn another interesting detail. See how there are 1614519 sampled packets out of 217929472 processed packets (ie. a roughly 1:100 rate)? I added a CPU clock cycle counter that counts cummulative clocks spent once samples are taken. I can see that VPP spent 2606893106 CPU cycles sending these samples. That’s 1615 CPU cycles per sent sample, which is pretty terrible.

Debrief: We both understand that assembling and send()ing the netlink messages from within the dataplane is a pretty bad idea. But it’s great to see that removing the use of RPCs immediately improves performance on non-enabled interfaces, and we learned what the cost is of sending those samples. An easy step forward from here is to create a producer/consumer queue, where the workers can just copy the packet into a queue or ring buffer, and have an external pthread consume from the queue/ring in another thread that won’t block the dataplane.

v3: SVM FIFO from workers, dedicated PSAMPLE pthread

TL/DR: 9.34Mpps L3, 13.51Mpps L2XC, 16.3 cycles/packet, but with corruption on the FIFO queue messages

Neil checks in after committing [7a78e05] that he has introduced a macro SFLOW_SEND_FIFO which tries this new approach. There’s a pretty elaborate FIFO queue implementation in svm/fifo_segment.h. Neil uses this to create a segment called fifo-sflow-worker, to which the worker can write its samples in the dataplane node. A new thread called spt_process_samples can then call svm_fifo_dequeue() from all workers’ queues and pump those into Netlink.

The overhead of copying the samples onto a VPP native svm_fifo seems to be two orders of magnitude lower than writing directly to Netlink, even though the svm_fifo library code has many bells and whistles that we don’t need. But, perhaps due to these bells and whistles, we may be holding it wrong, as invariably after a short while the Netlink writes return Message too long errors.

pim@hvn6-lab:~$ grep 'vector rates' v3fifo-sflow-100-runtime.txt  | grep -v 'in 0'
  vector rates in 1.0783e7, out 1.0783e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 9.3499e6, out 9.3499e6, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.4728e7, out 1.4728e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.3516e7, out 1.3516e7, drop 0.0000e0, punt 0.0000e0
pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-runtime.txt 
Name            State        Calls      Vectors  Suspends  Clocks    Vectors/Call  
sflow           active     1096132    280609792         0  1.63e1          256.00
sflow           active     1584577    405651712         0  1.46e1          256.00
pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-err.txt 
 280635904               sflow                     sflow packets processed         error  
   2079194               sflow                      sflow packets sampled          error  
 733447310               sflow                    CPU cycles in sent samples       error  
 405689856               sflow                     sflow packets processed         error  
   3004118               sflow                      sflow packets sampled          error  
1844674407               sflow                    CPU cycles in sent samples       error

Two things of note here. Firstly, the average clocks spent in the sFlow node have gone down from 86 CPU cycles/packet to 16.3 CPU cycles. But even more importantly, the amount of time spent after the sample is taken is hugely reduced, from 1600+ cycles in v2 to a much more favorable 352 cycles in this version. Also, any risk of Netlink writes failing has been eliminated, because that’s now offloaded to a different thread entirely.

Debrief: It’s not great that we created a new linux pthread for the consumer of the samples. VPP has an elaborate thread management system, and collaborative multitasking in its threading model, which adds introspection like clock counters, names, show runtime, show threads and so on. I can’t help but wonder: wouldn’t we just be able to move the spt_process_samples() thread into a VPP process node instead?

v3bis: SVM FIFO, PSAMPLE process in Main

TL/DR: 9.68Mpps L3, 14.10Mpps L2XC, 14.2 cycles/packet, still with corrupted FIFO queue messages

Neil agrees that there’s no good reason to keep this out of main, and conjures up [df2dab8d] which rewrites the thread to an sflow_process_samples() function, using VLIB_REGISTER_NODE to add it to VPP in an idiomatic way. As a really nice benefit, we can now count how many CPU cycles are spent, in main, each time this process wakes up and does some work. It’s a widely used pattern in VPP.

Because of the FIFO queue message corruption, Netlink message are failing to send at an alarming rate, which is causing lots of clib_warning() messages to be spewed on console. I replace those with a counter of Failed Netlink messages instead, and commit refactor [6ba4715].

pim@hvn6-lab:~$ grep 'vector rates' v3bis-100-runtime.txt  | grep -v 'in 0'
  vector rates in 1.0976e7, out 1.0976e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 9.6743e6, out 9.6743e6, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.4052e7, out 1.4052e7, drop 0.0000e0, punt 0.0000e0
pim@hvn6-lab:~$ grep sflow v3bis-100-runtime.txt 
Name                   State       Calls    Vectors  Suspends  Clocks    Vectors/Call  
sflow-process-samples  any wait        0          0     28052  4.66e4            0.00
sflow                  active    1134102  290330112         0  1.42e1          256.00
sflow                  active    1647240  421693440         0  1.32e1          256.00
pim@hvn6-lab:~$ grep sflow v3bis-100-err.txt 
     77945               sflow                        sflow PSAMPLE sent           error  
       863               sflow                    sflow PSAMPLE send failed        error  
 290376960               sflow                     sflow packets processed         error  
   2151184               sflow                      sflow packets sampled          error  
 421761024               sflow                     sflow packets processed         error  
   3119625               sflow                      sflow packets sampled          error

With this iteration, I make a few observations. Firstly, the sflow-process-samples node shows up and informs me that, when handling the samples from the worker FIFO queues, the process is using 4660 CPU cycles. Secondly, the replacement of clib_warnign() with the sflow PSAMPLE send failed counter reduced time from 16.3 to 14.2 cycles on average in the dataplane. Nice.

Debrief: A sad conclusion: of the 5.2M samples taken, only 77k make it through to Netlink. All these send failures and corrupt packets are really messing things up. So while the provided FIFO implementation in svm/fifo_segment.h is idiomatic, it is also much more complex than we thought, and we’re fearing that it may not be safe to read from another thread.

v4: Custom lockless FIFO, PSAMPLE process in Main

TL/DR: 9.56Mpps L3, 13.69Mpps L2XC, 15.6 cycles/packet, corruption fixed!

After reading around a bit in DPDK’s [kni_fifo], Neil produces a gem of a commit in [42bbb64], where he introduces a tiny multiple-writer, single-consumer FIFO with two simple functions: sflow_fifo_enqueue() to be called in the workers, and sflow_fifo_dequeue() to be called in the main thread’s sflow-process-samples process. He then makes this thread-safe by doing what I consider black magic, in commit [dd8af17], which makes use of clib_atomic_load_acq_n() and clib_atomic_store_rel_n() macros from VPP’s vppinfra/atomics.h.

What I really like about this change is that it introduces a FIFO implementation in about twenty lines of code, which means the sampling code path in the dataplane becomes really easy to follow, and will be even faster than it was before. I take it out for a loadtest:

pim@hvn6-lab:~$ grep 'vector rates' v4-100-runtime.txt  | grep -v 'in 0'
  vector rates in 1.0958e7, out 1.0958e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 9.5633e6, out 9.5633e6, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.3697e7, out 1.3697e7, drop 0.0000e0, punt 0.0000e0
pim@hvn6-lab:~$ grep sflow v4-100-runtime.txt 
Name                   State       Calls    Vectors  Suspends  Clocks    Vectors/Call  
sflow-process-samples  any wait        0          0     17767  1.52e6            0.00
sflow                  active    1121156  287015936         0  1.56e1          256.00
sflow                  active    1605772  411077632         0  1.53e1          256.00
pim@hvn6-lab:~$ grep sflow v4-100-err.txt 
   3553600               sflow                        sflow PSAMPLE sent           error  
 287101184               sflow                     sflow packets processed         error  
   2127024               sflow                      sflow packets sampled          error  
    350224               sflow                      sflow packets dropped          error  
 411199744               sflow                     sflow packets processed         error  
   3043693               sflow                      sflow packets sampled          error  
   1266893               sflow                      sflow packets dropped          error

This is starting to be a very nice implementation! With this iteration of the plugin, all the corruption is gone, there is a slight regression (because we’re now actually sending the messages). With the v3bis variant, only a tiny fraction of the samples made it through to netlink. With this v4 variant, I can see 2127024 + 3043693 packets sampled, but due to a carefully chosen FIFO depth of 4, the workers will drop samples so as not to overload the main process that is trying to write them out. At this unnatural rate of 1:100, I can see that of the 2127024 samples taken, 350224 are prematurely dropped (because the FIFO queue is full). This is a perfect defense in depth!

Doing the math, both workers can enqueue 1776800 samples in 30 seconds, which is 59k/s per interface. I can also see that the second interface, which is doing L2XC and hits a much larger packets/sec throughput, is dropping more samples because it receives an equal amount of time from main reading samples from its queue. In other words: in an overload scenario, one interface cannot crowd out another. Slick.

Finally, completing my math, each worker has enqueued 1776800 samples to their FIFOs, and I see that main has dequeued exactly 2x1776800 = 3553600 samples, all successfully written to Netlink, so the sflow PSAMPLE send failed counter remains zero.

Debrief: In the mean time, Neil has been working on the host-sflow daemon changes to pick up these netlink messages. There’s also a bit of work to do with retrieving the packet and byte counters of the VPP interfaces, so he is creating a module in host-sflow that can consume some messages from VPP. He will call this mod_vpp, and he mails a screenshot of his work in progress. I’ll discuss the end-to-end changes with hsflowd in a followup article, and focus my efforts here on documenting the VPP parts only. But, as a teaser, here’s a screenshot of a validated sflow-tool output of a VPP instance using our sFlow plugin and his pending host-sflow changes to integrate the rest of the business logic outside of the VPP dataplane, where it’s arguably expensive to make mistakes.

Neil admits to an itch that he has been meaning to scratch all this time. In VPP’s plugins/sflow/node.c, we insert the node between device-input and ethernet-input. Here, really most of the time the plugin is just shoveling the ethernet packets through to ethernet-input. To make use of some CPU instruction cache affinity, the loop that does this shovelling can do it one packet at a time, two packets at a time, or even four packets at a time. Although the code is super repetitive and somewhat ugly, it does actually speed up processing in terms of CPU cycles spent per packet, if you shovel four of them at a time.

v5: Quad Bucket Brigade in worker

TL/DR: 9.68Mpps L3, 14.0Mpps L2XC, 11 CPU cycles/packet, 1.28e5 CPU cycles in main

Neil calls this the Quad Bucket Brigade, and one last finishing touch is to move from his default 2-packet to a 4-packet shoveling. In commit [285d8a0], he extends a common pattern in VPP dataplane nodes, each time the node iterates, it’ll pre-fetch now up to eight packets (p0-p7) if the vector is long enough, and handle them four at a time (b0-b3). He also adds a few compiler hints with branch prediction: almost no packets will have a trace enabled, so he can use PREDICT_FALSE() macros to allow the compiler to further optimize the code.

I find reading the dataplane code, that it is incredibly ugly. But it’s the price to pay for ultra fast throughput. But how do we see the effect? My low-tech proposal is to enable sampling at a very high rate, say 1:10'000'000, so that the code path that grabs and enqueues the sample into the FIFO is almost never called. Then, what’s left for the sFlow dataplane node, really is to shovel the packets from device-input into ethernet-input.

To measure the relative improvement, I do one test with, and one without commit [285d8a09].

pim@hvn6-lab:~$ grep 'vector rates' v5-10M-runtime.txt  | grep -v 'in 0'
  vector rates in 1.0981e7, out 1.0981e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 9.8806e6, out 9.8806e6, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.4328e7, out 1.4328e7, drop 0.0000e0, punt 0.0000e0
pim@hvn6-lab:~$ grep sflow v5-10M-runtime.txt 
Name                   State       Calls    Vectors  Suspends  Clocks    Vectors/Call  
sflow-process-samples  any wait        0          0     28467  9.36e3            0.00
sflow                  active    1158325  296531200         0  1.09e1          256.00
sflow                  active    1679742  430013952         0  1.11e1          256.00

pim@hvn6-lab:~$ grep 'vector rates' v5-noquadbrigade-10M-runtime.txt  | grep -v in\ 0
  vector rates in 1.0959e7, out 1.0959e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 9.7046e6, out 9.7046e6, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0
  vector rates in 1.4008e7, out 1.4008e7, drop 0.0000e0, punt 0.0000e0
pim@hvn6-lab:~$ grep sflow v5-noquadbrigade-10M-runtime.txt 
Name                   State       Calls    Vectors  Suspends  Clocks    Vectors/Call  
sflow-process-samples  any wait        0          0     28462  9.57e3            0.00
sflow                  active    1137571  291218176         0  1.26e1          256.00
sflow                  active    1641991  420349696         0  1.20e1          256.00

Would you look at that, this optimization actually works as advertised! There is a meaningful progression from v5-noquadbrigade (9.70Mpps L3, 14.00Mpps L2XC) to v5 (9.88Mpps L3, 14.32Mpps L2XC). So at the expense of adding 63 lines of code, there is a 2.8% increase in throughput. Quad-Bucket-Brigade, yaay!

I’ll leave you with a beautiful screenshot of the current code at HEAD, as it is sampling 1:100 packets (!) on four interfaces, while forwarding 8x10G of 256 byte packets at line rate. You’ll recall at the beginning of this article I did an acceptance loadtest with sFlow disabled, but this is the exact same result with sFlow enabled:

This picture says it all: 79.98 Gbps in, 79.98 Gbps out; 36.22Mpps in, 36.22Mpps out. Also: 176k samples/sec taken from the dataplane, with correct rate limiting due to a per-worker FIFO depth limit, yielding 25k samples/sec sent to Netlink.

What’s Next

Checking in on the three main things we wanted to ensure with the plugin:

✅ If sFlow is not enabled on a given interface, there is no regression on other interfaces.
✅ If sFlow is enabled, copying packets costs 11 CPU cycles on average
✅ If sFlow takes a sample, it takes only marginally more CPU time to enqueue.
- No sampling gets 9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput,
- 1:1000 sampling reduces to 9.77Mpps of L3 and 14.05Mpps of L2XC throughput,
- and an overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only.

The hard part is finished, but we’re not entirely done yet. What’s left is to implement a set of packet and byte counters, and send this information along with possible Linux CP data (such as the TAP interface ID in the Linux side), and to add the module for VPP in hsflowd. I’ll write about that part in a followup article.

Neil has introduced vpp-dev@ to this plugin, and so far there were no objections. But he has pointed folks to a github out of tree repo, and I may add a Gerrit instead so it becomes part of the ecosystem. Our work so far is captured in Gerrit [41680], which ends up being just over 2600 lines all-up. I do think we need to refactor a bit, add some VPP-specific tidbits like FEATURE.yaml and *.rst documentation, but this should be in reasonable shape.

Acknowledgements

I’d like to thank Neil McKee from inMon for his dedication to getting things right, including the finer details such as logging, error handling, API specifications, and documentation. He has been a true pleasure to work with and learn from.