A while ago, in June 2021, we were discussing home routers that can keep up with 1G+ internet connections in the CommunityRack telegram channel. Of course at IPng Networks we are fond of the Supermicro Xeon D1518 [ref], which has a bunch of 10Gbit X522 and 1Gbit i350 and i210 intel NICs, but it does come at a certain price.
For smaller applications, PC Engines APU6 [ref] is kind of cool and definitely more affordable. But, in this chat, Patrick offered an alternative, the [Fitlet2] which is a small, passively cooled, and expandable IoT-esque machine.
Fast forward 18 months, and Patrick decided to sell off his units, so I bought one off of him, and decided to loadtest it. Considering the pricetag (the unit I will be testing will ship for around $400), and has the ability to use (1G/SFP) fiber optics, it may be a pretty cool one!
Executive Summary
TL/DR: Definitely a cool VPP router, 3x 1Gbit line rate, A- would buy again
With some care on the VPP configuration (notably RX/TX descriptors), this unit can handle L2XC at (almost) line rate in both directions (2.94Mpps out a theoretical 2.97Mpps), with one VPP worker thread, which it not just good, it’s Good Enoughâ„¢, at which time there is still plenty of headroom on the CPU, as the Atom E3950 has 4 cores.
In IPv4 routing, using two VPP worker threads, and 2 RX/TX queues on each NIC, the machine keeps up with 64 byte traffic in both directions (ie 2.97Mpps), again with compute power to spare, and while using only two out of four CPU cores on the Atom E3950.
For a $400,- machine that draws close to 11 Watts fully loaded, and sporting 8GB (at a max of 16GB) this Fitlet2 is a gem: it will easily keep up 3x 1Gbit in a production environment, while carrying multiple full BGP tables (900K IPv4 and 170K IPv6), with room to spare. It’s a classy little machine!
Detailed findings
The first thing that I noticed when it arrived is how small it is! The design of the Fitlet2 has a motherboard with a non-removable Atom E3950 CPU running at 1.6GHz, from the Goldmont series. This is a notoriously slow/budget CPU, and it comes with 4C/4T, each CPU thread comes with 24kB of L1 and 1MB of L2 cache, and there is no L3 cache on this CPU at all. That would mean performance in applications like VPP (which try to leverage these caches) will be poorer – the main question on my mind is: does the CPU have enough oompff to keep up with the 1G network cards? I’ll want this CPU to be able to handle roughly 4.5Mpps in total, in order for Fitlet2 to count itself amongst the wirespeed routers.
Looking further, Fitlet2 has one HDMI and one MiniDP port, two USB2 and two USB3 ports, two Intel i211 NICs with RJ45 port (these are 1Gbit). There’s a helpful MicroSD slot, two LEDs and an audio in- and output 3.5mm jack. The power button does worry me a little bit, I feel like just brushing against it may turn the machine off. I do appreciate the cooling situation - the top finned plate mates with the CPU on the top of the motherboard, and the bottom bracket holds a sizable aluminium cooling block which further helps dissipate heat, without needing any active cooling. The Fitlet folks claim this machine can run in environments anywhere between -50C and +112C, which I won’t be doing :)
Inside, there’s a single DDR3 SODIMM slot for memory (the one I have came with 8GB at 1600MT/s) and a custom, ableit open specification expansion board called a FACET-Card which stands for Function And Connectivity Extension T-Card, well okay then! The FACET card in this little machine sports one extra Intel i210-IS NIC, an M2 for an SSD, and an M2E for a WiFi port. The NIC is a 1Gbit SFP capable device. You can see its optic cage on the FACET card above, next to the yellow CMOS / Clock battery.
The whole thing is fed with 12V powerbrick delivering 2A, and a nice touch is that the barrel connector has a plastic bracket that locks it into the chassis by turning it 90degrees, so it won’t flap around in the breeze and detach. I wish other embedded PCs would ship with those, as I’ve been fumbling around in 19" racks that are, let me say, less tightly cable organized, and may or may not have disconnected the CHIX routeserver at some point in the past. Sorry, Max :)
For the curious, here’s a list of interesting details: [lspci] - [dmidecode] - [likwid-topology] - [dmesg].
Preparing the Fitlet2
First, I grab a USB key and install Debian Bullseye (11.5) on it, using the UEFI installer. After
booting, I carry through the instructions on my [VPP Production]
post. Notably, I create the dataplane
namespace, run an SSH and SNMP agent there, run
isolcpus=1-3
so that I can give three worker threads to VPP, but I start off giving it only one (1)
worker thread, because this way I can take a look at what the performance is of a single CPU, before
scaling out to the three (3) threads that this CPU can offer. I also take the defaults for DPDK,
notably allowing the DPDK poll-mode-drivers to take their proposed defaults:
- GigabitEthernet1/0/0: Intel Corporation I211 Gigabit Network Connection (rev 03)
rx: queues 1 (max 2), desc 512 (min 32 max 4096 align 8) tx: queues 2 (max 2), desc 512 (min 32 max 4096 align 8)
- GigabitEthernet3/0/0: Intel Corporation I210 Gigabit Fiber Network Connection (rev 03)
rx: queues 1 (max 4), desc 512 (min 32 max 4096 align 8) tx: queues 2 (max 4), desc 512 (min 32 max 4096 align 8)
I observe that the i211 NIC allows for a maximum of two (2) RX/TX queues, while the (older!) i210
will allow for four (4) of them. And another thing that I see here is that there are two (2) TX
queues active, but I only have one worker thread, so what gives? This is because there is always a
main thread and a worker thread, and it could be that the main thread needs to / wants to send
traffic out on an interface, so it always attaches to a queue in addition to the worker thread(s).
When exploring new hardware, I find it useful to take a look at the output of a few tactical show
commands on the CLI, such as:
1. What CPU is in this machine?
vpp# show cpu
Model name: Intel(R) Atom(TM) Processor E3950 @ 1.60GHz
Microarch model (family): [0x6] Goldmont ([0x5c] Apollo Lake) stepping 0x9
Flags: sse3 pclmulqdq ssse3 sse41 sse42 rdrand pqe rdseed aes sha invariant_tsc
Base frequency: 1.59 GHz
2. Which devices on the PCI bus, PCIe speed details, and driver?
vpp# show pci
Address Sock VID:PID Link Speed Driver Product Name Vital Product Data
0000:01:00.0 0 8086:1539 2.5 GT/s x1 uio_pci_generic
0000:02:00.0 0 8086:1539 2.5 GT/s x1 igb
0000:03:00.0 0 8086:1536 2.5 GT/s x1 uio_pci_generic
Note: This device at slot 02:00.0
is the second onboard RJ45 i211 NIC. I have used this one
to log in to the Fitlet2 and more easily kill/restart VPP and so on, but I could of course just as
well give it to VPP, in which case I’d have three gigabit interfaces to play with!
3. What details are known for the physical NICs?
vpp# show hardware GigabitEthernet1/0/0
GigabitEthernet1/0/0 1 up GigabitEthernet1/0/0
Link speed: 1 Gbps
RX Queues:
queue thread mode
0 vpp_wk_0 (1) polling
TX Queues:
TX Hash: [name: hash-eth-l34 priority: 50 description: Hash ethernet L34 headers]
queue shared thread(s)
0 no 0
1 no 1
Ethernet address 00:01:c0:2a:eb:a8
Intel e1000
carrier up full duplex max-frame-size 2048
flags: admin-up maybe-multiseg tx-offload intel-phdr-cksum rx-ip4-cksum int-supported
rx: queues 1 (max 2), desc 512 (min 32 max 4096 align 8)
tx: queues 2 (max 2), desc 512 (min 32 max 4096 align 8)
pci: device 8086:1539 subsystem 8086:0000 address 0000:01:00.00 numa 0
max rx packet len: 16383
promiscuous: unicast off all-multicast on
vlan offload: strip off filter off qinq off
rx offload avail: vlan-strip ipv4-cksum udp-cksum tcp-cksum vlan-filter
vlan-extend scatter keep-crc rss-hash
rx offload active: ipv4-cksum scatter
tx offload avail: vlan-insert ipv4-cksum udp-cksum tcp-cksum sctp-cksum
tcp-tso multi-segs
tx offload active: ipv4-cksum udp-cksum tcp-cksum multi-segs
rss avail: ipv4-tcp ipv4-udp ipv4 ipv6-tcp-ex ipv6-udp-ex ipv6-tcp
ipv6-udp ipv6-ex ipv6
rss active: none
tx burst function: (not available)
rx burst function: (not available)
Configuring VPP
After this exploratory exercise, I have learned enough about the hardware to be able to take the Fitlet2 out for a spin. To configure the VPP instance, I turn to [vppcfg], which can take a YAML configuration file describing the desired VPP configuration, and apply it safely to the running dataplane using the VPP API. I’ve written a few more posts on how it does that, notably on its [syntax] and its [planner]. A complete configuration guide on vppcfg can be found [here].
pim@fitlet:~$ sudo dpkg -i {lib,}vpp*23.06*deb
pim@fitlet:~$ sudo apt install python3-pip
pim@fitlet:~$ sudo pip install vppcfg-0.0.3-py3-none-any.whl
Methodology
Method 1: Single CPU Thread Saturation
First I will take VPP out for a spin by creating an L2 Cross Connect where any ethernet frame
received on Gi1/0/0
will be directly transmitted as-is on Gi3/0/0
and vice versa. This is a
relatively cheap operation for VPP, as it will not have to do any routing table lookups. The
configuration looks like this:
pim@fitlet:~$ cat << EOF > l2xc.yaml
interfaces:
GigabitEthernet1/0/0:
mtu: 1500
l2xc: GigabitEthernet3/0/0
GigabitEthernet3/0/0:
mtu: 1500
l2xc: GigabitEthernet1/0/0
EOF
pim@fitlet:~$ vppcfg plan -c l2xc.yaml
[INFO ] root.main: Loading configfile l2xc.yaml
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
[INFO ] root.main: Configuration is valid
[INFO ] vppcfg.vppapi.connect: VPP version is 23.06-rc0~35-gaf4046134
comment { vppcfg sync: 10 CLI statement(s) follow }
set interface l2 xconnect GigabitEthernet1/0/0 GigabitEthernet3/0/0
set interface l2 tag-rewrite GigabitEthernet1/0/0 disable
set interface l2 xconnect GigabitEthernet3/0/0 GigabitEthernet1/0/0
set interface l2 tag-rewrite GigabitEthernet3/0/0 disable
set interface mtu 1500 GigabitEthernet1/0/0
set interface mtu 1500 GigabitEthernet3/0/0
set interface mtu packet 1500 GigabitEthernet1/0/0
set interface mtu packet 1500 GigabitEthernet3/0/0
set interface state GigabitEthernet1/0/0 up
set interface state GigabitEthernet3/0/0 up
[INFO ] vppcfg.reconciler.write: Wrote 11 lines to (stdout)
[INFO ] root.main: Planning succeeded
After I paste these commands on the CLI, I start T-Rex in L2 stateless mode, and start T-Rex, I can
generate some activity by starting the bench
profile on port 0 with packets of 64 bytes in size
and with varying IPv4 source and destination addresses and ports:
tui>start -f stl/bench.py -m 1.48mpps -p 0
-t size=64,vm=var2
Let me explain a few hilights from the picture to the right. When starting this profile, I
specified 1.48Mpps, which is the maximum amount of packets/second that can be generated on a 1Gbit
link when using 64 byte frames (the smallest permissible ethernet frames). I do this because the
loadtester comes with 10Gbit (and 100Gbit) ports, but the Fitlet2 has only 1Gbit ports. Then, I see
that port0 is indeed transmitting (Tx pps) 1.48 Mpps, shown in dark blue. This is about 992 Mbps
on the wire (the Tx bps L1), but due to the overhead of ethernet (each 64 byte ethernet frame
needs an additional 20 bytes [details]), so the Tx
bps L2 is about 64/84 * 992.35 = 756.08
Mbps, which lines up.
Then, after the Fitlet2 tries its best to forward those from its receiving Gi1/0/0 port onto its transmitting port Gi3/0/0, they are received again by T-Rex on port 1. Here, I can see that the Rx pps is 1.29 Mpps, with an Rx bps of 660.49 Mbps (which is the L2 counter), and in bright red at the top I see the drop_rate is about 95.59 Mbps. In other words, the Fitlet2 is not keeping up.
But, after I take a look at the runtime statistics, I see that the CPU isn’t very busy at all:
vpp# show run
...
Thread 1 vpp_wk_0 (lcore 1)
Time 23.8, 10 sec internal node vector rate 4.30 loops/sec 1638976.68
vector rates in 1.2908e6, out 1.2908e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
GigabitEthernet3/0/0-output active 6323688 27119700 0 9.14e1 4.29
GigabitEthernet3/0/0-tx active 6323688 27119700 0 1.79e2 4.29
dpdk-input polling 44406936 27119701 0 5.35e2 .61
ethernet-input active 6323689 27119701 0 1.42e2 4.29
l2-input active 6323689 27119701 0 9.94e1 4.29
l2-output active 6323689 27119701 0 9.77e1 4.29
Very interesting! Notice that the line above says vector rates in .. out ..
are saying that the
thread is receiving only 1.29Mpps, and it is managing to send all of them out as well. When a VPP
worker is busy, each DPDK call will yield many packets, up to 256 in one call, which means the
amount of “vectors per call” will rise. Here, I see that on average, DPDK is returning an average of
only 0.61 packets each time it polls the NIC, and in each time a bunch of the packets are sent off
into the VPP graph, there is an average of 4.29 packets per loop. If the CPU was the bottleneck, it
would look more like 256 in the Vectors/Call column – so the bottleneck must be in the NIC.
Remember above, when I showed the show hardware
command output? There’s a clue in there. The
Fitlet2 has two onboard i211 NICs and one i210 NIC on the FACET card. Despite the lower number,
the i210 is a bit more advanced
[datasheet]. If I reverse the
direction of flow (so receiving on the i210 Gi3/0/0, and transmitting on the i211 Gi1/0/0), things
look a fair bit better:
vpp# show run
...
Thread 1 vpp_wk_0 (lcore 1)
Time 12.6, 10 sec internal node vector rate 4.02 loops/sec 853956.73
vector rates in 1.4799e6, out 1.4799e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
GigabitEthernet1/0/0-output active 4642964 18652932 0 9.34e1 4.02
GigabitEthernet1/0/0-tx active 4642964 18652420 0 1.73e2 4.02
dpdk-input polling 12200880 18652933 0 3.27e2 1.53
ethernet-input active 4642965 18652933 0 1.54e2 4.02
l2-input active 4642964 18652933 0 1.04e2 4.02
l2-output active 4642964 18652933 0 1.01e2 4.02
Hey, would you look at that! The line up top here shows vector rates of in 1.4799e6 (which is 1.48Mpps) and outbound is the same number. And in this configuration as well, the DPDK node isn’t even reading that many packets, and the graph traversal is on average with 4.02 packets per run, which means that this CPU can do in excess of 1.48Mpps on one (1) CPU thread. Slick!
So what is the maximum throughput per CPU thread? To show this, I will saturate both ports with line rate traffic, and see what makes it through the other side. After instructing the T-Rex to perform the following profile:
tui>start -f stl/bench.py -m 1.48mpps -p 0 1 \
-t size=64,vm=var2
T-Rex will faithfully start to send traffic on both ports and expect the same amount back from the Fitlet2 (the Device Under Test or DUT). I can see that from T-Rex port 1->0 all traffic makes its way back, but from port 0->1 there is a little bit of loss (for the 1.48Mpps sent, only 1.43Mpps is returned). This is the same phenomenon that I explained above – the i211 NIC is not quite as good at eating packets as the i210 NIC is.
Even when doing this though, the (still) single threaded VPP is keeping up just fine, CPU wise:
vpp# show run
...
Thread 1 vpp_wk_0 (lcore 1)
Time 13.4, 10 sec internal node vector rate 13.59 loops/sec 122820.33
vector rates in 2.9599e6, out 2.8834e6, drop 0.0000e0, punt 0.0000e0
Name State Calls Vectors Suspends Clocks Vectors/Call
GigabitEthernet1/0/0-output active 1822674 19826616 0 3.69e1 10.88
GigabitEthernet1/0/0-tx active 1822674 19597360 0 1.51e2 10.75
GigabitEthernet3/0/0-output active 1823770 19826612 0 4.79e1 10.87
GigabitEthernet3/0/0-tx active 1823770 19029508 0 1.56e2 10.43
dpdk-input polling 1827320 39653228 0 1.62e2 21.70
ethernet-input active 3646444 39653228 0 7.67e1 10.87
l2-input active 1825356 39653228 0 4.96e1 21.72
l2-output active 1825356 39653228 0 4.58e1 21.72
Here we can see 2.96Mpps received (vector rates in) while only 2.88Mpps are transmitted (vector rates out). First off, this lines up perfectly with the reporting of T-Rex in the screenshot above, and it also shows that one direction loses more packets than the other. We’re dropping some 80kpps, but where did they go? Looking at the statistics counters, which include any packets which had errors in processing, we learn more:
vpp# show err
Count Node Reason Severity
3109141488 l2-output L2 output packets error
3109141488 l2-input L2 input packets error
9936649 GigabitEthernet1/0/0-tx Tx packet drops (dpdk tx failure) error
32120469 GigabitEthernet3/0/0-tx Tx packet drops (dpdk tx failure) error
Aha! From previous experience I know that when DPDK signals packet drops due to ’tx failure’,
that this is often because it’s trying to hand off the packet to the NIC, which has a ringbuffer to
collect them while the hardware transmits them onto the wire, and this NIC has run out of slots,
which means the packet has to be dropped and a kitten gets hurt. But, I can raise the number of
RX and TX slots, by setting them in VPP’s startup.conf
file:
dpdk {
dev default {
num-rx-desc 512 ## default
num-tx-desc 1024
}
no-multi-seg
}
And with that simple tweak, I’ve succeeded in configuring the Fitlet2 in a way that it is capable of receiving and transmitting 64 byte packets in both directions at (almost) line rate, with one CPU thread.
Method 2: Rampup using trex-loadtest.py
For this test, I decide to put the Fitlet2 into L3 mode (up until now it was set up in L2 Cross
Connect mode). To do this, I give the interfaces an IPv4 address and set a route for the loadtest
traffic (which will be coming from 16.0.0.0/8
and going to 48.0.0.0/8
). I will once again look
to vppcfg
to do this, because manipulating the YAML files like this allow me to easily and reliabily
swap back and forth, letting vppcfg
do the mundane chore of figuring out what commands to type, in
which order, safely.
From my existing L2XC dataplane configuration, I switch to L3 like so:
pim@fitlet:~$ cat << EOF > l3.yaml
interfaces:
GigabitEthernet1/0/0:
mtu: 1500
lcp: e1-0-0
addresses: [ 100.64.10.1/30 ]
GigabitEthernet3/0/0:
mtu: 1500
lcp: e3-0-0
addresses: [ 100.64.10.5/30 ]
EOF
pim@fitlet:~$ vppcfg plan -c l3.yaml
[INFO ] root.main: Loading configfile l3.yaml
[INFO ] vppcfg.config.valid_config: Configuration validated successfully
[INFO ] root.main: Configuration is valid
[INFO ] vppcfg.vppapi.connect: VPP version is 23.06-rc0~35-gaf4046134
comment { vppcfg prune: 2 CLI statement(s) follow }
set interface l3 GigabitEthernet1/0/0
set interface l3 GigabitEthernet3/0/0
comment { vppcfg create: 2 CLI statement(s) follow }
lcp create GigabitEthernet1/0/0 host-if e1-0-0
lcp create GigabitEthernet3/0/0 host-if e3-0-0
comment { vppcfg sync: 2 CLI statement(s) follow }
set interface ip address GigabitEthernet1/0/0 100.64.10.1/30
set interface ip address GigabitEthernet3/0/0 100.64.10.5/30
[INFO ] vppcfg.reconciler.write: Wrote 9 lines to (stdout)
[INFO ] root.main: Planning succeeded
One small note – vppcfg
cannot set routes, and this is by design as the Linux Control Plane is
meant to take care of that. I can either set routes using ip
in the dataplane
network namespace,
like so:
pim@fitlet:~$ sudo nsenter --net=/var/run/netns/dataplane
root@fitlet:/home/pim# ip route add 16.0.0.0/8 via 100.64.10.2
root@fitlet:/home/pim# ip route add 48.0.0.0/8 via 100.64.10.6
Or, alternatively, I can set them directly on VPP in the CLI, interestingly with identical syntax:
pim@fitlet:~$ vppctl
vpp# ip route add 16.0.0.0/8 via 100.64.10.2
vpp# ip route add 48.0.0.0/8 via 100.64.10.6
The loadtester will run a bunch of profiles (1514b, imix, 64b with multiple flows, and 64b with only one flow), either in unidirectional or bidirectional mode, which gives me a wealth of data to share:
Loadtest | 1514b | imix | Multi 64b | Single 64b |
---|---|---|---|---|
Bidirectional | 81.7k (100%) | 327k (100%) | 1.48M (100%) | 1.43M (98.8%) |
Unidirectional | 73.2k (89.6%) | 255k (78.2%) | 1.18M (79.4%) | 1.23M (82.7%) |
Caveats
While all results of the loadtests are navigable [here], I will cherrypick one interesting bundle showing the results of all (bi- and unidirectional) tests:
I have to admit I was a bit stumped with the unidirectional loadtests - these are pushing traffic into the i211 (onboard RJ45) NIC, and out of the i210 (FACET SFP) NIC. What I found super weird (and can’t really explain), is that the unidirectional load, which in the end serves half the packets/sec, is lower than the bidirectional load, which was almost perfect dropping only a little bit of traffic at the very end. A picture says a thousand words - so here’s a graph of all the loadtests, which you can also find by clicking on the links in the table.
Appendix
Generating the data
The JSON files that are emitted by my loadtester script can be fed directly into Michal’s visualizer to plot interactive graphs (which I’ve done for the table above):
DEVICE=Fitlet2
## Loadtest
SERVER=${SERVER:=hvn0.lab.ipng.ch}
TARGET=${TARGET:=l3}
RATE=${RATE:=10} ## % of line
DURATION=${DURATION:=600}
OFFSET=${OFFSET:=10}
PROFILE=${PROFILE:="ipng"}
for DIR in unidirectional bidirectional; do
for SIZE in 1514 imix 64; do
[ $DIR == "unidirectional" ] && FLAGS="-u "
## Multiple Flows
./trex-loadtest -s ${SERVER} ${FLAGS} -p $PROFILE}.py -t "offset=${OFFSET},vm=var2,size=${SIZE}" \
-rd ${DURATION} -rt ${RATE} -o ${DEVICE}-${TARGET}-${PROFILE}-var2-${SIZE}-${DIR}.json
[ $SIZE -eq 64 ] && {
## Specialcase: Single Flow
./trex-loadtest -s ${SERVER} ${FLAGS -p ${PROFILE}.py -t "offset=${OFFSET},size=${SIZE}" \
-rd ${DURATION} -rt ${RATE} -o ${DEVICE}-${TARGET}-${PROFILE}-${SIZE}-${DIR}.json
}
done
done
## Graphs
ruby graph.rb -t "${DEVICE} All Loadtests" ${DEVICE}*.json -o ${DEVICE}.html
ruby graph.rb -t "${DEVICE} Unidirectional Loadtests" ${DEVICE}*unidir*.json \
-o ${DEVICE}.unidirectional.html
ruby graph.rb -t "${DEVICE} Bidirectional Loadtests" ${DEVICE}*bidir*.json \
-o ${DEVICE}.bidirectional.html
for i in ${PROFILE}-var2-1514 ${PROFILE}-var2-imix ${PROFILE}-var2-64 ${PROFILE}-64; do
ruby graph.rb -t "${DEVICE} Unidirectional Loadtests" ${DEVICE}*-${i}*unidirectional.json \
-o ${DEVICE}.$i-unidirectional.html; done
ruby graph.rb -t "${DEVICE} Bidirectional Loadtests" ${DEVICE}*-${i}*bidirectional.json \
-o ${DEVICE}.$i-bidirectional.html; done
done