Introduction

I’m still hunting for a set of machines with which I can generate 1Tbps and 1Gpps of VPP traffic, and considering a 100G network interface can do at most 148.8Mpps, I will need 7 or 8 of these network cards. Doing a loadtest like this with DACs back-to-back is definitely possible, but it’s a bit more convenient to connect them all to a switch. However, for this to work I would need (at least) fourteen or more HundredGigabitEthernet ports, and these switches tend to get expensive, real quick.

Or do they?

Hardware

SN2700

I thought I’d ask the #nlnog IRC channel for advice, and of course the usual suspects came past, such as Juniper, Arista, and Cisco. But somebody mentioned “How about Mellanox, like SN2700?” and I remembered my buddy Eric was a fan of those switches. I looked them up on the refurbished market and I found one for EUR 1’400,- for 32x100G which felt suspiciously low priced… but I thought YOLO and I ordered it. It arrived a few days later via UPS from Denmark to Switzerland.

The switch specs are pretty impressive, with 32x100G QSFP28 ports, which can be broken out to a set of sub-ports (each of 1/10/25/50G), with a specified switch throughput of 6.4Tbps in 4.76Gpps, while only consuming ~150W all-up.

Further digging revealed that the architecture of this switch consists of two main parts:

SN2700

an AMD64 component with an mSATA disk to boot from, two e1000 network cards, and a single USB and RJ45 serial port with standard pinout. It has a PCIe connection to a switch board in the front of the chassis, further more it’s equipped with 8GB of RAM in an SO-DIMM, and its CPU is a two core Celeron(R) CPU 1047UE @ 1.40GHz.
the silicon used in this switch is called Spectrum and identifies itself in Linux as PCI device 03:00.0 called Mellanox Technologies MT52100, so the front dataplane with 32x100G is separated from the Linux based controlplane.

SN2700

When turning on the device, the serial port comes to life and shows me a BIOS, quickly after which it jumps into GRUB2 and wants me to install it using something called ONIE. I’ve heard of that, but now it’s time for me to learn a little bit more about that stuff. I ask around and there’s plenty of ONIE images for this particular type of chip to be found - some are open source, some are semi-open source (as in: were once available but now are behind paywalls etc).

Before messing around with the switch and possibly locking myself out or bricking it, I take out the 16GB mSATA and make a copy of it for safe keeping. I feel somewhat invincible by doing this. How bad could I mess up this switch, if I can just copy back a bitwise backup of the 16GB mSATA? I’m about to find out, so read on!

Software

The Mellanox SN2700 switch is an ONIE (Open Network Install Environment) based platform that supports a multitude of operating systems, as well as utilizing the advantages of Open Ethernet and the capabilities of the Mellanox Spectrum® ASIC. The SN2700 has three modes of operation:

Preinstalled with Mellanox Onyx (successor to MLNX-OS Ethernet), a home-grown operating system utilizing common networking user experiences and industry standard CLI.
Preinstalled with Cumulus Linux, a revolutionary operating system taking the Linux user experience from servers to switches and providing a rich routing functionality for large scale applications.
Provided with a bare ONIE image ready to be installed with the aforementioned or other ONIE-based operating systems.

I asked around a bit more and found that there’s a few more things one might do with this switch. One of them is [SONiC], which stands for Software for Open Networking in the Cloud, and has support for the Spectrum and notably the SN2700 switch. Cool!

I also learned about [DENT], which utilizes the Linux Kernel, Switchdev, and other Linux based projects as the basis for building a new standardized network operating system without abstractions or overhead. Unfortunately, while the Spectrum chipset is known to DENT, this particular layout on SN2700 is not supported.

Finally, my buddy fall0ut said “why not just Debian with switchdev?” and now my eyes opened wide. I had not yet come across [switchdev], which is a standard Linux kernel driver model for switch devices which offload the forwarding (data)plane from the kernel. As it turns out, Mellanox did a really good job writing a switchdev implementation in the [linux kernel] for the Spectrum series of silicon, and it’s all upstreamed to the Linux kernel. Wait, what?!

Mellanox Switchdev

I start by reading the [brochure], which shows me the intentions Mellanox had when designing and marketing these switches. It seems that they really meant it when they said this thing is a fully customizable Linux switch, check out this paragraph:

Once the Mellanox Switchdev driver is loaded into the Linux Kernel, each of the switch’s physical ports is registered as a net_device within the kernel. Using standard Linux tools (for example, bridge, tc, iproute), ports can be bridged, bonded, tunneled, divided into VLANs, configured for L3 routing and more. Linux switching and routing tables are reflected in the switch hardware. Network traffic is then handled directly by the switch. Standard Linux networking applications can be natively deployed and run on switchdev. This may include open source routing protocol stacks, such as Quagga, Bird and XORP, OpenFlow applications, or user-specific implementations.

Installing Debian on SN2700

.. they had me at Bird :) so off I go, to install a vanilla Debian AMD64 Bookworm on a 120G mSATA I had laying around. After installing it, I noticed that the coveted mlxsw driver is not shipped by default on the Linux kernel image in Debian, so I decide to build my own, letting the [Debian docs] take my hand and guide me through it.

I find a reference on the Mellanox [GitHub wiki] which shows me which kernel modules to include to successfully use the Spectrum under Linux, so I think I know what to do:

pim@summer:/usr/src$ sudo apt-get install build-essential linux-source bc kmod cpio flex \
  libncurses5-dev libelf-dev libssl-dev dwarves bison
pim@summer:/usr/src$ sudo apt install linux-source-6.1                                   
pim@summer:/usr/src$ sudo tar xf linux-source-6.1.tar.xz                                      
pim@summer:/usr/src$ cd linux-source-6.1/
pim@summer:/usr/src/linux-source-6.1$ sudo cp /boot/config-6.1.0-12-amd64 .config
pim@summer:/usr/src/linux-source-6.1$ cat << EOF | sudo tee -a .config
CONFIG_NET_IPIP=m
CONFIG_NET_IPGRE_DEMUX=m
CONFIG_NET_IPGRE=m
CONFIG_IPV6_GRE=m
CONFIG_IP_MROUTE_MULTIPLE_TABLES=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IPV6_MULTIPLE_TABLES=y
CONFIG_BRIDGE=m
CONFIG_VLAN_8021Q=m
CONFIG_BRIDGE_VLAN_FILTERING=y
CONFIG_BRIDGE_IGMP_SNOOPING=y
CONFIG_NET_SWITCHDEV=y
CONFIG_NET_DEVLINK=y
CONFIG_MLXFW=m
CONFIG_MLXSW_CORE=m
CONFIG_MLXSW_CORE_HWMON=y
CONFIG_MLXSW_CORE_THERMAL=y
CONFIG_MLXSW_PCI=m
CONFIG_MLXSW_I2C=m
CONFIG_MLXSW_MINIMAL=y
CONFIG_MLXSW_SWITCHX2=m
CONFIG_MLXSW_SPECTRUM=m
CONFIG_MLXSW_SPECTRUM_DCB=y
CONFIG_LEDS_MLXCPLD=m
CONFIG_NET_SCH_PRIO=m
CONFIG_NET_SCH_RED=m
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_CLS=y
CONFIG_NET_CLS_ACT=y
CONFIG_NET_ACT_MIRRED=m
CONFIG_NET_CLS_MATCHALL=m
CONFIG_NET_CLS_FLOWER=m
CONFIG_NET_ACT_GACT=m
CONFIG_NET_ACT_MIRRED=m
CONFIG_NET_ACT_SAMPLE=m
CONFIG_NET_ACT_VLAN=m
CONFIG_NET_L3_MASTER_DEV=y
CONFIG_NET_VRF=m
EOF
pim@summer:/usr/src/linux-source-6.1$ sudo make menuconfig
pim@summer:/usr/src/linux-source-6.1$ sudo make -j`nproc` bindeb-pkg

I run a gratuitous make menuconfig after adding all those config statements to the end of the .config file, and it figures out how to combine what I wrote before with what was in the file earlier, and I used the standard Bookworm 6.1 kernel config that came from the default installer, so that it would be a minimal diff to what Debian itself shipped with.

After Summer stretches her legs a bit compiling this kernel for me, look at the result:

pim@summer:/usr/src$ dpkg -c linux-image-6.1.55_6.1.55-4_amd64.deb | grep mlxsw
drwxr-xr-x root/root         0 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/
-rw-r--r-- root/root    414897 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_core.ko
-rw-r--r-- root/root     19721 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_i2c.ko
-rw-r--r-- root/root     31817 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_minimal.ko
-rw-r--r-- root/root     65161 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_pci.ko
-rw-r--r-- root/root   1425065 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_spectrum.ko

Good job, Summer! On my mSATA disk, I tell Linux to boot its kernel using the following in GRUB, which will make the kernel not create spiffy interface names like enp6s0 or eno1 but just enumerate them all one by one and call them eth0 and so on:

pim@fafo:~$ grep GRUB_CMDLINE /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT=""
GRUB_CMDLINE_LINUX="console=tty0 console=ttyS0,115200n8 net.ifnames=0 biosdevname=0"

Mellanox SN2700 running Debian+Switchdev

Debian

I insert the freshly installed Debian Bookworm with custom compiled 6.1.55+mlxsw kernel into the switch, and it boots on the first try. I see 34 (!) ethernet ports, noting that the first two come from an Intel NIC but carrying a MAC address from Mellanox (starting with 0c:42:a1) and the other 32 have a common MAC address (from Mellanox, starting with 04:3f:72), and what I noticed is that the MAC addresses here are skipping one between subsequent ports, which leads me to believe that these 100G ports can be split into two (perhaps 2x50G, 2x40G, 2x25G, 2x10G, which I intend to find out later). According to the official spec sheet, the switch allows 2-way breakout ports as well as converter modules, to insert for example a 25G SFP28 into a QSFP28 switchport.

Honestly, I did not think I would get this far, so I humorously (at least, I think so) decide to call this switch [FAFO].

First off, the mlxsw driver loaded:

root@fafo:~# lsmod | grep mlx
mlxsw_spectrum        708608  0
mlxsw_pci              36864  1 mlxsw_spectrum
mlxsw_core            217088  2 mlxsw_pci,mlxsw_spectrum
mlxfw                  36864  1 mlxsw_core
vxlan                 106496  1 mlxsw_spectrum
ip6_tunnel             45056  1 mlxsw_spectrum
objagg                 53248  1 mlxsw_spectrum
psample                20480  1 mlxsw_spectrum
parman                 16384  1 mlxsw_spectrum
bridge                311296  1 mlxsw_spectrum

I run sensors-detect and pwmconfig, let the fans calibrate and write their config file. The fans come back down to a more chill (pun intended) speed, and I take a closer look. It seems all fans and all thermometers, including the ones in the QSFP28 cages and the Spectrum switch ASIC are accounted for:

root@fafo:~# sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +30.0°C  (high = +87.0°C, crit = +105.0°C)
Core 0:        +29.0°C  (high = +87.0°C, crit = +105.0°C)
Core 1:        +30.0°C  (high = +87.0°C, crit = +105.0°C)

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +27.8°C  (crit = +106.0°C)
temp2:        +29.8°C  (crit = +106.0°C)

mlxsw-pci-0300
Adapter: PCI adapter
fan1:            6239 RPM
fan2:            5378 RPM
fan3:            6268 RPM
fan4:            5378 RPM
fan5:            6326 RPM
fan6:            5442 RPM
fan7:            6268 RPM
fan8:            5315 RPM
temp1:            +37.0°C  (highest = +41.0°C)
front panel 001:  +23.0°C  (crit = +73.0°C, emerg = +75.0°C)
front panel 002:  +24.0°C  (crit = +73.0°C, emerg = +75.0°C)
front panel 003:  +23.0°C  (crit = +73.0°C, emerg = +75.0°C)
front panel 004:  +26.0°C  (crit = +73.0°C, emerg = +75.0°C)
...

From the top, first I see the classic CPU core temps, then an ACPI interface which I’m not quite sure I understand the purpose of (possibly motherboard, but not PSU because pulling one out does not change any values). Finally, the sensors using driver mlxsw-pci-0300, are those on the switch PCB carrying the Spectrum silicon, and there’s a thermometer for each of the QSFP28 cages, possibly reading from the optic, as most of them are empty except the first four which I inserted optics to. Slick!

Ethernet

I notice that the ports are in a bit of a weird order. Firstly, eth0-1 are the two 1G ports on the Debian machine. But then, the rest of the ports are the Mellanox Spectrum ASIC:

eth2-17 correspond to port 17-32, which seems normal, but
eth18-19 correspond to port 15-16
eth20-21 correspond to port 13-14
eth30-31 correspond to port 3-4
eth32-33 correspond to port 1-2

The switchports are actually sequentially numbered with respect to MAC addresses, with eth2 starting at 04:3f:72:74:a9:41 and finally eth34 having 04:3f:72:74:a9:7f (for 64 consecutive MACs).

Somehow though, the ports are wired in a different way on the front panel. As it turns out, I can insert a little udev ruleset that will take care of this:

root@fafo:~# cat << EOF > /etc/udev/rules.d/10-local.rules 
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="mlxsw_spectrum*", \
    NAME="sw$attr{phys_port_name}"
EOF

After rebooting the switch, the ports are now called swp1 .. swp32 and they also correspond with their physical ports on the front panel. One way to check this, is using ethtool --identify swp1 which will blink the LED of port 1, until I press ^C. Nice.

Debian SN2700: Diagnostics

The first thing I’m curious to try, is if Link Layer Discovery Protocol [LLDP] works. This is a vendor-neutral protocol that network devices use to advertise their identity to peers over Ethernet. I install an open source LLDP daemon and plug in a DAC from port1 to a Centec switch in the lab.

And indeed, quickly after that, I see two devices, the first on the Linux machine eth0 which is the Unifi switch that has my LAN, and the second is the Centec behind swp1:

root@fafo:~# apt-get install lldpd
root@fafo:~# lldpcli show nei summary
-------------------------------------------------------------------------------
LLDP neighbors:
-------------------------------------------------------------------------------
Interface:    eth0, via: LLDP
  Chassis:     
    ChassisID:    mac 44:d9:e7:05:ff:46
    SysName:      usw6-BasementServerroom
  Port:        
    PortID:       local Port 9
    PortDescr:    fafo.lab
    TTL:          120
Interface:    swp1, via: LLDP
  Chassis:     
    ChassisID:    mac 60:76:23:00:01:ea
    SysName:      sw3.lab
  Port:        
    PortID:       ifname eth-0-25
    PortDescr:    eth-0-25
    TTL:          120

With this I learn that the switch forwards these datagrams (ethernet type 0x88CC) from the dataplane to the Linux controlplane. I would call this punting in VPP language, but switchdev calls it trapping, and I can see the LLDP packets when tcpdumping on ethernet device swp1. So today I learned how to trap packets :-)

Debian SN2700: ethtool

One popular diagnostics tool that is useful (and, hopefully well known because it’s awesome), isethtool, a command-line tool in Linux for managing network interface devices. It allows me to modify the parameters of the ports and their transceivers, as well as query the information of those devices.

Here are few common examples, all of which work on this switch running Debian:

ethtool swp1: Shows link capabilities (eg, 1G/10G/25G/40G/100G)
ethtool -s swp1 speed 40000 duplex full autoneg off: Force speed/duplex
ethtool -m swp1: Shows transceiver diagnostics like SFP+ light levels, link levels (also --module-info)
ethtool -p swp1: Flashes the transceiver port LED (also --identify)
ethtool -S swp1: Shows packet and octet counters, and sizes, discards, errors, and so on (also --statistics)

I specifically love the digital diagnostics monitoring (DDM), originally specified in [SFF-8472], which allows me to read the EEPROM of optical transceivers and get all sorts of critical diagnostics. I wish DPDK and VPP had that!

Debian SN2700: devlink

In reading up on the switchdev ecosystem, I stumbled across devlink, an API to expose device information and resources not directly related to any device class, such as switch ASIC configuration. As a fun fact, devlink was written by the same engineer who wrote the mlxsw driver for Linux, Jiří Pírko. Its documentation can be found in the [linux kernel], and it ships with any modern iproute2 distribution. The specific (somewhat terse) documentation of the mlxsw driver [lives there] as well.

There’s a lot to explore here, but I’ll focus my attention to three things:

1. devlink resource

When learning that the switch also does IPv4 and IPv6 routing, I immediately thought: how many prefixes can be offloaded to the ASIC? One way to find out is to query what types of resources it has:

root@fafo:~# devlink resource show pci/0000:03:00.0 
pci/0000:03:00.0:
  name kvd size 258048 unit entry dpipe_tables none
    resources:
      name linear size 98304 occ 1 unit entry size_min 0 size_max 159744 size_gran 128 dpipe_tables none
        resources:
          name singles size 16384 occ 1 unit entry size_min 0 size_max 159744 size_gran 1 dpipe_tables none
          name chunks size 49152 occ 0 unit entry size_min 0 size_max 159744 size_gran 32 dpipe_tables none
          name large_chunks size 32768 occ 0 unit entry size_min 0 size_max 159744 size_gran 512 dpipe_tables none
      name hash_double size 65408 unit entry size_min 32768 size_max 192512 size_gran 128 dpipe_tables none
      name hash_single size 94336 unit entry size_min 65536 size_max 225280 size_gran 128 dpipe_tables none
  name span_agents size 3 occ 0 unit entry dpipe_tables none
  name counters size 32000 occ 4 unit entry dpipe_tables none
    resources:
      name rif size 8192 occ 0 unit entry dpipe_tables none
      name flow size 23808 occ 4 unit entry dpipe_tables none
  name global_policers size 1000 unit entry dpipe_tables none
    resources:
      name single_rate_policers size 968 occ 0 unit entry dpipe_tables none
  name rif_mac_profiles size 1 occ 0 unit entry dpipe_tables none
  name rifs size 1000 occ 1 unit entry dpipe_tables none
  name physical_ports size 64 occ 36 unit entry dpipe_tables none

There’s a lot to unpack here, but this is a tree of resources, each with names and children. Let me focus on the first one, called kvd, which stands for Key Value Database (in other words, a set of lookup tables). It contains a bunch of children called linear, hash_double and hash_single. The kernel [docs] explain it in more detail, but this is where the switch will keep its FIB in Content Addressable Memory (CAM) of certain types of elements of a given length and count. All up, the size is 252KB, which is not huge, but also certainly not tiny!

Here I learn that it’s subdivided into:

linear: 96KB bytes of flat memory using an index, further divided into regions:
- singles: 16KB of size 1, nexthops
- chunks: 48KB of size 32, multipath routes with <32 entries
- large_chunks: 32KB of size 512, multipath routes with <512 entries
hash_single: 92KB bytes of hash table for keys smaller than 64 bits (eg. L2 FIB, IPv4 FIB and neighbors)
hash_double: 63KB bytes of hash table for keys larger than 64 bits (eg. IPv6 FIB and neighbors)

2. devlink dpipe

Now that I know the memory layout and regions of the CAM, I can start making some guesses on the FIB size. The devlink pipeline debug API (DPIPE) is aimed at providing the user visibility into the ASIC’s pipeline in a generic way. The API is described in detail in the [kernel docs]. I feel free to take a peek at the dataplane configuration innards:

root@fafo:~# devlink dpipe table show pci/0000:03:00.0
pci/0000:03:00.0:
  name mlxsw_erif size 1000 counters_enabled false
    match:
      type field_exact header mlxsw_meta field erif_port mapping ifindex
    action:
      type field_modify header mlxsw_meta field l3_forward
      type field_modify header mlxsw_meta field l3_drop
  name mlxsw_host4 size 0 counters_enabled false resource_path /kvd/hash_single resource_units 1
    match:
      type field_exact header mlxsw_meta field erif_port mapping ifindex
      type field_exact header ipv4 field destination ip
    action:
      type field_modify header ethernet field destination mac
  name mlxsw_host6 size 0 counters_enabled false resource_path /kvd/hash_double resource_units 2
    match:
      type field_exact header mlxsw_meta field erif_port mapping ifindex
      type field_exact header ipv6 field destination ip
    action:
      type field_modify header ethernet field destination mac
  name mlxsw_adj size 0 counters_enabled false resource_path /kvd/linear resource_units 1
    match:
      type field_exact header mlxsw_meta field adj_index
      type field_exact header mlxsw_meta field adj_size
      type field_exact header mlxsw_meta field adj_hash_index
    action:
      type field_modify header ethernet field destination mac
      type field_modify header mlxsw_meta field erif_port mapping ifindex

From this I can puzzle together how the CAM is actually used:

mlxsw_host4: matches on the interface port and IPv4 destination IP, using hash_single above with one unit for each entry, and when looking that up, puts the result into the ethernet destination MAC (in other words, the FIB entry points at an L2 nexthop!)
mlxsw_host6: matches on the interface port and IPv6 destination IP using hash_double with two units for each entry.
mlxsw_adj: holds the L2 adjacencies, and the lookup key is an index, size and hash index, where the returned value is used to rewrite the destination MAC and select the egress port!

Now that I know the types of tables and what they are matching on (and then which action they are performing), I can also take a look at the actual data in the FIB. For example, if I create an IPv4 interface on the switch and ping a member on directly connected network there, I can see an entry show up in the L2 adjacency table, like so:

root@fafo:~# ip addr add 100.65.1.1/30 dev swp31
root@fafo:~# ping 100.65.1.2
root@fafo:~# devlink dpipe table dump pci/0000:03:00.0  name mlxsw_host4
pci/0000:03:00.0:
  index 0
    match_value:
      type field_exact header mlxsw_meta field erif_port mapping ifindex mapping_value 71 value 1
      type field_exact header ipv4 field destination ip value 100.65.1.2
    action_value:
      type field_modify header ethernet field destination mac value b4:96:91:b3:b1:10

To decypher what the switch is doing: if the ifindex is 71 (which corresponds to swp31), and the IPv4 destination IP address is 100.65.1.2, then the destination MAC address will be set to b4:96:91:b3:b1:10, so the switch knows where to send this ethernet datagram.

And now I have found what I need to know to be able to answer the question of the FIB size. This switch can take 92K IPv4 routes and 31.5K IPv6 routes, and I can even inspect the FIB in great detail. Rock on!

3. devlink port split

But reading the switch chip configuration and FIB is not all that devlink can do, it can also make changes! One particularly interesting one is the ability to split and unsplit ports. What this means is that, when you take a 100Gbit port, it internally is divided into four so-called lanes of 25Gbit each, where a 40Gbit port is internally divided into four lanes of 10Gbit each. Splitting ports is the act of taking such a port and reconfiguring its lanes.

Let me show you, by means of example, what spliting the first two switchports might look like. They begin their life as 100G ports, which support a number of link speeds, notably: 100G, 50G, 25G, but also 40G, 10G, and finally 1G:

root@fafo:~# ethtool swp1
Settings for swp1:
        Supported ports: [ FIBRE ]
        Supported link modes:   1000baseKX/Full
                                10000baseKR/Full
                                40000baseCR4/Full
                                40000baseSR4/Full
                                40000baseLR4/Full
                                25000baseCR/Full
                                25000baseSR/Full
                                50000baseCR2/Full
                                100000baseSR4/Full
                                100000baseCR4/Full
                                100000baseLR4_ER4/Full

root@fafo:~# devlink port show | grep 'swp[12] '
pci/0000:03:00.0/61: type eth netdev swp1 flavour physical port 1 splittable true lanes 4
pci/0000:03:00.0/63: type eth netdev swp2 flavour physical port 2 splittable true lanes 4
root@fafo:~# devlink port split pci/0000:03:00.0/61 count 4
[  629.593819] mlxsw_spectrum 0000:03:00.0 swp1: link down
[  629.722731] mlxsw_spectrum 0000:03:00.0 swp2: link down
[  630.049709] mlxsw_spectrum 0000:03:00.0: EMAD retries (1/5) (tid=64b1a5870000c726)
[  630.092179] mlxsw_spectrum 0000:03:00.0 swp1s0: renamed from eth2
[  630.148860] mlxsw_spectrum 0000:03:00.0 swp1s1: renamed from eth2
[  630.375401] mlxsw_spectrum 0000:03:00.0 swp1s2: renamed from eth2
[  630.375401] mlxsw_spectrum 0000:03:00.0 swp1s3: renamed from eth2

root@fafo:~# ethtool swp1s0
Settings for swp1s0:
        Supported ports: [ FIBRE ]
        Supported link modes:   1000baseKX/Full
                                10000baseKR/Full
                                25000baseCR/Full
                                25000baseSR/Full

Whoa, what just happened here? The switch took the port defined by pci/0000:03:00.0/61 which says it is splittable and has four lanes, and split it into four NEW ports called swp1s0-swp1s3, and the resulting ports are 25G, 10G or 1G.

Warning

However, I make an important observation. When splitting swp1 in 4, the switch also removed port swp2, and remember at the beginning of this article I mentioned that the MAC addresses seemed to skip one entry between subsequent interfaces? Now I understand why: when spltting the port into two, it will use the second MAC address for the second 50G port; but if I split it into four, it’ll use the MAC addresses from the adjacent port and decommission it. In other words: this switch can do 32x100G, or 64x50G, or 64x25G/10G/1G.

It doesn’t matter which of the PCI interfaces I split on. The operation is also reversible, I can issue devlink port unsplit to return the port to its aggregate state (eg. 4 lanes and 100Gbit), which will remove the swp1s0-3 ports and put back swp1 and swp2 again.

What I find particularly impressive about this, is that for most hardware vendors, this splitting of ports requires a reboot of the chassis, while here it can happen entirely online. Well done, Mellanox!

Performance

OK, so this all seems to work, but does it work well? If you’re a reader of my blog you’ll know that I love doing loadtests, so I boot my machine, Hippo, and I connect it with two 100G DACs to the switch on ports 31 and 32:

[    1.354802] ice 0000:0c:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
[    1.447677] ice 0000:0c:00.0: firmware: direct-loading firmware intel/ice/ddp/ice.pkg
[    1.561979] ice 0000:0c:00.1: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link)
[    7.738198] ice 0000:0c:00.0 enp12s0f0: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC,
  Negotiated FEC: RS-FEC, Autoneg Advertised: On, Autoneg Negotiated: True, Flow Control: None
[    7.802572] ice 0000:0c:00.1 enp12s0f1: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC,
  Negotiated FEC: RS-FEC, Autoneg Advertised: On, Autoneg Negotiated: True, Flow Control: None

I hope you’re hungry, Hippo, cuz you’re about to get fed!

Debian SN2700: L2

To use the switch in L2 mode, I intuitively create Linux bridge, say br0, and add ports to that. From the Mellanox documentation I learn that there can be multiple bridges, each isolated from one another, but there can only be one such bridge with vlan_filtering set. VLAN Filtering allows the switch to only accept tagged frames from a list of configured VLANs, and drop the rest. This is what you’d imagine a regular commercial switch would provide.

So off I go, creating the bridge in which I’ll add two ports (HundredGigabitEthernet port swp31 and swp32), and I will allow for the maximum MTU size of 9216, also known as [Jumbo Frames].

root@fafo:~# ip link add name br0 type bridge
root@fafo:~# ip link set br0 type bridge vlan_filtering 1 mtu 9216 up
root@fafo:~# ip link set swp31 mtu 9216 master br0 up
root@fafo:~# ip link set swp32 mtu 9216 master br0 up

These two ports are now access ports, that is to say they accept and emit only untagged traffic, and due to the vlan_filtering flag, they will drop all other frames. Using the standard bridge utility from Linux, I can manipulate the VLANs on these ports.

First, I’ll remove the default VLAN and add VLAN 1234 to both ports, specifying that VLAN 1234 is the so-called Port VLAN ID (pvid). This makes them the equivalent of Cisco’s switchport access 1234:

root@fafo:~# bridge vlan del vid 1 dev swp1
root@fafo:~# bridge vlan del vid 1 dev swp2
root@fafo:~# bridge vlan add vid 1234 dev swp1 pvid
root@fafo:~# bridge vlan add vid 1234 dev swp2 pvid

Then, I’ll add a few tagged VLANs to the ports, so that they become the Cisco equivalent of a trunk port allowing these tagged VLANs and assuming untagged traffic is still VLAN 1234:

root@fafo:~# for port in swp1 swp2; do for vlan in 100 200 300 400; do \
               bridge vlan add vid $vlan dev $port; done; done
root@fafo:~# bridge vlan
port              vlan-id  
swp1              100 200 300 400
                  1234 PVID
swp2              100 200 300 400
                  1234 PVID
br0               1 PVID Egress Untagged

When these commands are run against the interfaces swp*, they are picked up by the mlxsw kernel driver, and transmitted to the Spectrum switch chip, in other words, these commands end up programming the silicon. Traffic through these switch ports on the front, rarely (if ever) get forwarded to the Linux kernel, very similar to [VPP], the traffic stays mostly in the dataplane. Some traffic, such as LLDP (and as we’ll see later, IPv4 ARP and IPv6 neighbor discovery), will be forwarded from the switch chip over the PCIe link to the kernel, after which the results are transmitted back via PCIe to program the switch chip L2/L3 Forwarding Information Base (FIB).

Now I turn my attention to the loadtest, by configuring T-Rex in L2 Stateless mode. I start a bidirectional loadtest with 256b packets at 50% of line rate, which looks just fine:

Trex L2

At this point I can already conclude that this is all happening in the dataplane, as the Spectrum switch is connected to the Debian machine using a PCIe v3.0 x8 link, which is even obscured by another device on the PCIe bus, so the Debian kernel is in no way able to process more than a token amount of traffic, and yet I’m seeing 100Gbit go through the switch chip and the CPU load on the kernel pretty much zero. I can however retrieve the link statistics using ip stats, and those will show me the actual counters of the silicon, not just the trapped packets. If you’ll recall, in VPP the only packets that the TAP interfaces see are those packets that are punted, and the Linux kernel there is completely oblivious to the total dataplane throughput. Here, the interface is showing the correct dataplane packet and byte counters, which means that things like SNMP will automatically just do the right thing.

root@fafo:~# dmesg | grep 03:00.*bandwidth
[    2.180410] pci 0000:03:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s
  PCIe x4 link at 0000:00:01.2 (capable of 31.504 Gb/s with 8.0 GT/s PCIe x4 link)

root@fafo:~# uptime
 03:19:16 up 2 days, 14:14,  1 user,  load average: 0.00, 0.00, 0.00

root@fafo:~# ip stats show dev swp32 group link
72: swp32: group link
    RX:      bytes      packets errors dropped  missed   mcast           
     5106713943502  15175926564      0       0       0     103 
    TX:      bytes      packets errors dropped carrier collsns           
    23464859508367 103495791750      0       0       0       0 

Debian SN2700: IPv4 and IPv6

I now take a look at the L3 capabilities of the switch. To do this, I simply destroy the bridge br0, which will return the enslaved switchports. I then convert the T-Rex loadtester to use an L3 profile, and configure the switch as follows:

root@fafo:~# ip addr add 100.65.1.1/30 dev swp31
root@fafo:~# ip nei replace 100.65.1.2 lladdr b4:96:91:b3:b1:10 dev swp31
root@fafo:~# ip ro add 16.0.0.0/8 via 100.65.1.2 dev swp31

root@fafo:~# ip addr add 100.65.2.1/30 dev swp32
root@fafo:~# ip nei replace 100.65.2.2 lladdr b4:96:91:b3:b1:11 dev swp32
root@fafo:~# ip ro add 48.0.0.0/8 via 100.65.2.2 dev swp32

Several other routers I’ve loadtested have the same (cosmetic) issue, that T-Rex doesn’t reply to ARP packets after the first few seconds, so I first set the IPv4 address, then add a static L2 adjacency for the T-Rex side (on MAC b4:96:91:b3:b1:10), and route 16.0.0.0/8 to port 0 and I route 48.0.0.0/8 to port 1 of the loadtester.

Trex L3

I start a stateless L3 loadtest with 192 byte packets in both directions, and the switch keeps up just fine. Taking a closer look at the ip stats instrumentation, I see that there’s the ability to turn on L3 counters in addition to L2 (ethernet) counters. So I do that on my two router ports while they are happily forwarding 58.9Mpps, and I can now see the difference between dataplane (forwarded in hardware) and CPU (forwarded by the CPU)

root@fafo:~# ip stats set dev swp31 l3_stats on
root@fafo:~# ip stats set dev swp32 l3_stats on
root@fafo:~# ip stats show dev swp32 group offload subgroup l3_stats
72: swp32: group offload subgroup l3_stats on used on
    RX:       bytes       packets errors dropped   mcast
    270222574848200 1137559577576      0       0       0 
    TX:       bytes       packets errors dropped
    281073635911430 1196677185749      0       0 
root@fafo:~# ip stats show dev swp32 group offload subgroup cpu_hit
72: swp32: group offload subgroup cpu_hit
    RX:  bytes packets errors dropped  missed   mcast           
       1068742   17810      0       0       0       0 
    TX:  bytes packets errors dropped carrier collsns           
        468546    2191      0       0       0       0 

The statistics above clearly demonstrate that the lion’s share of the packets have been forwarded by the ASIC, and only a few (notably things like IPv6 neighbor discovery, IPv4 ARP, LLDP, and of course any traffic to the IP addresses configured on the router) will go to the kernel.

Debian SN2700: BVI (or VLAN Interfaces)

I’ve played around a little bit with L2 (switch) and L3 (router) ports, but there is one middle ground. I’ll keep the T-Rex loadtest running in L3 mode, but now I’ll reconfigure the switch to put the ports back into the bridge, each port in its own VLAN, and have so-called Bridge Virtual Interface, also known as VLAN interfaces – this is where the switch has a bunch of ports together in a VLAN, but the switch itself has an IPv4 or IPv6 address in that VLAN as well, which can act as a router.

I reconfigure the switch to put the interfaces back into VLAN 1000 and 2000 respectively, and move the IPv4 addresses and routes there – so here I go, first putting the switch interfaces back into L2 mode and adding them to the bridge, each in their own VLAN, by making them access ports:

root@fafo:~# ip link add name br0 type bridge vlan_filtering 1
root@fafo:~# ip link set br0 address 04:3f:72:74:a9:7d mtu 9216 up
root@fafo:~# ip link set swp31 master br0 mtu 9216 up
root@fafo:~# ip link set swp32 master br0 mtu 9216 up
root@fafo:~# bridge vlan del vid 1 dev swp31
root@fafo:~# bridge vlan del vid 1 dev swp32
root@fafo:~# bridge vlan add vid 1000 dev swp31 pvid
root@fafo:~# bridge vlan add vid 2000 dev swp32 pvid

From the ASIC specs, I understand that these BVIs need to (re)use a MAC from one of the members, so the first thing I do is give br0 the right MAC address. Then I put the switch ports into the bridge, remove VLAN 1 and put them in their respective VLANs. At this point, the loadtester reports 100% packet loss, because the two ports can no longer see each other at layer2, and layer3 configs have been removed. But I can restore connectivity with two BVIs as follows:

root@fafo:~# for vlan in 1000 2000; do
  ip link add link br0 name br0.$vlan type vlan id $vlan
  bridge vlan add dev br0 vid $vlan self
  ip link set br0.$vlan up mtu 9216
done

root@fafo:~# ip addr add 100.65.1.1/24 dev br0.1000
root@fafo:~# ip ro add 16.0.0.0/8 via 100.65.1.2
root@fafo:~# ip nei replace 100.65.1.2 lladdr b4:96:91:b3:b1:10 dev br0.1000

root@fafo:~# ip addr add 100.65.2.1/24 dev br0.2000
root@fafo:~# ip ro add 48.0.0.0/8 via 100.65.2.2
root@fafo:~# ip nei replace 100.65.2.2 lladdr b4:96:91:b3:b1:11 dev br0.2000

And with that, the loadtest shoots back in action: Trex L3 BVI

First a quick overview of the sitation I have created:

root@fafo:~# bridge vlan
port              vlan-id  
swp31             1000 PVID
swp32             2000 PVID
br0               1 PVID Egress Untagged

root@fafo:~# ip -4 ro
default via 198.19.5.1 dev eth0 onlink rt_trap 
16.0.0.0/8 via 100.65.1.2 dev br0.1000 offload rt_offload 
48.0.0.0/8 via 100.65.2.2 dev br0.2000 offload rt_offload 
100.65.1.0/24 dev br0.1000 proto kernel scope link src 100.65.1.1 rt_offload 
100.65.2.0/24 dev br0.2000 proto kernel scope link src 100.65.2.1 rt_offload 
198.19.5.0/26 dev eth0 proto kernel scope link src 198.19.5.62 rt_trap 

root@fafo:~# ip -4 nei
198.19.5.1 dev eth0 lladdr 00:1e:08:26:ec:f3 REACHABLE 
100.65.1.2 dev br0.1000 lladdr b4:96:91:b3:b1:10 offload PERMANENT 
100.65.2.2 dev br0.2000 lladdr b4:96:91:b3:b1:11 offload PERMANENT 

Looking at the situation now, compared to the regular IPv4 L3 loadtest, there is one important difference. Now, the switch can have any number of ports in VLAN 1000, which will all amongst themselves do L2 forwarding at line rate, and when they need to send IPv4 traffic out, they will ARP for the gateway (for example at 100.65.1.1/24), which will get trapped and forwarded to the CPU, after which the ARP reply will go out so that the machines know where to find the gateway. From that point on, IPv4 forwarding happens once again in hardware, which can be shown by the keywords rt_offload in the routing table (br0, in the ASIC), compared to the rt_trap (eth0, in the kernel). Similarly for the IPv4 neighbors, the L2 adjacency is programmed into the CAM (the output of which I took a look at above), do forwarding can be done directly by the ASIC without intervention from the CPU.

As a result, these VLAN Interfaces (which are synonymous with BVIs), work at line rate out of the box.

Results

This switch is phenomenal, and Jiří Pírko and the Mellanox team truly outdid themselves with their mlxsw switchdev implementation. I have in my hands a very affordable 32x100G or 64x(50G, 25G, 10G, 1G) and anything in between, with IPv4 and IPv6 forwarding in hardware, with a limited FIB size, not too dissimilar from the [Centec] switches that IPng Networks runs in its AS8298 network, albeit without MPLS forwarding capabilities.

Still, for a LAB switch, to better test 25G and 100G topologies, this switch is very good value for my money spent, and that it runs Debian and is fully configurable with things like Kees and Ansible. Considering there’s a whole range of 48x10G and 48x25G switches as well from Mellanox, all completely open and officially allowed to run OSS stuff on, these make a perfect fit for IPng Networks!

Acknowledgements

This article was written after fussing around and finding out, but a few references were particularly helpful, and I’d like to acknowledge the following super useful sites:

[mlxsw wiki] on GitHub
[jpirko’s kernel driver] on GitHub
[SONiC wiki] on GitHub
[Spectrum Docs] on NVIDIA

And to the community for writing and maintaining this excellent switchdev implementation.