FrysIX eVPN: think different

FrysIX Logo

Introduction

Somewhere in the far north of the Netherlands, the country where I was born, a town called Jubbega is the home of the Frysian Internet Exchange called [Frys-IX]. Back in 2021, a buddy of mine, Arend, said that he was planning on renting a rack at the NIKHEF facility, one of the most densely populated facilities in western Europe. He was looking for a few launching customers and I was definitely in the market for a presence in Amsterdam. I even wrote about it on my [bucketlist]. Arend and his IT company [ERITAP], took delivery of that rack in May of 2021, and this is when the internet exchange with Frysian roots was born.

In the years from 2021 until now, Arend and I have been operating the exchange with reasonable success. It grew from a handful of folks in that first rack, to now some 250 participating ISPs with about ten switches in six datacenters across the Amsterdam metro area. It’s shifting a cool 800Gbit of traffic or so. It’s dope, and very rewarding to be able to contribute to this community!

Frys-IX is growing

We have several members with a 2x100G LAG and even though all inter-datacenter links are either dark fiber or WDM, we’re starting to feel the growing pains as we set our sights to the next step growth. You see, when FrysIX did 13.37Gbit of traffic, Arend organized a barbecue. When it did 133.7Gbit of traffic, Arend organized an even bigger barbecue. Obviously, the next step is 1337Gbit and joining the infamous [One TeraBit Club]. Thomas: we’re on our way!

It became clear that we will not be able to keep a dependable peering platform if FrysIX remains a single L2 broadcast domain, and it also became clear that concatenating multiple 100G ports would be operationally expensive (think of all the dark fiber or WDM waves!), and brittle (think of LACP and balancing traffic over those ports). We need to modernize in order to stay ahead of the growth curve.

Hello Nokia

Nokia 7220-D4

The Nokia 7220 Interconnect Router (7220 IXR) for data center fabric provides fixed-configuration, high-capacity platforms that let you bring unmatched scale, flexibility and operational simplicity to your data center networks and peering network environments. These devices are built around the Broadcom Trident chipset, in the case of the “D4” platform, this is a Trident4 with 28x100G and 8x400G ports. Whoot!

Nokia 7220-D3

What I find particularly awesome of the Trident series is their speed (total bandwidth of 12.8Tbps per router), low power use (without optics, the IXR-7220-D4 consumes about 150W) and a plethora of advanced capabilities like L2/L3 filtering, IPv4, IPv6 and MPLS routing, and modern approaches to scale-out networking such as VXLAN based EVPN. At the FrysIX barbecue in September of 2024, FrysIX was gifted a rather powerful IXR-7220-D3 router, shown in the picture to the right. That’s a 32x100G router.

ERITAP has bought two (new in box) IXR-7220-D4 (8x400G,28x100G) routers, and has also acquired two IXR-7220-D2 (48x25G,8x100G) routers. So in total, FrysIX is now the proud owner of five of these beautiful Nokia devices. If you haven’t yet, you should definitely read about these versatile routers on the [Nokia] website, and some details of the merchant silicon switch chips in use on the [Broadcom] website.

eVPN: A small rant

Topology Concept

First, I need to get something off my chest. Consider a topology for an internet exchange platform, taking into account the available equipment, rackspace, power, and cross connects. Somehow, almost every design or reference architecture I can find on the Internet, assumes folks want to build a [Clos network], which has a topology existing of leaf and spine switches. The spine switches have a different set of features than the leaf ones, notably they don’t have to do provider edge functionality like VXLAN encap and decapsulation. Almost all of these designs are showing how one might build a leaf-spine network for hyperscale.

Critique 1: my ‘spine’ (IXR-7220-D4 routers) must also be provider edge. Practically speaking, in the picture above I have these beautiful Nokia IXR-7220-D4 routers, using two 400G ports to connect between the facilities, and six 100G ports to connect the smaller breakout switches. That would leave a massive amount of capacity unused: 22x 100G and 6x400G ports, to be exact.

Critique 2: all ’leaf’ (either IXR-7220-D2 routers or Arista switches) can’t realistically connect to both ‘spines’. Our devices are spread out over two (and in practice, more like six) datacenters, and it’s prohibitively expensive to get 100G waves or dark fiber to create a full mesh. It’s much more economical to create a star-topology that minimizes cross-datacenter fiber spans.

Critique 3: Most of these ‘spine-leaf’ reference architectures assume that the interior gateway protocol is eBGP in what they call the underlay, and on top of that, some secondary eBGP that’s called the overlay. Frankly, such a design makes my head spin a little bit. These designs assume hundreds of switches, in which case making use of one AS number per switch could make sense, as iBGP needs either a ‘full mesh’, or external route reflectors.

Critique 4: These reference designs also make an assumption that all fiber is local and while optics and links can fail, it will be relatively rare to drain a link. However, in cross-datacenter networks, draining links for maintenance is very common, for example if the dark fiber provider needs to perform repairs on a span that was damaged. With these eBGP-over-eBGP connections, traffic engineering is more difficult than simply raising the OSPF (or IS-IS) cost of a link, to reroute traffic.

Setting aside eVPN for a second, if I were to build an IP transport network, like I did when I built [IPng Site Local], I would use a much more intuitive and simple (I would even dare say elegant) design:

  1. Take a classic IGP like [OSPF], or perhaps [IS-IS]. There is no benefit, to me at least, to use BGP as an IGP.
  2. I would give each of the links between the switches an IPv4 /31 and enable link-local, and give each switch a loopback address with a /32 IPv4 and a /128 IPv6.
  3. If I had multiple links between two given switches, I would probably just use ECMP if my devices supported it, and fall back to a LACP signaled bundle-ethernet otherwise.
  4. If I were to need to use BGP (and for eVPN, this need exists), taking the ISP mindset (as opposed to the datacenter fabric mindset), I would simply install iBGP against two or three route reflectors, and exchange routing information within the same single AS number.

eVPN: A demo topology

Demo topology

So, that’s exactly how I’m going to approach the FrysIX eVPN design: OSPF for the underlay and iBGP for the overlay! I have a feeling that some folks will dispise me for being contrarian, but you can leave your comments below, and don’t forget to like-and-subscribe :-)

Arend builds this topology for me in Jubbega - also known as FrysIX HQ. He takes the two 400G-capable routers and connects them. Then he takes an Arista DCS-7060CX switch, which is eVPN capable, with 32x100G ports, based on the Broadcom Tomahawk3 chipset, and a smaller Nokia IXR-7220-D2 with 48x25G and 8x100G ports, based on the Trident3 chipset. He wires all of this up to look like the picture on the right.

Underlay: Nokia’s SR Linux

We boot up the equipment, verify that all the optics and links are up, and connect the management ports to an OOB network that I can remotely log in to. This is the first time that either of us work on Nokia, but I find it reasonably intuitive once I get a few tips and tricks from Niek.

[pim@nikhef ~]$ sr_cli
--{ running }--[  ]--
A:pim@nikhef# enter candidate
--{ candidate shared default }--[  ]--
A:pim@nikhef# set / interface lo0 admin-state enable
A:pim@nikhef# set / interface lo0 subinterface 0 admin-state enable
A:pim@nikhef# set / interface lo0 subinterface 0 ipv4 admin-state enable
A:pim@nikhef# set / interface lo0 subinterface 0 ipv4 address 198.19.16.1/32
A:pim@nikhef# commit stay

There, my first config snippet! This creates a loopback interface, and similar to JunOS, a subinterface (which Juniper calls a unit) which enables IPv4 and gives it an /32 address. In SR Linux, any interface has to be associated with a network-instance, think of those as routing domains or VRFs. There’s a conveniently named default network-instance, which I’ll add this and the point-to-point interface between the two 400G routers to:

A:pim@nikhef# info flat interface ethernet-1/29
set / interface ethernet-1/29 admin-state enable
set / interface ethernet-1/29 subinterface 0 admin-state enable
set / interface ethernet-1/29 subinterface 0 ip-mtu 9190
set / interface ethernet-1/29 subinterface 0 ipv4 admin-state enable
set / interface ethernet-1/29 subinterface 0 ipv4 address 198.19.17.1/31
set / interface ethernet-1/29 subinterface 0 ipv6 admin-state enable

A:pim@nikhef# set / network-instance default type default
A:pim@nikhef# set / network-instance default admin-state enable
A:pim@nikhef# set / network-instance default interface ethernet-1/29.0
A:pim@nikhef# set / network-instance default interface lo0.0
A:pim@nikhef# commit stay

Cool. Assuming I now also do this on the other IXR-7220-D4 router, called equinix (which gets the loopback address 198.19.16.0/32 and the point-to-point on the 400G interface of 198.19.17.0/31), I should be able to do my first jumboframe ping:

A:pim@equinix# ping network-instance default 198.19.17.1 -s 9162 -M do
Using network instance default
PING 198.19.17.1 (198.19.17.1) 9162(9190) bytes of data.
9170 bytes from 198.19.17.1: icmp_seq=1 ttl=64 time=0.466 ms
9170 bytes from 198.19.17.1: icmp_seq=2 ttl=64 time=0.477 ms
9170 bytes from 198.19.17.1: icmp_seq=3 ttl=64 time=0.547 ms

Underlay: SR Linux OSPF

OK, let’s get these two Nokia routers to speak OSPF, so that they can reach each other’s loopback. It’s really easy:

A:pim@nikhef# / network-instance default protocols ospf instance default
--{ candidate shared default }--[ network-instance default protocols ospf instance default ]--
A:pim@nikhef# set admin-state enable
A:pim@nikhef# set version ospf-v2
A:pim@nikhef# set router-id 198.19.16.1
A:pim@nikhef# set area 0.0.0.0 interface ethernet-1/29.0 interface-type point-to-point
A:pim@nikhef# set area 0.0.0.0 interface lo0.0 passive true
A:pim@nikhef# commit stay

Similar to in JunOS, I can descend into a configuration scope: the first line goes into the network-instance called default and then the protocols called ospf, and then the instance called default. Subsequent set commands operate at this scope. Once I commit this configuration (on the nikhef router and also the equinix router, with its own unique router-id), OSPF quickly shoots in action:

A:pim@nikhef# show network-instance default protocols ospf neighbor 
=========================================================================================
Net-Inst default OSPFv2 Instance default Neighbors
=========================================================================================
+---------------------------------------------------------------------------------------+
| Interface-Name         Rtr Id            State        Pri   RetxQ    Time Before Dead |
+=======================================================================================+
| ethernet-1/29.0        198.19.16.0       full         1     0        36               |
+---------------------------------------------------------------------------------------+
-----------------------------------------------------------------------------------------
No. of Neighbors: 1
=========================================================================================

A:pim@nikhef# show network-instance default route-table all | more
IPv4 unicast route table of network instance default
+------------------+-----+------------+--------------+--------+----------+--------+------+-------------+-----------------+
|        Prefix    | ID  | Route Type |  Route Owner | Active |  Origin  | Metric | Pref | Next-hop    |    Next-hop     |
|                  |     |            |              |        | Network  |        |      | (Type)      |   Interface     |
|                  |     |            |              |        | Instance |        |      |             |                 |
+==================+=====+============+==============+========+==========+========+======+=============+=================+
| 198.19.16.0/32   | 0   | ospfv2     | ospf_mgr     | True   | default  | 1      | 10   | 198.19.17.0 | ethernet-1/29.0 |
|                  |     |            |              |        |          |        |      | (direct)    |                 |
| 198.19.16.1/32   | 7   | host       | net_inst_mgr | True   | default  | 0      | 0    | None        | None            |
| 198.19.17.0/31   | 6   | local      | net_inst_mgr | True   | default  | 0      | 0    | 198.19.17.1 | ethernet-1/29.0 |
|                  |     |            |              |        |          |        |      | (direct)    |                 |
| 198.19.17.1/32   | 6   | host       | net_inst_mgr | True   | default  | 0      | 0    | None        | None            |
+==================+=====+============+==============+========+==========+========+======+=============+=================+

A:pim@nikhef# ping network-instance default 198.19.16.0
Using network instance default
PING 198.19.16.0 (198.19.16.0) 56(84) bytes of data.
64 bytes from 198.19.16.0: icmp_seq=1 ttl=64 time=0.484 ms
64 bytes from 198.19.16.0: icmp_seq=2 ttl=64 time=0.663 ms

Delicious! OSPF has learned the loopback, and it is now reachable. As with most things, going from 0 to 1 (in this case: understanding how SR Linux works at all) is the most difficult part. Then going from 1 to 2 is critical (in this case: making two routers interact with OSPF), but from there on, going from 2 to N is easy (in my case: enabling several other point-to-point /31 transit networks on the nikhef router, using ethernet-1/1.0 through ethernet-1/4.0 with the correct MTU and turning on OSPF for these), makes the whole network shoot to life. Slick!

Underlay: Arista

I’ll point out that one of the devices in this topology is an Arista. We have several of these ready for deployment at FrysIX. They are a lot more affordable and easy to find on the second hand / refurbished market. These switches come with 32x100G ports, and are really good at packet slinging because they’re based on the Broadcom Tomahawk chipset. They pack a few less faetures than the Trident chipset that powers the Nokia, but they happen to have all the features we need to run our internet exchange . So I turn my attention to the Arista in the topology. I am much more comfortable configuring the whole thing here, as it’s not my first time touching these devices:

arista-leaf#show run int loop0
interface Loopback0
   ip address 198.19.16.2/32
   ip ospf area 0.0.0.0
arista-leaf#show run int Ethernet32/1
interface Ethernet32/1
   description Core: Connected to nikhef:ethernet-1/2
   load-interval 1
   mtu 9190
   no switchport
   ip address 198.19.17.5/31
   ip ospf cost 1000
   ip ospf network point-to-point
   ip ospf area 0.0.0.0
arista-leaf#show run section router ospf
router ospf 65500
   router-id 198.19.16.2
   redistribute connected
   network 198.19.0.0/16 area 0.0.0.0
   max-lsa 12000

I complete the configuration for the other two interfaces on this Arista, port Eth31/1 connects also to the nikhef IXR-7220-D4 and I give it a high cost of 1000, while Eth30/1 connects only 1x100G to the nokia-leaf IXR-7220-D2 with a cost of 10. It’s nice to see that OSPF in action - there are two equal path (but high cost) OSPF paths via router-id 198.19.16.1 (nikhef), and there’s one lower cost path via router-id 198.19.16.3 (nokia-leaf). The traceroute nicely shows the scenic route (arista-leaf -> nokia-leaf -> nokia -> equinix). Dope!

arista-leaf#show ip ospf nei
Neighbor ID     Instance VRF      Pri State                  Dead Time   Address         Interface
198.19.16.1     65500    default  1   FULL                   00:00:36    198.19.17.4     Ethernet32/1
198.19.16.3     65500    default  1   FULL                   00:00:31    198.19.17.11    Ethernet30/1
198.19.16.1     65500    default  1   FULL                   00:00:35    198.19.17.2     Ethernet31/1

arista-leaf#traceroute 198.19.16.0
traceroute to 198.19.16.0 (198.19.16.0), 30 hops max, 60 byte packets
 1  198.19.17.11 (198.19.17.11)  0.220 ms  0.150 ms  0.206 ms
 2  198.19.17.6 (198.19.17.6)  0.169 ms  0.107 ms  0.099 ms
 3  198.19.16.0 (198.19.16.0)  0.434 ms  0.346 ms  0.303 ms

So far, so good! The underlay is up, every router can reach every other router on its loopback, and all OSPF adjacencies are formed. I’ll leave the 2x100G between nikhef and arista-leaf at high cost for now.

Overlay EVPN: SR Linux

The big-picture idea here is to use iBGP with the same private AS number, and because there are two main facilities (NIKHEF and Equinix), make each of those bigger IXR-7220-D4 routers act as route-reflectors for others. It means that they will have an iBGP session amongst themselves (198.191.16.0 <-> 198.19.16.1) and otherwise accept iBGP sessions from any IP address in the 198.19.16.0/24 subnet. This way, I don’t have to configure any more than strictly necessary on the core routers. Any new router can just plug in, form an OSPF adjacency, and connect to both core routers. I proceed to configure BGP on the Nokia’s like this:

A:pim@nikhef# / network-instance default protocols bgp 
A:pim@nikhef# set admin-state enable
A:pim@nikhef# set autonomous-system 65500
A:pim@nikhef# set router-id 198.19.16.1
A:pim@nikhef# set dynamic-neighbors accept match 198.19.16.0/24 peer-group overlay
A:pim@nikhef# set afi-safi evpn admin-state enable
A:pim@nikhef# set preference ibgp 170
A:pim@nikhef# set route-advertisement rapid-withdrawal true
A:pim@nikhef# set route-advertisement wait-for-fib-install false
A:pim@nikhef# set group overlay peer-as 65500
A:pim@nikhef# set group overlay afi-safi evpn admin-state enable
A:pim@nikhef# set group overlay afi-safi ipv4-unicast admin-state disable
A:pim@nikhef# set group overlay afi-safi ipv6-unicast admin-state disable
A:pim@nikhef# set group overlay local-as as-number 65500
A:pim@nikhef# set group overlay route-reflector client true
A:pim@nikhef# set group overlay transport local-address 198.19.16.1
A:pim@nikhef# set neighbor 198.19.16.0 admin-state enable
A:pim@nikhef# set neighbor 198.19.16.0 peer-group overlay
A:pim@nikhef# commit stay

I can see that iBGP sessions establish between all the devices:

A:pim@nikhef# show network-instance default protocols bgp neighbor
---------------------------------------------------------------------------------------------------------------------------
BGP neighbor summary for network-instance "default"              
Flags: S static, D dynamic, L discovered by LLDP, B BFD enabled, - disabled, * slow
---------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------
+-------------+-------------+----------+-------+----------+-------------+---------------+------------+--------------------+
|   Net-Inst  |    Peer     |  Group   | Flags |  Peer-AS |   State     |      Uptime   |  AFI/SAFI  |   [Rx/Active/Tx]   |
+=============+=============+==========+=======+==========+=============+===============+============+====================+
| default     | 198.19.16.0 | overlay  | S     | 65500    | established | 0d:0h:2m:32s  | evpn       | [0/0/0]            |
| default     | 198.19.16.2 | overlay  | D     | 65500    | established | 0d:0h:2m:27s  | evpn       | [0/0/0]            |
| default     | 198.19.16.3 | overlay  | D     | 65500    | established | 0d:0h:2m:41s  | evpn       | [0/0/0]            |
+-------------+-------------+----------+-------+----------+-------------+---------------+------------+--------------------+
---------------------------------------------------------------------------------------------------------------------------
Summary:
1 configured neighbors, 1 configured sessions are established, 0 disabled peers
2 dynamic peers

A few things to note here - there one configured neighbor (this is the other IXR-7220-D4 router), and two dynamic peers, these are the Arista and the smaller IXR-7220-D2 router. The only address family that they are exchanging information for is the evpn family, and no prefixes have been learned or sent yet, shown by the [0/0/0] designation in the last column.

Overlay EVPN: Arista

The Arista is also remarkably straight forward to configure. Here, I’ll simply enable the iBGP session as follows:

arista-leaf#show run section bgp
router bgp 65500
   neighbor evpn peer group
   neighbor evpn remote-as 65500
   neighbor evpn update-source Loopback0
   neighbor evpn ebgp-multihop 3
   neighbor evpn send-community extended
   neighbor evpn maximum-routes 12000 warning-only
   neighbor 198.19.16.0 peer group evpn
   neighbor 198.19.16.1 peer group evpn
   !
   address-family evpn
      neighbor evpn activate

arista-leaf#show bgp summary 
BGP summary information for VRF default
Router identifier 198.19.16.2, local AS number 65500
Neighbor             AS Session State AFI/SAFI                AFI/SAFI State   NLRI Rcd   NLRI Acc
----------- ----------- ------------- ----------------------- -------------- ---------- ----------
198.19.16.0       65500 Established   IPv4 Unicast            Advertised              0          0
198.19.16.0       65500 Established   L2VPN EVPN              Negotiated              0          0
198.19.16.1       65500 Established   IPv4 Unicast            Advertised              0          0
198.19.16.1       65500 Established   L2VPN EVPN              Negotiated              0          0

On this leaf node, I’ll have a redundant iBGP session with the two core nodes. Since those core nodes are peering amongst themselves, and are configured as route-reflectors, this is all I need. No matter how many additional Arista (or Nokia) devices I add to the network, all they’ll have to do is enable OSPF (so they can reach 198.19.16.0 and .1) and turn on iBGP sesions with both core routers. Voila!

VXLAN EVPN: SR Linux

Nokia documentation informs me that SR Linux uses a special interface called system0 to source its VXLAN traffic from, and to add this interface to the default network-instance. So it’s a matter of defining that interface and associate a VXLAN interface with it, like so:

A:pim@nikhef# set / interface system0 admin-state enable
A:pim@nikhef# set / interface system0 subinterface 0 admin-state enable
A:pim@nikhef# set / interface system0 subinterface 0 ipv4 admin-state enable
A:pim@nikhef# set / interface system0 subinterface 0 ipv4 address 198.19.18.1/32
A:pim@nikhef# set / network-instance default interface system0.0
A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged
A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604
A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address
A:pim@nikhef# commit stay

This creates the plumbing for a VXLAN sub-interface called vxlan1.2604 which will accept/send traffic using VNI 2604 (this happens to be the VLAN id we use at FrysIX for our production Peering LAN), and it’ll use the system0.0 address to source that traffic from.

The second part is to create what SR Linux calls a MAC-VRF and put some interface in it:

A:pim@nikhef# set / interface ethernet-1/9 admin-state enable
A:pim@nikhef# set / interface ethernet-1/9 breakout-mode num-breakout-ports 4
A:pim@nikhef# set / interface ethernet-1/9 breakout-mode breakout-port-speed 10G
A:pim@nikhef# set / interface ethernet-1/9/3 admin-state enable
A:pim@nikhef# set / interface ethernet-1/9/3 vlan-tagging true
A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 type bridged
A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 admin-state enable
A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 vlan encap untagged

A:pim@nikhef# / network-instance peeringlan 
A:pim@nikhef# set type mac-vrf
A:pim@nikhef# set admin-state enable
A:pim@nikhef# set interface ethernet-1/9/3.0
A:pim@nikhef# set vxlan-interface vxlan1.2604
A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 admin-state enable
A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604
A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 evi 2604
A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604
A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604
A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604
A:pim@nikhef# commit stay

In the first block here, Arend took what is a 100G port called ethernet-1/9 and split it into 4x25G ports. Arend forced the port speed to 10G because he has taken a 40G-4x10G DAC, and it happens that the third lane is plugged into the Debian machine. So on ethernet-1/9/3 I’ll create a sub-interface, make it type bridged (which I’ve also done on vxlan1.2604!) and allow any untagged traffic to enter it.

brain

If you, like me, are used to either VPP or IOS/XR, this type of sub-interface stuff should feel very natural to you. I’ve written about the sub-interfaces logic on Cisco’s IOS/XR and VPP approach in a previous [article] which my buddy Fred lovingly calls VLAN Gymnastics because the ports are just so damn flexible. Worth a read!

The second block creates a new network-instance which I’ll name peeringlan, and it associates the newly crated untagged sub-interface ethernet-1/9/3.0 with the VXLAN interface, and starts a protocol for eVPN instructing traffic in and out of this network-instance to use EVI 2604 on the VXLAN sub-interface, and signalling of all MAC addresses learned to use the specified route-distinguisher and import/export route-targets. For simplicity I’ve just used the same for each: 65500:2604.

I continue to add an interface to the peeringlan network-instance on the other two Nokia routers: ethernet-1/9/3.0 on the equinix router and ethernet-1/9.0 on the nokia-leaf router. Each of these goes to a 10Gbps port on a Debian machine.

VXLAN EVPN: Arista

At this point I’m feeling pretty bullish about the whole project. Arista does not make it very difficult on me to configure it for L2 EVPN (which is called MAC-VRF here also):

arista-leaf#conf t
vlan 2604
   name v-peeringlan
interface Ethernet9/3
   speed forced 10000full
   switchport access vlan 2604

interface Loopback1
   ip address 198.19.18.2/32
interface Vxlan1
   vxlan source-interface Loopback1
   vxlan udp-port 4789
   vxlan vlan 2604 vni 2604

After creating VLAN 2604 on making port Eth9/3 an access port in that VLAN, I’ll add a VTEP endpoint called Loopback1, and a VXLAN interface that uses that to source its traffic. Here, I’ll associate local VLAN 2604 with the Vxlan1 and its VNI 2604, to match up with how I configured the Nokias previously.

Finally, it’s a matter of tying these together by announcing the MAC addresses into the EVPN iBGP sessions:

arista-leaf#conf t
router bgp 65500
   vlan 2604
      rd 65500:2604
      route-target both 65500:2604
      redistribute learned
   !

Results

To validate the configurations, I learn a cool trick from my buddy Andy on the SR Linux discord server. In EOS, I can ask it to check for any obvious mistakes in two places:

arista-leaf#show vxlan config-sanity detail
Category                            Result  Detail                                            
---------------------------------- -------- --------------------------------------------------
Local VTEP Configuration Check        OK                                                      
  Loopback IP Address                 OK                                                      
  VLAN-VNI Map                        OK                                                      
  Flood List                          OK                                                      
  Routing                             OK                                                      
  VNI VRF ACL                         OK                                                      
  Decap VRF-VNI Map                   OK                                                      
  VRF-VNI Dynamic VLAN                OK                                                      
Remote VTEP Configuration Check       OK                                                      
  Remote VTEP                         OK                                                      
Platform Dependent Check              OK                                                      
  VXLAN Bridging                      OK                                                      
  VXLAN Routing                       OK    VXLAN Routing not enabled                         
CVX Configuration Check               OK                                                      
  CVX Server                          OK    Not in controller client mode                     
MLAG Configuration Check              OK    Run 'show mlag config-sanity' to verify MLAG config
  Peer VTEP IP                        OK    MLAG peer is not connected                        
  MLAG VTEP IP                        OK                                                      
  Peer VLAN-VNI                       OK                                                      
  Virtual VTEP IP                     OK                                                      
  MLAG Inactive State                 OK                                                      

arista-leaf#show bgp evpn sanity detail 
Category Check                Status Detail
-------- -------------------- ------ ------
General  Send community       OK           
General  Multi-agent mode     OK           
General  Neighbor established OK           
L2       MAC-VRF route-target OK           
         import and export                 
L2       MAC-VRF              OK           
         route-distinguisher               
L2       MAC-VRF redistribute OK           
L2       MAC-VRF overlapping  OK           
         VLAN                              
L2       Suppressed MAC       OK           
VXLAN    VLAN to VNI map for  OK           
         MAC-VRF                           
VXLAN    VRF to VNI map for   OK           
         IP-VRF                            

Results: Arista view

Inspecting the MAC addresses learned from all four of the client ports on the Debian machine is easy:

arista-leaf#show bgp evpn summary 
BGP summary information for VRF default
Router identifier 198.19.16.2, local AS number 65500
Neighbor Status Codes: m - Under maintenance
  Neighbor    V AS           MsgRcvd   MsgSent  InQ OutQ  Up/Down State   PfxRcd PfxAcc
  198.19.16.0 4 65500           3311      3867    0    0 18:06:28 Estab   7      7
  198.19.16.1 4 65500           3308      3873    0    0 18:06:28 Estab   7      7

arista-leaf#show bgp evpn vni 2604 next-hop 198.19.18.3
BGP routing table information for VRF default
Router identifier 198.19.16.2, local AS number 65500
Route status codes: * - valid, > - active, S - Stale, E - ECMP head, e - ECMP
                    c - Contributing to ECMP, % - Pending BGP convergence
Origin codes: i - IGP, e - EGP, ? - incomplete
AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Link Local Nexthop

          Network                Next Hop              Metric  LocPref Weight  Path
 * >Ec    RD: 65500:2604 mac-ip e43a.6e5f.0c59
                                 198.19.18.3           -       100     0       i Or-ID: 198.19.16.3 C-LST: 198.19.16.1 
 *  ec    RD: 65500:2604 mac-ip e43a.6e5f.0c59
                                 198.19.18.3           -       100     0       i Or-ID: 198.19.16.3 C-LST: 198.19.16.0 
 * >Ec    RD: 65500:2604 imet 198.19.18.3
                                 198.19.18.3           -       100     0       i Or-ID: 198.19.16.3 C-LST: 198.19.16.1 
 *  ec    RD: 65500:2604 imet 198.19.18.3
                                 198.19.18.3           -       100     0       i Or-ID: 198.19.16.3 C-LST: 198.19.16.0 

There’s a lot to unpack here! The Arista is seeing that from the route-distinguisher I configured on all the sessions, it is learning one MAC address on neighbor 198.19.18.3 (this is the VTEP for the nokia-leaf router) from both iBGP sessions. The MAC address is learned from originator 198.19.16.3 (the loopback of the nokia-leaf router), from two cluster members, the active one on iBGP speaker 198.19.16.1 (nikhef) and a backup member on 198.19.16.0 (equinix).

I can also see that there’s a bunch of imet route entries, and Andy explained these to me. They are a signal from a VTEP participant that they are interested in seeing multicast traffic (like neighbor discovery or ARP requests) flooded to them. Every router participating in this L2VPN will raise such an imet route, which I’ll see in duplicates as well (one from each iBGP session). This checks out.

Results: SR Linux view

The Nokia IXR-7220-D4 router called equinix has also learned a bunch of EVPN routing entries, which I can inspect as follows:

A:pim@equinix# show network-instance default protocols bgp routes evpn route-type summary 
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Show report for the BGP route table of network-instance "default"
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Status codes: u=used, *=valid, >=best, x=stale, b=backup
Origin codes: i=IGP, e=EGP, ?=incomplete
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
BGP Router ID: 198.19.16.0      AS: 65500      Local AS: 65500
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Type 2 MAC-IP Advertisement Routes
+--------+---------------+--------+-------------------+------------+-------------+------+-------------+--------+--------------------------------+------------------+
| Status |    Route-     | Tag-ID |    MAC-address    | IP-address | neighbor    | Path-|  Next-Hop   |  Label |              ESI               |   MAC Mobility   |
|        | distinguisher |        |                   |            |             |   id |             |        |                                |                  |
+========+===============+========+===================+============+=============+======+============-+========+================================+==================+
| u*>    | 65500:2604    | 0      | E4:3A:6E:5F:0C:57 | 0.0.0.0    | 198.19.16.1 | 0    | 198.19.18.1 | 2604   | 00:00:00:00:00:00:00:00:00:00  | -                |
| *      | 65500:2604    | 0      | E4:3A:6E:5F:0C:58 | 0.0.0.0    | 198.19.16.1 | 0    | 198.19.18.2 | 2604   | 00:00:00:00:00:00:00:00:00:00  | -                |
| u*>    | 65500:2604    | 0      | E4:3A:6E:5F:0C:58 | 0.0.0.0    | 198.19.16.2 | 0    | 198.19.18.2 | 2604   | 00:00:00:00:00:00:00:00:00:00  | -                |
| *      | 65500:2604    | 0      | E4:3A:6E:5F:0C:59 | 0.0.0.0    | 198.19.16.1 | 0    | 198.19.18.3 | 2604   | 00:00:00:00:00:00:00:00:00:00  | -                |
| u*>    | 65500:2604    | 0      | E4:3A:6E:5F:0C:59 | 0.0.0.0    | 198.19.16.3 | 0    | 198.19.18.3 | 2604   | 00:00:00:00:00:00:00:00:00:00  | -                |
+--------+---------------+--------+-------------------+------------+-------------+------+-------------+--------+--------------------------------+------------------+
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
Type 3 Inclusive Multicast Ethernet Tag Routes
+--------+-----------------------------+--------+---------------------+-----------------+--------+-----------------------+
| Status |     Route-distinguisher     | Tag-ID |    Originator-IP    |    neighbor     | Path-  |       Next-Hop        |
|        |                             |        |                     |                 |   id   |                       |
+========+=============================+========+=====================+=================+========+=======================+
| u*>    | 65500:2604                  | 0      | 198.19.18.1         | 198.19.16.1     | 0      | 198.19.18.1           |
| *      | 65500:2604                  | 0      | 198.19.18.2         | 198.19.16.1     | 0      | 198.19.18.2           |
| u*>    | 65500:2604                  | 0      | 198.19.18.2         | 198.19.16.2     | 0      | 198.19.18.2           |
| *      | 65500:2604                  | 0      | 198.19.18.3         | 198.19.16.1     | 0      | 198.19.18.3           |
| u*>    | 65500:2604                  | 0      | 198.19.18.3         | 198.19.16.3     | 0      | 198.19.18.3           |
+--------+-----------------------------+--------+---------------------+-----------------+--------+-----------------------+
--------------------------------------------------------------------------------------------------------------------------
0 Ethernet Auto-Discovery routes 0 used, 0 valid
5 MAC-IP Advertisement routes 3 used, 5 valid
5 Inclusive Multicast Ethernet Tag routes 3 used, 5 valid
0 Ethernet Segment routes 0 used, 0 valid
0 IP Prefix routes 0 used, 0 valid
0 Selective Multicast Ethernet Tag routes 0 used, 0 valid
0 Selective Multicast Membership Report Sync routes 0 used, 0 valid
0 Selective Multicast Leave Sync routes 0 used, 0 valid
--------------------------------------------------------------------------------------------------------------------------

I have to say, SR Linux output is incredibly verbose! But, I can see all the relevant bits and bobs here. Each MAC-IP entry is accounted for, I can see several nexthops pointing at the nikhef switch, one pointing at the nokia-leaf router and one pointing at the Arista switch. I also see the imet entries. One thing to note – the SR Linux implementation leaves the type-2 routes empty with a 0.0.0.0 IPv4 address, while the Arista (in my opinion, more correctly) leaves them as NULL (unspecified). But, everything looks great!

Results: Debian view

There’s one more thing to show, and that’s kind of the ‘proof is in the pudding’ moment. As I said, Arend hooked up a Debian machine with an Intel X710-DA4 network card, which sports 4x10G SFP+ connections. This network card is a regular in my AS8298 network, as it has excellent DPDK support and can pump easily 40Mpps with VPP. IPng 🥰 Intel X710!

root@debian:~ # ip netns add nikhef
root@debian:~ # ip link set enp1s0f0 netns nikhef
root@debian:~ # ip netns exec nikhef ip link set enp1s0f0 up mtu 9000
root@debian:~ # ip netns exec nikhef ip addr add 192.0.2.10/24 dev enp1s0f0
root@debian:~ # ip netns exec nikhef ip addr add 2001:db8::10/64 dev enp1s0f0

root@debian:~ # ip netns add arista-leaf
root@debian:~ # ip link set enp1s0f1 netns arista-leaf
root@debian:~ # ip netns exec arista-leaf ip link set enp1s0f1 up mtu 9000
root@debian:~ # ip netns exec arista-leaf ip addr add 192.0.2.11/24 dev enp1s0f1
root@debian:~ # ip netns exec arista-leaf ip addr add 2001:db8::11/64 dev enp1s0f1

root@debian:~ # ip netns add nokia-leaf
root@debian:~ # ip link set enp1s0f2 netns nokia-leaf
root@debian:~ # ip netns exec nokia-leaf ip link set enp1s0f2 up mtu 9000
root@debian:~ # ip netns exec nokia-leaf ip addr add 192.0.2.12/24 dev enp1s0f2
root@debian:~ # ip netns exec nokia-leaf ip addr add 2001:db8::12/64 dev enp1s0f2

root@debian:~ # ip netns add equinix
root@debian:~ # ip link set enp1s0f3 netns equinix
root@debian:~ # ip netns exec equinix ip link set enp1s0f3 up mtu 9000
root@debian:~ # ip netns exec equinix ip addr add 192.0.2.13/24 dev enp1s0f3 
root@debian:~ # ip netns exec equinix ip addr add 2001:db8::13/64 dev enp1s0f3 

root@debian:~# ip netns exec nikhef fping -g 192.0.2.8/29
192.0.2.10 is alive
192.0.2.11 is alive
192.0.2.12 is alive
192.0.2.13 is alive

root@debian:~# ip netns exec arista-leaf fping 2001:db8::10 2001:db8::11 2001:db8::12 2001:db8::13
2001:db8::10 is alive
2001:db8::11 is alive
2001:db8::12 is alive
2001:db8::13 is alive

root@debian:~# ip netns exec equinix ip nei
192.0.2.10 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE 
192.0.2.11 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE 
192.0.2.12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE 
fe80::e63a:6eff:fe5f:c57 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE 
fe80::e63a:6eff:fe5f:c58 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE 
fe80::e63a:6eff:fe5f:c59 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE 
2001:db8::10 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE 
2001:db8::11 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE 
2001:db8::12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE 

The Debian machine puts each network card into its own network namespace, and gives them both an IPv4 and an IPv6 address. I can then enter the nikhef network namespace, which has its NIC connected to the IXR-7220-D4 router called nikhef, and ping all four endpoints. Similarly, I can enter the arista-leaf namespace and ping6 all four endpoints. Finally, I take a look at the IPv6 and IPv4 neighbor table on the network card that is connected to the equinix router. All three MAC addresses are seen. This proves end to end connectivity across the EVPN VXLAN, and full interoperability. Booyah!

Performance? We got that! I’m not worried as these Nokia routers are rated for 12.8Tbps of VXLAN….

root@debian:~# ip netns exec equinix iperf3 -c 192.0.2.12 
Connecting to host 192.0.2.12, port 5201
[  5] local 192.0.2.10 port 34598 connected to 192.0.2.12 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec  1.15 GBytes  9.91 Gbits/sec   19   1.52 MBytes       
[  5]   1.00-2.00   sec  1.15 GBytes  9.90 Gbits/sec    3   1.54 MBytes       
[  5]   2.00-3.00   sec  1.15 GBytes  9.90 Gbits/sec    1   1.54 MBytes       
[  5]   3.00-4.00   sec  1.15 GBytes  9.90 Gbits/sec    1   1.54 MBytes       
[  5]   4.00-5.00   sec  1.15 GBytes  9.90 Gbits/sec    0   1.54 MBytes       
[  5]   5.00-6.00   sec  1.15 GBytes  9.90 Gbits/sec    0   1.54 MBytes       
[  5]   6.00-7.00   sec  1.15 GBytes  9.90 Gbits/sec    0   1.54 MBytes       
[  5]   7.00-8.00   sec  1.15 GBytes  9.90 Gbits/sec    0   1.54 MBytes       
[  5]   8.00-9.00   sec  1.15 GBytes  9.90 Gbits/sec    0   1.54 MBytes       
[  5]   9.00-10.00  sec  1.15 GBytes  9.90 Gbits/sec    0   1.54 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  11.5 GBytes  9.90 Gbits/sec   24             sender
[  5]   0.00-10.00  sec  11.5 GBytes  9.90 Gbits/sec                  receiver

iperf Done.

What’s Next

There’s a few improvements I can make before deploying this architecture to the internet exchange. Notably:

  • the functional equivalent of port security, that is to say only allowing one or two MAC addresses per member port. FrysIX has a strict one-port-one-member-one-MAC rule, and having port security will greatly improve our resilience.
  • SR Linux has the ability to suppress ARP, even on L2 MAC-VRF! It’s relatively well known for IRB based setups, but adding this to transparent bridge-domains is possible in Nokia [ref], using the syntax of protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise true. This will glean the IP addresses based on intercepted ARP requests, and reduce the need for BUM flooding.
  • Andy informs me that Arista also has this feature. By setting router l2-vpn and arp learning bridged, the suppression of ARP requests/replies also works in the same way. This greatly reduces cross-router BUM flooding. If DE-CIX can do it, so can FrysIX :)
  • some automation - although configuring the MAC-VRF across Arista and SR Linux is definitely not as difficult as I thought, having some automation in place will avoid errors and mistakes. It would suck if the IXP collapsed because I botched a link drain or PNI configuration!

Acknowledgements

I am relatively new to EVPN configurations, and wanted to give a shoutout to Andy Whitaker who jumped in very quickly when I asked a question on the SR Linux Discord. He was gracious with his time and spent a few hours on a video call with me, explaining EVPN in great detail both for Arista as well as SR Linux, and in particular wanted to give a big “Thank you!” for helping me understand symmetric and asymmetric IRB in the context of multivendor EVPN. Andy is about to start a new job at Nokia, and I wish him all the best. To my friends at Nokia: you caught a good one, Andy is pure gold!

I also want to thank Niek for helping me take my first baby steps onto this platform and patiently answering my nerdly questions about the platform, the switch chip, and the configuration philosophy. Learning a new NOS is always a fun task, and it was made super fun because Niek spent an hour with Arend and me on a video call, giving a bunch of operational tips and tricks along the way.

Finally, Arend and ERITAP are an absolute joy to work with. We took turns hacking on the lab, which Arend made available for me while I am traveling to Mississippi this week. Thanks for the kWh and OOB access, and for brainstorming the config with me!

Reference configurations

Here’s the configs for all machines in this demonstration: [nikhef] | [equinix] | [nokia-leaf] | [arista-leaf]