[{"contents":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nLoad balancing is one of those topics that sounds deceptively simple until you think about it for a while. In this article I take the VPP load balancer plugin out for a spin, fix a handful of API bugs, and add two small new features that make running it in production a little bit easier.\nIntroduction IPng runs services that want to be reachable via as few public IP addresses as possible. Let\u0026rsquo;s say I want to run a DNS resolver or authoritative nameserver or even the IPng website, but I want these to be highly available and perhaps scale to more traffic than one backend server could provide. What are my options?\nMy first option is just put a bunch of servers online and give them all an A/AAAA record, and put them all in DNS, say 7 webservers, and then point ipng.ch to those. It\u0026rsquo;s clumsy, notably if one server is down for maintenance or failure, one seventh of the traffic may still want to reach it. Also, removing a server will have lots of lingering traffic stay on the webserver, as clients are sometimes slow to pick up the DNS changes, even if my TTL is low.\nLet me show you an example:\nThere are two main problems with this graph:\nLoad imbalance: there are seven webservers in this graph, but somehow only three of them are getting traffic, the others are not. One is much more heavily loaded (nginx0.chrma0) than the others. It\u0026rsquo;s receiving 1.2kqps while others are receiving ~40qps. This poses a risk when the clients that are somehow attracted to this instance grow, they may overwhelm this little webserver, even if there are six others that could help out!\nDrains take forever: The green graph was a drain of nginx0.nlams2 due to a pending maintenance window as the datacenter is closing and the server needs to be physically moved. I put in the DNS change at around 16:15 UTC and the traffic finally dropped at 21:45, a full five hours (!) later. And believe it or not, the TTL was 15 minutes on these records. Some clients just don\u0026rsquo;t get the hint \u0026hellip;\nLoad balancing 101 A naive load balancing solution is to simply round-robin: send each new packet to the next backend in the list. That works reasonably well for stateless UDP traffic like DNS, although even with DNS there is a gotcha: some DNS queries need TCP, for example those that are too big to fit in a single UDP packet, and they will not be tolerant of naive packet round robin. For TCP this naive load balancing solution quickly falls apart, because every packet in a connection needs to reach the same backend. Sending a SYN to backend A and the subsequent ACK to backend B will not establish a TCP connection.\nThe classical answer is to keep per-session state on the load balancer: a table that maps a 5-tuple of {source IP, destination IP, source port, destination port, protocol} to a chosen backend. That works, but it introduces a stateful bottleneck. At line rate on a load balancer handling millions of flows and packets/sec, maintaining and synchronising that table across multiple CPU threads is expensive. It also means that if the load balancer restarts, every existing TCP session breaks.\nWhat if there was some form of consistent hashing: given the 5-tuple of a packet, the load balancer might always select the same backend deterministically, without storing any per-session state. If backends come and go, only the flows that were assigned to the changed backend are affected — all other flows keep working. Google solved this problem at scale and published their solution. They call it Maglev.\nIntroducing Maglev Google\u0026rsquo;s Maglev load balancer has been running in production since 2008 and I happen to know several of its authors - as a personal aside I was sad to learn that Cody Smith, with whom I shared an office and a team for many years, passed away earlier this year. Rest in peace, Cody!\nThe Google team published their design at NSDI 2016 in the paper [Maglev: A Fast and Reliable Software Network Load Balancer]. It is worth reading in full — the paper is well written and covers not only the hashing algorithm but also the wider architecture of how Google handles frontend traffic at scale.\nThe key insight is that Maglev uses a pre-computed lookup table of size M (some large prime number, 65537 in the paper) filled with backend indices. To handle a packet, the forwarder computes a hash over the 5-tuple modulo M, looks up the table, and forwards to whatever backend is stored there. No per-session state is needed avoiding the need for session matching and lots of RAM, and the flow lookup can be done super efficiently.\nThe Maglev new flow table The interesting part is how that lookup table is filled. I learn that a simple approach might be to divide M slots evenly among N backends. That would work, but removing a backend would shift every remaining backend\u0026rsquo;s range, disrupting all flows and resetting TCP connections all over the place. Maglev uses a smarter fill algorithm:\nFor each backend i, derive two independent hash values from its identity (typically its IP address): an offset and a skip value. These define a preference list — a permutation of all M slots that this backend would like to occupy, in preference order. Iterate over all backends round-robin. Each backend claims its next preferred slot if it is still free. Continue until every slot is filled. The result is a table where each backend occupies approximately M/N slots, the distribution is uniform, and most importantly, adding or removing one backend only displaces approximately 1/N of the flows. All other flows keep hashing to the same backend. Slick!\nThe Maglev existing flow hash table Consistent hashing handles the common case well, but there is one subtlety: the hashing guarantees that the same 5-tuple always maps to the same backend, but only as long as the set of backends does not change. If a backend is added mid-stream, a fraction of existing TCP connections will start hashing to a different backend.\nTo protect long-lived connections, Maglev keeps a small per-CPU flow hash table: an LRU cache of recently seen 5-tuple to backend mappings. For every packet:\nLook up in the Maglev flow hash table. On a hit, forward to the cached backend (even if the Maglev table would now say something different). On a miss, look up the Maglev new-flow table, select the backend, and insert the mapping into the flow hash table. The flow hash table does not need to be exhaustive — it only needs to cover active connections. An LRU eviction policy handles the rest. This means the load balancer is mostly stateless, as the Maglev table is deterministic and identical on every CPU, with just enough per-connection state to protect existing TCP sessions from transient backend changes.\nVPP LB: Plugin anatomy The VPP load balancer plugin lives in src/plugins/lb/. Its core data structures map directly to the Maglev design:\nVIP (Virtual IP): a prefix plus an optional {protocol, port} pair. This is the public-facing address that clients connect to. A VIP can be protocol-agnostic and forward all traffic to its backends, or it can be port-specific and forward only for example TCP/443 to its backends. AS (Application Server): a backend endpoint associated with a VIP. The plugin maintains a list of active ASes per VIP. New flow table: the Maglev lookup table, computed from the active AS list whenever an AS is added or removed. Size is configurable, defaulting to 1024 entries. It is filled by the clever algorithm described above. Flow hash table: per-worker LRU hash table of recent {5-tuple → AS} mappings. This is the connection affinity cache described above. Encapsulation: packets are forwarded to the AS by encapsulating them in either GRE (GRE4 or GRE6), or via L3DSR (direct server return using DSCP remarking). The AS decapsulates and responds directly to the client, bypassing the load balancer on the return path. When a new flow arrives, VPP computes a hash over the 5-tuple modulo the length of its new_flow_table, it then looks up the backend that will serve this client, stores it in the per-worker flow hash table, and encapsulates the packet towards the AS. Subsequent packets for the same 5-tuple hit the flow hash table directly, skipping the Maglev lookup entirely.\nA garbage collection timer periodically walks the flow table and removes entries for backends that have become inactive, preventing stale flows from reaching a long-gone AS. Operators can also remove these AS, and flush existing connections to them.\nObservations After reading the LB code in VPP, I am ready to make a few observations.\n1: Lameduck I have the choice of \u0026lsquo;remove AS from VIP\u0026rsquo;, by removing it from the Maglev new-flow table it will not get new flows assigned but if there are long-lived clients, the server will keep connections open potentially indefinitely. A good example is a websocket that streams data between a client and the webserver: it never disconnects!\nMy other choice is to \u0026lsquo;Remove and flush AS from VIP\u0026rsquo;, which will also remove it from being eligible for new flows, but forcibly remove all existing flows from the flow hash table at the same time. Yikes.\nI want a middle ground, operationally:\nRemove AS from VIP for new connections while keeping existing ones for a grace period. This is commonly referred to as lameduck mode. Remove AS from VIP for all connections, which will reset any lingering connections and move them to another backend where they reconnect and continue on their journey. 2: Slow undrain: From my own experience, adding a new AS needs to often be done carefully, for two reasons. First, sloshing traffic around can overwhelm a new / freshly started server which does lazy initialization (for example, a Java binary). Second, a new server may have a different configuration on purpose, for example different version of the server binary, or different parameters like caching flags and what-not. It may be good to ease in traffic and inspect it for a little while before bringing full load onto the server. This is commonly referred to as a canary backend. I\u0026rsquo;ll come back to this later.\nVPP LB: Bugs While playing around with the plugin\u0026rsquo;s binary API, I ran into a collection of bugs that made the plugin largely unusable via the API (as opposed to the CLI). I fixed those in Gerrit [45428].\nIPv4 VIP prefixlen offset bug: lb_add_del_vip() was computing the prefix length incorrectly for IPv4 addresses due to an off-by-one in the address family handling, producing VIPs that silently matched no traffic.\nWrong encap type on VIP create: Both lb_add_del_vip() and lb_add_del_vip_v2() were passing the encapsulation type through an incorrect enum mapping, so a VIP created with GRE4 encap via the API would actually end up configured with a different encap type internally.\nlb_vip_dump() returning wrong fields: The dump handler was returning a stale encap type and an incorrect protocol value, making it impossible to verify what was actually configured via the API.\nlb_as_dump() port filter broken: The AS dump call accepts an optional VIP filter. The port comparison was being done against an uninitialized variable, causing the filter to miss entries or match wrong ones depending on stack contents.\nMissing lb_conf_get(): There was no API call to retrieve the global LB configuration (flow table size, timeout values). I added lb_conf_get() so an operator or controlplane can verify the running configuration without resorting to CLI parsing.\n\u0026lsquo;show lb vips\u0026rsquo; unformatting error: The CLI handler dereferenced a pointer that is only valid in verbose mode, causing unexpected output (and a possible crash!) on a plain show lb vips.\nGC only triggered by CLI input: The garbage collector for the flow table was only invoked when the operator typed a CLI command. On a production load balancer, stale flow entries would accumulate indefinitely. So I added a periodic GC timer that automatically cleans up the flow hash table.\nWhile discussing on the vpp-dev mailing list, my buddy Jerome Tollet independently found two of these bugs (the encap type mismatch and the dump port filter) and reported them during review. Both are addressed in the latest patchset.\nVPP LB: New Feature - Weights My attempt to address the two observations above comes from an insight that they are actually the same class of problem: I want to be able to set a variable amount of traffic anywhere from 100% all the way down to 0% of load that a given backend is capable of handling, and I want to be able to flush (remove existing flows from the flow hash table) independently of the new-flow assignment. This is commonly referred to as weights in a load balancer, and in Gerrit [45487] I add per-AS weights to the Maglev new flow table, and decouple \u0026lsquo;flush\u0026rsquo; from \u0026lsquo;set weight\u0026rsquo; semantically.\nThe motivation comes from the two operational scenarios I kept running into while testing the plugin:\n1. Draining a backend without disrupting existing sessions. When a backend needs to go down for maintenance, the only option was lb as del flush, which both removes the AS and flushes the flow table. Flushing the flow table is disruptive: all existing TCP sessions that were pinned to any backend suddenly need to re-select, causing a brief spike of misdirected packets. What I actually want is to stop sending new flows to the AS while letting existing sessions drain naturally.\n2. Introducing a new backend gradually. When adding a new AS to a busy VIP, the Maglev algorithm immediately assigns it ~1/N of the new-flow table slots. On a VIP handling tens of thousands of new connections per second, that is a lot of traffic hitting a backend that may not yet be fully warmed up (think JVM JIT, filled caches, established database connections). It would be useful to introduce the new AS slowly and ramp it up over time.\nMy solution for both is to allow each AS to carry a weight in the range 0–100, which controls what fraction of the new flow table slots it is allowed to occupy:\nweight 100 (default): the AS gets its full ~1/N share of slots. This is the existing behavior, and remains the default. weight 1–99: the AS gets a proportionally smaller share. Useful for gradual introduction as well as gradual removal. weight 0: the AS gets no slots in the new flow table — no new flows are sent to it. The flow table entries for existing sessions remain intact, so those connections keep working until they naturally expire. The Maglev fill algorithm is made weight-aware by scaling each AS\u0026rsquo;s preference list length proportionally to its weight. The sort order is deterministic (sorted by (replica, address)) so the resulting table is identical regardless of the order ASes were added, which also has a bonus side effect of making anycast and ECMP VIPs work correctly.\nBecause VPP developers do not change API signatures once they are published, I added a few new API calls instead:\nlb_as_add_del_v2() — creates or deletes an AS with an explicit weight, and optionally flushes the flow table for that AS on deletion. lb_as_dump_v2() — returns the weight and the number of new-flow-table buckets currently assigned to each AS, which is useful for verifying the distribution. lb_as_set_weight() — changes the weight of an existing AS in place, optionally flushing the flow table, without needing to delete and recreate the AS. From the CLI, the weight is set with:\nvpp# lb as 192.0.2.0/32 10.0.0.1 weight 0 vpp# lb as 192.0.2.0/32 10.0.0.1 weight 1 vpp# lb as 192.0.2.0/32 10.0.0.1 weight 10 vpp# lb as 192.0.2.0/32 10.0.0.1 weight 100 vpp# lb as 192.0.2.0/32 10.0.0.1 weight 0 vpp# lb as 192.0.2.0/32 10.0.0.1 weight 0 flush In the sequence above, backend AS 10.0.0.1 starts off fully drained, then getting a token amount of traffic by setting it to weight 1, then 10, and finally 100. When the backend needs to be removed, I can set weight 0 which will put it in lameduck mode but keep existing flows alive. A few minutes later, I can set it to weight 0 flush which will remove the remaining existing flows. The backend then can be safely removed, without having to wait 5+ hours like I did with the uncontrolled DNS \u0026lsquo;drain\u0026rsquo;.\nVPP LB: New Feature - Punt Unknown I\u0026rsquo;m still on the fence on this feature, but since I wrote it .. Gerrit [45431] adds a punt flag to port-based VIPs.\nBy default, when a VIP is configured with a specific protocol and port (e.g. TCP/443), any packet that arrives at that VIP\u0026rsquo;s address but does not match the configured {protocol, port} pair is sent by VPP to error-drop. This is the correct behavior for most cases: if I am load balancing TCP/443, I do not want stray UDP packets forwarded anywhere.\nThe problem is that this also drops ICMP. If an operator runs traceroute towards the VIP, or sends an ICMP echo, or a client receives an ICMP unreachable, all of that is silently discarded. This makes the VIP opaque from the network\u0026rsquo;s perspective and can complicate debugging.\nWhen creating a port-based VIP, I decide to add a punt flag, so any traffic that does not match the configured protocol/port pairs on the VIP will newly be punted to the local IP stack (ip4-local or ip6-local) instead of dropped. To make this work, I ask VPP to insert the VIP\u0026rsquo;s address into the FIB at a higher priority than device routes, so the punt path is actually reachable. This allows the load balancer to handle TCP/443 (or whatever protocol/port combinations are configured) while the local stack takes care of ICMP, traceroute, and anything else that arrives at that address and is not a part of the maglev configuration.\nThe punt flag is only permitted on port-based VIPs — on a protocol-agnostic VIP there is nothing left to punt, since all traffic is already matched and forwarded to application servers.\nEnabling this from the CLI is straightforward, at creation time:\nvpp# loopback create interface instance 0 vpp# lcp create loop0 host-if maglev0 vpp# set int state loop0 up vpp# set int ip address loop0 192.0.2.0/32 vpp# lb vip 192.0.2.0/32 protocol tcp port 443 encap gre4 punt In this configuration snippet, I first create a simple loopback device with a given IPv4 address, and plumb it through to Linux using the [Linux CP] plugin. This makes it reachable, I can ping it and traceroute to it just like any other Linux Interface Pair LIP. Then, I steal some traffic from it, by creating an LB VIP on this address. Without this feature, the VIP would become unreachable, as the LB plugin would take all traffic destined to the IPv4 address. But with the punt keyword, any traffic not matching the LB VIP(s) on this address, will be sent onwards to the IP stack and end up in Linux. For those of us who like pinging their VIPs, the punt feature flag on VIPs will come in handy.\nFor the same reason as with the other feature I wrote, I need to add new API calls rather than changing existing ones, so here I go:\nlb_add_del_vip_v3() — adds a is_punt flag to the VIP creation call. lb_vip_dump_v2() — returns is_punt in the VIP details, so an operator or controlplane can verify the configuration. What\u0026rsquo;s Next I am going to use Maglev at IPng Networks to load balance our services like SMTP, IMAP, HTTP, DNS and what-not. But before I can do that, I\u0026rsquo;m going to want to write some sort of controlplane that can manipulate the VIPs, AS weights, and do things like health checking. I\u0026rsquo;m inspired by [HAProxy] which I used to use way back when. I find its health checking algorithm particularly clever, so I will give that codebase a good read and with what I learn, create a health checking VPP Maglev controlplane which will give me much better insight into what traffic goes where.\nStay tuned!\n","date":"2026-04-30","desc":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nLoad balancing is one of those topics that sounds deceptively simple until you think about it for a while. In this article I take the VPP load balancer plugin out for a spin, fix a handful of API bugs, and add two small new features that make running it in production a little bit easier.\n","permalink":"https://ipng.ch/s/articles/2026/04/30/vpp-with-maglev-loadbalancing-part-1/","section":"articles","title":"VPP with Maglev Loadbalancing - Part 1"},{"contents":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nSegment Routing is a lesser known technique that allows network operators to determine a path through their network by encoding the path inside headers in the packet itself, rather than relying on the IGP to determine the path. Originally created to help traffic engineering of MPLS packets, the concepts were carried forward for IPv6 as well.\nIn this article I take SRv6 out for a spin, implement some missing features in VPP, and stumble across, and manage to fix a nasty bug in its implementation.\nIntroduction SRv6 - Segment Routing for IPv6 - is defined in a number of RFCs.\n[RFC 8402]: Segment Routing Architecture. This document describes the fundamentals. It defines the general concepts of Segment Routing (nodes, segments, and steering) for both MPLS and IPv6. [RFC 8754]: IPv6 Segment Routing Header (SRH). This RFC defines the specific IPv6 Extension Header used for SRv6. It explains how segments are listed and how the Segments Left field works. [RFC 8986]: SRv6 Network Programming. This one describes the so-called \u0026ldquo;behaviors\u0026rdquo; associated with a Segment ID (SID). It defines functions like End (Endpoint), End.X (Layer-3 cross-connect), and End.DT4/6 (VRF decapsulation). While reading these RFCs, I learn that I can configure an SRv6 path through the network that picks up an ethernet packet on the ingress, and decapsulates and cross connects that ethernet packet to an interface on the egress: an L2VPN using Ethernet-over-IPv6. That sounds dope to me!\nSRv6 in VPP - Segment Routing Header For the dataplane, there are two parts of note. Firstly, when an IPv6 packet arrives with an IPv6 extension header, the so-called Segment Routing Header or SRH, any router supporting SRv6 needs to inspect it. The presence of an SRH changes the forwarding logic from a simple \u0026ldquo;look at the destination, do a FIB lookup for next hop, and send the packet on its merry way\u0026rdquo; to a more customized \u0026ldquo;process the instruction and update the IPv6 headers\u0026rdquo; kind of thing.\nIn IPv6, an (almost) arbitrary amount of headers can be chained from the base IPv6 packet header, to the ultimate layer4 protocol header like ICMP, TCP or UDP. In IPv4, this is not the case, there is only the L3 header (IPv4) and the L4 header (TCP/UDP/ICMP etc). These intermediate headers are called Routing Extension headers, and the SRH is the one with type 4.\nThe fields in this header are:\nNext Header: Identifies the type of header following the SRH. It can be another routing extension header or it might be the Layer4 protocol header like TCP, UDP or ICMP. Flags: IANA loves reserving optionality for the future. The authors of SRv6 added an 8-bit flags field, but none of them have been assigned yet. Tag: Moar optionality! This 16-bit tag is not defined in the RFC, simply stating that The allocation and use of tag is outside the scope of this document. OK then! Segments Left (SL): A counter indicating how many intermediate nodes still need to be visited. Last Entry: The index (starting from 0) of the last element in the Segment List. Segment List: This is an array of 128-bit IPv6 addresses, listed in reverse order of the path. The first segment to be visited is at the highest index. (optional) TLVs: These Type-Length-Value objects can encode other information, like HMAC signatures, operational and performance monitoring data, and so on. SRv6: Anatomy Much like magnets, you might be wondering SRv6 Routers: How do they work?. There are really only three relevant things: SR Policy (they determine how packets are steered into the SRv6 routing domain), SRv6 Source nodes (they handle the ingress part), and SRv6 Segment Endpoint Nodes (they handle both the intermediate routers that participate in SRv6, and also the egress part where the packet leaves the SRv6 routing domain).\nSRv6: Policies A Segment Routing Policy is the same for MPLS and SRv6. They are represented by either a stack of MPLS labels, or by a stack of IPv6 addresses, and they are uniquely identified by either an MPLS label or an IPv6 address as well. The identifier is called a Binding Segment ID or BSID, and the elements of the list are called Segment IDs or SIDs.\nBSID := SID [, SID] [, SID] ... 8298::1 := 2001:db8::1 , 2001:db8::2 , 2001:db8::3 These policies are written to the FIB in the router. I can now do a lookup for 8298::1, and find that it points to this SR Policy object with the list of three IPv6 addresses. In the case of MPLS, the BSID will be in the MPLS FIB and point at a list of three MPLS labels, but I\u0026rsquo;m going to stop talking about MPLS now :)\nSRv6: Source Node An SR Source Node originates an IPv6 packet with a Segment in the destination address, and it optionally adds an SRH with a list of instructions for the network. The SR Source Node is the ingress point and enables SRv6 processing in the network, which is called steering. Instead of setting the destination address to the final destination, the source node will set it to the first Segment, which is the first router that needs to be visited.\nSRv6: Transit Node Spoiler alert! This node type doesn\u0026rsquo;t have anything to do with SRv6. SRv6 packets really do look like normal packets, the IPv6 source address is the Source Node, and the destination address is the Transit Node, which can just forward it like any other packet using their routing table. Notably, those routers are not actively participating in SRv6 and they don\u0026rsquo;t need to know anything about it.\nSRv6: Segment Endpoint Node The Segment Endpoint Node is a router that is SRv6 capable. A packet may arrive with a locally configured address in the IPv6 destination. The magic happens here - one of two things can occur:\nThe Segment Routing Header is inspected. If Segments Left is 0, then the next header (typically UDP, TCP, ICMP) is processed. Otherwise, the next segment is read from the Segment List, and the IPv6 destination address is overwritten with it. The Segments Left field is decremented. In this case the packet is routed normally through a bunch of potential transit routers, who are blissfully ignorant of what is happening, and onto a next Segment Endpoint router.\nThe IPv6 destination address might have an entry in the forwarding table which points at a specific local meaning, called a Local Segment ID or LocalSID. The LocalSID tells this router what to do, for example decapsulate the packet and do a next-hop lookup in a specific routing table, useful for L3VPNs; or perhaps an instruction to decapsulate the packet and cross connect it to a local interface, useful for L2VPN. The key insight here is, that the local FIB entry can carry any type of further instruction.\nVPP: IPng LAB At this point I\u0026rsquo;m pretty sure I\u0026rsquo;ve bored you to tears with all the RFC stuff and theory. I do think that segment routing (both the MPLS and the SRv6 variant) are sufficiently complex that taking a read of the main RFCs at least once is useful. But for me, the fun part is seeing it work in practice. So I boot the [IPng Lab], which looks a bit like this.\nIn this environment, each of the VPP routers is running Bird2 with OSPF and OSPFv3. They are connected in a string, and each VPP router has an interface (Gi10/0/2) connected to a debian host called host0-0 (at the bottom), as well as an interface (Gi10/0/3) connected to a host called host0-1 (at the top). One really cool feature of the LAB is that all links are on an OpenVSwitch which is mirroring all traffic to a tap host called tap0-0, so I can see traffic clearly:\nroot@vpp0-0:/etc/bird# ping -n 2001:678:d78:200::3 -c1 PING 2001:678:d78:200::3 (2001:678:d78:200::3) 56 data bytes 64 bytes from 2001:678:d78:200::3: icmp_seq=1 ttl=62 time=3.24 ms --- 2001:678:d78:200::3 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 3.240/3.240/3.240/0.000 ms root@tap0-0:~# tcpdump -eni enp16s0f0 tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes 10:39:23.558942 52:54:00:f0:11:01 \u0026gt; 52:54:00:f0:11:10, ethertype 802.1Q (0x8100), length 122: vlan 20, p 0, ethertype IPv6 (0x86dd), 2001:678:d78:201::1:0 \u0026gt; 2001:678:d78:200::3: ICMP6, echo request, id 12, seq 1, length 64 10:39:23.558942 52:54:00:f0:11:11 \u0026gt; 52:54:00:f0:11:20, ethertype 802.1Q (0x8100), length 122: vlan 21, p 0, ethertype IPv6 (0x86dd), 2001:678:d78:201::1:0 \u0026gt; 2001:678:d78:200::3: ICMP6, echo request, id 12, seq 1, length 64 10:39:23.559993 52:54:00:f0:11:21 \u0026gt; 52:54:00:f0:11:30, ethertype 802.1Q (0x8100), length 122: vlan 22, p 0, ethertype IPv6 (0x86dd), 2001:678:d78:201::1:0 \u0026gt; 2001:678:d78:200::3: ICMP6, echo request, id 12, seq 1, length 64 10:39:23.560179 52:54:00:f0:11:30 \u0026gt; 52:54:00:f0:11:21, ethertype 802.1Q (0x8100), length 122: vlan 22, p 0, ethertype IPv6 (0x86dd), 2001:678:d78:200::3 \u0026gt; 2001:678:d78:201::1:0: ICMP6, echo reply, id 12, seq 1, length 64 10:39:23.561070 52:54:00:f0:11:20 \u0026gt; 52:54:00:f0:11:11, ethertype 802.1Q (0x8100), length 122: vlan 21, p 0, ethertype IPv6 (0x86dd), 2001:678:d78:200::3 \u0026gt; 2001:678:d78:201::1:0: ICMP6, echo reply, id 12, seq 1, length 64 10:39:23.561248 52:54:00:f0:11:10 \u0026gt; 52:54:00:f0:11:01, ethertype 802.1Q (0x8100), length 122: vlan 20, p 0, ethertype IPv6 (0x86dd), 2001:678:d78:200::3 \u0026gt; 2001:678:d78:201::1:0: ICMP6, echo reply, id 12, seq 1, length 64 Here you can see the packet path from vpp0-0 sending one ICMPv6 echo request to vpp0-3, which responded with one ICMPv6 echo reply. I can see the packet on vlan 20, 21, 22 on the way out, and back again on vlan 22, 21 and 20.\nVPP: SRv6 Example Alright, here I go! With the following short snippet, I can sum up all of the theory above in a practical first example:\nvpp0-0# set sr encaps source addr 2001:678:d78:200:: vpp0-0# sr policy add bsid 8298::2:1 next 2001:678:d78:20F::3:1 encap vpp0-0# sr steer l2 GigabitEthernet10/0/2 via bsid 8298::2:1 vpp0-0# sr localsid address 2001:678:d78:20f::0:1 behavior end.dx2 GigabitEthernet10/0/2 vpp0-0# set int state GigabitEthernet10/0/2 up Looking at what I typed on vpp0-0, first I tell the system that its encapsulation source address is its IPv6 loopback address. Then I add a Binding SID with one Segment ID and I instruct this policy to encapsulate the packet. Then, I add an L2 steering from interface Gi10/0/2 via this BSID. At this point, vpp0-0 knows that if an ethernet frame comes in on that interface, it needs to encapsulate it in SRv6 from 2001:678:d78:200:: and send it to 2001:678:d78:20F::3:1. Finally, I tell the system that if an IPv6 packet arrives with destination address 2001:678:d78:20f::0:1, that it needs to decapsulate it and send the resulting L2 datagram out on Gi10/0/2.\nThere is one last thing I have to do, and that\u0026rsquo;s somehow attract this 2001:678:d78:20F::0:0/112 prefix to vpp0-0 and 2001:678:d78:20F::3:0/112 prefix to vpp0-3. I can do this by adding the prefix to loop0, like so:\nvpp0-0# create loopback interface instance 0 vpp0-0# set interface state loop0 up vpp0-0# set interface ip address loop0 192.168.10.0/32 vpp0-0# set interface ip address loop0 2001:678:d78:200::0/128 vpp0-0# set interface ip address loop0 2001:678:d78:20F::0:0/112 This will be picked up in OSPFv3, and all routers will install a FIB entry pointing at vpp0-0 for the /112. Did it work?\nroot@host0-0:~# ping6 ff02::1%enp16s0f0 PING ff02::1%enp16s0f0 (ff02::1%enp16s0f0) 56 data bytes 64 bytes from fe80::5054:ff:fef0:1000%enp16s0f0: icmp_seq=1 ttl=64 time=0.156 ms 64 bytes from fe80::5054:ff:fef0:1013%enp16s0f0: icmp_seq=1 ttl=64 time=4.03 ms ^C --- ff02::1%enp16s0f0 ping statistics --- 1 packets transmitted, 1 received, +1 duplicates, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.156/2.092/4.029/1.936 ms Yes, it worked! I love it when a plan comes together! This IPv6 address that I pinged, ff02::1 is called all-hosts, and I can see one reply from fe80::5054:ff:fef0:1000 which is host0-0\u0026rsquo;s own link-local address, and a second reply from fe80::5054:ff:fef0:1013 which is host0-1\u0026rsquo;s address. I have created a point to point L2VPN or Virtual Leased Line between vpp0-0:Gi10/0/2 and vpp0-3:Gi10/0/3 and any ethernet traffic between these two ports is passed through the network as IPv6 packets including segment routing. Nice going!\nSRv6 on the Wire I learn something curious. I configure an IPv4 address on both hosts:\nroot@host0-0:~# ip addr add 192.0.2.0/31 dev enp16s0f0 root@host0-1:~# ip addr add 192.0.2.1/31 dev enp16s0f3 root@host0-1:~# ping 192.0.2.0 PING 192.0.2.0 (192.0.2.0) 56(84) bytes of data. 64 bytes from 192.0.2.0: icmp_seq=1 ttl=64 time=5.27 ms ^C --- 192.0.2.0 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 5.274/5.274/5.274/0.000 ms And then I take a look at this IPv4 ICMP packet on the wire:\n11:03:22.118770 52:54:00:f0:10:00 \u0026gt; 52:54:00:f0:10:13, ethertype 802.1Q-QinQ (0x88a8), length 102: vlan 30, p 0, ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 35014, offset 0, flags [DF], proto ICMP (1), length 84) 192.0.2.0 \u0026gt; 192.0.2.1: ICMP echo request, id 50, seq 1, length 64 11:03:22.119078 52:54:00:f0:11:01 \u0026gt; 52:54:00:f0:11:10, ethertype 802.1Q (0x8100), length 156: vlan 20, p 0, ethertype IPv6 (0x86dd), (flowlabel 0x09d8f, hlim 63, next-header Ethernet (143) payload length: 98) 2001:678:d78:200:: \u0026gt; 2001:678:d78:20f::3:1: 52:54:00:f0:10:00 \u0026gt; 52:54:00:f0:10:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 35014, offset 0, flags [DF], proto ICMP (1), length 84) 192.0.2.0 \u0026gt; 192.0.2.1: ICMP echo request, id 50, seq 1, length 64 The first packet is coming in on vlan 30 (host0-0:enp16s0f0 to vpp0-0:Gi10/0/2). I then see it go out on vlan 20 (from vpp0-0 to vpp0-1). I see it is an IPv6 packet from 2001:678:d78:200:: (the encapsulation address I configured), and to 2001:678:d78:20f::3:1 (the BSID resolves to an SR Policy with a single segment: this address), and then I see the Ethernet inner payload with the ICMP echo packet. But where\u0026rsquo;s the Segment Routing Header??\nIt is here that I learn why the RFC says that SRH are optional. This packet has everything it needs to have using the destination address, 2001:678:d78:20f::3:1, which is routed towards the loopback interface of vpp0-3. There, it is looked up in the FIB and the Local Segment ID or LocalSID determines that packets to this address must be decapsulated and forwarded out on vpp0-3:Gi10/0/3.\nVPP: Let\u0026rsquo;s ZigZag So how do I get these elusive SRH headers? Easy: make more than one segment in the BSID, because then, the SR Source Node will have to encode it in the Segment List, for which it needs to construct an SRH.\nI want to tell vpp0-0 to do some scenic routing. I want it to send the packet first to vpp0-2, then vpp0-1 and then vpp0-3. I struggle a little bit, because how should I construct the Segment List ? If I put vpp0-2\u0026rsquo;s loopback address in there, the packet will be seen as local, and sent for local processing, in VPP\u0026rsquo;s ip6-receive node. I don\u0026rsquo;t want that to happen, instead I want VPP to inspect the SRH in this case. After reading a little bit in src/vnet/srv6/sr_localsid.c, I realize the trick is simple (once you know it, of course): I need to tell all routers to handle a specific localsid as End behavior, which will make the intermediate routers run end_srh_processing() which processes the SRH and does the destination swap.\nvpp0-3# sr localsid address 2001:678:d78:20F::3:ffff behavior end vpp0-2# sr localsid address 2001:678:d78:20F::2:ffff behavior end vpp0-1# sr localsid address 2001:678:d78:20F::1:ffff behavior end vpp0-0# sr localsid address 2001:678:d78:20F::0:ffff behavior end vpp0-0# sr policy add bsid 8298::2:2 next 2001:678:d78:20F::2:ffff next 2001:678:d78:20F::1:ffff next 2001:678:d78:20f::3:1 encap Now each router knows that if an IPv6 packet is destined to its :ffff address, that it needs to \u0026ldquo;End\u0026rdquo; the segment by inspecting the SRH. And the SR Policy for vpp0-0 is to send it first to ::2:ffff, which is vpp0-2, which now inspects the SRH and advances the Segment List.\nThe proof is in the tcpdump pudding, and it makes me smile to see the icmp-echo packet bounce back and forward on its scenic route:\nroot@tap0-0:~# tcpdump -veni enp16s0f0 src 2001:678:d78:200:: tcpdump: listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes 12:15:39.442587 52:54:00:f0:10:00 \u0026gt; 52:54:00:f0:10:13, ethertype 802.1Q-QinQ (0x88a8), length 102: vlan 30, p 0, ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 5534, offset 0, flags [DF], proto ICMP (1), length 84) 192.0.2.0 \u0026gt; 192.0.2.1: ICMP echo request, id 51, seq 561, length 64 12:15:39.501353 52:54:00:f0:11:01 \u0026gt; 52:54:00:f0:11:10, ethertype 802.1Q (0x8100), length 212: vlan 20, p 0, ethertype IPv6 (0x86dd), (flowlabel 0x09d8f, hlim 63, next-header Routing (43) payload length: 154) 2001:678:d78:200:: \u0026gt; 2001:678:d78:20f::2:ffff: RT6 (len=6, type=4, segleft=2, last-entry=2, flags=0x0, tag=0, [0]2001:678:d78:20f::3:1, [1]2001:678:d78:20f::1:ffff, [2]2001:678:d78:20f::2:ffff) 52:54:00:f0:10:00 \u0026gt; 52:54:00:f0:10:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 64406, offset 0, flags [DF], proto ICMP (1), length 84) 192.0.2.0 \u0026gt; 192.0.2.1: ICMP echo request, id 51, seq 6, length 64 12:15:39.501902 52:54:00:f0:11:11 \u0026gt; 52:54:00:f0:11:20, ethertype 802.1Q (0x8100), length 212: vlan 21, p 0, ethertype IPv6 (0x86dd), (flowlabel 0x09d8f, hlim 62, next-header Routing (43) payload length: 154) 2001:678:d78:200:: \u0026gt; 2001:678:d78:20f::2:ffff: RT6 (len=6, type=4, segleft=2, last-entry=2, flags=0x0, tag=0, [0]2001:678:d78:20f::3:1, [1]2001:678:d78:20f::1:ffff, [2]2001:678:d78:20f::2:ffff) 52:54:00:f0:10:00 \u0026gt; 52:54:00:f0:10:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 64406, offset 0, flags [DF], proto ICMP (1), length 84) 192.0.2.0 \u0026gt; 192.0.2.1: ICMP echo request, id 51, seq 6, length 64 12:15:39.502658 52:54:00:f0:11:20 \u0026gt; 52:54:00:f0:11:11, ethertype 802.1Q (0x8100), length 212: vlan 21, p 0, ethertype IPv6 (0x86dd), (flowlabel 0x09d8f, hlim 61, next-header Routing (43) payload length: 154) 2001:678:d78:200:: \u0026gt; 2001:678:d78:20f::1:ffff: RT6 (len=6, type=4, segleft=1, last-entry=2, flags=0x0, tag=0, [0]2001:678:d78:20f::3:1, [1]2001:678:d78:20f::1:ffff, [2]2001:678:d78:20f::2:ffff) 52:54:00:f0:10:00 \u0026gt; 52:54:00:f0:10:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 64406, offset 0, flags [DF], proto ICMP (1), length 84) 192.0.2.0 \u0026gt; 192.0.2.1: ICMP echo request, id 51, seq 6, length 64 12:15:39.502990 52:54:00:f0:11:11 \u0026gt; 52:54:00:f0:11:20, ethertype 802.1Q (0x8100), length 212: vlan 21, p 0, ethertype IPv6 (0x86dd), (flowlabel 0x09d8f, hlim 60, next-header Routing (43) payload length: 154) 2001:678:d78:200:: \u0026gt; 2001:678:d78:20f::3:1: RT6 (len=6, type=4, segleft=0, last-entry=2, flags=0x0, tag=0, [0]2001:678:d78:20f::3:1, [1]2001:678:d78:20f::1:ffff, [2]2001:678:d78:20f::2:ffff) 52:54:00:f0:10:00 \u0026gt; 52:54:00:f0:10:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 64406, offset 0, flags [DF], proto ICMP (1), length 84) 192.0.2.0 \u0026gt; 192.0.2.1: ICMP echo request, id 51, seq 6, length 64 12:15:39.503813 52:54:00:f0:11:21 \u0026gt; 52:54:00:f0:11:30, ethertype 802.1Q (0x8100), length 212: vlan 22, p 0, ethertype IPv6 (0x86dd), (flowlabel 0x09d8f, hlim 59, next-header Routing (43) payload length: 154) 2001:678:d78:200:: \u0026gt; 2001:678:d78:20f::3:1: RT6 (len=6, type=4, segleft=0, last-entry=2, flags=0x0, tag=0, [0]2001:678:d78:20f::3:1, [1]2001:678:d78:20f::1:ffff, [2]2001:678:d78:20f::2:ffff) 52:54:00:f0:10:00 \u0026gt; 52:54:00:f0:10:13, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 64406, offset 0, flags [DF], proto ICMP (1), length 84) 192.0.2.0 \u0026gt; 192.0.2.1: ICMP echo request, id 51, seq 6, length 64 12:15:39.525605 52:54:00:f0:10:00 \u0026gt; 52:54:00:f0:10:13, ethertype 802.1Q-QinQ (0x88a8), length 102: vlan 43, p 0, ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 5534, offset 0, flags [DF], proto ICMP (1), length 84) 192.0.2.0 \u0026gt; 192.0.2.1: ICMP echo request, id 51, seq 561, length 64 The echo-request packet can be observed seven times:\ncoming in on vlan 30 (between host0-0 and vpp0-0:Gi10/0/2), here it is simply an IPv4 packet. on vlan 20, encapsulated in an IPv6 packet, this time including SRH header showing where it is expected to go. on vlan 21, because the first segment wants the packet to go to vpp0-2. and vpp0-1 is acting as a transit router (just normally using IPv6 FIB lookup to pass it along) on vlan 21 again, because when vpp0-2 got it, it decremented the SRH Segments Left from 2 to 1, and sent it to the second segment, which is onwards to vpp0-1. on vlan 21 yet again, because when vpp0-1 got it, it decremented the SRH Segments Left from 1 to 0, and sent it to the third and final segment, which is onwards to vpp0-3. on vlan 22, because vpp0-2 is acting as a transit router here (the destination is now vpp0-3, not its own localsid), using its FIB to pass it along to vpp0-3, which decapsulates it with End.DX2 and sends it as an L2 packet on Gi10/0/3. coming out of vlan 43 (between vpp0-3:Gi10/0/3 and host0-1), where it is simply an IPv4 packet again. Some folks find it easier to visualize packets by looking at Wireshark output. I grabbed one of the packets from the wire, and here\u0026rsquo;s what it looks like:\nThe screenshot shows the packet observed on step 4 above - it is coming from vpp0-0\u0026rsquo;s loopback address and destined to the End localsid on vpp0-1, and I can see that the SRH has the list of 3 Segments in reversed order, where Address[0] is the final destination: a LocalSID on vpp0-3 configured as End.DX2. I can also see that Segments Left is set to 1.\nVPP has a few relevant dataplane nodes:\nsr-pl-rewrite-encaps-l2: This node encapsulates ethernet at the ingress point by steering packets into an SR Policy named by its Binding Segment ID sr-localsid: This node implements End behavior, in this case sending to the next Segment Router by looking up its Local Segment ID in the FIB sr-localsid-d: This node decapsulates the ethernet, on an End.DX2 behavior, by looking up its Local Segment ID in the FIB VPP: Adding SRv6 encap/decap on sub-interface A few years ago, I thought maybe it\u0026rsquo;d be cool to use SRv6 for L2VPN at IPng. But I was quickly disappointed because SRv6 encap and decap is only implemented on the device-input path which means it will not work with sub-interfaces.\nA few weeks ago, I worked on Gerrit [44654], which implements policers on sub-interfaces. I wrote about it in a [policer article], but since my brain\u0026rsquo;s instruction cache is still warm with the code I wrote to enable L2 features on input- and output, I thought I\u0026rsquo;d give it another go. If you\u0026rsquo;re not interested in the software engineering parts, you can stop reading now :-)\n0. Remove vlan_index everywhere\nThe original author followed the RFC, where there is an End.DX2V behavior that allows to decapsulate to a VLAN tag on an interface, but they never implemented it and added a note to the code to that effect. I can see why, DX2V is not idiomatic for VPP, but there\u0026rsquo;s an alternative. It would make more sense to decapsulate with End.DX2 to a sub-interface. So I removed this from the codebase in all places except the API functions, where I marked them as \u0026rsquo;not implemented\u0026rsquo;, which is true at this point anyway.\n1. Add feature bitmap entries\nI added L2INPUT_FEAT_SRV6 to l2_input.h. This allows me to turn on an SRv6 feature bit, and on ingress, send L2 datagrams from l2-input node directly to sr-pl-rewrite-encaps-l2 node, regardless of the interface being a PHY like Gi10/0/0 or a SUB like Gi10/0/0.100. It comes at a small CPU cost though, because moving on the device-input arc directly to the encapsulation node will skip a bunch of L2 processing, like L2 ACL, and VLAN TAG Rewriting (which doesn\u0026rsquo;t make sense on an untagged interface anyway). But, in return I can apply SRv6 encapsulation to any interface type.\n2. Precompute DX2 headerlen\nIn the case of an End.DX2 to a sub-interface, I need to add either 4 bytes (single tag) or 8 bytes (QinQ or QinAD double tag) to the packet length. I know which at creation time, because I can look that up from the to-be-DX2\u0026rsquo;d interface. I\u0026rsquo;ll store this in the localsid structure as ls-\u0026gt;l2_len (either 14, 18, or 22 bytes).\n3a. Connect to l2-input on ingress\nWhen enabling the sr steer with keyword encap, I need to change two things: first, I need to allow VNET_SW_INTERFACE_TYPE_SUB in addition to the already present VNET_SW_INTERFACE_TYPE_HARDWARE, and then if the steer policy is SR_STEER_L2, I remove the bits which initialize the feature arc on device-input, and instead, call set_int_l2_mode() in MODE_L2_XC (cross connect), but then I sneakily clear the feature bitmap bit for L2INPUT_FEAT_XCONNECT, and instead set my new L2INPUT_FEAT_SRV6 bit. This means that from now on, any L2 frames will get sent to node sr-pl-rewrite-encaps-l2 instead of l2-output which is what the L2XC would\u0026rsquo;ve done. Finally, I initialize the L2 feature bitmap next-nodes for the encapsulation node in function sr_policy_rewrite_init().\n3b. Connect to l2-output on egress\nI call l2output_create_output_node_mapping() on the (sub)-interface, so that traffic into it will go to l2-output, where I can inspect the feature bitmap to see if I need to send it to decapsulation or not. I also need to update sr_localsid_next to remove interface-output and replace it with l2-output so that egress traffic visits l2-output. In end_decaps_srh_processing(), I need to set the l2_len on the buffer, and change the next node to be SR_LOCALSID_NEXT_L2_OUTPUT instead of SR_LOCALSID_NEXT_INTERFACE_OUTPUT, so that sub-interface processing can occur (eg, VLAN Tag Rewriting, ACLs, SPAN, and so on).\n4. Fix a bug in sr_policy_rewrite_encaps_l2\nI kind of thought I would be done, and it did work, but I had about 75% packet loss and iperf performance was 20Mbps or so, while on the bench I usually expect 350+ Mbps. I scratched my head a little bit, but then found a bug in the quad-loop processing of sr_policy_rewrite_encaps_l2(). Maybe you can spot it too?\nif (vec_len (sp0-\u0026gt;segments_lists) == 1) vnet_buffer (b0)-\u0026gt;ip.adj_index[VLIB_TX] = sp0-\u0026gt;segments_lists[0]; else { vnet_buffer (b0)-\u0026gt;ip.flow_hash = flow_label0; vnet_buffer (b0)-\u0026gt;ip.adj_index[VLIB_TX] = sp0-\u0026gt;segments_lists[(vnet_buffer (b0)-\u0026gt;ip.flow_hash \u0026amp; (vec_len (sp0-\u0026gt;segments_lists) - 1))]; } if (vec_len (sp1-\u0026gt;segments_lists) == 1) vnet_buffer (b1)-\u0026gt;ip.adj_index[VLIB_TX] = sp1-\u0026gt;segments_lists[1]; else { vnet_buffer (b1)-\u0026gt;ip.flow_hash = flow_label1; vnet_buffer (b1)-\u0026gt;ip.adj_index[VLIB_TX] = sp1-\u0026gt;segments_lists[(vnet_buffer (b1)-\u0026gt;ip.flow_hash \u0026amp; (vec_len (sp1-\u0026gt;segments_lists) - 1))]; } Once I found this, I became quite certain that nobody uses L2 encapsulation in VPP, because if 4+ packets would be present in the vector, for the second through fourth packet (b1-b3), and if the segment list had length 1, then the segment list index would incorrectly be set to garbage segment_lists[1] rather than the first and only segment segment_list[0]. Yikes! But it explains perfectly why I had roughly 75% packetloss, lots of TCP retransmits, and terrible throughput. I fix this bug and SRv6 encap starts to work flawlessly.\n5. Add tests\nI decide to add four tests: for {PHY, SUB} x {Encap, Decap}. On the encap side, I create a SR Policy with BSID a3::9999:1 which encapsulates from source a3:: and sends to Segment List [a4::, a5::, a6::c7]. I then steer L2 traffic from interface pg0 using this BSID. I\u0026rsquo;ll generate a packet and want to receive it from pg1 encapsulated with the correct SRH and destination address. On the decap side, I create an SRv6 packet and send it into pg1, and want to see it decapsulated and exit on interface pg0.\nI try to get consistency by adding a send_and_verify_pkts() which takes an argument as a validator function, either compare_rx_tx_packet_T_Encaps_L2() or compare_rx_tx_packet_End_DX2(). These four tests succeed, look at me!\n============================================================================== SRv6 L2 Sub-Interface Steering Test Case [main thread only] ============================================================================== Test SRv6 End.DX2 decapsulation to a hardware (phy) interface. 1.53 OK Test SRv6 End.DX2 decapsulation to a sub-interface (VLAN). 1.00 OK Test SRv6 L2 encapsulation on a hardware (phy) interface. 1.97 OK Test SRv6 L2 encapsulation on a sub-interface (VLAN). 1.93 OK ============================================================================== TEST RESULTS: Scheduled tests: 4 Executed tests: 4 Passed tests: 4 ============================================================================== Results With this change, it becomes possible to sr steer into a sub-interface, and to have an sr localsid that outputs to a sub-interface, which I can demonstrate like so:\nvpp0-0# create sub-interfaces GigabitEthernet10/0/2 100 vpp0-0# set int l2 tag-rewrite GigabitEthernet10/0/2.100 pop 1 vpp0-0# set int state GigabitEthernet10/0/2.100 up vpp0-0# sr policy add bsid 8298::2:2 next 2001:678:d78:20f::3:2 encap vpp0-0# sr steer l2 GigabitEthernet10/0/2.100 via bsid 8298::2:2 vpp0-0# sr localsid address 2001:678:d78:20f::0:2 behavior end.dx2 GigabitEthernet10/0/2.100 vpp0-3# create sub-interfaces GigabitEthernet10/0/3 200 vpp0-3# set int l2 tag-rewrite GigabitEthernet10/0/3.200 pop 1 vpp0-3# set int state GigabitEthernet10/0/3.200 up vpp0-3# sr policy add bsid 8298::2:2 next 2001:678:d78:20F::2 encap vpp0-3# sr steer l2 GigabitEthernet10/0/3.200 via bsid 8298::2:2 vpp0-3# sr localsid address 2001:678:d78:20f::3:2 behavior end.dx2 GigabitEthernet10/0/3.200 One thing to remember, is that when sub-interfaces are created and used in L2 mode, they have to get the [VLAN Gymnastics] applied to them. In VPP terminology, it means applying VTR or VLAN Tag Rewrite feature, where the tag is removed upon ingress, and re-added on egress. That way, the ethernet frame that gets put into the SRv6 L2VPN is untagged. It allows me to have different encapsulation on both sides.\nNow, for the moment suprème, on the two hosts, I can now create this sub-interface and use the tagged L2VPN also:\nroot@host0-0:~# ip link add link enp16s0f0 name enp16s0f0.100 type vlan id 100 root@host0-0:~# ip link set enp16s0f0.100 up mtu 1500 root@host0-0:~# ip addr add 192.0.2.128/31 dev enp16s0f0.100 root@host0-1:~# ip link add link enp16s0f3 name enp16s0f3.200 type vlan id 200 root@host0-1:~# ip link set enp16s0f3.200 up mtu 1500 root@host0-1:~# ip addr add 192.0.2.129/31 dev enp16s0f3.200 root@host0-1:~# ping 192.0.2.128 PING 192.0.2.128 (192.0.2.128) 56(84) bytes of data. 64 bytes from 192.0.2.128: icmp_seq=1 ttl=64 time=9.88 ms 64 bytes from 192.0.2.128: icmp_seq=2 ttl=64 time=4.88 ms 64 bytes from 192.0.2.128: icmp_seq=3 ttl=64 time=7.07 ms ^C --- 192.0.2.128 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2003ms rtt min/avg/max/mdev = 4.880/7.273/9.876/2.044 ms root@host0-1:~# ip nei | grep 200 192.0.2.128 dev enp16s0f3.200 lladdr 52:54:00:f0:10:00 REACHABLE fe80::5054:ff:fef0:1000 dev enp16s0f3.200 lladdr 52:54:00:f0:10:00 DELAY What\u0026rsquo;s Next I\u0026rsquo;ve sent the change, which is about ~850 LOC, off for review. You can follow along on the gerrit on [44899]. I\u0026rsquo;m happy to have fixed the quad-loop encap bug, but it does show me that SRv6 (at least in L2 transport mode) is not super common for VPP, perhaps not common in the industry? I am not convinced that I want to use this in production on AS8298, but if I did, the basic functionality would be adding an IPv6 prefix to each of the loopback devices, in order to attract traffic to the router, add an \u0026lsquo;End\u0026rsquo; localsid on every router so that they can participate in multi-hop SRv6, and add some static config to [vppcfg] to do the encap/decap for L2VPN. By the way, there\u0026rsquo;s a whole world of encap and decap behaviors, including L3VPN for IPv4, IPv6, GTP-U, and so on.\nFor me, I\u0026rsquo;ve still set my sights on eVPN VxLAN as a destination, because that will give me multi-point ethernet mesh akin to VPLS. However there\u0026rsquo;s a lot of ground to cover for me, considering IPng uses Bird2 as a routing controlplane. Bird2 is starting to get eVPN support, but there\u0026rsquo;s a lot for me to learn. Stay tuned!\n","date":"2026-02-21","desc":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nSegment Routing is a lesser known technique that allows network operators to determine a path through their network by encoding the path inside headers in the packet itself, rather than relying on the IGP to determine the path. Originally created to help traffic engineering of MPLS packets, the concepts were carried forward for IPv6 as well.\n","permalink":"https://ipng.ch/s/articles/2026/02/21/vpp-srv6-l2vpn/","section":"articles","title":"VPP SRv6 L2VPN"},{"contents":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nThere are some really fantastic features in VPP, some of which are less well known, and not always very well documented. In this article, I will describe a unique use case in which I think VPP will excel, notably acting as a gateway for Internet Exchange Points.\nA few years ago, I toyed with the idea to use VPP as an IXP Reseller concentrator, allowing several carriers to connect with say 10G or 25G ports, and carry sub-customers on tagged interfaces with safety (like MAC address ACLs) and rate limiting (say any given customer limited to 1Gbps on a 10G or 100G trunk), all provided by VPP. You can take a look at my [VPP IXP Gateway] article for details. I never ended up deploying it.\nIn this article, I follow up and fix a few shortcomings in VPP\u0026rsquo;s policer framework.\nIntroduction Consider the following policer in VPP:\nvpp# policer add name client-a rate kbps cir 150000 cb 15000000 conform-action transmit vpp# policer input name client-a GigabitEthernet10/0/1 vpp# policer output name client-a GigabitEthernet10/0/1 The idea is to give a committed information rate of 150Mbps with a committed burst rate of 15MB. The CIR represents the average bandwidth allowed for the interface, while the CB represents the maximum amount of data (in bytes) that can be sent at line speed in a single burst before the CIR kicks in to throttle the traffic.\nBack in October of 2023, I reached the conclusion that the policer works in the following modes:\nOn input, the policer is applied on device-input which means it takes frames directly from the Phy. It will not work on any sub-interfaces. It explains why the policer worked on untagged (Gi10/0/1) but not on tagged (Gi10/0/1.100) sub-interfaces. On output, the policer is applied on ip4-output and ip6-output, which works only for L3 enabled interfaces, not for L2 ones like the ones one might use on bridge domain or L2 cross connects. VPP Infra: L2 Feature Maps The benefit of using the device-input arc is that it\u0026rsquo;s efficient: every packet that comes from the device (Gi10/0/1) regardless of tagging or not, will be handed off to the policer plugin. It means any traffic (L2, L3, sub-interface, tagged, untagged) will all go through the same policer.\nIn src/vnet/l2/ there are two nodes called l2-input and l2-output. I can configure VPP to call these nodes before ip[46]-unicast and before ip[46]-output respectively. These L2 nodes have a feature bitmap with 32 entries. The l2-input / l2-output nodes use a bitmap walk: they find the highest set bit, and then dispatch the packet to a pre-configured graph node. Upon return, the feat-bitmap-next checks the next bit, and if that one is set, dispatches the packet to the next pre-configured graph node. This continues until all the bits are checked and packets have been handed to their respective graph node if any given bit is set.\nTo show what I can do with these nodes, let me dive into an example. When a packet arrives on an interface configured in L2 mode, either because it\u0026rsquo;s a bridge-domain or an L2XC, ethernet-input will send it to l2-input. This node does three things:\nIt will classify the packet, by reading the interface configuration (l2input_main.configs) for the sw_if_index, which contains the mode of the interface (bridge-domain, l2xc, or bvi). It also contains the feature bitmap: a statically configured set of features for this interface.\nIt will store the effective feature bitmap for each individual packet in the packet buffer. For bridge mode, depending on the packet being unicast or multicast, some features are disabled. For example, flooding for unicast packets is not performed, so those bits are cleared. The result is stored in a per-packet working copy that downstream nodes can be triggered on, in turn.\nFor each of the bits set in the packet buffer\u0026rsquo;s l2.feature_bitmap, starting from highest bit set, l2-input will set the next node, for example l2-input-vtr to do VLAN Tag Rewriting. Once that node is finished, it\u0026rsquo;ll clear its own bit, and search for the next one set, in order to set a new node.\nI note that processing order is HIGH to LOW bits. By reading l2_input.h, I can see that the full l2-input chain looks like this:\nl2-input → SPAN(17) → INPUT_CLASSIFY(16) → INPUT_FEAT_ARC(15) → POLICER_CLAS(14) → ACL(13) → VPATH(12) → L2_IP_QOS_RECORD(11) → VTR(10) → LEARN(9) → RW(8) → FWD(7) → UU_FWD(6) → UU_FLOOD(5) → ARP_TERM(4) → ARP_UFWD(3) → FLOOD(2) → XCONNECT(1) → DROP(0) l2-output → XCRW(12) → OUTPUT_FEAT_ARC(11) → OUTPUT_CLASSIFY(10) → LINESTATUS_DOWN(9) → STP_BLOCKED(8) → IPIW(7) → EFP_FILTER(6) → L2PT(5) → ACL(4) → QOS(3) → CFM(2) → SPAN(1) → OUTPUT(0) If none of the L2 processing nodes set the next node, ultimately feature-bitmap-drop gently takes the packet behind the shed and drops it. On the way out, ultimately the last OUTPUT bit sends the packet to interface-output, which hands off to the driver\u0026rsquo;s TX node.\nEnabling L2 features There\u0026rsquo;s lots of places in VPP where L2 feature bitmaps are set/cleared. Here\u0026rsquo;s a few examples:\n# VTR: sets L2INPUT_FEAT_VTR + configures output VTR (VLAN Tag Rewriting) vpp# set interface l2 tag-rewrite GigE0/0/0.100 pop 1 # ACL: sets L2INPUT_FEAT_ACL / L2OUTPUT_FEAT_ACL vpp# set interface l2 input acl intfc GigE0/0/0 ip4-table 0 vpp# set interface l2 output acl intfc GigE0/0/0 ip4-table 0 # SPAN: sets L2INPUT_FEAT_SPAN / L2OUTPUT_FEAT_SPAN vpp# set interface span GigE0/0/0 l2 destination GigE0/0/1 # Bridge domain level (affects bd_feature_bitmap, applied to all bridge members) vpp# set bridge-domain learn 1 # enable/disable LEARN in BD vpp# set bridge-domain forward 1 # enable/disable FWD in BD vpp# set bridge-domain flood 1 # enable/disable FLOOD in BD I\u0026rsquo;m starting to see how these L2 feature bitmaps are super powerful, yet flexible. I\u0026rsquo;m ready to add one!\nCreating L2 features First, I need to insert my new POLICER bit in l2_input.h and l2_output.h. Then, I can call l2input_intf_bitmap_enable() and its companion l2output_intf_bitmap_enable() to enable or disable the L2 feature, and point it at a new graph node.\n/* Enable policer both on L2 feature bitmap, and L3 feature arcs */ if (dir == VLIB_RX) { l2input_intf_bitmap_enable (sw_if_index, L2INPUT_FEAT_POLICER, apply); vnet_feature_enable_disable (\u0026#34;ip4-unicast\u0026#34;, \u0026#34;policer-input\u0026#34;, sw_if_index, apply, 0, 0); vnet_feature_enable_disable (\u0026#34;ip6-unicast\u0026#34;, \u0026#34;policer-input\u0026#34;, sw_if_index, apply, 0, 0); } else { l2output_intf_bitmap_enable (sw_if_index, L2OUTPUT_FEAT_POLICER, apply); vnet_feature_enable_disable (\u0026#34;ip4-output\u0026#34;, \u0026#34;policer-output\u0026#34;, sw_if_index, apply, 0, 0); vnet_feature_enable_disable (\u0026#34;ip6-output\u0026#34;, \u0026#34;policer-output\u0026#34;, sw_if_index, apply, 0, 0); } What this means is that if the interface happens to be in L2 mode, in other words when it is a bridge-domain member or when it is in an l2XC mode, I will enable the L2 features. However, for L3 packets, I will still proceed to enable the existing policer-input node by calling vnet_feature_enable_disable() on the IPv4 and IPv6 input arc. I make a mental note that MPLS and other non-IP traffic will not be policed in this way.\nUpdating Policer graph node The policer framework has an existing dataplane node called vnet_policer_inline() which I extend to take a flag is_l2. Using this flag, I can either set the next graph node to be vnet_l2_feature_next(), or, in the pre-existing L3 case, set vnet_feature_next() on the packets that move through the node. The nodes now look like this:\nVLIB_NODE_FN (policer_l2_input_node) (vlib_main_t *vm, vlib_node_runtime_t *node, vlib_frame_t *frame) { return vnet_policer_inline (vm, node, frame, VLIB_RX, 1 /* is_l2 */); } VLIB_REGISTER_NODE (policer_l2_input_node) = { .name = \u0026#34;l2-policer-input\u0026#34;, .vector_size = sizeof (u32), .format_trace = format_policer_trace, .type = VLIB_NODE_TYPE_INTERNAL, .n_errors = ARRAY_LEN(vnet_policer_error_strings), .error_strings = vnet_policer_error_strings, .n_next_nodes = VNET_POLICER_N_NEXT, .next_nodes = { [VNET_POLICER_NEXT_DROP] = \u0026#34;error-drop\u0026#34;, [VNET_POLICER_NEXT_HANDOFF] = \u0026#34;policer-input-handoff\u0026#34;, }, }; /* Register on IP unicast arcs for L3 routed sub-interfaces */ VNET_FEATURE_INIT (policer_ip4_unicast, static) = { .arc_name = \u0026#34;ip4-unicast\u0026#34;, .node_name = \u0026#34;policer-input\u0026#34;, .runs_before = VNET_FEATURES (\u0026#34;ip4-lookup\u0026#34;), }; VNET_FEATURE_INIT (policer_ip6_unicast, static) = { .arc_name = \u0026#34;ip6-unicast\u0026#34;, .node_name = \u0026#34;policer-input\u0026#34;, .runs_before = VNET_FEATURES (\u0026#34;ip6-lookup\u0026#34;), }; Here, I install the L3 feature before ip[46]-lookup, and hook up the L2 feature with a new node that really just calls the existing node but with is_l2 set to true. I do something very similar for the output direction, except there I\u0026rsquo;ll hook the L3 feature before ip[46]-output.\nTests! I think writing unit- and integration tests is a great idea. I add a new file test/test_policer_subif.py which actually tests all four new cases:\nL3 Input: on a routed sub-interface L3 Output: on a routed sub-interface L2 Input: on a bridge-domain sub-interface L2 Output: on a bridge-domain sub-interface The existing test/test_policer.py should also cover existing cases, and of course it\u0026rsquo;s important that my work does not break existing functionality. Lucky me, the existing tests all still pass :)\nTest: L3 in/output The tests use a VPP feature called packet-generator, which creates virtual devices upon which I can emit packets using ScaPY, and use pcap to receive them. For the input, first I\u0026rsquo;ll create the interface and apply a new policer to it:\nsub_if0 = VppDot1QSubint(self, self.pg0, 10) sub_if0.admin_up() sub_if0.config_ip4() sub_if0.resolve_arp() # Create policer action_tx = PolicerAction(VppEnum.vl_api_sse2_qos_action_type_t.SSE2_QOS_ACTION_API_TRANSMIT, 0) policer = VppPolicer(self, \u0026#34;subif_l3_pol\u0026#34;, 80, 0, 1000, 0, conform_action=action_tx, exceed_action=action_tx, violate_action=action_tx, ) policer.add_vpp_config() # Apply policer to sub-interface input on pg0 policer.apply_vpp_config(sub_if0.sw_if_index, Dir.RX, True) The policer with name subif_l3_pol has a CIR of 80kbps, and EIR of 0kB, a CB of 1000 bytes, and EB of 0kB, and otherwise always accepts packets. I do this so that I can eventually detect if and how many packets were seen, and how many bytes were passed in the conform and violate actions.\nNext, I can generate a few packets and send them out from pg0, and wait to receive them on pg1:\n# Send packets with VLAN tag from sub_if0 to sub_if1 pkts = [] for i in range(NUM_PKTS): # NUM_PKTS = 67 pkt = ( Ether(src=self.pg0.remote_mac, dst=self.pg0.local_mac) / Dot1Q(vlan=10) / IP(src=sub_if0.remote_ip4, dst=sub_if1.remote_ip4) / UDP(sport=1234, dport=1234) / Raw(b\u0026#34;\\xa5\u0026#34; * 100) ) pkts.append(pkt) # Send and verify packets are policed and forwarded rx = self.send_and_expect(self.pg0, pkts, self.pg1) stats = policer.get_stats() # Verify policing happened self.assertGreater(stats[\u0026#34;conform_packets\u0026#34;], 0) self.assertEqual(stats[\u0026#34;exceed_packets\u0026#34;], 0) self.assertGreater(stats[\u0026#34;violate_packets\u0026#34;], 0) self.logger.info(f\u0026#34;L3 sub-interface input policer stats: {stats}\u0026#34;) Similar to the L3 sub-interface input policer, I also write a test for L3 sub-interface output policer. The only difference between the two is that in the output case, the policer is applied to pg1 in the Dir.TX direction, while in the input case, it\u0026rsquo;s applied to pg0 in the Dir.RX direction.\nI can predict the outcome. Every packet is exactly 146 bytes:\n14 bytes src/dst MAC in Ether() 4 bytes VLAN tag (10) in Dot1Q() 20 bytes IPv4 header in IP() 8 bytes UDP header in UDP() 100 bytes of additional payload. When allowing a burst of 1000 bytes, that means 6 packets should make it through (876 bytes) in the conform bucket while the other 61 should be in the violate bucket. I won\u0026rsquo;t see any packets in the exceed bucket, because the policer I created is a simple one-rate, two-color 1R2C policer with EB set to 0, so every non-conforming packet goes straight to violate as there is no extra budget in the exceed bucket. However they are all sent, because the action was set to transmit in all cases.\npim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 15:21:46,868 L3 sub-interface input policer stats: {\u0026#39;conform_packets\u0026#39;: 7, \u0026#39;conform_bytes\u0026#39;: 896, \u0026#39;exceed_packets\u0026#39;: 0, \u0026#39;exceed_bytes\u0026#39;: 0, \u0026#39;violate_packets\u0026#39;: 60, \u0026#39;violate_bytes\u0026#39;: 7680} 15:21:47,919 L3 sub-interface output policer stats: {\u0026#39;conform_packets\u0026#39;: 6, \u0026#39;conform_bytes\u0026#39;: 876, \u0026#39;exceed_packets\u0026#39;: 0, \u0026#39;exceed_bytes\u0026#39;: 0, \u0026#39;violate_packets\u0026#39;: 61, \u0026#39;violate_bytes\u0026#39;: 8906} Whoops! So much for predicting the outcome! I see that 7 packets (896 bytes) make it through on input while 6 packets (876 bytes) make it through on output. In the input case, the packet size is 896/7 = 128 bytes, which is 18 bytes short. What\u0026rsquo;s going on?\nSide Quest: Policer Accounting On the vpp-dev mailinglist, Ben points out that the accounting will be changing when moving from device-input to ip[46]-input, because after device-input, the packet buffer is advanced to the L3 portion, and will start at the IPv4 or IPv6 header. Considering I was using dot1q tagged sub-interfaces, that means I will be short exactly 18 bytes. The reason why this does not happen on the way out, is that ip[46]-rewrite have both already wound back the buffer to be able to insert the ethernet frame and encapsulation, so no adjustment is needed there.\nBen also points out that when applying the policer to the interface, I can detect at creation time if it\u0026rsquo;s a PHY, a single-tagged or a double-tagged interface, and store some information to help correct the accounting. We discuss a little bit on the mailinglist, and agree that it\u0026rsquo;s best for all four cases (L2 input/output and L3 input/output) to use the full L2 frame bytes in the accounting, which as an added benefit, also remains backwards compatible with the device-input accounting. Chapeau, Ben, you\u0026rsquo;re so clever!\nI add a little helper function:\nstatic u8 vnet_policer_compute_l2_overhead (vnet_main_t *vnm, u32 sw_if_index, vlib_dir_t dir) { if (dir == VLIB_TX) return 0; vnet_hw_interface_t *hi = vnet_get_sup_hw_interface (vnm, sw_if_index); if (PREDICT_FALSE (hi-\u0026gt;hw_class_index != ethernet_hw_interface_class.index)) return 0; /* Not Ethernet */ vnet_sw_interface_t *si = vnet_get_sw_interface (vnm, sw_if_index); if (si-\u0026gt;type == VNET_SW_INTERFACE_TYPE_SUB) { if (si-\u0026gt;sub.eth.flags.one_tag) return 18; /* Ethernet + single VLAN */ if (si-\u0026gt;sub.eth.flags.two_tags) return 22; /* Ethernet + QinQ */ } return 14; /* Untagged Ethernet */ } And in the policer struct, I also add a l2_overhead_by_sw_if_index[dir][sw_if_index] to store these values. That way, I do not need to do this calculation for every packet in the dataplane, but just blindly add the value I pre-computed at creation time. This is safe, because sub-interfaces cannot change their encapsulation after being created.\nIn the vnet_policer_police() dataplane function, I add an l2_overhead argument, and then call it like so:\nu16 l2_overhead0 = (is_l2) ? 0 : pm-\u0026gt;l2_overhead_by_sw_if_index[dir][sw_if_index0]; act0 = vnet_policer_police (vm, b0, pi0, ..., l2_overhead0); And with that, my two tests give the same results:\npim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 | grep \u0026#39;policer stats\u0026#39; 15:38:39,720 L3 sub-interface input policer stats: {\u0026#39;conform_packets\u0026#39;: 6, \u0026#39;conform_bytes\u0026#39;: 876, \u0026#39;exceed_packets\u0026#39;: 0, \u0026#39;exceed_bytes\u0026#39;: 0, \u0026#39;violate_packets\u0026#39;: 61, \u0026#39;violate_bytes\u0026#39;: 8906} 15:38:40,715 L3 sub-interface output policer stats: {\u0026#39;conform_packets\u0026#39;: 6, \u0026#39;conform_bytes\u0026#39;: 876, \u0026#39;exceed_packets\u0026#39;: 0, \u0026#39;exceed_bytes\u0026#39;: 0, \u0026#39;violate_packets\u0026#39;: 61, \u0026#39;violate_bytes\u0026#39;: 8906} Yay, great success!\nTest: L2 in/output The tests for the L2 input and output case are not radically different. In the setup, rather than giving the VLAN sub-interfaces an IPv4 address, I\u0026rsquo;ll just add them to a bridge-domain:\n# Create VLAN sub-interfaces on pg0 and pg1 sub_if0 = VppDot1QSubint(self, self.pg0, 30) sub_if0.admin_up() sub_if1 = VppDot1QSubint(self, self.pg1, 30) sub_if1.admin_up() # Add both sub-interfaces to bridge domain 1 self.vapi.sw_interface_set_l2_bridge(sub_if0.sw_if_index, bd_id=1) self.vapi.sw_interface_set_l2_bridge(sub_if1.sw_if_index, bd_id=1) This puts the sub-interfaces in L2 mode, after which the l2-input and l2-output feature bitmaps kick in. Without further ado:\npim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 | grep \u0026#39;L2.*policer stats\u0026#39; 15:50:15,217 L2 sub-interface input policer stats: {\u0026#39;conform_packets\u0026#39;: 6, \u0026#39;conform_bytes\u0026#39;: 876, \u0026#39;exceed_packets\u0026#39;: 0, \u0026#39;exceed_bytes\u0026#39;: 0, \u0026#39;violate_packets\u0026#39;: 61, \u0026#39;violate_bytes\u0026#39;: 8906} 15:50:16,217 L2 sub-interface output policer stats: {\u0026#39;conform_packets\u0026#39;: 6, \u0026#39;conform_bytes\u0026#39;: 876, \u0026#39;exceed_packets\u0026#39;: 0, \u0026#39;exceed_bytes\u0026#39;: 0, \u0026#39;violate_packets\u0026#39;: 61, \u0026#39;violate_bytes\u0026#39;: 8906} Results The policer works in all sorts of cool scenarios now. Let me give a concrete example, where I create an L2XC with VTR and then apply a policer. I\u0026rsquo;ve written about VTR, which stands for VLAN Tag Rewriting before, in an old article lovingly called [VPP VLAN Gymnastics]. It all looks like this:\nvpp# create sub Gi10/0/0 100 vpp# create sub Gi10/0/1 200 vpp# set interface l2 xconnect Gi10/0/0.100 Gi10/0/1.200 vpp# set interface l2 xconnect Gi10/0/1.200 Gi10/0/0.100 vpp# set interface l2 tag-rewrite Gi10/0/0.100 pop 1 vpp# set interface l2 tag-rewrite Gi10/0/1.200 pop 1 vpp# policer add name pol-test rate kbps cir 150000 cb 15000000 conform-action transmit vpp# policer input name pol-test Gi10/0/0.100 After applying this configuration, the input bitmap on Gi10/0/0.100 becomes POLICER(14) | VTR(10) | XCONNECT(1) | DROP(0). Packets now take the following path through the dataplane:\nethernet-input → l2-input (computes bitmap, dispatches to bit 14) → l2-policer-input (clears bit 14, polices, dispatches to bit 10) → l2-input-vtr (clears bit 10, pops 1 tag, dispatches to bit 1) → l2-output (XCONNECT: sw_if_index[TX]=Gi10/0/1.200) → inline output VTR (pushes 1 tag for .200) → interface-output → Gi10/0/1-tx What\u0026rsquo;s Next I\u0026rsquo;ve sent the change, which was only about ~300 LOC, off for review. You can follow along on the gerrit on [44654]. I don\u0026rsquo;t think the policer got much slower after adding the l2 path, and one might argue it doesn\u0026rsquo;t matter because policing didn\u0026rsquo;t work on sub-interfaces or L2 output at all before this change. However, for the L3 input/output case, and for the PHY input case, there are a few CPU cycles added now to address the L2 and sub-int use cases. Perhaps I should do a side by side comparison of packets/sec throughput on the bench some time.\nIt would be great if VPP would support FQ-CoDel (Flow Queue-Controlled Delay), which is an algorithm and packet scheduler designed to eliminate bufferbloat, which is high latency caused by excessive buffering in network equipment, while ensuring fair bandwidth distribution among competing traffic flows. I know that Dave Täht - may he rest in peace - always wanted that.\nFor me, I\u0026rsquo;ve set my sights on eVPN VxLAN, and I also started toying with SRv6 L2 transport. I hope that in the spring I\u0026rsquo;ll have a bit more time to contribute to VPP and write about it. Stay tuned!\n","date":"2026-02-14","desc":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nThere are some really fantastic features in VPP, some of which are less well known, and not always very well documented. In this article, I will describe a unique use case in which I think VPP will excel, notably acting as a gateway for Internet Exchange Points.\n","permalink":"https://ipng.ch/s/articles/2026/02/14/vpp-policers/","section":"articles","title":"VPP Policers"},{"contents":" Introduction There once was a Dutch company called [DigiNotar], as the name suggests it was a form of digital notary, and they were in the business of issuing security certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for man-in-the-middle attacks on Iranian Gmail users. Not cool.\nGoogle launched a project called Certificate Transparency, because it was becoming more common that the root of trust given to Certification Authorities could no longer be unilaterally trusted. These attacks showed that the lack of transparency in the way CAs operated was a significant risk to the Web Public Key Infrastructure. It led to the creation of this ambitious [project] to improve security online by bringing accountability to the system that protects our online services with SSL (Secure Socket Layer) and TLS (Transport Layer Security).\nIn 2013, [RFC 6962] was published by the IETF. It describes an experimental protocol for publicly logging the existence of Transport Layer Security (TLS) certificates as they are issued or observed, in a manner that allows anyone to audit certificate authority (CA) activity and notice the issuance of suspect certificates as well as to audit the certificate logs themselves. The intent is that eventually clients would refuse to honor certificates that do not appear in a log, effectively forcing CAs to add all issued certificates to the logs.\nIn the first two articles of this series, I explored [TesseraCT] and [Sunlight], two open source implementations of the Static CT protocol. In this final article, I\u0026rsquo;ll share the details on how I created the environment and production instances for four logs that IPng will be providing: Rennet and Lipase are two ingredients to make cheese and will serve as our staging/testing logs. Gouda and Halloumi are two delicious cheeses that pay homage to our heritage, Jeroen and I being Dutch and Antonis being Greek.\nHardware At IPng Networks, all hypervisors are from the same brand: Dell\u0026rsquo;s Poweredge line. In this project, Jeroen is also contributing a server, and it so happens that he also has a Dell Poweredge. We\u0026rsquo;re both running Debian on our hypervisor, so we install a fresh VM with Debian 13.0, codenamed Trixie, and give the machine 16GB of memory, 8 vCPU and a 16GB boot disk. Boot disks are placed on the hypervisor\u0026rsquo;s ZFS pool, and a blockdevice snapshot is taken every 6hrs. This allows the boot disk to be rolled back to a last known good point in case an upgrade goes south. If you haven\u0026rsquo;t seen it yet, take a look at [zrepl], a one-stop, integrated solution for ZFS replication. This tool is incredibly powerful, and can do snapshot management, sourcing / sinking to remote hosts, of course using incremental snapshots as they are native to ZFS.\nOnce the machine is up, we pass four enterprise-class storage drives, in our case 3.84TB Kioxia NVMe, model KXD51RUE3T84 which are PCIe 3.1 x4 lanes, and NVMe 1.2.1 specification with a good durability and reasonable (albeit not stellar) read throughput of ~2700MB/s, write throughput of ~800MB/s with 240 kIOPS random read and 21 kIOPS random write. My attention is also drawn to a specific specification point: these drives allow for 1.0 DWPD, which stands for Drive Writes Per Day, in other words they are not going to run themselves off a cliff after a few petabytes of writes, and I am reminded that a CT Log wants to write to disk a lot during normal operation.\nThe point of these logs is to keep them safe, and the most important aspects of the compute environment are the use of ECC memory to detect single bit errors, and dependable storage. Toshiba makes a great product.\nctlog1:~$ sudo zpool create -f -o ashift=12 -o autotrim=on -O atime=off -O xattr=sa \\ ssd-vol0 raidz2 /dev/disk/by-id/nvme-KXD51RUE3T84_TOSHIBA_*M ctlog1:~$ sudo zfs create -o encryption=on -o keyformat=passphrase ssd-vol0/enc ctlog1:~$ sudo zfs create ssd-vol0/logs ctlog1:~$ for log in lipase; do \\ for shard in 2025h2 2026h1 2026h2 2027h1 2027h2; do \\ sudo zfs create ssd-vol0/logs/${log}${shard} \\ done \\ done The hypervisor will use PCI passthrough for the NVMe drives, and we\u0026rsquo;ll handle ZFS directly on the VM. The first command creates a ZFS raidz2 pool using 4kB blocks, turns off atime (which avoids one metadata write for each read!), and turns on SSD trimming in ZFS, a very useful feature.\nThen I\u0026rsquo;ll create an encrypted volume for the configuration and key material. This way, if the machine is ever physically transported, the keys will be safe in transit. Finally, I\u0026rsquo;ll create the temporal log shards starting at 2025h2, all the way through to 2027h2 for our testing log called Lipase and our production log called Halloumi on Jeroen\u0026rsquo;s machine. On my own machine, it\u0026rsquo;ll be Rennet for the testing log and Gouda for the production log.\nSunlight I set up Sunlight first, as its authors have extensive operational notes both in terms of the [config] of Geomys\u0026rsquo; Tuscolo log, as well as on the [Sunlight] homepage. I really appreciate that Filippo added some [Gists] and [Doc] with pretty much all I need to know to run one too. Our Rennet and Gouda logs use a very similar approach for their configuration, with one notable exception: the VMs do not have a public IP address, and are tucked away in a private network called IPng Site Local. I\u0026rsquo;ll get back to that later.\nctlog@ctlog0:/ssd-vol0/enc/sunlight$ cat \u0026lt;\u0026lt; EOF | tee sunlight-staging.yaml listen: - \u0026#34;[::]:16420\u0026#34; checkpoints: /ssd-vol0/shared/checkpoints.db logs: - shortname: rennet2025h2 inception: 2025-07-28 period: 200 poolsize: 750 submissionprefix: https://rennet2025h2.log.ct.ipng.ch monitoringprefix: https://rennet2025h2.mon.ct.ipng.ch ccadbroots: testing extraroots: /ssd-vol0/enc/sunlight/extra-roots-staging.pem secret: /ssd-vol0/enc/sunlight/keys/rennet2025h2.seed.bin cache: /ssd-vol0/logs/rennet2025h2/cache.db localdirectory: /ssd-vol0/logs/rennet2025h2/data notafterstart: 2025-07-01T00:00:00Z notafterlimit: 2026-01-01T00:00:00Z ... EOF ctlog@ctlog0:/ssd-vol0/enc/sunlight$ cat \u0026lt;\u0026lt; EOF | tee skylight-staging.yaml listen: - \u0026#34;[::]:16421\u0026#34; homeredirect: https://ipng.ch/s/ct/ logs: - shortname: rennet2025h2 monitoringprefix: https://rennet2025h2.mon.ct.ipng.ch localdirectory: /ssd-vol0/logs/rennet2025h2/data staging: true ... In the first configuration file, I\u0026rsquo;ll tell Sunlight (the write path component) to listen on port :16420 and I\u0026rsquo;ll tell Skylight (the read path component) to listen on port :16421. I\u0026rsquo;ve disabled the automatic certificate renewals, and will handle SSL upstream. A few notes on this:\nMost importantly, I will be using a common frontend pool with a wildcard certificate for *.ct.ipng.ch. I wrote about [DNS-01] before, it\u0026rsquo;s a very convenient way for IPng to do certificate pool management. I will be sharing a certificate for all log types. ACME/HTTP-01 could be made to work with a bit of effort; plumbing through the /.well-known/ URIs on the frontend and pointing them to these instances. But then the cert would have to be copied from Sunlight back to the frontends. I\u0026rsquo;ve noticed that when the log doesn\u0026rsquo;t exist yet, I can start Sunlight and it\u0026rsquo;ll create the bits and pieces on the local filesystem and start writing checkpoints. But if the log already exists, I am required to have the monitoringprefix active, otherwise Sunlight won\u0026rsquo;t start up. It\u0026rsquo;s a small thing, as I will have the read path operational in a few simple steps. Anyway, all five logshards for Rennet, and a few days later, for Gouda, are operational this way.\nSkylight provides all the things I need to serve the data back, which is a huge help. The [Static Log Spec] is very clear on things like compression, content-type, cache-control and other headers. Skylight makes this a breeze, as it reads a configuration file very similar to the Sunlight write-path one, and takes care of it all for me.\nTesseraCT Good news came to our community on August 14th, when Google\u0026rsquo;s TrustFabric team announced their Alpha milestone of [TesseraCT]. This release also moved the POSIX variant from experimental alongside the already further along GCP and AWS personalities. After playing around with it with Al and the team, I think I\u0026rsquo;ve learned enough to get us going in a public tesseract-posix instance.\nOne thing I liked about Sunlight is its compact YAML file that described the pertinent bits of the system, and that I can serve any number of logs with the same process. On the other hand, TesseraCT can serve only one log per process. Both have pros and cons, notably if any poisonous submission would be offered, Sunlight might take down all logs, while TesseraCT would only take down the log receiving the offensive submission. On the other hand, maintaining separate processes is cumbersome, and all log instances need to be meticulously configured.\nTesseraCT genconf I decide to automate this by vibing a little tool called tesseract-genconf, which I\u0026rsquo;ve published on [Gitea]. What it does is take a YAML file describing the logs, and outputs the bits and pieces needed to operate multiple separate processes that together form the sharded static log. I\u0026rsquo;ve attempted to stay mostly compatible with the Sunlight YAML configuration, and came up with a variant like this one:\nctlog@ctlog1:/ssd-vol0/enc/tesseract$ cat \u0026lt;\u0026lt; EOF | tee tesseract-staging.yaml listen: - \u0026#34;[::]:8080\u0026#34; roots: /ssd-vol0/enc/tesseract/roots.pem logs: - shortname: lipase2025h2 listen: \u0026#34;[::]:16900\u0026#34; submissionprefix: https://lipase2025h2.log.ct.ipng.ch monitoringprefix: https://lipase2025h2.mon.ct.ipng.ch extraroots: /ssd-vol0/enc/tesseract/extra-roots-staging.pem secret: /ssd-vol0/enc/tesseract/keys/lipase2025h2.pem localdirectory: /ssd-vol0/logs/lipase2025h2/data notafterstart: 2025-07-01T00:00:00Z notafterlimit: 2026-01-01T00:00:00Z ... EOF With this snippet, I have all the information I need. Here are the steps I take to construct the log itself:\n1. Generate keys\nThe keys are prime256v1 and the format that TesseraCT accepts did change since I wrote up my first [deep dive] a few weeks ago. Now, the tool accepts a PEM format private key, from which the Log ID and Public Key can be derived. So off I go:\nctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-key Creating /ssd-vol0/enc/tesseract/keys/lipase2025h2.pem Creating /ssd-vol0/enc/tesseract/keys/lipase2026h1.pem Creating /ssd-vol0/enc/tesseract/keys/lipase2026h2.pem Creating /ssd-vol0/enc/tesseract/keys/lipase2027h1.pem Creating /ssd-vol0/enc/tesseract/keys/lipase2027h2.pem Of course, if a file already exists at that location, it\u0026rsquo;ll just print a warning like:\nKey already exists: /ssd-vol0/enc/tesseract/keys/lipase2025h2.pem (skipped) 2. Generate JSON/HTML\nI will be operating the read-path with NGINX. Log operators have started speaking about their log metadata in terms of a small JSON file called log.v3.json, and Skylight does a good job of exposing that one, alongside all the other pertinent metadata. So I\u0026rsquo;ll generate these files for each of the logs:\nctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-html Creating /ssd-vol0/logs/lipase2025h2/data/index.html Creating /ssd-vol0/logs/lipase2025h2/data/log.v3.json Creating /ssd-vol0/logs/lipase2026h1/data/index.html Creating /ssd-vol0/logs/lipase2026h1/data/log.v3.json Creating /ssd-vol0/logs/lipase2026h2/data/index.html Creating /ssd-vol0/logs/lipase2026h2/data/log.v3.json Creating /ssd-vol0/logs/lipase2027h1/data/index.html Creating /ssd-vol0/logs/lipase2027h1/data/log.v3.json Creating /ssd-vol0/logs/lipase2027h2/data/index.html Creating /ssd-vol0/logs/lipase2027h2/data/log.v3.json It\u0026rsquo;s nice to see a familiar look-and-feel for these logs appear in those index.html (which all cross-link to each other within the logs specified in tesseract-staging.yaml, which is dope.\n3. Generate Roots\nAntonis had seen this before (thanks for the explanation!) but TesseraCT does not natively implement fetching of the [CCADB] roots. But, he points out, you can just get them from any other running log instance, so I\u0026rsquo;ll implement a gen-roots command:\nctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf gen-roots \\ --source https://tuscolo2027h1.sunlight.geomys.org --output production-roots.pem Fetching roots from: https://tuscolo2027h1.sunlight.geomys.org/ct/v1/get-roots 2025/08/25 08:24:58 Warning: Failed to parse certificate,carefully skipping: x509: negative serial number Creating production-roots.pem Successfully wrote 248 certificates to tusc.pem (out of 249 total) ctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf gen-roots \\ --source https://navigli2027h1.sunlight.geomys.org --output testing-roots.pem Fetching roots from: https://navigli2027h1.sunlight.geomys.org/ct/v1/get-roots Creating testing-roots.pem Successfully wrote 82 certificates to tusc.pem (out of 82 total) I can do this regularly, say daily, in a cronjob and if the files were to change, restart the TesseraCT processes. It\u0026rsquo;s not ideal (because the restart might be briefly disruptive), but it\u0026rsquo;s a reasonable option for the time being.\n4. Generate TesseraCT cmdline\nI will be running TesseraCT as a templated unit in systemd. These are system unit files that have an argument, they will have an @ in their name, like so:\nctlog@ctlog1:/ssd-vol0/enc/tesseract$ cat \u0026lt;\u0026lt; EOF | sudo tee /lib/systemd/system/tesseract@.service [Unit] Description=Tesseract CT Log service for %i ConditionFileExists=/ssd-vol0/logs/%i/data/.env After=network.target [Service] # The %i here refers to the instance name, e.g., \u0026#34;lipase2025h2\u0026#34; # This path should point to where your instance-specific .env files are located EnvironmentFile=/ssd-vol0/logs/%i/data/.env ExecStart=/home/ctlog/bin/tesseract-posix $TESSERACT_ARGS User=ctlog Group=ctlog Restart=on-failure RestartSec=5 [Install] WantedBy=multi-user.target EOF I can now implement a gen-env command for my tool:\nctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-env Creating /ssd-vol0/logs/lipase2025h2/data/roots.pem Creating /ssd-vol0/logs/lipase2025h2/data/.env Creating /ssd-vol0/logs/lipase2026h1/data/roots.pem Creating /ssd-vol0/logs/lipase2026h1/data/.env Creating /ssd-vol0/logs/lipase2026h2/data/roots.pem Creating /ssd-vol0/logs/lipase2026h2/data/.env Creating /ssd-vol0/logs/lipase2027h1/data/roots.pem Creating /ssd-vol0/logs/lipase2027h1/data/.env Creating /ssd-vol0/logs/lipase2027h2/data/roots.pem Creating /ssd-vol0/logs/lipase2027h2/data/.env Looking at one of those .env files, I can show the exact commandline I\u0026rsquo;ll be feeding to the tesseract-posix binary:\nctlog@ctlog1:/ssd-vol0/enc/tesseract$ cat /ssd-vol0/logs/lipase2025h2/data/.env TESSERACT_ARGS=\u0026#34;--private_key=/ssd-vol0/enc/tesseract/keys/lipase2025h2.pem --origin=lipase2025h2.log.ct.ipng.ch --storage_dir=/ssd-vol0/logs/lipase2025h2/data --roots_pem_file=/ssd-vol0/logs/lipase2025h2/data/roots.pem --http_endpoint=[::]:16900 --not_after_start=2025-07-01T00:00:00Z --not_after_limit=2026-01-01T00:00:00Z\u0026#34; OTEL_EXPORTER_OTLP_ENDPOINT=http://localhost:4318 A quick operational note on OpenTelemetry (also often referred to as Otel): Al and the TrustFabric team added open telemetry to the TesseraCT personalities, as it was mostly already implemented in the underlying Tessera library. By default, it\u0026rsquo;ll try to send its telemetry to localhost using https, which makes sense in those cases where the collector is on a different machine. In my case, I\u0026rsquo;ll keep otelcol (the collector) on the same machine. Its job is to consume the Otel telemetry stream, and turn those back into Prometheus /metrics endpoint on port :9464.\nThe gen-env command also assembles the per-instance roots.pem file. For staging logs, it\u0026rsquo;ll take the file pointed to by the roots: key, and append any per-log extraroots: files. For me, these extraroots are empty and the main roots file points at either the testing roots that came from Rennet (our Sunlight staging log), or the production roots that came from Gouda. A job well done!\n5. Generate NGINX\nWhen I first ran my tests, I noticed that the log check tool called ct-fsck threw errors on my read path. Filippo explained that the HTTP headers matter in the Static CT specification. Tiles, Issuers, and Checkpoint must all have specific caching and content type headers set. This is what makes Skylight such a gem - I get to read it (and the spec!) to see what I\u0026rsquo;m supposed to be serving.\nAnd thus, the gen-nginx command is born, and listens on port :8080 for requests:\nctlog@ctlog1:/ssd-vol0/enc/tesseract$ tesseract-genconf -c tesseract-staging.yaml gen-nginx Creating nginx config: /ssd-vol0/logs/lipase2025h2/data/lipase2025h2.mon.ct.ipng.ch.conf Creating nginx config: /ssd-vol0/logs/lipase2026h1/data/lipase2026h1.mon.ct.ipng.ch.conf Creating nginx config: /ssd-vol0/logs/lipase2026h2/data/lipase2026h2.mon.ct.ipng.ch.conf Creating nginx config: /ssd-vol0/logs/lipase2027h1/data/lipase2027h1.mon.ct.ipng.ch.conf Creating nginx config: /ssd-vol0/logs/lipase2027h2/data/lipase2027h2.mon.ct.ipng.ch.conf All that\u0026rsquo;s left for me to do is symlink these from /etc/nginx/sites-enabled/ and the read-path is off to the races. With these commands in the tesseract-genconf tool, I am hoping that future travelers have an easy time setting up their static log. Please let me know if you\u0026rsquo;d like to use, or contribute, to the tool. You can find me in the Transparency Dev Slack, in #ct and also #cheese.\nIPng Frontends IPng Networks has a private internal network called [IPng Site Local], which is not routed on the internet. Our [Frontends] are the only things that have public IPv4 and IPv6 addresses. It allows for things like anycasted webservers and loadbalancing with [Maglev].\nThe IPng Site Local network kind of looks like the picture to the right. The hypervisors running the Sunlight and TesseraCT logs are at NTT Zurich1 in Rümlang, Switzerland. The IPng frontends are in green, and the sweet thing is, some of them run in IPng\u0026rsquo;s own ISP network (AS8298), while others run in partner networks (like IP-Max AS25091, and Coloclue AS8283). This means that I will benefit from some pretty solid connectivity redundancy.\nThe frontends are provisioned with Ansible. There are two aspects to them - firstly, a certbot instance maintains the Let\u0026rsquo;s Encrypt wildcard certificates for *.ct.ipng.ch. There\u0026rsquo;s a machine tucked away somewhere called lego.net.ipng.ch \u0026ndash; again, not exposed on the internet \u0026ndash; and its job is to renew certificates and copy them to the machines that need them. Next, a cluster of NGINX servers uses these certificates to expose IPng and customer services to the Internet.\nI can tie it all together with a snippet like so, for which I apologize in advance - it\u0026rsquo;s quite a wall of text:\nmap $http_user_agent $no_cache_ctlog_lipase { \u0026#34;~*TesseraCT fsck\u0026#34; 1; default 0; } server { listen [::]:443 ssl http2; listen 0.0.0.0:443 ssl http2; ssl_certificate /etc/certs/ct.ipng.ch/fullchain.pem; ssl_certificate_key /etc/certs/ct.ipng.ch/privkey.pem; include /etc/nginx/conf.d/options-ssl-nginx.inc; ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc; server_name lipase2025h2.log.ct.ipng.ch; access_log /nginx/logs/lipase2025h2.log.ct.ipng.ch-access.log upstream buffer=512k flush=5s; include /etc/nginx/conf.d/ipng-headers.inc; location = / { proxy_http_version 1.1; proxy_set_header Host lipase2025h2.mon.ct.ipng.ch; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection \u0026#34;upgrade\u0026#34;; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_pass http://ctlog1.net.ipng.ch:8080/index.html; } location = /metrics { proxy_http_version 1.1; proxy_set_header Host $host; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection \u0026#34;upgrade\u0026#34;; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_pass http://ctlog1.net.ipng.ch:9464; } location / { proxy_http_version 1.1; proxy_set_header Host $host; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection \u0026#34;upgrade\u0026#34;; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_pass http://ctlog1.net.ipng.ch:16900; } } server { listen [::]:443 ssl http2; listen 0.0.0.0:443 ssl http2; ssl_certificate /etc/certs/ct.ipng.ch/fullchain.pem; ssl_certificate_key /etc/certs/ct.ipng.ch/privkey.pem; include /etc/nginx/conf.d/options-ssl-nginx.inc; ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc; server_name lipase2025h2.mon.ct.ipng.ch; access_log /nginx/logs/lipase2025h2.mon.ct.ipng.ch-access.log upstream buffer=512k flush=5s; include /etc/nginx/conf.d/ipng-headers.inc; location = /checkpoint { proxy_http_version 1.1; proxy_set_header Host $host; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection \u0026#34;upgrade\u0026#34;; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_pass http://ctlog1.net.ipng.ch:8080; } location / { proxy_http_version 1.1; proxy_set_header Host $host; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection \u0026#34;upgrade\u0026#34;; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; include /etc/nginx/conf.d/ipng-upstream-headers.inc; proxy_cache ipng_cache; proxy_cache_key \u0026#34;$scheme://$host$request_uri\u0026#34;; proxy_cache_valid 200 24h; proxy_cache_revalidate off; proxy_cache_bypass $no_cache_ctlog_lipase; proxy_no_cache $no_cache_ctlog_lipase; proxy_pass http://ctlog1.net.ipng.ch:8080; } } Taking Lipase shard 2025h2 as an example, the submission path (on *.log.ct.ipng.ch) will show the same index.html as the monitoring path (on *.mon.ct.ipng.ch), to provide some consistency with Sunlight logs. Otherwise, the /metrics endpoint is forwarded to the otelcol running on port :9464, and the rest (the /ct/v1/ and so on) are sent to the first port :16900 of the TesseraCT.\nThen the read-path makes a special-case of the /checkpoint endpoint, which it does not cache. That request (as all others) is forwarded to port :8080 which is where NGINX is running. Other requests (notably /tile and /issuer) are cacheable, so I\u0026rsquo;ll cache these on the upstream NGINX servers, both for resilience as well as for performance. Having four of these NGINX upstream will allow the Static CT logs (regardless of being Sunlight or TesseraCT) to serve very high read-rates.\nWhat\u0026rsquo;s Next I need to spend a little bit of time thinking about rate limits, specifically write-ratelimits. I think I\u0026rsquo;ll use a request limiter in upstream NGINX, to allow for each IP or /24 or /48 subnet to only send a fixed number of requests/sec. I\u0026rsquo;ll probably keep that part private though, as it\u0026rsquo;s a good rule of thumb to never offer information to attackers.\nTogether with Antonis Chariton and Jeroen Massar, IPng Networks will be offering both TesseraCT and Sunlight logs on the public internet. One final step is to productionize both logs, and file the paperwork for them in the community. At this point our Sunlight log has been running for a month or so, and we\u0026rsquo;ve filed the paperwork for it to be included at Apple and Google.\nI\u0026rsquo;m going to have folks poke at Lipase as well, after which I\u0026rsquo;ll try to run a few ct-fsck to make sure the logs are sane, before offering them into the inclusion program as well. Wish us luck!\n","date":"2025-08-24","desc":" Introduction There once was a Dutch company called [DigiNotar], as the name suggests it was a form of digital notary, and they were in the business of issuing security certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for man-in-the-middle attacks on Iranian Gmail users. Not cool.\nGoogle launched a project called Certificate Transparency, because it was becoming more common that the root of trust given to Certification Authorities could no longer be unilaterally trusted. These attacks showed that the lack of transparency in the way CAs operated was a significant risk to the Web Public Key Infrastructure. It led to the creation of this ambitious [project] to improve security online by bringing accountability to the system that protects our online services with SSL (Secure Socket Layer) and TLS (Transport Layer Security).\n","permalink":"https://ipng.ch/s/articles/2025/08/24/certificate-transparency-part-3-operations/","section":"articles","title":"Certificate Transparency - Part 3 - Operations"},{"contents":" Introduction There once was a Dutch company called [DigiNotar], as the name suggests it was a form of digital notary, and they were in the business of issuing security certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for man-in-the-middle attacks on Iranian Gmail users. Not cool.\nGoogle launched a project called Certificate Transparency, because it was becoming more common that the root of trust given to Certification Authorities could no longer be unilaterally trusted. These attacks showed that the lack of transparency in the way CAs operated was a significant risk to the Web Public Key Infrastructure. It led to the creation of this ambitious [project] to improve security online by bringing accountability to the system that protects our online services with SSL (Secure Socket Layer) and TLS (Transport Layer Security).\nIn 2013, [RFC 6962] was published by the IETF. It describes an experimental protocol for publicly logging the existence of Transport Layer Security (TLS) certificates as they are issued or observed, in a manner that allows anyone to audit certificate authority (CA) activity and notice the issuance of suspect certificates as well as to audit the certificate logs themselves. The intent is that eventually clients would refuse to honor certificates that do not appear in a log, effectively forcing CAs to add all issued certificates to the logs.\nIn a [previous article], I took a deep dive into a new open source implementation of Static CT Logs made by Google. There is however a very competent alternative called [Sunlight], which deserves some attention to get to know its look and feel, as well as its performance characteristics.\nSunlight I start by reading up on the project website, and learn:\nSunlight is a [Certificate Transparency] log implementation and monitoring API designed for scalability, ease of operation, and reduced cost. What started as the Sunlight API is now the [Static CT API] and is allowed by the CT log policies of the major browsers.\nSunlight was designed by Filippo Valsorda for the needs of the WebPKI community, through the feedback of many of its members, and in particular of the Sigsum, Google TrustFabric, and ISRG teams. It is partially based on the Go Checksum Database. Sunlight\u0026rsquo;s development was sponsored by Let\u0026rsquo;s Encrypt.\nI have a chat with Filippo and think I\u0026rsquo;m addressing an Elephant by asking him which of the two implementations, TesseraCT or Sunlight, he thinks would be a good fit. One thing he says really sticks with me: \u0026ldquo;The community needs any static log operator, so if Google thinks TesseraCT is ready, by all means use that. The diversity will do us good!\u0026rdquo;.\nTo find out if one or the other is \u0026lsquo;ready\u0026rsquo; is partly on the software, but importantly also on the operator. So I carefully take Sunlight out of its cardboard box, and put it onto the same Dell R630 that I used in my previous tests: two Xeon E5-2640 v4 CPUs for a total of 20 cores and 40 threads, and 512GB of DDR4 memory. They also sport a SAS controller. In one machine I place 6 pcs 1.2TB SAS3 drives (HPE part number EG1200JEHMC), and in the second machine I place 6pcs of 1.92TB enterprise storage (Samsung part number P1633N19).\nSunlight: setup I download the source from GitHub, which, one of these days, will have an IPv6 address. Building the tools is easy enough, there are three main tools:\nsunlight: Which serves the write-path. Certification authorities add their certs here. sunlight-keygen: A helper tool to create the so-called seed file (key material) for a log. skylight: Which serves the read-path. /checkpoint and things like /tile and /issuer are served here in a spec-compliant way. The YAML configuration file is straightforward, and can define and handle multiple logs in one instance, which sets it apart from TesseraCT which can only handle one log per instance. There\u0026rsquo;s a submissionprefix which sunlight will use to accept writes, and a monitoringprefix which skylight will use for reads.\nI stumble across a small issue - I haven\u0026rsquo;t created multiple DNS hostnames for the test machine. So I decide to use a different port for one versus the other. The write path will use TLS on port 1443 while Sunlight will point to a normal HTTP port 1080. And considering I don\u0026rsquo;t have a certificate for *.lab.ipng.ch, I will use a self-signed one instead:\npim@ctlog-test:/etc/sunlight$ openssl genrsa -out ca.key 2048 pim@ctlog-test:/etc/sunlight$ openssl req -new -x509 -days 365 -key ca.key \\ -subj \u0026#34;/C=CH/ST=ZH/L=Bruttisellen/O=IPng Networks GmbH/CN=IPng Root CA\u0026#34; -out ca.crt pim@ctlog-test:/etc/sunlight$ openssl req -newkey rsa:2048 -nodes -keyout sunlight-key.pem \\ -subj \u0026#34;/C=CH/ST=ZH/L=Bruttisellen/O=IPng Networks GmbH/CN=*.lab.ipng.ch\u0026#34; -out sunlight.csr pim@ctlog-test:/etc/sunlight# openssl x509 -req -extfile \\ \u0026lt;(printf \u0026#34;subjectAltName=DNS:ctlog-test.lab.ipng.ch,DNS:ctlog-test.lab.ipng.ch\u0026#34;) -days 365 \\ -in sunlight.csr -CA ca.crt -CAkey ca.key -CAcreateserial -out sunlight.pem ln -s sunlight.pem skylight.pem ln -s sunlight-key.pem skylight-key.pem This little snippet yields sunlight.pem (the certificate) and sunlight-key.pem (the private key), and symlinks them to skylight.pem and skylight-key.pem for simplicity. With these in hand, I can start the rest of the show. First I will prepare the NVME storage with a few datasets in which Sunlight will store its data:\npim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test pim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test/shared pim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test/logs pim@ctlog-test:~$ sudo zfs create ssd-vol0/sunlight-test/logs/sunlight-test pim@ctlog-test:~$ sudo chown -R pim:pim /ssd-vol0/sunlight-test Then I\u0026rsquo;ll create the Sunlight configuration:\npim@ctlog-test:/etc/sunlight$ sunlight-keygen -f sunlight-test.seed.bin Log ID: IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E= ECDSA public key: -----BEGIN PUBLIC KEY----- MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHR wRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== -----END PUBLIC KEY----- Ed25519 public key: -----BEGIN PUBLIC KEY----- 0pHg7KptAxmb4o67m9xNM1Ku3YH4bjjXbyIgXn2R2bk= -----END PUBLIC KEY----- The first block creates key material for the log, and I get a fun surprise: the Log ID starts precisely with the string IPng\u0026hellip; what are the odds that that would happen!? I should tell Antonis about this, it\u0026rsquo;s dope!\nAs a safety precaution, Sunlight requires the operator to make the checkpoints.db by hand, which I\u0026rsquo;ll also do:\npim@ctlog-test:/etc/sunlight$ sqlite3 /ssd-vol0/sunlight-test/shared/checkpoints.db \\ \u0026#34;CREATE TABLE checkpoints (logID BLOB PRIMARY KEY, body TEXT)\u0026#34; And with that, I\u0026rsquo;m ready to create my first log!\nSunlight: Setting up S3 When learning about [Tessera], I already kind of drew the conclusion that, for our case at IPng at least, running the fully cloud-native version with S3 storage and MySQL database, gave both poorer performance, but also more operational complexity. But I find it interesting to compare behavior and performance, so I\u0026rsquo;ll start by creating a Sunlight log using backing MinIO SSD storage.\nI\u0026rsquo;ll first create the bucket and a user account to access it:\npim@ctlog-test:~$ export AWS_ACCESS_KEY_ID=\u0026#34;\u0026lt;some user\u0026gt;\u0026#34; pim@ctlog-test:~$ export AWS_SECRET_ACCESS_KEY=\u0026#34;\u0026lt;some password\u0026gt;\u0026#34; pim@ctlog-test:~$ export S3_BUCKET=sunlight-test pim@ctlog-test:~$ mc mb ssd/${S3_BUCKET} pim@ctlog-test:~$ cat \u0026lt;\u0026lt; EOF \u0026gt; /tmp/minio-access.json { \u0026#34;Version\u0026#34;: \u0026#34;2012-10-17\u0026#34;, \u0026#34;Statement\u0026#34;: [ { \u0026#34;Effect\u0026#34;: \u0026#34;Allow\u0026#34;, \u0026#34;Action\u0026#34;: [ \u0026#34;s3:ListBucket\u0026#34;, \u0026#34;s3:PutObject\u0026#34;, \u0026#34;s3:GetObject\u0026#34;, \u0026#34;s3:DeleteObject\u0026#34; ], \u0026#34;Resource\u0026#34;: [ \u0026#34;arn:aws:s3:::${S3_BUCKET}/*\u0026#34;, \u0026#34;arn:aws:s3:::${S3_BUCKET}\u0026#34; ] } ] } EOF pim@ctlog-test:~$ mc admin user add ssd ${AWS_ACCESS_KEY_ID} ${AWS_SECRET_ACCESS_KEY} pim@ctlog-test:~$ mc admin policy create ssd ${S3_BUCKET}-access /tmp/minio-access.json pim@ctlog-test:~$ mc admin policy attach ssd ${S3_BUCKET}-access --user ${AWS_ACCESS_KEY_ID} pim@ctlog-test:~$ mc anonymous set public ssd/${S3_BUCKET} After setting up the S3 environment, all I must do is wire it up to the Sunlight configuration file:\npim@ctlog-test:/etc/sunlight$ cat \u0026lt;\u0026lt; EOF \u0026gt; sunlight-s3.yaml listen: - \u0026#34;[::]:1443\u0026#34; checkpoints: /ssd-vol0/sunlight-test/shared/checkpoints.db logs: - shortname: sunlight-test inception: 2025-08-10 submissionprefix: https://ctlog-test.lab.ipng.ch:1443/ monitoringprefix: http://sunlight-test.minio-ssd.lab.ipng.ch:9000/ secret: /etc/sunlight/sunlight-test.seed.bin cache: /ssd-vol0/sunlight-test/logs/sunlight-test/cache.db s3region: eu-schweiz-1 s3bucket: sunlight-test s3endpoint: http://minio-ssd.lab.ipng.ch:9000/ roots: /etc/sunlight/roots.pem period: 200 poolsize: 15000 notafterstart: 2024-01-01T00:00:00Z notafterlimit: 2025-01-01T00:00:00Z EOF The one thing of note here is the use of roots: file which contains the Root CA for the TesseraCT loadtester which I\u0026rsquo;ll be using. In production, Sunlight can grab the approved roots from the so-called Common CA Database or CCADB. But you can also specify either all roots using the roots field, or additional roots on top of the ccadbroots field, using the extraroots field. That\u0026rsquo;s a handy trick! You can find more info on the [CCADB] homepage.\nI can then start Sunlight just like this:\npim@ctlog-test:/etc/sunlight$ sunlight -testcert -c /etc/sunlight/sunlight-s3.yaml {\u0026#34;time\u0026#34;:\u0026#34;2025-08-10T13:49:36.091384532+02:00\u0026#34;,\u0026#34;level\u0026#34;:\u0026#34;INFO\u0026#34;,\u0026#34;source\u0026#34;:{\u0026#34;function\u0026#34;:\u0026#34;main.main.func1\u0026#34;,\u0026#34;file\u0026#34;:\u0026#34;/home/pim/src/sunlight/cmd/sunlight/sunlig ht.go\u0026#34;,\u0026#34;line\u0026#34;:341},\u0026#34;msg\u0026#34;:\u0026#34;debug server listening\u0026#34;,\u0026#34;addr\u0026#34;:{\u0026#34;IP\u0026#34;:\u0026#34;127.0.0.1\u0026#34;,\u0026#34;Port\u0026#34;:37477,\u0026#34;Zone\u0026#34;:\u0026#34;\u0026#34;}} time=2025-08-10T13:49:36.091+02:00 level=INFO msg=\u0026#34;debug server listening\u0026#34; addr=127.0.0.1:37477 {\u0026#34;time\u0026#34;:\u0026#34;2025-08-10T13:49:36.100471647+02:00\u0026#34;,\u0026#34;level\u0026#34;:\u0026#34;INFO\u0026#34;,\u0026#34;source\u0026#34;:{\u0026#34;function\u0026#34;:\u0026#34;main.main\u0026#34;,\u0026#34;file\u0026#34;:\u0026#34;/home/pim/src/sunlight/cmd/sunlight/sunlight.go\u0026#34; ,\u0026#34;line\u0026#34;:542},\u0026#34;msg\u0026#34;:\u0026#34;today is the Inception date, creating log\u0026#34;,\u0026#34;log\u0026#34;:\u0026#34;sunlight-test\u0026#34;} time=2025-08-10T13:49:36.100+02:00 level=INFO msg=\u0026#34;today is the Inception date, creating log\u0026#34; log=sunlight-test {\u0026#34;time\u0026#34;:\u0026#34;2025-08-10T13:49:36.119529208+02:00\u0026#34;,\u0026#34;level\u0026#34;:\u0026#34;INFO\u0026#34;,\u0026#34;source\u0026#34;:{\u0026#34;function\u0026#34;:\u0026#34;filippo.io/sunlight/internal/ctlog.CreateLog\u0026#34;,\u0026#34;file\u0026#34;:\u0026#34;/home/pim/src /sunlight/internal/ctlog/ctlog.go\u0026#34;,\u0026#34;line\u0026#34;:159},\u0026#34;msg\u0026#34;:\u0026#34;created log\u0026#34;,\u0026#34;log\u0026#34;:\u0026#34;sunlight-test\u0026#34;,\u0026#34;timestamp\u0026#34;:1754826576111,\u0026#34;logID\u0026#34;:\u0026#34;IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E=\u0026#34;} time=2025-08-10T13:49:36.119+02:00 level=INFO msg=\u0026#34;created log\u0026#34; log=sunlight-test timestamp=1754826576111 logID=\u0026#34;IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E=\u0026#34; {\u0026#34;time\u0026#34;:\u0026#34;2025-08-10T13:49:36.127702166+02:00\u0026#34;,\u0026#34;level\u0026#34;:\u0026#34;WARN\u0026#34;,\u0026#34;source\u0026#34;:{\u0026#34;function\u0026#34;:\u0026#34;filippo.io/sunlight/internal/ctlog.LoadLog\u0026#34;,\u0026#34;file\u0026#34;:\u0026#34;/home/pim/src/s unlight/internal/ctlog/ctlog.go\u0026#34;,\u0026#34;line\u0026#34;:296},\u0026#34;msg\u0026#34;:\u0026#34;failed to parse previously trusted roots\u0026#34;,\u0026#34;log\u0026#34;:\u0026#34;sunlight-test\u0026#34;,\u0026#34;roots\u0026#34;:\u0026#34;\u0026#34;} time=2025-08-10T13:49:36.127+02:00 level=WARN msg=\u0026#34;failed to parse previously trusted roots\u0026#34; log=sunlight-test roots=\u0026#34;\u0026#34; {\u0026#34;time\u0026#34;:\u0026#34;2025-08-10T13:49:36.127766452+02:00\u0026#34;,\u0026#34;level\u0026#34;:\u0026#34;INFO\u0026#34;,\u0026#34;source\u0026#34;:{\u0026#34;function\u0026#34;:\u0026#34;filippo.io/sunlight/internal/ctlog.LoadLog\u0026#34;,\u0026#34;file\u0026#34;:\u0026#34;/home/pim/src/sunlight/internal/ctlog/ctlog.go\u0026#34;,\u0026#34;line\u0026#34;:301},\u0026#34;msg\u0026#34;:\u0026#34;loaded log\u0026#34;,\u0026#34;log\u0026#34;:\u0026#34;sunlight-test\u0026#34;,\u0026#34;logID\u0026#34;:\u0026#34;IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E=\u0026#34;,\u0026#34;size\u0026#34;:0, \u0026#34;timestamp\u0026#34;:1754826576111} time=2025-08-10T13:49:36.127+02:00 level=INFO msg=\u0026#34;loaded log\u0026#34; log=sunlight-test logID=\u0026#34;IPngJcHCHWi+s37vfFqpY9ouk+if78wAY2kl/sh3c8E=\u0026#34; size=0 timestamp=1754826576111 {\u0026#34;time\u0026#34;:\u0026#34;2025-08-10T13:49:36.540297532+02:00\u0026#34;,\u0026#34;level\u0026#34;:\u0026#34;INFO\u0026#34;,\u0026#34;source\u0026#34;:{\u0026#34;function\u0026#34;:\u0026#34;filippo.io/sunlight/internal/ctlog.(*Log).sequencePool\u0026#34;,\u0026#34;file\u0026#34;:\u0026#34;/home/pim/src/sunlight/internal/ctlog/ctlog.go\u0026#34;,\u0026#34;line\u0026#34;:972},\u0026#34;msg\u0026#34;:\u0026#34;sequenced pool\u0026#34;,\u0026#34;log\u0026#34;:\u0026#34;sunlight-test\u0026#34;,\u0026#34;old_tree_size\u0026#34;:0,\u0026#34;entries\u0026#34;:0,\u0026#34;start\u0026#34;:\u0026#34;2025-08-1 0T13:49:36.534500633+02:00\u0026#34;,\u0026#34;tree_size\u0026#34;:0,\u0026#34;tiles\u0026#34;:0,\u0026#34;timestamp\u0026#34;:1754826576534,\u0026#34;elapsed\u0026#34;:5788099} time=2025-08-10T13:49:36.540+02:00 level=INFO msg=\u0026#34;sequenced pool\u0026#34; log=sunlight-test old_tree_size=0 entries=0 start=2025-08-10T13:49:36.534+02:00 tree_size=0 tiles=0 timestamp=1754826576534 elapsed=5.788099ms ... Although that looks pretty good, I see that something is not quite right. When Sunlight comes up, it shares with me a few links, in the get-roots and json fields on the homepage, but neither of them work:\npim@ctlog-test:~$ curl -k https://ctlog-test.lab.ipng.ch:1443/ct/v1/get-roots 404 page not found pim@ctlog-test:~$ curl -k https://ctlog-test.lab.ipng.ch:1443/log.v3.json 404 page not found I\u0026rsquo;m starting to think that using a non-standard listen port won\u0026rsquo;t work, or more precisely, adding a port in the monitoringprefix won\u0026rsquo;t work. I notice that the logname is called ctlog-test.lab.ipng.ch:1443 which I don\u0026rsquo;t think is supposed to have a port number in it. So instead, I make Sunlight listen on port 443 and omit the port in the submissionprefix, and give it and its companion Skylight the needed privileges to bind the privileged port like so:\npim@ctlog-test:~$ sudo setcap \u0026#39;cap_net_bind_service=+ep\u0026#39; /usr/local/bin/sunlight pim@ctlog-test:~$ sudo setcap \u0026#39;cap_net_bind_service=+ep\u0026#39; /usr/local/bin/skylight pim@ctlog-test:~$ sunlight -testcert -c /etc/sunlight/sunlight-s3.yaml And with that, Sunlight reports for duty and the links work. Hoi!\nSunlight: Loadtesting S3 I have some good experience loadtesting from the [TesseraCT article]. One important difference is that Sunlight wants to use SSL for the submission and monitoring paths, and I\u0026rsquo;ve created a snakeoil self-signed cert. CT Hammer does not accept that out of the box, so I need to make a tiny change to the Hammer:\npim@ctlog-test:~/src/tesseract$ git diff diff --git a/internal/hammer/hammer.go b/internal/hammer/hammer.go index 3828fbd..1dfd895 100644 --- a/internal/hammer/hammer.go +++ b/internal/hammer/hammer.go @@ -104,6 +104,9 @@ func main() { MaxIdleConns: *numWriters + *numReadersFull + *numReadersRandom, MaxIdleConnsPerHost: *numWriters + *numReadersFull + *numReadersRandom, DisableKeepAlives: false, + TLSClientConfig: \u0026amp;tls.Config{ + InsecureSkipVerify: true, + }, }, Timeout: *httpTimeout, } With that small bit of insecurity out of the way, Sunlight makes it otherwise pretty easy for me to construct the CT Hammer commandline:\npim@ctlog-test:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \\ --log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \\ --log_url=http://sunlight-test.minio-ssd.lab.ipng.ch:9000/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \\ --max_read_ops=0 --num_writers=5000 --max_write_ops=100 pim@ctlog-test:/etc/sunlight$ T=0; O=0; while :; do \\ N=$(curl -sS http://sunlight-test.minio-ssd.lab.ipng.ch:9000/checkpoint | grep -E \u0026#39;^[0-9]+$\u0026#39;); \\ if [ \u0026#34;$N\u0026#34; -eq \u0026#34;$O\u0026#34; ]; then \\ echo -n .; \\ else \\ echo \u0026#34; $T seconds $((N-O)) certs\u0026#34;; O=$N; T=0; echo -n $N\\ ; fi; \\ T=$((T+1)); sleep 1; done 24915 1 seconds 96 certs 25011 1 seconds 92 certs 25103 1 seconds 93 certs 25196 1 seconds 87 certs On the first commandline I\u0026rsquo;ll start the loadtest at 100 writes/sec with the standard duplication probability of 10%, which allows me to test Sunlight\u0026rsquo;s ability to avoid writing duplicates. This means I should see on average a growth of the tree at about 90/s. Check. I raise the write-load to 500/s:\n39421 1 seconds 443 certs 39864 1 seconds 442 certs 40306 1 seconds 441 certs 40747 1 seconds 447 certs 41194 1 seconds 448 certs .. and to 1'000/s:\n57941 1 seconds 945 certs 58886 1 seconds 970 certs 59856 1 seconds 948 certs 60804 1 seconds 965 certs 61769 1 seconds 955 certs After a few minutes I see a few errors from CT Hammer:\nW0810 14:55:29.660710 1398779 analysis.go:134] (1 x) failed to create request: failed to write leaf: Post \u0026#34;https://ctlog-test.lab.ipng.ch/ct/v1/add-chain\u0026#34;: EOF W0810 14:55:30.496603 1398779 analysis.go:124] (1 x) failed to create request: write leaf was not OK. Status code: 500. Body: \u0026#34;failed to read body: read tcp 127.0.1.1:443-\u0026gt;127.0.0.1:44908: i/o timeout\\n\u0026#34; I raise the Hammer load to 5'000/sec (which means 4'500/s unique certs and 500 duplicates), and find the max committed writes/sec to max out at around 4'200/s:\n879637 1 seconds 4213 certs 883850 1 seconds 4207 certs 888057 1 seconds 4211 certs 892268 1 seconds 4249 certs 896517 1 seconds 4216 certs The error rate is a steady stream of errors like the one before:\nW0810 14:59:48.499274 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post \u0026#34;https://ctlog-test.lab.ipng.ch/ct/v1/add-chain\u0026#34;: EOF W0810 14:59:49.034194 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post \u0026#34;https://ctlog-test.lab.ipng.ch/ct/v1/add-chain\u0026#34;: EOF W0810 15:00:05.496459 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post \u0026#34;https://ctlog-test.lab.ipng.ch/ct/v1/add-chain\u0026#34;: EOF W0810 15:00:07.187181 1398779 analysis.go:124] (1 x) failed to create request: failed to write leaf: Post \u0026#34;https://ctlog-test.lab.ipng.ch/ct/v1/add-chain\u0026#34;: EOF At this load of 4'200/s, MinIO is not very impressed. Remember in the [other article] I loadtested it to about 7'500 ops/sec and the statistics below are about 50 ops/sec (2'800/min). I conclude that MinIO is, in fact, bored of this whole activity:\npim@ctlog-test:/etc/sunlight$ mc admin trace --stats ssd Duration: 18m58s ▱▱▱ RX Rate:↑ 115 MiB/m TX Rate:↓ 2.4 MiB/m RPM : 2821.3 ------------- Call Count RPM Avg Time Min Time Max Time Avg TTFB Max TTFB Avg Size Rate /min Errors s3.PutObject 37602 (70.3%) 1982.2 6.2ms 785µs 86.7ms 6.1ms 86.6ms ↑59K ↓0B ↑115M ↓1.4K 0 s3.GetObject 15918 (29.7%) 839.1 996µs 670µs 51.3ms 912µs 51.2ms ↑46B ↓3.0K ↑38K ↓2.4M 0 Sunlight still keeps its certificate cache on local disk. At a rate of 4'200/s, the ZFS pool has a write rate of about 105MB/s with about 877 ZFS writes per second.\npim@ctlog-test:/etc/sunlight$ zpool iostat -v ssd-vol0 10 capacity operations bandwidth pool alloc free read write read write -------------------------- ----- ----- ----- ----- ----- ----- ssd-vol0 59.1G 685G 0 2.55K 0 312M mirror-0 59.1G 685G 0 2.55K 0 312M wwn-0x5002538a05302930 - - 0 877 0 104M wwn-0x5002538a053069f0 - - 0 871 0 104M wwn-0x5002538a06313ed0 - - 0 866 0 104M -------------------------- ----- ----- ----- ----- ----- ----- pim@ctlog-test:/etc/sunlight$ zpool iostat -l ssd-vol0 10 capacity operations bandwidth total_wait disk_wait syncq_wait asyncq_wait scrub trim pool alloc free read write read write read write read write read write read write wait wait ---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ssd-vol0 59.0G 685G 0 3.19K 0 388M - 8ms - 628us - 990us - 10ms - 88ms ssd-vol0 59.2G 685G 0 2.49K 0 296M - 5ms - 557us - 163us - 8ms - - ssd-vol0 59.6G 684G 0 2.04K 0 253M - 2ms - 704us - 296us - 4ms - - ssd-vol0 58.8G 685G 0 2.72K 0 328M - 6ms - 783us - 701us - 9ms - 68ms A few interesting observations:\nSunlight still uses a local sqlite3 database for the certificate tracking, which is more efficient than MariaDB/MySQL, let alone AWS RDS, so it has one less runtime dependency. The write rate to ZFS is significantly higher with Sunlight than TesseraCT (about 8:1). This is likely explained because the sqlite3 database lives on ZFS here, while TesseraCT uses MariaDB running on a different filesystem. The MinIO usage is a lot lighter. As I reduce the load to 1'000/s, as was the case in the TesseraCT test, I can see the ratio of Get:Put was 93:4 in TesseraCT, while it\u0026rsquo;s 70:30 here. TesseraCT was also consuming more IOPS, running at about 10.5k requests/minute, while Sunlight is significantly calmer at 2.8k requests/minute (almost 4x less!) The burst capacity of Sunlight is a fair bit higher than TesseraCT, likely due to its more efficient use of S3 backends. Conclusion: Sunlight S3+MinIO can handle 1'000/s reliably, and can spike to 4'200/s with only few errors.\nSunlight: Loadtesting POSIX When I took a closer look at TesseraCT a few weeks ago, it struck me that while making a cloud-native setup, with S3 storage would allow for a cool way to enable storage scaling and read-path redundancy, by creating synchronously replicated buckets, it does come at a significant operational overhead and complexity. My main concern is the amount of different moving parts, and Sunlight really has one very appealing property: it can run entirely on one machine without the need for any other moving parts - even the SQL database is linked in. That\u0026rsquo;s pretty slick.\npim@ctlog-test:/etc/sunlight$ cat \u0026lt;\u0026lt; EOF \u0026gt; sunlight.yaml listen: - \u0026#34;[::]:443\u0026#34; checkpoints: /ssd-vol0/sunlight-test/shared/checkpoints.db logs: - shortname: sunlight-test inception: 2025-08-10 submissionprefix: https://ctlog-test.lab.ipng.ch/ monitoringprefix: https://ctlog-test.lab.ipng.ch:1443/ secret: /etc/sunlight/sunlight-test.seed.bin cache: /ssd-vol0/sunlight-test/logs/sunlight-test/cache.db localdirectory: /ssd-vol0/sunlight-test/logs/sunlight-test/data roots: /etc/sunlight/roots.pem period: 200 poolsize: 15000 notafterstart: 2024-01-01T00:00:00Z notafterlimit: 2025-01-01T00:00:00Z EOF pim@ctlog-test:/etc/sunlight$ sunlight -testcert -c sunlight.yaml pim@ctlog-test:/etc/sunlight$ skylight -testcert -c skylight.yaml First I\u0026rsquo;ll start a hello-world loadtest at 100/s and take a look at the number of leaves in the checkpoint after a few minutes, I would expect about three minutes worth at 100/s with a duplicate probability of 10% to yield about 16'200 unique certificates in total.\npim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E \u0026#39;^[0-9]+$\u0026#39;; sleep 60; done 10086 15518 20920 26339 And would you look at that? (26339-10086) is right on the dot! One thing that I find particularly cool about Sunlight is its baked in Prometheus metrics. This allows me some pretty solid insight on its performance. Take a look for example at the write path latency tail (99th ptile):\npim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep \u0026#39;seconds.*quantile=\\\u0026#34;0.99\\\u0026#34;\u0026#39; sunlight_addchain_wait_seconds{log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 0.207285993 sunlight_cache_get_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 0.001409719 sunlight_cache_put_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 0.002227985 sunlight_fs_op_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,method=\u0026#34;discard\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 0.000224969 sunlight_fs_op_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,method=\u0026#34;fetch\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 8.3003e-05 sunlight_fs_op_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,method=\u0026#34;upload\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 0.042118751 sunlight_http_request_duration_seconds{endpoint=\u0026#34;add-chain\u0026#34;,log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 0.2259605 sunlight_sequencing_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 0.108987393 sunlight_sqlite_update_duration_seconds{quantile=\u0026#34;0.99\u0026#34;} 0.014922489 I\u0026rsquo;m seeing here that at a load of 100/s (with 90/s of unique certificates), the 99th percentile add-chain latency is 207ms, which makes sense because the period configuration field is set to 200ms. The filesystem operations (discard, fetch, upload) are de minimis and the sequencing duration is at 109ms. Excellent!\nBut can this thing go really fast? I do remember that the CT Hammer uses more CPU than TesseraCT, and I\u0026rsquo;ve seen it above also when running my 5'000/s loadtest that\u0026rsquo;s about all the hammer can take on a single Dell R630. So, as I did with the TesseraCT test, I\u0026rsquo;ll use the MinIO SSD and MinIO Disk machines to generate the load.\nI boot them, so that I can hammer, or shall I say jackhammer away:\npim@ctlog-test:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \\ --log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \\ --log_url=https://ctlog-test.lab.ipng.ch:1443/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \\ --max_read_ops=0 --num_writers=5000 --max_write_ops=5000 pim@minio-ssd:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \\ --log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \\ --log_url=https://ctlog-test.lab.ipng.ch:1443/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \\ --max_read_ops=0 --num_writers=5000 --max_write_ops=5000 --serial_offset=1000000 pim@minio-disk:~/src/tesseract$ go run ./internal/hammer --origin=ctlog-test.lab.ipng.ch \\ --log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAE6Hg60YncYt/V69kLmg4LlTO9RmHRwRllfa2cjURBJIKPpCUbgiiMX/jLQqmfzYrtveUws4SG8eT7+ICoa8xdAQ== \\ --log_url=https://ctlog-test.lab.ipng.ch:1443/ --write_log_url=https://ctlog-test.lab.ipng.ch/ \\ --max_read_ops=0 --num_writers=5000 --max_write_ops=5000 --serial_offset=2000000 This will generate 15'000/s of load, which I note does bring Sunlight to its knees, although it does remain stable (yaay!) with a somewhat more bursty checkpoint interval:\n5504780 1 seconds 4039 certs 5508819 1 seconds 10000 certs 5518819 . 2 seconds 7976 certs 5526795 1 seconds 2022 certs 5528817 1 seconds 9782 certs 5538599 1 seconds 217 certs 5538816 1 seconds 3114 certs 5541930 1 seconds 6818 certs So what I do instead is a somewhat simpler measurement of certificates per minute:\npim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E \u0026#39;^[0-9]+$\u0026#39;; sleep 60; done 6008831 6296255 6576712 This rate boils down to (6576712-6008831)/120 or 4'700/s of written certs, which at a duplication ratio of 10% means approximately 5'200/s of total accepted certs. At this rate, Sunlight is consuming about 10.3 CPUs/s, while Skylight is at 0.1 CPUs/s and the CT Hammer is at 11.1 CPUs/s; Given the 40 threads on this machine, I am not saturating the CPU, but I\u0026rsquo;m curious as this rate is significantly lower than TesseraCT. I briefly turn off the hammer on ctlog-test to allow Sunlight to monopolize the entire machine. The CPU use does reduce to about 9.3 CPUs/s suggesting that indeed, the bottleneck is not strictly CPU:\nWhen using only two CT Hammers (on minio-ssd.lab.ipng.ch and minio-disk.lab.ipng.ch), the CPU use on the ctlog-test.lab.ipng.ch machine definitely goes down (CT Hammer is kind of a CPU hog\u0026hellip;.), but the resulting throughput doesn\u0026rsquo;t change that much:\npim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E \u0026#39;^[0-9]+$\u0026#39;; sleep 60; done 7985648 8302421 8528122 8772758 What I find particularly interesting is that the total rate stays approximately 4'400/s ((8772758-7985648)/180), while the checkpoint latency varies considerably. One really cool thing I learned earlier is that Sunlight comes with baked in Prometheus metrics, which I can take a look at while keeping it under this load of ~10'000/sec:\npim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep \u0026#39;seconds.*quantile=\\\u0026#34;0.99\\\u0026#34;\u0026#39; sunlight_addchain_wait_seconds{log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 1.889983538 sunlight_cache_get_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 0.000148819 sunlight_cache_put_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 0.837981208 sunlight_fs_op_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,method=\u0026#34;discard\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 0.000433179 sunlight_fs_op_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,method=\u0026#34;fetch\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} NaN sunlight_fs_op_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,method=\u0026#34;upload\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 0.067494558 sunlight_http_request_duration_seconds{endpoint=\u0026#34;add-chain\u0026#34;,log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 1.86894666 sunlight_sequencing_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 1.111400223 sunlight_sqlite_update_duration_seconds{quantile=\u0026#34;0.99\u0026#34;} 0.016859223 Comparing the throughput at 4'400/s with that first test of 100/s, I expect and can confirm a significant increase in all of these metrics. The 99th percentile addchain is now 1889ms (up from 207ms) and the sequencing duration is now 1111ms (up from 109ms).\nSunlight: Effect of period I fiddle a little bit with Sunlight\u0026rsquo;s configuration file, notably the period and poolsize. First I set period:2000 and poolsize:15000, which yields pretty much the same throughput:\npim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E \u0026#39;^[0-9]+$\u0026#39;; sleep 60; done 701850 1001424 1295508 1575789 With a generated load of 10'000/sec with a 10% duplication rate, I am offering roughly 9'000/sec of unique certificates, and I\u0026rsquo;m seeing (1575789 - 701850)/180 or about 4'855/sec come through. Just for reference, at this rate and with period:2000, the latency tail looks like this:\npim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep \u0026#39;seconds.*quantile=\\\u0026#34;0.99\\\u0026#34;\u0026#39; sunlight_addchain_wait_seconds{log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 3.203510079 sunlight_cache_get_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 0.000108613 sunlight_cache_put_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 0.950453973 sunlight_fs_op_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,method=\u0026#34;discard\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 0.00046192 sunlight_fs_op_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,method=\u0026#34;fetch\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} NaN sunlight_fs_op_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,method=\u0026#34;upload\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 0.049007693 sunlight_http_request_duration_seconds{endpoint=\u0026#34;add-chain\u0026#34;,log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 3.570709413 sunlight_sequencing_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 1.5968609040000001 sunlight_sqlite_update_duration_seconds{quantile=\u0026#34;0.99\u0026#34;} 0.010847308 Then I also set a period:100 and poolsize:15000, which does improve a bit:\npim@ctlog-test:/etc/sunlight$ while :; do curl -ksS https://ctlog-test.lab.ipng.ch:1443/checkpoint | grep -E \u0026#39;^[0-9]+$\u0026#39;; sleep 60; done 560654 950524 1324645 1720362 With the same generated load of 10'000/sec with a 10% duplication rate, I am still offering roughly 9'000/sec of unique certificates, and I\u0026rsquo;m seeing (1720362 - 560654)/180 or about 6'440/sec come through, which is a fair bit better, at the expense of more disk activity. At this rate and with period:100, the latency tail looks like this:\npim@ctlog-test:/etc/sunlight$ curl -ksS https://ctlog-test.lab.ipng.ch/metrics | egrep \u0026#39;seconds.*quantile=\\\u0026#34;0.99\\\u0026#34;\u0026#39; sunlight_addchain_wait_seconds{log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 1.616046445 sunlight_cache_get_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 7.5123e-05 sunlight_cache_put_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 0.534935803 sunlight_fs_op_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,method=\u0026#34;discard\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 0.000377273 sunlight_fs_op_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,method=\u0026#34;fetch\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 4.8893e-05 sunlight_fs_op_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,method=\u0026#34;upload\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 0.054685991 sunlight_http_request_duration_seconds{endpoint=\u0026#34;add-chain\u0026#34;,log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 1.946445877 sunlight_sequencing_duration_seconds{log=\u0026#34;sunlight-test\u0026#34;,quantile=\u0026#34;0.99\u0026#34;} 0.980602185 sunlight_sqlite_update_duration_seconds{quantile=\u0026#34;0.99\u0026#34;} 0.018385831 Conclusion: Sunlight on POSIX can reliably handle 4'400/s (with a duplicate rate of 10%) on this setup.\nWrapup - Observations From an operator\u0026rsquo;s point of view, TesseraCT and Sunlight handle quite differently. Both are easily up to the task of serving the current write-load (which is about 250/s).\nS3: When using the S3 backend, TesseraCT became quite unhappy above 800/s while Sunlight went all the way up to 4'200/s and sent significantly fewer requests to MinIO (about 4x less), while showing good telemetry on the use of S3 backends. In this mode, TesseraCT uses MySQL (in my case, MariaDB) which was not on the ZFS pool, but on the boot-disk.\nPOSIX: When using a normal filesystem, Sunlight seems to peak at 4'800/s while TesseraCT went all the way to 12'000/s. When doing so, disk I/O was quite similar between the two solutions, taking into account that TesseraCT runs BadgerDB, while Sunlight uses sqlite3, both are using their respective ZFS pool.\nNotable: Sunlight POSIX and S3 performance is roughly identical (both handle about 5'000/sec), while TesseraCT POSIX performance (12'000/s) is significantly better than its S3 (800/s). Some other observations:\nSunlight has a very opinionated configuration, and can run multiple logs with one configuration file and one binary. Its configuration was a bit constraining though, as I could not manage to use monitoringprefix or submissionprefix with http:// prefix - a likely security precaution - but also using ports in those prefixes (other than the standard 443) rendered Sunlight and Skylight unusable for me.\nSkylight only serves from local directory, it does not have support for S3. For operators using S3, an alternative could be to use NGINX in the serving path, similar to TesseraCT. Skylight does have a few things to teach me though, notably on proper compression, content type and other headers.\nTesseraCT does not have a configuration file, and will run exactly one log per binary instance. It uses flags to construct the environment, and is much more forgiving for creative origin (log name), and submission- and monitoring URLs. It\u0026rsquo;s happy to use regular \u0026lsquo;http://\u0026rsquo; for both, which comes in handy in those architectures where the system is serving behind a reverse proxy.\nThe TesseraCT Hammer tool then again does not like using self-signed certificates, and needs to be told to skip certificate validation in the case of Sunlight loadtests while it is running with the -testcert commandline.\nI consider all of these small and mostly cosmetic issues, because in production there will be proper TLS certificates issued and normal https:// serving ports with unique monitoring and submission hostnames.\nWhat\u0026rsquo;s Next Together with Antonis Chariton and Jeroen Massar, IPng Networks will be offering both TesseraCT and Sunlight logs on the public internet. One final step is to productionize both logs, and file the paperwork for them in the community. Although at this point our Sunlight log is already running, I\u0026rsquo;ll wait a few weeks to gather any additional intel, before wrapping up in a final article.\n","date":"2025-08-10","desc":" Introduction There once was a Dutch company called [DigiNotar], as the name suggests it was a form of digital notary, and they were in the business of issuing security certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for man-in-the-middle attacks on Iranian Gmail users. Not cool.\nGoogle launched a project called Certificate Transparency, because it was becoming more common that the root of trust given to Certification Authorities could no longer be unilaterally trusted. These attacks showed that the lack of transparency in the way CAs operated was a significant risk to the Web Public Key Infrastructure. It led to the creation of this ambitious [project] to improve security online by bringing accountability to the system that protects our online services with SSL (Secure Socket Layer) and TLS (Transport Layer Security).\n","permalink":"https://ipng.ch/s/articles/2025/08/10/certificate-transparency-part-2-sunlight/","section":"articles","title":"Certificate Transparency - Part 2 - Sunlight"},{"contents":" Certificate Transparency logs are \u0026ldquo;append-only\u0026rdquo; and publicly-auditable ledgers of certificates being created, updated, and expired. This is the homepage for IPng Networks\u0026rsquo; Certificate Transparency project.\nCertificate Transparency [CT] is a system for logging and monitoring certificate issuance. It greatly enhances everyone’s ability to monitor and study certificate issuance, and these capabilities have led to numerous improvements to the CA ecosystem and Web security. As a result, it is rapidly becoming critical Internet infrastructure. Originally developed by Google, the concept is now being adopted by many Certification Authorities who log their certificates, and professional Monitoring companies who observe the certificates and report anomalies.\nIPng Networks runs our logs under the domain ct.ipng.ch, split into a *.log.ct.ipng.ch for the write-path, and *.mon.ct.ipng.ch for the read-path.\nWe are [tracking] our logs for inclusion in the approved log lists for Google Chrome and Apple Safari. As of Oct'25, our logs have been added to these trusted lists and that change will propagate to people’s browsers with subsequent browser version releases.\nWe operate two popular implementations of Static Certificate Transparency software.\nSunlight [Sunlight] was designed by Filippo Valsorda for the needs of the WebPKI community, through the feedback of many of its members, and in particular of the Sigsum, Google TrustFabric, and ISRG teams. It is partially based on the Go Checksum Database. Sunlight\u0026rsquo;s development was sponsored by Let\u0026rsquo;s Encrypt.\nOur Sunlight logs:\nA staging log called [Rennet], incepted 2025-07-28, starting from temporal shard rennet2025h2. A production log called [Gouda], incepted 2025-07-30, starting from temporal shard gouda2025h2. TesseraCT [TesseraCT] is a Certificate Transparency (CT) log implementation by the TrustFabric team at Google. It was built to allow log operators to run production static-ct-api CT logs starting with temporal shards covering 2026 onwards, as the successor to Trillian\u0026rsquo;s CTFE.\nOur TesseraCT logs:\nA staging log called [Lipase], incepted 2025-08-22, starting from temporal shard lipase2025h2. A production log called [Halloumi], incepted 2025-08-24, starting from temporal shard halloumi2025h2. Shard halloumi2026h2 incorporated incorrect data into its Merkle Tree at entry 4357956 and 4552365, due to a [TesseraCT bug] and was retired on 2025-09-08, to be replaced by temporal shard halloumi2026h2a. We also submit them to [github.com/geomys/ct-archive].\nOperational Details You can read more details about our infrastructure on:\n[TesseraCT] - published on 2025-07-26. [Sunlight] - published on 2025-08-10. [Operations] - published on 2025-08-24. The operators of this infrastructure are Antonis Chariton, Jeroen Massar and Pim van Pelt. You can reach us via e-mail at [ct-ops@ipng.ch].\nArchived logs Logs are archived in the [c2sp.org/static-ct-api@v1.0.0] format, although if they were originally served through RFC 6962 APIs, leaves might miss the LeafIndex extension. IPng archives its static log shards at least two weeks after the notafterlimit, and removes the DNS entries at least two weeks after archiving.\nWe serve our archived logs from both S3 as well as [ct-archive-serve]:\nhalloumi2026h2.log.ct.ipng.ch - [S3] - [log.v3.json] halloumi2025h2.log.ct.ipng.ch - [S3] - [log.v3.json] lipase2025h2.log.ct.ipng.ch - [S3] - [log.v3.json] gouda2025h2.log.ct.ipng.ch - [S3] - [log.v3.json] rennet2025h2.log.ct.ipng.ch - [S3] - [log.v3.json] ","date":"2025-07-30","desc":" Certificate Transparency logs are \u0026ldquo;append-only\u0026rdquo; and publicly-auditable ledgers of certificates being created, updated, and expired. This is the homepage for IPng Networks\u0026rsquo; Certificate Transparency project.\nCertificate Transparency [CT] is a system for logging and monitoring certificate issuance. It greatly enhances everyone’s ability to monitor and study certificate issuance, and these capabilities have led to numerous improvements to the CA ecosystem and Web security. As a result, it is rapidly becoming critical Internet infrastructure. Originally developed by Google, the concept is now being adopted by many Certification Authorities who log their certificates, and professional Monitoring companies who observe the certificates and report anomalies.\n","permalink":"https://ipng.ch/s/ct/","section":"","title":"Certificate Transparency"},{"contents":" Introduction There once was a Dutch company called [DigiNotar], as the name suggests it was a form of digital notary, and they were in the business of issuing security certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for man-in-the-middle attacks on Iranian Gmail users. Not cool.\nGoogle launched a project called Certificate Transparency, because it was becoming more common that the root of trust given to Certification Authorities could no longer be unilaterally trusted. These attacks showed that the lack of transparency in the way CAs operated was a significant risk to the Web Public Key Infrastructure. It led to the creation of this ambitious [project] to improve security online by bringing accountability to the system that protects our online services with SSL (Secure Socket Layer) and TLS (Transport Layer Security).\nIn 2013, [RFC 6962] was published by the IETF. It describes an experimental protocol for publicly logging the existence of Transport Layer Security (TLS) certificates as they are issued or observed, in a manner that allows anyone to audit certificate authority (CA) activity and notice the issuance of suspect certificates as well as to audit the certificate logs themselves. The intent is that eventually clients would refuse to honor certificates that do not appear in a log, effectively forcing CAs to add all issued certificates to the logs.\nThis series explores and documents how IPng Networks will be running two Static CT Logs with two different implementations. One will be [Sunlight], and the other will be [TesseraCT].\nStatic Certificate Transparency In this context, Logs are network services that implement the protocol operations for submissions and queries that are defined in a specification that builds on the previous RFC. A few years ago, my buddy Antonis asked me if I would be willing to run a log, but operationally they were very complex and expensive to run. However, over the years, the concept of Static Logs put running one in reach. This [Static CT API] defines a read-path HTTP static asset hierarchy (for monitoring) to be implemented alongside the write-path RFC 6962 endpoints (for submission).\nAside from the different read endpoints, a log that implements the Static API is a regular CT log that can work alongside RFC 6962 logs and that fulfills the same purpose. In particular, it requires no modification to submitters and TLS clients.\nIf you only read one document about Static CT, read Filippo Valsorda\u0026rsquo;s excellent [paper]. It describes a radically cheaper and easier to operate [Certificate Transparency] log that is backed by a consistent object storage, and can scale to 30x the current issuance rate for 2-10% of the costs with no merge delay.\nScalable, Cheap, Reliable: choose two In the diagram, I\u0026rsquo;ve drawn an overview of IPng\u0026rsquo;s network. In red a European backbone network is provided by a [BGP Free Core network]. It operates a private IPv4, IPv6, and MPLS network, called IPng Site Local, which is not connected to the internet. On top of that, IPng offers L2 and L3 services, for example using [VPP].\nIn green I built a cluster of replicated NGINX frontends. They connect into IPng Site Local and can reach all hypervisors, VMs, and storage systems. They also connect to the Internet with a single IPv4 and IPv6 address. One might say that SSL is added and removed here :-) [ref].\nThen in orange I built a set of [MinIO] S3 storage pools. Amongst others, I serve the static content from the IPng website from these pools, providing fancy redundancy and caching. I wrote about its design in [this article].\nFinally, I turn my attention to the blue which is two hypervisors, one run by [IPng] and the other by [Massar]. Each of them will be running one of the Log implementations. IPng provides two large ZFS storage tanks for offsite backup, in case a hypervisor decides to check out, and daily backups to an S3 bucket using Restic.\nHaving explained all of this, I am well aware that end-to-end reliability will be coming from the fact that there are many independent Log operators, and folks wanting to validate certificates can simply monitor many. If there is a gap in coverage, say due to any given Log\u0026rsquo;s downtime, this will not necessarily be problematic. It does mean that I may have to suppress the SRE in me\u0026hellip;\nMinIO My first instinct is to leverage the distributed storage IPng has, but as I\u0026rsquo;ll show in the rest of this article, maybe a simpler, more elegant design could be superior, precisely because individual log reliability is not as important as having many available log instances to choose from.\nFrom operators in the field I understand that the world-wide generation of certificates is roughly 17M/day, which amounts to some 200-250qps of writes. Antonis explains that certs with a validity of 180 days or less will need two CT log entries, while certs with a validity of more than 180d will need three CT log entries. So the write rate is roughly 2.2x that, as an upper bound.\nMy first thought is to see how fast my open source S3 machines can go, really. I\u0026rsquo;m curious also as to the difference between SSD and spinning disks.\nI boot two Dell R630s in the Lab. These machines have two Xeon E5-2640 v4 CPUs for a total of 20 cores and 40 threads, and 512GB of DDR4 memory. They also sport a SAS controller. In one machine I place 6pcs 1.2TB SAS3 disks (HPE part number EG1200JEHMC), and in the second machine I place 6pcs of 1.92TB enterprise storage (Samsung part number P1633N19).\nI spin up a 6-device MinIO cluster on both and take them out for a spin using [S3 Benchmark] from Wasabi Tech.\npim@ctlog-test:~/src/s3-benchmark$ for dev in disk ssd; do \\ for t in 1 8 32; do \\ for z in 4M 1M 8k 4k; do \\ ./s3-benchmark -a $KEY -s $SECRET -u http://minio-$dev:9000 -t $t -z $z \\ | tee -a minio-results.txt; \\ done; \\ done; \\ done The loadtest above does a bunch of runs with varying parameters. First it tries to read and write object sizes of 4MB, 1MB, 8kB and 4kB respectively. Then it tries to do this with either 1 thread, 8 threads or 32 threads. Finally it tests both the disk-based variant as well as the SSD based one. The loadtest runs from a third machine, so that the Dell R630 disk tanks can stay completely dedicated to their task of running MinIO.\nThe left-hand side graph feels pretty natural to me. With one thread, uploading 8kB objects will quickly hit the IOPS rate of the disks, each of which has to participate in the write due to EC:3 encoding when using six disks, and it tops out at ~56 PUT/s. The single thread hitting SSDs will not hit that limit, and has ~371 PUT/s which I found a bit underwhelming. But, when performing the loadtest with either 8 or 32 write threads, the hard disks become only marginally faster (topping out at 240 PUT/s), while the SSDs really start to shine, with 3850 PUT/s. Pretty good performance.\nOn the read-side, I am pleasantly surprised that there\u0026rsquo;s not really that much of a difference between disks and SSDs. This is likely because the host filesystem cache is playing a large role, so the 1-thread performance is equivalent (765 GET/s for disks, 677 GET/s for SSDs), and the 32-thread performance is also equivalent (at 7624 GET/s for disks with 7261 GET/s for SSDs). I do wonder why the hard disks consistently outperform the SSDs with all the other variables (OS, MinIO version, hardware) the same.\nSidequest: SeaweedFS Something that has long caught my attention is the way in which [SeaweedFS] approaches blob storage. Many operators have great success with many small file writes in SeaweedFS compared to MinIO and even AWS S3 storage. This is because writes with WeedFS are not broken into erasure-sets, which would require every disk to write a small part or checksum of the data, but rather files are replicated within the cluster in their entirety on different disks, racks or datacenters. I won\u0026rsquo;t bore you with the details of SeaweedFS but I\u0026rsquo;ll tack on a docker [compose file] that I used at the end of this article, if you\u0026rsquo;re curious.\nIn the write-path, SeaweedFS dominates in all cases, due to its different way of achieving durable storage (per-file replication in SeaweedFS versus all-disk erasure-sets in MinIO):\n4k: 3,384 ops/sec vs MinIO\u0026rsquo;s 111 ops/sec (30x faster!) 8k: 3,332 ops/sec vs MinIO\u0026rsquo;s 111 ops/sec (30x faster!) 1M: 383 ops/sec vs MinIO\u0026rsquo;s 44 ops/sec (9x faster) 4M: 104 ops/sec vs MinIO\u0026rsquo;s 32 ops/sec (4x faster) For the read-path, in GET operations MinIO is better at small objects, and really dominates the large objects:\n4k: 7,411 ops/sec vs SeaweedFS 5,014 ops/sec 8k: 7,666 ops/sec vs SeaweedFS 5,165 ops/sec 1M: 5,466 ops/sec vs SeaweedFS 2,212 ops/sec 4M: 3,084 ops/sec vs SeaweedFS 646 ops/sec This makes me draw an interesting conclusion: seeing as CT Logs are read/write heavy (every couple of seconds, the Merkle tree is recomputed which is reasonably disk-intensive), SeaweedFS might be a slightly better choice. IPng Networks has three MinIO deployments, but no SeaweedFS deployments. Yet.\nTessera [Tessera] is a Go library for building tile-based transparency logs (tlogs) [ref]. It is the logical successor to the approach that Google took when building and operating Logs using its predecessor called [Trillian]. The implementation and its APIs bake-in current best-practices based on the lessons learned over the past decade of building and operating transparency logs in production environments and at scale.\nTessera was introduced at the Transparency.Dev summit in October 2024. I first watched Al and Martin [introduce] it at last year\u0026rsquo;s summit. At a high level, it wraps what used to be a whole Kubernetes cluster full of components, into a single library that can be used with Cloud based services, either like AWS S3 and RDS database, or like GCP\u0026rsquo;s GCS storage and Spanner database. However, Google also made it easy to use a regular POSIX filesystem implementation.\nTesseraCT While Tessera is a library, a CT log implementation comes from its sibling GitHub repository called [TesseraCT]. Because it leverages Tessera under the hood, TesseraCT can run on GCP, AWS, POSIX-compliant, or on S3-compatible systems alongside a MySQL database. In order to provide ecosystem agility and to control the growth of CT Log sizes, new CT Logs must be temporally sharded, defining a certificate expiry range denoted in the form of two dates: [rangeBegin, rangeEnd). The certificate expiry range allows a Log to reject otherwise valid logging submissions for certificates that expire before or after this defined range, thus partitioning the set of publicly-trusted certificates that each Log will accept. I will be expected to keep logs for an extended period of time, say 3-5 years.\nIt\u0026rsquo;s time for me to figure out what this TesseraCT thing can do .. are you ready? Let\u0026rsquo;s go!\nTesseraCT: S3 and SQL TesseraCT comes with a few so-called personalities. These are implementations of the underlying storage infrastructure in an opinionated way. The first personality I look at is the aws one in cmd/tesseract/aws. I notice that this personality does make hard assumptions about the use of AWS which is unfortunate as the documentation says \u0026lsquo;.. or self-hosted S3 and MySQL database\u0026rsquo;. However, the aws personality assumes the AWS SecretManager in order to fetch its signing key. Before I can be successful, I need to untangle that.\nTesseraCT: AWS and Local Signer First, I change cmd/tesseract/aws/main.go to add two new flags:\n-signer_public_key_file: a path to the public key for checkpoints and SCT signer -signer_private_key_file: a path to the private key for checkpoints and SCT signer I then change the program to assume if these flags are both set, the user will want a NewLocalSigner instead of a NewSecretsManagerSigner. Now all I have to do is implement the signer interface in a package local_signer.go. There, function NewLocalSigner() will read the public and private PEM from file, decode them, and create an ECDSAWithSHA256Signer with them, a simple example to show what I mean:\n// NewLocalSigner creates a new signer that uses the ECDSA P-256 key pair from // local disk files for signing digests. func NewLocalSigner(publicKeyFile, privateKeyFile string) (*ECDSAWithSHA256Signer, error) { // Read public key publicKeyPEM, err := os.ReadFile(publicKeyFile) publicPemBlock, rest := pem.Decode(publicKeyPEM) var publicKey crypto.PublicKey publicKey, err = x509.ParsePKIXPublicKey(publicPemBlock.Bytes) ecdsaPublicKey, ok := publicKey.(*ecdsa.PublicKey) // Read private key privateKeyPEM, err := os.ReadFile(privateKeyFile) privatePemBlock, rest := pem.Decode(privateKeyPEM) var ecdsaPrivateKey *ecdsa.PrivateKey ecdsaPrivateKey, err = x509.ParseECPrivateKey(privatePemBlock.Bytes) // Verify the correctness of the signer key pair if !ecdsaPrivateKey.PublicKey.Equal(ecdsaPublicKey) { return nil, errors.New(\u0026#34;signer key pair doesn\u0026#39;t match\u0026#34;) } return \u0026amp;ECDSAWithSHA256Signer{ publicKey: ecdsaPublicKey, privateKey: ecdsaPrivateKey, }, nil } In the snippet above I omitted all of the error handling, but the local signer logic itself is hopefully clear. And with that, I am liberated from Amazon\u0026rsquo;s Cloud offering and can run this thing all by myself!\nTesseraCT: Running with S3, MySQL, and Local Signer First, I need to create a suitable ECDSA key:\npim@ctlog-test:~$ openssl ecparam -name prime256v1 -genkey -noout -out /tmp/private_key.pem pim@ctlog-test:~$ openssl ec -in /tmp/private_key.pem -pubout -out /tmp/public_key.pem Then, I\u0026rsquo;ll install the MySQL server and create the databases:\npim@ctlog-test:~$ sudo apt install default-mysql-server pim@ctlog-test:~$ sudo mysql -u root CREATE USER \u0026#39;tesseract\u0026#39;@\u0026#39;localhost\u0026#39; IDENTIFIED BY \u0026#39;\u0026lt;db_passwd\u0026gt;\u0026#39;; CREATE DATABASE tesseract; CREATE DATABASE tesseract_antispam; GRANT ALL PRIVILEGES ON tesseract.* TO \u0026#39;tesseract\u0026#39;@\u0026#39;localhost\u0026#39;; GRANT ALL PRIVILEGES ON tesseract_antispam.* TO \u0026#39;tesseract\u0026#39;@\u0026#39;localhost\u0026#39;; Finally, I use the SSD MinIO lab-machine that I just loadtested to create an S3 bucket.\npim@ctlog-test:~$ mc mb minio-ssd/tesseract-test pim@ctlog-test:~$ cat \u0026lt;\u0026lt; EOF \u0026gt; /tmp/minio-access.json { \u0026#34;Version\u0026#34;: \u0026#34;2012-10-17\u0026#34;, \u0026#34;Statement\u0026#34;: [ { \u0026#34;Effect\u0026#34;: \u0026#34;Allow\u0026#34;, \u0026#34;Action\u0026#34;: [ \u0026#34;s3:ListBucket\u0026#34;, \u0026#34;s3:PutObject\u0026#34;, \u0026#34;s3:GetObject\u0026#34;, \u0026#34;s3:DeleteObject\u0026#34; ], \u0026#34;Resource\u0026#34;: [ \u0026#34;arn:aws:s3:::tesseract-test/*\u0026#34;, \u0026#34;arn:aws:s3:::tesseract-test\u0026#34; ] } ] } EOF pim@ctlog-test:~$ mc admin user add minio-ssd \u0026lt;user\u0026gt; \u0026lt;secret\u0026gt; pim@ctlog-test:~$ mc admin policy create minio-ssd tesseract-test-access /tmp/minio-access.json pim@ctlog-test:~$ mc admin policy attach minio-ssd tesseract-test-access --user \u0026lt;user\u0026gt; pim@ctlog-test:~$ mc anonymous set public minio-ssd/tesseract-test After some fiddling, I understand that the AWS software development kit makes some assumptions that you\u0026rsquo;ll be using .. quelle surprise .. AWS services. But you can also use local S3 services by setting a few key environment variables. I had heard of the S3 access and secret key environment variables before, but I now need to also use a different S3 endpoint. That little detour into the codebase only took me .. several hours.\nArmed with that knowledge, I can build and finally start my TesseraCT instance:\npim@ctlog-test:~/src/tesseract/cmd/tesseract/aws$ go build -o ~/aws . pim@ctlog-test:~$ export AWS_DEFAULT_REGION=\u0026#34;us-east-1\u0026#34; pim@ctlog-test:~$ export AWS_ACCESS_KEY_ID=\u0026#34;\u0026lt;user\u0026gt;\u0026#34; pim@ctlog-test:~$ export AWS_SECRET_ACCESS_KEY=\u0026#34;\u0026lt;secret\u0026gt;\u0026#34; pim@ctlog-test:~$ export AWS_ENDPOINT_URL_S3=\u0026#34;http://minio-ssd.lab.ipng.ch:9000/\u0026#34; pim@ctlog-test:~$ ./aws --http_endpoint=\u0026#39;[::]:6962\u0026#39; \\ --origin=ctlog-test.lab.ipng.ch/test-ecdsa \\ --bucket=tesseract-test \\ --db_host=ctlog-test.lab.ipng.ch \\ --db_user=tesseract \\ --db_password=\u0026lt;db_passwd\u0026gt; \\ --db_name=tesseract \\ --antispam_db_name=tesseract_antispam \\ --signer_public_key_file=/tmp/public_key.pem \\ --signer_private_key_file=/tmp/private_key.pem \\ --roots_pem_file=internal/hammer/testdata/test_root_ca_cert.pem I0727 15:13:04.666056 337461 main.go:128] **** CT HTTP Server Starting **** Hah! I think most of the command line flags and environment variables should make sense, but I was struggling for a while with the --roots_pem_file and the --origin flags, so I phoned a friend (Al Cutter, Googler extraordinaire and an expert in Tessera/CT). He explained to me that the Log is actually an open endpoint to which anybody might POST data. However, to avoid folks abusing the log infrastructure, each POST is expected to come from one of the certificate authorities listed in the --roots_pem_file. OK, that makes sense.\nThen, the --origin flag designates how my log calls itself. In the resulting checkpoint file it will enumerate a hash of the latest merged and published Merkle tree. In case a server serves multiple logs, it uses the --origin flag to make the distinction which checksum belongs to which.\npim@ctlog-test:~/src/tesseract$ curl http://tesseract-test.minio-ssd.lab.ipng.ch:9000/checkpoint ctlog-test.lab.ipng.ch/test-ecdsa 0 JGPitKWWI0aGuCfC2k1n/p9xdWAYPm5RZPNDXkCEVUU= — ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMCONUBAMARjBEAiA/nc9dig6U//vPg7SoTHjt9bxP5K+x3w4MYKpIRn4ULQIgUY5zijRK8qyuJGvZaItDEmP1gohCt+wI+sESBnhkuqo= When creating the bucket above, I used mc anonymous set public, which made the S3 bucket world-readable. I can now execute the whole read-path simply by hitting the S3 service. Check.\nTesseraCT: Loadtesting S3/MySQL The write path is a server on [::]:6962. I should be able to write a log to it, but how? Here\u0026rsquo;s where I am grateful to find a tool in the TesseraCT GitHub repository called hammer. This hammer sets up read and write traffic to a Static CT API log to test correctness and performance under load. The traffic is sent according to the [Static CT API] spec. Slick!\nThe tool starts a text-based UI (my favorite! also when using Cisco T-Rex loadtester) in the terminal that shows the current status, logs, and supports increasing/decreasing read and write traffic. This TUI allows for a level of interactivity when probing a new configuration of a log in order to find any cliffs where performance degrades. For real load-testing applications, especially headless runs as part of a CI pipeline, it is recommended to run the tool with -show_ui=false in order to disable the UI.\nI\u0026rsquo;m a bit lost in the somewhat terse [README.md], but my buddy Al comes to my rescue and explains the flags to me. First of all, the loadtester wants to hit the same --origin that I configured the write-path to accept. In my case this is ctlog-test.lab.ipng.ch/test-ecdsa. Then, it needs the public key for that Log, which I can find in /tmp/public_key.pem. The text there is the DER (Distinguished Encoding Rules), stored as a base64 encoded string. What follows next was the most difficult for me to understand, as I was thinking the hammer would read some log from the internet somewhere and replay it locally. Al explains that actually, the hammer tool synthetically creates all of these entries itself, and it regularly reads the checkpoint from the --log_url place, while it writes its certificates to --write_log_url. The last few flags just inform the hammer how many read and write ops/sec it should generate, and with that explanation my brain plays tadaa.wav and I am ready to go.\npim@ctlog-test:~/src/tesseract$ go run ./internal/hammer \\ --origin=ctlog-test.lab.ipng.ch/test-ecdsa \\ --log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEucHtDWe9GYNicPnuGWbEX8rJg/VnDcXs8z40KdoNidBKy6/ZXw2u+NW1XAUnGpXcZozxufsgOMhijsWb25r7jw== \\ --log_url=http://tesseract-test.minio-ssd.lab.ipng.ch:9000/ \\ --write_log_url=http://localhost:6962/ctlog-test.lab.ipng.ch/test-ecdsa/ \\ --max_read_ops=0 \\ --num_writers=5000 \\ --max_write_ops=100 Cool! It seems that the loadtest is happily chugging along at 100qps. The log is consuming them in the HTTP write-path by accepting POST requests to /ctlog-test.lab.ipng.ch/test-ecdsa/ct/v1/add-chain, where hammer is offering them at a rate of 100qps, with a configured probability of duplicates set at 10%. What that means is that every now and again, it\u0026rsquo;ll repeat a previous request. The purpose of this is to stress test the so-called antispam implementation. When hammer sends its requests, it signs them with a certificate that was issued by the CA described in internal/hammer/testdata/test_root_ca_cert.pem, which is why TesseraCT accepts them.\nI raise the write load by using the \u0026lsquo;\u0026gt;\u0026rsquo; key a few times. I notice things are great at 500qps, which is nice because that\u0026rsquo;s double what we are to expect. But I start seeing a bit more noise at 600qps. When I raise the write-rate to 1000qps, all hell breaks loose on the logs of the server (and similar logs in the hammer loadtester):\nW0727 15:54:33.419881 348475 handlers.go:168] ctlog-test.lab.ipng.ch/test-ecdsa: AddChain handler error: couldn\u0026#39;t store the leaf: failed to fetch entry bundle at index 0: failed to fetch resource: getObject: failed to create reader for object \u0026#34;tile/data/000\u0026#34; in bucket \u0026#34;tesseract-test\u0026#34;: operation error S3: GetObject, context deadline exceeded W0727 15:55:02.727962 348475 aws.go:345] GarbageCollect failed: failed to delete one or more objects: failed to delete objects: operation error S3: DeleteObjects, https response error StatusCode: 400, RequestID: 1856202CA3C4B83F, HostID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8, api error MalformedXML: The XML you provided was not well-formed or did not validate against our published schema. E0727 15:55:10.448973 348475 append_lifecycle.go:293] followerStats: follower \u0026#34;AWS antispam\u0026#34; EntriesProcessed(): failed to read follow coordination info: Error 1040: Too many connections I see on the MinIO instance that it\u0026rsquo;s doing about 150/s of GETs and 15/s of PUTs, which is totally reasonable:\npim@ctlog-test:~/src/tesseract$ mc admin trace --stats ssd Duration: 6m9s ▰▱▱ RX Rate:↑ 34 MiB/m TX Rate:↓ 2.3 GiB/m RPM : 10588.1 ------------- Call Count RPM Avg Time Min Time Max Time Avg TTFB Max TTFB Avg Size Rate /min s3.GetObject 60558 (92.9%) 9837.2 4.3ms 708µs 48.1ms 3.9ms 47.8ms ↑144B ↓246K ↑1.4M ↓2.3G s3.PutObject 2199 (3.4%) 357.2 5.3ms 2.4ms 32.7ms 5.3ms 32.7ms ↑92K ↑32M s3.DeleteMultipleObjects 1212 (1.9%) 196.9 877µs 290µs 41.1ms 850µs 41.1ms ↑230B ↓369B ↑44K ↓71K s3.ListObjectsV2 1212 (1.9%) 196.9 18.4ms 999µs 52.8ms 18.3ms 52.7ms ↑131B ↓261B ↑25K ↓50K Another nice way to see what makes it through is this oneliner, which reads the checkpoint every second, and once it changes, shows the delta in seconds and how many certs were written:\npim@ctlog-test:~/src/tesseract$ T=0; O=0; while :; do \\ N=$(curl -sS http://tesseract-test.minio-ssd.lab.ipng.ch:9000/checkpoint | grep -E \u0026#39;^[0-9]+$\u0026#39;); \\ if [ \u0026#34;$N\u0026#34; -eq \u0026#34;$O\u0026#34; ]; then \\ echo -n .; \\ else \\ echo \u0026#34; $T seconds $((N-O)) certs\u0026#34;; O=$N; T=0; echo -n $N\\ ; fi; \\ T=$((T+1)); sleep 1; done 1012905 .... 5 seconds 2081 certs 1014986 .... 5 seconds 2126 certs 1017112 .... 5 seconds 1913 certs 1019025 .... 5 seconds 2588 certs 1021613 .... 5 seconds 2591 certs 1024204 .... 5 seconds 2197 certs So I can see that the checkpoint is refreshed every 5 seconds and between 1913 and 2591 certs are written each time. And indeed, at 400/s there are no errors or warnings at all. At this write rate, TesseraCT is using about 2.9 CPUs/s, with MariaDB using 0.3 CPUs/s, but the hammer is using 6.0 CPUs/s. Overall, the machine is perfectly happily serving for a few hours under this load test.\nConclusion: a write-rate of 400/s should be safe with S3+MySQL\nTesseraCT: POSIX I have been playing with this idea of having a reliable read-path by having the S3 cluster be redundant, or by replicating the S3 bucket. But Al asks: why not use our experimental POSIX? We discuss two very important benefits, but also two drawbacks:\nOn the plus side: There is no need for S3 storage, read/writing to a local ZFS raidz2 pool instead. There is no need for MySQL, as the POSIX implementation can use a local badger instance also on the local filesystem. On the drawbacks: There is a SPOF in the read-path, as the single VM must handle both. The write-path always has a SPOF on the TesseraCT VM. Local storage is more expensive than S3 storage, and can be used only for the purposes of one application (and at best, shared with other VMs on the same hypervisor). Come to think of it, this is maybe not such a bad tradeoff. I do kind of like having a single-VM with a single-binary and no other moving parts. It greatly simplifies the architecture, and for the read-path I can (and will) still use multiple upstream NGINX machines in IPng\u0026rsquo;s network.\nI consider myself nerd-sniped, and take a look at the POSIX variant. I have a few SAS3 solid state storage (NetAPP part number X447_S1633800AMD), which I plug into the ctlog-test machine.\npim@ctlog-test:~$ sudo zpool create -o ashift=12 -o autotrim=on -o ssd-vol0 mirror \\ /dev/disk/by-id/wwn-0x5002538a0??????? pim@ctlog-test:~$ sudo zfs create ssd-vol0/tesseract-test pim@ctlog-test:~$ sudo chown pim:pim /ssd-vol0/tesseract-test pim@ctlog-test:~/src/tesseract$ go run ./cmd/experimental/posix --http_endpoint=\u0026#39;[::]:6962\u0026#39; \\ --origin=ctlog-test.lab.ipng.ch/test-ecdsa \\ --private_key=/tmp/private_key.pem \\ --storage_dir=/ssd-vol0/tesseract-test \\ --roots_pem_file=internal/hammer/testdata/test_root_ca_cert.pem badger 2025/07/27 16:29:15 INFO: All 0 tables opened in 0s badger 2025/07/27 16:29:15 INFO: Discard stats nextEmptySlot: 0 badger 2025/07/27 16:29:15 INFO: Set nextTxnTs to 0 I0727 16:29:15.032845 363156 files.go:502] Initializing directory for POSIX log at \u0026#34;/ssd-vol0/tesseract-test\u0026#34; (this should only happen ONCE per log!) I0727 16:29:15.034101 363156 main.go:97] **** CT HTTP Server Starting **** pim@ctlog-test:~/src/tesseract$ cat /ssd-vol0/tesseract-test/checkpoint ctlog-test.lab.ipng.ch/test-ecdsa 0 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU= — ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMSgC8BAMARzBFAiBjT5zdkniKlryqlUlx/gLHOtVK26zuWwrc4BlyTVzCWgIhAJ0GIrlrP7YGzRaHjzdB5tnS5rpP3LeOsPbpLateaiFc Alright, I can see the log started and created an empty checkpoint file. Nice!\nBefore I can loadtest it, I will need to get the read-path to become visible. The hammer can read a checkpoint from local file:/// prefixes, but I\u0026rsquo;ll have to serve them over the network eventually anyway, so I create the following NGINX config for it:\nserver { listen 80 default_server backlog=4096; listen [::]:80 default_server backlog=4096; root /ssd-vol0/tesseract-test/; index index.html index.htm index.nginx-debian.html; server_name _; access_log /var/log/nginx/access.log combined buffer=512k flush=5s; location / { try_files $uri $uri/ =404; tcp_nopush on; sendfile on; tcp_nodelay on; keepalive_timeout 65; keepalive_requests 1000; } } Just a couple of small thoughts on this configuration. I\u0026rsquo;m using buffered access logs, to avoid excessive disk writes in the read-path. Then, I\u0026rsquo;m using kernel sendfile() which will instruct the kernel to serve the static objects directly, so that NGINX can move on. Further, I\u0026rsquo;ll allow for a long keepalive in HTTP 1.1, so that future requests can use the same TCP connection, and I\u0026rsquo;ll set the flag tcp_nodelay and tcp_nopush to just blast the data out without waiting.\nWithout much ado:\npim@ctlog-test:~/src/tesseract$ curl -sS ctlog-test.lab.ipng.ch/checkpoint ctlog-test.lab.ipng.ch/test-ecdsa 0 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU= — ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMTfksBAMASDBGAiEAqADLH0P/SRVloF6G1ezlWG3Exf+sTzPIY5u6VjAKLqACIQCkJO2N0dZQuDHvkbnzL8Hd91oyU41bVqfD3vs5EwUouA== TesseraCT: Loadtesting POSIX The loadtesting is roughly the same. I start the hammer with the same 500qps of write rate, which was roughly where the S3+MySQL variant topped. My checkpoint tracker shows the following:\npim@ctlog-test:~/src/tesseract$ T=0; O=0; while :; do \\ N=$(curl -sS http://localhost/checkpoint | grep -E \u0026#39;^[0-9]+$\u0026#39;); \\ if [ \u0026#34;$N\u0026#34; -eq \u0026#34;$O\u0026#34; ]; then \\ echo -n .; \\ else \\ echo \u0026#34; $T seconds $((N-O)) certs\u0026#34;; O=$N; T=0; echo -n $N\\ ; fi; \\ T=$((T+1)); sleep 1; done 59250 ......... 10 seconds 5244 certs 64494 ......... 10 seconds 5000 certs 69494 ......... 10 seconds 5000 certs 74494 ......... 10 seconds 5000 certs 79494 ......... 10 seconds 5256 certs 79494 ......... 10 seconds 5256 certs 84750 ......... 10 seconds 5244 certs 89994 ......... 10 seconds 5256 certs 95250 ......... 10 seconds 5000 certs 100250 ......... 10 seconds 5000 certs 105250 ......... 10 seconds 5000 certs I learn two things. First, the checkpoint interval in this posix variant is 10 seconds, compared to the 5 seconds of the aws variant I tested before. I dive into the code, because there doesn\u0026rsquo;t seem to be a --checkpoint_interval flag. In the tessera library, I find DefaultCheckpointInterval which is set to 10 seconds. I change it to be 2 seconds instead, and restart the posix binary:\n238250 . 2 seconds 1000 certs 239250 . 2 seconds 1000 certs 240250 . 2 seconds 1000 certs 241250 . 2 seconds 1000 certs 242250 . 2 seconds 1000 certs 243250 . 2 seconds 1000 certs 244250 . 2 seconds 1000 certs Very nice! Maybe I can write a few more certs? I restart the hammer with 5000/s, which somewhat to my surprise, ends up serving!\n642608 . 2 seconds 6155 certs 648763 . 2 seconds 10256 certs 659019 . 2 seconds 9237 certs 668256 . 2 seconds 8800 certs 677056 . 2 seconds 8729 certs 685785 . 2 seconds 8237 certs 694022 . 2 seconds 7487 certs 701509 . 2 seconds 8572 certs 710081 . 2 seconds 7413 certs The throughput is highly variable though, seemingly between 3700/sec and 5100/sec, and I quickly find out that the hammer is completely saturating the CPU on the machine, leaving very little room for the posix TesseraCT to serve. I\u0026rsquo;m going to need more machines!\nSo I start a hammer loadtester on the two now-idle MinIO servers, and run them at about 6000qps each, for a total of 12000 certs/sec. And my little posix binary is keeping up like a champ:\n2987169 . 2 seconds 23040 certs 3010209 . 2 seconds 23040 certs 3033249 . 2 seconds 21760 certs 3055009 . 2 seconds 21504 certs 3076513 . 2 seconds 23808 certs 3100321 . 2 seconds 22528 certs One thing is reasonably clear, the posix TesseraCT is CPU bound, not disk bound. The CPU is now running at about 18.5 CPUs/s (with 20 cores), which is pretty much all this Dell has to offer. The NetAPP enterprise solid state drives are not impressed:\npim@ctlog-test:~/src/tesseract$ zpool iostat -v ssd-vol0 10 100 capacity operations bandwidth pool alloc free read write read write -------------------------- ----- ----- ----- ----- ----- ----- ssd-vol0 11.4G 733G 0 3.13K 0 117M mirror-0 11.4G 733G 0 3.13K 0 117M wwn-0x5002538a05302930 - - 0 1.04K 0 39.1M wwn-0x5002538a053069f0 - - 0 1.06K 0 39.1M wwn-0x5002538a06313ed0 - - 0 1.02K 0 39.1M -------------------------- ----- ----- ----- ----- ----- ----- pim@ctlog-test:~/src/tesseract$ zpool iostat -l ssd-vol0 10 capacity operations bandwidth total_wait disk_wait syncq_wait asyncq_wait scrub trim pool alloc free read write read write read write read write read write read write wait wait ---------- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ----- ssd-vol0 14.0G 730G 0 1.48K 0 35.4M - 2ms - 535us - 1us - 3ms - 50ms ssd-vol0 14.0G 730G 0 1.12K 0 23.0M - 1ms - 733us - 2us - 1ms - 44ms ssd-vol0 14.1G 730G 0 1.42K 0 45.3M - 508us - 122us - 914ns - 2ms - 41ms ssd-vol0 14.2G 730G 0 678 0 21.0M - 863us - 144us - 2us - 2ms - - Results OK, that kind of seals the deal for me. The write path needs about 250 certs/sec and I\u0026rsquo;m hammering now with 12'000 certs/sec, with room to spare. But what about the read path? The cool thing about the static log is that reads are all entirely done by NGINX. The only file that isn\u0026rsquo;t cacheable is the checkpoint file which gets updated every two seconds (or ten seconds in the default tessera settings).\nSo I start yet another hammer whose job it is to read back from the static filesystem:\npim@ctlog-test:~/src/tesseract$ curl localhost/nginx_status; sleep 60; curl localhost/nginx_status Active connections: 10556 server accepts handled requests 25302 25302 1492918 Reading: 0 Writing: 1 Waiting: 10555 Active connections: 7791 server accepts handled requests 25764 25764 1727631 Reading: 0 Writing: 1 Waiting: 7790 And I can see that it\u0026rsquo;s keeping up quite nicely. In one minute, it handled (1727631-1492918) or 234713 requests, which is a cool 3911 requests/sec. All these read/write hammers are kind of saturating the ctlog-test machine though:\nBut after a little bit of fiddling, I can assert my conclusion:\nConclusion: a write-rate of 8'000/s alongside a read-rate of 4'000/s should be safe with POSIX\nWhat\u0026rsquo;s Next I am going to offer such a machine in production together with Antonis Chariton and Jeroen Massar. I plan to do a few additional things:\nTest Sunlight as well on the same hardware. It would be nice to see a comparison between write rates of the two implementations. Work with Al Cutter and the Transparency Dev team to close a few small gaps (like the local_signer.go and some Prometheus monitoring of the posix binary. Install and launch both under *.ct.ipng.ch, which in itself deserves its own report, showing how I intend to do log cycling and care/feeding, as well as report on the real production experience running these CT Logs. ","date":"2025-07-26","desc":" Introduction There once was a Dutch company called [DigiNotar], as the name suggests it was a form of digital notary, and they were in the business of issuing security certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for man-in-the-middle attacks on Iranian Gmail users. Not cool.\nGoogle launched a project called Certificate Transparency, because it was becoming more common that the root of trust given to Certification Authorities could no longer be unilaterally trusted. These attacks showed that the lack of transparency in the way CAs operated was a significant risk to the Web Public Key Infrastructure. It led to the creation of this ambitious [project] to improve security online by bringing accountability to the system that protects our online services with SSL (Secure Socket Layer) and TLS (Transport Layer Security).\n","permalink":"https://ipng.ch/s/articles/2025/07/26/certificate-transparency-part-1-tesseract/","section":"articles","title":"Certificate Transparency - Part 1 - TesseraCT"},{"contents":" Introduction You know what would be really cool? If VPP could be an eVPN/VxLAN speaker! Sometimes I feel like I\u0026rsquo;m the very last on the planet to learn about something cool. My latest \u0026ldquo;A-Ha!\u0026quot;-moment was when I was configuring the eVPN fabric for [Frys-IX], and I wrote up an article about it [here] back in April.\nI can build the equivalent of Virtual Private Wires (VPWS), also called L2VPN or Virtual Leased Lines, and these are straight forward because they typically only have two endpoints. A \u0026ldquo;regular\u0026rdquo; VxLAN tunnel which is L2 cross connected with another interface already does that just fine. Take a look at an article on [L2 Gymnastics] for that. But the real kicker is that I can also create multi-site L2 domains like Virtual Private LAN Services (VPLS) or also called Virtual Private Ethernet, L2VPN or Ethernet LAN Service (E-LAN). And that is a whole other level of awesome.\nRecap: VPP today VPP: VxLAN The current VPP VxLAN tunnel plugin does point to point tunnels, that is they are configured with a source address, destination address, destination port and VNI. As I mentioned, a point to point ethernet transport is configured very easily:\nvpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298 instance 0 vpp0# set int l2 xconnect vxlan_tunnel0 HundredGigabitEthernet10/0/0 vpp0# set int l2 xconnect HundredGigabitEthernet10/0/0 vxlan_tunnel0 vpp0# set int state vxlan_tunnel0 up vpp0# set int state HundredGigabitEthernet10/0/0 up vpp1# create vxlan tunnel src 192.0.2.254 dst 192.0.2.1 vni 8298 instance 0 vpp1# set int l2 xconnect vxlan_tunnel0 HundredGigabitEthernet10/0/1 vpp1# set int l2 xconnect HundredGigabitEthernet10/0/1 vxlan_tunnel0 vpp1# set int state vxlan_tunnel0 up vpp1# set int state HundredGigabitEthernet10/0/1 up And with that, vpp0:Hu10/0/0 is cross connected with vpp1:Hu10/0/1 and ethernet flows between the two.\nVPP: Bridge Domains Now consider a VPLS with five different routers. While it\u0026rsquo;s possible to create a bridge-domain and add some local ports and four other VxLAN tunnels:\nvpp0# create bridge-domain 8298 vpp0# set int l2 bridge HundredGigabitEthernet10/0/1 8298 vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.2 vni 8298 instance 0 vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.3 vni 8298 instance 1 vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.4 vni 8298 instance 2 vpp0# create vxlan tunnel src 192.0.2.1 dst 192.0.2.5 vni 8298 instance 3 vpp0# set int l2 bridge vxlan_tunnel0 8298 vpp0# set int l2 bridge vxlan_tunnel1 8298 vpp0# set int l2 bridge vxlan_tunnel2 8298 vpp0# set int l2 bridge vxlan_tunnel3 8298 To make this work, I will have to replicate this configuration to all other vpp1-vpp4 routers. While it does work, it\u0026rsquo;s really not very practical. When other VPP instances get added to a VPLS, every other router will have to have a new VxLAN tunnel created and added to its local bridge domain. Consider 1000s of VPLS instances on 100s of routers, it would yield ~100'000 VxLAN tunnels on every router, yikes!\nSuch a configuration reminds me in a way of iBGP in a large network: the naive approach is to have a full mesh of all routers speaking to all other routers, but that quickly becomes a maintenance headache. The canonical solution for this is to create iBGP Route Reflectors to which every router connects, and their job is to redistribute routing information between the fleet of routers. This turns the iBGP problem from an O(N^2) to an O(N) problem: all 1'000 routers connect to, say, three regional route reflectors for a total of 3'000 BGP connections, which is much better than ~1'000'000 BGP connections in the naive approach.\nRecap: eVPN Moving parts The reason why I got so enthusiastic when I was playing with Arista and Nokia\u0026rsquo;s eVPN stuff, is because it requires very little dataplane configuration, and a relatively intuitive controlplane configuration:\nDataplane: For each L2 broadcast domain (be it a L2XC or a Bridge Domain), really all I need is a single VxLAN interface with a given VNI, which should be able to send encapsulated ethernet frames to one more more other speakers in the same domain. Controlplane: I will need to learn MAC addresses locally, and inform some BGP eVPN implementation of who-lives-where. Other VxLAN speakers learn of the MAC addresses I own, and will send me encapsulated ethernet for those addresses Dataplane: For unknown layer2 destinations, like Broadcast, Unknown Unicast, and Multicast (BUM) traffic, I will want to keep track of which other VxLAN speakers these packets should be flooded. I make note that this is not that different to flooding the packets to local interfaces, except here it\u0026rsquo;d be flooding them to remote VxLAN endpoints. ControlPlane: Flooding L2 traffic across wide area networks is typically considered icky, so a few tricks might be optionally deployed. Since the controlplane already knows which MAC lives where, it may as well also make note of any local IPv6 ARP and IPv6 neighbor discovery replies and teach its peers which IPv4/IPv6 addresses live where: a distributed neighbor table. For the controlplane parts, [FRRouting] has a working implementation for L2 (MAC-VRF) and L3 (IP-VRF). My favorite, [Bird], is slowly catching up, and has a few of these controlplane parts already working (mostly MAC-VRF). Commercial vendors like Arista, Nokia, Juniper, Cisco are ready to go. If we want VPP to inter-operate, we may need to make a few changes.\nVPP: Changes needed Dynamic VxLAN I propose two changes to the VxLAN plugin, or perhaps, a new plugin that changes the behavior so that we don\u0026rsquo;t have to break any performance or functional promises to existing users. This new VxLAN interface behavior changes in the following ways:\nEach VxLAN interface has a local L2FIB attached to it, the keys are MAC address and the values are remote VTEPs. In its simplest form, the values would be just IPv4 or IPv6 addresses, because I can re-use the VNI and port information from the tunnel definition itself.\nEach VxLAN interface has a local flood-list attached to it. This list contains remote VTEPs that I am supposed to send \u0026lsquo;flood\u0026rsquo; packets to. Similar to the Bridge Domain, when packets are marked for flooding, I will need to prepare and replicate them, sending them to each VTEP.\nA set of APIs will be needed to manipulate these:\nInterface: I will need to have an interface create, delete and list call, which will be able to maintain the interfaces, their metadata like source address, source/destination port, VNI and such. L2FIB: I will need to add, replace, delete, and list which MAC addresses go where, With such a table, each time a packet is handled for a given Dynamic VxLAN interface, the dst_addr can be written into the packet. Flooding: For those packets that are not unicast (BUM), I will need to be able to add, remove and list which VTEPs should receive this packet. It would be pretty dope if the configuration looked something like this:\nvpp# create evpn-vxlan src \u0026lt;v46address\u0026gt; dst-port \u0026lt;port\u0026gt; vni \u0026lt;vni\u0026gt; instance \u0026lt;id\u0026gt; vpp# evpn-vxlan l2fib \u0026lt;iface\u0026gt; mac \u0026lt;mac\u0026gt; dst \u0026lt;v46address\u0026gt; [del] vpp# evpn-vxlan flood \u0026lt;iface\u0026gt; dst \u0026lt;v46address\u0026gt; [del] The VxLAN underlay transport can be either IPv4 or IPv6. Of course manipulating L2FIB or Flood destinations must match the address family of an interface of type evpn-vxlan. A practical example might be:\nvpp# create evpn-vxlan src 2001:db8::1 dst-port 4789 vni 8298 instance 6 vpp# evpn-vxlan l2fib evpn-vxlan0 mac 00:01:02:82:98:02 dst 2001:db8::2 vpp# evpn-vxlan l2fib evpn-vxlan0 mac 00:01:02:82:98:03 dst 2001:db8::3 vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::2 vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::3 vpp# evpn-vxlan flood evpn-vxlan0 dst 2001:db8::4 By the way, while this could be a new plugin, it could also just be added to the existing VxLAN plugin. One way in which I might do this when creating a normal vxlan tunnel is to allow for its destination address to be either 0.0.0.0 for IPv4 or :: for IPv6. That would signal \u0026lsquo;dynamic\u0026rsquo; tunneling, upon which the L2FIB and Flood lists are used. It would slow down each VxLAN packet by the time it takes to call ip46_address_is_zero() which is only a handfull of clocks.\nBridge Domain It\u0026rsquo;s important to understand that L2 learning is required for eVPN to function. Each router needs to be able to tell the iBGP eVPN session which MAC addresses should be forwarded to it. This rules out the simple case of L2XC because there, no learning is performed. The corollary is that a bridge-domain is required for any form of eVPN.\nThe L2 code in VPP already does most of what I\u0026rsquo;d need. It maintains an L2FIB in vnet/l2/l2_fib.c, which is keyed by bridge-id and MAC address, and its values are a 64 bit structure that points essentially to a sw_if_index output interface. The L2FIB of the eVPN needs a bit more information though, notably a ip46address struct to know which VTEP to send to. It\u0026rsquo;s tempting to add this extra data to the bridge domain code. I would recommend against it, because other implementations, for example MPLS, GENEVE or Carrier Pigeon IP may need more than just the destination address. Even the VxLAN implementation I\u0026rsquo;m thinking about might want to be able to override other things like the destination port for a given VTEP, or even the VNI. Putting all of this stuff in the bridge-domain code will just clutter it, for all users, not just those users who might want eVPN.\nSimilarly, one might argue it is tempting to re-use/extend the behavior in vnet/l2/l2_flood.c, because if it\u0026rsquo;s already replicating BUM traffic, why not replicate it many times over the flood list for any member interface that happens to be a dynamic VxLAN interface? This would be a bad idea because of a few reasons. Firstly, it is not guaranteed that the VxLAN plugin is loaded, and in doing this, I would leak internal details of VxLAN into the bridge-domain code. Secondly, the l2_flood.c code would potentially get messy if other types were added (like the MPLS and GENEVE above).\nA reasonable request is to mark such BUM frames once in the existing L2 code and when handing the replicated packet into the VxLAN node, to see the is_bum marker and once again replicate \u0026ndash; in the vxlan plugin \u0026ndash; these packets to the VTEPs in our local flood-list. Although a bit more work, this approach only requires a tiny amount of work in the l2_flood.c code (the marking), and will keep all the logic tucked away where it is relevant, derisking the VPP vnet codebase.\nFundamentally, I think the cleanest design is to keep the dynamic VxLAN interface fully self-contained and it would therefor maintain its own L2FIB and Flooding logic. The only thing I would add to the L2 codebase is some form of BUM marker to allow for efficient flooding.\nControl Plane There\u0026rsquo;s a few things the control plane has to do. Some external agent, like FRR or Bird, will be receiving a few types of eVPN messages. The ones I\u0026rsquo;m interested in are:\nType 2: MAC/IP Advertisement Route On the way in, these should be fed to the VxLAN L2FIB belonging to the bridge-domain. On the way out, learned addresses should be advertised to peers. Regarding IPv4/IPv6 addresses, that is the ARP / ND tables: we can talk about those later. Type 3: Inclusive Multicast Ethernet Tag Route On the way in, these will populate the VxLAN Flood list belonging to the bridge-domain On the way out, each bridge-domain should advertise itself as IMET to peers. Type 5: IP Prefix Route Similar to IP information in Type 2, we can talk about those later once L3VPN/eVPN is needed. The \u0026lsquo;on the way in\u0026rsquo; stuff can be easily done with my proposed APIs in the Dynamic VxLAN (or a new eVPN VxLAN) plugin. Adding, removing, listing L2FIB and Flood lists is easy as far as VPP is concerned. It\u0026rsquo;s just that the controlplane implementation needs to somehow feed the API, so an external program may be needed, or alterntively the Linux Control Plane netlink plugin might be used to consume this information.\nThe \u0026lsquo;on the way out\u0026rsquo; stuff is a bit trickier. I will need to listen to creation of new broadcast domains and associate them with the right IMET announcements, and for each MAC address learned, pick them up and advertise them into eVPN. Later, if ever ARP and ND proxying becomes important, I\u0026rsquo;ll have to revisit the bridge-domain feature to do IPv4 ARP and IPv6 Neighbor Discovery, and replace it with some code that populates the IPv4/IPv6 parts of the Type2 messages on the way out, and similarly on the way in, populates an L3 neighbor cache for the bridge domain, so ARP and ND replies can be synthesized based on what we\u0026rsquo;ve learned in eVPN.\nDemonstration VPP: Current VxLAN I\u0026rsquo;ll build a small demo environment on Summer to show how the interaction of VxLAN and Bridge Domain works today:\nvpp# create tap host-if-name dummy0 host-mtu-size 9216 host-ip4-addr 192.0.2.1/24 vpp# set int state tap0 up vpp# set int ip address tap0 192.0.2.1/24 vpp# set ip neighbor tap0 192.0.2.254 01:02:03:82:98:fe static vpp# set ip neighbor tap0 192.0.2.2 01:02:03:82:98:02 static vpp# set ip neighbor tap0 192.0.2.3 01:02:03:82:98:03 static vpp# create vxlan tunnel src 192.0.2.1 dst 192.0.2.254 vni 8298 vpp# set int state vxlan_tunnel0 up vpp# create tap host-if-name vpptap0 host-mtu-size 9216 hw-addr 02:fe:64:dc:1b:82 vpp# set int state tap1 up vpp# create bridge-domain 8298 vpp# set int l2 bridge tap1 8298 vpp# set int l2 bridge vxlan_tunnel0 8298 I\u0026rsquo;ve created a tap device called dummy0 and gave it an IPv4 address. Normally, I would use some DPDK or RDMA interface like TenGigabutEthernet10/0/0. Then I\u0026rsquo;ll populate some static ARP entries. Again, normally this would just be \u0026lsquo;use normal routing\u0026rsquo;. However, for the purposes of this demonstration, it helps to use a TAP device, as any packets I make VPP send to those 192.0.2.254 and so on, can be captured with tcpdump in Linux in addition to trace add in VPP.\nThen, I create a VxLAN tunnel with a default destination of 192.0.2.254 and the given VNI. Next, I create a TAP interface called vpptap0 with the given MAC address. Finally, I bind these two interfaces together in a bridge-domain.\nI proceed to write a small ScaPY program:\n#!/usr/bin/env python3 from scapy.all import Ether, IP, UDP, Raw, sendp pkt = Ether(dst=\u0026#34;01:02:03:04:05:02\u0026#34;, src=\u0026#34;02:fe:64:dc:1b:82\u0026#34;, type=0x0800) / IP(src=\u0026#34;192.168.1.1\u0026#34;, dst=\u0026#34;192.168.1.2\u0026#34;) / UDP(sport=8298, dport=7) / Raw(load=b\u0026#34;ping\u0026#34;) print(pkt) sendp(pkt, iface=\u0026#34;vpptap0\u0026#34;) pkt = Ether(dst=\u0026#34;01:02:03:04:05:03\u0026#34;, src=\u0026#34;02:fe:64:dc:1b:82\u0026#34;, type=0x0800) / IP(src=\u0026#34;192.168.1.1\u0026#34;, dst=\u0026#34;192.168.1.3\u0026#34;) / UDP(sport=8298, dport=7) / Raw(load=b\u0026#34;ping\u0026#34;) print(pkt) sendp(pkt, iface=\u0026#34;vpptap0\u0026#34;) What will happen is, the ScaPY program will emit these frames into device vpptap0 which is in bridge-domain 8298. The bridge will learn our src MAC 02:fe:64:dc:1b:82, and look up the dst MAC 01:02:03:04:05:02, and because there hasn\u0026rsquo;t been traffic yet, it\u0026rsquo;ll flood to all member ports, one of which is the VxLAN tunnel. VxLAN will then encapsulate the packets to the other side of the tunnel.\npim@summer:~$ sudo ./vxlan-test.py Ether / IP / UDP 192.168.1.1:8298 \u0026gt; 192.168.1.2:echo / Raw Ether / IP / UDP 192.168.1.1:8298 \u0026gt; 192.168.1.3:echo / Raw pim@summer:~$ sudo tcpdump -evni dummy0 10:50:35.310620 02:fe:72:52:38:53 \u0026gt; 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96: (tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82) 192.0.2.1.6345 \u0026gt; 192.0.2.254.4789: VXLAN, flags [I] (0x08), vni 8298 02:fe:64:dc:1b:82 \u0026gt; 01:02:03:04:05:02, ethertype IPv4 (0x0800), length 46: (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32) 192.168.1.1.8298 \u0026gt; 192.168.1.2.7: UDP, length 4 10:50:35.362552 02:fe:72:52:38:53 \u0026gt; 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96: (tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82) 192.0.2.1.23916 \u0026gt; 192.0.2.254.4789: VXLAN, flags [I] (0x08), vni 8298 02:fe:64:dc:1b:82 \u0026gt; 01:02:03:04:05:03, ethertype IPv4 (0x0800), length 46: (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32) 192.168.1.1.8298 \u0026gt; 192.168.1.3.7: UDP, length 4 I want to point out that nothing, so far, is special. All of this works with upstream VPP just fine. I can see two VxLAN encapsulated packets, both destined to 192.0.2.254:4789. Cool.\nDynamic VPP VxLAN I wrote a prototype for a Dynamic VxLAN tunnel in [43433]. The good news is, this works. The bad news is, I think I\u0026rsquo;ll want to discuss my proposal (this article) with the community before going further down a potential rabbit hole.\nWith my gerrit patched in, I can do the following:\nvpp# vxlan l2fib vxlan_tunnel0 mac 01:02:03:04:05:02 dst 192.0.2.2 Added VXLAN dynamic destination for 01:02:03:04:05:02 on vxlan_tunnel0 dst 192.0.2.2 vpp# vxlan l2fib vxlan_tunnel0 mac 01:02:03:04:05:03 dst 192.0.2.3 Added VXLAN dynamic destination for 01:02:03:04:05:03 on vxlan_tunnel0 dst 192.0.2.3 vpp# show vxlan l2fib VXLAN Dynamic L2FIB entries: MAC Interface Destination Port VNI 01:02:03:04:05:02 vxlan_tunnel0 192.0.2.2 4789 8298 01:02:03:04:05:03 vxlan_tunnel0 192.0.2.3 4789 8298 Dynamic L2FIB entries: 2 I\u0026rsquo;ve instructed the VxLAN tunnel to change the tunnel destination based on the destination MAC.\nI run the script and tcpdump again:\npim@summer:~$ sudo tcpdump -evni dummy0 11:16:53.834619 02:fe:fe:ae:0d:a3 \u0026gt; 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96: (tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82, bad cksum 3945 (-\u0026gt;3997)!) 192.0.2.1.6345 \u0026gt; 192.0.2.2.4789: VXLAN, flags [I] (0x08), vni 8298 02:fe:64:dc:1b:82 \u0026gt; 01:02:03:04:05:02, ethertype IPv4 (0x0800), length 46: (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32) 192.168.1.1.8298 \u0026gt; 192.168.1.2.7: UDP, length 4 11:16:53.882554 02:fe:fe:ae:0d:a3 \u0026gt; 01:02:03:82:98:fe, ethertype IPv4 (0x0800), length 96: (tos 0x0, ttl 253, id 0, offset 0, flags [none], proto UDP (17), length 82, bad cksum 3944 (-\u0026gt;3996)!) 192.0.2.1.23916 \u0026gt; 192.0.2.3.4789: VXLAN, flags [I] (0x08), vni 8298 02:fe:64:dc:1b:82 \u0026gt; 01:02:03:04:05:03, ethertype IPv4 (0x0800), length 46: (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 32) 192.168.1.1.8298 \u0026gt; 192.168.1.3.7: UDP, length 4 Two important notes: Firstly, this works! For the MAC address ending in :02, send the packet to 192.0.2.2 instead of the default of 192.0.2.254. Same for the :03 MAC which now goes to 192.0.2.3. Nice! But secondly, the IPv4 header of the VxLAN packets was changed, so there needs to be a call to ip4_header_checksum() inserted somewhere. That\u0026rsquo;s an easy fix.\nWhat\u0026rsquo;s next I want to discuss a few things, perhaps at an upcoming VPP Community meeting. Notably:\nIs the VPP Developer community supportive of adding eVPN support? Does anybody want to help write it with me? Is changing the existing VxLAN plugin appropriate, or should I make a new plugin which adds dynamic endpoints, L2FIB and Flood lists for BUM traffic? Is it acceptable for me to add a BUM marker in l2_flood.c so that I can reuse all the logic from bridge-domain flooding as I extend to also do VTEP flooding? (perhaps later) VxLAN is the canonical underlay, but is there an appetite to extend also to, say, GENEVE or MPLS? (perhaps later) What\u0026rsquo;s a good way to tie in a controlplane like FRRouting or Bird2 into the dataplane (perhaps using a sidecar controller, or perhaps using Linux CP Netlink messages)? ","date":"2025-07-12","desc":" Introduction You know what would be really cool? If VPP could be an eVPN/VxLAN speaker! Sometimes I feel like I\u0026rsquo;m the very last on the planet to learn about something cool. My latest \u0026ldquo;A-Ha!\u0026quot;-moment was when I was configuring the eVPN fabric for [Frys-IX], and I wrote up an article about it [here] back in April.\nI can build the equivalent of Virtual Private Wires (VPWS), also called L2VPN or Virtual Leased Lines, and these are straight forward because they typically only have two endpoints. A \u0026ldquo;regular\u0026rdquo; VxLAN tunnel which is L2 cross connected with another interface already does that just fine. Take a look at an article on [L2 Gymnastics] for that. But the real kicker is that I can also create multi-site L2 domains like Virtual Private LAN Services (VPLS) or also called Virtual Private Ethernet, L2VPN or Ethernet LAN Service (E-LAN). And that is a whole other level of awesome.\n","permalink":"https://ipng.ch/s/articles/2025/07/12/vpp-and-evpn/vxlan-part-1/","section":"articles","title":"VPP and eVPN/VxLAN - Part 1"},{"contents":" Introduction Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading scalability, data availability, security, and performance. Millions of customers of all sizes and industries store, manage, analyze, and protect any amount of data for virtually any use case, such as data lakes, cloud-native applications, and mobile apps. With cost-effective storage classes and easy-to-use management features, you can optimize costs, organize and analyze data, and configure fine-tuned access controls to meet specific business and compliance requirements.\nAmazon\u0026rsquo;s S3 became the de facto standard object storage system, and there exist several fully open source implementations of the protocol. One of them is MinIO: designed to allow enterprises to consolidate all of their data on a single, private cloud namespace. Architected using the same principles as the hyperscalers, AIStor delivers performance at scale at a fraction of the cost compared to the public cloud.\nIPng Networks is an Internet Service Provider, but I also dabble in self-hosting things, for example [PeerTube], [Mastodon], [Immich], [Pixelfed] and of course [Hugo]. These services all have one thing in common: they tend to use lots of storage when they grow. At IPng Networks, all hypervisors ship with enterprise SAS flash drives, mostly 1.92TB and 3.84TB. Scaling up each of these services, and backing them up safely, can be quite the headache.\nIn a [previous article], I talked through the install of a redundant set of three Minio machines. In this article, I\u0026rsquo;ll start putting them to good use.\nUse Case: Restic [Restic] is a modern backup program that can back up your files from multiple host OS, to many different storage types, easily, effectively, securely, verifiably and freely. With a sales pitch like that, what\u0026rsquo;s not to love? Actually, I am a long-time [BorgBackup] user, and I think I\u0026rsquo;ll keep that running. However, for resilience, and because I\u0026rsquo;ve heard only good things about Restic, I\u0026rsquo;ll make a second backup of the routers, hypervisors, and virtual machines using Restic.\nRestic can use S3 buckets out of the box (incidentally, so can BorgBackup). To configure it, I use a mixture of environment variables and flags. But first, let me create a bucket for the backups.\npim@glootie:~$ mc mb chbtl0/ipng-restic pim@glootie:~$ mc admin user add chbtl0/ \u0026lt;key\u0026gt; \u0026lt;secret\u0026gt; pim@glootie:~$ cat \u0026lt;\u0026lt; EOF | tee ipng-restic-access.json { \u0026#34;PolicyName\u0026#34;: \u0026#34;ipng-restic-access\u0026#34;, \u0026#34;Policy\u0026#34;: { \u0026#34;Version\u0026#34;: \u0026#34;2012-10-17\u0026#34;, \u0026#34;Statement\u0026#34;: [ { \u0026#34;Effect\u0026#34;: \u0026#34;Allow\u0026#34;, \u0026#34;Action\u0026#34;: [ \u0026#34;s3:DeleteObject\u0026#34;, \u0026#34;s3:GetObject\u0026#34;, \u0026#34;s3:ListBucket\u0026#34;, \u0026#34;s3:PutObject\u0026#34; ], \u0026#34;Resource\u0026#34;: [ \u0026#34;arn:aws:s3:::ipng-restic\u0026#34;, \u0026#34;arn:aws:s3:::ipng-restic/*\u0026#34; ] } ] }, } EOF pim@glootie:~$ mc admin policy create chbtl0/ ipng-restic-access.json pim@glootie:~$ mc admin policy attach chbtl0/ ipng-restic-access --user \u0026lt;key\u0026gt; First, I\u0026rsquo;ll create a bucket called ipng-restic. Then, I\u0026rsquo;ll create a user with a given secret key. To protect the innocent, and my backups, I\u0026rsquo;ll not disclose them. Next, I\u0026rsquo;ll create an IAM policy that allows for Get/List/Put/Delete to be performed on the bucket and its contents, and finally I\u0026rsquo;ll attach this policy to the user I just created.\nTo run a Restic backup, I\u0026rsquo;ll first have to create a so-called repository. The repository has a location and a password, which Restic uses to encrypt the data. Because I\u0026rsquo;m using S3, I\u0026rsquo;ll also need to specify the key and secret:\nroot@glootie:~# RESTIC_PASSWORD=\u0026#34;changeme\u0026#34; root@glootie:~# RESTIC_REPOSITORY=\u0026#34;s3:https://s3.chbtl0.ipng.ch/ipng-restic/$(hostname)/\u0026#34; root@glootie:~# AWS_ACCESS_KEY_ID=\u0026#34;\u0026lt;key\u0026gt;\u0026#34; root@glootie:~# AWS_SECRET_ACCESS_KEY:=\u0026#34;\u0026lt;secret\u0026gt;\u0026#34; root@glootie:~# export RESTIC_PASSWORD RESTIC_REPOSITORY AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY root@glootie:~# restic init created restic repository 807cf25e85 at s3:https://s3.chbtl0.ipng.ch/ipng-restic/glootie.ipng.ch/ Restic prints out some repository finterprint of the latest \u0026lsquo;snapshot\u0026rsquo; it just created. Taking a look on the MinIO install:\npim@glootie:~$ mc stat chbtl0/ipng-restic/glootie.ipng.ch/ Name : config Date : 2025-06-01 12:01:43 UTC Size : 155 B ETag : 661a43f72c43080649712e45da14da3a Type : file Metadata : Content-Type: application/octet-stream Name : keys/ Date : 2025-06-01 12:03:33 UTC Type : folder Cool. Now I\u0026rsquo;m ready to make my first full backup:\nroot@glootie:~# ARGS=\u0026#34;--exclude /proc --exclude /sys --exclude /dev --exclude /run\u0026#34; root@glootie:~# ARGS=\u0026#34;$ARGS --exclude-if-present .nobackup\u0026#34; root@glootie:~# restic backup $ARGS / ... processed 1141426 files, 131.111 GiB in 15:12 snapshot 34476c74 saved Once the backup completes, the Restic authors advise me to also do a check of the repository, and to prune it so that it keeps a finite amount of daily, weekly and monthly backups. My further journey for Restic looks a bit like this:\nroot@glootie:~# restic check using temporary cache in /tmp/restic-check-cache-2712250731 create exclusive lock for repository load indexes check all packs check snapshots, trees and blobs [0:04] 100.00% 1 / 1 snapshots no errors were found root@glootie:~# restic forget --prune --keep-daily 8 --keep-weekly 5 --keep-monthly 6 repository 34476c74 opened (version 2, compression level auto) Applying Policy: keep 8 daily, 5 weekly, 6 monthly snapshots keep 1 snapshots: ID Time Host Tags Reasons Paths --------------------------------------------------------------------------------- 34476c74 2025-06-01 12:18:54 glootie.ipng.ch daily snapshot / weekly snapshot monthly snapshot ---------------------------------------------------------------------------------- 1 snapshots Right on! I proceed to update the Ansible configs at IPng to roll this out against the entire fleet of 152 hosts at IPng Networks. I do this in a little tool called bitcron, which I wrote for a previous company I worked at: [BIT] in the Netherlands. Bitcron allows me to create relatively elegant cronjobs that can raise warnings, errors and fatal issues. If no issues are found, an e-mail can be sent to a bitbucket address, but if warnings or errors are found, a different monitored address will be used. Bitcron is kind of cool, and I wrote it in 2001. Maybe I\u0026rsquo;ll write about it, for old time\u0026rsquo;s sake. I wonder if the folks at BIT still use it?\nUse Case: NGINX OK, with the first use case out of the way, I turn my attention to a second - in my opinion more interesting - use case. In the [previous article], I created a public bucket called ipng-web-assets in which I stored 6.50GB of website data belonging to the IPng website, and some material I posted when I was on my [Sabbatical] last year.\nMinIO: Bucket Replication First things first: redundancy. These web assets are currently pushed to all four nginx machines, and statically served. If I were to replace them with a single S3 bucket, I would create a single point of failure, and that\u0026rsquo;s no bueno!\nOff I go, creating a replicated bucket using two MinIO instances (chbtl0 and ddln0):\npim@glootie:~$ mc mb ddln0/ipng-web-assets pim@glootie:~$ mc anonymous set download ddln0/ipng-web-assets pim@glootie:~$ mc admin user add ddln0/ \u0026lt;replkey\u0026gt; \u0026lt;replsecret\u0026gt; pim@glootie:~$ cat \u0026lt;\u0026lt; EOF | tee ipng-web-assets-access.json { \u0026#34;PolicyName\u0026#34;: \u0026#34;ipng-web-assets-access\u0026#34;, \u0026#34;Policy\u0026#34;: { \u0026#34;Version\u0026#34;: \u0026#34;2012-10-17\u0026#34;, \u0026#34;Statement\u0026#34;: [ { \u0026#34;Effect\u0026#34;: \u0026#34;Allow\u0026#34;, \u0026#34;Action\u0026#34;: [ \u0026#34;s3:DeleteObject\u0026#34;, \u0026#34;s3:GetObject\u0026#34;, \u0026#34;s3:ListBucket\u0026#34;, \u0026#34;s3:PutObject\u0026#34; ], \u0026#34;Resource\u0026#34;: [ \u0026#34;arn:aws:s3:::ipng-web-assets\u0026#34;, \u0026#34;arn:aws:s3:::ipng-web-assets/*\u0026#34; ] } ] }, } EOF pim@glootie:~$ mc admin policy create ddln0/ ipng-web-assets-access.json pim@glootie:~$ mc admin policy attach ddln0/ ipng-web-assets-access --user \u0026lt;replkey\u0026gt; pim@glootie:~$ mc replicate add chbtl0/ipng-web-assets \\ --remote-bucket https://\u0026lt;key\u0026gt;:\u0026lt;secret\u0026gt;@s3.ddln0.ipng.ch/ipng-web-assets What happens next is pure magic. I\u0026rsquo;ve told chbtl0 that I want it to replicate all existing and future changes to that bucket to its neighbor ddln0. Only minutes later, I check the replication status, just to see that it\u0026rsquo;s already done:\npim@glootie:~$ mc replicate status chbtl0/ipng-web-assets Replication status since 1 hour s3.ddln0.ipng.ch Replicated: 142 objects (6.5 GiB) Queued: ● 0 objects, 0 B (avg: 4 objects, 915 MiB ; max: 0 objects, 0 B) Workers: 0 (avg: 0; max: 0) Transfer Rate: 15 kB/s (avg: 88 MB/s; max: 719 MB/s Latency: 3ms (avg: 3ms; max: 7ms) Link: ● online (total downtime: 0 milliseconds) Errors: 0 in last 1 minute; 0 in last 1hr; 0 since uptime Configured Max Bandwidth (Bps): 644 GB/s Current Bandwidth (Bps): 975 B/s pim@summer:~/src/ipng-web-assets$ mc ls ddln0/ipng-web-assets/ [2025-06-01 12:42:22 CEST] 0B ipng.ch/ [2025-06-01 12:42:22 CEST] 0B sabbatical.ipng.nl/ MinIO has pumped the data from bucket ipng-web-assets to the other machine at an average of 88MB/s with a peak throughput of 719MB/s (probably for the larger VM images). And indeed, looking at the remote machine, it is fully caught up after the push, within only a minute or so with a completely fresh copy. Nice!\nMinIO: Missing directory index I take a look at what I just built, on the following URL:\nhttps://ipng-web-assets.s3.ddln0.ipng.ch/sabbatical.ipng.nl/media/vdo/IMG_0406_0.mp4 That checks out, and I can see the mess that was my room when I first went on sabbatical. By the way, I totally cleaned it up, see [here] for proof. I can\u0026rsquo;t, however, see the directory listing:\npim@glootie:~$ curl https://ipng-web-assets.s3.ddln0.ipng.ch/sabbatical.ipng.nl/media/vdo/ \u0026lt;?xml version=\u0026#34;1.0\u0026#34; encoding=\u0026#34;UTF-8\u0026#34;?\u0026gt; \u0026lt;Error\u0026gt; \u0026lt;Code\u0026gt;NoSuchKey\u0026lt;/Code\u0026gt; \u0026lt;Message\u0026gt;The specified key does not exist.\u0026lt;/Message\u0026gt; \u0026lt;Key\u0026gt;sabbatical.ipng.nl/media/vdo/\u0026lt;/Key\u0026gt; \u0026lt;BucketName\u0026gt;ipng-web-assets\u0026lt;/BucketName\u0026gt; \u0026lt;Resource\u0026gt;/sabbatical.ipng.nl/media/vdo/\u0026lt;/Resource\u0026gt; \u0026lt;RequestId\u0026gt;1844EC0CFEBF3C5F\u0026lt;/RequestId\u0026gt; \u0026lt;HostId\u0026gt;dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8\u0026lt;/HostId\u0026gt; \u0026lt;/Error\u0026gt; That\u0026rsquo;s unfortunate, because some of the IPng articles link to a directory full of files, which I\u0026rsquo;d like to be shown so that my readers can navigate through the directories. Surely I\u0026rsquo;m not the first to encounter this? And sure enough, I\u0026rsquo;m not [ref] by user glowinthedark who wrote a little python script that generates index.html files for their Caddy file server. I\u0026rsquo;ll take me some of that Python, thank you!\nWith the following little script, my setup is complete:\npim@glootie:~/src/ipng-web-assets$ cat push.sh #!/usr/bin/env bash echo \u0026#34;Generating index.html files ...\u0026#34; for D in */media; do echo \u0026#34;* Directory $D\u0026#34; ./genindex.py -r $D done echo \u0026#34;Done (genindex)\u0026#34; echo \u0026#34;\u0026#34; echo \u0026#34;Mirroring directoro to S3 Bucket\u0026#34; mc mirror --remove --overwrite . chbtl0/ipng-web-assets/ echo \u0026#34;Done (mc mirror)\u0026#34; echo \u0026#34;\u0026#34; pim@glootie:~/src/ipng-web-assets$ ./push.sh Only a few seconds after I run ./push.sh, the replication is complete and I have two identical copies of my media:\nhttps://ipng-web-assets.s3.chbtl0.ipng.ch/ipng.ch/media/ https://ipng-web-assets.s3.ddln0.ipng.ch/ipng.ch/media/ NGINX: Proxy to Minio Before moving to S3 storage, my NGINX frontends all kept a copy of the IPng media on local NVME disk. That\u0026rsquo;s great for reliability, as each NGINX instance is completely hermetic and standalone. However, it\u0026rsquo;s not great for scaling: the current NGINX instances only have 16GB of local storage, and I\u0026rsquo;d rather not have my static web asset data outgrow that filesystem. From before, I already had an NGINX config that served the Hugo static data from /var/www/ipng.ch/ and the /media\u0026rsquo; subdirectory from a different directory in /var/www/ipng-web-assets/ipng.ch/media.\nMoving to redundant S3 storage backenda is straight forward:\nupstream minio_ipng { least_conn; server minio0.chbtl0.net.ipng.ch:9000; server minio0.ddln0.net.ipng.ch:9000; } server { ... location / { root /var/www/ipng.ch/; } location /media { proxy_set_header Host $http_host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_connect_timeout 300; proxy_http_version 1.1; proxy_set_header Connection \u0026#34;\u0026#34;; chunked_transfer_encoding off; rewrite (.*)/$ $1/index.html; proxy_pass http://minio_ipng/ipng-web-assets/ipng.ch/media; } } I want to make note of a few things:\nThe upstream definition here uses IPng Site Local entrypoints, considering the NGINX servers all have direct MTU=9000 access to the MinIO instances. I\u0026rsquo;ll put both in there, in a round-robin configuration favoring the replica with least connections. Deeplinking to directory names without the trailing /index.html would serve a 404 from the backend, so I\u0026rsquo;ll intercept these and rewrite directory to always include the `/index.html'. The used upstream endpoint is path-based, that is to say has the bucketname and website name included. This whole location used to be simply root /var/www/ipng-web-assets/ipng.ch/media/ so the mental change is quite small. NGINX: Caching After deploying the S3 upstream on all IPng websites, I can delete the old /var/www/ipng-web-assets/ directory and reclaim about 7GB of diskspace. This gives me an idea \u0026hellip;\nOn the one hand it\u0026rsquo;s great that I will pull these assets from Minio and all, but at the same time, it\u0026rsquo;s a tad inefficient to retrieve them from, say, Zurich to Amsterdam just to serve them onto the internet again. If at any time something on the IPng website goes viral, it\u0026rsquo;d be nice to be able to serve them directly from the edge, right?\nA webcache. What could possibly go wrong :)\nNGINX is really really good at caching content. It has a powerful engine to store, scan, revalidate and match any content and upstream headers. It\u0026rsquo;s also very well documented, so I take a look at the proxy module\u0026rsquo;s documentation [here] and in particular a useful [blog] on their website.\nThe first thing I need to do is create what is called a key zone, which is a region of memory in which URL keys are stored with some metadata. Having a copy of the keys in memory enables NGINX to quickly determine if a request is a HIT or a MISS without having to go to disk, greatly speeding up the check.\nIn /etc/nginx/conf.d/ipng-cache.conf I add the following NGINX cache:\nproxy_cache_path /var/www/nginx-cache levels=1:2 keys_zone=ipng_cache:10m max_size=8g inactive=24h use_temp_path=off; With this statement, I\u0026rsquo;ll create a 2-level subdirectory, and allocate 10MB of space, which should hold on the order of 100K entries. The maximum size I\u0026rsquo;ll allow the cache to grow to is 8GB, and I\u0026rsquo;ll mark any object inactive if it\u0026rsquo;s not been referenced for 24 hours. I learn that inactive is different to expired content. If a cache element has expired, but NGINX can\u0026rsquo;t reach the upstream for a new copy, it can be configured to serve a inactive (stale) copy from the cache. That\u0026rsquo;s dope, as it serves as an extra layer of defence in case the network or all available S3 replicas take the day off. I\u0026rsquo;ll ask NGINX to avoid writing objects first to a tmp directory and them moving them into the /var/www/nginx-cache directory. These are recommendations I grab from the manual.\nWithin the location block I configured above, I\u0026rsquo;m now ready to enable this cache. I\u0026rsquo;ll do that by adding two include files, which I\u0026rsquo;ll reference in all sites that I want to have make use of this cache:\nFirst, to enable the cache, I write the following snippet:\npim@nginx0-nlams1:~$ cat /etc/nginx/conf.d/ipng-cache.inc proxy_cache ipng_cache; proxy_ignore_headers Cache-Control; proxy_cache_valid any 1h; proxy_cache_revalidate on; proxy_cache_use_stale error timeout updating http_500 http_502 http_503 http_504; proxy_cache_background_update on; Then, I find it useful to emit a few debugging HTTP headers, and at the same time I see that Minio emits a bunch of HTTP headers that may not be safe for me to propagate, so I pen two more snippets:\npim@nginx0-nlams1:~$ cat /etc/nginx/conf.d/ipng-strip-minio-headers.inc proxy_hide_header x-minio-deployment-id; proxy_hide_header x-amz-request-id; proxy_hide_header x-amz-id-2; proxy_hide_header x-amz-replication-status; proxy_hide_header x-amz-version-id; pim@nginx0-nlams1:~$ cat /etc/nginx/conf.d/ipng-add-upstream-headers.inc add_header X-IPng-Frontend $hostname always; add_header X-IPng-Upstream $upstream_addr always; add_header X-IPng-Upstream-Status $upstream_status always; add_header X-IPng-Cache-Status $upstream_cache_status; With that, I am ready to enable caching of the IPng /media location:\nlocation /media { ... include /etc/nginx/conf.d/ipng-strip-minio-headers.inc; include /etc/nginx/conf.d/ipng-add-upstream-headers.inc; include /etc/nginx/conf.d/ipng-cache.inc; ... } Results I run the Ansible playbook for the NGINX cluster and take a look at the replica at Coloclue in Amsterdam, called nginx0.nlams1.ipng.ch. Notably, it\u0026rsquo;ll have to retrieve the file from a MinIO replica in Zurich (12ms away), so it\u0026rsquo;s expected to take a little while.\nThe first attempt:\npim@nginx0-nlams1:~$ curl -v -o /dev/null --connect-to ipng.ch:443:localhost:443 \\ https://ipng.ch/media/vpp-proto/vpp-proto-bookworm.qcow2.lrz ... \u0026lt; last-modified: Sun, 01 Jun 2025 12:37:52 GMT \u0026lt; x-ipng-frontend: nginx0-nlams1 \u0026lt; x-ipng-cache-status: MISS \u0026lt; x-ipng-upstream: [2001:678:d78:503::b]:9000 \u0026lt; x-ipng-upstream-status: 200 100 711M 100 711M 0 0 26.2M 0 0:00:27 0:00:27 --:--:-- 26.6M OK, that\u0026rsquo;s respectable, I\u0026rsquo;ve read the file at 26MB/s. Of course I just turned on the cache, so the NGINX fetches the file from Zurich while handing it over to my curl here. It notifies me by means of a HTTP header that the cache was a MISS, and then which upstream server it contacted to retrieve the object.\nBut look at what happens the second time I run the same command:\npim@nginx0-nlams1:~$ curl -v -o /dev/null --connect-to ipng.ch:443:localhost:443 \\ https://ipng.ch/media/vpp-proto/vpp-proto-bookworm.qcow2.lrz \u0026lt; last-modified: Sun, 01 Jun 2025 12:37:52 GMT \u0026lt; x-ipng-frontend: nginx0-nlams1 \u0026lt; x-ipng-cache-status: HIT 100 711M 100 711M 0 0 436M 0 0:00:01 0:00:01 --:--:-- 437M Holy moly! First I see the object has the same Last-Modified header, but I now also see that the Cache-Status was a HIT, and there is no mention of any upstream server. I do however see the file come in at a whopping 437MB/s which is 16x faster than over the network!! Nice work, NGINX!\nWhat\u0026rsquo;s Next I\u0026rsquo;m going to deploy the third MinIO replica in Rümlang once the disks arrive. I\u0026rsquo;ll release the ~4TB of disk used currently in Restic backups for the fleet, and put that ZFS capacity to other use. Now, creating services like PeerTube, Mastodon, Pixelfed, Loops, NextCloud and what-have-you, will become much easier for me. And with the per-bucket replication between MinIO deployments, I also think this is a great way to auto-backup important data. First off, it\u0026rsquo;ll be RS8.4 on the MinIO node itself, and secondly, user data will be copied automatically to a neighboring facility.\nI\u0026rsquo;ve convinced myself that S3 storage is a great service to operate, and that MinIO is awesome.\n","date":"2025-06-01","desc":" Introduction Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading scalability, data availability, security, and performance. Millions of customers of all sizes and industries store, manage, analyze, and protect any amount of data for virtually any use case, such as data lakes, cloud-native applications, and mobile apps. With cost-effective storage classes and easy-to-use management features, you can optimize costs, organize and analyze data, and configure fine-tuned access controls to meet specific business and compliance requirements.\n","permalink":"https://ipng.ch/s/articles/2025/06/01/case-study-minio-s3-part-2/","section":"articles","title":"Case Study: Minio S3 - Part 2"},{"contents":" Introduction Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading scalability, data availability, security, and performance. Millions of customers of all sizes and industries store, manage, analyze, and protect any amount of data for virtually any use case, such as data lakes, cloud-native applications, and mobile apps. With cost-effective storage classes and easy-to-use management features, you can optimize costs, organize and analyze data, and configure fine-tuned access controls to meet specific business and compliance requirements.\nAmazon\u0026rsquo;s S3 became the de facto standard object storage system, and there exist several fully open source implementations of the protocol. One of them is MinIO: designed to allow enterprises to consolidate all of their data on a single, private cloud namespace. Architected using the same principles as the hyperscalers, AIStor delivers performance at scale at a fraction of the cost compared to the public cloud.\nIPng Networks is an Internet Service Provider, but I also dabble in self-hosting things, for example [PeerTube], [Mastodon], [Immich], [Pixelfed] and of course [Hugo]. These services all have one thing in common: they tend to use lots of storage when they grow. At IPng Networks, all hypervisors ship with enterprise SAS flash drives, mostly 1.92TB and 3.84TB. Scaling up each of these services, and backing them up safely, can be quite the headache.\nThis article is for the storage-buffs. I\u0026rsquo;ll set up a set of distributed MinIO nodes from scatch.\nPhysical I\u0026rsquo;ll start with the basics. I still have a few Dell R720 servers laying around, they are getting a bit older but still have 24 cores and 64GB of memory. First I need to get me some disks. I order 36pcs of 16TB SATA enterprise disk, a mixture of Seagate EXOS and Toshiba MG series disks. I\u0026rsquo;ve once learned (the hard way), that buying a big stack of disks from one production run is a risk - so I\u0026rsquo;ll mix and match the drives.\nThree trays of caddies and a melted credit card later, I have 576TB of SATA disks safely in hand. Each machine will carry 192TB of raw storage. The nice thing about this chassis is that Dell can ship them with 12x 3.5\u0026quot; SAS slots in the front, and 2x 2.5\u0026quot; SAS slots in the rear of the chassis.\nSo I\u0026rsquo;ll install Debian Bookworm on one small 480G SSD in software RAID1.\nCloning an install I have three identical machines so in total I\u0026rsquo;ll want six of these SSDs. I temporarily screw the other five in 3.5\u0026quot; drive caddies and plug them into the first installed Dell, which I\u0026rsquo;ve called minio-proto:\npim@minio-proto:~$ for i in b c d e f; do sudo dd if=/dev/sda of=/dev/sd${i} bs=512 count=1; sudo mdadm --manage /dev/md0 --add /dev/md${i}1 done pim@minio-proto:~$ sudo mdadm --manage /dev/md0 --grow 6 pim@minio-proto:~$ watch cat /proc/mdstat pim@minio-proto:~$ for i in a b c d e f; do sudo grub-install /dev/sd$i done The first command takes my installed disk, /dev/sda, and copies the first sector over to the other five. This will give them the same partition table. Next, I\u0026rsquo;ll add the first partition of each disk to the raidset. Then, I\u0026rsquo;ll expand the raidset to have six members, after which the kernel starts a recovery process that syncs the newly added paritions to /dev/md0 (by copying from /dev/sda to all other disks at once). Finally, I\u0026rsquo;ll watch this exciting movie and grab a cup of tea.\nOnce the disks are fully copied, I\u0026rsquo;ll shut down the machine and distribute the disks to their respective Dell R720, two each. Once they boot they will all be identical. I\u0026rsquo;ll need to make sure their hostnames, and machine/host-id are unique, otherwise things like bridges will have overlapping MAC addresses - ask me how I know:\npim@minio-proto:~$ sudo mdadm --manage /dev/md0 --grow -n 2 pim@minio-proto:~$ sudo rm /etc/ssh/ssh_host* pim@minio-proto:~$ sudo hostname minio0-chbtl0 pim@minio-proto:~$ sudo dpkg-reconfigure openssh-server pim@minio-proto:~$ sudo dd if=/dev/random of=/etc/hostid bs=4 count=1 pim@minio-proto:~$ sudo /usr/bin/dbus-uuidgen \u0026gt; /etc/machine-id pim@minio-proto:~$ sudo reboot After which I have three beautiful and unique machines:\nminio0.chbtl0.net.ipng.ch: which will go into my server rack at the IPng office. minio0.ddln0.net.ipng.ch: which will go to [Daedalean], doing AI since before it was all about vibe coding. minio0.chrma0.net.ipng.ch: which will go to [IP-Max], one of the best ISPs on the planet. 🥰 Deploying Minio The user guide that MinIO provides [ref] is super good, arguably one of the best documented open source projects I\u0026rsquo;ve ever seen. it shows me that I can do three types of install. A \u0026lsquo;Standalone\u0026rsquo; with one disk, a \u0026lsquo;Standalone Multi-Drive\u0026rsquo;, and a \u0026lsquo;Distributed\u0026rsquo; deployment. I decide to make three independent standalone multi-drive installs. This way, I have less shared fate, and will be immune to network partitions (as these are going to be in three different physical locations). I\u0026rsquo;ve also read about per-bucket replication, which will be an excellent way to get geographical distribution and active/active instances to work together.\nI feel good about the single-machine multi-drive decision. I follow the install guide [ref] for this deployment type.\nIPng Frontends At IPng I use a private IPv4/IPv6/MPLS network that is not connected to the internet. I call this network [IPng Site Local]. But how will users reach my Minio install? I have four redundantly and geographically deployed frontends, two in the Netherlands and two in Switzerland. I\u0026rsquo;ve described the frontend setup in a [previous article] and the certificate management in [this article].\nI\u0026rsquo;ve decided to run the service on these three regionalized endpoints:\ns3.chbtl0.ipng.ch which will back into minio0.chbtl0.net.ipng.ch s3.ddln0.ipng.ch which will back into minio0.ddln0.net.ipng.ch s3.chrma0.ipng.ch which will back into minio0.chrma0.net.ipng.ch The first thing I take note of is that S3 buckets can be either addressed by path, in other words something like s3.chbtl0.ipng.ch/my-bucket/README.md, but they can also be addressed by virtual host, like so: my-bucket.s3.chbtl0.ipng.ch/README.md. A subtle difference, but from the docs I understand that Minio needs to have control of the whole space under its main domain.\nThere\u0026rsquo;s a small implication to this requirement \u0026ndash; the Web Console that ships with MinIO (eh, well, maybe that\u0026rsquo;s going to change, more on that later), will want to have its own domain-name, so I choose something simple: cons0-s3.chbtl0.ipng.ch and so on. This way, somebody might still be able to have a bucket name called cons0 :)\nLet\u0026rsquo;s Encrypt Certificates Alright, so I will be neading nine domains into this new certificate which I\u0026rsquo;ll simply call s3.ipng.ch. I configure it in Ansible:\ncertbot: certs: ... s3.ipng.ch: groups: [ \u0026#39;nginx\u0026#39;, \u0026#39;minio\u0026#39; ] altnames: - \u0026#39;s3.chbtl0.ipng.ch\u0026#39; - \u0026#39;cons0-s3.chbtl0.ipng.ch\u0026#39; - \u0026#39;*.s3.chbtl0.ipng.ch\u0026#39; - \u0026#39;s3.ddln0.ipng.ch\u0026#39; - \u0026#39;cons0-s3.ddln0.ipng.ch\u0026#39; - \u0026#39;*.s3.ddln0.ipng.ch\u0026#39; - \u0026#39;s3.chrma0.ipng.ch\u0026#39; - \u0026#39;cons0-s3.chrma0.ipng.ch\u0026#39; - \u0026#39;*.s3.chrma0.ipng.ch\u0026#39; I run the certbot playbook and it does two things:\nOn the machines from group nginx and minio, it will ensure there exists a user lego with an SSH key and write permissions to /etc/lego/; this is where the automation will write (and update) the certificate keys. On the lego machine, it\u0026rsquo;ll create two files. One is the certificate requestor, and the other is a certificate distribution script that will copy the cert to the right machine(s) when it renews. On the lego machine, I\u0026rsquo;ll run the cert request for the first time:\nlego@lego:~$ bin/certbot:s3.ipng.ch lego@lego:~$ RENEWED_LINEAGE=/home/lego/acme-dns/live/s3.ipng.ch bin/certbot-distribute The first script asks me to add the _acme-challenge DNS entries, which I\u0026rsquo;ll do, for example on the s3.chbtl0.ipng.ch instance (and similar for the ddln0 and chrma0 ones:\n$ORIGIN chbtl0.ipng.ch. _acme-challenge.s3 CNAME 51f16fd0-8eb6-455c-b5cd-96fad12ef8fd.auth.ipng.ch. _acme-challenge.cons0-s3 CNAME 450477b8-74c9-4b9e-bbeb-de49c3f95379.auth.ipng.ch. s3 CNAME nginx0.ipng.ch. *.s3 CNAME nginx0.ipng.ch. cons0-s3 CNAME nginx0.ipng.ch. I push and reload the ipng.ch zonefile with these changes after which the certificate gets requested and a cronjob added to check for renewals. The second script will copy the newly created cert to all three minio machines, and all four nginx machines. From now on, every 90 days, a new cert will be automatically generated and distributed. Slick!\nNGINX Configs With the LE wildcard certs in hand, I can create an NGINX frontend for these minio deployments.\nFirst, a simple redirector service that punts people on port 80 to port 443:\nserver { listen [::]:80; listen 0.0.0.0:80; server_name cons0-s3.chbtl0.ipng.ch s3.chbtl0.ipng.ch *.s3.chbtl0.ipng.ch; access_log /var/log/nginx/s3.chbtl0.ipng.ch-access.log; include /etc/nginx/conf.d/ipng-headers.inc; location / { return 301 https://$server_name$request_uri; } } Next, the Minio API service itself which runs on port 9000, with a configuration snippet inspired by the MinIO [docs]:\nserver { listen [::]:443 ssl http2; listen 0.0.0.0:443 ssl http2; ssl_certificate /etc/certs/s3.ipng.ch/fullchain.pem; ssl_certificate_key /etc/certs/s3.ipng.ch/privkey.pem; include /etc/nginx/conf.d/options-ssl-nginx.inc; ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc; server_name s3.chbtl0.ipng.ch *.s3.chbtl0.ipng.ch; access_log /var/log/nginx/s3.chbtl0.ipng.ch-access.log upstream; include /etc/nginx/conf.d/ipng-headers.inc; add_header Strict-Transport-Security \u0026#34;max-age=31536000; includeSubDomains\u0026#34; always; ignore_invalid_headers off; client_max_body_size 0; # Disable buffering proxy_buffering off; proxy_request_buffering off; location / { proxy_set_header Host $http_host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_connect_timeout 300; proxy_http_version 1.1; proxy_set_header Connection \u0026#34;\u0026#34;; chunked_transfer_encoding off; proxy_pass http://minio0.chbtl0.net.ipng.ch:9000; } } Finally, the Minio Console service which runs on port 9090:\ninclude /etc/nginx/conf.d/geo-ipng-trusted.inc; server { listen [::]:443 ssl http2; listen 0.0.0.0:443 ssl http2; ssl_certificate /etc/certs/s3.ipng.ch/fullchain.pem; ssl_certificate_key /etc/certs/s3.ipng.ch/privkey.pem; include /etc/nginx/conf.d/options-ssl-nginx.inc; ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc; server_name cons0-s3.chbtl0.ipng.ch; access_log /var/log/nginx/cons0-s3.chbtl0.ipng.ch-access.log upstream; include /etc/nginx/conf.d/ipng-headers.inc; add_header Strict-Transport-Security \u0026#34;max-age=31536000; includeSubDomains\u0026#34; always; ignore_invalid_headers off; client_max_body_size 0; # Disable buffering proxy_buffering off; proxy_request_buffering off; location / { if ($geo_ipng_trusted = 0) { rewrite ^ https://ipng.ch/ break; } proxy_set_header Host $http_host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_set_header X-NginX-Proxy true; real_ip_header X-Real-IP; proxy_connect_timeout 300; chunked_transfer_encoding off; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection \u0026#34;upgrade\u0026#34;; proxy_pass http://minio0.chbtl0.net.ipng.ch:9090; } } This last one has an NGINX trick. It will only allow users in if they are in the map called geo_ipng_trusted, which contains a set of IPv4 and IPv6 prefixes. Visitors who are not in this map will receive an HTTP redirect back to the [IPng.ch] homepage instead.\nI run the Ansible Playbook which contains the NGINX changes to all frontends, but of course nothing runs yet, because I haven\u0026rsquo;t yet started MinIO backends.\nMinIO Backends The first thing I need to do is get those disks mounted. MinIO likes using XFS, so I\u0026rsquo;ll install that and prepare the disks as follows:\npim@minio0-chbtl0:~$ sudo apt install xfsprogs pim@minio0-chbtl0:~$ sudo modprobe xfs pim@minio0-chbtl0:~$ echo xfs | sudo tee -a /etc/modules pim@minio0-chbtl0:~$ sudo update-initramfs -k all -u pim@minio0-chbtl0:~$ for i in a b c d e f g h i j k l; do sudo mkfs.xfs /dev/sd$i; done pim@minio0-chbtl0:~$ blkid | awk \u0026#39;BEGIN {i=1} /TYPE=\u0026#34;xfs\u0026#34;/ { printf \u0026#34;%s /minio/disk%d xfs defaults 0 2\\n\u0026#34;,$2,i; i++; }\u0026#39; | sudo tee -a /etc/fstab pim@minio0-chbtl0:~$ for i in `seq 1 12`; do sudo mkdir -p /minio/disk$i; done pim@minio0-chbtl0:~$ sudo mount -t xfs -a pim@minio0-chbtl0:~$ sudo chown -R minio-user: /minio/ From the top: I\u0026rsquo;ll install xfsprogs which contains the things I need to manipulate XFS filesystems in Debian. Then I\u0026rsquo;ll install the xfs kernel module, and make sure it gets inserted upon subsequent startup by adding it to /etc/modules and regenerating the initrd for the installed kernels.\nNext, I\u0026rsquo;ll format all twelve 16TB disks (which are /dev/sda - /dev/sdl on these machines), and add their resulting blockdevice id\u0026rsquo;s to /etc/fstab so they get persistently mounted on reboot.\nFinally, I\u0026rsquo;ll create their mountpoints, mount all XFS filesystems, and chown them to the user that MinIO is running as. End result:\npim@minio0-chbtl0:~$ df -T Filesystem Type 1K-blocks Used Available Use% Mounted on udev devtmpfs 32950856 0 32950856 0% /dev tmpfs tmpfs 6595340 1508 6593832 1% /run /dev/md0 ext4 114695308 5423976 103398948 5% / tmpfs tmpfs 32976680 0 32976680 0% /dev/shm tmpfs tmpfs 5120 4 5116 1% /run/lock /dev/sda xfs 15623792640 121505936 15502286704 1% /minio/disk1 /dev/sde xfs 15623792640 121505968 15502286672 1% /minio/disk12 /dev/sdi xfs 15623792640 121505968 15502286672 1% /minio/disk11 /dev/sdl xfs 15623792640 121505904 15502286736 1% /minio/disk10 /dev/sdd xfs 15623792640 121505936 15502286704 1% /minio/disk4 /dev/sdb xfs 15623792640 121505968 15502286672 1% /minio/disk3 /dev/sdk xfs 15623792640 121505936 15502286704 1% /minio/disk5 /dev/sdc xfs 15623792640 121505936 15502286704 1% /minio/disk9 /dev/sdf xfs 15623792640 121506000 15502286640 1% /minio/disk2 /dev/sdj xfs 15623792640 121505968 15502286672 1% /minio/disk7 /dev/sdg xfs 15623792640 121506000 15502286640 1% /minio/disk8 /dev/sdh xfs 15623792640 121505968 15502286672 1% /minio/disk6 tmpfs tmpfs 6595336 0 6595336 0% /run/user/0 MinIO likes to be configured using environment variables - and this is likely because it\u0026rsquo;s a popular thing to run in a containerized environment like Kubernetes. The maintainers ship it also as a Debian package, which will read its environment from /etc/default/minio, and I\u0026rsquo;ll prepare that file as follows:\npim@minio0-chbtl0:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/default/minio MINIO_DOMAIN=\u0026#34;s3.chbtl0.ipng.ch,minio0.chbtl0.net.ipng.ch\u0026#34; MINIO_ROOT_USER=\u0026#34;XXX\u0026#34; MINIO_ROOT_PASSWORD=\u0026#34;YYY\u0026#34; MINIO_VOLUMES=\u0026#34;/minio/disk{1...12}\u0026#34; MINIO_OPTS=\u0026#34;--console-address :9001\u0026#34; EOF pim@minio0-chbtl0:~$ sudo systemctl enable --now minio pim@minio0-chbtl0:~$ sudo journalctl -u minio May 31 10:44:11 minio0-chbtl0 minio[690420]: MinIO Object Storage Server May 31 10:44:11 minio0-chbtl0 minio[690420]: Copyright: 2015-2025 MinIO, Inc. May 31 10:44:11 minio0-chbtl0 minio[690420]: License: GNU AGPLv3 - https://www.gnu.org/licenses/agpl-3.0.html May 31 10:44:11 minio0-chbtl0 minio[690420]: Version: RELEASE.2025-05-24T17-08-30Z (go1.24.3 linux/amd64) May 31 10:44:11 minio0-chbtl0 minio[690420]: API: http://198.19.4.11:9000 http://127.0.0.1:9000 May 31 10:44:11 minio0-chbtl0 minio[690420]: WebUI: https://cons0-s3.chbtl0.ipng.ch/ May 31 10:44:11 minio0-chbtl0 minio[690420]: Docs: https://docs.min.io pim@minio0-chbtl0:~$ sudo ipmitool sensor | grep Watts Pwr Consumption | 154.000 | Watts Incidentally - I am pretty pleased with this 192TB disk tank, sporting 24 cores, 64GB memory and 2x10G network, casually hanging out at 154 Watts of power all up. Slick!\nMinIO implements erasure coding as a core component in providing availability and resiliency during drive or node-level failure events. MinIO partitions each object into data and parity shards and distributes those shards across a single so-called erasure set. Under the hood, it uses [Reed-Solomon] erasure coding implementation and partitions the object for distribution. From the MinIO website, I\u0026rsquo;ll borrow a diagram to show how it looks like on a single node like mine to the right.\nAnyway, MinIO detects 12 disks and installs an erasure set with 8 data disks and 4 parity disks, which it calls EC:4 encoding, also known in the industry as RS8.4. Just like that, the thing shoots to life. Awesome!\nMinIO Client On Summer, I\u0026rsquo;ll install the MinIO Client called mc. This is easy because the maintainers ship a Linux binary which I can just download. On OpenBSD, they don\u0026rsquo;t do that. Not a problem though, on Squanchy, Pencilvester and Glootie, I will just go install the client. Using the mc commandline, I can all any of the S3 APIs on my new MinIO instance:\npim@summer:~$ set +o history pim@summer:~$ mc alias set chbtl0 https://s3.chbtl0.ipng.ch/ \u0026lt;rootuser\u0026gt; \u0026lt;rootpass\u0026gt; pim@summer:~$ set -o history pim@summer:~$ mc admin info chbtl0/ ● s3.chbtl0.ipng.ch Uptime: 22 hours Version: 2025-05-24T17:08:30Z Network: 1/1 OK Drives: 12/12 OK Pool: 1 ┌──────┬───────────────────────┬─────────────────────┬──────────────┐ │ Pool │ Drives Usage │ Erasure stripe size │ Erasure sets │ │ 1st │ 0.8% (total: 116 TiB) │ 12 │ 1 │ └──────┴───────────────────────┴─────────────────────┴──────────────┘ 95 GiB Used, 5 Buckets, 5,859 Objects, 318 Versions, 1 Delete Marker 12 drives online, 0 drives offline, EC:4 Cool beans. I think I should get rid of this root account though, I\u0026rsquo;ve installed those credentials into the /etc/default/minio environment file, but I don\u0026rsquo;t want to keep them out in the open. So I\u0026rsquo;ll make an account for myself and assign me reasonable privileges, called consoleAdmin in the default install:\npim@summer:~$ set +o history pim@summer:~$ mc admin user add chbtl0/ \u0026lt;someuser\u0026gt; \u0026lt;somepass\u0026gt; pim@summer:~$ mc admin policy info chbtl0 consoleAdmin pim@summer:~$ mc admin policy attach chbtl0 consoleAdmin --user=\u0026lt;someuser\u0026gt; pim@summer:~$ mc alias set chbtl0 https://s3.chbtl0.ipng.ch/ \u0026lt;someuser\u0026gt; \u0026lt;somepass\u0026gt; pim@summer:~$ set -o history OK, I feel less gross now that I\u0026rsquo;m not operating as root on the MinIO deployment. Using my new user-powers, let me set some metadata on my new minio server:\npim@summer:~$ mc admin config set chbtl0/ site name=chbtl0 region=switzerland Successfully applied new settings. Please restart your server \u0026#39;mc admin service restart chbtl0/\u0026#39;. pim@summer:~$ mc admin service restart chbtl0/ Service status: ▰▰▱ [DONE] Summary: ┌───────────────┬─────────────────────────────┐ │ Servers: │ 1 online, 0 offline, 0 hung │ │ Restart Time: │ 61.322886ms │ └───────────────┴─────────────────────────────┘ pim@summer:~$ mc admin config get chbtl0/ site site name=chbtl0 region=switzerland By the way, what\u0026rsquo;s really cool about these open standards is that both the Amazon aws client works with MinIO, but mc also works with AWS!\nMinIO Console Although I\u0026rsquo;m pretty good with APIs and command line tools, there\u0026rsquo;s some benefit also in using a Graphical User Interface. MinIO ships with one, but there was a bit of a kerfuffle in the MinIO community. Unfortunately, these are pretty common \u0026ndash; Redis (an open source key/value storage system) changed their offering abruptly. Terraform (an open source infrastructure-as-code tool) changed their licensing at some point. Ansible (an open source machine management tool) changed their offering also. MinIO developers decided to strip their console of ~all features recently. The gnarly bits are discussed on [reddit]. but suffice to say: the same thing that happened in literally 100% of the other cases, also happened here. Somebody decided to simply fork the code from before it was changed.\nEnter OpenMaxIO. A cringe worthy name, but it gets the job done. Reading up on the [GitHub], reviving the fully working console is pretty straight forward \u0026ndash; that is, once somebody spent a few days figuring it out. Thank you icesvz for this excellent pointer. With this, I can create a systemd service for the console and start it:\npim@minio0-chbtl0:~$ cat \u0026lt;\u0026lt; EOF | sudo tee -a /etc/default/minio ## NOTE(pim): For openmaxio console service CONSOLE_MINIO_SERVER=\u0026#34;http://localhost:9000\u0026#34; MINIO_BROWSER_REDIRECT_URL=\u0026#34;https://cons0-s3.chbtl0.ipng.ch/\u0026#34; EOF pim@minio0-chbtl0:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /lib/systemd/system/minio-console.service [Unit] Description=OpenMaxIO Console Service Wants=network-online.target After=network-online.target AssertFileIsExecutable=/usr/local/bin/minio-console [Service] Type=simple WorkingDirectory=/usr/local User=minio-user Group=minio-user ProtectProc=invisible EnvironmentFile=-/etc/default/minio ExecStart=/usr/local/bin/minio-console server Restart=always LimitNOFILE=1048576 MemoryAccounting=no TasksMax=infinity TimeoutSec=infinity OOMScoreAdjust=-1000 SendSIGKILL=no [Install] WantedBy=multi-user.target EOF pim@minio0-chbtl0:~$ sudo systemctl enable --now minio-console pim@minio0-chbtl0:~$ sudo systemctl restart minio The first snippet is an update to the MinIO configuration that instructs it to redirect users who are not trying to use the API to the console endpoint on cons0-s3.chbtl0.ipng.ch, and then the console-server needs to know where to find the API, which from its vantage point is running on localhost:9000. Hello, beautiful fully featured console:\nMinIO Prometheus MinIO ships with a prometheus metrics endpoint, and I notice on its console that it has a nice metrics tab, which is fully greyed out. This is most likely because, well, I don\u0026rsquo;t have a Prometheus install here yet. I decide to keep the storage nodes self-contained and start a Prometheus server on the local machine. I can always plumb that to IPng\u0026rsquo;s Grafana instance later.\nFor now, I\u0026rsquo;ll install Prometheus as follows:\npim@minio0-chbtl0:~$ cat \u0026lt;\u0026lt; EOF | sudo tee -a /etc/default/minio ## NOTE(pim): Metrics for minio-console MINIO_PROMETHEUS_AUTH_TYPE=\u0026#34;public\u0026#34; CONSOLE_PROMETHEUS_URL=\u0026#34;http://localhost:19090/\u0026#34; CONSOLE_PROMETHEUS_JOB_ID=\u0026#34;minio-job\u0026#34; EOF pim@minio0-chbtl0:~$ sudo apt install prometheus pim@minio0-chbtl0:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/default/prometheus ARGS=\u0026#34;--web.listen-address=\u0026#39;[::]:19090\u0026#39; --storage.tsdb.retention.size=16GB\u0026#34; EOF pim@minio0-chbtl0:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/prometheus/prometheus.yml global: scrape_interval: 60s scrape_configs: - job_name: minio-job metrics_path: /minio/v2/metrics/cluster static_configs: - targets: [\u0026#39;localhost:9000\u0026#39;] labels: cluster: minio0-chbtl0 - job_name: minio-job-node metrics_path: /minio/v2/metrics/node static_configs: - targets: [\u0026#39;localhost:9000\u0026#39;] labels: cluster: minio0-chbtl0 - job_name: minio-job-bucket metrics_path: /minio/v2/metrics/bucket static_configs: - targets: [\u0026#39;localhost:9000\u0026#39;] labels: cluster: minio0-chbtl0 - job_name: minio-job-resource metrics_path: /minio/v2/metrics/resource static_configs: - targets: [\u0026#39;localhost:9000\u0026#39;] labels: cluster: minio0-chbtl0 - job_name: node static_configs: - targets: [\u0026#39;localhost:9100\u0026#39;] labels: cluster: minio0-chbtl0 pim@minio0-chbtl0:~$ sudo systemctl restart minio prometheus In the first snippet, I\u0026rsquo;ll tell MinIO where it should find its Prometheus instance. Since the MinIO console service is running on port 9090, and this is also the default port for Prometheus, I will run Promtheus on port 19090 instead. From reading the MinIO docs, I can see that normally MinIO will want prometheus to authenticate to it before it\u0026rsquo;ll allow the endpoints to be scraped. I\u0026rsquo;ll turn that off by making these public. On the IPng Frontends, I can always remove access to /minio/v2 and simply use the IPng Site Local access for local Prometheus scrapers instead.\nAfter telling Prometheus its runtime arguments (in /etc/default/prometheus) and its scraping endpoints (in /etc/prometheus/prometheus.yml), I can restart minio and prometheus. A few minutes later, I can see the Metrics tab in the console come to life.\nBut now that I have this prometheus running on the MinIO node, I can also add it to IPng\u0026rsquo;s Grafana configuration, by adding a new data source on minio0.chbtl0.net.ipng.ch:19090 and pointing the default Grafana [Dashboard] at it:\nA two-for-one: I will both be able to see metrics directly in the console, but also I will be able to hook up these per-node prometheus instances into IPng\u0026rsquo;s alertmanager also, and I\u0026rsquo;ve read some [docs] on the concepts. I\u0026rsquo;m really liking the experience so far!\nMinIO Nagios Prometheus is fancy and all, but at IPng Networks, I\u0026rsquo;ve been doing monitoring for a while now. As a dinosaur, I still have an active [Nagios] install, which autogenerates all of its configuration using the Ansible repository I have. So for the new Ansible group called minio, I will autogenerate the following snippet:\ndefine command { command_name ipng_check_minio command_line $USER1$/check_http -E -H $HOSTALIAS$ -I $ARG1$ -p $ARG2$ -u $ARG3$ -r \u0026#39;$ARG4$\u0026#39; } define service { hostgroup_name ipng:minio:ipv6 service_description minio6:api check_command ipng_check_minio!$_HOSTADDRESS6$!9000!/minio/health/cluster! use ipng-service-fast notification_interval 0 ; set \u0026gt; 0 if you want to be renotified } define service { hostgroup_name ipng:minio:ipv6 service_description minio6:prom check_command ipng_check_minio!$_HOSTADDRESS6$!19090!/classic/targets!minio-job use ipng-service-fast notification_interval 0 ; set \u0026gt; 0 if you want to be renotified } define service { hostgroup_name ipng:minio:ipv6 service_description minio6:console check_command ipng_check_minio!$_HOSTADDRESS6$!9090!/!MinIO Console use ipng-service-fast notification_interval 0 ; set \u0026gt; 0 if you want to be renotified } I\u0026rsquo;ve shown the snippet for IPv6 but I also have three services defined for legacy IP in the hostgroup ipng:minio:ipv4. The check command here uses -I which has the IPv4 or IPv6 address to talk to, -p for the port to consule, -u for the URI to hit and an option -r for a regular expression to expect in the output. For the Nagios afficianados out there: my Ansible groups correspond one to one with autogenerated Nagios hostgroups. This allows me to add arbitrary checks by group-type, like above in the ipng:minio group for IPv4 and IPv6.\nIn the MinIO [docs] I read up on the Healthcheck API. I choose to monitor the Cluster Write Quorum on my minio deployments. For Prometheus, I decide to hit the targets endpoint and expect the minio-job to be among them. Finally, for the MinIO Console, I expect to see a login screen with the words MinIO Console in the returned page. I guessed right, because Nagios is all green:\nMy First Bucket The IPng website is a statically generated Hugo site, and when-ever I submit a change to my Git repo, a CI/CD runner (called [Drone]), picks up the change. It re-builds the static website, and copies it to four redundant NGINX servers.\nBut IPng\u0026rsquo;s website has amassed quite a bit of extra files (like VM images and VPP packages that I publish), which are copied separately using a simple push script I have in my home directory. This avoids all those big media files from cluttering the Git repository. I decide to move this stuff into S3:\npim@summer:~/src/ipng-web-assets$ echo \u0026#39;Gruezi World.\u0026#39; \u0026gt; ipng.ch/media/README.md pim@summer:~/src/ipng-web-assets$ mc mb chbtl0/ipng-web-assets pim@summer:~/src/ipng-web-assets$ mc mirror . chbtl0/ipng-web-assets/ ...ch/media/README.md: 6.50 GiB / 6.50 GiB ┃▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓┃ 236.38 MiB/s 28s pim@summer:~/src/ipng-web-assets$ mc anonymous set download chbtl0/ipng-web-assets/ OK, two things that immediately jump out at me. This stuff is fast: Summer is connected with a 2.5GbE network card, and she\u0026rsquo;s running hard, copying the 6.5GB of data that are in these web assets essentially at line rate. It doesn\u0026rsquo;t really surprise me because Summer is running off of Gen4 NVME, while MinIO has 12 spinning disks which each can write about 160MB/s or so sustained [ref], with 24 CPUs to tend to the NIC (2x10G) and disks (2x SSD, 12x LFF). Should be plenty!\nThe second is that MinIO allows for buckets to be publicly shared in three ways: 1) read-only by setting download; 2) write-only by setting upload, and 3) read-write by setting public. I set download here, which means I should be able to fetch an asset now publicly:\npim@summer:~$ curl https://s3.chbtl0.ipng.ch/ipng-web-assets/ipng.ch/media/README.md Gruezi World. pim@summer:~$ curl https://ipng-web-assets.s3.chbtl0.ipng.ch/ipng.ch/media/README.md Gruezi World. The first curl here shows the path-based access, while the second one shows an equivalent virtual-host based access. Both retrieve the file I just pushed via the public Internet. Whoot!\nWhat\u0026rsquo;s Next I\u0026rsquo;m going to be moving [Restic] backups from IPng\u0026rsquo;s ZFS storage pool to this S3 service over the next few days. I\u0026rsquo;ll also migrate PeerTube and possibly Mastodon from NVME based storage to replicated S3 buckets as well. Finally, the IPng website media that I mentioned above, should make for a nice followup article. Stay tuned!\n","date":"2025-05-28","desc":" Introduction Amazon Simple Storage Service (Amazon S3) is an object storage service offering industry-leading scalability, data availability, security, and performance. Millions of customers of all sizes and industries store, manage, analyze, and protect any amount of data for virtually any use case, such as data lakes, cloud-native applications, and mobile apps. With cost-effective storage classes and easy-to-use management features, you can optimize costs, organize and analyze data, and configure fine-tuned access controls to meet specific business and compliance requirements.\n","permalink":"https://ipng.ch/s/articles/2025/05/28/case-study-minio-s3-part-1/","section":"articles","title":"Case Study: Minio S3 - Part 1"},{"contents":" Introduction From time to time the subject of containerized VPP instances comes up. At IPng, I run the routers in AS8298 on bare metal (Supermicro and Dell hardware), as it allows me to maximize performance. However, VPP is quite friendly in virtualization. Notably, it runs really well on virtual machines like Qemu/KVM or VMWare. I can pass through PCI devices directly to the host, and use CPU pinning to allow the guest virtual machine access to the underlying physical hardware. In such a mode, VPP performance almost the same as on bare metal. But did you know that VPP can also run in Docker?\nThe other day I joined the [ZANOG'25] in Durban, South Africa. One of the presenters was Nardus le Roux of Nokia, and he showed off a project called [Containerlab], which provides a CLI for orchestrating and managing container-based networking labs. It starts the containers, builds virtual wiring between them to create lab topologies of users\u0026rsquo; choice and manages the lab lifecycle.\nQuite regularly I am asked \u0026lsquo;when will you add VPP to Containerlab?\u0026rsquo;, but at ZANOG I made a promise to actually add it. In my previous [article], I took a good look at VPP as a dockerized container. In this article, I\u0026rsquo;ll explore how to make such a container run in Containerlab!\nCompleting the Docker container Just having VPP running by itself in a container is not super useful (although it is cool!). I decide first to add a few bits and bobs that will come in handy in the Dockerfile:\nFROM debian:bookworm ARG DEBIAN_FRONTEND=noninteractive ARG VPP_INSTALL_SKIP_SYSCTL=true ARG REPO=release EXPOSE 22/tcp RUN apt-get update \u0026amp;\u0026amp; apt-get -y install curl procps tcpdump iproute2 iptables \\ iputils-ping net-tools git python3 python3-pip vim-tiny openssh-server bird2 \\ mtr-tiny traceroute \u0026amp;\u0026amp; apt-get clean # Install VPP RUN mkdir -p /var/log/vpp /root/.ssh/ RUN curl -s https://packagecloud.io/install/repositories/fdio/${REPO}/script.deb.sh | bash RUN apt-get update \u0026amp;\u0026amp; apt-get -y install vpp vpp-plugin-core \u0026amp;\u0026amp; apt-get clean # Build vppcfg RUN pip install --break-system-packages build netaddr yamale argparse pyyaml ipaddress RUN git clone https://git.ipng.ch/ipng/vppcfg.git \u0026amp;\u0026amp; cd vppcfg \u0026amp;\u0026amp; python3 -m build \u0026amp;\u0026amp; \\ pip install --break-system-packages dist/vppcfg-*-py3-none-any.whl # Config files COPY files/etc/vpp/* /etc/vpp/ COPY files/etc/bird/* /etc/bird/ COPY files/init-container.sh /sbin/ RUN chmod 755 /sbin/init-container.sh CMD [\u0026#34;/sbin/init-container.sh\u0026#34;] A few notable additions:\nvppcfg is a handy utility I wrote and discussed in a previous [article]. Its purpose is to take YAML file that describes the configuration of the dataplane (like which interfaces, sub-interfaces, MTU, IP addresses and so on), and then apply this safely to a running dataplane. You can check it out in my [vppcfg] git repository. openssh-server will come in handy to log in to the container, in addition to the already available docker exec. bird2 which will be my controlplane of choice. At a future date, I might also add FRR, which may be a good alterantive for some. VPP works well with both. You can check out Bird on the nic.cz [website]. I\u0026rsquo;ll add a couple of default config files for Bird and VPP, and replace the CMD with a generic /sbin/init-container.sh in which I can do any late binding stuff before launching VPP.\nInitializing the Container VPP Containerlab: NetNS VPP\u0026rsquo;s Linux Control Plane plugin wants to run in its own network namespace. So the first order of business of /sbin/init-container.sh is to create it:\nNETNS=${NETNS:=\u0026#34;dataplane\u0026#34;} echo \u0026#34;Creating dataplane namespace\u0026#34; /usr/bin/mkdir -p /etc/netns/$NETNS /usr/bin/touch /etc/netns/$NETNS/resolv.conf /usr/sbin/ip netns add $NETNS VPP Containerlab: SSH Then, I\u0026rsquo;ll set the root password (which is vpp by the way), and start aan SSH daemon which allows for password-less logins:\necho \u0026#34;Starting SSH, with credentials root:vpp\u0026#34; sed -i -e \u0026#39;s,^#PermitRootLogin prohibit-password,PermitRootLogin yes,\u0026#39; /etc/ssh/sshd_config sed -i -e \u0026#39;s,^root:.*,root:$y$j9T$kG8pyZEVmwLXEtXekQCRK.$9iJxq/bEx5buni1hrC8VmvkDHRy7ZMsw9wYvwrzexID:20211::::::,\u0026#39; /etc/shadow /etc/init.d/ssh start VPP Containerlab: Bird2 I can already predict that Bird2 won\u0026rsquo;t be the only option for a controlplane, even though I\u0026rsquo;m a huge fan of it. Therefore, I\u0026rsquo;ll make it configurable to leave the door open for other controlplane implementations in the future:\nBIRD_ENABLED=${BIRD_ENABLED:=\u0026#34;true\u0026#34;} if [ \u0026#34;$BIRD_ENABLED\u0026#34; == \u0026#34;true\u0026#34; ]; then echo \u0026#34;Starting Bird in $NETNS\u0026#34; mkdir -p /run/bird /var/log/bird chown bird:bird /var/log/bird ROUTERID=$(ip -br a show eth0 | awk \u0026#39;{ print $3 }\u0026#39; | cut -f1 -d/) sed -i -e \u0026#34;s,.*router id .*,router id $ROUTERID; # Set by container-init.sh,\u0026#34; /etc/bird/bird.conf /usr/bin/nsenter --net=/var/run/netns/$NETNS /usr/sbin/bird -u bird -g bird fi I am reminded that Bird won\u0026rsquo;t start if it cannot determine its router id. When I start it in the dataplane namespace, it will immediately exit, because there will be no IP addresses configured yet. But luckily, it logs its complaint and it\u0026rsquo;s easily addressed. I decide to take the management IPv4 address from eth0 and write that into the bird.conf file, which otherwise does some basic initialization that I described in a previous [article], so I\u0026rsquo;ll skip that here. However, I do include an empty file called /etc/bird/bird-local.conf for users to further configure Bird2.\nVPP Containerlab: Binding veth pairs When Containerlab starts the VPP container, it\u0026rsquo;ll offer it a set of veth ports that connect this container to other nodes in the lab. This is done by the links list in the topology file [ref]. It\u0026rsquo;s my goal to take all of the interfaces that are of type veth, and generate a little snippet to grab them and bind them into VPP while setting their MTU to 9216 to allow for jumbo frames:\nCLAB_VPP_FILE=${CLAB_VPP_FILE:=/etc/vpp/clab.vpp} echo \u0026#34;Generating $CLAB_VPP_FILE\u0026#34; : \u0026gt; $CLAB_VPP_FILE MTU=9216 for IFNAME in $(ip -br link show type veth | cut -f1 -d@ | grep -v \u0026#39;^eth0$\u0026#39; | sort); do MAC=$(ip -br link show dev $IFNAME | awk \u0026#39;{ print $3 }\u0026#39;) echo \u0026#34; * $IFNAME hw-addr $MAC mtu $MTU\u0026#34; ip link set $IFNAME up mtu $MTU cat \u0026lt;\u0026lt; EOF \u0026gt;\u0026gt; $CLAB_VPP_FILE create host-interface name $IFNAME hw-addr $MAC set interface name host-$IFNAME $IFNAME set interface mtu $MTU $IFNAME set interface state $IFNAME up EOF done One thing I realized is that VPP will assign a random MAC address on its copy of the veth port, which is not great. I\u0026rsquo;ll explicitly configure it with the same MAC address as the veth interface itself, otherwise I\u0026rsquo;d have to put the interface into promiscuous mode.\nVPP Containerlab: VPPcfg I\u0026rsquo;m almost ready, but I have one more detail. The user will be able to offer a [vppcfg] YAML file to configure the interfaces and so on. If such a file exists, I\u0026rsquo;ll apply it to the dataplane upon startup:\nVPPCFG_VPP_FILE=${VPPCFG_VPP_FILE:=/etc/vpp/vppcfg.vpp} echo \u0026#34;Generating $VPPCFG_VPP_FILE\u0026#34; : \u0026gt; $VPPCFG_VPP_FILE if [ -r /etc/vpp/vppcfg.yaml ]; then vppcfg plan --novpp -c /etc/vpp/vppcfg.yaml -o $VPPCFG_VPP_FILE fi Once the VPP process starts, it\u0026rsquo;ll execute /etc/vpp/bootstrap.vpp, which in turn executes these newly generated /etc/vpp/clab.vpp to grab the veth interfaces, and then /etc/vpp/vppcfg.vpp to further configure the dataplane. Easy peasy!\nAdding VPP to Containerlab Roman points out a previous integration for the 6WIND VSR in [PR#2540]. This serves as a useful guide to get me started. I fork the repo, create a branch so that Roman can also add a few commits, and together we start hacking in [PR#2571].\nFirst, I add the documentation skeleton in docs/manual/kinds/fdio_vpp.md, which links in from a few other places, and will be where the end-user facing documentation will live. That\u0026rsquo;s about half the contributed LOC, right there!\nNext, I\u0026rsquo;ll create a Go module in nodes/fdio_vpp/fdio_vpp.go which doesn\u0026rsquo;t do much other than creating the struct, and its required Register and Init functions. The Init function ensures the right capabilities are set in Docker, and the right devices are bound for the container.\nI notice that Containerlab rewrites the Dockerfile CMD string and prepends an if-wait.sh script to it. This is because when Containerlab starts the container, it\u0026rsquo;ll still be busy adding these link interfaces to it, and if a container starts too quickly, it may not see all the interfaces. So, containerlab informs the container using an environment variable called CLAB_INTFS, so this script simply sleeps for a while until that exact amount of interfaces are present. Ok, cool beans.\nRoman helps me a bit with Go templating. You see, I think it\u0026rsquo;ll be slick to have the CLI prompt for the VPP containers to reflect their hostname, because normally, VPP will assign vpp# . I add the template in nodes/fdio_vpp/vpp_startup_config.go.tpl and it only has one variable expansion: unix { cli-prompt {{ .ShortName }}# }. But I totally think it\u0026rsquo;s worth it, because when running many VPP containers in the lab, it could otherwise get confusing.\nRoman also shows me a trick in the function PostDeploy(), which will write the user\u0026rsquo;s SSH pubkeys to /root/.ssh/authorized_keys. This allows users to log in without having to use password authentication.\nCollectively, we decide to punt on the SaveConfig function until we\u0026rsquo;re a bit further along. I have an idea how this would work, basically along the lines of calling vppcfg dump and bind-mounting that file into the lab directory somewhere. This way, upon restarting, the YAML file can be re-read and the dataplane initialized. But it\u0026rsquo;ll be for another day.\nAfter the main module is finished, all I have to do is add it to clab/register.go and that\u0026rsquo;s just about it. In about 170 lines of code, 50 lines of Go template, and 170 lines of Markdown, this contribution is about ready to ship!\nContainerlab: Demo After I finish writing the documentation, I decide to include a demo with a quickstart to help folks along. A simple lab showing two VPP instances and two Alpine Linux clients can be found on [git.ipng.ch/ipng/vpp-containerlab]. Simply check out the repo and start the lab, like so:\n$ git clone https://git.ipng.ch/ipng/vpp-containerlab.git $ cd vpp-containerlab $ containerlab deploy --topo vpp.clab.yml Containerlab: configs The file vpp.clab.yml contains an example topology existing of two VPP instances connected each to one Alpine linux container, in the following topology:\nTwo relevant files for each VPP router are included in this [repository]:\nconfig/vpp*/vppcfg.yaml configures the dataplane interfaces, including a loopback address. config/vpp*/bird-local.conf configures the controlplane to enable BFD and OSPF. To illustrate these files, let me take a closer look at node vpp1. It\u0026rsquo;s VPP dataplane configuration looks like this:\npim@summer:~/src/vpp-containerlab$ cat config/vpp1/vppcfg.yaml interfaces: eth1: description: \u0026#39;To client1\u0026#39; mtu: 1500 lcp: eth1 addresses: [ 10.82.98.65/28, 2001:db8:8298:101::1/64 ] eth2: description: \u0026#39;To vpp2\u0026#39; mtu: 9216 lcp: eth2 addresses: [ 10.82.98.16/31, 2001:db8:8298:1::1/64 ] loopbacks: loop0: description: \u0026#39;vpp1\u0026#39; lcp: loop0 addresses: [ 10.82.98.0/32, 2001:db8:8298::/128 ] Then, I enable BFD, OSPF and OSPFv3 on eth2 and loop0 on both of the VPP routers:\npim@summer:~/src/vpp-containerlab$ cat config/vpp1/bird-local.conf protocol bfd bfd1 { interface \u0026#34;eth2\u0026#34; { interval 100 ms; multiplier 30; }; } protocol ospf v2 ospf4 { ipv4 { import all; export all; }; area 0 { interface \u0026#34;loop0\u0026#34; { stub yes; }; interface \u0026#34;eth2\u0026#34; { type pointopoint; cost 10; bfd on; }; }; } protocol ospf v3 ospf6 { ipv6 { import all; export all; }; area 0 { interface \u0026#34;loop0\u0026#34; { stub yes; }; interface \u0026#34;eth2\u0026#34; { type pointopoint; cost 10; bfd on; }; }; } Containerlab: playtime! Once the lab comes up, I can SSH to the VPP containers (vpp1 and vpp2) which have my SSH pubkeys installed thanks to Roman\u0026rsquo;s work. Barring that, I could still log in as user root using password vpp. VPP runs its own network namespace called dataplane, which is very similar to SR Linux default network-instance. I can join that namespace to take a closer look:\npim@summer:~/src/vpp-containerlab$ ssh root@vpp1 root@vpp1:~# nsenter --net=/var/run/netns/dataplane root@vpp1:~# ip -br a lo DOWN loop0 UP 10.82.98.0/32 2001:db8:8298::/128 fe80::dcad:ff:fe00:0/64 eth1 UNKNOWN 10.82.98.65/28 2001:db8:8298:101::1/64 fe80::a8c1:abff:fe77:acb9/64 eth2 UNKNOWN 10.82.98.16/31 2001:db8:8298:1::1/64 fe80::a8c1:abff:fef0:7125/64 root@vpp1:~# ping 10.82.98.1 PING 10.82.98.1 (10.82.98.1) 56(84) bytes of data. 64 bytes from 10.82.98.1: icmp_seq=1 ttl=64 time=9.53 ms 64 bytes from 10.82.98.1: icmp_seq=2 ttl=64 time=15.9 ms ^C --- 10.82.98.1 ping statistics --- 2 packets transmitted, 2 received, 0% packet loss, time 1002ms rtt min/avg/max/mdev = 9.530/12.735/15.941/3.205 ms From vpp1, I can tell that Bird2\u0026rsquo;s OSPF adjacency has formed, because I can ping the loop0 address of vpp2 router on 10.82.98.1. Nice! The two client nodes are running a minimalistic Alpine Linux container, which doesn\u0026rsquo;t ship with SSH by default. But of course I can still enter the containers using docker exec, like so:\npim@summer:~/src/vpp-containerlab$ docker exec -it client1 sh / # ip addr show dev eth1 531235: eth1@if531234: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN\u0026gt; mtu 9500 qdisc noqueue state UP link/ether 00:c1:ab:00:00:01 brd ff:ff:ff:ff:ff:ff inet 10.82.98.66/28 scope global eth1 valid_lft forever preferred_lft forever inet6 2001:db8:8298:101::2/64 scope global valid_lft forever preferred_lft forever inet6 fe80::2c1:abff:fe00:1/64 scope link valid_lft forever preferred_lft forever / # traceroute 10.82.98.82 traceroute to 10.82.98.82 (10.82.98.82), 30 hops max, 46 byte packets 1 10.82.98.65 (10.82.98.65) 5.906 ms 7.086 ms 7.868 ms 2 10.82.98.17 (10.82.98.17) 24.007 ms 23.349 ms 15.933 ms 3 10.82.98.82 (10.82.98.82) 39.978 ms 31.127 ms 31.854 ms / # traceroute 2001:db8:8298:102::2 traceroute to 2001:db8:8298:102::2 (2001:db8:8298:102::2), 30 hops max, 72 byte packets 1 2001:db8:8298:101::1 (2001:db8:8298:101::1) 0.701 ms 7.144 ms 7.900 ms 2 2001:db8:8298:1::2 (2001:db8:8298:1::2) 23.909 ms 22.943 ms 23.893 ms 3 2001:db8:8298:102::2 (2001:db8:8298:102::2) 31.964 ms 30.814 ms 32.000 ms From the vantage point of client1, the first hop represents the vpp1 node, which forwards to vpp2, which finally forwards to client2, which shows that both VPP routers are passing traffic. Dope!\nResults After all of this deep-diving, all that\u0026rsquo;s left is for me to demonstrate the Containerlab by means of this little screencast [asciinema]. I hope you enjoy it as much as I enjoyed creating it:\nAcknowledgements I wanted to give a shout-out Roman Dodin for his help getting the Containerlab parts squared away when I got a little bit stuck. He took the time to explain the internals and idiom of Containerlab project, which really saved me a tonne of time. He also pair-programmed the [PR#2471] with me over the span of two evenings.\nCollaborative open source rocks!\n","date":"2025-05-04","desc":" Introduction From time to time the subject of containerized VPP instances comes up. At IPng, I run the routers in AS8298 on bare metal (Supermicro and Dell hardware), as it allows me to maximize performance. However, VPP is quite friendly in virtualization. Notably, it runs really well on virtual machines like Qemu/KVM or VMWare. I can pass through PCI devices directly to the host, and use CPU pinning to allow the guest virtual machine access to the underlying physical hardware. In such a mode, VPP performance almost the same as on bare metal. But did you know that VPP can also run in Docker?\n","permalink":"https://ipng.ch/s/articles/2025/05/04/vpp-in-containerlab-part-2/","section":"articles","title":"VPP in Containerlab - Part 2"},{"contents":" Introduction From time to time the subject of containerized VPP instances comes up. At IPng, I run the routers in AS8298 on bare metal (Supermicro and Dell hardware), as it allows me to maximize performance. However, VPP is quite friendly in virtualization. Notably, it runs really well on virtual machines like Qemu/KVM or VMWare. I can pass through PCI devices directly to the host, and use CPU pinning to allow the guest virtual machine access to the underlying physical hardware. In such a mode, VPP performance almost the same as on bare metal. But did you know that VPP can also run in Docker?\nThe other day I joined the [ZANOG'25] in Durban, South Africa. One of the presenters was Nardus le Roux of Nokia, and he showed off a project called [Containerlab], which provides a CLI for orchestrating and managing container-based networking labs. It starts the containers, builds a virtual wiring between them to create lab topologies of users choice and manages labs lifecycle.\nQuite regularly I am asked \u0026lsquo;when will you add VPP to Containerlab?\u0026rsquo;, but at ZANOG I made a promise to actually add them. Here I go, on a journey to integrate VPP into Containerlab!\nContainerized VPP The folks at [Tigera] maintain a project called Calico, which accelerates Kubernetes CNI (Container Network Interface) by using [FD.io] VPP. Since the origins of Kubernetes are to run containers in a Docker environment, it stands to reason that it should be possible to run a containerized VPP. I start by reading up on how they create their Docker image, and I learn a lot.\nDocker Build Considering IPng runs bare metal Debian (currently Bookworm) machines, my Docker image will be based on debian:bookworm as well. The build starts off quite modest:\npim@summer:~$ mkdir -p src/vpp-containerlab pim@summer:~/src/vpp-containerlab$ cat \u0026lt; EOF \u0026gt; Dockerfile.bookworm FROM debian:bookworm ARG DEBIAN_FRONTEND=noninteractive ARG VPP_INSTALL_SKIP_SYSCTL=true ARG REPO=release RUN apt-get update \u0026amp;\u0026amp; apt-get -y install curl procps \u0026amp;\u0026amp; apt-get clean # Install VPP RUN curl -s https://packagecloud.io/install/repositories/fdio/${REPO}/script.deb.sh | bash RUN apt-get update \u0026amp;\u0026amp; apt-get -y install vpp vpp-plugin-core \u0026amp;\u0026amp; apt-get clean CMD [\u0026#34;/usr/bin/vpp\u0026#34;,\u0026#34;-c\u0026#34;,\u0026#34;/etc/vpp/startup.conf\u0026#34;] EOF pim@summer:~/src/vpp-containerlab$ docker build -f Dockerfile.bookworm . -t pimvanpelt/vpp-containerlab One gotcha - when I install the upstream VPP debian packages, they generate a sysctl file which it tries to execute. However, I can\u0026rsquo;t set sysctl\u0026rsquo;s in the container, so the build fails. I take a look at the VPP source code and find src/pkg/debian/vpp.postinst which helpfully contains a means to override setting the sysctl\u0026rsquo;s, using an environment variable called VPP_INSTALL_SKIP_SYSCTL.\nRunning VPP in Docker With the Docker image built, I need to tweak the VPP startup configuration a little bit, to allow it to run well in a Docker environment. There are a few things I make note of:\nWe may not have huge pages on the host machine, so I\u0026rsquo;ll set all the page sizes to the linux-default 4kB rather than 2MB or 1GB hugepages. This creates a performance regression, but in the case of Containerlab, we\u0026rsquo;re not here to build high performance stuff, but rather users will be doing functional testing. DPDK requires either UIO of VFIO kernel drivers, so that it can bind its so-called poll mode driver to the network cards. It also requires huge pages. Since my first version will be using only virtual ethernet interfaces, I\u0026rsquo;ll disable DPDK and VFIO alltogether. VPP can run any number of CPU worker threads. In its simplest form, I can also run it with only one thread. Of course, this will not be a high performance setup, but since I\u0026rsquo;m already not using hugepages, I\u0026rsquo;ll use only 1 thread. The VPP startup.conf configuration file I came up with:\npim@summer:~/src/vpp-containerlab$ cat \u0026lt; EOF \u0026gt; clab-startup.conf unix { interactive log /var/log/vpp/vpp.log full-coredump cli-listen /run/vpp/cli.sock cli-prompt vpp-clab# cli-no-pager poll-sleep-usec 100 } api-trace { on } memory { main-heap-size 512M main-heap-page-size 4k } buffers { buffers-per-numa 16000 default data-size 2048 page-size 4k } statseg { size 64M page-size 4k per-node-counters on } plugins { plugin default { enable } plugin dpdk_plugin.so { disable } } EOF Just a couple of notes for those who are running VPP in production. Each of the *-page-size config settings take the normal Linux pagesize of 4kB, which effectively avoids VPP from using anhy hugepages. Then, I\u0026rsquo;ll specifically disable the DPDK plugin, although I didn\u0026rsquo;t install it in the Dockerfile build, as it lives in its own dedicated Debian package called vpp-plugin-dpdk. Finally, I\u0026rsquo;ll make VPP use less CPU by telling it to sleep for 100 microseconds between each poll iteration. In production environments, VPP will use 100% of the CPUs it\u0026rsquo;s assigned, but in this lab, it will not be quite as hungry. By the way, even in this sleepy mode, it\u0026rsquo;ll still easily handle a gigabit of traffic!\nNow, VPP wants to run as root and it needs a few host features, notably tuntap devices and vhost, and a few capabilites, notably NET_ADMIN and SYS_PTRACE. I take a look at the [manpage]:\nCAP_SYS_NICE: allows to set real-time scheduling, CPU affinity, I/O scheduling class, and to migrate and move memory pages. CAP_NET_ADMIN: allows to perform various network-relates operations like interface configs, routing tables, nested network namespaces, multicast, set promiscuous mode, and so on. CAP_SYS_PTRACE: allows to trace arbitrary processes using ptrace(2), and a few related kernel system calls. Being a networking dataplane implementation, VPP wants to be able to tinker with network devices. This is not typically allowed in Docker containers, although the Docker developers did make some consessions for those containers that need just that little bit more access. They described it in their [docs] as follows:\n| The \u0026ndash;privileged flag gives all capabilities to the container. When the operator executes docker | run \u0026ndash;privileged, Docker enables access to all devices on the host, and reconfigures AppArmor or | SELinux to allow the container nearly all the same access to the host as processes running outside | containers on the host. Use this flag with caution. For more information about the \u0026ndash;privileged | flag, see the docker run reference.\nIn this moment, I feel I should point out that running a Docker container with --privileged flag set does give it a lot of privileges. A container with --privileged is not a securely sandboxed process. Containers in this mode can get a root shell on the host and take control over the system.\nWith that little fineprint warning out of the way, I am going to Yolo like a boss:\npim@summer:~/src/vpp-containerlab$ docker run --name clab-pim \\ --cap-add=NET_ADMIN --cap-add=SYS_NICE --cap-add=SYS_PTRACE \\ --device=/dev/net/tun:/dev/net/tun --device=/dev/vhost-net:/dev/vhost-net \\ --privileged -v $(pwd)/clab-startup.conf:/etc/vpp/startup.conf:ro \\ docker.io/pimvanpelt/vpp-containerlab clab-pim Configuring VPP in Docker And with that, the Docker container is running! I post a screenshot on [Mastodon] and my buddy John responds with a polite but firm insistence that I explain myself. Here you go, buddy :)\nIn another terminal, I can play around with this VPP instance a little bit:\npim@summer:~$ docker exec -it clab-pim bash root@d57c3716eee9:/# ip -br l lo UNKNOWN 00:00:00:00:00:00 \u0026lt;LOOPBACK,UP,LOWER_UP\u0026gt; eth0@if530566 UP 02:42:ac:11:00:02 \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; root@d57c3716eee9:/# ps auxw USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND root 1 2.2 0.2 17498852 160300 ? Rs 15:11 0:00 /usr/bin/vpp -c /etc/vpp/startup.conf root 10 0.0 0.0 4192 3388 pts/0 Ss 15:11 0:00 bash root 18 0.0 0.0 8104 4056 pts/0 R+ 15:12 0:00 ps auxw root@d57c3716eee9:/# vppctl _______ _ _ _____ ___ __/ __/ _ \\ (_)__ | | / / _ \\/ _ \\ _/ _// // / / / _ \\ | |/ / ___/ ___/ /_/ /____(_)_/\\___/ |___/_/ /_/ vpp-clab# show version vpp v25.02-release built by root on d5cd2c304b7f at 2025-02-26T13:58:32 vpp-clab# show interfaces Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count local0 0 down 0/0/0/0 Slick! I can see that the container has an eth0 device, which Docker has connected to the main bridged network. For now, there\u0026rsquo;s only one process running, pid 1 proudly shows VPP (as in Docker, the CMD field will simply replace init. Later on, I can imagine running a few more daemons like SSH and so on, but for now, I\u0026rsquo;m happy.\nLooking at VPP itself, it has no network interfaces yet, except for the default local0 interface.\nAdding Interfaces in Docker But if I don\u0026rsquo;t have DPDK, how will I add interfaces? Enter veth(4). From the [manpage], I learn that veth devices are virtual Ethernet devices. They can act as tunnels between network namespaces to create a bridge to a physical network device in another namespace, but can also be used as standalone network devices. veth devices are always created in interconnected pairs.\nOf course, Docker users will recognize this. It\u0026rsquo;s like bread and butter for containers to communicate with one another - and with the host they\u0026rsquo;re running on. I can simply create a Docker network and attach one half of it to a running container, like so:\npim@summer:~$ docker network create --driver=bridge clab-network \\ --subnet 192.0.2.0/24 --ipv6 --subnet 2001:db8::/64 5711b95c6c32ac0ed185a54f39e5af4b499677171ff3d00f99497034e09320d2 pim@summer:~$ docker network connect clab-network clab-pim --ip \u0026#39;\u0026#39; --ip6 \u0026#39;\u0026#39; The first command here creates a new network called clab-network in Docker. As a result, a new bridge called br-5711b95c6c32 shows up on the host. The bridge name is chosen from the UUID of the Docker object. Seeing as I added an IPv4 and IPv6 subnet to the bridge, it gets configured with the first address in both:\npim@summer:~/src/vpp-containerlab$ brctl show br-5711b95c6c32 bridge name bridge id STP enabled interfaces br-5711b95c6c32 8000.0242099728c6 no veth021e363 pim@summer:~/src/vpp-containerlab$ ip -br a show dev br-5711b95c6c32 br-5711b95c6c32 UP 192.0.2.1/24 2001:db8::1/64 fe80::42:9ff:fe97:28c6/64 fe80::1/64 The second command creates a veth pair, and puts one half of it in the bridge, and this interface is called veth021e363 above. The other half of it pops up as eth1 in the Docker container:\npim@summer:~/src/vpp-containerlab$ docker exec -it clab-pim bash root@d57c3716eee9:/# ip -br l lo UNKNOWN 00:00:00:00:00:00 \u0026lt;LOOPBACK,UP,LOWER_UP\u0026gt; eth0@if530566 UP 02:42:ac:11:00:02 \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; eth1@if530577 UP 02:42:c0:00:02:02 \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; One of the many awesome features of VPP is its ability to attach to these veth devices by means of its af-packet driver, by reusing the same MAC address (in this case 02:42:c0:00:02:02). I first take a look at the linux [manpage] for it, and then read up on the VPP [documentation] on the topic.\nHowever, my attention is drawn to Docker assigning an IPv4 and IPv6 address to the container:\nroot@d57c3716eee9:/# ip -br a lo UNKNOWN 127.0.0.1/8 ::1/128 eth0@if530566 UP 172.17.0.2/16 eth1@if530577 UP 192.0.2.2/24 2001:db8::2/64 fe80::42:c0ff:fe00:202/64 root@d57c3716eee9:/# ip addr del 192.0.2.2/24 dev eth1 root@d57c3716eee9:/# ip addr del 2001:db8::2/64 dev eth1 I decide to remove them from here, as in the end, eth1 will be owned by VPP so it should be setting the IPv4 and IPv6 addresses. For the life of me, I don\u0026rsquo;t see how I can avoid Docker from assinging IPv4 and IPv6 addresses to this container \u0026hellip; and the [docs] seem to be off as well, as they suggest I can pass a flagg --ipv4=False but that flag doesn\u0026rsquo;t exist, at least not on my Bookworm Docker variant. I make a mental note to discuss this with the folks in the Containerlab community.\nAnyway, armed with this knowledge I can bind the container-side veth pair called eth1 to VPP, like so:\nroot@d57c3716eee9:/# vppctl _______ _ _ _____ ___ __/ __/ _ \\ (_)__ | | / / _ \\/ _ \\ _/ _// // / / / _ \\ | |/ / ___/ ___/ /_/ /____(_)_/\\___/ |___/_/ /_/ vpp-clab# create host-interface name eth1 hw-addr 02:42:c0:00:02:02 vpp-clab# set interface name host-eth1 eth1 vpp-clab# set interface mtu 1500 eth1 vpp-clab# set interface ip address eth1 192.0.2.2/24 vpp-clab# set interface ip address eth1 2001:db8::2/64 vpp-clab# set interface state eth1 up vpp-clab# show int addr eth1 (up): L3 192.0.2.2/24 L3 2001:db8::2/64 local0 (dn): Results After all this work, I\u0026rsquo;ve successfully created a Docker image based on Debian Bookworm and VPP 25.02 (the current stable release version), started a container with it, added a network bridge in Docker, which binds the host summer to the container. Proof, as they say, is in the ping-pudding:\npim@summer:~/src/vpp-containerlab$ ping -c5 2001:db8::2 PING 2001:db8::2(2001:db8::2) 56 data bytes 64 bytes from 2001:db8::2: icmp_seq=1 ttl=64 time=0.113 ms 64 bytes from 2001:db8::2: icmp_seq=2 ttl=64 time=0.056 ms 64 bytes from 2001:db8::2: icmp_seq=3 ttl=64 time=0.202 ms 64 bytes from 2001:db8::2: icmp_seq=4 ttl=64 time=0.102 ms 64 bytes from 2001:db8::2: icmp_seq=5 ttl=64 time=0.100 ms --- 2001:db8::2 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4098ms rtt min/avg/max/mdev = 0.056/0.114/0.202/0.047 ms pim@summer:~/src/vpp-containerlab$ ping -c5 192.0.2.2 PING 192.0.2.2 (192.0.2.2) 56(84) bytes of data. 64 bytes from 192.0.2.2: icmp_seq=1 ttl=64 time=0.043 ms 64 bytes from 192.0.2.2: icmp_seq=2 ttl=64 time=0.032 ms 64 bytes from 192.0.2.2: icmp_seq=3 ttl=64 time=0.019 ms 64 bytes from 192.0.2.2: icmp_seq=4 ttl=64 time=0.041 ms 64 bytes from 192.0.2.2: icmp_seq=5 ttl=64 time=0.027 ms --- 192.0.2.2 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4063ms rtt min/avg/max/mdev = 0.019/0.032/0.043/0.008 ms And in case that simple ping-test wasn\u0026rsquo;t enough to get you excited, here\u0026rsquo;s a packet trace from VPP itself, while I\u0026rsquo;m performing this ping:\nvpp-clab# trace add af-packet-input 100 vpp-clab# wait 3 vpp-clab# show trace ------------------- Start of thread 0 vpp_main ------------------- Packet 1 00:07:03:979275: af-packet-input af_packet: hw_if_index 1 rx-queue 0 next-index 4 block 47: address 0x7fbf23b7d000 version 2 seq_num 48 pkt_num 0 tpacket3_hdr: status 0x20000001 len 98 snaplen 98 mac 92 net 106 sec 0x68164381 nsec 0x258e7659 vlan 0 vlan_tpid 0 vnet-hdr: flags 0x00 gso_type 0x00 hdr_len 0 gso_size 0 csum_start 0 csum_offset 0 00:07:03:979293: ethernet-input IP4: 02:42:09:97:28:c6 -\u0026gt; 02:42:c0:00:02:02 00:07:03:979306: ip4-input ICMP: 192.0.2.1 -\u0026gt; 192.0.2.2 tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN fragment id 0x5813, flags DONT_FRAGMENT ICMP echo_request checksum 0xc16 id 21197 00:07:03:979315: ip4-lookup fib 0 dpo-idx 9 flow hash: 0x00000000 ICMP: 192.0.2.1 -\u0026gt; 192.0.2.2 tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN fragment id 0x5813, flags DONT_FRAGMENT ICMP echo_request checksum 0xc16 id 21197 00:07:03:979322: ip4-receive fib:0 adj:9 flow:0x00000000 ICMP: 192.0.2.1 -\u0026gt; 192.0.2.2 tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN fragment id 0x5813, flags DONT_FRAGMENT ICMP echo_request checksum 0xc16 id 21197 00:07:03:979323: ip4-icmp-input ICMP: 192.0.2.1 -\u0026gt; 192.0.2.2 tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN fragment id 0x5813, flags DONT_FRAGMENT ICMP echo_request checksum 0xc16 id 21197 00:07:03:979323: ip4-icmp-echo-request ICMP: 192.0.2.1 -\u0026gt; 192.0.2.2 tos 0x00, ttl 64, length 84, checksum 0x5e92 dscp CS0 ecn NON_ECN fragment id 0x5813, flags DONT_FRAGMENT ICMP echo_request checksum 0xc16 id 21197 00:07:03:979326: ip4-load-balance fib 0 dpo-idx 5 flow hash: 0x00000000 ICMP: 192.0.2.2 -\u0026gt; 192.0.2.1 tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN fragment id 0x2dc4, flags DONT_FRAGMENT ICMP echo_reply checksum 0x1416 id 21197 00:07:03:979325: ip4-rewrite tx_sw_if_index 1 dpo-idx 5 : ipv4 via 192.0.2.1 eth1: mtu:1500 next:3 flags:[] 0242099728c60242c00002020800 flow hash: 0x00000000 00000000: 0242099728c60242c00002020800450000542dc44000400188e1c0000202c000 00000020: 02010000141652cd00018143166800000000399d0900000000001011 00:07:03:979326: eth1-output eth1 flags 0x02180005 IP4: 02:42:c0:00:02:02 -\u0026gt; 02:42:09:97:28:c6 ICMP: 192.0.2.2 -\u0026gt; 192.0.2.1 tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN fragment id 0x2dc4, flags DONT_FRAGMENT ICMP echo_reply checksum 0x1416 id 21197 00:07:03:979327: eth1-tx af_packet: hw_if_index 1 tx-queue 0 tpacket3_hdr: status 0x1 len 108 snaplen 108 mac 0 net 0 sec 0x0 nsec 0x0 vlan 0 vlan_tpid 0 vnet-hdr: flags 0x00 gso_type 0x00 hdr_len 0 gso_size 0 csum_start 0 csum_offset 0 buffer 0xf97c4: current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x0 local l2-hdr-offset 0 l3-hdr-offset 14 IP4: 02:42:c0:00:02:02 -\u0026gt; 02:42:09:97:28:c6 ICMP: 192.0.2.2 -\u0026gt; 192.0.2.1 tos 0x00, ttl 64, length 84, checksum 0x88e1 dscp CS0 ecn NON_ECN fragment id 0x2dc4, flags DONT_FRAGMENT ICMP echo_reply checksum 0x1416 id 21197 Well, that\u0026rsquo;s a mouthfull, isn\u0026rsquo;t it! Here, I get to show you VPP in action. After receiving the packet on its af-packet-input node from 192.0.2.1 (Summer, who is pinging us) to 192.0.2.2 (the VPP container), the packet traverses the dataplane graph. It goes through ethernet-input, then ip4-input, which sees it\u0026rsquo;s destined to an IPv4 address configured, so the packet is handed to ip4-receive. That one sees that the IP protocol is ICMP, so it hands the packet to ip4-icmp-input which notices that the packet is an ICMP echo request, so off to ip4-icmp-echo-request our little packet goes. The ICMP plugin in VPP now answers by ip4-rewrite\u0026lsquo;ing the packet, sending the return to 192.0.2.1 at MAC address 02:42:09:97:28:c6 (this is Summer, the host doing the pinging!), after which the newly created ICMP echo-reply is handed to eth1-output which marshalls it back into the kernel\u0026rsquo;s AF_PACKET interface using eth1-tx.\nBoom. I could not be more pleased.\nWhat\u0026rsquo;s Next This was a nice exercise for me! I\u0026rsquo;m going this direction becaue the [Containerlab] framework will start containers with given NOS images, not too dissimilar from the one I just made, and then attaches veth pairs between the containers. I started dabbling with a [pull-request], but I got stuck with a part of the Containerlab code that pre-deploys config files into the containers. You see, I will need to generate two files:\nA startup.conf file that is specific to the containerlab Docker container. I\u0026rsquo;d like them to each set their own hostname so that the CLI has a unique prompt. I can do this by setting unix { cli-prompt {{ .ShortName }}# } in the template renderer. Containerlab will know all of the veth pairs that are planned to be created into each VPP container. I\u0026rsquo;ll need it to then write a little snippet of config that does the create host-interface spiel, to attach these veth pairs to the VPP dataplane. I reached out to Roman from Nokia, who is one of the authors and current maintainer of Containerlab. Roman was keen to help out, and seeing as he knows the COntainerlab stuff well, and I know the VPP stuff well, this is a reasonable partnership! Soon, he and I plan to have a bare-bones setup that will connect a few VPP containers together with an SR Linux node in a lab. Stand by!\nOnce we have that, there\u0026rsquo;s still quite some work for me to do. Notably:\nConfiguration persistence. clab allows you to save the running config. For that, I\u0026rsquo;ll need to introduce [vppcfg] and a means to invoke it when the lab operator wants to save their config, and then reconfigure VPP when the container restarts. I\u0026rsquo;ll need to have a few files from clab shared with the host, notably the startup.conf and vppcfg.yaml, as well as some manual pre- and post-flight configuration for the more esoteric stuff. Building the plumbing for this is a TODO for now. Acknowledgements I wanted to give a shout-out to Nardus le Roux who inspired me to contribute this Containerlab VPP node type, and to Roman Dodin for his help getting the Containerlab parts squared away when I got a little bit stuck.\nFirst order of business: get it to ping at all \u0026hellip; it\u0026rsquo;ll go faster from there on out :)\n","date":"2025-05-03","desc":" Introduction From time to time the subject of containerized VPP instances comes up. At IPng, I run the routers in AS8298 on bare metal (Supermicro and Dell hardware), as it allows me to maximize performance. However, VPP is quite friendly in virtualization. Notably, it runs really well on virtual machines like Qemu/KVM or VMWare. I can pass through PCI devices directly to the host, and use CPU pinning to allow the guest virtual machine access to the underlying physical hardware. In such a mode, VPP performance almost the same as on bare metal. But did you know that VPP can also run in Docker?\n","permalink":"https://ipng.ch/s/articles/2025/05/03/vpp-in-containerlab-part-1/","section":"articles","title":"VPP in Containerlab - Part 1"},{"contents":" Introduction Somewhere in the far north of the Netherlands, the country where I was born, a town called Jubbega is the home of the Frysian Internet Exchange called [Frys-IX]. Back in 2021, a buddy of mine, Arend, said that he was planning on renting a rack at the NIKHEF facility, one of the most densely populated facilities in western Europe. He was looking for a few launching customers and I was definitely in the market for a presence in Amsterdam. I even wrote about it on my [bucketlist]. Arend and his IT company [ERITAP], took delivery of that rack in May of 2021, and this is when the internet exchange with Frysian roots was born.\nIn the years from 2021 until now, Arend and I have been operating the exchange with reasonable success. It grew from a handful of folks in that first rack, to now some 250 participating ISPs with about ten switches in six datacenters across the Amsterdam metro area. It\u0026rsquo;s shifting a cool 800Gbit of traffic or so. It\u0026rsquo;s dope, and very rewarding to be able to contribute to this community!\nFrys-IX is growing We have several members with a 2x100G LAG and even though all inter-datacenter links are either dark fiber or WDM, we\u0026rsquo;re starting to feel the growing pains as we set our sights to the next step growth. You see, when FrysIX did 13.37Gbit of traffic, Arend organized a barbecue. When it did 133.7Gbit of traffic, Arend organized an even bigger barbecue. Obviously, the next step is 1337Gbit and joining the infamous [One TeraBit Club]. Thomas: we\u0026rsquo;re on our way!\nIt became clear that we will not be able to keep a dependable peering platform if FrysIX remains a single L2 broadcast domain, and it also became clear that concatenating multiple 100G ports would be operationally expensive (think of all the dark fiber or WDM waves!), and brittle (think of LACP and balancing traffic over those ports). We need to modernize in order to stay ahead of the growth curve.\nHello Nokia The Nokia 7220 Interconnect Router (7220 IXR) for data center fabric provides fixed-configuration, high-capacity platforms that let you bring unmatched scale, flexibility and operational simplicity to your data center networks and peering network environments. These devices are built around the Broadcom Trident chipset, in the case of the \u0026ldquo;D4\u0026rdquo; platform, this is a Trident4 with 28x100G and 8x400G ports. Whoot!\nWhat I find particularly awesome of the Trident series is their speed (total bandwidth of 12.8Tbps per router), low power use (without optics, the IXR-7220-D4 consumes about 150W) and a plethora of advanced capabilities like L2/L3 filtering, IPv4, IPv6 and MPLS routing, and modern approaches to scale-out networking such as VXLAN based EVPN. At the FrysIX barbecue in September of 2024, FrysIX was gifted a rather powerful IXR-7220-D3 router, shown in the picture to the right. That\u0026rsquo;s a 32x100G router.\nERITAP has bought two (new in box) IXR-7220-D4 (8x400G,28x100G) routers, and has also acquired two IXR-7220-D2 (48x25G,8x100G) routers. So in total, FrysIX is now the proud owner of five of these beautiful Nokia devices. If you haven\u0026rsquo;t yet, you should definitely read about these versatile routers on the [Nokia] website, and some details of the merchant silicon switch chips in use on the [Broadcom] website.\neVPN: A small rant First, I need to get something off my chest. Consider a topology for an internet exchange platform, taking into account the available equipment, rackspace, power, and cross connects. Somehow, almost every design or reference architecture I can find on the Internet, assumes folks want to build a [Clos network], which has a topology existing of leaf and spine switches. The spine switches have a different set of features than the leaf ones, notably they don\u0026rsquo;t have to do provider edge functionality like VXLAN encap and decapsulation. Almost all of these designs are showing how one might build a leaf-spine network for hyperscale.\nCritique 1: my \u0026lsquo;spine\u0026rsquo; (IXR-7220-D4 routers) must also be provider edge. Practically speaking, in the picture above I have these beautiful Nokia IXR-7220-D4 routers, using two 400G ports to connect between the facilities, and six 100G ports to connect the smaller breakout switches. That would leave a massive amount of capacity unused: 22x 100G and 6x400G ports, to be exact.\nCritique 2: all \u0026rsquo;leaf\u0026rsquo; (either IXR-7220-D2 routers or Arista switches) can\u0026rsquo;t realistically connect to both \u0026lsquo;spines\u0026rsquo;. Our devices are spread out over two (and in practice, more like six) datacenters, and it\u0026rsquo;s prohibitively expensive to get 100G waves or dark fiber to create a full mesh. It\u0026rsquo;s much more economical to create a star-topology that minimizes cross-datacenter fiber spans.\nCritique 3: Most of these \u0026lsquo;spine-leaf\u0026rsquo; reference architectures assume that the interior gateway protocol is eBGP in what they call the underlay, and on top of that, some secondary eBGP that\u0026rsquo;s called the overlay. Frankly, such a design makes my head spin a little bit. These designs assume hundreds of switches, in which case making use of one AS number per switch could make sense, as iBGP needs either a \u0026lsquo;full mesh\u0026rsquo;, or external route reflectors.\nCritique 4: These reference designs also make an assumption that all fiber is local and while optics and links can fail, it will be relatively rare to drain a link. However, in cross-datacenter networks, draining links for maintenance is very common, for example if the dark fiber provider needs to perform repairs on a span that was damaged. With these eBGP-over-eBGP connections, traffic engineering is more difficult than simply raising the OSPF (or IS-IS) cost of a link, to reroute traffic.\nSetting aside eVPN for a second, if I were to build an IP transport network, like I did when I built [IPng Site Local], I would use a much more intuitive and simple (I would even dare say elegant) design:\nTake a classic IGP like [OSPF], or perhaps [IS-IS]. There is no benefit, to me at least, to use BGP as an IGP. I would give each of the links between the switches an IPv4 /31 and enable link-local, and give each switch a loopback address with a /32 IPv4 and a /128 IPv6. If I had multiple links between two given switches, I would probably just use ECMP if my devices supported it, and fall back to a LACP signaled bundle-ethernet otherwise. If I were to need to use BGP (and for eVPN, this need exists), taking the ISP mindset (as opposed to the datacenter fabric mindset), I would simply install iBGP against two or three route reflectors, and exchange routing information within the same single AS number. eVPN: A demo topology So, that\u0026rsquo;s exactly how I\u0026rsquo;m going to approach the FrysIX eVPN design: OSPF for the underlay and iBGP for the overlay! I have a feeling that some folks will despise me for being contrarian, but you can leave your comments below, and don\u0026rsquo;t forget to like-and-subscribe :-)\nArend builds this topology for me in Jubbega - also known as FrysIX HQ. He takes the two 400G-capable routers and connects them. Then he takes an Arista DCS-7060CX switch, which is eVPN capable, with 32x100G ports, based on the Broadcom Tomahawk chipset, and a smaller Nokia IXR-7220-D2 with 48x25G and 8x100G ports, based on the Trident3 chipset. He wires all of this up to look like the picture on the right.\nUnderlay: Nokia\u0026rsquo;s SR Linux We boot up the equipment, verify that all the optics and links are up, and connect the management ports to an OOB network that I can remotely log in to. This is the first time that either of us work on Nokia, but I find it reasonably intuitive once I get a few tips and tricks from Niek.\n[pim@nikhef ~]$ sr_cli --{ running }--[ ]-- A:pim@nikhef# enter candidate --{ candidate shared default }--[ ]-- A:pim@nikhef# set / interface lo0 admin-state enable A:pim@nikhef# set / interface lo0 subinterface 0 admin-state enable A:pim@nikhef# set / interface lo0 subinterface 0 ipv4 admin-state enable A:pim@nikhef# set / interface lo0 subinterface 0 ipv4 address 198.19.16.1/32 A:pim@nikhef# commit stay There, my first config snippet! This creates a loopback interface, and similar to JunOS, a subinterface (which Juniper calls a unit) which enables IPv4 and gives it an /32 address. In SR Linux, any interface has to be associated with a network-instance, think of those as routing domains or VRFs. There\u0026rsquo;s a conveniently named default network-instance, which I\u0026rsquo;ll add this and the point-to-point interface between the two 400G routers to:\nA:pim@nikhef# info flat interface ethernet-1/29 set / interface ethernet-1/29 admin-state enable set / interface ethernet-1/29 subinterface 0 admin-state enable set / interface ethernet-1/29 subinterface 0 ip-mtu 9190 set / interface ethernet-1/29 subinterface 0 ipv4 admin-state enable set / interface ethernet-1/29 subinterface 0 ipv4 address 198.19.17.1/31 set / interface ethernet-1/29 subinterface 0 ipv6 admin-state enable A:pim@nikhef# set / network-instance default type default A:pim@nikhef# set / network-instance default admin-state enable A:pim@nikhef# set / network-instance default interface ethernet-1/29.0 A:pim@nikhef# set / network-instance default interface lo0.0 A:pim@nikhef# commit stay Cool. Assuming I now also do this on the other IXR-7220-D4 router, called equinix (which gets the loopback address 198.19.16.0/32 and the point-to-point on the 400G interface of 198.19.17.0/31), I should be able to do my first jumboframe ping:\nA:pim@equinix# ping network-instance default 198.19.17.1 -s 9162 -M do Using network instance default PING 198.19.17.1 (198.19.17.1) 9162(9190) bytes of data. 9170 bytes from 198.19.17.1: icmp_seq=1 ttl=64 time=0.466 ms 9170 bytes from 198.19.17.1: icmp_seq=2 ttl=64 time=0.477 ms 9170 bytes from 198.19.17.1: icmp_seq=3 ttl=64 time=0.547 ms Underlay: SR Linux OSPF OK, let\u0026rsquo;s get these two Nokia routers to speak OSPF, so that they can reach each other\u0026rsquo;s loopback. It\u0026rsquo;s really easy:\nA:pim@nikhef# / network-instance default protocols ospf instance default --{ candidate shared default }--[ network-instance default protocols ospf instance default ]-- A:pim@nikhef# set admin-state enable A:pim@nikhef# set version ospf-v2 A:pim@nikhef# set router-id 198.19.16.1 A:pim@nikhef# set area 0.0.0.0 interface ethernet-1/29.0 interface-type point-to-point A:pim@nikhef# set area 0.0.0.0 interface lo0.0 passive true A:pim@nikhef# commit stay Similar to in JunOS, I can descend into a configuration scope: the first line goes into the network-instance called default and then the protocols called ospf, and then the instance called default. Subsequent set commands operate at this scope. Once I commit this configuration (on the nikhef router and also the equinix router, with its own unique router-id), OSPF quickly shoots in action:\nA:pim@nikhef# show network-instance default protocols ospf neighbor ========================================================================================= Net-Inst default OSPFv2 Instance default Neighbors ========================================================================================= +---------------------------------------------------------------------------------------+ | Interface-Name Rtr Id State Pri RetxQ Time Before Dead | +=======================================================================================+ | ethernet-1/29.0 198.19.16.0 full 1 0 36 | +---------------------------------------------------------------------------------------+ ----------------------------------------------------------------------------------------- No. of Neighbors: 1 ========================================================================================= A:pim@nikhef# show network-instance default route-table all | more IPv4 unicast route table of network instance default +------------------+-----+------------+--------------+--------+----------+--------+------+-------------+-----------------+ | Prefix | ID | Route Type | Route Owner | Active | Origin | Metric | Pref | Next-hop | Next-hop | | | | | | | Network | | | (Type) | Interface | | | | | | | Instance | | | | | +==================+=====+============+==============+========+==========+========+======+=============+=================+ | 198.19.16.0/32 | 0 | ospfv2 | ospf_mgr | True | default | 1 | 10 | 198.19.17.0 | ethernet-1/29.0 | | | | | | | | | | (direct) | | | 198.19.16.1/32 | 7 | host | net_inst_mgr | True | default | 0 | 0 | None | None | | 198.19.17.0/31 | 6 | local | net_inst_mgr | True | default | 0 | 0 | 198.19.17.1 | ethernet-1/29.0 | | | | | | | | | | (direct) | | | 198.19.17.1/32 | 6 | host | net_inst_mgr | True | default | 0 | 0 | None | None | +==================+=====+============+==============+========+==========+========+======+=============+=================+ A:pim@nikhef# ping network-instance default 198.19.16.0 Using network instance default PING 198.19.16.0 (198.19.16.0) 56(84) bytes of data. 64 bytes from 198.19.16.0: icmp_seq=1 ttl=64 time=0.484 ms 64 bytes from 198.19.16.0: icmp_seq=2 ttl=64 time=0.663 ms Delicious! OSPF has learned the loopback, and it is now reachable. As with most things, going from 0 to 1 (in this case: understanding how SR Linux works at all) is the most difficult part. Then going from 1 to 2 is critical (in this case: making two routers interact with OSPF), but from there on, going from 2 to N is easy (in my case: enabling several other point-to-point /31 transit networks on the nikhef router, using ethernet-1/1.0 through ethernet-1/4.0 with the correct MTU and turning on OSPF for these), makes the whole network shoot to life. Slick!\nUnderlay: Arista I\u0026rsquo;ll point out that one of the devices in this topology is an Arista. We have several of these ready for deployment at FrysIX. They are a lot more affordable and easy to find on the second hand / refurbished market. These switches come with 32x100G ports, and are really good at packet slinging because they\u0026rsquo;re based on the Broadcom Tomahawk chipset. They pack a few less features than the Trident chipset that powers the Nokia, but they happen to have all the features we need to run our internet exchange . So I turn my attention to the Arista in the topology. I am much more comfortable configuring the whole thing here, as it\u0026rsquo;s not my first time touching these devices:\narista-leaf#show run int loop0 interface Loopback0 ip address 198.19.16.2/32 ip ospf area 0.0.0.0 arista-leaf#show run int Ethernet32/1 interface Ethernet32/1 description Core: Connected to nikhef:ethernet-1/2 load-interval 1 mtu 9190 no switchport ip address 198.19.17.5/31 ip ospf cost 1000 ip ospf network point-to-point ip ospf area 0.0.0.0 arista-leaf#show run section router ospf router ospf 65500 router-id 198.19.16.2 redistribute connected network 198.19.0.0/16 area 0.0.0.0 max-lsa 12000 I complete the configuration for the other two interfaces on this Arista, port Eth31/1 connects also to the nikhef IXR-7220-D4 and I give it a high cost of 1000, while Eth30/1 connects only 1x100G to the nokia-leaf IXR-7220-D2 with a cost of 10. It\u0026rsquo;s nice to see that OSPF in action - there are two equal path (but high cost) OSPF paths via router-id 198.19.16.1 (nikhef), and there\u0026rsquo;s one lower cost path via router-id 198.19.16.3 (nokia-leaf). The traceroute nicely shows the scenic route (arista-leaf -\u0026gt; nokia-leaf -\u0026gt; nokia -\u0026gt; equinix). Dope!\narista-leaf#show ip ospf nei Neighbor ID Instance VRF Pri State Dead Time Address Interface 198.19.16.1 65500 default 1 FULL 00:00:36 198.19.17.4 Ethernet32/1 198.19.16.3 65500 default 1 FULL 00:00:31 198.19.17.11 Ethernet30/1 198.19.16.1 65500 default 1 FULL 00:00:35 198.19.17.2 Ethernet31/1 arista-leaf#traceroute 198.19.16.0 traceroute to 198.19.16.0 (198.19.16.0), 30 hops max, 60 byte packets 1 198.19.17.11 (198.19.17.11) 0.220 ms 0.150 ms 0.206 ms 2 198.19.17.6 (198.19.17.6) 0.169 ms 0.107 ms 0.099 ms 3 198.19.16.0 (198.19.16.0) 0.434 ms 0.346 ms 0.303 ms So far, so good! The underlay is up, every router can reach every other router on its loopback, and all OSPF adjacencies are formed. I\u0026rsquo;ll leave the 2x100G between nikhef and arista-leaf at high cost for now.\nOverlay EVPN: SR Linux The big-picture idea here is to use iBGP with the same private AS number, and because there are two main facilities (NIKHEF and Equinix), make each of those bigger IXR-7220-D4 routers act as route-reflectors for others. It means that they will have an iBGP session amongst themselves (198.191.16.0 \u0026lt;-\u0026gt; 198.19.16.1) and otherwise accept iBGP sessions from any IP address in the 198.19.16.0/24 subnet. This way, I don\u0026rsquo;t have to configure any more than strictly necessary on the core routers. Any new router can just plug in, form an OSPF adjacency, and connect to both core routers. I proceed to configure BGP on the Nokia\u0026rsquo;s like this:\nA:pim@nikhef# / network-instance default protocols bgp A:pim@nikhef# set admin-state enable A:pim@nikhef# set autonomous-system 65500 A:pim@nikhef# set router-id 198.19.16.1 A:pim@nikhef# set dynamic-neighbors accept match 198.19.16.0/24 peer-group overlay A:pim@nikhef# set afi-safi evpn admin-state enable A:pim@nikhef# set preference ibgp 170 A:pim@nikhef# set route-advertisement rapid-withdrawal true A:pim@nikhef# set route-advertisement wait-for-fib-install false A:pim@nikhef# set group overlay peer-as 65500 A:pim@nikhef# set group overlay afi-safi evpn admin-state enable A:pim@nikhef# set group overlay afi-safi ipv4-unicast admin-state disable A:pim@nikhef# set group overlay afi-safi ipv6-unicast admin-state disable A:pim@nikhef# set group overlay local-as as-number 65500 A:pim@nikhef# set group overlay route-reflector client true A:pim@nikhef# set group overlay transport local-address 198.19.16.1 A:pim@nikhef# set neighbor 198.19.16.0 admin-state enable A:pim@nikhef# set neighbor 198.19.16.0 peer-group overlay A:pim@nikhef# commit stay I can see that iBGP sessions establish between all the devices:\nA:pim@nikhef# show network-instance default protocols bgp neighbor --------------------------------------------------------------------------------------------------------------------------- BGP neighbor summary for network-instance \u0026#34;default\u0026#34; Flags: S static, D dynamic, L discovered by LLDP, B BFD enabled, - disabled, * slow --------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------- +-------------+-------------+----------+-------+----------+-------------+---------------+------------+--------------------+ | Net-Inst | Peer | Group | Flags | Peer-AS | State | Uptime | AFI/SAFI | [Rx/Active/Tx] | +=============+=============+==========+=======+==========+=============+===============+============+====================+ | default | 198.19.16.0 | overlay | S | 65500 | established | 0d:0h:2m:32s | evpn | [0/0/0] | | default | 198.19.16.2 | overlay | D | 65500 | established | 0d:0h:2m:27s | evpn | [0/0/0] | | default | 198.19.16.3 | overlay | D | 65500 | established | 0d:0h:2m:41s | evpn | [0/0/0] | +-------------+-------------+----------+-------+----------+-------------+---------------+------------+--------------------+ --------------------------------------------------------------------------------------------------------------------------- Summary: 1 configured neighbors, 1 configured sessions are established, 0 disabled peers 2 dynamic peers A few things to note here - there one configured neighbor (this is the other IXR-7220-D4 router), and two dynamic peers, these are the Arista and the smaller IXR-7220-D2 router. The only address family that they are exchanging information for is the evpn family, and no prefixes have been learned or sent yet, shown by the [0/0/0] designation in the last column.\nOverlay EVPN: Arista The Arista is also remarkably straight forward to configure. Here, I\u0026rsquo;ll simply enable the iBGP session as follows:\narista-leaf#show run section bgp router bgp 65500 neighbor evpn peer group neighbor evpn remote-as 65500 neighbor evpn update-source Loopback0 neighbor evpn ebgp-multihop 3 neighbor evpn send-community extended neighbor evpn maximum-routes 12000 warning-only neighbor 198.19.16.0 peer group evpn neighbor 198.19.16.1 peer group evpn ! address-family evpn neighbor evpn activate arista-leaf#show bgp summary BGP summary information for VRF default Router identifier 198.19.16.2, local AS number 65500 Neighbor AS Session State AFI/SAFI AFI/SAFI State NLRI Rcd NLRI Acc ----------- ----------- ------------- ----------------------- -------------- ---------- ---------- 198.19.16.0 65500 Established IPv4 Unicast Advertised 0 0 198.19.16.0 65500 Established L2VPN EVPN Negotiated 0 0 198.19.16.1 65500 Established IPv4 Unicast Advertised 0 0 198.19.16.1 65500 Established L2VPN EVPN Negotiated 0 0 On this leaf node, I\u0026rsquo;ll have a redundant iBGP session with the two core nodes. Since those core nodes are peering amongst themselves, and are configured as route-reflectors, this is all I need. No matter how many additional Arista (or Nokia) devices I add to the network, all they\u0026rsquo;ll have to do is enable OSPF (so they can reach 198.19.16.0 and .1) and turn on iBGP sessions with both core routers. Voila!\nVXLAN EVPN: SR Linux Nokia documentation informs me that SR Linux uses a special interface called system0 to source its VXLAN traffic from, and to add this interface to the default network-instance. So it\u0026rsquo;s a matter of defining that interface and associate a VXLAN interface with it, like so:\nA:pim@nikhef# set / interface system0 admin-state enable A:pim@nikhef# set / interface system0 subinterface 0 admin-state enable A:pim@nikhef# set / interface system0 subinterface 0 ipv4 admin-state enable A:pim@nikhef# set / interface system0 subinterface 0 ipv4 address 198.19.18.1/32 A:pim@nikhef# set / network-instance default interface system0.0 A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 type bridged A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 ingress vni 2604 A:pim@nikhef# set / tunnel-interface vxlan1 vxlan-interface 2604 egress source-ip use-system-ipv4-address A:pim@nikhef# commit stay This creates the plumbing for a VXLAN sub-interface called vxlan1.2604 which will accept/send traffic using VNI 2604 (this happens to be the VLAN id we use at FrysIX for our production Peering LAN), and it\u0026rsquo;ll use the system0.0 address to source that traffic from.\nThe second part is to create what SR Linux calls a MAC-VRF and put some interface(s) in it:\nA:pim@nikhef# set / interface ethernet-1/9 admin-state enable A:pim@nikhef# set / interface ethernet-1/9 breakout-mode num-breakout-ports 4 A:pim@nikhef# set / interface ethernet-1/9 breakout-mode breakout-port-speed 10G A:pim@nikhef# set / interface ethernet-1/9/3 admin-state enable A:pim@nikhef# set / interface ethernet-1/9/3 vlan-tagging true A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 type bridged A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 admin-state enable A:pim@nikhef# set / interface ethernet-1/9/3 subinterface 0 vlan encap untagged A:pim@nikhef# / network-instance peeringlan A:pim@nikhef# set type mac-vrf A:pim@nikhef# set admin-state enable A:pim@nikhef# set interface ethernet-1/9/3.0 A:pim@nikhef# set vxlan-interface vxlan1.2604 A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 admin-state enable A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 vxlan-interface vxlan1.2604 A:pim@nikhef# set protocols bgp-evpn bgp-instance 1 evi 2604 A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-distinguisher rd 65500:2604 A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-target export-rt target:65500:2604 A:pim@nikhef# set protocols bgp-vpn bgp-instance 1 route-target import-rt target:65500:2604 A:pim@nikhef# commit stay In the first block here, Arend took what is a 100G port called ethernet-1/9 and split it into 4x25G ports. Arend forced the port speed to 10G because he has taken a 40G-4x10G DAC, and it happens that the third lane is plugged into the Debian machine. So on ethernet-1/9/3 I\u0026rsquo;ll create a sub-interface, make it type bridged (which I\u0026rsquo;ve also done on vxlan1.2604!) and allow any untagged traffic to enter it.\nIf you, like me, are used to either VPP or IOS/XR, this type of sub-interface stuff should feel very natural to you. I\u0026rsquo;ve written about the sub-interfaces logic on Cisco\u0026rsquo;s IOS/XR and VPP approach in a previous [article] which my buddy Fred lovingly calls VLAN Gymnastics because the ports are just so damn flexible. Worth a read!\nThe second block creates a new network-instance which I\u0026rsquo;ll name peeringlan, and it associates the newly created untagged sub-interface ethernet-1/9/3.0 with the VXLAN interface, and starts a protocol for eVPN instructing traffic in and out of this network-instance to use EVI 2604 on the VXLAN sub-interface, and signalling of all MAC addresses learned to use the specified route-distinguisher and import/export route-targets. For simplicity I\u0026rsquo;ve just used the same for each: 65500:2604.\nI continue to add an interface to the peeringlan network-instance on the other two Nokia routers: ethernet-1/9/3.0 on the equinix router and ethernet-1/9.0 on the nokia-leaf router. Each of these goes to a 10Gbps port on a Debian machine.\nVXLAN EVPN: Arista At this point I\u0026rsquo;m feeling pretty bullish about the whole project. Arista does not make it very difficult on me to configure it for L2 EVPN (which is called MAC-VRF here also):\narista-leaf#conf t vlan 2604 name v-peeringlan interface Ethernet9/3 speed forced 10000full switchport access vlan 2604 interface Loopback1 ip address 198.19.18.2/32 interface Vxlan1 vxlan source-interface Loopback1 vxlan udp-port 4789 vxlan vlan 2604 vni 2604 After creating VLAN 2604 and making port Eth9/3 an access port in that VLAN, I\u0026rsquo;ll add a VTEP endpoint called Loopback1, and a VXLAN interface that uses that to source its traffic. Here, I\u0026rsquo;ll associate local VLAN 2604 with the Vxlan1 and its VNI 2604, to match up with how I configured the Nokias previously.\nFinally, it\u0026rsquo;s a matter of tying these together by announcing the MAC addresses into the EVPN iBGP sessions:\narista-leaf#conf t router bgp 65500 vlan 2604 rd 65500:2604 route-target both 65500:2604 redistribute learned ! Results To validate the configurations, I learn a cool trick from my buddy Andy on the SR Linux discord server. In EOS, I can ask it to check for any obvious mistakes in two places:\narista-leaf#show vxlan config-sanity detail Category Result Detail ---------------------------------- -------- -------------------------------------------------- Local VTEP Configuration Check OK Loopback IP Address OK VLAN-VNI Map OK Flood List OK Routing OK VNI VRF ACL OK Decap VRF-VNI Map OK VRF-VNI Dynamic VLAN OK Remote VTEP Configuration Check OK Remote VTEP OK Platform Dependent Check OK VXLAN Bridging OK VXLAN Routing OK VXLAN Routing not enabled CVX Configuration Check OK CVX Server OK Not in controller client mode MLAG Configuration Check OK Run \u0026#39;show mlag config-sanity\u0026#39; to verify MLAG config Peer VTEP IP OK MLAG peer is not connected MLAG VTEP IP OK Peer VLAN-VNI OK Virtual VTEP IP OK MLAG Inactive State OK arista-leaf#show bgp evpn sanity detail Category Check Status Detail -------- -------------------- ------ ------ General Send community OK General Multi-agent mode OK General Neighbor established OK L2 MAC-VRF route-target OK import and export L2 MAC-VRF OK route-distinguisher L2 MAC-VRF redistribute OK L2 MAC-VRF overlapping OK VLAN L2 Suppressed MAC OK VXLAN VLAN to VNI map for OK MAC-VRF VXLAN VRF to VNI map for OK IP-VRF Results: Arista view Inspecting the MAC addresses learned from all four of the client ports on the Debian machine is easy:\narista-leaf#show bgp evpn summary BGP summary information for VRF default Router identifier 198.19.16.2, local AS number 65500 Neighbor Status Codes: m - Under maintenance Neighbor V AS MsgRcvd MsgSent InQ OutQ Up/Down State PfxRcd PfxAcc 198.19.16.0 4 65500 3311 3867 0 0 18:06:28 Estab 7 7 198.19.16.1 4 65500 3308 3873 0 0 18:06:28 Estab 7 7 arista-leaf#show bgp evpn vni 2604 next-hop 198.19.18.3 BGP routing table information for VRF default Router identifier 198.19.16.2, local AS number 65500 Route status codes: * - valid, \u0026gt; - active, S - Stale, E - ECMP head, e - ECMP c - Contributing to ECMP, % - Pending BGP convergence Origin codes: i - IGP, e - EGP, ? - incomplete AS Path Attributes: Or-ID - Originator ID, C-LST - Cluster List, LL Nexthop - Link Local Nexthop Network Next Hop Metric LocPref Weight Path * \u0026gt;Ec RD: 65500:2604 mac-ip e43a.6e5f.0c59 198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.1 * ec RD: 65500:2604 mac-ip e43a.6e5f.0c59 198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.0 * \u0026gt;Ec RD: 65500:2604 imet 198.19.18.3 198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.1 * ec RD: 65500:2604 imet 198.19.18.3 198.19.18.3 - 100 0 i Or-ID: 198.19.16.3 C-LST: 198.19.16.0 There\u0026rsquo;s a lot to unpack here! The Arista is seeing that from the route-distinguisher I configured on all the sessions, it is learning one MAC address on neighbor 198.19.18.3 (this is the VTEP for the nokia-leaf router) from both iBGP sessions. The MAC address is learned from originator 198.19.16.3 (the loopback of the nokia-leaf router), from two cluster members, the active one on iBGP speaker 198.19.16.1 (nikhef) and a backup member on 198.19.16.0 (equinix).\nI can also see that there\u0026rsquo;s a bunch of imet route entries, and Andy explained these to me. They are a signal from a VTEP participant that they are interested in seeing multicast traffic (like neighbor discovery or ARP requests) flooded to them. Every router participating in this L2VPN will raise such an imet route, which I\u0026rsquo;ll see in duplicates as well (one from each iBGP session). This checks out.\nResults: SR Linux view The Nokia IXR-7220-D4 router called equinix has also learned a bunch of EVPN routing entries, which I can inspect as follows:\nA:pim@equinix# show network-instance default protocols bgp routes evpn route-type summary -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Show report for the BGP route table of network-instance \u0026#34;default\u0026#34; -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Status codes: u=used, *=valid, \u0026gt;=best, x=stale, b=backup Origin codes: i=IGP, e=EGP, ?=incomplete -------------------------------------------------------------------------------------------------------------------------------------------------------------------- BGP Router ID: 198.19.16.0 AS: 65500 Local AS: 65500 -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Type 2 MAC-IP Advertisement Routes +--------+---------------+--------+-------------------+------------+-------------+------+-------------+--------+--------------------------------+------------------+ | Status | Route- | Tag-ID | MAC-address | IP-address | neighbor | Path-| Next-Hop | Label | ESI | MAC Mobility | | | distinguisher | | | | | id | | | | | +========+===============+========+===================+============+=============+======+============-+========+================================+==================+ | u*\u0026gt; | 65500:2604 | 0 | E4:3A:6E:5F:0C:57 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.1 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - | | * | 65500:2604 | 0 | E4:3A:6E:5F:0C:58 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.2 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - | | u*\u0026gt; | 65500:2604 | 0 | E4:3A:6E:5F:0C:58 | 0.0.0.0 | 198.19.16.2 | 0 | 198.19.18.2 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - | | * | 65500:2604 | 0 | E4:3A:6E:5F:0C:59 | 0.0.0.0 | 198.19.16.1 | 0 | 198.19.18.3 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - | | u*\u0026gt; | 65500:2604 | 0 | E4:3A:6E:5F:0C:59 | 0.0.0.0 | 198.19.16.3 | 0 | 198.19.18.3 | 2604 | 00:00:00:00:00:00:00:00:00:00 | - | +--------+---------------+--------+-------------------+------------+-------------+------+-------------+--------+--------------------------------+------------------+ -------------------------------------------------------------------------------------------------------------------------------------------------------------------- Type 3 Inclusive Multicast Ethernet Tag Routes +--------+-----------------------------+--------+---------------------+-----------------+--------+-----------------------+ | Status | Route-distinguisher | Tag-ID | Originator-IP | neighbor | Path- | Next-Hop | | | | | | | id | | +========+=============================+========+=====================+=================+========+=======================+ | u*\u0026gt; | 65500:2604 | 0 | 198.19.18.1 | 198.19.16.1 | 0 | 198.19.18.1 | | * | 65500:2604 | 0 | 198.19.18.2 | 198.19.16.1 | 0 | 198.19.18.2 | | u*\u0026gt; | 65500:2604 | 0 | 198.19.18.2 | 198.19.16.2 | 0 | 198.19.18.2 | | * | 65500:2604 | 0 | 198.19.18.3 | 198.19.16.1 | 0 | 198.19.18.3 | | u*\u0026gt; | 65500:2604 | 0 | 198.19.18.3 | 198.19.16.3 | 0 | 198.19.18.3 | +--------+-----------------------------+--------+---------------------+-----------------+--------+-----------------------+ -------------------------------------------------------------------------------------------------------------------------- 0 Ethernet Auto-Discovery routes 0 used, 0 valid 5 MAC-IP Advertisement routes 3 used, 5 valid 5 Inclusive Multicast Ethernet Tag routes 3 used, 5 valid 0 Ethernet Segment routes 0 used, 0 valid 0 IP Prefix routes 0 used, 0 valid 0 Selective Multicast Ethernet Tag routes 0 used, 0 valid 0 Selective Multicast Membership Report Sync routes 0 used, 0 valid 0 Selective Multicast Leave Sync routes 0 used, 0 valid -------------------------------------------------------------------------------------------------------------------------- I have to say, SR Linux output is incredibly verbose! But, I can see all the relevant bits and bobs here. Each MAC-IP entry is accounted for, I can see several nexthops pointing at the nikhef switch, one pointing at the nokia-leaf router and one pointing at the Arista switch. I also see the imet entries. One thing to note \u0026ndash; the SR Linux implementation leaves the type-2 routes empty with a 0.0.0.0 IPv4 address, while the Arista (in my opinion, more correctly) leaves them as NULL (unspecified). But, everything looks great!\nResults: Debian view There\u0026rsquo;s one more thing to show, and that\u0026rsquo;s kind of the \u0026lsquo;proof is in the pudding\u0026rsquo; moment. As I said, Arend hooked up a Debian machine with an Intel X710-DA4 network card, which sports 4x10G SFP+ connections. This network card is a regular in my AS8298 network, as it has excellent DPDK support and can easily pump 40Mpps with VPP. IPng 🥰 Intel X710!\nroot@debian:~ # ip netns add nikhef root@debian:~ # ip link set enp1s0f0 netns nikhef root@debian:~ # ip netns exec nikhef ip link set enp1s0f0 up mtu 9000 root@debian:~ # ip netns exec nikhef ip addr add 192.0.2.10/24 dev enp1s0f0 root@debian:~ # ip netns exec nikhef ip addr add 2001:db8::10/64 dev enp1s0f0 root@debian:~ # ip netns add arista-leaf root@debian:~ # ip link set enp1s0f1 netns arista-leaf root@debian:~ # ip netns exec arista-leaf ip link set enp1s0f1 up mtu 9000 root@debian:~ # ip netns exec arista-leaf ip addr add 192.0.2.11/24 dev enp1s0f1 root@debian:~ # ip netns exec arista-leaf ip addr add 2001:db8::11/64 dev enp1s0f1 root@debian:~ # ip netns add nokia-leaf root@debian:~ # ip link set enp1s0f2 netns nokia-leaf root@debian:~ # ip netns exec nokia-leaf ip link set enp1s0f2 up mtu 9000 root@debian:~ # ip netns exec nokia-leaf ip addr add 192.0.2.12/24 dev enp1s0f2 root@debian:~ # ip netns exec nokia-leaf ip addr add 2001:db8::12/64 dev enp1s0f2 root@debian:~ # ip netns add equinix root@debian:~ # ip link set enp1s0f3 netns equinix root@debian:~ # ip netns exec equinix ip link set enp1s0f3 up mtu 9000 root@debian:~ # ip netns exec equinix ip addr add 192.0.2.13/24 dev enp1s0f3 root@debian:~ # ip netns exec equinix ip addr add 2001:db8::13/64 dev enp1s0f3 root@debian:~# ip netns exec nikhef fping -g 192.0.2.8/29 192.0.2.10 is alive 192.0.2.11 is alive 192.0.2.12 is alive 192.0.2.13 is alive root@debian:~# ip netns exec arista-leaf fping 2001:db8::10 2001:db8::11 2001:db8::12 2001:db8::13 2001:db8::10 is alive 2001:db8::11 is alive 2001:db8::12 is alive 2001:db8::13 is alive root@debian:~# ip netns exec equinix ip nei 192.0.2.10 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE 192.0.2.11 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE 192.0.2.12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE fe80::e63a:6eff:fe5f:c57 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE fe80::e63a:6eff:fe5f:c58 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE fe80::e63a:6eff:fe5f:c59 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE 2001:db8::10 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:57 STALE 2001:db8::11 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:58 STALE 2001:db8::12 dev enp1s0f3 lladdr e4:3a:6e:5f:0c:59 STALE The Debian machine puts each network card into its own network namespace, and gives them both an IPv4 and an IPv6 address. I can then enter the nikhef network namespace, which has its NIC connected to the IXR-7220-D4 router called nikhef, and ping all four endpoints. Similarly, I can enter the arista-leaf namespace and ping6 all four endpoints. Finally, I take a look at the IPv6 and IPv4 neighbor table on the network card that is connected to the equinix router. All three MAC addresses are seen. This proves end to end connectivity across the EVPN VXLAN, and full interoperability. Booyah!\nPerformance? We got that! I\u0026rsquo;m not worried as these Nokia routers are rated for 12.8Tbps of VXLAN\u0026hellip;.\nroot@debian:~# ip netns exec equinix iperf3 -c 192.0.2.12 Connecting to host 192.0.2.12, port 5201 [ 5] local 192.0.2.10 port 34598 connected to 192.0.2.12 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 1.15 GBytes 9.91 Gbits/sec 19 1.52 MBytes [ 5] 1.00-2.00 sec 1.15 GBytes 9.90 Gbits/sec 3 1.54 MBytes [ 5] 2.00-3.00 sec 1.15 GBytes 9.90 Gbits/sec 1 1.54 MBytes [ 5] 3.00-4.00 sec 1.15 GBytes 9.90 Gbits/sec 1 1.54 MBytes [ 5] 4.00-5.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes [ 5] 5.00-6.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes [ 5] 6.00-7.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes [ 5] 7.00-8.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes [ 5] 8.00-9.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes [ 5] 9.00-10.00 sec 1.15 GBytes 9.90 Gbits/sec 0 1.54 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec 24 sender [ 5] 0.00-10.00 sec 11.5 GBytes 9.90 Gbits/sec receiver iperf Done. What\u0026rsquo;s Next There\u0026rsquo;s a few improvements I can make before deploying this architecture to the internet exchange. Notably:\nthe functional equivalent of port security, that is to say only allowing one or two MAC addresses per member port. FrysIX has a strict one-port-one-member-one-MAC rule, and having port security will greatly improve our resilience. SR Linux has the ability to suppress ARP, even on L2 MAC-VRF! It\u0026rsquo;s relatively well known for IRB based setups, but adding this to transparent bridge-domains is possible in Nokia [ref], using the syntax of protocols bgp-evpn bgp-instance 1 routes bridge-table mac-ip advertise true. This will glean the IP addresses based on intercepted ARP requests, and reduce the need for BUM flooding. Andy informs me that Arista also has this feature. By setting router l2-vpn and arp learning bridged, the suppression of ARP requests/replies also works in the same way. This greatly reduces cross-router BUM flooding. If DE-CIX can do it, so can FrysIX :) some automation - although configuring the MAC-VRF across Arista and SR Linux is definitely not as difficult as I thought, having some automation in place will avoid errors and mistakes. It would suck if the IXP collapsed because I botched a link drain or PNI configuration! Acknowledgements I am relatively new to EVPN configurations, and wanted to give a shoutout to Andy Whitaker who jumped in very quickly when I asked a question on the SR Linux Discord. He was gracious with his time and spent a few hours on a video call with me, explaining EVPN in great detail both for Arista as well as SR Linux, and in particular wanted to give a big \u0026ldquo;Thank you!\u0026rdquo; for helping me understand symmetric and asymmetric IRB in the context of multivendor EVPN. Andy is about to start a new job at Nokia, and I wish him all the best. To my friends at Nokia: you caught a good one, Andy is pure gold!\nI also want to thank Niek for helping me take my first baby steps onto this platform and patiently answering my nerdly questions about the platform, the switch chip, and the configuration philosophy. Learning a new NOS is always a fun task, and it was made super fun because Niek spent an hour with Arend and I on a video call, giving a bunch of operational tips and tricks along the way.\nFinally, Arend and ERITAP are an absolute joy to work with. We took turns hacking on the lab, which Arend made available for me while I am traveling to Mississippi this week. Thanks for the kWh and OOB access, and for brainstorming the config with me!\nReference configurations Here\u0026rsquo;s the configs for all machines in this demonstration: [nikhef] | [equinix] | [nokia-leaf] | [arista-leaf]\n","date":"2025-04-09","desc":" Introduction Somewhere in the far north of the Netherlands, the country where I was born, a town called Jubbega is the home of the Frysian Internet Exchange called [Frys-IX]. Back in 2021, a buddy of mine, Arend, said that he was planning on renting a rack at the NIKHEF facility, one of the most densely populated facilities in western Europe. He was looking for a few launching customers and I was definitely in the market for a presence in Amsterdam. I even wrote about it on my [bucketlist]. Arend and his IT company [ERITAP], took delivery of that rack in May of 2021, and this is when the internet exchange with Frysian roots was born.\n","permalink":"https://ipng.ch/s/articles/2025/04/09/frysix-evpn-think-different/","section":"articles","title":"FrysIX eVPN: think different"},{"contents":"Introduction In the second half of last year, I picked up a project together with Neil McKee of [inMon], the care takers of [sFlow]: an industry standard technology for monitoring high speed networks. sFlow gives complete visibility into the use of networks enabling performance optimization, accounting/billing for usage, and defense against security threats.\nThe open source software dataplane [VPP] is a perfect match for sampling, as it forwards packets at very high rates using underlying libraries like [DPDK] and [RDMA]. A clever design choice in the so called Host sFlow Daemon [host-sflow], which allows for a small portion of code to grab the samples, for example in a merchant silicon ASIC or FPGA, but also in the VPP software dataplane. The agent then transmits these samples using a Linux kernel feature called [PSAMPLE]. This greatly reduces the complexity of code to be implemented in the forwarding path, while at the same time bringing consistency to the sFlow delivery pipeline by (re)using the hsflowd business logic for the more complex state keeping, packet marshalling and transmission from the Agent to a central Collector.\nIn this third article, I wanted to spend some time discussing how samples make their way out of the VPP dataplane, and into higher level tools.\nRecap: sFlow sFlow describes a method for Monitoring Traffic in Switched/Routed Networks, originally described in [RFC3176]. The current specification is version 5 and is homed on the sFlow.org website [ref]. Typically, a Switching ASIC in the dataplane (seen at the bottom of the diagram to the left) is asked to copy 1-in-N packets to local sFlow Agent.\nSampling: The agent will copy the first N bytes (typically 128) of the packet into a sample. As the ASIC knows which interface the packet was received on, the inIfIndex will be added. After a routing decision is made, the nexthop and its L2 address and interface become known. The ASIC might annotate the sample with this outIfIndex and DstMAC metadata as well.\nDrop Monitoring: There\u0026rsquo;s one rather clever insight that sFlow gives: what if the packet was not routed or switched, but rather discarded? For this, sFlow is able to describe the reason for the drop. For example, the ASIC receive queue could have been overfull, or it did not find a destination to forward the packet to (no FIB entry), perhaps it was instructed by an ACL to drop the packet or maybe even tried to transmit the packet but the physical datalink layer had to abandon the transmission for whatever reason (link down, TX queue full, link saturation, and so on). It\u0026rsquo;s hard to overstate how important it is to have this so-called drop monitoring, as operators often spend hours and hours figuring out why packets are lost their network or datacenter switching fabric.\nMetadata: The agent may have other metadata as well, such as which prefix was the source and destination of the packet, what additional RIB information is available (AS path, BGP communities, and so on). This may be added to the sample record as well.\nCounters: Since sFlow is sampling 1:N packets, the system can estimate total traffic in a reasonably accurate way. Peter and Sonia wrote a succint [paper] about the math, so I won\u0026rsquo;t get into that here. Mostly because I am but a software engineer, not a statistician\u0026hellip; :) However, I will say this: if a fraction of the traffic is sampled but the Agent knows how many bytes and packets were forwarded in total, it can provide an overview with a quantifiable accuracy. This is why the Agent will periodically get the interface counters from the ASIC.\nCollector: One or more samples can be concatenated into UDP messages that go from the sFlow Agent to a central sFlow Collector. The heavy lifting in analysis is done upstream from the switch or router, which is great for performance. Many thousands or even tens of thousands of agents can forward their samples and interface counters to a single central collector, which in turn can be used to draw up a near real time picture of the state of traffic through even the largest of ISP networks or datacenter switch fabrics.\nIn sFlow parlance [VPP] and its companion [hsflowd] together form an Agent (it sends the UDP packets over the network), and for example the commandline tool sflowtool could be a Collector (it receives the UDP packets).\nRecap: sFlow in VPP First, I have some pretty good news to report - our work on this plugin was [merged] and will be included in the VPP 25.02 release in a few weeks! Last weekend, I gave a lightning talk at [FOSDEM] in Brussels, Belgium, and caught up with a lot of community members and network- and software engineers. I had a great time.\nIn trying to keep the amount of code as small as possible, and therefore the probability of bugs that might impact VPP\u0026rsquo;s dataplane stability low, the architecture of the end to end solution consists of three distinct parts, each with their own risk and performance profile:\n1. sFlow worker node: Its job is to do what the ASIC does in the hardware case. As VPP moves packets from device-input to the ethernet-input nodes in its forwarding graph, the sFlow plugin will inspect 1-in-N, taking a sample for further processing. Here, we don\u0026rsquo;t try to be clever, simply copy the inIfIndex and the first bytes of the ethernet frame, and append them to a [FIFO] queue. If too many samples arrive, samples are dropped at the tail, and a counter incremented. This way, I can tell when the dataplane is congested. Bounded FIFOs also provide fairness: it allows for each VPP worker thread to get their fair share of samples into the Agent\u0026rsquo;s hands.\n2. sFlow main process: There\u0026rsquo;s a function running on the main thread, which shifts further processing time away from the dataplane. This sflow-process does two things. Firstly, it consumes samples from the per-worker FIFO queues (both forwarded packets in green, and dropped ones in red). Secondly, it keeps track of time and every few seconds (20 by default, but this is configurable), it\u0026rsquo;ll grab all interface counters from those interfaces for which I have sFlow turned on. VPP produces Netlink messages and sends them to the kernel.\n3. Host sFlow daemon: The third component is external to VPP: hsflowd subscribes to the Netlink messages. It goes without saying that hsflowd is a battle-hardened implementation running on hundreds of different silicon and software defined networking stacks. The PSAMPLE stuff is easy, this module already exists. But Neil implemented a mod_vpp which can grab interface names and their ifIndex, and counter statistics. VPP emits this data as Netlink USERSOCK messages alongside the PSAMPLEs.\nBy the way, I\u0026rsquo;ve written about Netlink before when discussing the [Linux Control Plane] plugin. It\u0026rsquo;s a mechanism for programs running in userspace to share information with the kernel. In the Linux kernel, packets can be sampled as well, and sent from kernel to userspace using a PSAMPLE Netlink channel. However, the pattern is that of a message producer/subscriber relationship and nothing precludes one userspace process (vpp) to be the producer while another userspace process (hsflowd) acts as the consumer!\nAssuming the sFlow plugin in VPP produces samples and counters properly, hsflowd will do the rest, giving correctness and upstream interoperability pretty much for free. That\u0026rsquo;s slick!\nVPP: sFlow Configuration The solution that I offer is based on two moving parts. First, the VPP plugin configuration, which turns on sampling at a given rate on physical devices, also known as hardware-interfaces. Second, the open source component [host-sflow] can be configured as of release v2.11-5 [ref].\nI will show how to configure VPP in three ways:\n1. VPP Configuration via CLI\npim@vpp0-0:~$ vppctl vpp0-0# sflow sampling-rate 100 vpp0-0# sflow polling-interval 10 vpp0-0# sflow header-bytes 128 vpp0-0# sflow enable GigabitEthernet10/0/0 vpp0-0# sflow enable GigabitEthernet10/0/0 disable vpp0-0# sflow enable GigabitEthernet10/0/2 vpp0-0# sflow enable GigabitEthernet10/0/3 The first three commands set the global defaults - in my case I\u0026rsquo;m going to be sampling at 1:100 which is an unusually high rate. A production setup may take 1-in-linkspeed-in-megabits so for a 1Gbps device 1:1'000 is appropriate. For 100GE, something between 1:10'000 and 1:100'000 is more appropriate, depending on link load. The second command sets the interface stats polling interval. The default is to gather these statistics every 20 seconds, but I set it to 10s here.\nNext, I tell the plugin how many bytes of the sampled ethernet frame should be taken. Common values are 64 and 128 but it doesn\u0026rsquo;t have to be a power of two. I want enough data to see the headers, like MPLS label(s), Dot1Q tag(s), IP header and TCP/UDP/ICMP header, but the contents of the payload are rarely interesting for statistics purposes.\nFinally, I can turn on the sFlow plugin on an interface with the sflow enable-disable CLI. In VPP, an idiomatic way to turn on and off things is to have an enabler/disabler. It feels a bit clunky maybe to write sflow enable $iface disable but it makes more logical sends if you parse that as \u0026ldquo;enable-disable\u0026rdquo; with the default being the \u0026ldquo;enable\u0026rdquo; operation, and the alternate being the \u0026ldquo;disable\u0026rdquo; operation.\n2. VPP Configuration via API\nI implemented a few API methods for the most common operations. Here\u0026rsquo;s a snippet that obtains the same config as what I typed on the CLI above, but using these Python API calls:\nfrom vpp_papi import VPPApiClient, VPPApiJSONFiles import sys vpp_api_dir = VPPApiJSONFiles.find_api_dir([]) vpp_api_files = VPPApiJSONFiles.find_api_files(api_dir=vpp_api_dir) vpp = VPPApiClient(apifiles=vpp_api_files, server_address=\u0026#34;/run/vpp/api.sock\u0026#34;) vpp.connect(\u0026#34;sflow-api-client\u0026#34;) print(vpp.api.show_version().version) # Output: 25.06-rc0~14-g9b1c16039 vpp.api.sflow_sampling_rate_set(sampling_N=100) print(vpp.api.sflow_sampling_rate_get()) # Output: sflow_sampling_rate_get_reply(_0=655, context=3, sampling_N=100) vpp.api.sflow_polling_interval_set(polling_S=10) print(vpp.api.sflow_polling_interval_get()) # Output: sflow_polling_interval_get_reply(_0=661, context=5, polling_S=10) vpp.api.sflow_header_bytes_set(header_B=128) print(vpp.api.sflow_header_bytes_get()) # Output: sflow_header_bytes_get_reply(_0=665, context=7, header_B=128) vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=True) vpp.api.sflow_enable_disable(hw_if_index=2, enable_disable=True) print(vpp.api.sflow_interface_dump()) # Output: [ sflow_interface_details(_0=667, context=8, hw_if_index=1), # sflow_interface_details(_0=667, context=8, hw_if_index=2) ] print(vpp.api.sflow_interface_dump(hw_if_index=2)) # Output: [ sflow_interface_details(_0=667, context=9, hw_if_index=2) ] print(vpp.api.sflow_interface_dump(hw_if_index=1234)) ## Invalid hw_if_index # Output: [] vpp.api.sflow_enable_disable(hw_if_index=1, enable_disable=False) print(vpp.api.sflow_interface_dump()) # Output: [ sflow_interface_details(_0=667, context=10, hw_if_index=2) ] This short program toys around a bit with the sFlow API. I first set the sampling to 1:100 and get the current value. Then I set the polling interval to 10s and retrieve the current value again. Finally, I set the header bytes to 128, and retrieve the value again.\nEnabling and disabling sFlow on interfaces shows the idiom I mentioned before - the API being an *_enable_disable() call of sorts, and typically taking a boolean argument if the operator wants to enable (the default), or disable sFlow on the interface. Getting the list of enabled interfaces can be done with the sflow_interface_dump() call, which returns a list of sflow_interface_details messages.\nI demonstrated VPP\u0026rsquo;s Python API and how it works in a fair amount of detail in a [previous article], in case this type of stuff interests you.\n3. VPPCfg YAML Configuration\nWriting on the CLI and calling the API is good and all, but many users of VPP have noticed that it does not have any form of configuration persistence and that\u0026rsquo;s deliberate. VPP\u0026rsquo;s goal is to be a programmable dataplane, and explicitly has left the programming and configuration as an exercise for integrators. I have written a Python project that takes a YAML file as input and uses it to configure (and reconfigure, on the fly) the dataplane automatically, called [VPPcfg]. Previously, I wrote some implementation thoughts on its [datamodel] and its [operations] so I won\u0026rsquo;t repeat that here. Instead, I will just show the configuration:\npim@vpp0-0:~$ cat \u0026lt;\u0026lt; EOF \u0026gt; vppcfg.yaml interfaces: GigabitEthernet10/0/0: sflow: true GigabitEthernet10/0/1: sflow: true GigabitEthernet10/0/2: sflow: true GigabitEthernet10/0/3: sflow: true sflow: sampling-rate: 100 polling-interval: 10 header-bytes: 128 EOF pim@vpp0-0:~$ vppcfg plan -c vppcfg.yaml -o /etc/vpp/config/vppcfg.vpp [INFO ] root.main: Loading configfile vppcfg.yaml [INFO ] vppcfg.config.valid_config: Configuration validated successfully [INFO ] root.main: Configuration is valid [INFO ] vppcfg.reconciler.write: Wrote 13 lines to /etc/vpp/config/vppcfg.vpp [INFO ] root.main: Planning succeeded pim@vpp0-0:~$ vppctl exec /etc/vpp/config/vppcfg.vpp The nifty thing about vppcfg is that if I were to change, say, the sampling-rate (setting it to 1000) and disable sFlow from an interface, say Gi10/0/0, I can re-run the vppcfg plan and vppcfg apply stages and the VPP dataplane will be reprogrammed to reflect the newly declared configuration.\nhsflowd: Configuration When sFlow is enabled, VPP will start to emit Netlink messages of type PSAMPLE with packet samples and of type USERSOCK with the custom messages containing interface names and counters. These latter custom messages have to be decoded, which is done by the mod_vpp module in hsflowd, starting from release v2.11-5 [ref].\nHere\u0026rsquo;s a minimalist configuration:\npim@vpp0-0:~$ cat /etc/hsflowd.conf sflow { collector { ip=127.0.0.1 udpport=16343 } collector { ip=192.0.2.1 namespace=dataplane } psample { group=1 } vpp { osIndex=off } } There are two important details that can be confusing at first: 1. kernel network namespaces 2. interface index namespaces\nhsflowd: Network namespace Network namespaces virtualize Linux\u0026rsquo;s network stack. Upon creation, a network namespace contains only a loopback interface, and subsequently interfaces can be moved between namespaces. Each network namespace will have its own set of IP addresses, its own routing table, socket listing, connection tracking table, firewall, and other network-related resources. When started by systemd, hsflowd and VPP will normally both run in the default network namespace.\nGiven this, I can conclude that when the sFlow plugin opens a Netlink channel, it will naturally do this in the network namespace that its VPP process is running in (the default namespace, normally). It is therefore important that the recipient of these Netlink messages, notably hsflowd runs in ths same namespace as VPP. It\u0026rsquo;s totally fine to run them together in a different namespace (eg. a container in Kubernetes or Docker), as long as they can see each other.\nIt might pose a problem if the network connectivity lives in a different namespace than the default one. One common example (that I heavily rely on at IPng!) is to create Linux Control Plane interface pairs, LIPs, in a dataplane namespace. The main reason for doing this is to allow something like FRR or Bird to completely govern the routing table in the kernel and keep it in-sync with the FIB in VPP. In such a dataplane network namespace, typically every interface is owned by VPP.\nLuckily, hsflowd can attach to one (default) namespace to get the PSAMPLEs, but create a socket in a different (dataplane) namespace to send packets to a collector. This explains the second collector entry in the config-file above. Here, hsflowd will send UDP packets to 192.0.2.1:6343 from within the (VPP) dataplane namespace, and to 127.0.0.1:16343 in the default namespace.\nhsflowd: osIndex I hope the previous section made some sense, because this one will be a tad more esoteric. When creating a network namespace, each interface will get its own uint32 interface index that identifies it, and such an ID is typically called an ifIndex. It\u0026rsquo;s important to note that the same number can (and will!) occur multiple times, once for each namespace. Let me give you an example:\npim@summer:~$ ip link 1: lo: \u0026lt;LOOPBACK,UP,LOWER_UP\u0026gt; mtu 65536 qdisc noqueue state UNKNOWN ... link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: eno1: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc mq master ipng-sl state UP ... link/ether 00:22:19:6a:46:2e brd ff:ff:ff:ff:ff:ff altname enp1s0f0 3: eno2: \u0026lt;NO-CARRIER,BROADCAST,MULTICAST,UP\u0026gt; mtu 900 qdisc mq master ipng-sl state DOWN ... link/ether 00:22:19:6a:46:30 brd ff:ff:ff:ff:ff:ff altname enp1s0f1 pim@summer:~$ ip netns exec dataplane ip link 1: lo: \u0026lt;LOOPBACK,UP,LOWER_UP\u0026gt; mtu 65536 qdisc noqueue state UNKNOWN ... link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: loop0: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9216 qdisc mq state UP ... link/ether de:ad:00:00:00:00 brd ff:ff:ff:ff:ff:ff 3: xe1-0: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9216 qdisc mq state UP ... link/ether 00:1b:21:bd:c7:18 brd ff:ff:ff:ff:ff:ff I want to draw your attention to the number at the beginning of the line. In the default namespace, ifIndex=3 corresponds to ifName=eno2 (which has no link, it\u0026rsquo;s marked DOWN). But in the dataplane namespace, that index corresponds to a completely different interface called ifName=xe1-0 (which is link UP).\nNow, let me show you the interfaces in VPP:\npim@summer:~$ vppctl show int | grep Gigabit | egrep \u0026#39;Name|loop0|tap0|Gigabit\u0026#39; Name Idx State MTU (L3/IP4/IP6/MPLS) GigabitEthernet4/0/0 1 up 9000/0/0/0 GigabitEthernet4/0/1 2 down 9000/0/0/0 GigabitEthernet4/0/2 3 down 9000/0/0/0 GigabitEthernet4/0/3 4 down 9000/0/0/0 TenGigabitEthernet5/0/0 5 up 9216/0/0/0 TenGigabitEthernet5/0/1 6 up 9216/0/0/0 loop0 7 up 9216/0/0/0 tap0 19 up 9216/0/0/0 Here, I want you to look at the second column Idx, which shows what VPP calls the sw_if_index (the software interface index, as opposed to hardware index). Here, ifIndex=3 corresponds to ifName=GigabitEthernet4/0/2, which is neither eno2 nor xe1-0. Oh my, yet another namespace!\nIt turns out that there are three (relevant) types of namespaces at play here:\nLinux network namespace; here using dataplane and default each with their own unique (and overlapping) numbering. VPP hardware interface namespace, also called PHYs (for physical interfaces). When VPP first attaches to or creates network interfaces like the ones from DPDK or RDMA, these will create an hw_if_index in a list. VPP software interface namespace. All interfaces (including hardware ones!) will receive a sw_if_index in VPP. A good example is sub-interfaces: if I create a sub-int on GigabitEthernet4/0/2, it will NOT get a hardware index, but it will get the next available software index (in this example, sw_if_index=7). In Linux CP, I can see a mapping from one to the other, just look at this:\npim@summer:~$ vppctl show lcp lcp default netns dataplane lcp lcp-auto-subint off lcp lcp-sync on lcp lcp-sync-unnumbered on itf-pair: [0] loop0 tap0 loop0 2 type tap netns dataplane itf-pair: [1] TenGigabitEthernet5/0/0 tap1 xe1-0 3 type tap netns dataplane itf-pair: [2] TenGigabitEthernet5/0/1 tap2 xe1-1 4 type tap netns dataplane itf-pair: [3] TenGigabitEthernet5/0/0.20 tap1.20 xe1-0.20 5 type tap netns dataplane Those itf-pair describe our LIPs, and they have the coordinates to three things. 1) The VPP software interface (VPP ifName=loop0 with sw_if_index=7), which 2) Linux CP will mirror into the Linux kernel using a TAP device (VPP ifName=tap0 with sw_if_index=19). That TAP has one leg in VPP (tap0), and another in 3) Linux (with ifName=loop and ifIndex=2 in namespace dataplane).\nSo the tuple that fully describes a LIP is {7, 19,'dataplane', 2}\nClimbing back out of that rabbit hole, I am now finally ready to explain the feature. When sFlow in VPP takes its sample, it will be doing this on a PHY, that is a given interface with a specific hw_if_index. When it polls the counters, it\u0026rsquo;ll do it for that specific hw_if_index. It now has a choice: should it share with the world the representation of its namespace, or should it try to be smarter? If LinuxCP is enabled, this interface will likely have a representation in Linux. So the plugin will first resolve the sw_if_index belonging to that PHY, and using that, try to look up a LIP with it. If it finds one, it\u0026rsquo;ll know both the namespace in which it lives as well as the osIndex in that namespace. If it doesn\u0026rsquo;t find a LIP, it will at least have the sw_if_index at hand, so it\u0026rsquo;ll annotate the USERSOCK counter messages with this information instead.\nNow, hsflowd has a choice to make: does it share the Linux representation and hide VPP as an implementation detail? Or does it share the VPP dataplane sw_if_index? There are use cases relevant to both, so the decision was to let the operator decide, by setting osIndex either on (use Linux ifIndex) or off (use VPP sw_if_index).\nhsflowd: Host Counters Now that I understand the configuration parts of VPP and hsflowd, I decide to configure everything but without enabling sFlow on on any interfaces yet in VPP. Once I start the daemon, I can see that it sends an UDP packet every 30 seconds to the configured collector:\npim@vpp0-0:~$ sudo tcpdump -s 9000 -i lo -n tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on lo, link-type EN10MB (Ethernet), snapshot length 9000 bytes 15:34:19.695042 IP 127.0.0.1.48753 \u0026gt; 127.0.0.1.6343: sFlowv5, IPv4 agent 198.19.5.16, agent-id 100000, length 716 The tcpdump I have on my Debian bookworm machines doesn\u0026rsquo;t know how to decode the contents of these sFlow packets. Actually, neither does Wireshark. I\u0026rsquo;ve attached a file of these mysterious packets [sflow-host.pcap] in case you want to take a look. Neil however gives me a tip. A full message decoder and otherwise handy Swiss army knife lives in [sflowtool].\nI can offer this pcap file to sflowtool, or let it just listen on the UDP port directly, and it\u0026rsquo;ll tell me what it finds:\npim@vpp0-0:~$ sflowtool -p 6343 startDatagram ================================= datagramSourceIP 127.0.0.1 datagramSize 716 unixSecondsUTC 1739112018 localtime 2025-02-09T15:40:18+0100 datagramVersion 5 agentSubId 100000 agent 198.19.5.16 packetSequenceNo 57 sysUpTime 987398 samplesInPacket 1 startSample ---------------------- sampleType_tag 0:4 sampleType COUNTERSSAMPLE sampleSequenceNo 33 sourceId 2:1 counterBlock_tag 0:2001 adaptor_0_ifIndex 2 adaptor_0_MACs 1 adaptor_0_MAC_0 525400f00100 counterBlock_tag 0:2010 udpInDatagrams 123904 udpNoPorts 23132459 udpInErrors 0 udpOutDatagrams 46480629 udpRcvbufErrors 0 udpSndbufErrors 0 udpInCsumErrors 0 counterBlock_tag 0:2009 tcpRtoAlgorithm 1 tcpRtoMin 200 tcpRtoMax 120000 tcpMaxConn 4294967295 tcpActiveOpens 0 tcpPassiveOpens 30 tcpAttemptFails 0 tcpEstabResets 0 tcpCurrEstab 1 tcpInSegs 89120 tcpOutSegs 86961 tcpRetransSegs 59 tcpInErrs 0 tcpOutRsts 4 tcpInCsumErrors 0 counterBlock_tag 0:2008 icmpInMsgs 23129314 icmpInErrors 32 icmpInDestUnreachs 0 icmpInTimeExcds 23129282 icmpInParamProbs 0 icmpInSrcQuenchs 0 icmpInRedirects 0 icmpInEchos 0 icmpInEchoReps 32 icmpInTimestamps 0 icmpInAddrMasks 0 icmpInAddrMaskReps 0 icmpOutMsgs 0 icmpOutErrors 0 icmpOutDestUnreachs 23132467 icmpOutTimeExcds 0 icmpOutParamProbs 23132467 icmpOutSrcQuenchs 0 icmpOutRedirects 0 icmpOutEchos 0 icmpOutEchoReps 0 icmpOutTimestamps 0 icmpOutTimestampReps 0 icmpOutAddrMasks 0 icmpOutAddrMaskReps 0 counterBlock_tag 0:2007 ipForwarding 2 ipDefaultTTL 64 ipInReceives 46590552 ipInHdrErrors 0 ipInAddrErrors 0 ipForwDatagrams 0 ipInUnknownProtos 0 ipInDiscards 0 ipInDelivers 46402357 ipOutRequests 69613096 ipOutDiscards 0 ipOutNoRoutes 80 ipReasmTimeout 0 ipReasmReqds 0 ipReasmOKs 0 ipReasmFails 0 ipFragOKs 0 ipFragFails 0 ipFragCreates 0 counterBlock_tag 0:2005 disk_total 6253608960 disk_free 2719039488 disk_partition_max_used 56.52 disk_reads 11512 disk_bytes_read 626214912 disk_read_time 48469 disk_writes 1058955 disk_bytes_written 8924332032 disk_write_time 7954804 counterBlock_tag 0:2004 mem_total 8326963200 mem_free 5063872512 mem_shared 0 mem_buffers 86425600 mem_cached 827752448 swap_total 0 swap_free 0 page_in 306365 page_out 4357584 swap_in 0 swap_out 0 counterBlock_tag 0:2003 cpu_load_one 0.030 cpu_load_five 0.050 cpu_load_fifteen 0.040 cpu_proc_run 1 cpu_proc_total 138 cpu_num 2 cpu_speed 1699 cpu_uptime 1699306 cpu_user 64269210 cpu_nice 1810 cpu_system 34690140 cpu_idle 3234293560 cpu_wio 3568580 cpuintr 0 cpu_sintr 5687680 cpuinterrupts 1596621688 cpu_contexts 3246142972 cpu_steal 329520 cpu_guest 0 cpu_guest_nice 0 counterBlock_tag 0:2006 nio_bytes_in 250283 nio_pkts_in 2931 nio_errs_in 0 nio_drops_in 0 nio_bytes_out 370244 nio_pkts_out 1640 nio_errs_out 0 nio_drops_out 0 counterBlock_tag 0:2000 hostname vpp0-0 UUID ec933791-d6af-7a93-3b8d-aab1a46d6faa machine_type 3 os_name 2 os_release 6.1.0-26-amd64 endSample ---------------------- endDatagram ================================= If you thought: \u0026ldquo;What an obnoxiously long paste!\u0026rdquo;, then my slightly RSI-induced mouse-hand might agree with you. But it is really cool to see that every 30 seconds, the collector will receive this form of heartbeat from the agent. There\u0026rsquo;s a lot of vitalsigns in this packet, including some non-obvious but interesting stats like CPU load, memory, disk use and disk IO, and kernel version information. It\u0026rsquo;s super dope!\nhsflowd: Interface Counters Next, I\u0026rsquo;ll enable sFlow in VPP on all four interfaces (Gi10/0/0-Gi10/0/3), set the sampling rate to something very high (1 in 100M), and the interface polling-interval to every 10 seconds. And indeed, every ten seconds or so I get a few packets, which I captured in [sflow-interface.pcap]. Most of the packets contain only one counter record, while some contain more than one (in the PCAP, packet #9 has two). If I update the polling-interval to every second, I can see that most of the packets have all four counters.\nThose interface counters, as decoded by sflowtool, look like this:\npim@vpp0-0:~$ sflowtool -r sflow-interface.pcap | \\ awk \u0026#39;/startSample/ { on=1 } { if (on) { print $0 } } /endSample/ { on=0 }\u0026#39; startSample ---------------------- sampleType_tag 0:4 sampleType COUNTERSSAMPLE sampleSequenceNo 745 sourceId 0:3 counterBlock_tag 0:1005 ifName GigabitEthernet10/0/2 counterBlock_tag 0:1 ifIndex 3 networkType 6 ifSpeed 0 ifDirection 1 ifStatus 3 ifInOctets 858282015 ifInUcastPkts 780540 ifInMulticastPkts 0 ifInBroadcastPkts 0 ifInDiscards 0 ifInErrors 0 ifInUnknownProtos 0 ifOutOctets 1246716016 ifOutUcastPkts 975772 ifOutMulticastPkts 0 ifOutBroadcastPkts 0 ifOutDiscards 127 ifOutErrors 28 ifPromiscuousMode 0 endSample ---------------------- What I find particularly cool about it, is that sFlow provides an automatic mapping between the ifName=GigabitEthernet10/0/2 (tag 0:1005), together with an object (tag 0:1), which contains the ifIndex=3, and lots of packet and octet counters both in the ingress and egress direction. This is super useful for upstream collectors, as they can now find the hostname, agent name and address, and the correlation between interface names and their indexes. Noice!\nhsflowd: Packet Samples Now it\u0026rsquo;s time to ratchet up the packet sampling, so I move it from 1:100M to 1:1000, while keeping the interface polling-interval at 10 seconds and I ask VPP to sample 64 bytes of each packet that it inspects. On either side of my pet VPP instance, I start an iperf3 run to generate some traffic. I now see a healthy stream of sflow packets coming in on port 6343. They still contain every 30 seconds or so a host counter, and every 10 seconds a set of interface counters come by, but mostly these UDP packets are showing me samples. I\u0026rsquo;ve captured a few minutes of these in [sflow-all.pcap]. Although Wireshark doesn\u0026rsquo;t know how to interpret the sFlow counter messages, it does know how to interpret the sFlow sample messagess, and it reveals one of them like this:\nLet me take a look at the picture from top to bottom. First, the outer header (from 127.0.0.1:48753 to 127.0.0.1:6343) is the sFlow agent sending to the collector. The agent identifies itself as having IPv4 address 198.19.5.16 with ID 100000 and an uptime of 1h52m. Then, it says it\u0026rsquo;s going to send 9 samples, the first of which says it\u0026rsquo;s from ifIndex=2 and at a sampling rate of 1:1000. It then shows that sample, saying that the frame length is 1518 bytes, and the first 64 bytes of those are sampled. Finally, the first sampled packet starts at the blue line. It shows the SrcMAC and DstMAC, and that it was a TCP packet from 192.168.10.17:51028 to 192.168.10.33:5201 - my running iperf3, booyah!\nVPP: sFlow Performance One question I get a lot about this plugin is: what is the performance impact when using sFlow? I spent a considerable amount of time tinkering with this, and together with Neil bringing the plugin to what we both agree is the most efficient use of CPU. We could have gone a bit further, but that would require somewhat intrusive changes to VPP\u0026rsquo;s internals and as North of the Border (and the Simpsons!) would say: what we have isn\u0026rsquo;t just good, it\u0026rsquo;s good enough!\nI\u0026rsquo;ve built a small testbed based on two Dell R730 machines. On the left, I have a Debian machine running Cisco T-Rex using four quad-tengig network cards, the classic Intel i710-DA4. On the right, I have my VPP machine called Hippo (because it\u0026rsquo;s always hungry for packets), with the same hardware. I\u0026rsquo;ll build two halves. On the top NIC (Te3/0/0-3 in VPP), I will install IPv4 and MPLS forwarding on the purple circuit, and a simple Layer2 cross connect on the cyan circuit. On all four interfaces, I will enable sFlow. Then, I will mirror this configuration on the bottom NIC (Te130/0/0-3) in the red and green circuits, for which I will leave sFlow turned off.\nTo help you reproduce my results, and under the assumption that this is your jam, here\u0026rsquo;s the configuration for all of the kit:\n0. Cisco T-Rex\npim@trex:~ $ cat /srv/trex/8x10.yaml - version: 2 interfaces: [ \u0026#39;06:00.0\u0026#39;, \u0026#39;06:00.1\u0026#39;, \u0026#39;83:00.0\u0026#39;, \u0026#39;83:00.1\u0026#39;, \u0026#39;87:00.0\u0026#39;, \u0026#39;87:00.1\u0026#39;, \u0026#39;85:00.0\u0026#39;, \u0026#39;85:00.1\u0026#39; ] port_info: - src_mac: 00:1b:21:06:00:00 dest_mac: 9c:69:b4:61:a1:dc # Connected to Hippo Te3/0/0, purple - src_mac: 00:1b:21:06:00:01 dest_mac: 9c:69:b4:61:a1:dd # Connected to Hippo Te3/0/1, purple - src_mac: 00:1b:21:83:00:00 dest_mac: 00:1b:21:83:00:01 # L2XC via Hippo Te3/0/2, cyan - src_mac: 00:1b:21:83:00:01 dest_mac: 00:1b:21:83:00:00 # L2XC via Hippo Te3/0/3, cyan - src_mac: 00:1b:21:87:00:00 dest_mac: 9c:69:b4:61:75:d0 # Connected to Hippo Te130/0/0, red - src_mac: 00:1b:21:87:00:01 dest_mac: 9c:69:b4:61:75:d1 # Connected to Hippo Te130/0/1, red - src_mac: 9c:69:b4:85:00:00 dest_mac: 9c:69:b4:85:00:01 # L2XC via Hippo Te130/0/2, green - src_mac: 9c:69:b4:85:00:01 dest_mac: 9c:69:b4:85:00:00 # L2XC via Hippo Te130/0/3, green pim@trex:~ $ sudo t-rex-64 -i -c 4 --cfg /srv/trex/8x10.yaml When constructing the T-Rex configuration, I specifically set the destination MAC address for L3 circuits (the purple and red ones) using Hippo\u0026rsquo;s interface MAC address, which I can find with vppctl show hardware-interfaces. This way, T-Rex does not have to ARP for the VPP endpoint. On L2XC circuits (the cyan and green ones), VPP does not concern itself with the MAC addressing at all. It puts its interface in promiscuous mode, and simply writes out any ethernet frame received, directly to the egress interface.\n1. IPv4\nhippo# set int state TenGigabitEthernet3/0/0 up hippo# set int state TenGigabitEthernet3/0/1 up hippo# set int state TenGigabitEthernet130/0/0 up hippo# set int state TenGigabitEthernet130/0/1 up hippo# set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31 hippo# set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31 hippo# set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31 hippo# set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31 hippo# ip route add 16.0.0.0/24 via 100.64.0.0 hippo# ip route add 48.0.0.0/24 via 100.64.1.0 hippo# ip route add 16.0.2.0/24 via 100.64.4.0 hippo# ip route add 48.0.2.0/24 via 100.64.5.0 hippo# ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static hippo# ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static hippo# ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static hippo# ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static By the way, one note to this last piece, I\u0026rsquo;m setting static IPv4 neighbors so that Cisco T-Rex as well as VPP do not have to use ARP to resolve each other. You\u0026rsquo;ll see above that the T-Rex configuration also uses MAC addresses exclusively. Setting the ip neighbor like this allows VPP to know where to send return traffic.\n2. MPLS\nhippo# mpls table add 0 hippo# set interface mpls TenGigabitEthernet3/0/0 enable hippo# set interface mpls TenGigabitEthernet3/0/1 enable hippo# set interface mpls TenGigabitEthernet130/0/0 enable hippo# set interface mpls TenGigabitEthernet130/0/1 enable hippo# mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17 hippo# mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16 hippo# mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21 hippo# mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20 Here, the MPLS configuration implements a simple P-router, where incoming MPLS packets with label 16 will be sent back to T-Rex on Te3/0/1 to the specified IPv4 nexthop (for which I already know the MAC address), and with label 16 removed and new label 17 imposed, in other words a SWAP operation.\n3. L2XC\nhippo# set int state TenGigabitEthernet3/0/2 up hippo# set int state TenGigabitEthernet3/0/3 up hippo# set int state TenGigabitEthernet130/0/2 up hippo# set int state TenGigabitEthernet130/0/3 up hippo# set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3 hippo# set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2 hippo# set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3 hippo# set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2 I\u0026rsquo;ve added a layer2 cross connect as well because it\u0026rsquo;s computationally very cheap for VPP to receive an L2 (ethernet) datagram, and immediately transmit it on another interface. There\u0026rsquo;s no FIB lookup and not even an L2 nexthop lookup involved, VPP is just shoveling ethernet packets in-and-out as fast as it can!\nHere\u0026rsquo;s how a loadtest looks like when sending 80Gbps at 192b packets on all eight interfaces:\nThe leftmost ports p0 \u0026lt;-\u0026gt; p1 are sending IPv4+MPLS, while ports p0 \u0026lt;-\u0026gt; p2 are sending ethernet back and forth. All four of them have sFlow enabled, at a sampling rate of 1:10'000, the default. These four ports are my experiment, to show the CPU use of sFlow. Then, ports p3 \u0026lt;-\u0026gt; p4 and p5 \u0026lt;-\u0026gt; p6 respectively have sFlow turned off but with the same configuration. They are my control, showing the CPU use without sFlow.\nFirst conclusion: This stuff works a treat. There is absolutely no impact of throughput at 80Gbps with 47.6Mpps either with, or without sFlow turned on. That\u0026rsquo;s wonderful news, as it shows that the dataplane has more CPU available than is needed for any combination of functionality.\nBut what is the limit? For this, I\u0026rsquo;ll take a deeper look at the runtime statistics by varying the CPU time spent and maximum throughput achievable on a single VPP worker, thus using a single CPU thread on this Hippo machine that has 44 cores and 44 hyperthreads. I switch the loadtester to emit 64 byte ethernet packets, the smallest I\u0026rsquo;m allowed to send.\nLoadtest no sFlow 1:1'000'000 1:10'000 1:1'000 1:100 L2XC 14.88Mpps 14.32Mpps 14.31Mpps 14.27Mpps 14.15Mpps IPv4 10.89Mpps 9.88Mpps 9.88Mpps 9.84Mpps 9.73Mpps MPLS 10.11Mpps 9.52Mpps 9.52Mpps 9.51Mpps 9.45Mpps sFlow Packets / 10sec N/A 337.42M total 337.39M total 336.48M total 333.64M total .. Sampled 328 33.8k 336k 3.34M .. Sent 328 33.8k 336k 1.53M .. Dropped 0 0 0 1.81M Here I can make a few important observations.\nBaseline: One worker (thus, one CPU thread) can sustain 14.88Mpps of L2XC when sFlow is turned off, which implies that it has a little bit of CPU left over to do other work, if needed. With IPv4, I can see that the throughput is actually CPU limited: 10.89Mpps can be handled by one worker (thus, one CPU thread). I know that MPLS is a little bit more expensive computationally than IPv4, and that checks out. The total capacity is 10.11Mpps for one worker, when sFlow is turned off.\nOverhead: When I turn on sFlow on the interface, VPP will insert the sflow-node into the forwarding graph between device-input and ethernet-input. It means that the sFlow node will see every single packet, and it will have to move all of these into the next node, which costs about 9.5 CPU cycles per packet. The regression on L2XC is 3.8% but I have to note that VPP was not CPU bound on the L2XC so it used some CPU cycles which were still available, before regressing throughput. There is an immediate regression of 9.3% on IPv4 and 5.9% on MPLS, only to shuffle the packets through the graph.\nSampling Cost: But when then doing higher rates of sampling, the further regression is not that terrible. Between 1:1'000'000 and 1:10'000, there\u0026rsquo;s barely a noticeable difference. Even in the worst case of 1:100, the regression is from 14.32Mpps to 14.15Mpps for L2XC, only 1.2%. The regression for L2XC, IPv4 and MPLS are all very modest, at 1.2% (L2XC), 1.6% (IPv4) and 0.8% (MPLS). Of course, by using multiple hardware receive queues and multiple RX workers per interface, the cost can be kept well in hand.\nOverload Protection: At 1:1'000 and an effective rate of 33.65Mpps across all ports, I correctly observe 336k samples taken, and sent to PSAMPLE. At 1:100 however, there are 3.34M samples, but they are not fitting through the FIFO, so the plugin is dropping samples to protect downstream sflow-main thread and hsflowd. I can see that here, 1.81M samples have been dropped, while 1.53M samples made it through. By the way, this means VPP is happily sending a whopping 153K samples/sec to the collector!\nWhat\u0026rsquo;s Next Now that I\u0026rsquo;ve seen the UDP packets from our agent to a collector on the wire, and also how incredibly efficient the sFlow sampling implementation turned out, I\u0026rsquo;m super motivated to continue the journey with higher level collector receivers like ntopng, sflow-rt or Akvorado. In an upcoming article, I\u0026rsquo;ll describe how I rolled out Akvorado at IPng, and what types of changes would make the user experience even better (or simpler to understand, at least).\nAcknowledgements I\u0026rsquo;d like to thank Neil McKee from inMon for his dedication to getting things right, including the finer details such as logging, error handling, API specifications, and documentation. He has been a true pleasure to work with and learn from. Also, thank you to the VPP developer community, notably Benoit, Florin, Damjan, Dave and Matt, for helping with the review and getting this thing merged in time for the 25.02 release.\n","date":"2025-02-08","desc":"Introduction In the second half of last year, I picked up a project together with Neil McKee of [inMon], the care takers of [sFlow]: an industry standard technology for monitoring high speed networks. sFlow gives complete visibility into the use of networks enabling performance optimization, accounting/billing for usage, and defense against security threats.\nThe open source software dataplane [VPP] is a perfect match for sampling, as it forwards packets at very high rates using underlying libraries like [DPDK] and [RDMA]. A clever design choice in the so called Host sFlow Daemon [host-sflow], which allows for a small portion of code to grab the samples, for example in a merchant silicon ASIC or FPGA, but also in the VPP software dataplane. The agent then transmits these samples using a Linux kernel feature called [PSAMPLE]. This greatly reduces the complexity of code to be implemented in the forwarding path, while at the same time bringing consistency to the sFlow delivery pipeline by (re)using the hsflowd business logic for the more complex state keeping, packet marshalling and transmission from the Agent to a central Collector.\n","permalink":"https://ipng.ch/s/articles/2025/02/08/vpp-with-sflow-part-3/","section":"articles","title":"VPP with sFlow - Part 3"},{"contents":" Introduction A few months ago, I wrote about [an idea] to help boost the value of small Internet Exchange Points (IXPs). When such an exchange doesn\u0026rsquo;t have many members, then the operational costs of connecting to it (cross connects, router ports, finding peers, etc) are not very favorable.\nClearly, the benefit of using an Internet Exchange is to reduce the portion of an ISP’s (and CDN’s) traffic that must be delivered via their upstream transit providers, thereby reducing the average per-bit delivery cost and as well reducing the end to end latency as seen by their users or customers. Furthermore, the increased number of paths available through the IXP improves routing efficiency and fault-tolerance, and at the same time it avoids traffic going the scenic route to a large hub like Frankfurt, London, Amsterdam, Paris or Rome, if it could very well remain local.\nRefresher: FreeIX Remote Let\u0026rsquo;s take for example the [Free IX in Greece] that was announced at GRNOG16 in Athens on April 19th, 2024. This exchange initially targets Athens and Thessaloniki, with 2x100G between the two cities. Members can connect to either site for the cost of only a cross connect. The 1G/10G/25G ports will be Gratis, so please make sure to apply if you\u0026rsquo;re in this region! I myself have connected one very special router to Free IX Greece, which will be offering an outreach infrastructure by connecting to other Internet Exchange Points in Amsterdam, and allowing all FreeIX Greece members to benefit from that in the following way:\nFreeIX Remote uses AS50869 to peer with any network operator (or routeserver) available at public Internet Exchange Points or using private interconnects. For these peers, it looks like a completely normal service provider in this regard. It will connect to internet exchange points, and learn a bunch of routes and announce other routes.\nFreeIX Remote members can join the program, after which they are granted certain propagation permissions by FreeIX Remote at the point where they have a BGP session with AS50869. The prefixes learned on these member sessions are marked as such, and will be allowed to propagate. Members will receive some or all learned prefixes from AS50869.\nFreeIX members can set fine grained BGP communities to determine which of their prefixes are propagated to and from which locations, by router, country or Internet Exchange Point.\nMembers at smaller internet exchange points greatly benefit from this type of outreach, by receiving large portions of the public internet directly at their preferred peering location. The Free IX Remote routers will carry member traffic to and from these remote Internet Exchange Points. My [previous article] went into a good amount of detail on the principles of operation, but back then I made a promise to come back to the actual implementation of such a complex routing topology. As a starting point, I work with the structure I shared in [IPng\u0026rsquo;s Routing Policy]. If you haven\u0026rsquo;t read that yet, I think it may make sense to take a look as many of the structural elements and concepts will be similar.\nImplementation The routing policy calls for three classes of (large) BGP communities: informational, permission and inhibit. It also defines a few classic BGP communties, but I\u0026rsquo;ll skip over those as they are not very interesting. Firstly, I will use the informational communities to tag which prefixes were learned by which router, in which country and at which internet exchange point, which I will call a group.\nThen, I will use the same structure to grant members permissions, that is to say, when AS50869 learns their prefixes, they will get tagged with specific action communities that enable propagation to other places. I will call this \u0026lsquo;Member-to-IXP\u0026rsquo;. Sometimes, I\u0026rsquo;d like to be able to inhibit propagation of \u0026lsquo;Member-to-IXP\u0026rsquo;, so there will be a third set of communities that perform this function. Finally, matching on the informational communities in a clever way will enable a symmetric \u0026lsquo;IXP-to-Member\u0026rsquo; propagation.\nTo help structure this implementation, it helps if I think about it in the following way:\nLet\u0026rsquo;s say, AS50869 is connected to IXP1, IXP2, IXP3 and IXP4. AS50869 has a member called M1 at IXP1, and that member is \u0026lsquo;permitted\u0026rsquo; to reach IXP2 and IXP3, but it is \u0026lsquo;inhibited\u0026rsquo; from reaching IXP4. My FreeIX Remote implementation now has to satisfy three main requirements:\nIngress: learn prefixes (from peers and members alike) at internet exchange points or private network interconnects, and \u0026rsquo;tag\u0026rsquo; them with the correct informational communities. Egress: Member-to-IXP: Announce M1\u0026rsquo;s prefixes to IXP2 and IXP3, but not to IXP4. Egress: IXP-to-Member: Announce IXP2\u0026rsquo;s and IXP3\u0026rsquo;s prefixes to M1, but not IXP4\u0026rsquo;s. Defining Countries and Routers I\u0026rsquo;ll start by giving each country which has at least one router a unique country_id in a YAML file, leaving the value 0 to mean \u0026lsquo;all\u0026rsquo; countries:\n$ cat config/common/countries.yaml country: all: 0 CH: 1 NL: 2 GR: 3 IT: 4 Each router has its own configuration file, and at the top, I\u0026rsquo;ll define some metadata which includes things like the country in which it operates, and its own unique router_id, like so:\n$ cat config/chrma0.net.free-ix.net.yaml device: id: 1 hostname: chrma0.free-ix.net shortname: chrma0 country: CH loopbacks: ipv4: 194.126.235.16 ipv6: \u0026#34;2a0b:dd80:3101::\u0026#34; location: \u0026#34;Hofwiesenstrasse, Ruemlang, Zurich, Switzerland\u0026#34; ... Defining communities Next, I define the BGP communities in class and subclass types, in the following YAML structure:\nebgp: community: legacy: noannounce: 0 blackhole: 666 inhibit: 3000 prepend1: 3100 prepend2: 3200 prepend3: 3300 large: class: informational: 1000 permission: 2000 inhibit: 3000 prepend1: 3100 prepend2: 3200 prepend3: 3300 subclass: all: 0 router: 10 country: 20 group: 30 asn: 40 Defining Members In order to keep this system manageable, I have to rely on automation. I intend to leverage the BGP community subclasses in a simple ACL system consisting of the following YAML, taking my buddy Antonios\u0026rsquo; network as an example:\n$ cat config/common/members.yaml member: 210312: description: DaKnObNET prefix_filter: AS-SET-DNET permission: [ router:chrma0 ] inhibit: [ group:chix ] ... The syntax of the permission and inhibit fields are identical. They are lists of key:value pairs where they key must be one of the subclasses (eg. \u0026lsquo;router\u0026rsquo;, \u0026lsquo;country\u0026rsquo;, \u0026lsquo;group\u0026rsquo;, \u0026lsquo;asn\u0026rsquo;), and the value appropriate for that type. In this example, AS50869 is being asked to grant permissions for Antonios\u0026rsquo; prefixes to any peer connected to router:chrma0, but inhibit propagation to/from the exchange point called group:chix. I could extend this list, for example by adding a permission to country:NL or an inhibit to router:grskg0 and so on.\nI decide that sensible defaults are to give permissions to all, and keep inhibit empty. In other words: be very liberal in propagation, to maximize the value that FreeIX Remote can provide its members.\nIngress: Learning Prefixes With what I\u0026rsquo;ve defined so far, I can start to set informational BGP communtiies:\nThe prefixes learned on subclass router for chrma0 will have value of device.id=1: (50869,1010,1) The prefixes learned on subclass country for chrma0 will learn from device.country=CH and be able to look up in countries['CH'] that this means value 1: (50869,1020,1) When learning prefixes from a given internet exchange, Kees already knows its PeeringDB ixp_id, which is a unique value for each exchange point. Thus, subclass group for chrma0 at [CommunityIX] is ixp_id=2013: (50869,1030,2013) Ingress: Learning from members I need to make sure that members send only the prefixes that I expect from them. To do this, I\u0026rsquo;ll make use of a common tool called [bgpq4] which cobbles together the prefixes belonging to an AS-SET by referencing one or more IRR databases.\nIn Python, I\u0026rsquo;ll prepare the Jinja context by generating the prefix filter lists like so:\nif session[\u0026#34;type\u0026#34;] == \u0026#34;member\u0026#34;: session = {**session, **data[\u0026#34;member\u0026#34;][asn]} pf = ebgp_merge_value(data[\u0026#34;ebgp\u0026#34;], group, session, \u0026#34;prefix_filter\u0026#34;, None) if pf: ctx[\u0026#34;prefix_filter\u0026#34;] = {} pfn = pf pfn = pfn.replace(\u0026#34;-\u0026#34;, \u0026#34;_\u0026#34;) pfn = pfn.replace(\u0026#34;:\u0026#34;, \u0026#34;_\u0026#34;) for af in [4, 6]: filter_name = \u0026#34;%s_%s_IPV%d\u0026#34; % (groupname.upper(), pfn, af) filter_contents = fetch_bgpq(filter_name, pf, af, allow_morespecifics=True) if \u0026#34;[\u0026#34; in filter_contents: ctx[\u0026#34;prefix_filter\u0026#34;][filter_name] = { \u0026#34;str\u0026#34;: filter_contents, \u0026#34;af\u0026#34;: af } ctx[\u0026#34;prefix_filter_ipv%d\u0026#34; % af] = True else: log.warning(f\u0026#34;Filter {filter_name} is empty!\u0026#34;) ctx[\u0026#34;prefix_filter_ipv%d\u0026#34; % af] = False First, if a given BGP session is of type member, I\u0026rsquo;ll merge the member[asn] dictionary into the ebgp.group.session[asn]. I\u0026rsquo;ve left out error handling for brevity, but in case the member YAML file doesn\u0026rsquo;t have an entry for the given ASN, it\u0026rsquo;ll just revert back to being of type peer.\nI\u0026rsquo;ll use a helper function ebgp_merge_value() to walk the YAML hiearchy from the member-data enriched session to the group and finally to the ebgp scope, looking for the existence of a key called prefix_filter and defaulting to None in case none was found. With the value of prefix_filter in hand (in this case AS-SET-DNET), I shell out to bgpq4 for IPv4 and IPv6 respectively. Sometimes, there are no IPv6 prefixes (why must you be like this?!) and sometimes there are no IPv4 prefixes (welcome to the Internet, kid!)\nAll of this context, including the session and group information, are then fed as context to a Jinja renderer, where I can use them in an import filter like so:\n{% for plname, pl in (prefix_filter | default({})).items() %} {{pl.str}} {% endfor %} filter ebgp_{{group_name}}_{{their_asn}}_import { {% if not prefix_filter_ipv4 | default(True) %} # WARNING: No IPv4 prefix filter found if (net.type = NET_IP4) then reject; {% endif %} {% if not prefix_filter_ipv6 | default(True) %} # WARNING: No IPv6 prefix filter found if (net.type = NET_IP6) then reject; {% endif %} {% for plname, pl in (prefix_filter | default({})).items() %} {% if pl.af == 4 %} if (net.type = NET_IP4 \u0026amp;\u0026amp; ! (net ~ {{plname}})) then reject; {% elif pl.af == 6 %} if (net.type = NET_IP6 \u0026amp;\u0026amp; ! (net ~ {{plname}})) then reject; {% endif %} {% endfor %} {% if session_type is defined %} if ! ebgp_import_{{session_type}}({{their_asn}}) then reject; {% endif %} # Add FreeIX Remote: Informational bgp_large_community.add(({{my_asn}},{{community.large.class.informational+community.large.subclass.router}},{{device.id}})); ## informational.router = {{ device.hostname }} bgp_large_community.add(({{my_asn}},{{community.large.class.informational+community.large.subclass.country}},{{country[device.country]}})); ## informational.country = {{ device.country }} {% if group.peeringdb_ix.id %} bgp_large_community.add(({{my_asn}},{{community.large.class.informational+community.large.subclass.group}},{{group.peeringdb_ix.id}})); ## informational.group = {{ group_name }} {% endif %} ## NOTE(pim): More comes here, see Member-to-IXP below accept; } Let me explain what\u0026rsquo;s going on here, as Jinja templating language that my generator uses is a bit \u0026hellip; chatty. The first block will print the dictionary of zero or more prefix_filter entries. If the prefix_filter context variable doesn\u0026rsquo;t exist, assume it\u0026rsquo;s the empty dictionary and thus, print no prefix lists.\nThen, I create a Bird2 filter and these must each have a globally unique name. I satisfy this requirement by giving it a name with the tuple of {group, their_asn}. The first thing this filter does, is inspect prefix_filter_ipv4 and prefix_filter_ipv6, and if they are explicitly set to False (for example, if a member doesn\u0026rsquo;t have any IRR prefixes associated with their AS-SET), then I\u0026rsquo;ll reject any prefixes from them. Then, I\u0026rsquo;ll match the prefixes with the prefix_filter, if provided, and reject any prefixes that aren\u0026rsquo;t in the list I\u0026rsquo;m expecting on this session. Assuming we\u0026rsquo;re still good to go, I\u0026rsquo;ll hand this prefix off to a function called ebgp_import_peer() for peers and ebgp_import_member() for members, both of which ensure BGP communities are scrubbed.\nfunction ebgp_import_peer(int remote_as) -\u0026gt; bool { # Scrub BGP Communities (RFC 7454 Section 11) bgp_community.delete([(50869, *)]); bgp_large_community.delete([(50869, *, *)]); # Scrub BLACKHOLE community bgp_community.delete((65535, 666)); return ebgp_import(remote_as); } function ebgp_import_member(int remote_as) -\u0026gt; bool { # We scrub only our own (informational, permissions) BGP Communities for members bgp_large_community.delete([(50869,1000..2999,*)]); return ebgp_import(remote_as); } After scrubbing the communities (peers are not allowed to set any communities, and members are not allowed to set their own informational or permissions communities, but they are allowed to inhibit themselves or prepend, if they wish), one last check is performed by calling the underlying ebgp_import():\nfunction ebgp_import(int remote_as) -\u0026gt; bool { if aspath_bogon() then return false; if (net.type = NET_IP4 \u0026amp;\u0026amp; ipv4_bogon()) then return false; if (net.type = NET_IP6 \u0026amp;\u0026amp; ipv6_bogon()) then return false; if (net.type = NET_IP4 \u0026amp;\u0026amp; ipv4_rpki_invalid()) then return false; if (net.type = NET_IP6 \u0026amp;\u0026amp; ipv6_rpki_invalid()) then return false; # Graceful Shutdown (https://www.rfc-editor.org/rfc/rfc8326.html) if (65535, 0) ~ bgp_community then bgp_local_pref = 0; return true; } Here, belt-and-suspenders checks are performed, notably bogon AS Paths, IPv4/IPv6 prefixes and RPKI invalids are filtered out. If the prefix has well-known community for [BGP Graceful Shutdown], honor it and set the local preference to zero (making sure to prefer any other available path).\nOK, after all these checks are done, I am finally ready to accept the prefix from this peer or member. It\u0026rsquo;s time to add the informational communities based on the router_id, the router\u0026rsquo;s country_id and (if this is a session at a public internet exchange point documented in PeeringDB), the group\u0026rsquo;s ixp_id.\nIngress Example: member Here\u0026rsquo;s what the rendered template looks like for Antonios\u0026rsquo; member session at CHIX:\n# bgpq4 -Ab4 -R 32 -l \u0026#39;define CHIX_AS_SET_DNET_IPV4\u0026#39; AS-SET-DNET define CHIX_AS_SET_DNET_IPV4 = [ 44.31.27.0/24{24,32}, 44.154.130.0/24{24,32}, 44.154.132.0/24{24,32}, 147.189.216.0/21{21,32}, 193.5.16.0/22{22,32}, 212.46.55.0/24{24,32} ]; # bgpq4 -Ab6 -R 128 -l \u0026#39;define CHIX_AS_SET_DNET_IPV6\u0026#39; AS-SET-DNET define CHIX_AS_SET_DNET_IPV6 = [ 2001:678:f5c::/48{48,128}, 2a05:dfc1:9174::/48{48,128}, 2a06:9f81:2500::/40{40,128}, 2a06:9f81:2600::/40{40,128}, 2a0a:6044:7100::/40{40,128}, 2a0c:2f04:100::/40{40,128}, 2a0d:3dc0::/29{29,128}, 2a12:bc0::/29{29,128} ]; filter ebgp_chix_210312_import { if (net.type = NET_IP4 \u0026amp;\u0026amp; ! (net ~ CHIX_AS_SET_DNET_IPV4)) then reject; if (net.type = NET_IP6 \u0026amp;\u0026amp; ! (net ~ CHIX_AS_SET_DNET_IPV6)) then reject; if ! ebgp_import_member(210312) then reject; # Add FreeIX Remote: Informational bgp_large_community.add((50869,1010,1)); ## informational.router = chrma0.free-ix.net bgp_large_community.add((50869,1020,1)); ## informational.country = CH bgp_large_community.add((50869,1030,2365)); ## informational.group = chix ## NOTE(pim): More comes here, see Member-to-IXP below accept; } Ingress Example: peer For completeness, here\u0026rsquo;s a regular peer Cloudflare at CHIX, and I hope you agree that the Jinja template renders down to something waaaay more readable now:\nfilter ebgp_chix_13335_import { if ! ebgp_import_peer(13335) then reject; # Add FreeIX Remote: Informational bgp_large_community.add((50869,1010,1)); ## informational.router = chrma0.free-ix.net bgp_large_community.add((50869,1020,1)); ## informational.country = CH bgp_large_community.add((50869,1030,2365)); ## informational.group = chix accept; } Most sessions will actually look like this one: just learning prefixes, scrubbing inbound communities that are nobody\u0026rsquo;s business to be setting but mine, tossing weird prefixes like bogons and then setting typically the three informational communities. I now know exactly which prefixes are picked up at group CHIX, which ones in country Switzerland, and which ones on router chrma0.\nEgress: Propagating Prefixes And with that, I\u0026rsquo;ve completed the \u0026rsquo;learning\u0026rsquo; part. Let me move to the \u0026lsquo;propagating\u0026rsquo; part. A design goal of FreeIX Remote is to have symmetric propagation. In my example above, member M1 should have its prefixes announced at IXP2 and IXP3, and all prefixes learned at IXP2 and IXP3 should be announced to member M1.\nFirst, let me create a helper function in the generator. It\u0026rsquo;s job is to take the symbolic member.*.permissions and member.*.inhibit lists and resolve them into a structure of numeric values suitable for BGP community list adding and matching. It\u0026rsquo;s a bit of a beast, but I\u0026rsquo;ve simplified it a bit. Notably, I\u0026rsquo;ve removed all the error and exception handling for brevity:\ndef parse_member_communities(data, asn, type): myasn = data[\u0026#34;ebgp\u0026#34;][\u0026#34;asn\u0026#34;] cls = data[\u0026#34;ebgp\u0026#34;][\u0026#34;community\u0026#34;][\u0026#34;large\u0026#34;][\u0026#34;class\u0026#34;] sub = data[\u0026#34;ebgp\u0026#34;][\u0026#34;community\u0026#34;][\u0026#34;large\u0026#34;][\u0026#34;subclass\u0026#34;] bgp_cl = [] member = data[\u0026#34;member\u0026#34;][asn] for perm in perms: if perm == \u0026#34;all\u0026#34;: el = { \u0026#34;class\u0026#34;: int(cls[type]), \u0026#34;subclass\u0026#34;: int(sub[\u0026#34;all\u0026#34;]), \u0026#34;value\u0026#34;: 0, \u0026#34;description\u0026#34;: f\u0026#34;{type}.all\u0026#34; } return [el] k, v = perm.split(\u0026#34;:\u0026#34;) if k == \u0026#34;country\u0026#34;: country_id = data[\u0026#34;country\u0026#34;][v] el = { \u0026#34;class\u0026#34;: int(cls[type]), \u0026#34;subclass\u0026#34;: int(sub[\u0026#34;country\u0026#34;]), \u0026#34;value\u0026#34;: int(country_id), \u0026#34;description\u0026#34;: f\u0026#34;{type}.{k} = {v}\u0026#34; } bgp_cl.append(el) elif k == \u0026#34;asn\u0026#34;: el = { \u0026#34;class\u0026#34;: int(cls[type]), \u0026#34;subclass\u0026#34;: int(sub[\u0026#34;asn\u0026#34;]), \u0026#34;value\u0026#34;: int(v), \u0026#34;description\u0026#34;: f\u0026#34;{type}.{k} = {v}\u0026#34; } bgp_cl.append(el) elif k == \u0026#34;router\u0026#34;: device_id = data[\u0026#34;_devices\u0026#34;][v][\u0026#34;id\u0026#34;] el = { \u0026#34;class\u0026#34;: int(cls[type]), \u0026#34;subclass\u0026#34;: int(sub[\u0026#34;router\u0026#34;]), \u0026#34;value\u0026#34;: int(device_id), \u0026#34;description\u0026#34;: f\u0026#34;{type}.{k} = {v}\u0026#34; } bgp_cl.append(el) elif k == \u0026#34;group\u0026#34;: group = data[\u0026#34;ebgp\u0026#34;][\u0026#34;groups\u0026#34;][v] if isinstance(group[\u0026#34;peeringdb_ix\u0026#34;], dict): ix_id = group[\u0026#34;peeringdb_ix\u0026#34;][\u0026#34;id\u0026#34;] else: ix_id = group[\u0026#34;peeringdb_ix\u0026#34;] el = { \u0026#34;class\u0026#34;: int(cls[type]), \u0026#34;subclass\u0026#34;: int(sub[\u0026#34;group\u0026#34;]), \u0026#34;value\u0026#34;: int(ix_id), \u0026#34;description\u0026#34;: f\u0026#34;{type}.{k} = {v}\u0026#34; } bgp_cl.append(el) else: log.warning (f\u0026#34;No implementation for {type} subclass \u0026#39;{k}\u0026#39; for member AS{asn}, skipping\u0026#34;) return bgp_cl The essence of this function is to take a human readable list of symbols, like \u0026lsquo;router:chrma0\u0026rsquo; and look up what subclass is called \u0026lsquo;router\u0026rsquo; and what router_id is \u0026lsquo;chrma0\u0026rsquo;. It does this for keywords \u0026lsquo;router\u0026rsquo;, \u0026lsquo;country\u0026rsquo;, \u0026lsquo;group\u0026rsquo; and \u0026lsquo;asn\u0026rsquo; and for a special keyword called \u0026lsquo;all\u0026rsquo; as well.\nRunning this a function on Antonios\u0026rsquo; member data above would reveal the following:\nMember 210312 has permissions: [{\u0026#39;class\u0026#39;: 2000, \u0026#39;subclass\u0026#39;: 10, \u0026#39;value\u0026#39;: 1, \u0026#39;description\u0026#39;: \u0026#39;permission.router = chrma0\u0026#39;}] Member 210312 has inhibits: [{\u0026#39;class\u0026#39;: 3000, \u0026#39;subclass\u0026#39;: 30, \u0026#39;value\u0026#39;: 2365, \u0026#39;description\u0026#39;: \u0026#39;inhibit.group = chix\u0026#39;}] The neat thing about this is, that this data will come in handy for both types of propagation, and the parse_member_communities() helper function returns pretty readable data, which will help in debugging and further understanding the ultimately generated configuration.\nEgress: Member-to-IXP OK, when I learned Antonios\u0026rsquo; prefixes, I have instructed the system to propagate them to all sessions on router chrma0, except sessions on group chix. This means that in the direction of from AS50869 to others, I can do the following:\n1. Tag permissions and inhibits on ingress\nI add a tiny bit of logic using this data structure I just created above. In the import filter, remember I added NOTE(pim): More comes here? After setting the informational communities, I also add these:\n{% if session_type == \u0026#34;member\u0026#34; %} {% if permissions %} # Add FreeIX Remote: Permission {% for el in permissions %} bgp_large_community.add(({{my_asn}},{{el.class+el.subclass}},{{el.value}})); ## {{ el.description }} {% endfor %} {% endif %} {% if inhibits %} # Add FreeIX Remote: Inhibit {% for el in inhibits %} bgp_large_community.add(({{my_asn}},{{el.class+el.subclass}},{{el.value}})); ## {{ el.description }} {% endfor %} {% endif %} {% endif %} Seeing as this block only gets rendered if the session type is member, let me show you how Antonios\u0026rsquo; import filter looks like in its full glory:\nfilter ebgp_chix_210312_import { if (net.type = NET_IP4 \u0026amp;\u0026amp; ! (net ~ CHIX_AS_SET_DNET_IPV4)) then reject; if (net.type = NET_IP6 \u0026amp;\u0026amp; ! (net ~ CHIX_AS_SET_DNET_IPV6)) then reject; if ! ebgp_import_member(210312) then reject; # Add FreeIX Remote: Informational bgp_large_community.add((50869,1010,1)); ## informational.router = chrma0.free-ix.net bgp_large_community.add((50869,1020,1)); ## informational.country = CH bgp_large_community.add((50869,1030,2365)); ## informational.group = chix # Add FreeIX Remote: Permission bgp_large_community.add((50869,2010,1)); ## permission.router = chrma0 # Add FreeIX Remote: Inhibit bgp_large_community.add((50869,3030,2365)); ## inhibit.group = chix accept; } Remember, the ebgp_import_member() helper will strip any informational (the 1000s) and permissions (the 2000s), but it would allow Antonios to set inhibits and prepends (the 3000s) so these BGP communities will still be allowed in. In other words, Antonios can\u0026rsquo;t give himself propagation rights (sorry, buddy!) but if he would like to make AS50869 stop sending his prefixes to, say, CommunityIX, he could simply add the BGP community (50869,3030,2013) on his announcements, and that will get honored. If he\u0026rsquo;d like AS50869 to prepend itself twice before announcing to peer AS8298, he could set (50869,3200,8298) and that will also get picked up.\n2. Match permissions and inhibits on egress\nNow that all of Antonios\u0026rsquo; prefixes are tagged with permissions and inhibits, I can reveal how I implemented the export filters for AS50869:\nfunction member_prefix(int group) -\u0026gt; bool { bool permitted = false; if (({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.all}}, 0) ~ bgp_large_community || ({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.router}}, {{ device.id }}) ~ bgp_large_community || ({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.country}}, {{ country[device.country] }}) ~ bgp_large_community || ({{ebgp.asn}}, {{ebgp.community.large.class.permission+ebgp.community.large.subclass.group}}, group) ~ bgp_large_community) then { permitted = true; } if (({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.all}}, 0) ~ bgp_large_community || ({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.router}}, {{ device.id }}) ~ bgp_large_community || ({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.country}}, {{ country[device.country] }}) ~ bgp_large_community || ({{ebgp.asn}}, {{ebgp.community.large.class.inhibit+ebgp.community.large.subclass.group}}, group) ~ bgp_large_community) then { permitted = false; } return (permitted); } function valid_prefix(int group) -\u0026gt; bool { return (source_prefix() || member_prefix(group)); } function ebgp_export_peer(int remote_as; int group) -\u0026gt; bool { if (source != RTS_BGP \u0026amp;\u0026amp; source != RTS_STATIC) then return false; if !valid_prefix(group) then return false; bgp_community.delete([(50869, *)]); bgp_large_community.delete([(50869, *, *)]); return ebgp_export(remote_as); } From the bottom, the function ebgp_export_peer() is invoked on each peering session, and it gets the argument of the remote AS (for example 13335 for CloudFlare), and the group (for example 2365 for CHIX). The function ensures that it\u0026rsquo;s either a static route or a BGP route. Then it makes sure it\u0026rsquo;s a valid_prefix() for the group.\nThe valid_prefix() function first checks if it\u0026rsquo;s one of our own (as in: AS50869\u0026rsquo;s own) prefixes, which it does by calling source_prefix(), which i\u0026rsquo;ve ommitted here as it would be a distraction. All it does is check if the prefix is in a static prefix list generated with bgpq4 for AS50869 itself. The more interesting observation is that to be eligible, the prefix needs to be either source_prefix() or member_prefix(group).\nThe propagation decision for \u0026lsquo;Member-to-IXP\u0026rsquo; actually happens in that member_prefix() function. It starts off by assuming the prefix is not permitted. Then it scans all relevant permissions communities which may be present in the RIB for this prefix:\nis the all permissions community (50869,2000,0) set? what about the router permission (50869,2010,R) for my router_id? perhaps the country permission (50869,2020,C) for my country_id? or maybe the group permission (50869,2030,G) for the ixp_id that this session lives on? If any of these conditions are true, then this prefix might pe permitted, so I set the variable to True. Next, I check and see if any of the inhibit communities are set, either by me (in members.yaml) or by the member on the live BGP session. If any one of them matches, then I flip the variable to False again. Once the verdict is known, I can return True or False here, which makes its way all the way up the call stack and ultimately announces the member prefix on the BGP session, or not. Slick!\nEgress: IXP-to-Member At this point, members\u0026rsquo; prefixes get announced at the correct internet exchange points, but I need to satisfy one more requirement: the prefixes picked up at those IXPs, should also be announced to members. For this, the helper dictionary with permissions and inhibits can be used in a clever way. What if I held them against the informational communities? For example, I have permitted Antonios to be annouced at any IXP connected to router chrma0, then all prefixes I learned at chrma0 are fair game, right? But, I configured an inhibit for Antonios\u0026rsquo; prefixes at CHIX. No problem, I have an informational community for all prefixes I learned from the CHIX group!\nI come to the realization that IXP-to-Member simply adds to the Member-to-IXP logic. Everything that I would announce to a peer, I will also announce to a member. Off I go, adding one last helper function to the BGP session Jinja template:\n{% if session_type == \u0026#34;member\u0026#34; %} function ebgp_export_{{group_name}}_{{their_asn}}(int remote_as; int group) -\u0026gt; bool { bool permitted = false; if (source != RTS_BGP \u0026amp;\u0026amp; source != RTS_STATIC) then return false; if valid_prefix(group) then return ebgp_export(remote_as); {% for el in permissions | default([]) %} if (bgp_large_community ~ [({{ my_asn }},{{ 1000+el.subclass}},{% if el.value == 0%}*{% else %}{{el.value}}{% endif %})]) then permitted=true; ## {{el.description}} {% endfor %} {% for el in inhibits | default([]) %} if (bgp_large_community ~ [({{ my_asn }},{{ 1000+el.subclass}},{% if el.value == 0%}*{% else %}{{el.value}}{% endif %})]) then permitted=false; ## {{el.description}} {% endfor %} if (permitted) then return ebgp_export(remote_as); return false; } {% endif %} Note that in essence, this new function still calls valid_prefix(), which in turn calls source_prefix() or member_prefix(group), so it announces the same prefixes that are also announced to sessions of type \u0026lsquo;peer\u0026rsquo;. But then, I\u0026rsquo;ll also inspect the informational communities, where the value of 0 is replaced with a wildcard, because \u0026lsquo;permit or inhibit all\u0026rsquo; would mean \u0026lsquo;match any of these BGP communities\u0026rsquo;. This template renders as follows for Antonios at CHIX:\nfunction ebgp_export_chix_210312(int remote_as; int group) -\u0026gt; bool { bool export = false; if (source != RTS_BGP \u0026amp;\u0026amp; source != RTS_STATIC) then return false; if valid_prefix(group) then return ebgp_export(remote_as); if (bgp_large_community ~ [(50869,1010,1)]) then export=true; ## permission.router = chrma0 if (bgp_large_community ~ [(50869,1030,2365)]) then export=false; ## inhibit.group = chix if (export) then return ebgp_export(remote_as); return false; } Results With this, the propagation logic is complete. Announcements are symmetric, that is to say the function ebgp_export_chix_210312() sees to it that Antonios gets the prefixes learned at router chrma0 but not those learned at group CHIX. Similarly, the ebgp_export_peer() ensures that Antonios\u0026rsquo; prefixes are propagated to any session at router chrma0 except those sessions at group CHIX.\nI have installed VPP with [OSPFv3] unnumbered interfaces, so each router has exactly one IPv4 and IPv6 loopback address. The router in Rümlang has been operational for a while, the one in Amsterdam (nlams0.free-ix.net) and Thessaloniki (grskg0.free-ix.net) have been deployed and are connecting to IXPs now, and the one in Milan (itmil0.free-ix.net) has been installed but is pending physical deployment at Caldara.\nI deployed a test setup with a few permissions and inhibits on the Rümlang router, with many thanks to Jurrian, Sam and Antonios for allowing me to guinnaepig-ize their member sessions. With the following test configuration:\nmember: 35202: description: OnTheGo (Sam Aschwanden) prefix_filter: AS-OTG permission: [ router:chrma0 ] inhibit: [ group:comix ] 210312: description: DaKnObNET prefix_filter: AS-SET-DNET permission: [ router:chrma0 ] inhibit: [ group:chix ] 212635: description: Jurrian van Iersel prefix_filter: AS212635:AS-212635 permission: [ router:chrma0 ] inhibit: [ group:chix, group:fogixp ] I can see the following prefix learn/announce counts towards members:\npim@chrma0:~$ for i in $(birdc show protocol | grep member | cut -f1 -d\u0026#39; \u0026#39;); do echo -n $i\\ ; birdc show protocol all $i | grep Routes; done chix_member_35202_ipv4_1 2 imported, 0 filtered, 159984 exported, 0 preferred chix_member_35202_ipv6_1 2 imported, 0 filtered, 61730 exported, 0 preferred chix_member_210312_ipv4_1 3 imported, 0 filtered, 3518 exported, 3 preferred chix_member_210312_ipv6_1 2 imported, 0 filtered, 1251 exported, 2 preferred comix_member_35202_ipv4_1 2 imported, 0 filtered, 159981 exported, 2 preferred comix_member_35202_ipv4_2 2 imported, 0 filtered, 159981 exported, 1 preferred comix_member_35202_ipv6_1 2 imported, 0 filtered, 61727 exported, 2 preferred comix_member_35202_ipv6_2 2 imported, 0 filtered, 61727 exported, 1 preferred fogixp_member_212635_ipv4_1 1 imported, 0 filtered, 442 exported, 1 preferred fogixp_member_212635_ipv6_1 14 imported, 0 filtered, 181 exported, 14 preferred freeix_ch_member_210312_ipv4_1 3 imported, 0 filtered, 3521 exported, 0 preferred freeix_ch_member_210312_ipv6_1 2 imported, 0 filtered, 1253 exported, 0 preferred Let me make a few observations:\nHurricane Electric AS6939 is present at CHIX, and they tend to announce a very large number of prefixes. So every member who is permitted (and not inhibited) at CHIX will see all of those: Sam\u0026rsquo;s AS35202 is inhibited on CommunityIX but not on CHIX, and he\u0026rsquo;s permitted on both. That explains why he is seeing the routes on both sessions. I\u0026rsquo;ve inhibited Jurrian\u0026rsquo;s AS212635 to/from both CHIX and FogIXP, which means he will be seeing CommunityIX (~245 IPv4, 85 IPv6 prefixes), and FreeIX CH (~173 IPv4 and ~60 IPv6). We also send him the member prefixes, which is about 35 or so additional prefixes. This explains why Jurrian is receiving from us ~440 IPv4 and ~180 IPv6. Antonios\u0026rsquo; AS210312, the exemplar in this article, is receiving all-but-CHIX. FogIXP yields 3077 or so IPv4 and 1056 IPv6 prefixes, while I\u0026rsquo;ve already added up FreeIX, CommunityIX, and our members (this is what we\u0026rsquo;re sending Jurrian!), at 330 resp 180, so Antonios should be getting about 3500 IPv4 prefixes and 1250 IPv6 prefixes. In the other direction, I would expect to be announcing to peers only prefixes belonging to either AS50869 itself, or those of our members:\npim@chrma0:~$ for i in $(birdc show protocol | grep peer.*_1 | cut -f1 -d\u0026#39; \u0026#39;); do echo -n $i\\ ; birdc show protocol all $i | grep Routes || echo; done chix_peer_212100_ipv4_1 57618 imported, 0 filtered, 24 exported, 778 preferred chix_peer_212100_ipv6_1 21979 imported, 1 filtered, 37 exported, 7186 preferred chix_peer_13335_ipv4_1 4767 imported, 9 filtered, 24 exported, 4765 preferred chix_peer_13335_ipv6_1 371 imported, 1 filtered, 37 exported, 369 preferred chix_peer_6939_ipv4_1 151787 imported, 27 filtered, 24 exported, 133943 preferred chix_peer_6939_ipv6_1 61191 imported, 6 filtered, 37 exported, 16223 preferred comix_peer_44596_ipv4_1 594 imported, 0 filtered, 25 exported, 10 preferred comix_peer_44596_ipv6_1 1147 imported, 0 filtered, 50 exported, 0 preferred comix_peer_8298_ipv4_1 23 imported, 0 filtered, 25 exported, 0 preferred comix_peer_8298_ipv6_1 34 imported, 0 filtered, 50 exported, 0 preferred fogixp_peer_47498_ipv4_1 3286 imported, 1 filtered, 27 exported, 3077 preferred fogixp_peer_47498_ipv6_1 1838 imported, 0 filtered, 39 exported, 1056 preferred freeix_ch_peer_51530_ipv4_1 355 imported, 0 filtered, 28 exported, 0 preferred freeix_ch_peer_51530_ipv6_1 143 imported, 0 filtered, 53 exported, 0 preferred Some observations:\nNobody is inhibited at FreeIX Switzerland. It stands to reason therefore, that it has the most exported prefixes: 28 for IPv4 and 53 for IPv6. Two members are inhibited at CHIX, which makes it have the lowest amount of exported prefixes: 24 for IPv4 and 27 for IPv6. All members at each exchange (group) will have the same amount of prefixes. I can confirm that at CHIX, all thre peers have the same amount of announced prefixes. Similarly, at CommunityIX, all peers have the same amount. If Antonios, Sam or Jurrian would add an outgoing announcement to AS50869 with an additional inhibit BGP community (eg (50869,3020,1) to inhibit country Switzerland), they could tweak these numbers. What\u0026rsquo;s next This all adds up. I\u0026rsquo;d like to test the waters with my friendly neighborhood canaries a little bit, to make sure that announcements are expected, and traffic flows where appropriate. In the mean time, I\u0026rsquo;ll chase the deployment of LSIX, FrysIX, SpeedIX and possibly a few others in Amsterdam. And of course FreeIX Greece in Thessaloniki. I\u0026rsquo;ll try to get the Milano VPP router deployed (it\u0026rsquo;s already installed and configured, but currently powered off) and connected to PCIX, MIX and a few others.\nHow can you help? If you\u0026rsquo;re willing to participate with a VPP router and connect it to either multiple local internet exchanges (like I\u0026rsquo;ve demonstrated in Zurich), or better yet, to one or more of the other existing routers, I would welcome your contribution. [Contact] me for details.\nA bit further down the pike, a connection from Amsterdam to Zurich, from Zurich to Milan and from Milan to Thessaloniki is on the horizon. If you are willing and able to donate some bandwidth (point to point VPWS, VLL, L2VPN) and your transport network is capable of at least 2026 bytes of inner payload, please also [reach out] as I\u0026rsquo;m sure many small network operators would be thrilled.\n","date":"2024-10-21","desc":" Introduction A few months ago, I wrote about [an idea] to help boost the value of small Internet Exchange Points (IXPs). When such an exchange doesn\u0026rsquo;t have many members, then the operational costs of connecting to it (cross connects, router ports, finding peers, etc) are not very favorable.\nClearly, the benefit of using an Internet Exchange is to reduce the portion of an ISP’s (and CDN’s) traffic that must be delivered via their upstream transit providers, thereby reducing the average per-bit delivery cost and as well reducing the end to end latency as seen by their users or customers. Furthermore, the increased number of paths available through the IXP improves routing efficiency and fault-tolerance, and at the same time it avoids traffic going the scenic route to a large hub like Frankfurt, London, Amsterdam, Paris or Rome, if it could very well remain local.\n","permalink":"https://ipng.ch/s/articles/2024/10/21/freeix-remote-part-2/","section":"articles","title":"FreeIX Remote - Part 2"},{"contents":"Introduction Last month, I picked up a project together with Neil McKee of [inMon], the care takers of [sFlow]: an industry standard technology for monitoring high speed switched networks. sFlow gives complete visibility into the use of networks enabling performance optimization, accounting/billing for usage, and defense against security threats.\nThe open source software dataplane [VPP] is a perfect match for sampling, as it forwards packets at very high rates using underlying libraries like [DPDK] and [RDMA]. A clever design choice in the so called Host sFlow Daemon [host-sflow], which allows for a small portion of code to grab the samples, for example in a merchant silicon ASIC or FPGA, but also in the VPP software dataplane, and then transmit these samples using a Linux kernel feature called [PSAMPLE]. This greatly reduces the complexity of code to be implemented in the forwarding path, while at the same time bringing consistency to the sFlow delivery pipeline by (re)using the hsflowd business logic for the more complex state keeping, packet marshalling and transmission from the Agent to a central Collector.\nLast month, Neil and I discussed the proof of concept [ref] and I described this in a [first article]. Then, we iterated on the VPP plugin, playing with a few different approaches to strike a balance between performance, code complexity, and agent features. This article describes our journey.\nVPP: an sFlow plugin There are three things Neil and I specifically take a look at:\nIf sFlow is not enabled on a given interface, there should not be a regression on other interfaces. If sFlow is enabled, but a packet is not sampled, the overhead should be as small as possible, targetting single digit CPU cycles per packet in overhead. If sFlow actually selects a packet for sampling, it should be moved out of the dataplane as quickly as possible, targetting double digit CPU cycles per sample. For all these validation and loadtests, I use a bare metal VPP machine which is receiving load from a T-Rex loadtester on eight TenGig ports. I have configured VPP and T-Rex as follows.\n1. RX Queue Placement\nIt\u0026rsquo;s important that the network card that is receiving the traffic, gets serviced by a worker thread on the same NUMA domain. Since my machine has two processors (and thus, two NUMA nodes), I will align the NIC with the correct processor, like so:\nset interface rx-placement TenGigabitEthernet3/0/0 queue 0 worker 0 set interface rx-placement TenGigabitEthernet3/0/1 queue 0 worker 2 set interface rx-placement TenGigabitEthernet3/0/2 queue 0 worker 4 set interface rx-placement TenGigabitEthernet3/0/3 queue 0 worker 6 set interface rx-placement TenGigabitEthernet130/0/0 queue 0 worker 1 set interface rx-placement TenGigabitEthernet130/0/1 queue 0 worker 3 set interface rx-placement TenGigabitEthernet130/0/2 queue 0 worker 5 set interface rx-placement TenGigabitEthernet130/0/3 queue 0 worker 7 2. L3 IPv4/MPLS interfaces\nI will take two pairs of interfaces, one on NUMA0, and the other on NUMA1, so that I can make a comparison with L3 IPv4 or MPLS running without sFlow (these are TenGig3/0/, which I will call the baseline pairs) and two which are running with sFlow (these are TenGig130/0/, which I\u0026rsquo;ll call the experiment pairs).\ncomment { L3: IPv4 interfaces } set int state TenGigabitEthernet3/0/0 up set int state TenGigabitEthernet3/0/1 up set int state TenGigabitEthernet130/0/0 up set int state TenGigabitEthernet130/0/1 up set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31 set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31 set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31 set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31 ip route add 16.0.0.0/24 via 100.64.0.0 ip route add 48.0.0.0/24 via 100.64.1.0 ip route add 16.0.2.0/24 via 100.64.4.0 ip route add 48.0.2.0/24 via 100.64.5.0 ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static Here, the only specific trick worth mentioning is the use of ip neighbor to pre-populate the L2 adjacency for the T-Rex loadtester. This way, VPP knows which MAC address to send traffic to, in case a packet has to be forwarded to 100.64.0.0 or 100.64.5.0. It avoids VPP from having to use ARP resolution.\nThe configuration for an MPLS label switching router LSR or also called P-Router is added:\ncomment { MPLS interfaces } mpls table add 0 set interface mpls TenGigabitEthernet3/0/0 enable set interface mpls TenGigabitEthernet3/0/1 enable set interface mpls TenGigabitEthernet130/0/0 enable set interface mpls TenGigabitEthernet130/0/1 enable mpls local-label add 16 eos via 100.64.1.0 TenGigabitEthernet3/0/1 out-labels 17 mpls local-label add 17 eos via 100.64.0.0 TenGigabitEthernet3/0/0 out-labels 16 mpls local-label add 20 eos via 100.64.5.0 TenGigabitEthernet130/0/1 out-labels 21 mpls local-label add 21 eos via 100.64.4.0 TenGigabitEthernet130/0/0 out-labels 20 3. L2 CrossConnect interfaces\nHere, I will also use NUMA0 as my baseline (sFlow disabled) pair, and an equivalent pair of TenGig interfaces on NUMA1 as my experiment (sFlow enabled) pair. This way, I can both make a comparison on the performance impact of enabling sFlow, but I can also assert if any regression occurs in the baseline pair if I enable a feature in the experiment pair, which should really never happen.\ncomment { L2 xconnected interfaces } set int state TenGigabitEthernet3/0/2 up set int state TenGigabitEthernet3/0/3 up set int state TenGigabitEthernet130/0/2 up set int state TenGigabitEthernet130/0/3 up set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3 set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2 set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3 set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2 4. T-Rex Configuration\nThe Cisco T-Rex loadtester is running on another machine in the same rack. Physically, it has eight ports which are connected to a LAB switch, a cool Mellanox SN2700 running Debian [ref]. From there, eight ports go to my VPP machine. The LAB switch just has VLANs with two ports in each: VLAN 100 takes T-Rex port0 and connects it to TenGig3/0/0, VLAN 101 takes port1 and connects it to TenGig3/0/1, and so on. In total, sixteen ports and eight VLANs are used.\nThe configuration for T-Rex then becomes:\n- version: 2 interfaces: [ \u0026#39;06:00.0\u0026#39;, \u0026#39;06:00.1\u0026#39;, \u0026#39;83:00.0\u0026#39;, \u0026#39;83:00.1\u0026#39;, \u0026#39;87:00.0\u0026#39;, \u0026#39;87:00.1\u0026#39;, \u0026#39;85:00.0\u0026#39;, \u0026#39;85:00.1\u0026#39; ] port_info: - src_mac: 00:1b:21:06:00:00 dest_mac: 9c:69:b4:61:a1:dc - src_mac: 00:1b:21:06:00:01 dest_mac: 9c:69:b4:61:a1:dd - src_mac: 00:1b:21:83:00:00 dest_mac: 00:1b:21:83:00:01 - src_mac: 00:1b:21:83:00:01 dest_mac: 00:1b:21:83:00:00 - src_mac: 00:1b:21:87:00:00 dest_mac: 9c:69:b4:61:75:d0 - src_mac: 00:1b:21:87:00:01 dest_mac: 9c:69:b4:61:75:d1 - src_mac: 9c:69:b4:85:00:00 dest_mac: 9c:69:b4:85:00:01 - src_mac: 9c:69:b4:85:00:01 dest_mac: 9c:69:b4:85:00:00 Do you see how the first pair sends from src_mac 00:1b:21:06:00:00? That\u0026rsquo;s the T-Rex side, and it encodes the PCI device 06:00.0 in the MAC address. It sends traffic to dest_mac 9c:69:b4:61:a1:dc, which is the MAC address of VPP\u0026rsquo;s TenGig3/0/0 interface. Looking back at the ip neighbor VPP config above, it becomes much easier to see who is sending traffic to whom.\nFor L2XC, the MAC addresses don\u0026rsquo;t matter. VPP will set the NIC in promiscuous mode which means it\u0026rsquo;ll accept any ethernet frame, not only those sent to the NIC\u0026rsquo;s own MAC address. Therefore, in L2XC modes (the second and fourth pair), I just use the MAC addresses from T-Rex. I find debugging connections and looking up FDB entries on the Mellanox switch much, much easier this way.\nWith all config in place, but with sFlow disabled, I run a quick bidirectional loadtest using 256b packets at line rate, which shows 79.83Gbps and 36.15Mpps. All ports are forwarding, with MPLS, IPv4, and L2XC. Neat!\nThe name of the game is now to do a loadtest that shows the packet throughput and CPU cycles spent for each of the plugin iterations, comparing their performance on ports with and without sFlow enabled. For each iteration, I will use exactly the same VPP configuration, I will generate unidirectional 4x14.88Mpps of traffic with T-Rex, and I will report on VPP\u0026rsquo;s performance in baseline and a somewhat unfavorable 1:100 sampling rate.\nReady? Here I go!\nv1: Workers send RPC to main TL/DR: 13 cycles/packet on passthrough, 4.68Mpps L2, 3.26Mpps L3, with severe regression in baseline\nThe first iteration goes all the way back to a proof of concept from last year. It\u0026rsquo;s described in detail in my [first post]. The performance results are not stellar:\n☢ When slamming a single sFlow enabled interface, all interfaces regress. When sending 8Mpps of IPv4 traffic through an baseline interface, that is an interface without sFlow enabled, only 5.2Mpps get through. This is considered a mortal sin in VPP-land. ✅ Passing through packets without sampling them, costs about 13 CPU cycles, not bad. ❌ Sampling a packet, specifically at higher rates (say, 1:100 or worse, 1:10) completely destroys throughput. When sending 4x14.88MMpps of traffic, only one third makes it through. Here\u0026rsquo;s the bloodbath as seen from T-Rex:\nDebrief: When we talked through these issues, we sort of drew the conclusion that it would be much faster if, when a worker thread produces a sample, instead of sending an RPC to main and taking the spinlock, that the worker appends the sample to a producer queue and moves on. This way, no locks are needed, and each worker thread will have its own producer queue.\nThen, we can create a separate thread (or even pool of threads), scheduling on possibly a different CPU (or in main), that runs a loop iterating on all sflow sample queues, consuming the samples and sending them in batches to the PSAMPLE Netlink group, possibly dropping samples if there are too many coming in.\nv2: Workers send PSAMPLE directly TL/DR: 7.21Mpps IPv4 L3, 9.45Mpps L2XC, 87 cycles/packet, no impact on disabled interfaces\nBut before we do that, we have one curiosity itch to scratch - what if we sent the sample directly from the worker? With such a model, if it works, we will need no RPCs or sample queue at all. Of course, in this model any sample will have to be rewritten into a PSAMPLE packet and written via the netlink socket. It would be less complex, but not as efficient as it could be. One thing is prety certain, though: it should be much faster than sending an RPC to the main thread.\nAfter short refactor, Neil commits [d278273], which adds compiler macros SFLOW_SEND_FROM_WORKER (v2) and SFLOW_SEND_VIA_MAIN (v1). When workers send directly, they will invoke sflow_send_sample_from_worker() instead of sending an RPC with vl_api_rpc_call_main_thread() in the previous version.\nThe code currently uses clib_warning() to print stats from the dataplane, which is pretty expensive. We should be using the VPP logging framework, but for the time being, I add a few CPU counters so we can more accurately count the cummulative time spent for each part of the calls, see [6ca61d2]. I can now see these with vppctl show err instead.\nWhen loadtesting this, the deadly sin of impacting performance of interfaces that did not have sFlow enabled is gone. The throughput is not great, though. Instead of showing screenshots of T-Rex, I can also take a look at the throughput as measured by VPP itself. In its show runtime statistics, each worker thread shows both CPU cycles spent, as well as how many packets/sec it received and how many it transmitted:\npim@hvn6-lab:~$ export C=\u0026#34;v2-100\u0026#34;; vppctl clear run; vppctl clear err; sleep 30; \\ vppctl show run \u0026gt; $C-runtime.txt; vppctl show err \u0026gt; $C-err.txt pim@hvn6-lab:~$ grep \u0026#39;vector rates\u0026#39; v2-100-runtime.txt | grep -v \u0026#39;in 0\u0026#39; vector rates in 1.0909e7, out 1.0909e7, drop 0.0000e0, punt 0.0000e0 vector rates in 7.2078e6, out 7.2078e6, drop 0.0000e0, punt 0.0000e0 vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0 vector rates in 9.4476e6, out 9.4476e6, drop 0.0000e0, punt 0.0000e0 pim@hvn6-lab:~$ grep \u0026#39;sflow\u0026#39; v2-100-runtime.txt Name State Calls Vectors Suspends Clocks Vectors/Call sflow active 844916 216298496 0 8.69e1 256.00 sflow active 1107466 283511296 0 8.26e1 256.00 pim@hvn6-lab:~$ grep -i sflow v2-100-err.txt 217929472 sflow sflow packets processed error 1614519 sflow sflow packets sampled error 2606893106 sflow CPU cycles in sent samples error 280697344 sflow sflow packets processed error 2078203 sflow sflow packets sampled error 1844674406 sflow CPU cycles in sent samples error At a glance, I can see in the first grep, the in and out vector (==packet) rates for each worker thread that is doing meaningful work (ie. has more than 0pps of input). Remember that I pinned the RX queues to worker threads, and this now pays dividend: worker thread 0 is servicing TenGig3/0/0 (as even worker thread numbers are on NUMA domain 0), worker thread 1 is servicing TenGig130/0/0. What\u0026rsquo;s cool about this, is it gives me an easy way to compare baseline L3 (10.9Mpps) with experiment L3 (7.21Mpps). Equally, L2XC comes in at 14.88Mpps in baseline and 9.45Mpps in experiment.\nLooking at the output of vppctl show error, I can learn another interesting detail. See how there are 1614519 sampled packets out of 217929472 processed packets (ie. a roughly 1:100 rate)? I added a CPU clock cycle counter that counts cummulative clocks spent once samples are taken. I can see that VPP spent 2606893106 CPU cycles sending these samples. That\u0026rsquo;s 1615 CPU cycles per sent sample, which is pretty terrible.\nDebrief: We both understand that assembling and send()ing the netlink messages from within the dataplane is a pretty bad idea. But it\u0026rsquo;s great to see that removing the use of RPCs immediately improves performance on non-enabled interfaces, and we learned what the cost is of sending those samples. An easy step forward from here is to create a producer/consumer queue, where the workers can just copy the packet into a queue or ring buffer, and have an external pthread consume from the queue/ring in another thread that won\u0026rsquo;t block the dataplane.\nv3: SVM FIFO from workers, dedicated PSAMPLE pthread TL/DR: 9.34Mpps L3, 13.51Mpps L2XC, 16.3 cycles/packet, but with corruption on the FIFO queue messages\nNeil checks in after committing [7a78e05] that he has introduced a macro SFLOW_SEND_FIFO which tries this new approach. There\u0026rsquo;s a pretty elaborate FIFO queue implementation in svm/fifo_segment.h. Neil uses this to create a segment called fifo-sflow-worker, to which the worker can write its samples in the dataplane node. A new thread called spt_process_samples can then call svm_fifo_dequeue() from all workers\u0026rsquo; queues and pump those into Netlink.\nThe overhead of copying the samples onto a VPP native svm_fifo seems to be two orders of magnitude lower than writing directly to Netlink, even though the svm_fifo library code has many bells and whistles that we don\u0026rsquo;t need. But, perhaps due to these bells and whistles, we may be holding it wrong, as invariably after a short while the Netlink writes return Message too long errors.\npim@hvn6-lab:~$ grep \u0026#39;vector rates\u0026#39; v3fifo-sflow-100-runtime.txt | grep -v \u0026#39;in 0\u0026#39; vector rates in 1.0783e7, out 1.0783e7, drop 0.0000e0, punt 0.0000e0 vector rates in 9.3499e6, out 9.3499e6, drop 0.0000e0, punt 0.0000e0 vector rates in 1.4728e7, out 1.4728e7, drop 0.0000e0, punt 0.0000e0 vector rates in 1.3516e7, out 1.3516e7, drop 0.0000e0, punt 0.0000e0 pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-runtime.txt Name State Calls Vectors Suspends Clocks Vectors/Call sflow active 1096132 280609792 0 1.63e1 256.00 sflow active 1584577 405651712 0 1.46e1 256.00 pim@hvn6-lab:~$ grep -i sflow v3fifo-sflow-100-err.txt 280635904 sflow sflow packets processed error 2079194 sflow sflow packets sampled error 733447310 sflow CPU cycles in sent samples error 405689856 sflow sflow packets processed error 3004118 sflow sflow packets sampled error 1844674407 sflow CPU cycles in sent samples error Two things of note here. Firstly, the average clocks spent in the sFlow node have gone down from 86 CPU cycles/packet to 16.3 CPU cycles. But even more importantly, the amount of time spent after the sample is taken is hugely reduced, from 1600+ cycles in v2 to a much more favorable 352 cycles in this version. Also, any risk of Netlink writes failing has been eliminated, because that\u0026rsquo;s now offloaded to a different thread entirely.\nDebrief: It\u0026rsquo;s not great that we created a new linux pthread for the consumer of the samples. VPP has an elaborate thread management system, and collaborative multitasking in its threading model, which adds introspection like clock counters, names, show runtime, show threads and so on. I can\u0026rsquo;t help but wonder: wouldn\u0026rsquo;t we just be able to move the spt_process_samples() thread into a VPP process node instead?\nv3bis: SVM FIFO, PSAMPLE process in Main TL/DR: 9.68Mpps L3, 14.10Mpps L2XC, 14.2 cycles/packet, still with corrupted FIFO queue messages\nNeil agrees that there\u0026rsquo;s no good reason to keep this out of main, and conjures up [df2dab8d] which rewrites the thread to an sflow_process_samples() function, using VLIB_REGISTER_NODE to add it to VPP in an idiomatic way. As a really nice benefit, we can now count how many CPU cycles are spent, in main, each time this process wakes up and does some work. It\u0026rsquo;s a widely used pattern in VPP.\nBecause of the FIFO queue message corruption, Netlink message are failing to send at an alarming rate, which is causing lots of clib_warning() messages to be spewed on console. I replace those with a counter of Failed Netlink messages instead, and commit refactor [6ba4715].\npim@hvn6-lab:~$ grep \u0026#39;vector rates\u0026#39; v3bis-100-runtime.txt | grep -v \u0026#39;in 0\u0026#39; vector rates in 1.0976e7, out 1.0976e7, drop 0.0000e0, punt 0.0000e0 vector rates in 9.6743e6, out 9.6743e6, drop 0.0000e0, punt 0.0000e0 vector rates in 1.4866e7, out 1.4866e7, drop 0.0000e0, punt 0.0000e0 vector rates in 1.4052e7, out 1.4052e7, drop 0.0000e0, punt 0.0000e0 pim@hvn6-lab:~$ grep sflow v3bis-100-runtime.txt Name State Calls Vectors Suspends Clocks Vectors/Call sflow-process-samples any wait 0 0 28052 4.66e4 0.00 sflow active 1134102 290330112 0 1.42e1 256.00 sflow active 1647240 421693440 0 1.32e1 256.00 pim@hvn6-lab:~$ grep sflow v3bis-100-err.txt 77945 sflow sflow PSAMPLE sent error 863 sflow sflow PSAMPLE send failed error 290376960 sflow sflow packets processed error 2151184 sflow sflow packets sampled error 421761024 sflow sflow packets processed error 3119625 sflow sflow packets sampled error With this iteration, I make a few observations. Firstly, the sflow-process-samples node shows up and informs me that, when handling the samples from the worker FIFO queues, the process is using 4660 CPU cycles. Secondly, the replacement of clib_warnign() with the sflow PSAMPLE send failed counter reduced time from 16.3 to 14.2 cycles on average in the dataplane. Nice.\nDebrief: A sad conclusion: of the 5.2M samples taken, only 77k make it through to Netlink. All these send failures and corrupt packets are really messing things up. So while the provided FIFO implementation in svm/fifo_segment.h is idiomatic, it is also much more complex than we thought, and we\u0026rsquo;re fearing that it may not be safe to read from another thread.\nv4: Custom lockless FIFO, PSAMPLE process in Main TL/DR: 9.56Mpps L3, 13.69Mpps L2XC, 15.6 cycles/packet, corruption fixed!\nAfter reading around a bit in DPDK\u0026rsquo;s [kni_fifo], Neil produces a gem of a commit in [42bbb64], where he introduces a tiny multiple-writer, single-consumer FIFO with two simple functions: sflow_fifo_enqueue() to be called in the workers, and sflow_fifo_dequeue() to be called in the main thread\u0026rsquo;s sflow-process-samples process. He then makes this thread-safe by doing what I consider black magic, in commit [dd8af17], which makes use of clib_atomic_load_acq_n() and clib_atomic_store_rel_n() macros from VPP\u0026rsquo;s vppinfra/atomics.h.\nWhat I really like about this change is that it introduces a FIFO implementation in about twenty lines of code, which means the sampling code path in the dataplane becomes really easy to follow, and will be even faster than it was before. I take it out for a loadtest:\npim@hvn6-lab:~$ grep \u0026#39;vector rates\u0026#39; v4-100-runtime.txt | grep -v \u0026#39;in 0\u0026#39; vector rates in 1.0958e7, out 1.0958e7, drop 0.0000e0, punt 0.0000e0 vector rates in 9.5633e6, out 9.5633e6, drop 0.0000e0, punt 0.0000e0 vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0 vector rates in 1.3697e7, out 1.3697e7, drop 0.0000e0, punt 0.0000e0 pim@hvn6-lab:~$ grep sflow v4-100-runtime.txt Name State Calls Vectors Suspends Clocks Vectors/Call sflow-process-samples any wait 0 0 17767 1.52e6 0.00 sflow active 1121156 287015936 0 1.56e1 256.00 sflow active 1605772 411077632 0 1.53e1 256.00 pim@hvn6-lab:~$ grep sflow v4-100-err.txt 3553600 sflow sflow PSAMPLE sent error 287101184 sflow sflow packets processed error 2127024 sflow sflow packets sampled error 350224 sflow sflow packets dropped error 411199744 sflow sflow packets processed error 3043693 sflow sflow packets sampled error 1266893 sflow sflow packets dropped error This is starting to be a very nice implementation! With this iteration of the plugin, all the corruption is gone, there is a slight regression (because we\u0026rsquo;re now actually sending the messages). With the v3bis variant, only a tiny fraction of the samples made it through to netlink. With this v4 variant, I can see 2127024 + 3043693 packets sampled, but due to a carefully chosen FIFO depth of 4, the workers will drop samples so as not to overload the main process that is trying to write them out. At this unnatural rate of 1:100, I can see that of the 2127024 samples taken, 350224 are prematurely dropped (because the FIFO queue is full). This is a perfect defense in depth!\nDoing the math, both workers can enqueue 1776800 samples in 30 seconds, which is 59k/s per interface. I can also see that the second interface, which is doing L2XC and hits a much larger packets/sec throughput, is dropping more samples because it receives an equal amount of time from main reading samples from its queue. In other words: in an overload scenario, one interface cannot crowd out another. Slick.\nFinally, completing my math, each worker has enqueued 1776800 samples to their FIFOs, and I see that main has dequeued exactly 2x1776800 = 3553600 samples, all successfully written to Netlink, so the sflow PSAMPLE send failed counter remains zero.\nDebrief: In the mean time, Neil has been working on the host-sflow daemon changes to pick up these netlink messages. There\u0026rsquo;s also a bit of work to do with retrieving the packet and byte counters of the VPP interfaces, so he is creating a module in host-sflow that can consume some messages from VPP. He will call this mod_vpp, and he mails a screenshot of his work in progress. I\u0026rsquo;ll discuss the end-to-end changes with hsflowd in a followup article, and focus my efforts here on documenting the VPP parts only. But, as a teaser, here\u0026rsquo;s a screenshot of a validated sflow-tool output of a VPP instance using our sFlow plugin and his pending host-sflow changes to integrate the rest of the business logic outside of the VPP dataplane, where it\u0026rsquo;s arguably expensive to make mistakes.\nNeil admits to an itch that he has been meaning to scratch all this time. In VPP\u0026rsquo;s plugins/sflow/node.c, we insert the node between device-input and ethernet-input. Here, really most of the time the plugin is just shoveling the ethernet packets through to ethernet-input. To make use of some CPU instruction cache affinity, the loop that does this shovelling can do it one packet at a time, two packets at a time, or even four packets at a time. Although the code is super repetitive and somewhat ugly, it does actually speed up processing in terms of CPU cycles spent per packet, if you shovel four of them at a time.\nv5: Quad Bucket Brigade in worker TL/DR: 9.68Mpps L3, 14.0Mpps L2XC, 11 CPU cycles/packet, 1.28e5 CPU cycles in main\nNeil calls this the Quad Bucket Brigade, and one last finishing touch is to move from his default 2-packet to a 4-packet shoveling. In commit [285d8a0], he extends a common pattern in VPP dataplane nodes, each time the node iterates, it\u0026rsquo;ll pre-fetch now up to eight packets (p0-p7) if the vector is long enough, and handle them four at a time (b0-b3). He also adds a few compiler hints with branch prediction: almost no packets will have a trace enabled, so he can use PREDICT_FALSE() macros to allow the compiler to further optimize the code.\nI find reading the dataplane code, that it is incredibly ugly. But it\u0026rsquo;s the price to pay for ultra fast throughput. But how do we see the effect? My low-tech proposal is to enable sampling at a very high rate, say 1:10'000'000, so that the code path that grabs and enqueues the sample into the FIFO is almost never called. Then, what\u0026rsquo;s left for the sFlow dataplane node, really is to shovel the packets from device-input into ethernet-input.\nTo measure the relative improvement, I do one test with, and one without commit [285d8a09].\npim@hvn6-lab:~$ grep \u0026#39;vector rates\u0026#39; v5-10M-runtime.txt | grep -v \u0026#39;in 0\u0026#39; vector rates in 1.0981e7, out 1.0981e7, drop 0.0000e0, punt 0.0000e0 vector rates in 9.8806e6, out 9.8806e6, drop 0.0000e0, punt 0.0000e0 vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0 vector rates in 1.4328e7, out 1.4328e7, drop 0.0000e0, punt 0.0000e0 pim@hvn6-lab:~$ grep sflow v5-10M-runtime.txt Name State Calls Vectors Suspends Clocks Vectors/Call sflow-process-samples any wait 0 0 28467 9.36e3 0.00 sflow active 1158325 296531200 0 1.09e1 256.00 sflow active 1679742 430013952 0 1.11e1 256.00 pim@hvn6-lab:~$ grep \u0026#39;vector rates\u0026#39; v5-noquadbrigade-10M-runtime.txt | grep -v in\\ 0 vector rates in 1.0959e7, out 1.0959e7, drop 0.0000e0, punt 0.0000e0 vector rates in 9.7046e6, out 9.7046e6, drop 0.0000e0, punt 0.0000e0 vector rates in 1.4849e7, out 1.4849e7, drop 0.0000e0, punt 0.0000e0 vector rates in 1.4008e7, out 1.4008e7, drop 0.0000e0, punt 0.0000e0 pim@hvn6-lab:~$ grep sflow v5-noquadbrigade-10M-runtime.txt Name State Calls Vectors Suspends Clocks Vectors/Call sflow-process-samples any wait 0 0 28462 9.57e3 0.00 sflow active 1137571 291218176 0 1.26e1 256.00 sflow active 1641991 420349696 0 1.20e1 256.00 Would you look at that, this optimization actually works as advertised! There is a meaningful progression from v5-noquadbrigade (9.70Mpps L3, 14.00Mpps L2XC) to v5 (9.88Mpps L3, 14.32Mpps L2XC). So at the expense of adding 63 lines of code, there is a 2.8% increase in throughput. Quad-Bucket-Brigade, yaay!\nI\u0026rsquo;ll leave you with a beautiful screenshot of the current code at HEAD, as it is sampling 1:100 packets (!) on four interfaces, while forwarding 8x10G of 256 byte packets at line rate. You\u0026rsquo;ll recall at the beginning of this article I did an acceptance loadtest with sFlow disabled, but this is the exact same result with sFlow enabled:\nThis picture says it all: 79.98 Gbps in, 79.98 Gbps out; 36.22Mpps in, 36.22Mpps out. Also: 176k samples/sec taken from the dataplane, with correct rate limiting due to a per-worker FIFO depth limit, yielding 25k samples/sec sent to Netlink.\nWhat\u0026rsquo;s Next Checking in on the three main things we wanted to ensure with the plugin:\n✅ If sFlow is not enabled on a given interface, there is no regression on other interfaces. ✅ If sFlow is enabled, copying packets costs 11 CPU cycles on average ✅ If sFlow takes a sample, it takes only marginally more CPU time to enqueue. No sampling gets 9.88Mpps of IPv4 and 14.3Mpps of L2XC throughput, 1:1000 sampling reduces to 9.77Mpps of L3 and 14.05Mpps of L2XC throughput, and an overly harsh 1:100 reduces to 9.69Mpps and 13.97Mpps only. The hard part is finished, but we\u0026rsquo;re not entirely done yet. What\u0026rsquo;s left is to implement a set of packet and byte counters, and send this information along with possible Linux CP data (such as the TAP interface ID in the Linux side), and to add the module for VPP in hsflowd. I\u0026rsquo;ll write about that part in a followup article.\nNeil has introduced vpp-dev@ to this plugin, and so far there were no objections. But he has pointed folks to a github out of tree repo, and I may add a Gerrit instead so it becomes part of the ecosystem. Our work so far is captured in Gerrit [41680], which ends up being just over 2600 lines all-up. I do think we need to refactor a bit, add some VPP-specific tidbits like FEATURE.yaml and *.rst documentation, but this should be in reasonable shape.\nAcknowledgements I\u0026rsquo;d like to thank Neil McKee from inMon for his dedication to getting things right, including the finer details such as logging, error handling, API specifications, and documentation. He has been a true pleasure to work with and learn from.\n","date":"2024-10-06","desc":"Introduction Last month, I picked up a project together with Neil McKee of [inMon], the care takers of [sFlow]: an industry standard technology for monitoring high speed switched networks. sFlow gives complete visibility into the use of networks enabling performance optimization, accounting/billing for usage, and defense against security threats.\nThe open source software dataplane [VPP] is a perfect match for sampling, as it forwards packets at very high rates using underlying libraries like [DPDK] and [RDMA]. A clever design choice in the so called Host sFlow Daemon [host-sflow], which allows for a small portion of code to grab the samples, for example in a merchant silicon ASIC or FPGA, but also in the VPP software dataplane, and then transmit these samples using a Linux kernel feature called [PSAMPLE]. This greatly reduces the complexity of code to be implemented in the forwarding path, while at the same time bringing consistency to the sFlow delivery pipeline by (re)using the hsflowd business logic for the more complex state keeping, packet marshalling and transmission from the Agent to a central Collector.\n","permalink":"https://ipng.ch/s/articles/2024/10/06/vpp-with-sflow-part-2/","section":"articles","title":"VPP with sFlow - Part 2"},{"contents":"Introduction In January of 2023, an uncomfortably long time ago at this point, an acquaintance of mine called Ciprian reached out to me after seeing my [DENOG #14] presentation. He was interested to learn about IPFIX and was asking if sFlow would be an option. At the time, there was a plugin in VPP called [flowprobe] which is able to emit IPFIX records. Unfotunately I never really got it to work well in my tests, as either the records were corrupted, sub-interfaces didn\u0026rsquo;t work, or the plugin would just crash the dataplane entirely. In the meantime, the folks at [Netgate] submitted quite a few fixes to flowprobe, but it remains an expensive operation computationally. Wouldn\u0026rsquo;t copying one in a thousand or ten thousand packet headers with flow sampling not be just as good?\nIn the months that followed, I discussed the feature with the incredible folks at [inMon], the original designers and maintainers of the sFlow protocol and toolkit. Neil from inMon wrote a prototype and put it on [GitHub] but for lack of time I didn\u0026rsquo;t manage to get it to work, which was largely my fault by the way.\nHowever, I have a bit of time on my hands in September and October, and just a few weeks ago, my buddy Pavel from [FastNetMon] pinged that very dormant thread about sFlow being a potentially useful tool for anti DDoS protection using VPP. And I very much agree!\nsFlow: Protocol Maintenance of the protocol is performed by the [sFlow.org] consortium, the authoritative source of the sFlow protocol specifications. The current version of sFlow is v5.\nsFlow, short for sampled Flow, works at the ethernet layer of the stack, where it inspects one in N datagrams (typically 1:1000 or 1:10000) going through the physical network interfaces of a device. On the device, an sFlow Agent does the sampling. For each sample the Agent takes, the first M bytes (typically 128) are copied into an sFlow Datagram. Sampling metadata is added, such as the ingress (or egress) interface and sampling process parameters. The Agent can then optionally add forwarding information (such as router source- and destination prefix, MPLS LSP information, BGP communties, and what-not). Finally the Agent will periodically read the octet and packet counters of physical network interface(s). Ultimately, the Agent will send the samples and additional information over the network as a UDP datagram, to an sFlow Collector for further processing.\nsFlow has been specifically designed to take advantages of the statistical properties of packet sampling and can be modeled using statistical sampling theory. This means that the sFlow traffic monitoring system will always produce statistically quantifiable measurements. You can read more about it in Peter Phaal and Sonia Panchen\u0026rsquo;s [paper], I certainly did and my head spun a little bit at the math :)\nsFlow: Netlink PSAMPLE sFlow is meant to be a very lightweight operation for the sampling equipment. It can typically be done in hardware, but there also exist several software implementations. One very clever thing, I think, is decoupling the sampler from the rest of the Agent. The Linux kernel has a packet sampling API called [PSAMPLE], which allows producers to send samples to a certain group, and then allows consumers to subscribe to samples of a certrain group. The PSAMPLE API uses [NetLink] under the covers. The cool thing, for me anyway, is that I have a little bit of experience with Netlink due to my work on VPP\u0026rsquo;s [Linux Control Plane] plugin.\nThe idea here is that some sFlow Agent, notably a VPP plugin, will be taking periodic samples from the physical network interfaces, and producing Netlink messages. Then, some other program, notably outside of VPP, can consume these messages and further handle them, creating UDP packets with sFlow samples and counters and other information, and sending them to an sFlow Collector somewhere else on the network.\nThere\u0026rsquo;s a handy utility called [psampletest] which can subscribe to these PSAMPLE netlink groups and retrieve the samples. The first time I used all of this stuff, I wasn\u0026rsquo;t aware of this utility and I kept on getting errors. It turns out, there\u0026rsquo;s a kernel module that needs to be loaded: modprobe psample and psampletest helpfully does that for you [ref], so just make sure the module is loaded and added to /etc/modules before you spend as many hours as I did pulling out hair.\nVPP: sFlow Plugin For the purposes of my initial testing, I\u0026rsquo;ll simply take a look at Neil\u0026rsquo;s prototype on [GitHub] and see what I learn in terms of functionality and performance.\nsFlow Plugin: Anatomy The design is purposefully minimal, to do all of the heavy lifting outside of the VPP dataplane. The plugin will create a new VPP graph node called sflow, which the operator can insert after device-input, in other words, if enabled, the plugin will get a copy of all packets that are read from an input provider, such as dpdk-input or rdma-input. The plugin\u0026rsquo;s job is to process the packet, and if it\u0026rsquo;s not selected for sampling, just move it onwards to the next node, typically ethernet-input. Almost all of the interesting action is in node.c\nThe kicker is, that one in N packets will be selected to sample, after which:\nthe ethernet header (*en) is extracted from the packet the input interface (hw_if_index) is extracted from the VPP buffer. Remember, sFlow works with physical network interfaces! if there are too many samples from this worker thread being worked on, it is discarded and an error counter is incremented. This protects the main thread from being slammed with samples if there are simply too many being fished out of the dataplane. Otherwise: a new sflow_sample_t is created, with all the sampling process metadata filled in the first 128 bytes of the packet are copied into the sample an RPC is dispatched to the main thread, which will send the sample to the PSAMPLE channel Both a debug CLI command and API call are added:\nsflow enable-disable \u0026lt;interface-name\u0026gt; [\u0026lt;sampling_N\u0026gt;]|[disable] Some observations:\nFirst off, the sampling_N in Neil\u0026rsquo;s demo is a global rather than per-interface setting. It would make sense to make this be per-interface, as routers typically have a mixture of 1G/10G and faster 100G network cards available. It was a surprise when I set one interface to 1:1000 and the other to 1:10000 and then saw the first interface change its sampling rate also. It\u0026rsquo;s a small thing, and will not be an issue to change.\nSecondly, sending the RPC to main uses vl_api_rpc_call_main_thread(), which requires a spinlock in src/vlibmemory/memclnt_api.c:649. I\u0026rsquo;m somewhat worried that when many samples are sent from many threads, there will be lock contention and performance will suffer.\nsFlow Plugin: Functional I boot up the [IPng Lab] and install a bunch of sFlow tools on it, make sure the psample kernel module is loaded. In this first test I\u0026rsquo;ll take a look at tablestakes. I compile VPP with the sFlow plugin, and enable that plugin in startup.conf on each of the four VPP routers. For reference, the Lab looks like this:\nWhat I\u0026rsquo;ll do is start an iperf3 server on vpp0-3 and then hit it from vpp0-0, to generate a few TCP traffic streams back and forth, which will be traversing vpp0-2 and vpp0-1, like so:\npim@vpp0-3:~ $ iperf3 -s -D pim@vpp0-0:~ $ iperf3 -c vpp0-3.lab.ipng.ch -t 86400 -P 10 -b 10M Configuring VPP for sFlow While this iperf3 is running, I\u0026rsquo;ll log on to vpp0-2 to take a closer look. The first thing I do, is turn on packet sampling on vpp0-2\u0026rsquo;s interface that points at vpp0-3, which is Gi10/0/1, and the interface that points at vpp0-0, which is Gi10/0/0. That\u0026rsquo;s easy enough, and I will use a sampling rate of 1:1000 as these interfaces are GigabitEthernet:\nroot@vpp0-2:~# vppctl sflow enable-disable GigabitEthernet10/0/0 1000 root@vpp0-2:~# vppctl sflow enable-disable GigabitEthernet10/0/1 1000 root@vpp0-2:~# vppctl show run | egrep \u0026#39;(Name|sflow)\u0026#39; Name State Calls Vectors Suspends Clocks Vectors/Call sflow active 5656 24168 0 9.01e2 4.27 Nice! VPP inserted the sflow node between dpdk-input and ethernet-input where it can do its business. But is it sending data? To answer this question, I can first take a look at the psampletest tool:\nroot@vpp0-2:~# psampletest pstest: modprobe psample returned 0 pstest: netlink socket number = 1637 pstest: getFamily pstest: generic netlink CMD = 1 pstest: generic family name: psample pstest: generic family id: 32 pstest: psample attr type: 4 (nested=0) len: 8 pstest: psample attr type: 5 (nested=0) len: 8 pstest: psample attr type: 6 (nested=0) len: 24 pstest: psample multicast group id: 9 pstest: psample multicast group: config pstest: psample multicast group id: 10 pstest: psample multicast group: packets pstest: psample found group packets=10 pstest: joinGroup 10 pstest: received Netlink ACK pstest: joinGroup 10 pstest: set headers... pstest: serialize... pstest: print before sending... pstest: psample netlink (type=32) CMD = 0 pstest: grp=1 in=7 out=9 n=1000 seq=1 pktlen=1514 hdrlen=31 pkt=0x558c08ba4958 q=3 depth=33333333 delay=123456 pstest: send... pstest: send_psample getuid=0 geteuid=0 pstest: sendmsg returned 140 pstest: free... pstest: start read loop... pstest: psample netlink (type=32) CMD = 0 pstest: grp=1 in=1 out=0 n=1000 seq=600320 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0 pstest: psample netlink (type=32) CMD = 0 pstest: grp=1 in=1 out=0 n=1000 seq=600321 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0 pstest: psample netlink (type=32) CMD = 0 pstest: grp=1 in=1 out=0 n=1000 seq=600322 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0 pstest: psample netlink (type=32) CMD = 0 pstest: grp=1 in=2 out=0 n=1000 seq=600423 pktlen=66 hdrlen=70 pkt=0x7ffdb0d5a1e8 q=0 depth=0 delay=0 pstest: psample netlink (type=32) CMD = 0 pstest: grp=1 in=1 out=0 n=1000 seq=600324 pktlen=2048 hdrlen=132 pkt=0x7ffe0e4776c8 q=0 depth=0 delay=0 I am amazed! The psampletest output shows a few packets, considering I\u0026rsquo;m asking iperf3 to push 100Mbit using 9000 byte jumboframes (which would be something like 1400 packets/second), I can expect two or three samples per second. I immediately notice a few things:\n1. Network Namespace: The Netlink sampling channel belongs to a network namespace. The VPP process is running in the default netns, so its PSAMPLE netlink messages will be in that namespace. Thus, the psampletest and other tools must also run in that namespace. I mention this because in Linux CP, often times the controlplane interfaces are created in a dedicated dataplane network namespace.\n2. pktlen and hdrlen: The pktlen is wrong, and this is a bug. In VPP, packets are put into buffers of size 2048, and if there is a jumboframe, that means multiple buffers are concatenated for the same packet. The packet length here ought to be 9000 in one direction. Looking at the in=2 packet with length 66, that looks like a legitimate ACK packet on the way back. But why is the hdrlen set to 70 there? I\u0026rsquo;m going to want to ask Neil about that.\n3. ingress and egress: The in=1 and one packet with in=2 represent the input hw_if_index which is the ifIndex that VPP assigns to its devices. And looking at show interfaces, indeed number 1 corresponds with GigabitEthernet10/0/0 and 2 is GigabitEthernet10/0/1, which checks out:\nroot@vpp0-2:~# vppctl show int Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count GigabitEthernet10/0/0 1 up 9000/0/0/0 rx packets 469552764 rx bytes 4218754400233 tx packets 133717230 tx bytes 8887341013 drops 6050 ip4 469321635 ip6 225164 GigabitEthernet10/0/1 2 up 9000/0/0/0 rx packets 133527636 rx bytes 8816920909 tx packets 469353481 tx bytes 4218736200819 drops 6060 ip4 133489925 ip6 29139 4. ifIndexes are orthogonal: These in=1 or in=2 ifIndex numbers are constructs of the VPP dataplane. Notably, VPP\u0026rsquo;s numbering of interface index is strictly orthogonal to Linux, and it\u0026rsquo;s not guaranteed that there even exists an interface in Linux for the PHY upon which the sampling is happening. Said differently, in=1 here is meant to reference VPP\u0026rsquo;s GigabitEthernet10/0/0 interface, but in Linux, ifIndex=1 is a completely different interface (lo) in the default network namespace. Similarly in=2 for VPP\u0026rsquo;s Gi10/0/1 interface corresponds to interface enp1s0 in Linux:\nroot@vpp0-2:~# ip link 1: lo: \u0026lt;LOOPBACK,UP,LOWER_UP\u0026gt; mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 2: enp1s0: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc fq_codel state UP mode DEFAULT group default qlen 1000 link/ether 52:54:00:f0:01:20 brd ff:ff:ff:ff:ff:ff 5. Counters: sFlow periodically polls the interface counters for all interfaces. It will normally use /proc/net/ entries for that, but there are two problems with this:\nThere may not exist a Linux representation of the interface, for example if it\u0026rsquo;s only doing L2 bridging or cross connects in the VPP dataplane, and it does not have a Linux Control Plane interface, or linux-cp is not used at all.\nEven if it does exist and it\u0026rsquo;s the \u0026ldquo;correct\u0026rdquo; ifIndex in Linux, for example if the Linux Interface Pair\u0026rsquo;s tuntap host_vif_index index is used, even then the statistics counters in the Linux representation will only count packets and octets of punted packets, that is to say, the stuff that LinuxCP has decided need to go to the Linux kernel through the TUN/TAP device. Important to note that east-west traffic that goes through the dataplane, is never punted to Linux, and as such, the counters will be undershooting: only counting traffic to the router, not through the router.\nVPP sFlow: Performance Now that I\u0026rsquo;ve shown that Neil\u0026rsquo;s proof of concept works, I will take a better look at the performance of the plugin. I\u0026rsquo;ve made a mental note that the plugin sends RPCs from worker threads to the main thread to marshall the PSAMPLE messages out. I\u0026rsquo;d like to see how expensive that is, in general. So, I pull boot two Dell R730 machines in IPng\u0026rsquo;s Lab and put them to work. The first machine will run Cisco\u0026rsquo;s T-Rex loadtester with 8x 10Gbps ports (4x dual Intel 58299), while the second (identical) machine will run VPP also ith 8x 10Gbps ports (2x Intel i710-DA4).\nI will test a bunch of things in parallel. First off, I\u0026rsquo;ll test L2 (xconnect) and L3 (IPv4 routing), and secondly I\u0026rsquo;ll test that with and without sFlow turned on. This gives me 8 ports to configure, and I\u0026rsquo;ll start with the L2 configuration, as follows:\nvpp# set int state TenGigabitEthernet3/0/2 up vpp# set int state TenGigabitEthernet3/0/3 up vpp# set int state TenGigabitEthernet130/0/2 up vpp# set int state TenGigabitEthernet130/0/3 up vpp# set int l2 xconnect TenGigabitEthernet3/0/2 TenGigabitEthernet3/0/3 vpp# set int l2 xconnect TenGigabitEthernet3/0/3 TenGigabitEthernet3/0/2 vpp# set int l2 xconnect TenGigabitEthernet130/0/2 TenGigabitEthernet130/0/3 vpp# set int l2 xconnect TenGigabitEthernet130/0/3 TenGigabitEthernet130/0/2 Then, the L3 configuration looks like this:\nvpp# lcp create TenGigabitEthernet3/0/0 host-if xe0-0 vpp# lcp create TenGigabitEthernet3/0/1 host-if xe0-1 vpp# lcp create TenGigabitEthernet130/0/0 host-if xe1-0 vpp# lcp create TenGigabitEthernet130/0/1 host-if xe1-1 vpp# set int state TenGigabitEthernet3/0/0 up vpp# set int state TenGigabitEthernet3/0/1 up vpp# set int state TenGigabitEthernet130/0/0 up vpp# set int state TenGigabitEthernet130/0/1 up vpp# set int ip address TenGigabitEthernet3/0/0 100.64.0.1/31 vpp# set int ip address TenGigabitEthernet3/0/1 100.64.1.1/31 vpp# set int ip address TenGigabitEthernet130/0/0 100.64.4.1/31 vpp# set int ip address TenGigabitEthernet130/0/1 100.64.5.1/31 vpp# ip route add 16.0.0.0/24 via 100.64.0.0 vpp# ip route add 48.0.0.0/24 via 100.64.1.0 vpp# ip route add 16.0.2.0/24 via 100.64.4.0 vpp# ip route add 48.0.2.0/24 via 100.64.5.0 vpp# ip neighbor TenGigabitEthernet3/0/0 100.64.0.0 00:1b:21:06:00:00 static vpp# ip neighbor TenGigabitEthernet3/0/1 100.64.1.0 00:1b:21:06:00:01 static vpp# ip neighbor TenGigabitEthernet130/0/0 100.64.4.0 00:1b:21:87:00:00 static vpp# ip neighbor TenGigabitEthernet130/0/1 100.64.5.0 00:1b:21:87:00:01 static And finally, the Cisco T-Rex configuration looks like this:\n- version: 2 interfaces: [ \u0026#39;06:00.0\u0026#39;, \u0026#39;06:00.1\u0026#39;, \u0026#39;83:00.0\u0026#39;, \u0026#39;83:00.1\u0026#39;, \u0026#39;87:00.0\u0026#39;, \u0026#39;87:00.1\u0026#39;, \u0026#39;85:00.0\u0026#39;, \u0026#39;85:00.1\u0026#39; ] port_info: - src_mac: 00:1b:21:06:00:00 dest_mac: 9c:69:b4:61:a1:dc - src_mac: 00:1b:21:06:00:01 dest_mac: 9c:69:b4:61:a1:dd - src_mac: 00:1b:21:83:00:00 dest_mac: 00:1b:21:83:00:01 - src_mac: 00:1b:21:83:00:01 dest_mac: 00:1b:21:83:00:00 - src_mac: 00:1b:21:87:00:00 dest_mac: 9c:69:b4:61:75:d0 - src_mac: 00:1b:21:87:00:01 dest_mac: 9c:69:b4:61:75:d1 - src_mac: 9c:69:b4:85:00:00 dest_mac: 9c:69:b4:85:00:01 - src_mac: 9c:69:b4:85:00:01 dest_mac: 9c:69:b4:85:00:00 A little note on the use of ip neighbor in VPP and specific dest_mac in T-Rex. In L2 mode, because the VPP interfaces will be in promiscuous mode and simply pass through any ethernet frame received on interface Te3/0/2 and copy it out on Te3/0/3 and vice-versa, there is no need to tinker with MAC addresses. But in L3 mode, the NIC will only accept ethernet frames addressed to its MAC address, so you can see that for the first port in T-Rex, I am setting dest_mac: 9c:69:b4:61:a1:dc which is the MAC address of Te3/0/0 on VPP. And then on the way out, if VPP wants to send traffic back to T-Rex, I\u0026rsquo;ll give it a static ARP entry with ip neighbor .. static.\nWith that said, I can start a baseline loadtest like so: T-Rex is sending 10Gbps out on all eight interfaces (four of which are L3 routing, and four of which are L2 xconnecting), using packet size of 1514 bytes. This amounts of roughlu 813Kpps per port, or a cool 6.51Mpps in total. And I can see, in this base line configuration, the VPP router is happy to do the work.\nI then enable sFlow on the second set of four ports, using a 1:1000 sampling rate:\npim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 1000 pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/1 1000 pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 1000 pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/3 1000 This should yield about 3'250 or so samples per second, and things look pretty great:\npim@hvn6-lab:~$ vppctl show err Count Node Reason Severity 5034508 sflow sflow packets processed error 4908 sflow sflow packets sampled error 5034508 sflow sflow packets processed error 5111 sflow sflow packets sampled error 5034516 l2-output L2 output packets error 5034516 l2-input L2 input packets error 5034404 sflow sflow packets processed error 4948 sflow sflow packets sampled error 5034404 l2-output L2 output packets error 5034404 l2-input L2 input packets error 5034404 sflow sflow packets processed error 4928 sflow sflow packets sampled error 5034404 l2-output L2 output packets error 5034404 l2-input L2 input packets error 5034516 l2-output L2 output packets error 5034516 l2-input L2 input packets error I can see that the sflow packets sampled is roughly 0.1% of the sflow packets processed which checks out. I can also see in psampletest a flurry of activity, so I\u0026rsquo;m happy:\npim@hvn6-lab:~$ sudo psampletest ... pstest: grp=1 in=9 out=0 n=1000 seq=63388 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0 pstest: psample netlink (type=32) CMD = 0 pstest: grp=1 in=8 out=0 n=1000 seq=63389 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0 pstest: psample netlink (type=32) CMD = 0 pstest: grp=1 in=11 out=0 n=1000 seq=63390 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0 pstest: psample netlink (type=32) CMD = 0 pstest: grp=1 in=10 out=0 n=1000 seq=63391 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0 pstest: psample netlink (type=32) CMD = 0 pstest: grp=1 in=11 out=0 n=1000 seq=63392 pktlen=1510 hdrlen=132 pkt=0x7ffd9e786158 q=0 depth=0 delay=0 I confirm that all four in interfaced (8, 9, 10 and 11) are sending samples, and those indexes correctly correspond to the VPP dataplane\u0026rsquo;s sw_if_index for TenGig130/0/0 - 3. Sweet! On this machine, each TenGig network interface has its own dedicated VPP worker thread. Considering I turned on sFlow sampling on four interfaces, I should see the cost I\u0026rsquo;m paying for the feature:\npim@hvn6-lab:~$ vppctl show run | grep -e \u0026#39;(Name|sflow)\u0026#39; Name State Calls Vectors Suspends Clocks Vectors/Call sflow active 3908218 14350684 0 9.05e1 3.67 sflow active 3913266 14350680 0 1.11e2 3.67 sflow active 3910828 14350687 0 1.08e2 3.67 sflow active 3909274 14350692 0 5.66e1 3.67 Alright, so for the 999 packets that went through and the one packet that got sampled, on average VPP is spending between 90 and 111 CPU cycles per packet, and the loadtest looks squeaky clean on T-Rex.\nVPP sFlow: Cost of passthru I decide to take a look at two edge cases. What if there are no samples being taken at all, and the sflow node is merely passing through all packets to ethernet-input? To simulate this, I will set up a bizarrely high sampling rate, say one in ten million. I\u0026rsquo;ll also make the T-Rex loadtester use only four ports, in other words, a unidirectional loadtest, and I\u0026rsquo;ll make it go much faster by sending smaller packets, say 128 bytes:\ntui\u0026gt;start -f stl/ipng.py -p 0 2 4 6 -m 99% -t size=128 pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 1000 disable pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/1 1000 disable pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 1000 disable pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/3 1000 disable pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10000000 pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10000000 The loadtester is now sending 33.5Mpps or thereabouts (4x 8.37Mpps), and I can confirm that the sFlow plugin is not sampling many packets:\npim@hvn6-lab:~$ vppctl show err Count Node Reason Severity 59777084 sflow sflow packets processed error 6 sflow sflow packets sampled error 59777152 l2-output L2 output packets error 59777152 l2-input L2 input packets error 59777104 sflow sflow packets processed error 6 sflow sflow packets sampled error 59777104 l2-output L2 output packets error 59777104 l2-input L2 input packets error pim@hvn6-lab:~$ vppctl show run | grep -e \u0026#39;(Name|sflow)\u0026#39; Name State Calls Vectors Suspends Clocks Vectors/Call sflow active 8186642 369674664 0 1.35e1 45.16 sflow active 25173660 369674696 0 1.97e1 14.68 Two observations:\nOne of these is busier than the other. Without looking further, I can already predict that the top one (doing 45.16 vectors/call) is the L3 thread. Reasoning: the L3 code path through the dataplane is a lot more expensive than \u0026lsquo;merely\u0026rsquo; L2 XConnect. As such, the packets will spend more time, and therefore the iterations of the dpdk-input loop will be further apart in time. And because of that, it\u0026rsquo;ll end up consuming more packets on each subsequent iteration, in order to catch up. The L2 path on the other hand, is quicker and therefore will have less packets waiting on subsequent iterations of dpdk-input.\nThe sflow plugin spends between 13.5 and 19.7 CPU cycles shoveling the packets into ethernet-input without doing anything to them. That\u0026rsquo;s pretty low! And the L3 path is a little bit more efficient per packet, which is very likely because it gets to amort its L1/L2 CPU instruction cache over 45 packets each time it runs, while the L2 path can only amort its instruction cache over 15 or so packets each time it runs.\nI let the loadtest run overnight,and the proof is in the pudding: sFlow enabled but not sampling works just fine:\nVPP sFlow: Cost of sampling The other interesting case is to figure out how much CPU it takes to execute the code path with the actual sampling. This one turns out a bit trickier to measure. While leaving the previous loadtest running at 33.5Mpps, I disable sFlow and then re-enable it at an abnormally high ratio of 1:10 packets:\npim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 disable pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 disable pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/0 10 pim@hvn6-lab:~$ vppctl sflow enable-disable TenGigabitEthernet130/0/2 10 The T-Rex view immediately reveals that VPP is not doing very well, as the throughput went from 33.5Mpps all the way down to 7.5Mpps. Ouch! Looking at the dataplane:\npim@hvn6-lab:~$ vppctl show err | grep sflow 340502528 sflow sflow packets processed error 12254462 sflow sflow packets dropped error 22611461 sflow sflow packets sampled error 422527140 sflow sflow packets processed error 8533855 sflow sflow packets dropped error 34235952 sflow sflow packets sampled error Ha, this new safeguard popped up: remember all the way at the beginning, I explained how there\u0026rsquo;s a safety net in the sflow plugin that will pre-emptively drop the sample if the RPC channel towards the main thread is seeing too many outstanding RPCs? That\u0026rsquo;s happening right now, under the moniker sflow packets dropped, and it\u0026rsquo;s roughly half of the samples.\nMy first attempt is to back off the loadtester to roughly 1.5Mpps per port (so 6Mpps in total, under the current limit of 7.5Mpps), but I\u0026rsquo;m disappointed: the VPP instance is now returning 665Kpps per port only, which is horrible, and it\u0026rsquo;s still dropping samples.\nMy second attempt is to turn off all ports but last pair (the L2XC port), which returns 930Kpps from the offered 1.5Mpps. VPP is clearly not having a good time here.\nFinally, as a validation, I turn off all ports but the first pair (the L3 port, without sFlow), and ramp up the traffic to 8Mpps. Success (unsurprising to me). I also ramp up the second pair (the L2XC port, without sFlow), VPP forwards all 16Mpps and is happy again.\nOnce I turn on the third pair (the L3 port, with sFlow), even at 1Mpps, the whole situation regresses again: First two ports go down from 8Mpps to 5.2Mpps each; the third (offending) port delivers 740Kpps out of 1Mpps. Clearly, there\u0026rsquo;s some work to do under high load situations!\nReasoning about the bottle neck But how expensive is sending samples, really? To try to get at least some pseudo-scientific answer I turn off all ports again, and ramp up the one port pair with (L3 + sFlow at 1:10 ratio) to full line rate: that is 64 byte packets at 14.88Mpps:\ntui\u0026gt;stop tui\u0026gt;start -f stl/ipng.py -m 100% -p 4 -t size=64 VPP is now on the struggle bus and is returning 3.16Mpps or 21% of that. But, I think it\u0026rsquo;ll give me some reasonable data to try to feel out where the bottleneck is.\nThread 2 vpp_wk_1 (lcore 3) Time 6.3, 10 sec internal node vector rate 256.00 loops/sec 27310.73 vector rates in 3.1607e6, out 3.1607e6, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call TenGigabitEthernet130/0/1-outp active 77906 19943936 0 5.79e0 256.00 TenGigabitEthernet130/0/1-tx active 77906 19943936 0 6.88e1 256.00 dpdk-input polling 77906 19943936 0 4.41e1 256.00 ethernet-input active 77906 19943936 0 2.21e1 256.00 ip4-input active 77906 19943936 0 2.05e1 256.00 ip4-load-balance active 77906 19943936 0 1.07e1 256.00 ip4-lookup active 77906 19943936 0 1.98e1 256.00 ip4-rewrite active 77906 19943936 0 1.97e1 256.00 sflow active 77906 19943936 0 6.14e1 256.00 pim@hvn6-lab:pim# vppctl show err | grep sflow 551357440 sflow sflow packets processed error 19829380 sflow sflow packets dropped error 36613544 sflow sflow packets sampled error OK, the sflow plugin saw 551M packets, selected 36.6M of them for sampling, but ultimately only sent RPCs to the main thread for 16.8M samples after having dropped 19.8M of them. There are three code paths, each one extending the other:\nSuper cheap: pass through. I already learned that it takes about X=13.5 CPU cycles to pass through a packet. Very cheap: select sample and construct the RPC, but toss it, costing Y CPU cycles. Expensive: select sample, and send the RPC. Z CPU cycles in worker, and another amount in main. Now I don\u0026rsquo;t know what Y is, but seeing as the selection only copies some data from the VPP buffer into a new sflow_sample_t, and it uses clip_memcpy_fast() for the sample header, I\u0026rsquo;m going to assume it\u0026rsquo;s not drastically more expensive than the super cheap case, so for simplicity I\u0026rsquo;ll guesstimate that it takes Y=20 CPU cyces.\nWith that guess out of the way, I can see what the sflow plugin is consuming for the third case:\nAvgClocks = (Total * X + Sampled * Y + RPCSent * Z) / Total 61.4 = ( 551357440 * 13.5 + 36613544 * 20 + (36613544-19829380) * Z ) / 551357440 61.4 = ( 7443325440 + 732270880 + 16784164 * Z ) / 551357440 33853346816 = 7443325440 + 732270880 + 16784164 * Z 25677750496 = 16784164 * Z Z = 1529 Good to know! I find spending O(1500) cycles to send the sample pretty reasonable. However, for a dataplane that is trying to do 10Mpps per core, and a core running 2.2GHz, there are really only 220 CPU cycles to spend end-to-end. Spending an order of magnitude more than that once every ten packets feels dangerous to me.\nHere\u0026rsquo;s where I start my conjecture. If I count the CPU cycles spent in the table above, I will see 273 CPU cycles spent on average per packet. The CPU in the VPP router is an E5-2696 v4 @ 2.20GHz, which means it should be able to do 2.2e10/273 = 8.06Mpps per thread, more than double that what I observe (3.16Mpps)! But, for all the vector rates in (3.1607e6), it also managed to emit the packets back out (same number: 3.1607e6).\nSo why isn\u0026rsquo;t VPP getting more packets from DPDK? I poke around a bit and find an important clue:\npim@hvn6-lab:~$ vppctl show hard TenGigabitEthernet130/0/0 | grep rx\\ missed; \\ sleep 10; vppctl show hard TenGigabitEthernet130/0/0 | grep rx\\ missed rx missed 4065539464 rx missed 4182788310 In those ten seconds, VPP missed (4182788310-4065539464)/10 = 11.72Mpps. I already measured that it forwarded 3.16Mpps and you know what? 11.7 + 3.16 is precisely 14.88Mpps. All packets are accounted for! It\u0026rsquo;s just, DPDK never managed to read them from the hardware: sad-trombone.wav\nAs a validation, I turned off sFlow while keeping that one port at 14.88Mpps. Now, 10.8Mpps were delivered:\nThread 2 vpp_wk_1 (lcore 3) Time 14.7, 10 sec internal node vector rate 256.00 loops/sec 40622.64 vector rates in 1.0794e7, out 1.0794e7, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call TenGigabitEthernet130/0/1-outp active 620012 158723072 0 5.66e0 256.00 TenGigabitEthernet130/0/1-tx active 620012 158723072 0 7.01e1 256.00 dpdk-input polling 620012 158723072 0 4.39e1 256.00 ethernet-input active 620012 158723072 0 1.56e1 256.00 ip4-input-no-checksum active 620012 158723072 0 1.43e1 256.00 ip4-load-balance active 620012 158723072 0 1.11e1 256.00 ip4-lookup active 620012 158723072 0 2.00e1 256.00 ip4-rewrite active 620012 158723072 0 2.02e1 256.00 Total Clocks: 201 per packet; 2.2GHz/201 = 10.9Mpps, and I am observing 10.8Mpps. As [North of the Border] would say: \u0026ldquo;That\u0026rsquo;s not just good, it\u0026rsquo;s good enough!\u0026rdquo;\nFor completeness, I turned on all eight ports again, at line rate (8x14.88 = 119Mpps 🥰), and saw that about 29Mpps of that made it through. Interestingly, what was 3.16Mpps in the single-port line rate loadtest, went up slighty to 3.44Mpps now. What puzzles me even more, is that the non-sFlow worker threads are also impacted. I spent some time thinking about this and poking around, but I did not find a good explanation why port pair 0 (L3, no sFlow) and 1 (L2XC, no sFlow) would be impacted. Here\u0026rsquo;s a screenshot of VPP on the struggle bus:\nHypothesis: Due to the spinlock in vl_api_rpc_call_main_thread(), the worker CPU is pegged for a longer time, during which the dpdk-input PMD can\u0026rsquo;t run, so it misses out on these sweet sweet packets that the network card had dutifully received for it, resulting in the rx-miss situation. While VPP\u0026rsquo;s performance measurement shows 273 CPU cycles per packet and 3.16Mpps, this accounts only for 862M cycles, while the thread has 2200M cycles, leaving a whopping 60% of CPU cycles unused in the dataplane. I still don\u0026rsquo;t understand why other worker threads are impacted, though.\nWhat\u0026rsquo;s Next I\u0026rsquo;ll continue to work with the folks in the sFlow and VPP communities and iterate on the plugin and other sFlow Agent machinery. In an upcoming article, I hope to share more details on how to tie the VPP plugin in to the hsflowd host sflow daemon in a way that the interface indexes, counters and packet lengths are all correct. Of course, the main improvement that we can make is to allow for the system to work better under load, which will take some thinking.\nI should do a few more tests with a debug binary and profiling turned on. I quickly ran a perf over the VPP (release / optimized) binary running on the bench, but it merely said 80% of time was spent in libvlib rather than libvnet in the baseline (sFlow turned off).\nroot@hvn6-lab:/home/pim# perf record -p 1752441 sleep 10 root@hvn6-lab:/home/pim# perf report --stdio --sort=dso # Overhead Shared Object (sFlow) Overhead Shared Object (baseline) # ........ ...................... ........ ........................ # 79.02% libvlib.so.24.10 54.27% libvlib.so.24.10 12.82% libvnet.so.24.10 33.91% libvnet.so.24.10 3.77% dpdk_plugin.so 10.87% dpdk_plugin.so 3.21% [kernel.kallsyms] 0.81% [kernel.kallsyms] 0.29% sflow_plugin.so 0.09% ld-linux-x86-64.so.2 0.28% libvppinfra.so.24.10 0.03% libc.so.6 0.21% libc.so.6 0.01% libvppinfra.so.24.10 0.17% libvlibapi.so.24.10 0.00% libvlibmemory.so.24.10 0.15% libvlibmemory.so.24.10 0.07% ld-linux-x86-64.so.2 0.00% vpp 0.00% [vdso] 0.00% libsvm.so.24.10 Unfortunately, I\u0026rsquo;m not much of a profiler expert, being merely a network engineer :) so I may have to ask for help. Of course, if you\u0026rsquo;re reading this, you may also offer help! There\u0026rsquo;s lots of interesting work to do on this sflow plugin, with matching ifIndex for consumers like hsflowd, reading interface counters from the dataplane (or from the Prometheus Exporter), and most importantly, ensuring it works well, or fails gracefully, under stringent load.\nFrom the cray-cray ideas department, what if we:\nIn worker thread, produced the sample but instead of sending an RPC to main and taking the lock, append it to a producer sample queue and move on. This way, no locks are needed, and each worker thread will have its own producer queue.\nCreate a separate worker (or even pool of workers), running on possibly a different CPU (or in main), that runs a loop iterating on all sflow sample queues consuming the samples and sending them in batches to the PSAMPLE Netlink group, possibly dropping samples if there are too many coming in.\nI\u0026rsquo;m reminded that this pattern exists already \u0026ndash; async crypto workers create a crypto-dispatch node that acts as poller for inbound crypto, and it hands off the result back into the worker thread: lockless at the expense of some complexity!\nAcknowledgements The plugin I am testing here is a prototype written by Neil McKee of inMon. I also wanted to say thanks to Pavel Odintsov of FastNetMon and Ciprian Balaceanu for showing an interest in this plugin, and Peter Phaal for facilitating a get-together last year.\nWho\u0026rsquo;s up for making this thing a reality?!\n","date":"2024-09-08","desc":"Introduction In January of 2023, an uncomfortably long time ago at this point, an acquaintance of mine called Ciprian reached out to me after seeing my [DENOG #14] presentation. He was interested to learn about IPFIX and was asking if sFlow would be an option. At the time, there was a plugin in VPP called [flowprobe] which is able to emit IPFIX records. Unfotunately I never really got it to work well in my tests, as either the records were corrupted, sub-interfaces didn\u0026rsquo;t work, or the plugin would just crash the dataplane entirely. In the meantime, the folks at [Netgate] submitted quite a few fixes to flowprobe, but it remains an expensive operation computationally. Wouldn\u0026rsquo;t copying one in a thousand or ten thousand packet headers with flow sampling not be just as good?\n","permalink":"https://ipng.ch/s/articles/2024/09/08/vpp-with-sflow-part-1/","section":"articles","title":"VPP with sFlow - Part 1"},{"contents":"Introduction In the before-days, I had a very modest personal website running on [ipng.nl] and [ipng.ch]. Over the years I\u0026rsquo;ve had quite a few different designs, and although one of them was hosted (on Google Sites) for a brief moment, they were mostly very much web 1.0, \u0026ldquo;The 90s called, they wanted their website back!\u0026rdquo; style.\nThe site didn\u0026rsquo;t have much other than a little blurb on a few open source projects of mine, and a gallery hosted on PicasaWeb [which Google subsequently turned down], and a mostly empty Blogger page. Would you imagine that I hand-typed the XHTML and CSS for this website, where the menu at the top (thinks like Home - Resume - History - Articles) would just have a HTML page which meticulously linked to the other HTML pages. It was the way of the world, in the 1990s.\nJekyll My buddy Michal suggested in May of 2021 that, if I was going to write all of the HTML skeleton by hand, I may as well switch to a static website generator. He\u0026rsquo;s fluent in Ruby, and suggested I take a look at [Jekyll], a static site generator. It takes text written in your favorite markup language and uses layouts to create a static website. You can tweak the site’s look and feel, URLs, the data displayed on the page, and more.\nI immediately fell in love! As an experiment, I moved [IPng.ch] to a new webserver, and kept my personal website on [IPng.nl]. I had always wanted to write a little bit more about technology, and since I was working on an interesting project [Linux Control Plane] in VPP, I thought it\u0026rsquo;d be nice to write a little bit about it, but certainly not while hand-crafting all of the HTML exoskeleton. I just wanted to write Markdown, and this is precisely the raison d\u0026rsquo;être of Jekyll!\nSince April 2021, I wrote in total 67 articles with Jekyll. Some of them proved to become quite popular, and (humblebrag) my website is widely considered one of the best resources for Vector Packet Processing, with my [VPP] series, [MPLS] series and a few others like the [Mastodon] series being amongst some of the top visited articles, with ~7.5-8K monthly unique visitors.\nThe catalyst There were two distinct events that lead up to this. Firstly, I started a side project called [Free IX], which I also created in Jekyll. When I did that, I branched the [IPng.ch] site, but the build faild with Ruby errors. My buddy Antonios fixed those, and we were underway. Secondly, later on I attempted to upgrade the IPng website to the same fixes that Antonios had provided for Free-IX, and all hell broke loose (luckily, only in staging environment). I spent several hours pulling my hear out re-assembling the dependencies, downgrading Jekyll, pulling new gems, downgrading ruby. Finally, I got it to work again, only to see after my first production build, that the build immediately failed because the Docker container that does the build no longer liked what I had put in the Gemfile and _config.yml. It was something to do with sass-embedded gem, and I spent waaaay too long fixing this incredibly frustrating breakage.\nHugo When I made my roadtrip from Zurich to the North Cape with my buddy Paul, we took extensive notes on our daily travels, and put them on a [2022roadtripnose] website. At the time, I was looking for a photo caroussel for Jekyll, and while I found a few, none of them really worked in the way I wanted them to. I stumbled across [Hugo], which says on its website that it is one of the most popular open-source static site generators. With its amazing speed and flexibility, Hugo makes building websites fun again. So I dabbled a bit and liked what I saw. I used the [notrack] theme from GitHub user @gevhaz, as they had made a really nice gallery widget (called a shortcode in Hugo).\nThe main reason for me to move to Hugo is that it is a standalone Go program, with no runtime or build time dependencies. The Hugo [GitHub] delivers ready to go build artifacts, tests amd releases regularly, and has a vibrant user community.\nMigrating I have only a few strong requirements if I am to move my website:\nThe site\u0026rsquo;s URL namespace MUST be identical (not just similar) to Jekyll. I do not want to lose my precious ranking on popular search engines. MUST be built in a CI/CD tool like Drone or Jenkins, and autodeploy Code MUST be hermetic, not pulling in external dependencies, neither in the build system (eg. Hugo itself) nor the website (eg. dependencies, themes, etc). Theme MUST support images, videos and SHOULD support asciinema. Theme SHOULD try to look very similar to the current Jekyll minima theme. Attempt 1: Auto import ❌ With that in mind, I notice that Hugo has a site importer, that can import a site from Jekyll! I run it, but it produces completely broken code, and Hugo doesn\u0026rsquo;t even want to compile the site. This turns out to be a theme issue, so I take Hugo\u0026rsquo;s advice and install the recommended theme. The site comes up, but is pretty screwed up. I now realize that the hugo import jekyll imports the markdown as-is, and only rewrites the frontmatter (the little blurb of YAML metadata at the top of each file). Two notable problems:\n1. images - I make liberal use of Markdown images, which in Jekyll can be decorated with CSS styling, like so:\n![Alt](/path/to/image){: style=\u0026#34;width:200px; float: right; margin: 1em;\u0026#34;} 2. post_url - Another widely used feature is cross-linking my own articles, using Jekyll template expansion, like so:\n.. Remember in my [[VPP Babel]({% post_url 2024-03-06-vpp-babel-1 %})] .. I do some grepping, and have 246 such Jekyll template expansions, and 272 images OK, that\u0026rsquo;s a dud.\nAttempt 2: Skeleton ✅ I decide to do this one step at a time. First, I create a completely new website hugo new site ipng.ch, download the notrack theme, and add only the front page index.md from the original IPng site. OK, that renders.\nNow comes a fun part: going over the notrack theme\u0026rsquo;s SCSS to adjust it to look and feel similar to the Jekyll minima theme. I change a bunch of stuff in the skeleton of the website:\nFirst, I take a look at the site media breakpoints, to feel correct for desktop screen, tablet screen and iPhone/Android screens. Then, I inspect the font family, size and H1/H2/H3\u0026hellip; magnifications, also scaling them with media size. Finally I notice the footer, which in notrack spans the whole width of the browser. I change it to be as wide as the header and main page.\nI go one by one on the site\u0026rsquo;s main pages and, just as on the Jekyll site, I make them into menu items at the top of the page. The [Services] page serves as my proof of concept, as it has both the image and the post_url pattern in Jekyll. It references six articles and has two images which float on the right side of the canvas. If I can figure out how to rewrite these to fit the Hugo variants of the same pattern, I should be home free.\nHugo: image The idiomatic way in notrack is an image shortcode. I hope you know where to find the curly braces on your keyboard - because geez, Hugo templating sure does like them!\n\u0026lt;figure class=\u0026#34;image-shortcode{{ with .Get \u0026#34;class\u0026#34; }} {{ . }}{{ end }} {{- with .Get \u0026#34;wide\u0026#34; }}{{- if eq . \u0026#34;true\u0026#34; }} wide{{ end -}}{{ end -}} {{- with .Get \u0026#34;frame\u0026#34; }}{{- if eq . \u0026#34;true\u0026#34; }} frame{{ end -}}{{ end -}} {{- with .Get \u0026#34;float\u0026#34; }} {{ . }}{{ end -}}\u0026#34; style=\u0026#34; {{- with .Get \u0026#34;width\u0026#34; }}width: {{ . }};{{ end -}} {{- with .Get \u0026#34;height\u0026#34; }}height: {{ . }};{{ end -}}\u0026#34;\u0026gt; {{- if .Get \u0026#34;link\u0026#34; -}} \u0026lt;a href=\u0026#34;{{ .Get \u0026#34;link\u0026#34; }}\u0026#34;{{ with .Get \u0026#34;target\u0026#34; }} target=\u0026#34;{{ . }}\u0026#34;{{ end -}} {{- with .Get \u0026#34;rel\u0026#34; }} rel=\u0026#34;{{ . }}\u0026#34;{{ end }}\u0026gt; {{- end }} \u0026lt;img src=\u0026#34;{{ .Get \u0026#34;src\u0026#34; | relURL }}\u0026#34; {{- if or (.Get \u0026#34;alt\u0026#34;) (.Get \u0026#34;caption\u0026#34;) }} alt=\u0026#34;{{ with .Get \u0026#34;alt\u0026#34; }}{{ replace . \u0026#34;\u0026#39;\u0026#34; \u0026#34;\u0026amp;#39;\u0026#34; }}{{ else -}} {{- .Get \u0026#34;caption\u0026#34; | markdownify| plainify }}{{ end }}\u0026#34; {{- end -}} /\u0026gt; \u0026lt;!-- Closing img tag --\u0026gt; {{- if .Get \u0026#34;link\u0026#34; }}\u0026lt;/a\u0026gt;{{ end -}} {{- if or (or (.Get \u0026#34;title\u0026#34;) (.Get \u0026#34;caption\u0026#34;)) (.Get \u0026#34;attr\u0026#34;) -}} \u0026lt;figcaption\u0026gt; {{ with (.Get \u0026#34;title\u0026#34;) -}} \u0026lt;h4\u0026gt;{{ . }}\u0026lt;/h4\u0026gt; {{- end -}} {{- if or (.Get \u0026#34;caption\u0026#34;) (.Get \u0026#34;attr\u0026#34;) -}}\u0026lt;p\u0026gt; {{- .Get \u0026#34;caption\u0026#34; | markdownify -}} {{- with .Get \u0026#34;attrlink\u0026#34; }} \u0026lt;a href=\u0026#34;{{ . }}\u0026#34;\u0026gt; {{- end -}} {{- .Get \u0026#34;attr\u0026#34; | markdownify -}} {{- if .Get \u0026#34;attrlink\u0026#34; }}\u0026lt;/a\u0026gt;{{ end }}\u0026lt;/p\u0026gt; {{- end }} \u0026lt;/figcaption\u0026gt; {{- end }} \u0026lt;/figure\u0026gt; From the top - Hugo creates a figure with a certain set of classes, the default image-shortcode but also classes for frame, wide and float to further decorate the image. Then it applies direct styling for width and height, optionally inserts a link (something I had missed out on in Jekyll), then inlines the \u0026lt;img\u0026gt; tag with an alt or (markdown based!) caption. It then reuses the caption or title or attr variables to assemble a \u0026lt;figcaption\u0026gt; block. I absolutely love it!\nI\u0026rsquo;ve rather consistently placed my images by themselves, on a single line, and they all have at least one style (be it width, or float), so it\u0026rsquo;s really straight forward to rewrite this with a little bit of Python:\ndef convert_image(line): p = re.compile(r\u0026#39;^!\\[(.+)\\]\\((.+)\\){:\\s*(.*)}\u0026#39;) m = p.match(line) if not m: return False alt=m.group(1) src=m.group(2) style=m.group(3) image_line = \u0026#34;{{\u0026lt; image \u0026#34; if sm := re.search(r\u0026#39;width:\\s*(\\d+px)\u0026#39;, style): image_line += f\u0026#39;width=\u0026#34;{sm.group(1)}\u0026#34; \u0026#39; if sm := re.search(r\u0026#39;float:\\s*(\\w+)\u0026#39;, style): image_line += f\u0026#39;float=\u0026#34;{sm.group(1)}\u0026#34; \u0026#39; image_line += f\u0026#39;src=\u0026#34;{src}\u0026#34; alt=\u0026#34;{alt}\u0026#34; \u0026gt;}}}}\u0026#39; print(image_line) return True with open(sys.argv[1], \u0026#34;r\u0026#34;, encoding=\u0026#34;utf-8\u0026#34;) as file_handle: for line in file_handle.readlines(): if not convert_image(line): print(line.rstrip()) Hugo: ref In Hugo, the idiomatic way to reference another document in the corpus is with the builtin ref shortcode, requiring a single argument: the path to a content document, with or without a file extension, with or without an anchor. Paths without a leading / are first resolved relative to the current page, then to the remainder of the site. This is super cool, because I can essentially reference any file by just its name!\nfor fn in $(find content/ -name \\*.md); do sed -i -r \u0026#39;s/{%[ ]?post_url (.*)[ ]?%}/{{\u0026lt; ref \\1 \u0026gt;}}/\u0026#39; $fn done And with that, the converted markdown from Jekyll renders perfectly in Hugo. Of course, other sites may use other templating commands, but for [IPng.ch], these were the only two special cases.\nHugo: URL redirects It is a hard requirement for me to keep the same URLs that I had from Jekyll. Luckily, this is a trivial matter for Hugo, as it supports URL aliases in the frontmatter. Jekyll will add a file extension to the article slugs, while Hugo uses only the directly and serves an index.html from it. Also, the default for Hugo is to put content in a different directory.\nThe first change I make is to the main hugo.toml config file:\n[permalinks] articles = \u0026#34;/s/articles/:year/:month/:day/:slug\u0026#34; That solves the main directory problem, as back then, I chose s/articles/ in Jekyll. Then, adding the URL redirect is a simple matter of looking up which filename Jekyll ultimately used, and adding a little frontmatter at the top of each article, for example my [VPP #1] article would get this addition:\n--- date: \u0026#34;2021-08-12T11:17:54Z\u0026#34; title: VPP Linux CP - Part1 aliases: - /s/articles/2021/08/12/vpp-1.html --- Hugo by default renders it in /s/articles/2021/08/12/vpp-linux-cp-part1/index.html but the addition of the alias makes it also generate a drop-in placeholder HTML page that offers a permanent redirect (cleverly setting noindex for web crawlers and offering the canonical link for the new place, aka a permanent redirect:\n$ curl https://ipng.ch/s/articles/2021/08/12/vpp-1.html \u0026lt;!DOCTYPE html\u0026gt; \u0026lt;html lang=\u0026#34;en-us\u0026#34;\u0026gt; \u0026lt;head\u0026gt; \u0026lt;title\u0026gt;https://ipng.ch/s/articles/2021/08/12/vpp-linux-cp-part1/\u0026lt;/title\u0026gt; \u0026lt;link rel=\u0026#34;canonical\u0026#34; href=\u0026#34;https://ipng.ch/s/articles/2021/08/12/vpp-linux-cp-part1/\u0026#34;\u0026gt; \u0026lt;meta name=\u0026#34;robots\u0026#34; content=\u0026#34;noindex\u0026#34;\u0026gt; \u0026lt;meta charset=\u0026#34;utf-8\u0026#34;\u0026gt; \u0026lt;meta http-equiv=\u0026#34;refresh\u0026#34; content=\u0026#34;0; url=https://ipng.ch/s/articles/2021/08/12/vpp-linux-cp-part1/\u0026#34;\u0026gt; \u0026lt;/head\u0026gt; \u0026lt;/html\u0026gt; Hugo: Asciinema One thing that I always wanted to add is the ability to inline [Asciinema] screen recordings. First, I take a look at what is needed to serve Asciinema: One Javascript file, and one CSS file, followed by a named \u0026lt;div\u0026gt; which invokes the Javascript. Armed with that knowledge, I dive into the shortcode language a little bit:\n$ cat themes/hugo-theme-ipng/layouts/shortcodes/asciinema.html \u0026lt;div id=\u0026#39;{{ .Get \u0026#34;src\u0026#34; | replaceRE \u0026#34;[[:^alnum:]]\u0026#34; \u0026#34;\u0026#34; }}\u0026#39;\u0026gt;\u0026lt;/div\u0026gt; \u0026lt;script\u0026gt; AsciinemaPlayer.create(\u0026#34;{{ .Get \u0026#34;src\u0026#34; }}\u0026#34;, document.getElementById(\u0026#39;{{ .Get \u0026#34;src\u0026#34; | replaceRE \u0026#34;[[:^alnum:]]\u0026#34; \u0026#34;\u0026#34; }}\u0026#39;)); \u0026lt;/script\u0026gt; This file creates the id of the \u0026lt;div\u0026gt; by means of stripping all non-alphanumeric characters from the src argument of the shortcode. So if I were to create an {{\u0026lt; asciinema src='/casts/my.cast' \u0026gt;}}, the resulting DIV will be uniquely called castsmycast. This way, I can add multiple screencasts in the same document, which is dope.\nBut, as I now know, I need to load some CSS and JS so that the AsciinemaPlayer class becomes available. For this, I use a realtively new feature in Hugo, which allows for params to be set in the frontmatter, for example in the [VPP OSPF #2] article:\n--- date: \u0026#34;2024-06-22T09:17:54Z\u0026#34; title: VPP with loopback-only OSPFv3 - Part 2 aliases: - /s/articles/2024/06/22/vpp-ospf-2.html params: asciinema: true --- The presence of that params.asciinema can be used in any page, including the HTML skeleton of the theme, like so:\n$ cat themes/hugo-theme-ipng/layouts/partials/head.html \u0026lt;head\u0026gt; ... {{ if eq .Params.asciinema true -}} \u0026lt;link rel=\u0026#34;stylesheet\u0026#34; type=\u0026#34;text/css\u0026#34; href=\u0026#34;{{ \u0026#34;css/asciinema-player.css\u0026#34; | relURL }}\u0026#34; /\u0026gt; \u0026lt;script src=\u0026#34;{{ \u0026#34;js/asciinema-player.min.js\u0026#34; | relURL }}\u0026#34;\u0026gt;\u0026lt;/script\u0026gt; {{- end }} \u0026lt;/head\u0026gt; Now all that\u0026rsquo;s left for me to do is drop the two Asciinema player files in their respective theme directories, and for each article that wants to use an Asciinema, set the param and it\u0026rsquo;ll ship the CSS and Javascript to the browser. I think I\u0026rsquo;m going to have a good relationship with Hugo :)\nGitea: Large File Support One mistake I made with the old Jekyll based website, is that I checked in all of the images and binary files directly into Git. This bloats the repository and is otherwise completely unnecessary. For this new repository, I enable [Git LFS], which is available for OpenBSD (packages), Debian (apt) and MacOS (homebrew). Turning this on is very simple:\n$ brew install git-lfs $ cd ipng.ch $ git lfs install $ for i in gz png gif jpg jpeg tgz zip; do \\\\ git track \u0026#34;*.$i\u0026#34; \\\\ git lfs import --everything --include \u0026#34;*.$i\u0026#34; \\\\ done $ git push --force --all The force push rewrites the history of the repo to reference the binary blobs in LFS instead of directly in the repo. As a result, the size of the repository greatly shrinks, and handling it becomes easier once it grows. A really nice feature!\nGitea: CI/CD with Drone At IPng, I run a [Gitea] server, which is one of the coolest pieces of open source that I use on a daily basis. There\u0026rsquo;s a very clean integration of a continuous integration tool called [Drone] and these two tools are literally made for each other. Drone can be enabled for any Git repo in Gitea, and given the presence of a .drone.yml file, execute a set of steps upon repository events, called triggers. It can then run a sequence of steps, hermetically in a Docker container called a drone-runner, which first checks out the repository at the latest commit, and then does whatever I\u0026rsquo;d like with it. I\u0026rsquo;d like to build and distribute a Hugo website, please!\nAs it turns out, there is a [Drone Hugo] plugin available, but it seems to be very outdated. Luckily, this being open source and all, I can download the source on [GitHub], and in the Dockerfile, bump the Alpine version, the Go version and build the latest Hugo release, which is 0.130.1 at the moment. I really do need this version, because the params feature was introduced in 0.123 and the upstream package is still for 0.77 \u0026ndash; which is about four years old. Ouch!\nI build a docker image and upload it to my private repo at IPng which is hosted as well on Gitea, by the way. As I said, it really is a great piece of kit! In case anybody else would like to give it a whirl, ping me on Mastodon or e-mail and I\u0026rsquo;ll upload one to public Docker Hub as well.\nPutting it all together With Drone activated for this repo, and the Drone Hugo plugin built with a new version, I can submit the following file to the root directory of the ipng.ch repository:\n$ cat .drone.yml kind: pipeline name: default steps: - name: git-lfs image: alpine/git commands: - git lfs install - git lfs pull - name: build image: git.ipng.ch/ipng/drone-hugo:release-0.130.0 settings: hugo_version: 0.130.0 extended: true - name: rsync image: drillster/drone-rsync settings: user: drone key: from_secret: drone_sshkey hosts: - nginx0.chrma0.net.ipng.ch - nginx0.chplo0.net.ipng.ch - nginx0.nlams1.net.ipng.ch - nginx0.nlams2.net.ipng.ch port: 22 args: \u0026#39;-6u --delete-after\u0026#39; source: public/ target: /var/www/ipng.ch/ recursive: true secrets: [ drone_sshkey ] image_pull_secrets: - git_ipng_ch_docker The file is relatively self-explanatory. Before my first step runs, Drone already checks out the repo in the current working directory of the docker container. I then install package alpine/git and run the git lfs install and git lfs pull commands to resolve the LFS symlinks into actual files by pulling those objects that are referenced (and, notably, not all historical versions of any binary file ever added to the repo).\nThen, I run a step called build which invokes the Hugo Drone package that I created before.\nFinally, I run a step called rsync which uses package drillster/drone-rsync to rsync-over-ssh the files to the four NGINX servers running at IPng: two in Amsterdam, one in Geneva and one in Zurich.\nOne really cool feature is the use of so called Drone Secrets which are references to locked secrets such as the SSH key, and, notably, the Docker Repository credentials, because Gitea at IPng does not run a public docker repo. Using secrets is nifty, because it allows to safely check in the .drone.yml configuration file without leaking any specifics.\nNGINX and SSL Now that the website is automatically built and rsync\u0026rsquo;d to the webservers upon every git merge, all that\u0026rsquo;s left for me to do is arm the webservers with SSL certificates. I actually wrote a whole story about specifically that, as for *.ipng.ch and *.ipng.nl and a bunch of others, periodically there is a background task that retrieves multiple wildcard certificates with Let\u0026rsquo;s Encrypt, and distributes them to any server that needs them (like the NGINX cluster, or the Postfix cluster). I wrote about the [Frontends], the spiffy [DNS-01] certificate subsystem, and the internal network called [IPng Site Local] each in their own articles, so I won\u0026rsquo;t repeat that information here.\nThe Results The results are really cool, as I\u0026rsquo;ll demonstrate in this video. I can just submit and merge this change, and it\u0026rsquo;ll automatically kick off a build and push. Take a look at this video which was performed in real time as I pushed this very article live:\nThere should have been a video here but your browser does not seem to support it. ","date":"2024-08-12","desc":"Introduction In the before-days, I had a very modest personal website running on [ipng.nl] and [ipng.ch]. Over the years I\u0026rsquo;ve had quite a few different designs, and although one of them was hosted (on Google Sites) for a brief moment, they were mostly very much web 1.0, \u0026ldquo;The 90s called, they wanted their website back!\u0026rdquo; style.\nThe site didn\u0026rsquo;t have much other than a little blurb on a few open source projects of mine, and a gallery hosted on PicasaWeb [which Google subsequently turned down], and a mostly empty Blogger page. Would you imagine that I hand-typed the XHTML and CSS for this website, where the menu at the top (thinks like Home - Resume - History - Articles) would just have a HTML page which meticulously linked to the other HTML pages. It was the way of the world, in the 1990s.\n","permalink":"https://ipng.ch/s/articles/2024/08/12/case-study-from-jekyll-to-hugo/","section":"articles","title":"Case Study: From Jekyll to Hugo"},{"contents":"Introduction Last month, I took a good look at the Gowin R86S based on Jasper Lake (N6005) CPU [ref], which is a really neat little 10G (and, if you fiddle with it a little bit, 25G!) router that runs off of USB-C power and can be rack mounted if you print a bracket. Check out my findings in this [article].\nDavid from Gowin reached out and asked me if I was willing to also take a look their Alder Lake (N305) CPU, which comes in a 19\u0026quot; rack mountable chassis, running off of 110V/220V AC mains power, but also with 2x25G ConnectX-4 network card. Why not! For critical readers: David sent me this machine, but made no attempt to influence this article.\nHardware Specs There are a few differences between this 19\u0026quot; model and the compact mini-pc R86S. The most obvious difference is the form factor. The R86S is super compact, not inherently rack mountable, although I 3D printed a bracket for it. Looking inside, the motherboard is mostly obscured bya large cooling block with fins that are flush with the top plate. There are 5 copper ports in the front: 2x Intel i226-V (these are 2.5Gbit) and 3x Intel i210 (these are 1Gbit), and one of them offers POE, which can be very handy to power a camera or wifi access point. A nice touch.\nThe Gowin server comes with an OCP v2.0 port, just like the R86S does. There\u0026rsquo;s a custom bracket with a ribbon cable to the motherboard, and in the bracket is housed a Mellanox ConnectX-4 LX 2x25Gbit network card.\nA look inside The machine comes with an Intel i3-N305 (Alder Lake) CPU running at a max clock of 3GHz and 4x8GB of LPDDR5 memory at 4800MT/s \u0026ndash; and considering the Alder Lake can make use of 4-channel memory, this thing should be plenty fast. The memory is soldered to the board, though, so there\u0026rsquo;s no option of expanding or changing the memory after buying the unit.\nUsing likwid-topology, I determine that the 8-core CPU has no hyperthreads, but just straight up 8 cores with 32kB of L1 cache, two times 2MB of L3 cache (shared between cores 0-3, and another bank shared between cores 4-7), and 6MB of L3 cache shared between all 8 cores. This is again a step up from the Jasper Lake CPU, and should make VPP run a little bit faster.\nWhat I find a nice touch is that Gowin has shipped this board with a 128GB MMC flash disk, which appears in Linux as /dev/mmcblk0 and can be used to install an OS. However, there are also two NVME slots with M.2 2280, one M.SATA slot and two additional SATA slots with 4-pin power. On the side of the chassis is a clever bracket that holds three 2.5\u0026quot; SSDs in a staircase configuration. That\u0026rsquo;s quite a lot of storage options, and given the CPU has some oompf, this little one could realistically be a NAS, although I\u0026rsquo;d prefer it to be a VPP router!\nThe copper RJ45 ports are all on the motherboard, and there\u0026rsquo;s an OCP breakout port that fits any OCP v2.0 network card. Gowin shipped it with a ConnectX-4 LX, but since I had a ConnectX-5 EN, I will take a look at performance with both cards. One critical observation, as with the Jasper Lake R86S, is that there are only 4 PCIe v3.0 lanes routed to the OCP, which means that the spiffy x8 network interfaces (both the Cx4 and the Cx5 I have here) will run at half speed. Bummer!\nThe power supply is a 100-240V switching PSU with about 150W of power available. When running idle, with one 1TB NVME drive, I measure 38.2W on the 220V side. When running VPP at full load, I measure 47.5W of total load. That\u0026rsquo;s totally respectable for a 2x 25G + 2x 2.5G + 3x 1G VPP router.\nI\u0026rsquo;ve added some pictures to a [Google Photos] album, if you\u0026rsquo;d like to take a look.\nVPP Loadtest: RDMA versus DPDK You (hopefully\u0026gt;) came here to read about VPP stuff. For years now, I have been curious as to the performance and functional differences in VPP between using DPDK and the native RDMA driver support that Mellanox network cards have support for. In this article, I\u0026rsquo;ll do four loadtests, with the stock Mellanox Cx4 that comes with the Gowin server, and with the Mellanox Cx5 card that I had bought for the R86S. I\u0026rsquo;ll take a look at the differences betwen DPDK on the one hand and RDMA on the other. This will yield, for me at least, a better understanding on the differences. Spoiler: there are not many!\nDPDK The Data Plane Development Kit (DPDK) is an open source software project managed by the Linux Foundation. It provides a set of data plane libraries and network interface controller polling-mode drivers for offloading ethernet packet processing from the operating system kernel to processes running in user space. This offloading achieves higher computing efficiency and higher packet throughput than is possible using the interrupt-driven processing provided in the kernel.\nYou can read more about it on [Wikipedia] or on the [DPDK Homepage]. VPP uses DPDK as one of the (more popular) drivers for network card interaction.\nDPDK: ConnectX-4 Lx This is the OCP network card that came with the Gowin server. It identifies in Linux as:\n0e:00.0 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] 0e:00.1 Ethernet controller: Mellanox Technologies MT27710 Family [ConnectX-4 Lx] Albeit with an important warning in dmesg, about the lack of PCIe lanes:\n[3.704174] pci 0000:0e:00.0: [15b3:1015] type 00 class 0x020000 [3.708154] pci 0000:0e:00.0: reg 0x10: [mem 0x60e2000000-0x60e3ffffff 64bit pref] [3.716221] pci 0000:0e:00.0: reg 0x30: [mem 0x80d00000-0x80dfffff pref] [3.724079] pci 0000:0e:00.0: Max Payload Size set to 256 (was 128, max 512) [3.732678] pci 0000:0e:00.0: PME# supported from D3cold [3.736296] pci 0000:0e:00.0: reg 0x1a4: [mem 0x60e4800000-0x60e48fffff 64bit pref] [3.756916] pci 0000:0e:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link at 0000:00:1d.0 (capable of 63.008 Gb/s with 8.0 GT/s PCIe x8 link) With a PCIe v3.0 overhead of 130b/128b, that means the card will have (128/130) * 32 = 31.508 Gbps available and I\u0026rsquo;m actually not quite sure why the kernel claims 31.504G in the log message. Anyway, the card itself works just fine at this speed, and is immediately detected in DPDK while continuing to use the mlx5_core driver. This would be a bit different with Intel based cards, as there the driver has to be rebound to vfio_pci or uio_pci_generic. Here, the NIC itself remains visible (and usable!) in Linux, which is kind of neat.\nI do my standard set of eight loadtests: {unidirectional,bidirectional} x {1514b, 64b multiflow, 64b singleflow, MPLS}. This teaches me a lot about how the NIC uses flow hashing, and what it\u0026rsquo;s maximum performance is. Without further ado, here\u0026rsquo;s the results:\nLoadtest: Gowin CX4 DPDK L1 bits/sec Packets/sec % of Line 1514b-unidirectional 25.00 Gbps 2.04 Mpps 100.2 % 64b-unidirectional 7.43 Gbps 11.05 Mpps 29.7 % 64b-single-unidirectional 3.09 Gbps 4.59 Mpps 12.4 % 64b-mpls-unidirectional 7.34 Gbps 10.93 Mpps 29.4 % 1514b-bidirectional 22.63 Gbps 1.84 Mpps 45.2 % 64b-bidirectional 7.42 Gbps 11.04 Mpps 14.8 % 64b-single-bidirectional 5.33 Gbps 7.93 Mpps 10.7 % 64b-mpls-bidirectional 7.36 Gbps 10.96 Mpps 14.8 % Some observations:\nIn the large packet department, the NIC easily saturates the port speed in unidirectional, and saturates the PCI bus (x4) in bidirectional forwarding. I\u0026rsquo;m surprised that the bidirectional forwarding capacity is a bit lower (1.84Mpps versus 2.04Mpps). The NIC is using three queues, and the difference between single flow (which could only use one queue, and one CPU thread) is not exactly linear (4.59Mpps vs 11.05Mpps for 3 RX queues) The MPLS performance is higher than single flow, which I think means that the NIC is capable of hashing the packets based on the inner packet. Otherwise, while using the same MPLS label, the Cx3 and other NICs tend to just leverage only one receive queue. I\u0026rsquo;m very curious how this NIC stacks up between DPDK and RDMA \u0026ndash; read on below for my results!\nDPDK: ConnectX-5 EN I swap the card out of its OCP bay and replace it with a ConnectX-5 EN that I have from when I tested the [R86S]. It identifies as:\n0e:00.0 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] 0e:00.1 Ethernet controller: Mellanox Technologies MT27800 Family [ConnectX-5] And similar to the ConnectX-4, this card also complains about PCIe bandwidth:\n[6.478898] mlx5_core 0000:0e:00.0: firmware version: 16.25.4062 [6.485393] mlx5_core 0000:0e:00.0: 31.504 Gb/s available PCIe bandwidth, limited by 8.0 GT/s PCIe x4 link at 0000:00:1d.0 (capable of 63.008 Gb/s with 8.0 GT/s PCIe x8 link) [6.816156] mlx5_core 0000:0e:00.0: E-Switch: Total vports 10, per vport: max uc(1024) max mc(16384) [6.841005] mlx5_core 0000:0e:00.0: Port module event: module 0, Cable plugged [7.023602] mlx5_core 0000:0e:00.0: MLX5E: StrdRq(1) RqSz(8) StrdSz(2048) RxCqeCmprss(0) [7.177744] mlx5_core 0000:0e:00.0: Supported tc offload range - chains: 4294967294, prios: 4294967295 With that said, the loadtests are quite a bit more favorable for the newer ConnectX-5:\nLoadtest: Gowin CX5 DPDK L1 bits/sec Packets/sec % of Line 1514b-unidirectional 24.98 Gbps 2.04 Mpps 99.7 % 64b-unidirectional 10.71 Gbps 15.93 Mpps 42.8 % 64b-single-unidirectional 4.44 Gbps 6.61 Mpps 17.8 % 64b-mpls-unidirectional 10.36 Gbps 15.42 Mpps 41.5 % 1514b-bidirectional 24.70 Gbps 2.01 Mpps 49.4 % 64b-bidirectional 14.58 Gbps 21.69 Mpps 29.1 % 64b-single-bidirectional 8.38 Gbps 12.47 Mpps 16.8 % 64b-mpls-bidirectional 14.50 Gbps 21.58 Mpps 29.1 % Some observations:\nThe NIC also saturates 25G in one direction with large packets, and saturates the PCI bus when pushing in both directions. Single queue / thread operation at 6.61Mpps is a fair bit higher than Cx4 (which is 4.59Mpps) Multiple threads scale almost linearly, from 6.61Mpps in 1Q to 15.93Mpps in 3Q. That\u0026rsquo;s respectable! Bidirectional small packet performance is pretty great at 21.69Mpps, more than double that of the Cx4 (which is 11.04Mpps). MPLS rocks! The NIC forwards 21.58Mpps of MPLS traffic. One thing I should note, is that at this point, the CPUs are not fully saturated. Looking at Prometheus/Grafana for this set of loadtests:\nWhat I find interesting is that in no cases did any CPU thread run to 100% utilization. In the 64b single flow loadtests (from 14:00-14:10 and from 15:05-15:15), the CPU threads definitely got close, but they did not clip \u0026ndash; which does lead me to believe that the NIC (or the PCIe bus!) are the bottleneck.\nBy the way, the bidirectional single flow 64b loadtest shows two threads that have an overall slightly lower utilization (63%) versus the unidirectional single flow 64 loadtest (at 78.5%). I think this can be explained by the two threads being able to use/re-use each others\u0026rsquo; cache lines.\nConclusion: ConnectX-5 performs significantly better than ConnectX-4 with DPDK.\nRDMA RDMA supports zero-copy networking by enabling the network adapter to transfer data from the wire directly to application memory or from application memory directly to the wire, eliminating the need to copy data between application memory and the data buffers in the operating system. Such transfers require no work to be done by CPUs, caches, or context switches, and transfers continue in parallel with other system operations. This reduces latency in message transfer.\nYou can read more about it on [Wikipedia] VPP uses RDMA in a clever way, relying on the Linux library for rdma-core (libibverb) to create a custom userspace poll-mode driver, specifically for Ethernet packets. Despite using the RDMA APIs, this is not about RDMA (no Infiniband, no RoCE, no iWARP), just pure traditional Ethernet packets. Many VPP developers recommend and prefer RDMA for Mellanox devices. I myself have been more comfortable with DPDK. But, now is the time to FAFO.\nRDMA: ConnectX-4 Lx Considering I used three RX queues for DPDK, I instruct VPP now to use 3 receive queues for RDMA as well. I remove the dpdk_plugin.so from startup.conf, although I could also have kept the DPDK plugin running (to drive the 1.0G and 2.5G ports!) and de-selected the 0000:0e:00.0 and 0000:0e:00.1 PCI entries, so that the RDMA driver can grab them.\nThe VPP startup now looks like this:\nvpp# create interface rdma host-if enp14s0f0np0 name xxv0 rx-queue-size 512 tx-queue-size 512 num-rx-queues 3 no-multi-seg no-striding max-pktlen 2026 vpp# create interface rdma host-if enp14s0f1np1 name xxv1 rx-queue-size 512 tx-queue-size 512 num-rx-queues 3 no-multi-seg no-striding max-pktlen 2026 vpp# set int mac address xxv0 02:fe:4a:ce:c2:fc vpp# set int mac address xxv1 02:fe:4e:f5:82:e7 I realize something pretty cool - the RDMA interface gets an ephemeral (randomly generated) MAC address, while the main network card in Linux stays available. The NIC internally has a hardware filter for the RDMA bound MAC address and gives it to VPP \u0026ndash; the implication is that the 25G NICs can also be used in Linux itself. That\u0026rsquo;s slick.\nPerformance wise:\nLoadtest: Gowin CX4 with RDMA L1 bits/sec Packets/sec % of Line 1514b-unidirectional 25.01 Gbps 2.04 Mpps 100.3 % 64b-unidirectional 12.32 Gbps 18.34 Mpps 49.1 % 64b-single-unidirectional 6.21 Gbps 9.24 Mpps 24.8 % 64b-mpls-unidirectional 11.95 Gbps 17.78 Mpps 47.8 % 1514b-bidirectional 26.24 Gbps 2.14 Mpps 52.5 % 64b-bidirectional 14.94 Gbps 22.23 Mpps 29.9 % 64b-single-bidirectional 11.53 Gbps 17.16 Mpps 23.1 % 64b-mpls-bidirectional 14.99 Gbps 22.30 Mpps 30.0 % Some thoughts:\nThe RDMA driver is significantly faster than DPDK in this configuration. Hah! 1514b are fine in both directions, RDMA slightly outperforms DPDK in the bidirectional test. 64b is massively faster: Unidirectional multiflow: RDMA 18.34Mpps, DPDK 11.05Mpps Bidirectional multiflow: RDMA 22.23Mpps, DPDK 11.04Mpps Bidirectional MPLS: RDMA 22.30Mpps, DPDK 10.93Mpps. Conclusion: I would say, roughly speaking, that RDMA outperforms DPDK on the Cx4 by a factor of two. That\u0026rsquo;s really cool, especially because ConnectX-4 network cards are found very cheap these days.\nRDMA: ConnectX-5 EN Well then what about the newer Mellanox ConnectX-5 card? Something surprising happens when I boot the machine and start the exact same configuration as with the Cx4, the loadtests results almost invariably suck:\nLoadtest: Gowin CX5 with RDMA L1 bits/sec Packets/sec % of Line 1514b-unidirectional 24.95 Gbps 2.03 Mpps 99.6 % 64b-unidirectional 6.19 Gbps 9.22 Mpps 24.8 % 64b-single-unidirectional 3.27 Gbps 4.87 Mpps 13.1 % 64b-mpls-unidirectional 6.18 Gbps 9.20 Mpps 24.7 % 1514b-bidirectional 24.59 Gbps 2.00 Mpps 49.2 % 64b-bidirectional 8.84 Gbps 13.15 Mpps 17.7 % 64b-single-bidirectional 5.57 Gbps 8.29 Mpps 11.1 % 64b-mpls-bidirectional 8.77 Gbps 13.05 Mpps 17.5 % Yikes! Cx5 in its default mode can still saturate 1514b loadtests, but turns into single digit with almost all other loadtest types. I\u0026rsquo;m surprised also that single flow loadtest clocks in at only 4.87Mpps, that\u0026rsquo;s about the same speed I sawwith the ConnectX-4 using DPDK. This does not look good at all, and honestly, I don\u0026rsquo;t believe it.\nSo I start fiddling with settings.\nConnectX-5 EN: Tuning Parameters There are a few things I found that might speed up processing in the ConnectX network card:\nAllowing for larger PCI packets - by default 512b, I can raise this to 1k, 2k or even 4k. setpci -s 0e:00.0 68.w will return some hex number ABCD, the A here stands for max read size. 0=128b, 1=256b, 2=512b, 3=1k, 4=2k, 8=4k. I can set the value by writing setpci -s 0e:00.0 68.w=3BCD, which immediately speeds up the loadtests! Mellanox recommends to turn on CQE compression, to allow for the PCI messages to be aggressively compressed, saving bandwidth. This helps specifically with smaller packets, as the PCI message overhead really starts to matter. mlxconfig -d 0e:00.0 set CQE_COMPRESSION=1 and reboot. For MPLS, the Cx5 can do flow matching on the inner packet (rather than hashing all packets to the same queue based on the MPLS label) \u0026ndash; mlxconfig -d 0e:00.0 set FLEX_PARSER_PROFILE_ENABLE=1 and reboot. Likely the number of receive queues matters, and can be set in the create interface rdma command. I notice that CQE_COMPRESSION and FLEX_PARSER_PROFILE_ENABLE help in all cases, so I set them and reboot. The PCI packets resizing also helps specifically with smaller packets, so I set that too in /etc/rc.local. The fourth variable is left over, which is varying receive queue count.\nHere\u0026rsquo;s a comparison that, to me at least, was surprising. With three receive queues, thus three CPU threads each receiving 4.7Mpps and sending 3.1Mpps, performance looked like this:\n$ vppctl create interface rdma host-if enp14s0f0np0 name xxv0 rx-queue-size 1024 tx-queue-size 4096 num-rx-queues 3 mode dv no-multi-seg max-pktlen 2026 $ vppctl create interface rdma host-if enp14s0f1np1 name xxv1 rx-queue-size 1024 tx-queue-size 4096 num-rx-queues 3 mode dv no-multi-seg max-pktlen 2026 $ vppctl show run | grep vector\\ rates | grep -v in\\ 0 vector rates in 4.7586e6, out 3.2259e6, drop 3.7335e2, punt 0.0000e0 vector rates in 4.9881e6, out 3.2206e6, drop 3.8344e2, punt 0.0000e0 vector rates in 5.0136e6, out 3.2169e6, drop 3.7335e2, punt 0.0000e0 This is fishy - why is the inbound rate much higher than the outbound rate? The behavior is consistent in multi-queue setups. If I create 2 queues it\u0026rsquo;s 8.45Mpps in and 7.98Mpps out:\n$ vppctl create interface rdma host-if enp14s0f0np0 name xxv0 rx-queue-size 1024 tx-queue-size 4096 num-rx-queues 2 mode dv no-multi-seg max-pktlen 2026 $ vppctl create interface rdma host-if enp14s0f1np1 name xxv1 rx-queue-size 1024 tx-queue-size 4096 num-rx-queues 2 mode dv no-multi-seg max-pktlen 2026 $ vppctl show run | grep vector\\ rates | grep -v in\\ 0 vector rates in 8.4533e6, out 7.9804e6, drop 0.0000e0, punt 0.0000e0 vector rates in 8.4517e6, out 7.9798e6, drop 0.0000e0, punt 0.0000e0 And when I create only one queue, the same appears:\n$ vppctl create interface rdma host-if enp14s0f0np0 name xxv0 rx-queue-size 1024 tx-queue-size 4096 num-rx-queues 1 mode dv no-multi-seg max-pktlen 2026 $ vppctl create interface rdma host-if enp14s0f1np1 name xxv1 rx-queue-size 1024 tx-queue-size 4096 num-rx-queues 1 mode dv no-multi-seg max-pktlen 2026 $ vppctl show run | grep vector\\ rates | grep -v in\\ 0 vector rates in 1.2082e7, out 9.3865e6, drop 0.0000e0, punt 0.0000e0 But now that I\u0026rsquo;ve scaled down to only one queue (and thus one CPU thread doing all the work), I manage to find a clue in the show runtime command:\nThread 1 vpp_wk_0 (lcore 1) Time 321.1, 10 sec internal node vector rate 256.00 loops/sec 46813.09 vector rates in 1.2392e7, out 9.4015e6, drop 0.0000e0, punt 1.5571e-2 Name State Calls Vectors Suspends Clocks Vectors/Call ethernet-input active 15543357 3979099392 0 2.79e1 256.00 ip4-input-no-checksum active 15543352 3979098112 0 1.26e1 256.00 ip4-load-balance active 15543357 3979099387 0 9.17e0 255.99 ip4-lookup active 15543357 3979099387 0 1.43e1 255.99 ip4-rewrite active 15543357 3979099387 0 1.69e1 255.99 rdma-input polling 15543357 3979099392 0 2.57e1 256.00 xxv1-output active 15543357 3979099387 0 5.03e0 255.99 xxv1-tx active 15543357 3018807035 0 4.35e1 194.22 It takes a bit of practice to spot this, but see how xx1-output is running at 256 vectors/call, while xxv1-tx is running at only 194.22 vectors/call? That means that VPP is dutifully handling the whole packet, but when it is handed off to RDMA to marshall onto the hardware, it\u0026rsquo;s getting lost! And indeed, this is corroborated by show errors:\n$ vppctl show err Count Node Reason Severity 3334 null-node blackholed packets error 7421 ip4-arp ARP requests throttled info 3 ip4-arp ARP requests sent info 1454511616 xxv1-tx no free tx slots error 16 null-node blackholed packets error Wow, billions of packets have been routed by VPP but then had to be discarded because RDMA output could not keep up. Ouch.\nCompare the previous CPU utilization graph (from the Cx5/DPDK loadtest) with this Cx5/RDMA/1-RXQ loadtest:\nHere I can clearly see that the one CPU thread (in yellow for unidirectional) and the two CPU therads (one for each of the bidirectional flows) jump up to 100% and stay there. This means that when VPP is completely pegged, it is receiving 12.4Mpps per core, but only manages to get RDMA to send 9.40Mpps of those on the wire. The performance further deteriorates when multiple receive queues are in play. Note: 12.4Mpps is pretty great for these CPU threads.\nConclusion: Single Queue RDMA based Cx5 will allow for about 9Mpps per interface, which is a little bit better than DPDK performance; but Cx4 and Cx5 performance are not too far apart.\nSummary and closing thoughts Looking at the RDMA results for both Cx4 and Cx5, when using only one thread, gives fair performance with very low CPU cost per port \u0026ndash; however I could not manage to get rid of the no free tx slots errors, and VPP can consume / process / forward more packets than RDMA is willing to marshall out on the wire, which is disappointing.\nThat said, both RDMA and DPDK performance is line rate at 25G unidirectional with sufficiently large packets, and for small packets, it can realistically handle roughly 9Mpps per CPU thread. Considering the CPU has 8 threads \u0026ndash; of which 6 usable by VPP \u0026ndash; the machine has more CPU than it needs to drive the NICs. It should be a really great router at 10Gbps traffic rates, and a very fair router at 25Gbps either with RDMA or DPDK.\nHere\u0026rsquo;s a few files I gathered along the way, in case they are useful:\n[LSCPU] - [Likwid Topology] - [DMI Decode] - [LSBLK] Mellanox MCX4421A-ACAN: [dmesg] - [LSPCI] - [LSHW] Mellanox MCX542B-ACAN: [dmesg] - [LSPCI] - [LSHW] VPP Configs: [startup.conf] - [L2 Config] - [L3 Config] - [MPLS Config] ","date":"2024-08-03","desc":"Introduction Last month, I took a good look at the Gowin R86S based on Jasper Lake (N6005) CPU [ref], which is a really neat little 10G (and, if you fiddle with it a little bit, 25G!) router that runs off of USB-C power and can be rack mounted if you print a bracket. Check out my findings in this [article].\nDavid from Gowin reached out and asked me if I was willing to also take a look their Alder Lake (N305) CPU, which comes in a 19\u0026quot; rack mountable chassis, running off of 110V/220V AC mains power, but also with 2x25G ConnectX-4 network card. Why not! For critical readers: David sent me this machine, but made no attempt to influence this article.\n","permalink":"https://ipng.ch/s/articles/2024/08/03/review-gowin-1u-2x25g-alder-lake-n305/","section":"articles","title":"Review: Gowin 1U 2x25G (Alder Lake - N305)"},{"contents":"IPng is more than just a company. We operate a privately owned collection of hosting software and network services. As such, we can make decisions quickly. The roots of IPng go all the way back to the 90s, when IPv6 was called the next generation internet protocol, or IPng for short.\nRather than dazzle you with pictures of clouds, grandiose projections of our \u0026ldquo;global IP backbone\u0026rdquo;, and other claims that small businesses make to appear larger than they are, we\u0026rsquo;re happy to show what we know, what we own, and how we can help you accomplish your goals if you want to work with us.\nOur mission is to make networking and hosting services available to our customers, partners and users with a flexible yet uncompromising quality.\nOur story As most companies, it started with an opportunity. We got our hands on a physical location which had a raised floor at 60m2 and a significant power connection of 3x200A, and a metro fiber connection at 10Gig. We asked ourselves \u0026lsquo;what would it take to turn this into a colo?\u0026rsquo; and the rest is history. Thanks to our partners who benefit from this infrastructure as well, making this first small colocation site was not only interesting, but also very rewarding.\nThe networking and service provider industry is quite small and well organized into Network Operator Groups, so we work under the assumption that everybody knows everybody. We\u0026rsquo;d definitely like to pitch in and share what we have built, both the physical bits but also the narrative.\nOur Founder We have been operating autonomous systems and corporate networks for decades. In Switzerland, we incorporated in early 2021 into a limited liability company.\nPim van Pelt - [PBVP1-RIPE] started his career as a network engineer in the Netherlands, where he worked for Intouch, Freeler, and BIT. He helped raise awareness for IPv6, for example by launching it at AMS-IX back in 2001. He also operated [SixXS], a global IPv6 tunnel broker, from 2001 through to its sunset in 2017. Since 2006, Pim works as a Distinguished Software Engineer at Google in Zurich, Switzerland. In his free time, he goes [Geocaching], contributes to [open source] projects, and occasionally flies model helicopters.\n","date":"2024-07-28","desc":"IPng is more than just a company. We operate a privately owned collection of hosting software and network services. As such, we can make decisions quickly. The roots of IPng go all the way back to the 90s, when IPv6 was called the next generation internet protocol, or IPng for short.\nRather than dazzle you with pictures of clouds, grandiose projections of our \u0026ldquo;global IP backbone\u0026rdquo;, and other claims that small businesses make to appear larger than they are, we\u0026rsquo;re happy to show what we know, what we own, and how we can help you accomplish your goals if you want to work with us.\n","permalink":"https://ipng.ch/s/about/","section":"","title":"About"},{"contents":"At IPng, we understand that teams are built with more than one player. We try to think out of the box, and have an extensive technical understanding of the ISP industry and the experience to know when we can solve problems or when we need to ask for help to solve problems in any of your hosting and network needs. Why don\u0026rsquo;t you use IPng to:\nHost directly physical machines or virtual machines at a competetive rate, or Use our direct contacts to host a rack, a set of racks, or a cage in a datacenter. Provide internet access to your network, or Connect you to best in class network providers of high pedigree for larger needs. Ask us a question or two about the things, companies and people we know, or Connect you to other professionals who mitigate IT risks and advance your business. GitHubLinkedInMastodonEmailRSS As you can see, we are not necessarily posessive of our customer relationship. We\u0026rsquo;re good at some things, are opinionated on how things ought to be, and have a sorting hat of contacts, partners and providers that can pitch in to solve problems. We can of course coordinate, redirect, or partner directly.\nIPng Networks GmbH Chamber: CHE-290.311.610 E-Mail: info@ipng.ch\nLanguages: Dutch, German, French, and English (preferred).\n","date":"2024-07-28","desc":"At IPng, we understand that teams are built with more than one player. We try to think out of the box, and have an extensive technical understanding of the ISP industry and the experience to know when we can solve problems or when we need to ask for help to solve problems in any of your hosting and network needs. Why don\u0026rsquo;t you use IPng to:\nHost directly physical machines or virtual machines at a competetive rate, or Use our direct contacts to host a rack, a set of racks, or a cage in a datacenter. Provide internet access to your network, or Connect you to best in class network providers of high pedigree for larger needs. Ask us a question or two about the things, companies and people we know, or Connect you to other professionals who mitigate IT risks and advance your business. GitHubLinkedInMastodonEmailRSS As you can see, we are not necessarily posessive of our customer relationship. We\u0026rsquo;re good at some things, are opinionated on how things ought to be, and have a sorting hat of contacts, partners and providers that can pitch in to solve problems. We can of course coordinate, redirect, or partner directly.\n","permalink":"https://ipng.ch/s/contact/","section":"","title":"Contact"},{"contents":"Network Our network consists of routers and interconnections in two main sites in the Zurich metro: eShelter (NTT) in Ruemlang (next to the airport), and Interxion ZUR1 in Glattbrugg. These two locations are connected with dark fiber, and we have access to several local loop providers in both locations. For example, we can arrange connectivity to Colozueri, Equinix ZH4 and Equinix ZH5 directly, and using our partners (such as IP-Max, Init7, Openfactory), domestically and to most european cities.\nYou can read more about our network in this [informative post].\nIP Transit We operate [AS8298] which announces AS-IPNG for ourselves and our transit customers. We\u0026rsquo;re pretty firmly connected in and around Zurich, with a 10Gbit link to SwissIX, CommunityIX and CH-IX. We have a diverse set of transit providers which give us good reach to the world, including via Cogent AS174, Hurricane Electric AS6939, and for strong european presence, we receive transit from OpenFactory AS58299, Meerfarbig AS34549, and IP-Max AS25091.\nGaining access to this wealth of IPv4 and IPv6 coverage is as easy as finding an L2 connection to one of our points of presence, establishing a BGP session to us, and announcing your netblock(s). We\u0026rsquo;ll take it from there!\nYou can read more about our BGP capabilities in this [informative post].\nLocal Loop Ethernet Apropos, getting onto the internet is pretty easy if you are in a commercial colocation facility. Of course, any internet provider will offer various quality IPv4 and IPv6 connections with or without a static IP address. However getting from your house to the datacenter is often the most complicated and expensive project. At IPng, we had this challenge too \u0026ndash; and considering we solved it for ourselves, we can certainly also solve it for you! With our residential last-mile ISP collaboration (for example, ConnectionPoint, Init7, Solnet, and Swisscom BBCS), L2 services directly to your residence or office become easily possible.\nColocation We operate a private colocation facility in Zurich Albisrieden. The facility has 3x200A of power and about 60m2 of floor space. Hosting one or several machines, including Layer2 connectivity either to your own home, or to the main internet hubs of Zurich, are easily accomplished in our colocation facility. If more space is needed, we are regulars in most all Swiss carrier housing facilities, and can help broker a deal that is tailored to your needs.\nYou can read more about how we built our own colocation from scratch in this [informative post].\nSelf-Hosting For IPng it\u0026rsquo;s important to take back a little bit of responsibility for our online presence, away from centrally hosted services and to privately operated ones. We are experts at self-hosting, with services such as [Mastodon], [Pixelfed], [Loops], [PeerTube], [Mail] and myriad others.\nProject Design / Execution We design things, both logical and physical. Be it two dimensional or three dimensional, as life long tinkerers, we have a fair bit of experience in mechanical and electrical engineering. Of course, as a network business, designing and deploying autonomous systems and networks of any size is our home turf. Also fundamentally understanding how a network should perform, be it throughput, bandwidth, latency or jitter and packet loss: we\u0026rsquo;ve seen it all, from ADSL lines to 300km DWDM spans.\nFor some good examples, take a look at these case studies:\nFiber7 on LiteXchange SixXS Sunset Coloclue Loadtesting Debian on Mellanox SN2700 There\u0026rsquo;s many more papers and opinion pieces in our [Articles] section.\n","date":"2024-07-28","desc":"Network Our network consists of routers and interconnections in two main sites in the Zurich metro: eShelter (NTT) in Ruemlang (next to the airport), and Interxion ZUR1 in Glattbrugg. These two locations are connected with dark fiber, and we have access to several local loop providers in both locations. For example, we can arrange connectivity to Colozueri, Equinix ZH4 and Equinix ZH5 directly, and using our partners (such as IP-Max, Init7, Openfactory), domestically and to most european cities.\n","permalink":"https://ipng.ch/s/services/","section":"","title":"Services"},{"contents":"Introduction I am always interested in finding new hardware that is capable of running VPP. Of course, a standard issue 19\u0026quot; rack mountable machine like a Dell, HPE or SuperMicro machine is an obvious choice. They come with redundant power supplies, PCIe v3.0 or better expansion slots, and can boot off of mSATA or NVME, with plenty of RAM. But for some people and in some locations, the power envelope or size/cost of these 19\u0026quot; rack mountable machines can be prohibitive. Sometimes, just having a smaller form factor can be very useful:\nEnter the GoWin R86S!\nI stumbled across this lesser known build from GoWin, which is an ultra compact but modern design, featuring three 2.5GbE ethernet ports and optionally two 10GbE, or as I\u0026rsquo;ll show here, two 25GbE ports. What I really liked about the machine is that it comes with 32GB of LPDDR4 memory and can boot off of an m.2 NVME \u0026ndash; which makes it immediately an appealing device to put in the field. I noticed that the height of the machine is just a few millimeters smaller than 1U which is 1.75\u0026quot; (44.5mm), which gives me the bright idea to 3D print a bracket to be able to rack these and because they are very compact \u0026ndash; a width of 78mm only, I can manage to fit four of them in one 1U front, or maybe a Mikrotik CRS305 breakout switch. Slick!\nI picked up two of these R86S Pro and when they arrived, I noticed that their 10GbE is actually an Open Compute Project (OCP) footprint expansion card, which struck me as clever. It means that I can replace the Mellanox CX342A network card with perhaps something more modern, such as an Intel X520-DA2 or Mellanox MCX542B_ACAN which is even dual-25G! So I take to ebay and buy myself a few expansion OCP boards, which are surprisingly cheap, perhaps because the OCP form factor isn\u0026rsquo;t as popular as \u0026rsquo;normal\u0026rsquo; PCIe v3.0 cards.\nI put a Google photos album online [here], in case you\u0026rsquo;d like some more detailed shots.\nIn this article, I\u0026rsquo;ll write about a mixture of hardware, systems engineering (how the hardware like network cards and motherboard and CPU interact with one another), and VPP performance diagnostics. I hope that it helps a few wary Internet denizens feel their way around these challenging but otherwise fascinating technical topics. Ready? Let\u0026rsquo;s go!\nHardware Specs For the CHF 314,- I paid for each Intel Pentium N6005, this machine is delightful! They feature:\nIntel Pentium Silver N6005 @ 2.00GHz (4 cores) 2x16GB Micron LPDDR4 memory @2933MT/s 1x Samsung SSD 980 PRO 1TB NVME 3x Intel I226-V 2.5GbE network ports 1x OCP v2.0 connector with PCIe v3.0 x4 delivered USB-C power supply 2x USB3 (one on front, one on side) 1x USB2 (on the side) 1x MicroSD slot 1x MicroHDMI video out Wi-Fi 6 AX201 160MHz onboard To the right I\u0026rsquo;ve put the three OCP nework interface cards side by side. On the top, the Mellanox Cx3 (2x10G) that shipped with the R86S units. In the middle, a spiffy Mellanox Cx5 (2x25G), and at the bottom, the classic Intel 82599ES (2x10G) card. As I\u0026rsquo;ll demonstrate, despite having the same form factor, each of these have a unique story to tell, well beyond their rated portspeed.\nThere\u0026rsquo;s quite a few options for CPU out there - GoWin sells them with Jasper Lake (Celeron N5105 or Pentium N6005, the one I bought), but also with newer Alder Lake (N100 or N305). Price, performance and power draw will vary. I looked at a few differences in Passmark, and I think I made a good trade off between cost, power and performance. You may of course choose differently!\nThe R86S formfactor is very compact, coming in at (80mm x 120mm x 40mm), and the case is made of sturdy aluminium. It feels like a good quality build, and the inside is also pretty neat. In the kit, a cute little M2 hex driver is included. This allows me to remove the bottom plate (to service the NVME) and separate the case to access the OCP connector (and replace the NIC!). Finally, the two antennae at the back are tri-band, suitable for WiFi 6. There is one fan included in the chassis, with a few cut-outs in the top of the case, to let the air flow through the case. The fan is not noisy, but definitely noticeable.\nCompiling VPP on R86S I first install Debian Bookworm on them, and retrofit one of them with the Intel X520 and the other with the Mellanox Cx5 network cards. While the Mellanox Cx342A that comes with the R86S does have DPDK support (using the MLX4 poll mode driver), it has a quirk in that it does not enumerate both ports as unique PCI devices, causing VPP to crash with duplicate graph node names:\nvlib_register_node:418: more than one node named `FiftySixGigabitEthernet5/0/0-tx\u0026#39; Failed to save post-mortem API trace to /tmp/api_post_mortem.794 received signal SIGABRT, PC 0x7f9445aa9e2c The way VPP enumerates DPDK devices is by walking the PCI bus, but considering the Connect-X3 has two ports behind the same PCI address, it\u0026rsquo;ll try to create two interfaces, which fails. It\u0026rsquo;s pretty easily fixable with a small [patch]. Off I go, to compile VPP (version 24.10-rc0~88-ge3469369dd) with Mellanox DPDK support, to get the best side by side comparison between the Cx3 and X520 cards on the one hand needing DPDK, and the Cx5 card optionally also being able to use VPP\u0026rsquo;s RDMA driver. They will all be using DPDK in my tests.\nI\u0026rsquo;m not out of the woods yet, because VPP throws an error when enumerating and attaching the Mellanox Cx342. I read the DPDK documentation for this poll mode driver [ref] and find that when using DPDK applications, the mlx4_core driver in the kernel has to be initialized with a specific flag, like so:\nGRUB_CMDLINE_LINUX_DEFAULT=\u0026#34;isolcpus=1-3 iommu=on intel_iommu=on mlx4_core.log_num_mgm_entry_size=-1\u0026#34; And because I\u0026rsquo;m using iommu, the correct driver to load for Cx3 is vfio_pci, so I put that in /etc/modules, rebuild the initrd, and reboot the machine. With all of that sleuthing out of the way, I am now ready to take the R86S out for a spin and see how much this little machine is capable of forwarding as a router.\nPower: Idle and Under Load I note that the Intel Pentium Silver CPU has 4 cores, one of which will be used by OS and controlplane, leaving 3 worker threads left for VPP. The Pentium Silver N6005 comes with 32kB of L1 per core, and 1.5MB of L2 + 4MB of L3 cache shared between the cores. It\u0026rsquo;s not much, but then again the TDP is shockingly low 10 Watts. Before VPP runs (and makes the CPUs work really hard), the entire machine idles at 12 Watts. When powered on under full load, the Mellanox Cx3 and Intel x520-DA2 both sip 17 Watts of power and the Mellanox Cx5 slurps 20 Watts of power all-up. Neat!\nLoadtest Results For each network interface I will do a bunch of loadtests, to show different aspects of the setup. First, I\u0026rsquo;ll do a bunch of unidirectional tests, where traffic goes into one port and exits another. I will do this with either large packets (1514b), small packets (64b) but many flows, which allow me to use multiple hardware receive queues assigned to individual worker threads, or small packets with only one flow, limiting VPP to only one RX queue and consequently only one CPU thread. Because I think it\u0026rsquo;s hella cool, I will also loadtest MPLS label switching (eg. MPLS frame with label \u0026lsquo;16\u0026rsquo; on ingress, forwarded with a swapped label \u0026lsquo;17\u0026rsquo; on egress). In general, MPLS lookups can be a bit faster as they are (constant time) hashtable lookups, while IPv4 longest prefix match lookups use a trie. MPLS won\u0026rsquo;t be significantly faster than IPv4 in these tests, because the FIB is tiny with only a handful of entries.\nSecond, I\u0026rsquo;ll do the same loadtests but in both directions, which means traffic is both entering NIC0 and being emitted on NIC1, but also entering on NIC1 to be emitted on NIC0. In these loadtests, again large packets, small packets multi-flow, small packets single-flow, and MPLS, the network chip has to do more work to maintain its RX queues and its TX queues simultaneously. As I\u0026rsquo;ll demonstrate, this tends to matter quite a bit on consumer hardware.\nIntel i226-V (2.5GbE) This is a 2.5G network interface from the Foxville family, released in Q2 2022 with a ten year expected availability, it\u0026rsquo;s currently a very good choice. It is a consumer/client chip, which means I cannot expect super performance from it. In this machine, the three RJ45 ports are connected to PCI slot 01:00.0, 02:00.0 and 03:00.0, each at 5.0GT/s (this means they are PCIe v2.0) and they take one x1 PCIe lane to the CPU. I leave the first port as management, and take the second+third one and give them to VPP like so:\ndpdk { dev 0000:02:00.0 { name e0 } dev 0000:03:00.0 { name e1 } no-multi-seg decimal-interface-names uio-driver vfio-pci } The logical configuration then becomes:\nset int state e0 up set int state e1 up set int ip address e0 100.64.1.1/30 set int ip address e1 100.64.2.1/30 ip route add 16.0.0.0/24 via 100.64.1.2 ip route add 48.0.0.0/24 via 100.64.2.2 ip neighbor e0 100.64.1.2 50:7c:6f:20:30:70 ip neighbor e1 100.64.2.2 50:7c:6f:20:30:71 mpls table add 0 set interface mpls e0 enable set interface mpls e1 enable mpls local-label add 16 eos via 100.64.2.2 e1 mpls local-label add 17 eos via 100.64.1.2 e0 In the first block, I\u0026rsquo;ll bring up interfaces e0 and e1, give them an IPv4 address in a /30 transit net, and set a route to the other side. I\u0026rsquo;ll route packets destined to 16.0.0.0/24 to the Cisco T-Rex loadtester at 100.64.1.2, and I\u0026rsquo;ll route packets for 48.0.0.0/24 to the T-Rex at 100.64.2.2. To avoid the need to ARP for T-Rex, I\u0026rsquo;ll set some static ARP entries to the loadtester\u0026rsquo;s MAC addresses.\nIn the second block, I\u0026rsquo;ll enable MPLS, turn it on on the two interfaces, and add two FIB entries. If VPP receives an MPLS packet with label 16, it\u0026rsquo;ll forward it on to Cisco T-Rex on port e1, and if it receives a packet with label 17, it\u0026rsquo;ll forward it to T-Rex on port e0.\nWithout further ado, here are the results of the i226-V loadtest:\nIntel i226-V: Loadtest L2 bits/sec Packets/sec % of Line-Rate Unidirectional 1514b 2.44Gbps 202kpps 99.4% Unidirectional 64b Multi 1.58Gbps 3.28Mpps 88.1% Unidirectional 64b Single 1.58Gbps 3.28Mpps 88.1% Unidirectional 64b MPLS 1.57Gbps 3.27Mpps 87.9% Bidirectional 1514b 4.84Gbps 404kpps 99.4% Bidirectional 64b Multi 2.44Gbps 5.07Mpps 68.2% Bidirectional 64b Single 2.44Gbps 5.07Mpps 68.2% Bidirectional 64b MPLS 2.43Gbps 5.07Mpps 68.2% First response: very respectable!\nImportant Notes 1. L1 vs L2 There\u0026rsquo;s a few observations I want to make, as these numbers can be confusing. First off, VPP when given large packets, can easily sustain almost exactly (!) the line rate of 2.5GbE. There\u0026rsquo;s always a debate about these numbers, so let me offer offer some theoretical background \u0026ndash;\nThe L2 Ethernet frame that Cisco T-Rex sends consists of the source/destination MAC (6 bytes each), a type (2 bytes), the payload, and a frame checksum (4 bytes). It shows us this number as Tx bps L2. But on the wire, the PHY has to additionally send a preamble (7 bytes), a start frame delimiter (1 byte), and at the end, an interpacket gap (12 bytes), which is 20 bytes of overhead. This means that the total size on the wire will be 1534 bytes. It shows us this number as Tx bps L1. This 1534 byte L1 frame on the wire is 12272 bits. For a 2.5Gigabit line rate, this means we can send at most 2'500'000'000 / 12272 = 203715 packets per second. Regardless of L1 or L2, this number is always Tx pps. The smallest (L2) Ethernet frame we\u0026rsquo;re allowed to send, is 64 bytes, and anything shorter than this is called a Runt. On the wire, such a frame will be 84 bytes (672 bits). With 2.5GbE, this means 3.72Mpps is the theoretical maximum. When reading back loadtest results from Cisco T-Rex, it shows us packets per second (Rx pps), but it only shows us the Rx bps, which is the L2 bits/sec which corresponds to the sending port\u0026rsquo;s Tx bps L2. When I describe the percentage of Line-Rate, I calculate this with what physically fits on the wire, eg the L1 bits/sec, because that makes most sense to me.\nWhen sending small 64b packets, the difference is significant: taking the above Unidirectional 64b Single as an example, I observed 3.28M packets/sec. This is a bandwidth of 3.28M*64*8 = 1.679Gbit of L2 traffic, but a bandwidth of 3.28M*(64+20)*8 = 2.204Gbit of L1 traffic, which is how I determine that it is 88.1% of Line-Rate.\n2. One RX queue A less pedantic observation is that there is no difference between Multi and Single flow loadtests. This is because the NIC only uses one RX queue, and therefor only one VPP worker thread. I did do a few loadtests with multiple receive queues, but it does not matter for performance. When performing this 3.28Mpps of load, I can see that VPP itself is not saturated. I can see that most of the time it\u0026rsquo;s just sitting there waiting for DPDK to give it work, which manifests as a vectors/call relatively low:\n--------------- Thread 2 vpp_wk_1 (lcore 2) Time 10.9, 10 sec internal node vector rate 40.39 loops/sec 68325.87 vector rates in 3.2814e6, out 3.2814e6, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call dpdk-input polling 61933 2846586 0 1.28e2 45.96 ethernet-input active 61733 2846586 0 1.71e2 46.11 ip4-input-no-checksum active 61733 2846586 0 6.54e1 46.11 ip4-load-balance active 61733 2846586 0 4.70e1 46.11 ip4-lookup active 61733 2846586 0 7.50e1 46.11 ip4-rewrite active 61733 2846586 0 7.23e1 46.11 e1-output active 61733 2846586 0 2.53e1 46.11 e1-tx active 61733 2846586 0 1.38e2 46.11 By the way the other numbers here are fascinating as well. Take a look at them:\nCalls: How often has VPP executed this graph node. Vectors: How many packets (which are internally called vectors) have been handled. Vectors/Call: Every time VPP executes the graph node, on average how many packets are done at once? An unloaded VPP will hover around 1.00, and the maximum permissible is 256.00. Clocks: How many CPU cycles, on average, did each packet spend in each graph node. Interestingly, summing up this number gets very close to the total CPU clock cycles available (on this machine 2.4GHz). Zooming in on the clocks number a bit more: every time a packet was handled, roughly 594 CPU cycles were spent in VPP\u0026rsquo;s directed graph. An additional 128 CPU cycles were spent asking DPDK for work. Summing it all up, 3.28M*(594+128) = 2'369'170'800 which is earily close to the 2.4GHz I mentioned above. I love it when the math checks out!!\nBy the way, in case you were wondering what happens on an unloaded VPP thread, the clocks spent in dpdk-input (and other polling nodes like unix-epoll-input) just go up to consume the whole core. I explain that in a bit more detail below.\n3. Uni- vs Bidirectional I noticed a non-linear response between loadtests in one direction versus both directions. At large packets, it did not matter. Both directions satured the line nearly perfectly (202kpps in one direction, and 404kpps in both directions). However, in the smaller packets, some contention became clear. In only one direction, IPv4 and MPLS forwarding were roughly 3.28Mpps; but in both directions, this went down to 2.53Mpps in each direction (which is my reported 5.07Mpps). So it\u0026rsquo;s interesting to see how these i226-V chips do seem to care if they are only receiving or transmitting transmitting, or performing both receiving and transmitting.\nIntel X520 (10GbE) This network card is based on the classic Intel Niantic chipset, also known as the 82599ES chip, first released in 2009. It\u0026rsquo;s super reliable, but there is one downside. It\u0026rsquo;s a PCIe v2.0 device (5.0GT/s) and to be able to run two ports, it will need eight lanes of PCI connectivity. However, a quick inspection using dmesg shows me, that there are only 4 lanes brought to the OCP connector:\nixgbe 0000:05:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x4 link at 0000:00:1c.4 (capable of 32.000 Gb/s with 5.0 GT/s PCIe x8 link) ixgbe 0000:05:00.0: MAC: 2, PHY: 1, PBA No: H31656-000 ixgbe 0000:05:00.0: 90:e2:ba:c5:c9:38 ixgbe 0000:05:00.0: Intel(R) 10 Gigabit Network Connection That\u0026rsquo;s a bummer! Because there are two Tengig ports on this OCP, and this chip is a PCIe v2.0 device which means the PCI encoding will be 8b/10b which means each lane can deliver about 80% of the 5.0GT/s, and 80% of 20GT/s is 16.0Gbit. By the way, when PCIe v3.0 was released, not only did the transfer speed go to 8.0GT/s per lane, the encoding also changed to 128b/130b which lowers the overhead from a whopping 20% to only 1.5%. It\u0026rsquo;s not a bad investment of time to read up on PCI Express standards on [Wikipedia], as PCIe limitations and blocked lanes (like in this case!) are the number one reason for poor VPP performance, as my buddy Sander also noted during my NLNOG talk last year.\nIntel X520: Loadtest Results Now that I\u0026rsquo;ve shown a few of these runtime statistics, I think it\u0026rsquo;s good to review three pertinent graphs. I proceed to hook up the loadtester to the 10G ports of the R86S unit that has the Intel X520-DA2 adapter. I\u0026rsquo;ll run the same eight loadtests: {1514b,64b,64b-1Q,MPLS} x {unidirectional,bidirectional}\nIn the table above, I showed the output of show runtime in the VPP debug CLI. These numbers are also exported in a prometheus exporter. I wrote about that in this [article]. In Grafana, I can draw these timeseries as graphs, and it shows me a lot about where VPP is spending its time. Each node in the directed graph counts how many vectors (packets) it has seen, and how many CPU cycles it has spent doing its work.\nIn VPP, a graph of vectors/sec means how many packets per second is the router forwarding. The graph above is on a logarithmic scale, and I\u0026rsquo;ve annotated each of the eight loadtests in orange. The first block of four are the Unidirectional tests and of course, higher values is better.\nI notice that some of these loadtests ramp up until a certain point, after which they become a flatline, which I drew orange arrows for. The first time this clearly happens is in the U3 loadtest. It makes sense to me, because having one flow implies only one worker thread, whereas in the U2 loadtest the system can make use of multiple receive queues and therefore multiple worker threads. It stands to reason that U2 has a slightly better performance than U3.\nThe fourth test, the MPLS loadtest, is forwarding the same identical packets with label 16, out on another interface with label 17. They are therefore also single flow, and this explains why the U4 loadtest looks very similar to the U3 one. Some NICs can hash MPLS traffic to multiple receive queues based on the inner payload, but I conclude that the Intel X520-DA2 aka 82599ES cannot do that.\nThe second block of four are the Bidirectional tests. Similar to the tests I did with the i226-V 2.5GbE NICs, here each of the network cards has to both receive traffic as well as sent traffic. It is with this graph that I can determine the overall throughput in packets/sec of these network interfaces. Of course the bits/sec and packets/sec also come from the T-Rex loadtester output JSON. Here they are, for the Intel X520-DA2:\nIntel 82599ES: Loadtest L2 bits/sec Packets/sec % of Line-Rate U1: Unidirectional 1514b 9.77Gbps 809kpps 99.2% U2: Unidirectional 64b Multi 6.48Gbps 13.4Mpps 90.1% U3: Unidirectional 64b Single 3.73Gbps 7.77Mpps 52.2% U4: Unidirectional 64b MPLS 3.32Gbps 6.91Mpps 46.4% B1: Bidirectional 1514b 12.9Gbps 1.07Mpps 65.6% B2: Bidirectional 64b Multi 6.08Gbps 12.7Mpps 42.7% B3: Bidirectional 64b Single 6.25Gbps 13.0Mpps 43.7% B4: Bidirectional 64b MPLS 3.26Gbps 6.79Mpps 22.8% A few further observations:\nU1\u0026rsquo;s loadtest shows that the machine can sustain 10Gbps in one direction, while B1 shows that bidirectional loadtests are not yielding twice as much throughput. This is very likely because the PCIe 5.0GT/s x4 link is constrained to 16Gbps total throughput, while the OCP NIC supports PCIe 5.0GT/s x8 (32Gbps). U3\u0026rsquo;s loadtest shows that one single CPU can do 7.77Mpps max, if it\u0026rsquo;s the only CPU that is doing work. This is likely because if it\u0026rsquo;s the only thread doing work, it gets to use the entire L2/L3 cache for itself. U2\u0026rsquo;s test shows that when multiple workers perform work, the throughput raises to 13.4Mpps, but this is not double that of a single worker. Similar to before, I think this is because the threads now need to share the CPU\u0026rsquo;s modest L2/L3 cache. B3\u0026rsquo;s loadtest shows that two CPU threads together can do 6.50Mpps each (for a total of 13.0Mpps), which I think is likely because each NIC now has to receive and transit packets. If you\u0026rsquo;re reading this and think you have an alternative explanation, do let me know!\nMellanox Cx3 (10GbE) When VPP is doing its work, it typically asks DPDK (or other input types like virtio, AVF, or RDMA) for a list of packets, rather than one individual packet. It then brings these packets, called vectors, through a directed acyclic graph inside of VPP. Each graph node does something specific to the packets, for example in ethernet-input, the node checks what ethernet type each packet is (ARP, IPv4, IPv6, MPLS, \u0026hellip;), and hands them off to the correct next node, such as ip4-input or mpls-input. If VPP is idle, there may be only one or two packets in the list, which means every time the packets go into a new node, a new chunk of code has to be loaded from working memory into the CPU\u0026rsquo;s instruction cache. Conversely, if there are many packets in the list, only the first packet may need to pull things into the i-cache, the second through Nth packet will become cache hits and execute much faster. Moreover, some nodes in VPP make use of processor optimizations like SIMD (single instruction, multiple data), to save on clock cycles if the same operation needs to be executed multiple times.\nThis graph shows the average CPU cycles per packet for each node. In the first three loadtests (U1, U2 and U3), you can see four lines representing the VPP nodes ip4-input ip4-lookup, ip4-load-balance and ip4-rewrite. In the fourth loadtest U4, you can see only three nodes: mpls-input, mpls-lookup, and ip4-mpls-label-disposition-pipe (where the MPLS label \u0026lsquo;16\u0026rsquo; is swapped for outgoing label \u0026lsquo;17\u0026rsquo;).\nIt\u0026rsquo;s clear to me that when VPP has not many packets/sec to route (ie U1 loadtest), that the cost per packet is actually quite high at around 200 CPU cycles per packet per node. But, if I slam the VPP instance with lots of packets/sec (ie U3 loadtest), that VPP gets much more efficient at what it does. What used to take 200+ cycles per packet, now only takes between 34-52 cycles per packet, which is a whopping 5x increase in efficiency. How cool is that?!\nAnd with that, the Mellanox C3 loadtest completes, and the results are in:\nMellanox MCX342A-XCCN: Loadtest L2 bits/sec Packets/sec % of Line-Rate U1: Unidirectional 1514b 9.73Gbps 805kpps 99.7% U2: Unidirectional 64b Multi 1.11Gbps 2.30Mpps 15.5% U3: Unidirectional 64b Single 1.10Gbps 2.27Mpps 15.3% U4: Unidirectional 64b MPLS 1.10Gbps 2.27Mpps 15.3% B1: Bidirectional 1514b 18.7Gbps 1.53Mpps 94.9% B2: Bidirectional 64b Multi 1.54Gbps 2.29Mpps 7.69% B3: Bidirectional 64b Single 1.54Gbps 2.29Mpps 7.69% B4: Bidirectional 64b MPLS 1.54Gbps 2.29Mpps 7.69% Here\u0026rsquo;s something that I find strange though. VPP is clearly not saturated by these 64b loadtests. I know this, because in the case of the Intel X520-DA2 above, I could easily see 13Mpps in a bidirectional test, yet with this Mellanox Cx3 card, no matter if I do one direction or both directions, the max packets/sec tops at 2.3Mpps only \u0026ndash; that\u0026rsquo;s an order of magnitude lower.\nLooking at VPP, both worker threads (the one reading from Port 5/0/0, and the other reading from Port 5/0/1), are not very busy at all. If a VPP worker thread is saturated, this typically shows as a vectors/call of 256.00 and 100% of CPU cycles consumed. But here, that\u0026rsquo;s not the case at all, and most time is spent in DPDK waiting for traffic:\nThread 1 vpp_wk_0 (lcore 1) Time 31.2, 10 sec internal node vector rate 2.26 loops/sec 988626.15 vector rates in 1.1521e6, out 1.1521e6, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call FiftySixGigabitEthernet5/0/1-o active 15949560 35929200 0 8.39e1 2.25 FiftySixGigabitEthernet5/0/1-t active 15949560 35929200 0 2.59e2 2.25 dpdk-input polling 36250611 35929200 0 6.55e2 .99 ethernet-input active 15949560 35929200 0 2.69e2 2.25 ip4-input-no-checksum active 15949560 35929200 0 1.01e2 2.25 ip4-load-balance active 15949560 35929200 0 7.64e1 2.25 ip4-lookup active 15949560 35929200 0 9.26e1 2.25 ip4-rewrite active 15949560 35929200 0 9.28e1 2.25 unix-epoll-input polling 35367 0 0 1.29e3 0.00 --------------- Thread 2 vpp_wk_1 (lcore 2) Time 31.2, 10 sec internal node vector rate 2.43 loops/sec 659534.38 vector rates in 1.1517e6, out 1.1517e6, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call FiftySixGigabitEthernet5/0/0-o active 14845221 35913927 0 8.66e1 2.42 FiftySixGigabitEthernet5/0/0-t active 14845221 35913927 0 2.72e2 2.42 dpdk-input polling 23114538 35913927 0 6.99e2 1.55 ethernet-input active 14845221 35913927 0 2.65e2 2.42 ip4-input-no-checksum active 14845221 35913927 0 9.73e1 2.42 ip4-load-balance active 14845220 35913923 0 7.17e1 2.42 ip4-lookup active 14845221 35913927 0 9.03e1 2.42 ip4-rewrite active 14845221 35913927 0 8.97e1 2.42 unix-epoll-input polling 22551 0 0 1.37e3 0.00 I kind of wonder why that is. Is the Mellanox Connect-X3 such a poor performer? Or does it not like small packets? I\u0026rsquo;ve read online that Mellanox cards do some form of message compression on the PCI bus, something perhaps to turn off. I don\u0026rsquo;t know, but I don\u0026rsquo;t like it!\nMellanox Cx5 (25GbE) VPP has a few polling nodes, which are pieces of code that execute back-to-back in a tight execution loop. A classic example of a polling node is a Poll Mode Driver from DPDK: this will ask the network cards if they have any packets, and if so: marshall them through the directed graph of VPP. As soon as that\u0026rsquo;s done, the node will immediately ask again. If there is no work to do, this turns into a tight loop with DPDK continuously asking for work. There is however another, lesser known, polling node: unix-epoll-input. This node services a local pool of file descriptors, like the Linux Control Plane netlink socket for example, or the clients attached to the Statistics segment, CLI or API. You can see the open files with show unix files.\nThis design explains why the CPU load of a typical DPDK application is 100% of each worker thread. As an aside, you can ask the PMD to start off in interrupt mode, and only after a certain load switch seemlessly to polling mode. Take a look at set interface rx-mode on how to change from polling to interrupt or adaptive modes. For performance reasons, I always leave the node in polling mode (the default in VPP).\nThe stats segment shows how many clock cycles are being spent in each call of each node. It also knows how often nodes are called. Considering the unix-epoll-input and dpdk-input nodes will perform what is essentially a tight-loop, the CPU should always add up to 100%. I found that one cool way to show how busy a VPP instance really is, is to look over all CPU threads, and sort through the fraction of time spent in each node:\nInput Nodes: are those which handle the receive path from DPDK and into the directed graph for routing \u0026ndash; for example ethernet-input, then ip4-input through to ip4-lookup and finally ip4-rewrite. This is where VPP usually spends most of its CPU cycles. Output Nodes: are those which handle the transmit path into DPDK. You\u0026rsquo;ll see these are nodes whose name ends in -output or -tx. You can also see that in U2, there are only two nodes consuming CPU, while in B2 there are four nodes (because two interfaces are transmitting!) epoll: the polling node called unix-epoll-input depicted in brown in this graph. dpdk: the polling node called dpdk-input depicted in green in this graph. If there is no work to do, as was the case at around 20:30 in the graph above, the dpdk and epoll nodes are the only two that are consuming CPU. If there\u0026rsquo;s lots of work to do, as was the case in the unidirectional 64b loadtest between 19:40-19:50, and the bidirectional 64b loadtest between 20:45-20:55, I can observe lots of other nodes doing meaningful work, ultimately starving the dpdk and epoll threads until an equilibrium is achieved. This is how I know the VPP process is the bottleneck and not, for example, the PCI bus.\nI let the eight loadtests run, and make note of the bits/sec and packets/sec for each, in this table for the Mellanox Cx5:\nMellanox MCX542_ACAT: Loadtest L2 bits/sec Packets/sec % of Line-Rate U1: Unidirectional 1514b 24.2Gbps 2.01Mpps 98.6% U2: Unidirectional 64b Multi 7.43Gbps 15.5Mpps 41.6% U3: Unidirectional 64b Single 3.52Gbps 7.34Mpps 19.7% U4: Unidirectional 64b MPLS 7.34Gbps 15.3Mpps 46.4% B1: Bidirectional 1514b 24.9Gbps 2.06Mpps 50.4% B2: Bidirectional 64b Multi 6.58Gbps 13.7Mpps 18.4% B3: Bidirectional 64b Single 3.15Gbps 6.55Mpps 8.81% B4: Bidirectional 64b MPLS 6.55Gbps 13.6Mpps 18.3% Some observations:\nThis Mellanox Cx5 runs quite a bit hotter than the other two cards. It\u0026rsquo;s a PCIe v3.0 which means that despite there only being 4 lanes to the OCP port, it can achieve 31.504 Gbit/s (in case you\u0026rsquo;re wondering, this is 128b/130b encoding on 8.0GT/s x4). It easily saturates 25Gbit in one direction with big packets in U1, but as soon as smaller packets are offered, each worker thread tops out at 7.34Mpps or so in U2. When testing in both directions, each thread can do about 6.55Mpps or so in B2. Similar to the other NICs, there is a clear slowdown due to CPU cache contention (when using multiple threads), and RX/TX simultaneously (when doing bidirectional tests). MPLS is a lot faster \u0026ndash; nearly double based on the use of multiple threads. I think this is because the Cx5 has a hardware hashing function for MPLS packets that looks at the inner payload to sort the traffic into multiple queues, while the Cx3 and Intel X520-DA2 do not. Summary and closing thoughts There\u0026rsquo;s a lot to say about these OCP cards. While the Intel is cheap, the Mellanox Cx3 is a bit quirky with its VPP enumeration, and the Mellanox Cx5 is a bit more expensive (and draws a fair bit more power, coming in at 20W), but it do 25Gbit reasonably, it\u0026rsquo;s pretty difficult to make a solid recommendation. What I find interesting is the very low limit in packets/sec on 64b packets coming from the Cx3, while at the same time there seems to be an added benefit in MPLS hashing that the other two cards do not have.\nAll things considered, I think I would recommend the Intel x520-DA2 (based on the Niantic chip, Intel 82599ES, total machine coming in at 17W). It seems like it pairs best with the available CPU on the machine. Maybe a Mellanox ConnectX-4 could be a good alternative though, hmmmm :)\nHere\u0026rsquo;s a few files I gathered along the way, in case they are useful:\n[LSCPU] - [Likwid Topology] - [DMI Decode] - [LSBLK] Mellanox Cx341: [dmesg] - [LSPCI] - [LSHW] - [VPP Patch] Mellanox Cx542: [dmesg] - [LSPCI] - [LSHW] Intel X520-DA2: [dmesg] - [LSPCI] - [LSHW] VPP Configs: [startup.conf] - [L2 Config] - [L3 Config] - [MPLS Config] ","date":"2024-07-05","desc":"Introduction I am always interested in finding new hardware that is capable of running VPP. Of course, a standard issue 19\u0026quot; rack mountable machine like a Dell, HPE or SuperMicro machine is an obvious choice. They come with redundant power supplies, PCIe v3.0 or better expansion slots, and can boot off of mSATA or NVME, with plenty of RAM. But for some people and in some locations, the power envelope or size/cost of these 19\u0026quot; rack mountable machines can be prohibitive. Sometimes, just having a smaller form factor can be very useful:\n","permalink":"https://ipng.ch/s/articles/2024/07/05/review-r86s-jasper-lake-n6005/","section":"articles","title":"Review: R86S (Jasper Lake - N6005)"},{"contents":" I have been a member of the Coloclue association in Amsterdam for a long time. This is a networking association in the social and technical sense of the word. [Coloclue] is based in Amsterdam with members throughout the Netherlands and Europe. Its goals are to facilitate learning about and operating IP based networks and services. It has about 225 members who, together, have built this network and deployed about 135 servers across 8 racks in 3 datacenters (Qupra, EUNetworks and NIKHEF). Coloclue is operating [AS8283] across several local and international internet exchange points.\nA small while ago, one of our members, Sebas, shared their setup with the membership. It generated a bit of a show-and-tell response, with Sebas and other folks on our mailinglist curious as to how we all deployed our stuff. My buddy Tim pinged me on Telegram: \u0026ldquo;This is something you should share for IPng as well!\u0026rdquo;, so this article is a bit different than my usual dabbles. It will be more of a show and tell: how did I deploy and configure the Amsterdam Chapter of IPng Networks?\nI\u0026rsquo;ll make this article a bit more picture-dense, to show the look-and-feel of the equipment.\nNetwork One thing that Coloclue and IPng Networks have in common is that we are networking clubs :) And readers of my articles may well know that I do so very much like writing about networking. During the Corona Pandemic, my buddy Fred asked \u0026ldquo;Hey you have this PI /24, why don\u0026rsquo;t you just announce it yourself? It\u0026rsquo;ll be fun!\u0026rdquo; \u0026ndash; and after resisting for a while, I finally decided to go for it. Fred owns a Swiss ISP called [IP-Max] and he was about to expand into Amsterdam, and in an epic roadtrip we deployed a point of presence for IPng Networks in each site where IP-Max has a point of presence.\nIn Amsterdam, I introduced Fred to Arend Brouwer from [ERITAP], and we deployed our stuff in a brand new rack he had acquired at NIKHEF. It was fun to be in an AirBnB, drive over to NIKHEF, and together with Arend move in to this completely new and empty rack in one of the most iconic internet places on the planet. I am deeply grateful for the opportunity.\nFor IP-Max, this deployment means a Nexus 3068PQ switch and an ASR9001 router, with one 10G wavelength towards Newtelco in Frankfurt, Germany, and another 10G wavelength towards ETIX in Lille, France. For IPng it means a Centec S5612X switch, and a Supermicro router. To the right you\u0026rsquo;ll see the network as it was deployed during that roadtrip - a ring of sites from Zurich, Frankfurt, Amsterdam, Lille, Paris, Geneva and back to Zurich. They are all identical in terms of hardware. Pictured to the right is our staging environment, in that AirBnB in Amsterdam: Fred\u0026rsquo;s Nexus and Cisco ASR9k, two of my Supermicro routers, and an APU which is used as an OOB access point.\nHardware Considering Coloclue is a computer and network association, lots of folks are interested in the physical bits. I\u0026rsquo;ll take some time to detail the hardware that I use for my network, focusing specifically on the Amsterdam sites.\nSwitches My switches are from a brand called Centec. They make their own switch silicon, and are known to be very power efficient, affordable cost per port, and what\u0026rsquo;s important for me is that the switches offer MPLS, VPLS and L2VPN, as well as VxLAN, GENEVE and GRE functionality, all in hardware.\nPictured on the right you can see two main Centec switch types that I use (in the red boxes above called IPng Site Local):\nCentec S5612X: 8x1G RJ45, 8x1G SFP, 12x10G SFP+, 2x40G QSFP+ and 8x25G SFP28 Centec S5548X: 48x1G RJ45, 2x40G QSFP+ and 4x25G SFP28 Centec S5624X: 24x10G SFP+ and 2x100G QSFP28 There are also bigger variants, such as the S7548N-8Z switch (48x25G, 8x100G, which is delicious), or S7800-32Z (32x100G, which is possibly even more yummy). Overall, I have very good experiences with these switches and the vendor ecosystem around them (optics, patch cables, WDM muxes, and so on).\nSandwidched between the switches, you\u0026rsquo;ll see some black Supermicro machines with 6x1G, 2x10G SFP+ and 2x 25G SFP28. I\u0026rsquo;ll detail them below, they are IPng\u0026rsquo;s default choice for low-power routers, as fully loaded they consume about 45W, and can forward 65Gbps and around 35Mpps or so, enough to kill a horse. And definitely enough for IPng!\nA photo album with a few pictures of the Centec switches, including their innards, lives [here]. In Amsterdam, I have one S5624X, which connects with 3x10G to IP-Max (one link to Lille, another to Frankfurt, and the third for local services off of IP-Max\u0026rsquo;s ASR9001). Pictured right is that Centec S5624X MPLS switch at NIKHEF, called msw0.nlams0.net.ipng.ch, which is where my Amsterdam story really begins: \u0026ldquo;Goed zat voor doordeweeks!\u0026rdquo;\nRouter The european ring that I built on my roadtrip with Fred consists of identical routers in each location. I was looking for a machine that has competent out of band operations with IPMI or iDRAC, could carry 32GB of ECC memory, had at least 4C/8T, and as many 10Gig ports as could realistically fit. I settled on the Supermicro 5018D-FN8T [ref], because of its relatively low power CPU (an Intel Xeon D-1518 at 35W TDP), ability to boot off of NVME or mSATA, PCIe v3.0 x8 expansion port, which carries an additional Intel X710-DA4 quad-tengig port.\nI\u0026rsquo;ve loadtested these routers extensively while I was working on the Linux Control Plane in VPP, and I can sustain full port speeds on all six TenGig ports, to a maximum of roughly 35Mpps of IPv4, IPv6 or MPLS traffic. Considering the machine, when fully loaded, will draw about 45 Watts, this is a very affordable and power efficient router. I love them!\nThe only one thing I\u0026rsquo;d change , is a second power supply. I personally have never had a PSU fail in any of the Supermicro routers I operate, but sometimes datacenters do need to take a feed offline, and that\u0026rsquo;s unfortunate if it causes an interruption. If I\u0026rsquo;d do it again, I would go for dual PSU, but I can\u0026rsquo;t complain either, as my router in NIKHEF has been running since 2021 without any power issues.\nI\u0026rsquo;m a huge fan of Supermicro\u0026rsquo;s IPMI design based on the ASpeed AST2400 BMC; it supports almost all IPMI features, notably serial-over-LAN, remote port off/cycle, remote KVM over HTML5, remote USB disk mount, remote install of operating systems, and requires no download of client software. It all just works in Firefox or Chrome or Safari \u0026ndash; and I\u0026rsquo;ve even reinstalled several routers remotely, as I described in my article [Debian on IPng\u0026rsquo;s VPP routers]. There\u0026rsquo;s just something magical about remote-mounting a Debian Bookworm iso image from my workstation in Brüttisellen, Switzerland, in a router running in Amsterdam, to then proceed to use KVM over HTML5 to reinstall the whole thing remotely. We didn\u0026rsquo;t have that, growing up!!\nHypervisors I host two machines at Coloclue. I started off way back in 2010 or so with one Dell R210-II. That machine still runs today, albeit in the Telehouse2 datacenter in Paris. At the end of 2022, I made a trip to Amsterdam to deploy three identical machines, all reasonably spec\u0026rsquo;d Dell PowerEdge R630 servers:\n8x32GB or 256GB Registered/Buffered (ECC) DDR4 2x Intel Xeon E5-2696 v4 (88 CPU threads in total) 1x LSI SAS3008 SAS12 controller 2x 500G Crucial MX500 (TLC) SSD 3x 3840G Seagate ST3840FM0003 (MLC) SAS12 Dell rNDC 2x Intel I350 (RJ45), 2x Intel X520 (10G SFP+) Once you go to enterprise storage, you will never want to go back. I take specific care to buy redundant boot drives, mostly Crucial MX500 because it\u0026rsquo;s TLC flash, and a bit more reliable. However, the MLC flash from these Seagate and HPE SAS-3 drives (12Gbps bus speeds) are next level. The Seagate 3.84TB drives are in a RAIDZ1 together, and read/write over them is a sustained 2.6GB/s and roughly 380Kops/sec per drive. It really makes the VMs on the hypervisor fly \u0026ndash; and at the same time has a much, much, much better durability and lifetime. Before I switched to enterprise storage, I would physically wear out a Samsung consumer SSd in about 12-15mo, and reads/writes would become unbearably slow over time. With these MLC based drives: no such problem. Ga hard!\nAll hypervisors run Debian Bookworm and have a dedicated iDRAC enterprise port + license. I find the Supermicro IPMI a little bit easier to work with, but the basic features are supported on the Dell as well: serial-over-LAN (which comes in super handy at Coloclue) and remote power on/off/cycle, power metering (using ipmitool sensors), and a clunky KVM over HTML if need be.\nColoclue: Routing Let\u0026rsquo;s dive into the Coloclue deployment, here\u0026rsquo;s an overview picture, with three colors: blue for Coloclue\u0026rsquo;s network components, red for IPng\u0026rsquo;s internal network, and orange for IPng\u0026rsquo;s public network services.\nColoclue currently has three main locations: Qupra, EUNetworks and NIKHEF. I\u0026rsquo;ve drawn the Coloclue network in blue. It\u0026rsquo;s pretty impressive, with a 10G wave between each of the locations. Within the two primary colocation sites (Qupra and EUNetworks), there are two core switches from Arista, which connect to top-of-rack switches from FS.com in an MLAG configuration. This means that each TOR is connected redundantly to both core switches with 10G. The switch in NIKHEF connects to a set of local internet exchanges, and as well to IP-Max, who deliver a bunch of remote IXPs toIPng\u0026rsquo;sColoclue, notably DE-CIX, FranceIX, and SwissIX. It is in NIKHEF where nikhef-core-1.switch.nl.coloclue.net (colored blue), connects to my my msw0.nlams0.net.ipng.ch switch (in red). IPng Networks\u0026rsquo; european backbone then connects from here to Frankfurt and southbound onwards to Zurich, but it also connects from here to Lille and southbound onwards to Paris.\nIn the picture to right you can see ERITAP\u0026rsquo;s rack R181 in NIKHEF, when it was \u0026hellip; younger. It did not take long for many folks to request many cross connects (I myself have already a dozen or so, and I\u0026rsquo;m only one customer in this rack!)\nOne of the advantages of being a launching customer, is that I got to see the rack when it was mostly empty. Here we can see Coloclue\u0026rsquo;s switch at the top (with the white flat ribbon RJ45 being my interconnect with it). Then there are two PC Engines APUs, which are IP-Max and IPng\u0026rsquo;s OOB serial machines. Then comes the ASR9001 called er01.zrh56.ip-max.net and under it the Nexus switch that IP-Max uses for its local customers (including Coloclue and IPng!).\nMy main router is all the way at the bottom of the picture, called nlams0.ipng.ch, one of those Supermicro D-1518 machines. It is connected with 2x10G in a LAG to the MPLS switch, and then to most of the available internet exchanges in NIKHEF. I also have two transit providers in Amsterdam: IP-Max (10Gbit) and A2B Internet (10Gbit).\npim@nlams0:~$ birdc show route count BIRD v2.15.1-4-g280daed5-x ready. 10735329 of 10735329 routes for 969311 networks in table master4 2469998 of 2469998 routes for 203707 networks in table master6 1852412 of 1852412 routes for 463103 networks in table t_roa4 438100 of 438100 routes for 109525 networks in table t_roa6 Total: 15495839 of 15495839 routes for 1745646 networks in 4 tables With the RIB at over 15M entries, I would say this site is very well connected!\nColoclue: Hypervisors I have one Dell R630 at Qupra (hvn0.nlams1.net.ipng.ch), one at EUNetworks (hvn0.nlams2.net.ipng.ch), and a third one with ERITAP at Equinix AM3 (hvn0.nlams3.net.ipng.ch). That last one is connected with a 10Gbit wavelength to IPng\u0026rsquo;s switch msw0.nlams0.net.ipng.ch, and another 10Gbit port to FrysIX.\nArend and I run a small internet exchange called [FrysIX]. I supply most of the services from that third hypervisor, which has a 10G connection to the local FrysIX switch at Equinix AM3. More recently, it became possible to request cross connects at Qupra, so I\u0026rsquo;ve put in a request to connect my hypervisor there to FrysIX with 10G as well - this will not be for peering purposes, but to be able to redundantly connect things like our routeservers, ixpmanager, librenms, sflow services, and so on. It\u0026rsquo;s nice to be able to have two hypervisors available, as it makes maintenance just that much easier.\nTurning my attention to the two hypervisors at Coloclue, one really cool feature that Coloclue offers, is an L2 VLAN connection from your colo server to the NIKHEF site over our 10G waves between the datacenters. I requested one of these in each site to NIKHEF using Coloclue\u0026rsquo;s VLAN 402 at Qupra, and VLAN 412 at EUNetworks. It is over these VLANs that I carry IPng Site Local to the hypervisors. I showed this in the overview diagram as an orange dashed line. I bridge Coloclue\u0026rsquo;s VLAN 105 (which is their eBGP VLAN which has loose uRPF filtering on the Coloclue routers) into a Q-in-Q transport towards NIKHEF. These two links are colored purple from EUnetworks and green from Qupra. Finally, I transport my own colocation VLAN to each site using another Q-in-Q transport with inner VLAN 100.\nThat may seem overly complicated, so let me describe these one by one:\n1. Colocation Connectivity:\nI will first create a bridge called coloclue, which I\u0026rsquo;ll give an MTU of 1500. I will add to that the port that connects to the Coloclue TOR switch, called eno4. However, I will give that port an MTU of 9216 as I will support jumbo frames on other VLANs later.\npim@hvn0-nlams1:~$ sudo ip link add coloclue type bridge pim@hvn0-nlams1:~$ sudo ip link set coloclue mtu 1500 up pim@hvn0-nlams1:~$ sudo ip link set eno4 mtu 9216 master coloclue up pim@hvn0-nlams1:~$ sudo ip addr add 94.142.244.54/24 dev coloclue pim@hvn0-nlams1:~$ sudo ip addr add 2a02:898::146:1/64 dev coloclue pim@hvn0-nlams1:~$ sudo ip route add default via 94.142.244.254 pim@hvn0-nlams1:~$ sudo ip route add default via 2a02:898::1 2. IPng Site Local:\nAll hypervisors at IPng are connected to a private network called IPng Site Local with IPv4 addresses from 198.19.0.0/16 and IPv6 addresses from 2001:678:d78:500::/56, both of which are not routed on the public Internet. I will give the hypervisor an address and a route towards IPng Site local like so:\npim@hvn0-nlams1:~$ sudo ip link add ipng-sl type bridge pim@hvn0-nlams1:~$ sudo ip link set ipng-sl mtu 9000 up pim@hvn0-nlams1:~$ sudo ip link add link eno4 name eno4.402 type vlan id 402 pim@hvn0-nlams1:~$ sudo ip link set link eno4.402 mtu 9216 master ipng-sl up pim@hvn0-nlams1:~$ sudo ip addr add 198.19.4.194/27 dev ipng-sl pim@hvn0-nlams1:~$ sudo ip addr add 2001:678:d78:509::2/64 dev ipng-sl pim@hvn0-nlams1:~$ sudo ip route add 198.19.0.0/16 via 198.19.4.193 pim@hvn0-nlams1:~$ sudo ip route add 2001:678:d78:500::/56 via 2001:678:d78:509::1 Note the MTU here. While the hypervisor is connected via 1500 bytes to the Coloclue network, it is connected with 9000 bytes to IPng Site local. On the other side of VLAN 402 lives the Centec switch, which is configured simply with a VLAN interface:\ninterface vlan402 description Infra: IPng Site Local (Qupra) mtu 9000 ip address 198.19.4.193/27 ipv6 address 2001:678:d78:509::1/64 ! interface vlan301 description Core: msw0.defra0.net.ipng.ch mtu 9028 label-switching ip address 198.19.2.13/31 ipv6 address 2001:678:d78:501::6:2/112 ip ospf network point-to-point ip ospf cost 73 ipv6 ospf network point-to-point ipv6 ospf cost 73 ipv6 router ospf area 0 enable-ldp ! interface vlan303 description Core: msw0.frggh0.net.ipng.ch mtu 9028 label-switching ip address 198.19.2.24/31 ipv6 address 2001:678:d78:501::c:1/112 ip ospf network point-to-point ip ospf cost 85 ipv6 ospf network point-to-point ipv6 ospf cost 85 ipv6 router ospf area 0 enable-ldp There are two other interfaces here: vlan301 towards the MPLS switch in Frankfurt Equinix FR5 and vlan303 towards the MPLS switch in Lille ETIX#2. I\u0026rsquo;ve configured those to enable OSPF, LDP and MPLS forwarding. As such, this network with hvn0.nlams1.net.ipng.ch becomes a leaf node with a /27 and /64 in IPng Site Local, in which I can run virtual machines and stuff.\nTraceroutes on this private underlay network are very pretty, using the net.ipng.ch domain, and entirely using silicon-based wirespeed routers with IPv4, IPv6 and MPLS and jumbo frames, never hitting the public Internet:\npim@hvn0-nlams1:~$ traceroute6 squanchy.net.ipng.ch 9000 traceroute to squanchy.net.ipng.ch (2001:678:d78:503::4), 30 hops max, 9000 byte packets 1 msw0.nlams0.net.ipng.ch (2001:678:d78:509::1) 1.116 ms 1.720 ms 2.369 ms 2 msw0.defra0.net.ipng.ch (2001:678:d78:501::6:1) 7.804 ms 7.812 ms 7.823 ms 3 msw0.chrma0.net.ipng.ch (2001:678:d78:501::5:1) 12.839 ms 13.498 ms 14.138 ms 4 msw1.chrma0.net.ipng.ch (2001:678:d78:501::11:2) 12.686 ms 13.363 ms 13.951 ms 5 msw0.chbtl0.net.ipng.ch (2001:678:d78:501::1) 13.446 ms 13.523 ms 13.683 ms 6 squanchy.net.ipng.ch (2001:678:d78:503::4) 12.890 ms 12.751 ms 12.767 ms 3. Coloclue BGP uplink:\nI make use of the IP transit offering of Coloclue. Coloclue has four routers in total: two in EUNetworks and two in Qupra, which I\u0026rsquo;ll show the configuration for here. I don\u0026rsquo;t take the transit session on the hypervisor, but rather I forward the traffic Layer2 to my VPP router called nlams0.ipng.ch over VLAN 402 purple and VLAN 412 green VLANs to NIKHEF. I\u0026rsquo;ll show the configuration for Qupra (VLAN 402) first:\npim@hvn0-nlams1:~$ sudo ip link add coloclue-bgp type bridge pim@hvn0-nlams1:~$ sudo ip link set coloclue-bgp mtu 1500 up pim@hvn0-nlams1:~$ sudo ip link add link eno4 name eno4.105 type vlan id 105 pim@hvn0-nlams1:~$ sudo ip link add link eno4.402 name eno4.402.105 type vlan id 105 pim@hvn0-nlams1:~$ sudo ip link set eno4.105 mtu 1500 master coloclue-bgp up pim@hvn0-nlams1:~$ sudo ip link set eno4.402.105 mtu 1500 master coloclue-bgp up These VLANs terminate on msw0.nlams0.net.ipng.ch where I just offer them directly to the VPP router:\ninterface eth-0-2 description Infra: nikhef-core-1.switch.nl.coloclue.net e1/34 switchport mode trunk switchport trunk allowed vlan add 402,412 switchport trunk allowed vlan remove 1 lldp disable ! interface eth-0-3 description Infra: nlams0.ipng.ch:Gi8/0/0 switchport mode trunk switchport trunk allowed vlan add 402,412 switchport trunk allowed vlan remove 1 4. IPng Services VLANs:\nI have one more thing to share. Up until now, the hypervisor has internal connectivity to IPng Site Local, and a single IPv4 / IPv6 address in the shared colocation network. Almost all VMs at IPng run entirely in IPng Site Local, and will use reversed proxies and other tricks to expose themselves to the internet. But, I also use a modest amount of IPv4 and IPv6 addresses on the VMs here, for example for those NGINX reversed proxies [ref], or my SMTP relays [ref].\nFor this purpose, I will need to plumb through some form of colocation VLAN in each site, which looks very similar to the BGP uplink VLAN I described previously:\npim@hvn0-nlams1:~$ sudo ip link add ipng type bridge pim@hvn0-nlams1:~$ sudo ip link set ipng mtu 9000 up pim@hvn0-nlams1:~$ sudo ip link add link eno4 name eno4.100 type vlan id 100 pim@hvn0-nlams1:~$ sudo ip link add link eno4.402 name eno4.402.100 type vlan id 100 pim@hvn0-nlams1:~$ sudo ip link set eno4.100 mtu 9000 master ipng up pim@hvn0-nlams1:~$ sudo ip link set eno4.402.100 mtu 9000 master ipng up Looking at the VPP router, it picks up these two VLANs 402 and 412, which are used for IPng Site Local. On top of those, the router will add two Q-in-Q VLANs: 402.105 will be the BGP uplink, and Q-in-Q 402.100 will be the IPv4 space assigned to IPng:\ninterfaces: GigabitEthernet8/0/0: device-type: dpdk description: \u0026#39;Infra: msw0.nlams0.ipng.ch:eth-0-3\u0026#39; lcp: e0-1 mac: \u0026#39;3c:ec:ef:46:65:97\u0026#39; mtu: 9216 sub-interfaces: 402: description: \u0026#39;Infra: VLAN to Qupra\u0026#39; lcp: e0-0.402 mtu: 9000 412: description: \u0026#39;Infra: VLAN to EUNetworks\u0026#39; lcp: e0-0.412 mtu: 9000 402100: description: \u0026#39;Infra: hvn0.nlams1.ipng.ch\u0026#39; addresses: [\u0026#39;94.142.241.184/32\u0026#39;, \u0026#39;2a02:898:146::1/64\u0026#39;] lcp: e0-0.402.100 mtu: 9000 encapsulation: dot1q: 402 exact-match: True inner-dot1q: 100 402105: description: \u0026#39;Transit: Coloclue (urpf-shared-vlan Qupra)\u0026#39; addresses: [\u0026#39;185.52.225.34/28\u0026#39;, \u0026#39;2a02:898:0:1::146:1/64\u0026#39;] lcp: e0-0.402.105 mtu: 1500 encapsulation: dot1q: 402 exact-match: True inner-dot1q: 105 Using BGP, my AS8298 will announce my own prefixes and two /29s that I have assigned to me from Coloclue. One of them is 94.142.241.184/29 in Qupra, and the other is 94.142.245.80/29 in EUNetworks. But, I don\u0026rsquo;t like wasting IP space, so I assign only the first /32 from that range to the interface, and use Bird2 to set a route for the other 7 addresses into the interface, which will allow me to use all eight addresses!\npim@border0-nlams3:~$ traceroute nginx0.nlams1.ipng.ch traceroute to nginx0.nlams1.ipng.ch (94.142.241.189), 30 hops max, 60 byte packets 1 ipmax.nlams0.ipng.ch (46.20.243.177) 1.190 ms 1.102 ms 1.101 ms 2 speed-ix.coloclue.net (185.1.222.16) 0.448 ms 0.405 ms 0.361 ms 3 nlams0.ipng.ch (185.52.225.34) 0.461 ms 0.461 ms 0.382 ms 4 nginx0.nlams1.ipng.ch (94.142.241.189) 1.084 ms 1.042 ms 1.004 ms pim@border0-nlams3:~$ traceroute smtp.nlams2.ipng.ch traceroute to smtp.nlams2.ipng.ch (94.142.245.85), 30 hops max, 60 byte packets 1 ipmax.nlams0.ipng.ch (46.20.243.177) 2.842 ms 2.743 ms 3.264 ms 2 speed-ix.coloclue.net (185.1.222.16) 0.383 ms 0.338 ms 0.338 ms 3 nlams0.ipng.ch (185.52.225.34) 0.372 ms 0.365 ms 0.304 ms 4 smtp.nlams2.ipng.ch (94.142.245.85) 1.042 ms 1.000 ms 0.959 ms Coloclue: Services I run a bunch of services on these hypervisors. Some are for me personally, or for my company IPng Networks GmbH, and some are for community projects. Let me list a few things here:\nAS112 Services I run an anycasted AS112 cluster in all sites where IPng has hypervisor capacity. Notably in Amsterdam, my nodes are running on both Qupra and EUNetworks, and connect to LSIX, SpeedIX, FogIXP, FrysIX and behind AS8283 and AS8298. The nodes here handle roughly 5kqps at peak, and if RIPE NCC\u0026rsquo;s node in Amsterdam goes down, this can go up to 13kqps (right, WEiRD?). I described the setup in an [article]. You may be wondering: how do I get those internet exchanges backhauled to a VM at Coloclue? The answer is: VxLAN transport! Here\u0026rsquo;s a relevant snippet from the nlams0.ipng.ch router config:\nvxlan_tunnels: vxlan_tunnel1: local: 94.142.241.184 remote: 94.142.241.187 vni: 11201 interfaces: TenGigabitEthernet4/0/0: device-type: dpdk description: \u0026#39;Infra: msw0.nlams0:eth-0-9\u0026#39; lcp: xe0-0 mac: \u0026#39;3c:ec:ef:46:68:a8\u0026#39; mtu: 9216 sub-interfaces: 112: description: \u0026#39;Peering: LSIX for AS112\u0026#39; l2xc: vxlan_tunnel1 mtu: 1522 vxlan_tunnel1: description: \u0026#39;Infra: AS112 LSIX\u0026#39; l2xc: TenGigabitEthernet4/0/0.112 mtu: 1522 And the Centec switch config:\nvlan database vlan 112 name v-lsix-as112 mac learning disable interface eth-0-5 description Infra: LSIX AS112 switchport access vlan 112 interface eth-0-9 description Infra: nlams0.ipng.ch:Te4/0/0 switchport mode trunk switchport trunk allowed vlan add 100,101,110-112,302,311,312,501-503,2604 switchport trunk allowed vlan remove 1 What happens is: LSIX connects the AS112 port to the Centec switch on eth-0-5, which offers it tagged to Te4/0/0.112 on the VPP router and without wasting CAM space for the MAC addresses (by turing off MAC learning \u0026ndash; this is possible because there\u0026rsquo;s only 2 ports in the VLAN, so the switch implicitly always knows where to forward the frames!).\nAfter sending it out on eth-0-9 tagged as VLAN 112, VPP in turn encapsulates it with VxLAN and sends it as VNI 11201 to remote endpoint 94.142.241.187. Because that path has an MTU of 9000, the traffic arrives to the VM with 1500b, no worries. Most of my AS112 traffic arrives to a VM this way, as it\u0026rsquo;s really easy to flip the remote endpoint of the VxLAN tunnel to anothe replica in case of an outage or maintenance. Typically, BGP sessions won\u0026rsquo;t even notice.\nNGINX Frontends At IPng, almost everything runs in the internal network called IPng Site Local. I expose this network via a few carefully placed NGINX frontends. There are two in my own network (in Geneva and Zurich), and one in IP-Max\u0026rsquo;s network (in Zurich), and two at Coloclue (in Amsterdam). They frontend and do SSL offloading and TCP loadbalancing for a variety of websites and services. I described the architecture and design in an [article]. There are currently ~120 or so websites frontended on this cluster.\nSMTP Relays I self-host my mail, and I tried to make a fully redundant and self-repairing SMTP in- and outbound with Postfix, IMAP server and redundant maildrop storage with Dovecot, a webmail service with Roundcube, and so on. Because I need to perform DNSBL lookups, this requires routable IPv4 and IPv6 addresses. Two of my four mailservers run at Coloclue, which I described in an [article].\nMailman Service For FrysIX, FreeIX, and IPng itself, I run a set of mailing lists. The mailman service runs partially in IPng Site Local, and has one IPv4 address for outbound e-mail. I separated this from the IPng relays so that IP based reputation does not interfere between these two types of mailservice.\nFrysIX Services The routeserver rs2.frys-ix.net, the authoritative nameserver ns2.frys-ix.net, the IXPManager and LibreNMS monitoring service all run on hypervisors at either Coloclue (Qupra) or ERITAP (Equinix AM3). By the way, remember the part about the enterprise storage? The ixpmanager is currently running on hvn0.nlams3.net.ipng.ch which has a set of three Samsung EVO consumer SSDs, which are really at the end of their life. Please, can I connect to FrysIX from Qupra so I can move these VMs to the Seagate SAS-3 MLC storage pool? :)\nIPng OpenBSD Bastion Hosts IPng Networks has three OpenBSD bastion jumphosts with an open SSH port 22, which are named after characters from a TV show called Rick and Morty. Squanchy lives in my house on hvn0.chbtl0.net.ipng.ch, Glootie lives at IP-Max on hvn0.chrma0.net.ipng.ch, and Pencilvester lives on a hypervisor at Coloclue on hvn0.nlams1.net.ipng.ch. These bastion hosts connect both to the public internet, but also to the IPng Site Local network. As such, if I have SSH access, I will also have access to the internal network of IPng.\nIPng Border Gateways The internal network of IPng is mostly disconnected from the Internet. Although I can log in via these bastion hosts, I also have a set of four so-called Border Gateways, which are also connected both to the IPng Site Local network, but also to the Internet. Each of them runs an IPv4 and IPv6 WireGuard endpoint, and I\u0026rsquo;m pretty much always connected with these. It allows me full access to the internal network, and NAT\u0026rsquo;ed towards the Internet.\nEach border gateway announces a default route towards the Centec switches, and connect to AS8298, AS8283 and AS25091 for internet connectivity. One of them runs in Amsterdam, and I wrote about these gateways in an [article].\nPublic NAT64/DNS64 Gateways I operate a set of four private NAT64/DNS64 gateways, one of which in Amsterdam. It pairs up and complements the WireGuard and NAT44/NAT66 functionality of the Border Gateways. Because NAT64 is useful in general, I also operate two public NAT64/DNS64 gateways, one at Qupra and one at EUNetworks. You can try them for yourself by using the following anycasted resolver: 2a02:898:146:64::64 and performing a traceroute to an IPv4 only host, like github.com. Note: this works from anywhere, but for satefy reasons, I filter some ports like SMTP, NETBIOS and so on, roughly the same way a TOR exit router would. I wrote about them in an [article].\npim@cons0-nlams0:~$ cat /etc/resolv.conf # *** Managed by IPng Ansible *** # domain ipng.ch search net.ipng.ch ipng.ch nameserver 2a02:898:146:64::64 pim@cons0-nlams0:~$ traceroute6 -q1 ipv4.tlund.se traceroute to ipv4c.tlund.se (2a02:898:146:64::c10f:e4c3), 30 hops max, 80 byte packets 1 2a10:e300:26:48::1 (2a10:e300:26:48::1) 0.221 ms 2 as8283.ix.frl (2001:7f8:10f::205b:187) 0.443 ms 3 hvn0.nlams1.ipng.ch (2a02:898::146:1) 0.866 ms 4 bond0-100.dc5-1.router.nl.coloclue.net (2a02:898:146:64::5e8e:f4fc) 0.900 ms 5 bond0-130.eunetworks-2.router.nl.coloclue.net (2a02:898:146:64::5e8e:f7f2) 0.920 ms 6 ams13-peer-1.hundredgige2-3-0.tele2.net (2a02:898:146:64::50f9:d18b) 2.302 ms 7 ams13-agg-1.bundle-ether4.tele2.net (2a02:898:146:64::5b81:e1e) 22.760 ms 8 gbg-cagg-1.bundle-ether7.tele2.net (2a02:898:146:64::5b81:ef8) 22.983 ms 9 bck3-core-1.bundle-ether6.tele2.net (2a02:898:146:64::5b81:c74) 22.295 ms 10 lba5-core-2.bundle-ether2.tele2.net (2a02:898:146:64::5b81:c2f) 21.951 ms 11 avk-core-2.bundle-ether9.tele2.net (2a02:898:146:64::5b81:c24) 21.760 ms 12 avk-cagg-1.bundle-ether4.tele2.net (2a02:898:146:64::5b81:c0d) 22.602 ms 13 skst123-lgw-2.bundle-ether50.tele2.net (2a02:898:146:64::5b81:e23) 21.553 ms 14 skst123-pe-1.gigabiteth0-2.tele2.net (2a02:898:146:64::82f4:5045) 21.336 ms 15 2a02:898:146:64::c10f:e4c3 (2a02:898:146:64::c10f:e4c3) 21.722 ms Thanks for reading This article is a bit different to my usual writing - it doesn\u0026rsquo;t deep dive into any protocol or code that I\u0026rsquo;ve written, but it does describe a good chunk of the way I think about systems and networking. I appreciate the opportunities that Coloclue as a networking community and hobby club affords. I\u0026rsquo;m always happy to talk about routing, network- and systems engineering, and the stuff I develop at IPng Networks, notably our VPP routing stack. I encourage folks to become a member and learn about develop novel approaches to this thing we call the Internet.\nOh, and if you\u0026rsquo;re a Coloclue member looking for a secondary location, IPng offers colocation and hosting services in Zurich, Geneva, and soon in Lucerne as well :) Houdoe!\n","date":"2024-06-29","desc":" I have been a member of the Coloclue association in Amsterdam for a long time. This is a networking association in the social and technical sense of the word. [Coloclue] is based in Amsterdam with members throughout the Netherlands and Europe. Its goals are to facilitate learning about and operating IP based networks and services. It has about 225 members who, together, have built this network and deployed about 135 servers across 8 racks in 3 datacenters (Qupra, EUNetworks and NIKHEF). Coloclue is operating [AS8283] across several local and international internet exchange points.\n","permalink":"https://ipng.ch/s/articles/2024/06/29/case-study-ipng-at-coloclue/","section":"articles","title":"Case Study: IPng at Coloclue"},{"contents":" Introduction When I first built IPng Networks AS8298, I decided to use OSPF as an IPv4 and IPv6 internal gateway protocol. Back in March I took a look at two slightly different ways of doing this for IPng, notably against a backdrop of conserving IPv4 addresses. As the network grows, the little point to point transit networks between routers really start adding up.\nI explored two potential solutions to this problem:\n[Babel] can use IPv6 nexthops for IPv4 destinations - which is super useful because it would allow me to retire all of the IPv4 /31 point to point networks between my routers. [OSPFv3] makes it difficult to use IPv6 nexthops for IPv4 destinations, but in a discussion with the Bird Users mailinglist, we found a way: by reusing a single IPv4 loopback address on adjacent interfaces In May I ran a modest set of two canaries, one between the two routers in my house (chbtl0 and chbtl1), and another between a router at the Daedalean colocation and Interxion datacenters (ddln0 and chgtg0). AS8298 has about quarter of a /24 tied up in these otherwise pointless point-to-point transit networks (see what I did there?). I want to reclaim these!\nSeeing as the two tests went well, I decided to roll this out and make it official. This post describes how I rolled out an (almost) IPv4-less core network for IPng Networks. It was actually way easier than I had anticipated, and apparently I was not alone - several of my buddies in the industry have asked me about it, so I thought I\u0026rsquo;d write a little bit about the configuration.\nBackground: OSPFv3 with IPv4 💩 /30: 4 addresses: In the oldest of days, two routers that formed an IPv4 OSPF adjacency would have a /30 point-to-point transit network between them. Router A would have the lower available IPv4 address, and Router B would have the upper available IPv4 address. The other two addresses in the /30 would be the network and broadcast addresses of the prefix. Not a very efficient way to do things, but back in the old days, IPv4 addresses were in infinite supply.\n🥈 /31: 2 addresses: Enter [RFC3021], from December 2000, which some might argue are also the old days. With ever-increasing pressure to conserve IP address space on the Internet, it makes sense to consider where relatively minor changes can be made to fielded practice to improve numbering efficiency. This RFC describes how to halve the amount of address space assigned to point-to-point links (common throughout the Internet infrastructure) by allowing the use of /31 prefixes for them. At some point, even our friends from Latvia figured it out!\n🥇 /32: 1 address: In most networks, each router has what is called a loopback IPv4 and IPv6 address, typically a /32 and /128 in size. This allows the router to select a unique address that is not bound to any given interface. It comes in handy in many ways \u0026ndash; for example to have stable addresses to manage the router, and to allow it to connect to iBGP route reflectors and peers from well known addresses.\nAs it so turns out, two routers that form an adjacency can advertise ~any IPv4 address as nexthop, provided that their adjacent peer knows how to find that address. Of course, with a /30 or /31 this is obvious: if I have a directly connected /31, I can simply ARP for the other side, learn its MAC address, and use that to forward traffic to the other router.\nThe Trick What would it look like if there\u0026rsquo;s no subnet that directly connects two adjacent routers? Well, I happen to know that RouterA and RouterB both have a /32 loopback address. So if I simply let RouterA (1) advertise its loopback address to neighbor RouterB, and also (2) answer ARP requests for that address, the two routers should be able to form an adjacency. This is exactly what Ondrej\u0026rsquo;s [Bird2 commit (1)] and my [VPP gerrit (2)] accomplish, as perfect partners:\nOndrej\u0026rsquo;s change will make the Link LSA be onlink, which is a way to describe that the next hop is not directly connected, in other words RouterB will be at nexthop 192.0.2.1, while RouterA itself is 192.0.2.0/32. My change will make VPP answer for ARP requests in such a scenario where RouterA with an unnumbered interface with 192.0.2.0/32 will respond to a request from the not directly connected onlink peer RouterB at 192.0.2.1. Rolling out P2P-less OSPFv3 1. Upgrade VPP + Bird2 First order of business is to upgrade all routers. I need a VPP version with the [ARP gerrit] and a Bird2 version with the [OSPFv3 commit]. I build a set of Debian packages on bookworm-builder and upload them to IPng\u0026rsquo;s website [ref].\nI schedule a two nightly maintenance windows. In the first one, I\u0026rsquo;ll upgrade two routers (frggh0 and ddln1) by means of canary. I\u0026rsquo;ll let them run for a few days, and then wheel over the rest after I\u0026rsquo;m confident there are no regressions.\nFor each router, I will first drain it: this means in Kees, setting the OSPFv2 and OSPFv3 cost of routers neighboring it to a higher number, so that traffic flows around the \u0026rsquo;expensive\u0026rsquo; link. I will also move the eBGP sessions into shutdown mode, which will make the BGP sessions stay connected, but the router will not announce any prefixes nor accept any from peers. Without it announcing or learning any prefixes, the router stops seeing traffic. After about 10 minutes, it is safe to make intrusive changes to it.\nSeeing as I\u0026rsquo;ll be moving from OSPFv2 to OSPFv3, I will allow for a seemless transition by configuring both protocols to run at the same time. The filter that applies to both flavors of OSPF is the same: I will only allow more specifics of IPng\u0026rsquo;s own prefixes to be propagated, and in particular I\u0026rsquo;ll drop all prefixes that come from BGP. I\u0026rsquo;ll rename the protocol called ospf4 to ospf4_old, and create a new (OSPFv3) protocol called ospf4 which has only the loopback interface in it. This way, when I\u0026rsquo;m done, the final running protocol will simply be called ospf4:\nfilter f_ospf { if (source = RTS_BGP) then reject; if (net ~ [ 92.119.38.0/24{25,32}, 194.1.163.0/24{25,32}, 194.126.235.0/24{25,32} ]) then accept; if (net ~ [ 2001:678:d78::/48{56,128}, 2a0b:dd80:3000::/36{48,48} ]) then accept; reject; } protocol ospf v2 ospf4_old { ipv4 { export filter f_ospf; import filter f_ospf; }; area 0 { interface \u0026#34;loop0\u0026#34; { stub yes; }; interface \u0026#34;xe1-1.302\u0026#34; { type pointopoint; cost 61; bfd on; }; interface \u0026#34;xe1-0.304\u0026#34; { type pointopoint; cost 56; bfd on; }; }; } protocol ospf v3 ospf4 { ipv4 { export filter f_ospf; import filter f_ospf; }; area 0 { interface \u0026#34;loop0\u0026#34;,\u0026#34;lo\u0026#34; { stub yes; }; }; } In one terminal, I will start a ping to the router\u0026rsquo;s IPv4 loopback:\npim@summer:~$ ping defra0.ipng.ch PING (194.1.163.7) 56(84) bytes of data. 64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=1 ttl=61 time=6.94 ms 64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=2 ttl=61 time=7.00 ms 64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=3 ttl=61 time=7.03 ms 64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=4 ttl=61 time=7.03 ms ... While in the other, I log in to the IPng Site Local connection to the router\u0026rsquo;s management plane, to perform the ugprade:\npim@squanchy:~$ ssh defra0.net.ipng.ch pim@defra0:~$ wget -m --no-parent https://ipng.ch/media/vpp/bookworm/24.06-rc0~183-gb0d433978/ pim@defra0:~$ cd ipng.ch/media/vpp/bookworm/24.06-rc0~183-gb0d433978/ pim@defra0:~$ sudo nsenter --net=/var/run/netns/dataplane root@defra0:~# pkill -9 vpp \u0026amp;\u0026amp; systemctl stop bird-dataplane vpp \u0026amp;\u0026amp; \\ dpkg -i ~pim/ipng.ch/media/vpp/bookworm/24.06-rc0~183-gb0d433978/*.deb \u0026amp;\u0026amp; \\ dpkg -i ~pim/bird2_2.15.1_amd64.deb \u0026amp;\u0026amp; \\ systemctl start bird-dataplane \u0026amp;\u0026amp; \\ systemctl restart vpp-snmp-agent-dataplane vpp-exporter-dataplane Then comes the small window of awkward staring at the ping I started in the other terminal. It always makes me smile because it all comes back very quickly, within 90 seconds the router is back online and fully converged with BGP:\npim@summer:~$ ping defra0.ipng.ch PING (194.1.163.7) 56(84) bytes of data. 64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=1 ttl=61 time=6.94 ms 64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=2 ttl=61 time=7.00 ms 64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=3 ttl=61 time=7.03 ms 64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=4 ttl=61 time=7.03 ms ... 64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=94 ttl=61 time=1003.83 ms 64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=95 ttl=61 time=7.03 ms 64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=96 ttl=61 time=7.02 ms 64 bytes from defra0.ipng.ch (194.1.163.7): icmp_seq=97 ttl=61 time=7.03 ms pim@defra0:~$ birdc show ospf nei BIRD v2.15.1-4-g280daed5-x ready. ospf4_old: Router ID Pri\tState DTime\tInterface Router IP 194.1.163.8 1\tFull/PtP 32.113\txe1-1.302 194.1.163.27 194.1.163.0 1\tFull/PtP 30.936\txe1-0.304 194.1.163.24 ospf4: Router ID Pri\tState DTime\tInterface Router IP ospf6: Router ID Pri\tState DTime\tInterface Router IP 194.1.163.8 1\tFull/PtP 32.113\txe1-1.302 fe80::3eec:efff:fe46:68a8 194.1.163.0 1\tFull/PtP 30.936\txe1-0.304 fe80::6a05:caff:fe32:4616 I can see that the OSPFv2 adjacencies have reformed, which is totally expected. Looking at the router\u0026rsquo;s current addresses:\npim@defra0:~$ ip -br a | grep UP loop0 UP 194.1.163.7/32 2001:678:d78::7/128 fe80::dcad:ff:fe00:0/64 xe1-0 UP fe80::6a05:caff:fe32:3e48/64 xe1-1 UP fe80::6a05:caff:fe32:3e49/64 xe1-2 UP fe80::6a05:caff:fe32:3e4a/64 xe1-3 UP fe80::6a05:caff:fe32:3e4b/64 xe1-0.304@xe1-0 UP 194.1.163.25/31 2001:678:d78::2:7:2/112 fe80::6a05:caff:fe32:3e48/64 xe1-1.302@xe1-1 UP 194.1.163.26/31 2001:678:d78::2:8:1/112 fe80::6a05:caff:fe32:3e49/64 xe1-2.441@xe1-2 UP 46.20.246.51/29 2a02:2528:ff01::3/64 fe80::6a05:caff:fe32:3e4a/64 xe1-2.503@xe1-2 UP 80.81.197.38/21 2001:7f8::206a:0:1/64 fe80::6a05:caff:fe32:3e4a/64 xe1-2.514@xe1-2 UP 185.1.210.235/23 2001:7f8:3d::206a:0:1/64 fe80::6a05:caff:fe32:3e4a/64 xe1-2.515@xe1-2 UP 185.1.208.84/23 2001:7f8:44::206a:0:1/64 fe80::6a05:caff:fe32:3e4a/64 xe1-2.516@xe1-2 UP 185.1.171.43/23 2001:7f8:9e::206a:0:1/64 fe80::6a05:caff:fe32:3e4a/64 xe1-3.900@xe1-3 UP 193.189.83.55/23 2001:7f8:33::a100:8298:1/64 fe80::6a05:caff:fe32:3e4b/64 xe1-3.2003@xe1-3 UP 185.1.155.116/24 2a0c:b641:701::8298:1/64 fe80::6a05:caff:fe32:3e4b/64 xe1-3.3145@xe1-3 UP 185.1.167.136/23 2001:7f8:f2:e1::8298:1/64 fe80::6a05:caff:fe32:3e4b/64 xe1-3.1405@xe1-3 UP 80.77.16.214/30 2a00:f820:839::2/64 fe80::6a05:caff:fe32:3e4b/64 Take a look at interfaces xe1-0.304 which is southbound from Frankfurt to Zurich (chrma0.ipng.ch) and xe1-1.302 which is northbound from Frankfurt to Amsterdam (nlams0.ipng.ch). I am going to get rid of the IPv4 and IPv6 global unicast addresses on these two interfaces, and let OSPFv3 borrow the IPv4 address from loop0 instead.\nBut first, rinse and repeat, until all routers are upgraded.\n2. A situational overview First, let me draw a diagram that helps show what I\u0026rsquo;m about to do:\nIn the network overview I\u0026rsquo;ve drawn four of IPng\u0026rsquo;s routers. The ones at the bottom are the two routers at my office in Brüttisellen, Switzerland, which explains their name chbtl0 and chbtl1, and they are connected via a local fiber trunk using 10Gig optics (drawn in red). On the left, the first router is connected via a 10G Ethernet-over-MPLS link (depicted in green) to the NTT Datacenter in Rümlang. From there, IPng rents a 25Gbps wavelength to the Interxion datacenter in Glattbrugg (shown in blue). Finally, the Interxion router connects back to Brüttisellen using a 10G Ethernet-over-MPLS link (colored in pink), completing the ring.\nYou can also see that each router has a set of loopback addresses, for example chbtl0 in the bottom left has IPv4 address 194.1.163.3/32 and IPv6 address 2001:678:d78::3/128. Each point to point network has assigned one /31 and one /112 with each router taking one address at either side. Counting them up real quick, I see twelve IPv4 addresses in this diagram. This is a classic OSPF design pattern. I seek to save eight of these addresses!\n3. First OSPFv3 link The rollout has to start somewhere, and I decide to start close to home, literally. I\u0026rsquo;m going to remove the IPv4 and IPv6 addresses from the red link between the two routers in Brüttisellen. They are directly connected, and if anything goes wrong, I can walk over and rescue them. Sounds like a safe way to start!\nI quickly add the ability for [vppcfg] to configure unnumbered interfaces. In VPP, these are interfaces that don\u0026rsquo;t have an IPv4 or IPv6 address of their own, but they borrow one from another interface. If you\u0026rsquo;re curious, you can take a look at the [User Guide] on GitHub.\nLooking at their vppcfg files, the change is actually very easy, taking as an example the configuration file for chbtl0.ipng.ch:\nloopbacks: loop0: description: \u0026#39;Core: chbtl1.ipng.ch\u0026#39; addresses: [\u0026#39;194.1.163.3/32\u0026#39;, \u0026#39;2001:678:d78::3/128\u0026#39;] lcp: loop0 mtu: 9000 interfaces: TenGigabitEthernet6/0/0: device-type: dpdk description: \u0026#39;Core: chbtl1.ipng.ch\u0026#39; mtu: 9000 lcp: xe1-0 # addresses: [ \u0026#39;194.1.163.20/31\u0026#39;, \u0026#39;2001:678:d78::2:5:1/112\u0026#39; ] unnumbered: loop0 By commenting out the addresses field, and replacing it with unnumbered: loop0, I instruct vppcfg to make Te6/0/0, which in Linux is called xe1-0, borrow its addresses from the loopback interface loop0.\nPlanning and applying this is straight forward, but there\u0026rsquo;s one detail I should mention. In my [previous article] I asked myself a question: would it be better to leave the addresses unconfigured in Linux, or would it be better to make the Linux Control Plane plugin carry forward the borrowed addresses? In the end, I decided to not copy them forward. VPP will be aware of the addresses, but Linux will only carry them on the loop0 interface.\nIn the article, you\u0026rsquo;ll see that discussed as Solution 2, and it includes a bit of rationale why I find this better. I implemented it in this [commit], in case you\u0026rsquo;re curious, and the commandline keyword is lcp lcp-sync-unnumbered off (the default is on).\npim@chbtl0:~$ vppcfg plan -c /etc/vpp/vppcfg.yaml [INFO ] root.main: Loading configfile /etc/vpp/vppcfg.yaml [INFO ] vppcfg.config.valid_config: Configuration validated successfully [INFO ] root.main: Configuration is valid [INFO ] vppcfg.vppapi.connect: VPP version is 24.06-rc0~183-gb0d433978 comment { vppcfg prune: 2 CLI statement(s) follow } set interface ip address del TenGigabitEthernet6/0/0 194.1.163.20/31 set interface ip address del TenGigabitEthernet6/0/0 2001:678:d78::2:5:1/112 comment { vppcfg sync: 1 CLI statement(s) follow } set interface unnumbered TenGigabitEthernet6/0/0 use loop0 [INFO ] vppcfg.reconciler.write: Wrote 5 lines to (stdout) [INFO ] root.main: Planning succeeded pim@chbtl0:~$ vppcfg show int addr TenGigabitEthernet6/0/0 TenGigabitEthernet6/0/0 (up): unnumbered, use loop0 L3 194.1.163.3/32 L3 2001:678:d78::3/128 pim@chbtl0:~$ vppctl show lcp | grep TenGigabitEthernet6/0/0 itf-pair: [9] TenGigabitEthernet6/0/0 tap9 xe1-0 65 type tap netns dataplane pim@chbtl0:~$ ip -br a | grep UP xe0-0 UP fe80::92e2:baff:fe3f:cad4/64 xe0-1 UP fe80::92e2:baff:fe3f:cad5/64 xe0-1.400@xe0-1 UP fe80::92e2:baff:fe3f:cad4/64 xe0-1.400.10@xe0-1.400 UP 194.1.163.16/31 2001:678:d78:2:3:1/112 fe80::92e2:baff:fe3f:cad4/64 xe1-0 UP fe80::21b:21ff:fe55:1dbc/64 xe1-1.101@xe1-1 UP 194.1.163.65/27 2001:678:d78:3::1/64 fe80::14b4:c6ff:fe1e:68a3/64 xe1-1.179@xe1-1 UP 45.129.224.236/29 2a0e:5040:0:2::236/64 fe80::92e2:baff:fe3f:cad5/64 After applying this configuration, I can see that Te6/0/0 indeed is unnumbered, use loop0 noting the IPv4 and IPv6 addresses that it borrowed. I can see with the second command that Te6/0/0 corresponds in Linux with xe1-0, and finally with the third command I can list the addresses of the Linux view, and indeed I confirm that xe1-0 only has a link local address. Slick!\nAfter applying this change, the OSPFv2 adjacency in the ospf4_old protocol expires, and I see the routing table converge. A traceroute between chbtl0 and chbtl1 now takes a bit of a detour:\npim@chbtl0:~$ traceroute chbtl1.ipng.ch traceroute to chbtl1 (194.1.163.4), 30 hops max, 60 byte packets 1 chrma0.ipng.ch (194.1.163.17) 0.981 ms 0.969 ms 0.953 ms 2 chgtg0.ipng.ch (194.1.163.9) 1.194 ms 1.192 ms 1.176 ms 3 chbtl1.ipng.ch (194.1.163.4) 1.875 ms 1.866 ms 1.911 ms I can now introduce the very first OSPFv3 adjacency for IPv4, and I do this by moving the neighbor from the ospf4_old protocol to the ospf4 prototol. Of course, I also update chbtl1 with the unnumbered interface on its xe1-0, and update OSPF there. And with that, something magical happens:\npim@chbtl0:~$ birdc show ospf nei BIRD v2.15.1-4-g280daed5-x ready. ospf4_old: Router ID Pri State DTime Interface Router IP 194.1.163.0 1 Full/PtP 30.571 xe0-1.400.10 fe80::266e:96ff:fe37:934c ospf4: Router ID Pri State DTime Interface Router IP 194.1.163.4 1 Full/PtP 31.955 xe1-0 fe80::9e69:b4ff:fe61:ff18 ospf6: Router ID Pri State DTime Interface Router IP 194.1.163.4 1 Full/PtP 31.955 xe1-0 fe80::9e69:b4ff:fe61:ff18 194.1.163.0 1 Full/PtP 30.571 xe0-1.400.10 fe80::266e:96ff:fe37:934c pim@chbtl0:~$ birdc show route protocol ospf4 BIRD v2.15.1-4-g280daed5-x ready. Table master4: 194.1.163.4/32 unicast [ospf4 2024-05-19 20:58:04] * I (150/2) [194.1.163.4] via 194.1.163.4 on xe1-0 onlink 194.1.163.64/27 unicast [ospf4 2024-05-19 20:58:04] E2 (150/2/10000) [194.1.163.4] via 194.1.163.4 on xe1-0 onlink Aww, would you look at that! Especially the first entry is interesting to me. It says that this router has learned the address 194.1.163.4/32, the loopback address of chbtl1 via nexthop also 194.1.163.4 on interface xe1-0 with a flag onlink.\nThe kernel routing table agrees with this construction:\npim@chbtl0:~$ ip ro get 194.1.163.4 194.1.163.4 via 194.1.163.4 dev xe1-0 src 194.1.163.3 uid 1000 cache Now, what this construction tells the kernel, is that it should ARP for 194.1.163.4 using local address 194.1.163.3, for which VPP on the other side will respond, thanks to my [VPP ARP gerrit]. As such, I should expect now a FIB entry for VPP:\npim@chbtl0:~$ vppctl show ip fib 194.1.163.4 ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, lcp-rt:1, ] 194.1.163.4/32 fib:0 index:973099 locks:3 lcp-rt-dynamic refs:1 src-flags:added,contributing,active, path-list:[189] locks:98 flags:shared,popular, uPRF-list:507 len:1 itfs:[36, ] path:[166] pl-index:189 ip4 weight=1 pref=32 attached-nexthop: oper-flags:resolved, 194.1.163.4 TenGigabitEthernet6/0/0 [@0]: ipv4 via 194.1.163.4 TenGigabitEthernet6/0/0: mtu:9000 next:10 flags:[] 90e2ba3fcad4246e9637934c810001908100000a0800 adjacency refs:1 entry-flags:attached, src-flags:added, cover:-1 path-list:[1025] locks:1 uPRF-list:1521 len:1 itfs:[36, ] path:[379] pl-index:1025 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, 194.1.163.4 TenGigabitEthernet6/0/0 [@0]: ipv4 via 194.1.163.4 TenGigabitEthernet6/0/0: mtu:9000 next:10 flags:[] 90e2ba3fcad4246e9637934c810001908100000a0800 Extensions: path:379 forwarding: unicast-ip4-chain [@0]: dpo-load-balance: [proto:ip4 index:848961 buckets:1 uRPF:507 to:[1966944:611861009]] [0] [@5]: ipv4 via 194.1.163.4 TenGigabitEthernet6/0/0: mtu:9000 next:10 flags:[] 90e2ba3fcad4246e9637934c810001908100000a0800 Nice work, VPP and Bird2! I confirm that I can ping the neighbor again, and that the traceroute is direct rather than the scenic route from before, and I validate that IPv6 still works for good measure:\npim@chbtl0:~$ ping -4 chbtl1.ipng.ch PING 194.1.163.4 (194.1.163.4) 56(84) bytes of data. 64 bytes from 194.1.163.4: icmp_seq=1 ttl=63 time=0.169 ms 64 bytes from 194.1.163.4: icmp_seq=2 ttl=63 time=0.283 ms 64 bytes from 194.1.163.4: icmp_seq=3 ttl=63 time=0.232 ms 64 bytes from 194.1.163.4: icmp_seq=4 ttl=63 time=0.271 ms ^C --- 194.1.163.4 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3003ms rtt min/avg/max/mdev = 0.163/0.233/0.276/0.045 ms pim@chbtl0:~$ traceroute chbtl1.ipng.ch traceroute to chbtl1 (194.1.163.4), 30 hops max, 60 byte packets 1 chbtl1.ipng.ch (194.1.163.4) 0.190 ms 0.176 ms 0.147 ms pim@chbtl0:~$ ping6 chbtl1.ipng.ch PING chbtl1.ipng.ch(chbtl1.ipng.ch (2001:678:d78::4)) 56 data bytes 64 bytes from chbtl1.ipng.ch (2001:678:d78::4): icmp_seq=1 ttl=64 time=0.205 ms 64 bytes from chbtl1.ipng.ch (2001:678:d78::4): icmp_seq=2 ttl=64 time=0.203 ms 64 bytes from chbtl1.ipng.ch (2001:678:d78::4): icmp_seq=3 ttl=64 time=0.213 ms 64 bytes from chbtl1.ipng.ch (2001:678:d78::4): icmp_seq=4 ttl=64 time=0.219 ms ^C --- chbtl1.ipng.ch ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3068ms rtt min/avg/max/mdev = 0.203/0.210/0.219/0.006 ms pim@chbtl0:~$ traceroute6 chbtl1.ipng.ch traceroute to chbtl1.ipng.ch (2001:678:d78::4), 30 hops max, 80 byte packets 1 chbtl1.ipng.ch (2001:678:d78::4) 0.163 ms 0.147 ms 0.124 ms 4. From one to two At this point I have two IPv4 IGPs running. This is not ideal, but it\u0026rsquo;s also not completely broken, because the OSPF filter allows the routers to learn and propagate any more specific prefix from 194.1.163.0/24. This way, the legacy OSPFv2 called ospf4_old and this new OSPFv3 called ospf4 will be aware of all routes. Bird will learn them twice, and routing decisions may be a bit funky because the OSPF protocols learn the routes from each other as OSPF-E2. There are two implications of this:\nIt means that the routes that are learned from the other OSPF protocol will have a fixed metric (==cost), and for the time being, I won\u0026rsquo;t be able to cleanly add up link costs between the routers that are speaking OSPFv2 and those that are speaking OSPFv3.\nIf an OSPF External Type E1 and Type E2 route exist to the same destination the E1 route will always be preferred irrespective of the metric. This means that within the routers that speak OSPFv2, cost will remain consistent; and also within the routers that speak OSPFv3, it will be consistent. Between them, routes will be learned, but cost will be roughly meaningless.\nI upgrade another link, between router chgtg0 and ddln0 at my [colo], which is connected via a 10G EoMPLS link from a local telco called Solnet. The colo, similar to IPng\u0026rsquo;s office, has two redundant 10G uplinks, so if things were to fall apart, I can always quickly shutdown the offending link (thereby removing OSPFv3 adjacencies), and traffic will reroute. I have created two islands of OSPFv3, drawn in orange, with exactly two links using IPv4-less point to point networks. I let this run for a few weeks, to make sure things do not fail in mysterious ways.\n5. From two to many From this point on it\u0026rsquo;s just rinse-and-repeat. For each backbone link, I will:\nI will drain the backbone link I\u0026rsquo;m about to work on, by raising OSPFv2 and OSPFv3 cost on both sides. If the cost was, say, 56, I will temporarily make that 1056. This will make traffic avoid using the link if at all possible. Due to redundancy, every router has (at least) two backbone links. Traffic will be diverted. I first change the VPP router\u0026rsquo;s vppcfg.yaml to remove the p2p addresses and replace them with an unnumbered: loop0 instead. I apply the diff, and the OSPF adjacency breaks for IPv4. The BFD adjacency for IPv4 will disappear. Curiously, the IPv6 adjacency stays up, because OSPFv3 adjacencies use link-local addresses. I move the interface section of the old OSPFv2 ospf4_old protocol to the new OSPFv3 ospf4 protocol, which will als use link-local addresses to form adjacencies. The two routers will exchange Link LSA and be able to find each other directly connected. Now the link is running two OSPFv3 protocols, each in their own address family. They will share the same BFD session. I finally undrain the link by setting the OSPF link cost back to what it was. This link is now a part of the OSPFv3 part of the network. I work my way through the network. The first one I do is the link between chgtg0 and chbtl1 (which I\u0026rsquo;ve colored in the diagram in pink), so that there are four contiguous OSPFv3 links, spanning from chbtl0 - chbtl1 - chgtg0 - ddln0. I constantly do a traceroute to a machine that is directly connected behind ddln0, and as well use RIPE Atlas and the NLNOG Ring to ensure that I have reachability:\npim@squanchy:~$ traceroute ipng.mm.fcix.net traceroute to ipng.mm.fcix.net (194.1.163.59), 64 hops max, 40 byte packets 1 chbtl0 (194.1.163.65) 0.279 ms 0.362 ms 0.249 ms 2 chbtl1 (194.1.163.3) 0.455 ms 0.394 ms 0.384 ms 3 chgtg0 (194.1.163.1) 1.302 ms 1.296 ms 1.294 ms 4 ddln0 (194.1.163.5) 2.232 ms 2.385 ms 2.322 ms 5 mm0.ddln0.ipng.ch (194.1.163.59) 2.377 ms 2.577 ms 2.364 ms I work my way outwards from there. First completing the ring chbtl0 - chrma0 - chgtg0 - chbtl1, and then completing the ring ddln0 - ddln1 - chrma0 - chgtg0, after which the Zurich metro area is converted. I then work my way clockwise from Zurich to Geneva, Paris, Lille, Amsterdam, Frankfurt, and end up with the last link completing the set: defra0 - chrma0.\nResults In total I reconfigure thirteen backbone links, and they all become unnumbered using the router\u0026rsquo;s loopback addresses for IPv4 and IPv6, and they all switch over from their OSPFv2 IGP to the new OSPFv3 IGP; the total number of routers running the old IGP shrinks until there are none left. Once that happens, I can simply remove the OSPFv2 protocol called ospf4_old, and keep the two OSPFv3 protocols now intuitively called ospf4 and ospf6. Nice.\nThis maintenance isn\u0026rsquo;t super intrusive. For IPng\u0026rsquo;s customers, latency goes up from time to time as backbone links are drained, the link is reconfigured to become unnumbered and OSPFv3, and put back into service. The whole operation takes a few hours, and I enjoy the repetitive tasks, getting pretty good at the drain-reconfigure-undrain cycle after a while.\nIt looks really cool on transit routers, like this one in Lille, France:\npim@frggh0:~$ ip -br a | grep UP loop0 UP 194.1.163.10/32 2001:678:d78::a/128 fe80::dcad:ff:fe00:0/64 xe0-0 UP 193.34.197.143/25 2001:7f8:6d::8298:1/64 fe80::3eec:efff:fe70:24a/64 xe0-1 UP fe80::3eec:efff:fe70:24b/64 xe1-0 UP fe80::6a05:caff:fe32:45ac/64 xe1-1 UP fe80::6a05:caff:fe32:45ad/64 xe1-2 UP fe80::6a05:caff:fe32:45ae/64 xe1-2.100@xe1-2 UP fe80::6a05:caff:fe32:45ae/64 xe1-2.200@xe1-2 UP fe80::6a05:caff:fe32:45ae/64 xe1-2.391@xe1-2 UP 46.20.247.3/29 2a02:2528:ff03::3/64 fe80::6a05:caff:fe32:45ae/64 xe0-1.100@xe0-1 UP 194.1.163.137/29 2001:678:d78:6::1/64 fe80::3eec:efff:fe70:24b/64 pim@frggh0:~$ birdc show bfd ses BIRD v2.15.1-4-g280daed5-x ready. bfd1: IP address Interface State Since Interval Timeout fe80::3eec:efff:fe46:68a9 xe1-2.200 Up 2024-06-19 20:16:58 0.100 3.000 fe80::6a05:caff:fe32:3e38 xe1-2.100 Up 2024-06-19 20:13:11 0.100 3.000 pim@frggh0:~$ birdc show ospf nei BIRD v2.15.1-4-g280daed5-x ready. ospf4: Router ID Pri State DTime Interface Router IP 194.1.163.9 1 Full/PtP 34.947 xe1-2.100 fe80::6a05:caff:fe32:3e38 194.1.163.8 1 Full/PtP 31.940 xe1-2.200 fe80::3eec:efff:fe46:68a9 ospf6: Router ID Pri State DTime Interface Router IP 194.1.163.9 1 Full/PtP 34.947 xe1-2.100 fe80::6a05:caff:fe32:3e38 194.1.163.8 1 Full/PtP 31.940 xe1-2.200 fe80::3eec:efff:fe46:68a9 You can see here that the router indeed has an IPv4 loopback address 194.1.163.10/32, and 2001:678:d78::a/128. It has two backbone links, on xe1-2.100 towards Paris and xe1-2.200 towards Amsterdam. Judging by the time between the BFD sessions, it took me somewhere around four minutes to drain, reconfigure, and undrain each link. I kept on listening to Nora en Pure\u0026rsquo;s [Episode #408] the whole time.\nA traceroute The beauty of this solution is that the routers will still have one IPv4 and IPv6 address, from their loop0 interface. The VPP dataplane will use this when generating ICMP error messages, for example in a traceroute. It will look quite normal:\npim@squanchy:~/src/ipng.ch$ traceroute bit.nl traceroute to bit.nl (213.136.12.97), 30 hops max, 60 byte packets 1 chbtl0.ipng.ch (194.1.163.65) 0.366 ms 0.408 ms 0.393 ms 2 chrma0.ipng.ch (194.1.163.0) 1.219 ms 1.252 ms 1.180 ms 3 defra0.ipng.ch (194.1.163.7) 6.943 ms 6.887 ms 6.922 ms 4 nlams0.ipng.ch (194.1.163.8) 12.882 ms 12.835 ms 12.910 ms 5 as12859.frys-ix.net (185.1.203.186) 14.028 ms 14.160 ms 14.436 ms 6 http-bit-ev-new.lb.network.bit.nl (213.136.12.97) 14.098 ms 14.671 ms 14.965 ms pim@squanchy:~$ traceroute6 bit.nl traceroute6 to bit.nl (2001:7b8:3:5::80:19), 64 hops max, 60 byte packets 1 chbtl0.ipng.ch (2001:678:d78:3::1) 0.871 ms 0.373 ms 0.304 ms 2 chrma0.ipng.ch (2001:678:d78::) 1.418 ms 1.387 ms 1.764 ms 3 defra0.ipng.ch (2001:678:d78::7) 6.974 ms 6.877 ms 6.912 ms 4 nlams0.ipng.ch (2001:678:d78::8) 13.023 ms 13.014 ms 13.013 ms 5 as12859.frys-ix.net (2001:7f8:10f::323b:186) 14.322 ms 14.181 ms 14.827 ms 6 http-bit-ev-new.lb.network.bit.nl (2001:7b8:3:5::80:19) 14.176 ms 14.24 ms 14.093 ms The only difference from before is that now, these traceroute hops are from the loopback addresses, not the P2P transit links (eg the second hop, through chrma0 is now 194.1.163.0 and 2001:678:d78:: respectively, where before that would have been 194.1.163.17 and 2001:678:d78::2:3:2 respectively. Subtle, but super dope.\nLink Flap Test The proof is in the pudding, they say. After all of this link draining, reconfiguring and undraining, I gain confidence that this stuff actually works as advertised! I thought it\u0026rsquo;d be a nice touch to demonstrate a link drain, between Frankfurt and Amsterdam. I recorded a little screencast [asciinema, gif], shown here:\nReturning IPv4 (and IPv6!) addresses Now that the backbone links no longer carry global unicast addresses, and they borrow from the one IPv4 and IPv6 address in loop0, I can return a whole stack of addresses:\nIn total, I returned 34 IPv4 addresses from IPng\u0026rsquo;s /24, which is 13.3%. This is huge, and I\u0026rsquo;m confident that I will find a better use for these little addresses than being pointless point-to-point links!\n","date":"2024-06-22","desc":" Introduction When I first built IPng Networks AS8298, I decided to use OSPF as an IPv4 and IPv6 internal gateway protocol. Back in March I took a look at two slightly different ways of doing this for IPng, notably against a backdrop of conserving IPv4 addresses. As the network grows, the little point to point transit networks between routers really start adding up.\nI explored two potential solutions to this problem:\n","permalink":"https://ipng.ch/s/articles/2024/06/22/vpp-with-loopback-only-ospfv3-part-2/","section":"articles","title":"VPP with loopback-only OSPFv3 - Part 2"},{"contents":"Introduction IPng\u0026rsquo;s network is built up in two main layers, (1) an MPLS transport layer, which is disconnected from the Internet, and (2) a VPP overlay, which carries the Internet. I created a BGP Free core transport network, which uses MPLS switches from a company called Centec. These switches offer IPv4, IPv6, VxLAN, GENEVE and GRE all in silicon, are very cheap on power and relatively affordable per port.\nCentec switches allow for a modest but not huge amount of routes in the hardware forwarding tables. I loadtested them in [a previous article] at line rate (well, at least 8x10G at 64b packets and around 110Mpps), and they forward IPv4, IPv6 and MPLS traffic effortlessly, at 45 watts.\nI wrote more about the Centec switches in [my review] of them back in 2022.\nIPng Site Local I leverage this internal transport network for more than just MPLS. The transport switches are perfectly capable of line rate (at 100G+) IPv4 and IPv6 forwarding as well. When designing IPng Site Local, I created a number plan that assigns IPv4 from the 198.19.0.0/16 prefix, and IPv6 from the 2001:678:d78:500::/56 prefix. Within these, I allocate blocks for Loopback addresses, PointToPoint subnets, and hypervisor networks for VMs and internal traffic.\nTake a look at the diagram to the right. Each site has one or more Centec switches (in red), and there are three redundant gateways that connect the IPng Site Local network to the Internet (in orange). I run lots of services in this red portion of the network: site to site backups [Borgbackup], ZFS replication [ZRepl], a message bus using [Nats], and of course monitoring with SNMP and Prometheus all make use of this network. But it\u0026rsquo;s not only internal services like management traffic, I also actively use this private network to expose public services!\nFor example, I operate a bunch of [NGINX Frontends] that have a public IPv4/IPv6 address, and reversed proxy for webservices (like [ublog.tech] or [Rallly]) which run on VMs and Docker hosts which don\u0026rsquo;t have public IP addresses. Another example which I wrote about [last week], is a bunch of mail services that run on VMs without public access, but are each carefully exposed via reversed proxies (like Postfix, Dovecot, or [Roundcube]). It\u0026rsquo;s an incredibly versatile network design!\nBorder Gateways Seeing as IPng Site Local uses native IPv6, it\u0026rsquo;s rather straight forward to give each hypervisor and VM an IPv6 address, and configure IPv4 only on the externally facing NGINX Frontends. As a reversed proxy, NGINX will create a new TCP session to the internal server, and that\u0026rsquo;s a fine solution. However, I also want my internal hypervisors and servers to have full Internet connectivity. For IPv6, this feels pretty straight forward, as I can just route the 2001:678:d78:500::/56 through a firewall that blocks incoming traffic, and call it a day. For IPv4, similarly I can use classic NAT just like one would in a residential network.\nBut what if I wanted to go IPv6-only? This poses a small challenge, because while IPng is fully IPv6 capable, and has been since the early 2000s, the rest of the internet is not quite there yet. For example, the quite popular [GitHub] hosting site still has only an IPv4 address. Come on, folks, what\u0026rsquo;s taking you so long?! It is for this purpose that NAT64 was invented. Described in [RFC6146]:\nStateful NAT64 translation allows IPv6-only clients to contact IPv4 servers using unicast UDP, TCP, or ICMP. One or more public IPv4 addresses assigned to a NAT64 translator are shared among several IPv6-only clients. When stateful NAT64 is used in conjunction with DNS64, no changes are usually required in the IPv6 client or the IPv4 server.\nThe rest of this article describes version 2 of the IPng SL border gateways, which opens the path for IPng to go IPv6-only. By the way, I thought it would be super complicated, but in hindsight: I should have done this years ago!\nGateway Design Let me take a closer look at the orange boxes that I drew in the network diagram above. I call these machines Border Gateways. Their job is to sit between IPng Site Local and the Internet. They\u0026rsquo;ll each have one network interface connected to the Centec switch, and another connected to the VPP routers at AS8298. They will provide two main functions: firewalling, so that no unwanted traffic enters IPng Site local, and NAT translation, so that:\nIPv4 users from 198.19.0.0/16 can reach external IPv4 addresses, IPv6 users from 2001:678:d78:500::/56 can reach external IPv6, IPv6-only users can reach external IPv4 addresses, a neat trick. IPv4 and IPv6 NAT Let me start off with the basic tablestakes. You\u0026rsquo;ll likely be familiar with masquerading, a NAT technique in Linux that uses the public IPv4 address assigned by your provider, allowing many internal clients, often using [RFC1918] addresses, to access the internet via that shared IPv4 address. You may not have come across IPv6 masquerading though, but it\u0026rsquo;s equally possible to take an internal (private, non-routable) IPv6 network and access the internet via a shared IPv6 address.\nI will assign a pool of four public IPv4 addresses and eight IPv6 addresses to each border gateway:\nMachine IPv4 pool IPv6 pool border0.chbtl0.net.ipng.ch 194.126.235.0/30 2001:678:d78::3:0:0/125 border0.chrma0.net.ipng.ch 194.126.235.4/30 2001:678:d78::3:1:0/125 border0.chplo0.net.ipng.ch 194.126.235.8/30 2001:678:d78::3:2:0/125 border0.nlams0.net.ipng.ch 194.126.235.12/30 2001:678:d78::3:3:0/125 Linux iptables masquerading will only work with the IP addresses assigned to the external interface, so I will need to use a slightly different approach to be able to use these pools. In case you\u0026rsquo;re wondering \u0026ndash; IPng\u0026rsquo;s internal network has grown to the size now that I cannot expose it all behind a single IPv4 address; there will not be enough TCP/UDP ports. Luckily, NATing via a pool is pretty easy using the SNAT module:\npim@border0-chrma0:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/rc.firewall.ipng-sl # IPng Site Local: Enable stateful firewalling on IPv4/IPv6 forwarding iptables -P FORWARD DROP ip6tables -P FORWARD DROP iptables -I FORWARD -i enp1s0f1 -m state --state NEW -s 198.19.0.0/16 -j ACCEPT ip6tables -I FORWARD -i enp1s0f1 -m state --state NEW -s 2001:678:d78:500::/56 -j ACCEPT iptables -I FORWARD -m state --state RELATED,ESTABLISHED -j ACCEPT ip6tables -I FORWARD -m state --state RELATED,ESTABLISHED -j ACCEPT # IPng Site Local: Enable NAT on external interface using NAT pools iptables -t nat -I POSTROUTING -s 198.19.0.0/16 -o enp1s0f0 \\ -j SNAT --to 194.126.235.4-194.126.235.7 ip6tables -t nat -I POSTROUTING -s 2001:678:d78:500::/56 -o enp1s0f0 \\ -j SNAT --to 2001:678:d78::3:1:0-2001:678:d78::3:1:7 EOF From the top \u0026ndash; I\u0026rsquo;ll first make it the default for the kernel to refuse to FORWARD any traffic that is not explicitly accepted. I will only allow traffic that comes in via enp1s0f1 (the internal interface), only if it comes from the assigned IPv4 and IPv6 site local prefixes. On the way back, I\u0026rsquo;ll allow traffic that matches states created on the way out. This is the firewalling portion of the setup.\nThen, two POSTROUTING rules turn on network address translation. If the source address is any of the site local prefixes, I\u0026rsquo;ll rewrite it to come from the IPv4 or IPv6 pool addresses, respectively. This is the NAT44 and NAT66 portion of the setup.\nNAT64: Jool So far, so good. But this article is about NAT64 :-) Here\u0026rsquo;s where I grossly overestimated how difficult it might be \u0026ndash; and if there\u0026rsquo;s one takeaway from my story here, it should be that NAT64 is as straight forward as the others! Enter [Jool], an Open Source SIIT and NAT64 for Linux. It\u0026rsquo;s available in Debian as a DKMS kernel module and userspace tool, and it integrates cleanly with both iptables and netfilter.\nJool is a network address and port translating implementation, which is referred to as NAPT, just as regular IPv4 NAT. When internal IPv6 clients try to reach an external endpoint, Jool will make note of the internal src6:port, then select an external IPv4 address:port, rewrite the packet, and on the way back, correlate the src4:port with the internal src6:port, and rewrite the packet. If this sounds an awful lot like NAT, then you\u0026rsquo;re not wrong! The only difference is, Jool will also translate the address family: it will rewrite the internal IPv6 addresses to external IPv4 addresses.\nInstalling Jool is as simple as this:\npim@border0-chrma0:~$ sudo apt install jool-dkms jool-tools pim@border0-chrma0:~$ sudo mkdir /etc/jool pim@border0-chrma0:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/jool/jool.conf { \u0026#34;comment\u0026#34;: { \u0026#34;description\u0026#34;: \u0026#34;Full NAT64 configuration for border0.chrma0.net.ipng.ch\u0026#34;, \u0026#34;last update\u0026#34;: \u0026#34;2024-05-21\u0026#34; }, \u0026#34;instance\u0026#34;: \u0026#34;default\u0026#34;, \u0026#34;framework\u0026#34;: \u0026#34;netfilter\u0026#34;, \u0026#34;global\u0026#34;: { \u0026#34;pool6\u0026#34;: \u0026#34;2001:678:d78:564::/96\u0026#34;, \u0026#34;lowest-ipv6-mtu\u0026#34;: 1280, \u0026#34;logging-debug\u0026#34;: false }, \u0026#34;pool4\u0026#34;: [ { \u0026#34;protocol\u0026#34;: \u0026#34;TCP\u0026#34;, \u0026#34;prefix\u0026#34;: \u0026#34;194.126.235.4/30\u0026#34;, \u0026#34;port range\u0026#34;: \u0026#34;1024-65535\u0026#34; }, { \u0026#34;protocol\u0026#34;: \u0026#34;UDP\u0026#34;, \u0026#34;prefix\u0026#34;: \u0026#34;194.126.235.4/30\u0026#34;, \u0026#34;port range\u0026#34;: \u0026#34;1024-65535\u0026#34; }, { \u0026#34;protocol\u0026#34;: \u0026#34;ICMP\u0026#34;, \u0026#34;prefix\u0026#34;: \u0026#34;194.126.235.4/30\u0026#34; } ] } EOF pim@border0-chrma0:~$ sudo systemctl start jool .. and that, as they say, is all there is to it! There\u0026rsquo;s two things I make note of here:\nI have assigned 2001:678:d78:564::/96 as NAT64 pool6, which means that if this machine sees any traffic destined to that prefix, it\u0026rsquo;ll activate Jool, select an available IPv4 address:port from the pool4, and send the packet to the IPv4 destination address which it takes from the last 32 bits of the original IPv6 destination address. Cool trick: I am reusing the same IPv4 pool as for regular NAT. The Jool kernel module happily coexists with the iptables implementation! DNS64: Unbound There\u0026rsquo;s one vital piece of information missing, and it took me a little while to appreciate that. If I take an IPv6 only host, like Summer, and I try to connect to an IPv4-only host, how does that even work?\npim@summer:~$ ip -br a lo UNKNOWN 127.0.0.1/8 ::1/128 eno1 UP 2001:678:d78:50b::f/64 fe80::7e4d:8fff:fe03:3c00/64 pim@summer:~$ ip -6 ro 2001:678:d78:50b::/64 dev eno1 proto kernel metric 256 pref medium fe80::/64 dev eno1 proto kernel metric 256 pref medium default via 2001:678:d78:50b::1 dev eno1 proto static metric 1024 pref medium pim@summer:~$ host github.com github.com has address 140.82.121.4 pim@summer:~$ ping github.com ping: connect: Network is unreachable Now comes the really clever reveal \u0026ndash; NAT64 works by assigning an IPv6 prefix that snugly fits the entire IPv4 address space, typically 64:ff9b::/96, but operators can chose any prefix they\u0026rsquo;d like. For IPng\u0026rsquo;s site local network, I decided to assign 2001:678:d78:564::/96 for this purpose (this is the global.pool6 attribute in Jool\u0026rsquo;s config file I described above). A resolver can then tweak DNS lookups for IPv6-only hosts to return addresses from that IPv6 range. This tweaking is called DNS64, described in [RFC6147]:\nDNS64 is a mechanism for synthesizing AAAA records from A records. DNS64 is used with an IPv6/IPv4 translator to enable client-server communication between an IPv6-only client and an IPv4-only server, without requiring any changes to either the IPv6 or the IPv4 node, for the class of applications that work through NATs.\nI run the popular [Unbound] resolver at IPng, deployed as a set of anycasted instances across the network. With two lines of configuration only, I can turn on this feature:\npim@border0-chrma0:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/unbound/unbound.conf.d/dns64.conf server: module-config: \u0026#34;dns64 iterator\u0026#34; dns64-prefix: 2001:678:d78:564::/96 EOF pim@border0-chrma0:~$ sudo systemctl restat unbound The behavior of the resolver now changes in a very subtle but cool way:\npim@summer:~$ host github.com github.com has address 140.82.121.3 github.com has IPv6 address 2001:678:d78:564::8c52:7903 pim@summer:~$ host 2001:678:d78:564::8c52:7903 3.0.9.7.2.5.c.8.0.0.0.0.0.0.0.0.4.6.5.0.8.7.d.0.8.7.6.0.1.0.0.2.ip6.arpa domain name pointer lb-140-82-121-3-fra.github.com. Before, [github.com] did not return an AAAA record, so there was no way for Summer to connect to it. But now, not only does it return an AAAA record, but it also rewrites the PTR request, knowing that I\u0026rsquo;m asking for something in the DNS64 range of 2001:678:d78:564::/96, Unbound will instead strip off the last 32 bits (8c52:7903, which is the hex encoding for the original IPv4 address), and return the answer for a PTR lookup for the original 3.121.82.140.in-addr.arpa instead. Game changer!\nDNS64 + NAT64 What I learned from this, is that the combination of these two tools provides the magic:\nWhen an IPv6-only client asks for AAAA for an IPv4-only hostname, Unbound will synthesize an AAAA from the IPv4 address, casting it into the last 32 bits of its NAT64 prefix 2001:678:d78:564::/96 When an IPv6-only client tries to send traffic to 2001:678:d78:564::/96, Jool will do the address family (and address/port) translation. This is represented by the red (ipv6) flow in the diagram to the right turning into a green (ipv4) flow to the left. What\u0026rsquo;s left for me to do is to ensure that (a) the NAT64 prefix is routed from IPng Site Local to the gateways and (b) that the IPv4 and IPv6 NAT address pools is routed from the Internet to the gateways.\nInternal: OSPF I use Bird2 to accomplish the dynamic routing - and considering the Centec switch network is by design BGP Free, I will use OSPF and OSPFv3 for these announcements. Using OSPF has an important benefit: I can selectively turn on and off the Bird announcements to the Centec IPng Site local network. Seeing as there will be multiple redundant gateways, if one of them goes down (either due to failure or because of maintenance), the network will quickly reconverge on another replica. Neat!\nHere\u0026rsquo;s how I configure the OSPF import and export filters:\nfilter ospf_import { if (net.type = NET_IP4 \u0026amp;\u0026amp; net ~ [ 198.19.0.0/16 ]) then accept; if (net.type = NET_IP6 \u0026amp;\u0026amp; net ~ [ 2001:678:d78:500::/56 ]) then accept; reject; } filter ospf_export { if (net.type=NET_IP4 \u0026amp;\u0026amp; !(net~[198.19.0.255/32,0.0.0.0/0])) then reject; if (net.type=NET_IP6 \u0026amp;\u0026amp; !(net~[2001:678:d78:564::/96,2001:678:d78:500::1:0/128,::/0])) then reject; ospf_metric1 = 200; unset(ospf_metric2); accept; } When learning prefixes from the Centec switch, I will only accept precisely the IPng Site Local IPv4 (198.19.0.0/16) and IPv6 (2001:678:d78:500::/56) supernets. On sending prefixes to the Centec switches, I will announce:\n198.19.0.255/32 and 2001:678:d78:500::1:0/128: These are the anycast addresses of the Unbound resolver. 0.0.0.0/0 and ::/0: These are default routes for IPv4 and IPv6 respectively 2001:678:d78:564::/96: This is the NAT64 prefix, which will attract the IPv6-only traffic towards DNS64-rewritten destinations, for example 2001:678:d78:564::8c52:7903 as DNS64 representation of github.com, which is reachable only at legacy address 140.82.121.3. I have to be careful with the announcements into OSPF. The cost of E1 routes is the cost of the external metric in addition to the internal cost within OSPF to reach that network. The cost of E2 routes will always be the external metric, the metric will take no notice of the internal cost to reach that router. Therefor, I emit these prefixes without Bird\u0026rsquo;s ospf_metric2 set, so that the closest border gateway is always used.\nWith that, I can see the following:\npim@summer:~$ traceroute6 github.com traceroute to github.com (2001:678:d78:564::8c52:7903), 30 hops max, 80 byte packets 1 msw0.chbtl0.net.ipng.ch (2001:678:d78:50b::1) 4.134 ms 4.640 ms 4.796 ms 2 border0.chbtl0.net.ipng.ch (2001:678:d78:503::13) 0.751 ms 0.818 ms 0.688 ms 3 * * * 4 * * * ^C I\u0026rsquo;m not quite there yet, I have one more step to go. What\u0026rsquo;s happening at the Border Gateway? Let me take a look at this, while I ping6 to github.com:\npim@summer:~$ ping6 github.com PING github.com(lb-140-82-121-4-fra.github.com (2001:678:d78:564::8c52:7904)) 56 data bytes ... (nothing) pim@border0-chbtl0:~$ sudo tcpdump -ni any src host 2001:678:d78:50b::f or dst host 140.82.121.4 11:25:19.225509 enp1s0f1 In IP6 2001:678:d78:50b::f \u0026gt; 2001:678:d78:564::8c52:7904: ICMP6, echo request, id 3904, seq 7, length 64 11:25:19.225603 enp1s0f0 Out IP 194.126.235.3 \u0026gt; 140.82.121.4: ICMP echo request, id 61668, seq 7, length 64 Unbound and Jool are doing great work. Unbound saw my DNS request for IPv4-only github.com, and synthesized a DNS64 response for me. Jool then saw the inbound packet from enp1s0f1, the internal interface pointed at IPng Site Local. This is because the 2001:678:d78:564::/96 prefix is announced in OSPFv3 so every host knows to route traffic to that prefix to this border gateway. But then, I see the NAT64 in action on the outbound interface enp1s0f0. Here, one of the IPv4 pool addresses is selected as source address. But there is no return packet, because there is no route back from the Internet, yet.\nExternal: BGP The final step for me is to allow return traffic, from the Internet to the IPv4 and IPv6 pools to reach this Border Gateway instance. For this, I configure BGP with the following Bird2 configuration snippet:\nfilter bgp_import { if (net.type = NET_IP4 \u0026amp;\u0026amp; !(net = 0.0.0.0/0)) then reject; if (net.type = NET_IP6 \u0026amp;\u0026amp; !(net = ::/0)) then reject; accept; } filter bgp_export { if (net.type = NET_IP4 \u0026amp;\u0026amp; !(net ~ [ 194.126.235.4/30 ])) then reject; if (net.type = NET_IP6 \u0026amp;\u0026amp; !(net ~ [ 2001:678:d78::3:1:0/125 ])) then reject; # Add BGP Wellknown community no-export (FFFF:FF01) bgp_community.add((65535,65281)); accept; } I then establish an eBGP session from private AS64513 to two of IPng Networks\u0026rsquo; core routers at AS8298. I add the wellknown BGP no-export community (FFFF:FF01) so that these prefixes are learned in AS8298, but never propagated. It\u0026rsquo;s not strictly necessary, because AS8298 won\u0026rsquo;t announce more specifics like these anyway, but it\u0026rsquo;s a nice way to really assert that these are meant to stay local. Because AS8298 is already announcing 194.126.235.0/24 and 2001:678:d78::/48 supernets, return traffic will already be able to reach IPng\u0026rsquo;s routers upstream. With these more specific announcements of the /30 and /125 pools, the upstream VPP routers will be able to route the return traffic to this specific server.\nAnd with that, the ping to Unbound\u0026rsquo;s DNS64 provided IPv6 address for github.com shoots to life.\nResults I deployed four of these Border Gateways using Ansible: one at my office in Brüttisellen, one in Zurich, one in Geneva and one in Amsterdam. They do all three types of NAT:\nAnnouncing the IPv4 default 0.0.0.0/0 will allow them to serve as NAT44 gateways for 198.19.0.0/16 Announcing the IPv6 default ::/0 will allow them to serve as NAT66 gateway for 2001:678:d78:500::/56 Announcing the IPv6 nat64 prefix 2001:678:d78:564::/96 will allow them to serve as NAT64 gateway Announcing the IPv4 and IPv6 anycast address for nscache.net.ipng.ch allows them to serve DNS64 Each individual service can be turned on or off. For example, stopping to announce the IPv4 default into the Centec network, will no longer attract NAT44 traffic through a replica. Similarly, stopping to announce the NAT64 prefix will no longer attract NAT64 traffic through that replica. OSPF in the IPng Site Local network will automatically select an alternative replica in such cases. Shutting down Bird2 alltogether will immediately drain the machine of all traffic, while traffic is immediately rerouted.\nIf you\u0026rsquo;re curious, here\u0026rsquo;s a few minutes of me playing with failover, while watching YouTube videos concurrently [asciinema, gif]:\nWhat\u0026rsquo;s Next I\u0026rsquo;ve added an Ansible module in which I can configure the individual instances\u0026rsquo; IPv4 and IPv6 NAT pools, and turn on/off the three NAT types by means of steering the OSPF announcements. I can also turn on/off the Anycast Unbound announcements, in much the same way.\nIf you\u0026rsquo;re a regular reader of my stories, you\u0026rsquo;ll maybe be asking: Why didn\u0026rsquo;t you use VPP? And that would be an excellent question. I need to noodle a little bit more with respect to having all three NAT types concurrently working alongside Linux CP for the Bird and Unbound stuff, but I think in the future you might see a followup article on how to do all of this in VPP. Stay tuned!\n","date":"2024-05-25","desc":"Introduction IPng\u0026rsquo;s network is built up in two main layers, (1) an MPLS transport layer, which is disconnected from the Internet, and (2) a VPP overlay, which carries the Internet. I created a BGP Free core transport network, which uses MPLS switches from a company called Centec. These switches offer IPv4, IPv6, VxLAN, GENEVE and GRE all in silicon, are very cheap on power and relatively affordable per port.\n","permalink":"https://ipng.ch/s/articles/2024/05/25/case-study-nat64/","section":"articles","title":"Case Study: NAT64"},{"contents":"Intro I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I\u0026rsquo;ve been feeling less enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using \u0026ldquo;free\u0026rdquo; services is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but for the last few years, I\u0026rsquo;ve been more and more inclined to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to privately operated ones.\nFirst off - I love Google\u0026rsquo;s Workspace products. I started using GMail just after it launched, back in 2004. Its user interface is sleek, performant, and very intuitive. Its filtering, granted, could be a bit less \u0026hellip; robotic, but that\u0026rsquo;s made up by labels and an incredibly comprehensive search function. I would dare say that between GMail and Photos, those are my absolute favorite products on the internet.\nThat said, I have been running e-mail servers since well before Google existed as a company. I started off at M.C.G.V. Stack, the computer club of the University of Eindhoven, in 1995. We ran sendmail back then, and until about two months ago, I have continuously run sendmail in production using the PaPHosting platform [ref] that I wrote with my buddies Paul and Jeroen.\nHowever, two things happened, both of them somewhat nerdsnipe-esque:\nMrs IPngNetworks said \u0026ldquo;Well if you are going to use NextCloud and PeerTube and PixelFed and Mastodon, why would you not run your own mailserver?\u0026rdquo; I added a forward for event@frys-ix.net on my Sendmail relays at PaPHosting, and was tipped by my buddy Jelle that his e-mail to it was bouncing due to SPF strictness. I tried to resist \u0026hellip; My main argument against running a mailserver has been the mailspool. Before I moved to GMail, I had the misfortune of having my mail and primary DNS fail, running on bfib.ipng.nl at the time, a server 700km away from me, without redundancy. Even the nameserver slaves went beyond their zone refresh. It was not a good month for me, even though it was twenty years or so ago :)\nLast year, during a roadtrip with Fred, he and I spent a few long hours restoring a backup after a catastrophic failure of a hypervisor at IP-Max on which his mailserver was running. Luckily, backups were awesome and saved the day, but having to go into a Red Alert mode and not being able to communicate, really can be stressful. I don\u0026rsquo;t want to run mailservers!!!1\n.. but resistance is futile After this nerdsnipe, I had a short conversation with Jeroen who mentioned that since I last had a look at this, Dovecot, a popular imap/pop3 server, had gained the ability to do mailbox synchronization across multiple machines. That\u0026rsquo;s a really nifty feature - but also it meant that there will be no more single points of failure, if I do this properly. Oh crap, there\u0026rsquo;s no longer an argument of resistance? Nerd-snipe accepted!\nLet me first introduce the mail^W main characters of my story:\nPostfix: is Wietse Venema\u0026rsquo;s mail server that started life at IBM research as an alternative to the widely-used Sendmail program. After eight years at Google, Wietse continues to maintain Postfix. Dovecot: an open source IMAP and POP3 email server for Linux/UNIX-like systems, written with security primarily in mind. Dovecot is an excellent choice for both small and large installations. NGINX: an HTTP and reverse proxy server, a mail proxy server, and a generic TCP/UDP proxy server, originally written by Igor Sysoev. Rspamd: an advanced spam filtering system and email processing framework that allows evaluation of messages by a number of rules including regular expressions, statistical analysis and custom services such as URL black lists. OpenDKIM: is a community effort to develop and maintain a C library for producing DKIM-aware applications and an open source milter for providing DKIM service. Unbound: a validating, recursive, caching DNS resolver. It is designed to be fast and lean and incorporates modern features based on open standards. Roundcube: a web-based IMAP email client. Roundcube\u0026rsquo;s most prominent feature is the pervasive use of Ajax technology. In the rest of this article, I\u0026rsquo;ll go over four main parts that I used to build a fully redundant and self-healing mail service at IPng Networks:\nGreen: smtp-in.ipng.ch which handles inbound e-mail Red: imap.ipng.ch which serves mailboxes to users Blue: smtp-out.ipng.ch which handles outbound e-mail Magenta: webmail.ipng.ch which exposes the mail in a web browser Let me start with a functional diagram, using those colors:\nAs you can see in this diagram, I will be separating concerns and splitting the design into three discrete parts, which will also be in three sets of redundantly configured backend servers running on IPng\u0026rsquo;s hypervisors in Zurich (CH), Lille (FR) and Amsterdam (NL).\n1. Outbound: smtp-out I\u0026rsquo;m going to start with a relatively simple component first: outbound mail. This service will be listening on the smtp submission port 587, require TLS and user authentication from clients, validate outbound e-mail using a spam detection agent, and finally provide DKIM signing on all outbound e-mails. It should spool and retry the delivery in case there is a temporary issue (like greylisting, or server failure) on the receiving side.\nBecause the only way to send e-mail will be using TLS and user authentication, the smtp-out servers themselves will not need to do any DNSBL lookups, which is convenient because it means I can put them behind a loadbalancer and serve them entirely within IPng Site Local. If you\u0026rsquo;re curious as to what this site local thing means, basically it\u0026rsquo;s an internal network spanning all IPng\u0026rsquo;s points of presence, with an IPv4, IPv6 and MPLS backbone that is disconnected from the internet. For more details on the design goals, take a look at the [article] I wrote about it last year.\nDebian VMs I\u0026rsquo;ll take three identical virtual machines, hosted on three separate hypervisors each in their own country.\npim@summer:~$ dig ANY smtp-out.net.ipng.ch smtp-out.net.ipng.ch.\t60\tIN\tA\t198.19.6.73 smtp-out.net.ipng.ch.\t60\tIN\tA\t198.19.4.230 smtp-out.net.ipng.ch.\t60\tIN\tA\t198.19.6.135 smtp-out.net.ipng.ch.\t60\tIN\tAAAA\t2001:678:d78:50e::9 smtp-out.net.ipng.ch.\t60\tIN\tAAAA\t2001:678:d78:50a::6 smtp-out.net.ipng.ch.\t60\tIN\tAAAA\t2001:678:d78:510::7 I will give them each 8GB of memory, 4 vCPUs, and 16GB of bootdisk. I\u0026rsquo;m pretty confident that the whole system will be running in only a fraction of that. I will install a standard issue Debian Bookworm (12.5), and while my VMs by default have 4 virtual NICs, I only need one, connected to the IPng Site Local:\npim@smtp-out-chrma0:~$ ip -br a lo UNKNOWN 127.0.0.1/8 ::1/128 enp1s0f0 UP 198.19.6.135/27 2001:678:d78:510::7/64 fe80::5054:ff:fe99:81b5/64 enp1s0f1 UP fe80::5054:ff:fe99:81b6/64 enp1s0f2 UP fe80::5054:ff:fe99:81b7/64 enp1s0f3 UP fe80::5054:ff:fe99:81b8/64 pim@smtp-out-chrma0:~$ mtr -6 -c5 -r dns.google Start: 2024-05-17T13:49:28+0200 HOST: smtp-out-chrma0 Loss% Snt Last Avg Best Wrst StDev 1.|-- msw1.chrma0.net.ipng.ch 0.0% 5 1.6 1.5 1.3 1.6 0.1 2.|-- msw0.chrma0.net.ipng.ch 0.0% 5 1.4 1.3 1.3 1.4 0.1 3.|-- msw0.chbtl0.net.ipng.ch 0.0% 5 3.2 3.1 2.8 3.2 0.2 4.|-- hvn0.chbtl0.net.ipng.ch 0.0% 5 1.5 1.5 1.4 1.5 0.0 5.|-- chbtl0.ipng.ch 0.0% 5 1.6 1.7 1.6 1.7 0.0 6.|-- chrma0.ipng.ch 0.0% 5 2.4 2.4 2.4 2.5 0.0 7.|-- as15169.lup.swissix.ch 0.0% 5 3.2 3.8 3.2 5.6 1.0 8.|-- 2001:4860:0:1::6083 0.0% 5 4.5 4.5 4.5 4.5 0.0 9.|-- 2001:4860:0:1::12f9 0.0% 5 3.4 3.5 3.4 3.5 0.0 10.|-- dns.google 0.0% 5 3.8 3.9 3.8 4.0 0.1 One cool observation: these machines are not really connected to the internet - you\u0026rsquo;ll note that their IPv4 address is from reserved space, and their IPv6 supernet (2001:678:d78:500::/56) is filtered at the border. I\u0026rsquo;ll get to that later!\nPostfix I will install Postfix, and make a few adjustments to its config. First off, this mailserver will only be receiving submission mail, which is port 587. It will not participate or listen to the regular smtp port 25, nor smtps port 465, as such the master.cf file for Postfix becomes:\n#smtp inet n - y - - smtpd # -o smtpd_sasl_auth_enable=no submission inet n - y - - smtpd -o syslog_name=postfix/submission -o smtpd_tls_security_level=encrypt -o smtpd_sasl_auth_enable=yes -o smtpd_reject_unlisted_recipient=no -o smtpd_client_restrictions=permit_sasl_authenticated,permit_mynetworks,reject -o milter_macro_daemon_name=ORIGINATING #smtps inet n - y - - smtpd # -o syslog_name=postfix/smtps # -o smtpd_tls_wrappermode=yes # -o smtpd_sasl_auth_enable=yes The only thing I will make note of is that the submission service has a set of client restrictions. In other words, to be able to use this service, the client must either be SASL authenticated, or from a list of network prefixes that are allowed to relay. If neither of those two conditions are satisfied, relaying will be denied.\nNow, I understand that pasting in the entire postfix configuration is a bit verbose, but honestly I\u0026rsquo;ve spent many an hour trying to puzzle together an end-to-end valid configuration, so I\u0026rsquo;m just going to swim upstream and post the whole main.cf, which I\u0026rsquo;ll try to annotate the broad strokes of, in case there\u0026rsquo;s anybody out there trying to learn:\nmyhostname = smtp-out.ipng.ch myorigin = smtp-out.ipng.ch mydestination = $myhostname, smtp-out.chrma0.net.ipng.ch, localhost.net.ipng.ch, localhost mynetworks = 127.0.0.0/8, [::1]/128 recipient_delimiter = + inet_interfaces = all inet_protocols = all biff = no # appending .domain is the MUA\u0026#39;s job. append_dot_mydomain = no readme_directory = no # See http://www.postfix.org/COMPATIBILITY_README.html -- default to 3.6 on fresh installs. compatibility_level = 3.6 # SMTP Server smtpd_banner = $myhostname ESMTP $mail_name (smtp-out.chrma0.net.ipng.ch) smtpd_relay_restrictions = permit_mynetworks permit_sasl_authenticated defer_unauth_destination smtpd_tls_cert_file = /etc/certs/ipng.ch/fullchain.pem smtpd_tls_key_file = /etc/certs/ipng.ch/privkey.pem smtpd_tls_CAfile = /etc/ssl/certs/ca-certificates.crt smtpd_tls_CApath = /etc/ssl/certs smtpd_use_tls = yes smtpd_tls_received_header = yes smtpd_tls_auth_only = yes smtpd_tls_session_cache_database = btree:$data_directory/smtpd_scache smtpd_client_connection_count_limit = 4 smtpd_client_connection_rate_limit = 10 smtpd_client_message_rate_limit = 60 smtpd_client_event_limit_exceptions = $mynetworks # Dovecot auth smtpd_sasl_type = dovecot smtpd_sasl_path = private/auth smtpd_sasl_authenticated_header = yes smtpd_sasl_auth_enable = yes smtpd_sasl_security_options = noanonymous, noplaintext smtpd_sasl_tls_security_options = noanonymous # SMTP Client smtp_use_tls = yes smtp_tls_note_starttls_offer = yes smtp_tls_cert_file = /etc/certs/ipng.ch/fullchain.pem smtp_tls_key_file = /etc/certs/ipng.ch/privkey.pem smtp_tls_CAfile = /etc/ssl/certs/ca-certificates.crt smtp_tls_CApath = /etc/ssl/certs smtp_tls_mandatory_ciphers = medium smtp_tls_session_cache_database = btree:${data_directory}/smtp_scache smtp_tls_security_level = encrypt header_size_limit = 4096000 message_size_limit = 52428800 mailbox_size_limit = 0 # OpenDKIM, Rspamd smtpd_milters = inet:localhost:8891,inet:rspamd.net.ipng.ch:11332 non_smtpd_milters = $smtpd_milters # Local aliases alias_maps = hash:/etc/postfix/aliases alias_database = hash:/etc/postfix/aliases Hostnames: The full (internal) hostname for the server is smtp-out.$(site).net.ipng.ch, in this case for chrma0 in Rümlang, Switzerland. However, when clients connect to the public hostname smtp-out.ipng.ch, they will expect that the TLS certificate matches that hostname. This is why I let the server present itself as simply smtp-out.ipng.ch, which will also be its public DNS name later, but put the internal FQDN for debugging purposes between parenthesis. See the smtpd_banner and myhostname for the destinction. I\u0026rsquo;ll load up the *.ipng.ch wildcard certificate which I described in my Let\u0026rsquo;s Encrypt [DNS-01] article.\nAuthorization: I will make Postfix accept relaying for those users that are either in the mynetworks (which is only localhost) OR sasl_authenticated (ie. presenting a username and password). This password exchange will only be possible after encryption has been triggered using the STARTTLS SMTP feature. This way, user/pass combos will be safe on the network.\nAuthentication: Those username and password combos can come from a few places. One popular way to do this is via a dovecot authentication service. Via the smtpd_sasl_path, I tell Postfix to ask these authentication questions using the dovecot protocol on a certain file path. I\u0026rsquo;ll let Dovecot listen in the /var/spool/postfix/private/auth directory. This is how Postfix will know which user to relay for, and which to deny.\nDKIM/SPF: These days, most (large and small) mail providers will be suspicious of e-mail that is delivered to them without proper SPF and DKIM fields. DKIM is a mechanism to create a cryptographic signature over some of the E-Mail header fields (usually From/Subject/Date), which can be checked by the recipient for validity. SPF is a mechanism to use DNS to inform receiving mailservers of which are the valid IPv4/IPv6 addresses that should be used to deliver mail for a given sender domain.\nDovecot (auth) The configuration for Dovecot is incredibly simple. The only thing I do is create a mostly empty dovecot.conf file which defines the auth service listening in the place where Postfix expects it. Then, I add a password file called sasl-users which will contain user:password tuples:\nservice auth { unix_listener /var/spool/postfix/private/auth { mode = 0660 # Assuming the default Postfix user and group user = postfix group = postfix } } passdb { driver = passwd-file args = username_format=%n /etc/dovecot/sasl-users } I can use doveadm pw to generate such passwords. I do this in an upstream Ansible repository and then push out the same configuration to any number of smtp-out servers, so they are all configured identically to this one.\nOpenDKIM (signing) Now that I can authorize (via SASL) and authenticate (via Dovecot backend) a user, it will be entitled to use the smtp-out Postfix to send e-mail. However, there\u0026rsquo;s a good chance that recipients will bounce the e-mail, unless it comes with a DKIM signature, and from the correct IP addresses.\nTo configure DKIM signing, I use OpenDKIM, which I give the following /etc/opendkim.conf file:\npim@smtp-out-chrma0:~$ cat /etc/opendkim.conf Syslog yes LogWhy yes UMask 007 Mode sv AlwaysAddARHeader yes SignatureAlgorithm rsa-sha256 X-Header no KeyTable refile:/etc/opendkim/keytable SigningTable refile:/etc/opendkim/signers RequireSafeKeys false Canonicalization relaxed TrustAnchorFile\t/usr/share/dns/root.key UserID opendkim PidFile /run/opendkim/opendkim.pid Socket inet6:8891 It opens a socket at port 8891, which is where Postfix expects it, based on its smtpd_milter configuration option. It will look at the so-called SigningTable to determine which outbound e-mail addresses it can sign. This table looks up From addresses, including wildcards, and informs which symbolic keyname in the KeyTable to use for the signature, like so:\npim@smtp-out-chrma0:/etc/opendkim$ cat signers *@*.ipng.nl ipng-nl *@ipng.nl ipng-nl *@*.ipng.ch ipng-ch *@ipng.ch ipng-ch *@*.ublog.tech ublog *@ublog.tech ublog ... pim@smtp-out-chrma0:/etc/opendkim$ cat keytable ipng-nl ipng.nl:DKIM2022:/etc/opendkim/keys/DKIM2022-ipng.nl-private ipng-ch ipng.ch:DKIM2022:/etc/opendkim/keys/DKIM2022-ipng.ch-private ublog ublog.tech:DKIM2022:/etc/opendkim/keys/DKIM2022-ublog.tech-private ... This allows OpenDKIM to sign messages for any number of domains, using the correct key. Slick!\nNGINX Now that I have three of these identical VMs, I am ready to hook them up to the internet. On the way in, I will point smtp-out.ipng.ch to our NGINX cluster. I wrote about that cluster in a [previous article]. I will add a snippet there, that exposes these VMs behind a TCP loadbalancer like so:\npim@squanchy:~/src/ipng-ansible/roles/nginx/files/streams-available$ cat smtp-out.ipng.ch.conf upstream smtp_out { server smtp-out.chrma0.net.ipng.ch:587 fail_timeout=10s max_fails=2; server smtp-out.frggh0.net.ipng.ch:587 fail_timeout=10s max_fails=2 backup; server smtp-out.nlams2.net.ipng.ch:587 fail_timeout=10s max_fails=2 backup; } server { listen [::]:587; listen 0.0.0.0:587; proxy_pass smtp_out; } I make use of the backup keyword, which will make the loadbalancer choose, if it\u0026rsquo;s available, the primary server in chrma0. If it were to go down, no problem, two connection failures within ten seconds will make NGINX choose the alternative ones in frggh0 or nlams2.\nIPng Site Local gateway When the smtp-out server receives the e-mail from the customer/client, it\u0026rsquo;ll spool it and start to deliver it to the remote MX record. To do this, it\u0026rsquo;ll create an outbound connection from its cozy spot within IPng Site Local (which, you will remember, is not connected directly to the internet). There are three redundant gateways in IPng Site Local (in Geneva, Brüttisellen and Amsterdam). If any of these were to go down for maintenance or fail, the network will use OSPF E1 to find the next closest default gateway. I wrote about how this entire european network is connected via three gateways that are self-repairing in this [article], in case you\u0026rsquo;re curious.\nBut, for the purposes of SMTP, it means that each of the internal smtp-out VMs will be seen by remote mailservers as NATted via one of these egress points. This allows me to determine the SPF records in DNS. With that, I\u0026rsquo;m ready to share the publicly visible details for this service:\n_spf.ipng.ch. 3600 IN TXT \u0026#34;v=spf1 include:_spf4.ipng.ch include:_spf6.ipng.ch ~all\u0026#34; _spf4.ipng.ch. 3600 IN TXT \u0026#34;v=spf1 ip4:46.20.246.112/28 ip4:46.20.243.176/28 ip4:94.142.245.80/29\u0026#34; \u0026#34;ip4:94.142.241.184/29 ip4:194.1.163.0/24 ~all\u0026#34; _spf6.ipng.ch. 3600 IN TXT \u0026#34;v=spf1 ip6:2a02:2528:ff00::/40 ip6:2a02:898:146::/48\u0026#34; \u0026#34;ip6:2001:678:d78::/48 ~all\u0026#34; smtp-out.ipng.ch. 3600 IN CNAME nginx0.ipng.ch. nginx0.ipng.ch. 600 IN A 194.1.163.151 nginx0.ipng.ch. 600 IN A 46.20.246.124 nginx0.ipng.ch. 600 IN A 94.142.241.189 nginx0.ipng.ch. 600 IN AAAA 2001:678:d78:7::151 nginx0.ipng.ch. 600 IN AAAA 2a02:2528:ff00::124 nginx0.ipng.ch. 600 IN AAAA 2a02:898:146::5 To re-iterate one point: the inbound path of the mail is via the redundant cluster of nginx0 entrypoints, while the outbound path will be seen from gw0.chbtl0.ipng.ch, gw0.chplo0.ipng.ch or gw0.nlams3.ipng.ch, which are all covered by the SPF records for IPv4 and IPv6.\nBonus: opensmtpd on clients By the way, every single server (VM, hypervisor, router) at IPng Neworks will all use smtp-out to send e-mail. I use opensmtpd for that, and it\u0026rsquo;s incredibly simple:\npim@squanchy:~$ cat /etc/mail/smtpd.conf table aliases file:/etc/mail/aliases table secrets file:/etc/mail/secrets listen on localhost action \u0026#34;local_mail\u0026#34; mbox alias \u0026lt;aliases\u0026gt; action \u0026#34;outbound\u0026#34; relay host \u0026#34;smtp+tls://ipng@smtp-out.ipng.ch:587\u0026#34; auth \u0026lt;secrets\u0026gt; mail-from \u0026#34;@ipng.ch\u0026#34; match from local for local action \u0026#34;local_mail\u0026#34; match from local for any action \u0026#34;outbound\u0026#34; pim@squanchy:~$ sudo cat /etc/mail/secrets ipng bastion:\u0026lt;haha-made-you-look\u0026gt; What happens here is, every time this server squanchy wants to send an e-mail, it will use an SMTP session with TLS, on port 587, of the machine called smtp-out.ipng.ch, and it\u0026rsquo;ll authenticate using the opensmtpd realm called ipng, which maps to a username:password tuple in the secrets file. It will also rewrite the envelope to be always from @ipng.ch. As a best practice I organize my SMTP users by Ansible group. Squanchy is in the group bastion, hence its username. By doing it this way, I can make use of the DKIM and SPF, which makes all mails properly formatted, routed, signed and delivered. I love it, so much!\n2. Inbound: smtp-in The smtp-out service I described in the previous section is completely standalone. That is to say, its purpose is only to receive submitted mail from humans and servers, sign it, spool it if need be, and deliver it. But users also want to deliver e-mail to me and my customers. For this, I\u0026rsquo;ll build a second cluster of redundant inbound mailservers: smtp-in.\nHere, the base setup is not too different from above, so I won\u0026rsquo;t repeat it. I\u0026rsquo;ll take three identical VMs, in three different datacenters, and install them with Debian and Postfix as well. But, contrary to the outbound servers, here I will make them listen to the smtp port 25 and the smtps port 465, and I\u0026rsquo;ll turn off the ability to authenticate with SASL (and thereby, refuse to forward any e-mail that I\u0026rsquo;m not the MX record for), making master.cf look like this:\nsmtp inet n - y - - smtpd -o smtpd_sasl_auth_enable=no #submission inet n - y - - smtpd # -o syslog_name=postfix/submission # -o smtpd_tls_security_level=encrypt # -o smtpd_sasl_auth_enable=yes # -o smtpd_reject_unlisted_recipient=no # -o smtpd_client_restrictions=permit_sasl_authenticated,permit_mynetworks,reject # -o milter_macro_daemon_name=ORIGINATING smtps inet n - y - - smtpd -o syslog_name=postfix/smtps -o smtpd_tls_wrappermode=yes -o smtpd_sasl_auth_enable=no Many of the main.cf attributes are the same, unsurprisingly the myhostname configuration option is set to smtp-in.ipng.ch, which is going to be expected to match the wildcard SSL certificate from the smtpd_tls_cert_file config option. The banner is a bit more telling, as it shows also the FQDN hostname (eg. smtp-in.frggh0.net.ipng.ch), helpful when debugging.\n# Impose DNSBL restrictions at SMTP time smtpd_recipient_restrictions = permit_mynetworks, reject_invalid_helo_hostname, reject_non_fqdn_recipient, reject_unknown_recipient_domain, reject_unauth_pipelining, reject_unauth_destination, reject_rbl_client zen.spamhaus.org=127.0.0.[2..11], reject_rhsbl_sender dbl.spamhaus.org=127.0.1.[2..99], reject_rhsbl_helo dbl.spamhaus.org=127.0.1.[2..99], reject_rhsbl_reverse_client dbl.spamhaus.org=127.0.1.[2..99], warn_if_reject reject_rbl_client zen.spamhaus.org=127.255.255.[1..255], reject_rbl_client dnsbl-1.uceprotect.net, reject_rbl_client bl.0spam.org=127.0.0.[7..9], permit # Milter for rspamd smtpd_milters = inet:rspamd.net.ipng.ch:11332 milter_default_action = accept # PostSRSd sender_canonical_maps = tcp:localhost:10001 sender_canonical_classes = envelope_sender recipient_canonical_maps = tcp:localhost:10002 recipient_canonical_classes= envelope_recipient,header_recipient # Virtual domains virtual_alias_domains = hash:/etc/postfix/virtual-domains virtual_alias_maps = hash:/etc/postfix/virtual The config arguably is quite compact. but I will hilight four specific pieces.\nDNSBL: When connecting and receiving the envelope (ie. MAIL FROM and RCPT TO in the SMTP transaction), I\u0026rsquo;ll ask Postfix to do a bunch of DNS blocklist lookups. Many sender domains, and infected hosts/networks are mapped in public DNSBL zones, notably [Spamhaus], [UCEProtect], and [0Spam] do a great job at identifying malicious and spammy domain-names and networks. So I\u0026rsquo;ll tell Postfix to reject folks attempting to connect from these low-reputation places.\nRspamd: Here\u0026rsquo;s where I hook up a redundant cluster of rspamd servers. Each e-mail, once accepted, will be routed through this milter, and after thinking about it a little bit, the Rspamd server will either answer:\ngreylisted: where Rspamd recommends a tempfail so the remote mailserver comes back after a few minutes after connecting for the first time, many spammers will not do this. blocked: if Rspamd finds the e-mail is egregious, it\u0026rsquo;ll simply recommend a permfail so Postfix immediately rejects it. tagged: if Rspamd is iffy about the e-mail, it may insert an X-Spam header, so that downstream mail clients like Thunderbird or Mail.app can decide for themselves to consider the e-mail junk or not. PostSRS: This is a really useful feature which allows Postfix to safely forward an e-mail to another mailhost. Perhaps best explained with an example, notably the aforementioned nerdsnipe from my buddy Jelle:\nLet\u0026rsquo;s say jelle@luteijn.email sends an e-mail to event@frys-ix.net for which IPng is the mailhost. Jelle configured his SPF records to allow mail to come from either ip4:185.36.229.0/24 or ip6:2a07:cd40::/29, and if it comes from neither of those, to hard fail the SPF check -all. My spiffy smtp-in.ipng.ch receives this e-mail and decides to forward it internally by rewriting it to foo@eritap.com. The mailserver for eritap.com now sees an e-mail coming From: jelle@luteijn.email going to its foo@eritap.com. It does an SPF check and concludes: Yikes! That mailserver smtp-in.ipng.ch is NOT authorized to send e-mail on behalf of Jelle, so reject it! A kitten gets hurt, which is obviously unacceptable. To handle this, PostSRSd detects when such a forward is about to happen, and rewrites the envelope From: header to be something that smtp-in.ipng.ch might be allowed to deliver mail for: something in the @ipng.ch domain! Using a secret (shared between the replicas of IPng\u0026rsquo;s smtp-in cluster), it can insert a little cryptographic signature as it does this rewrite.\nIn the example above, the e-mail from jelle@luteijn.email will be rewritten to an envelope such as SRS0=CCIM=MT=luteijn.email=jelle@ipng.ch and while hideous, it is in the @ipng.ch domain. If a bounce for this e-mail were to be generated, PostSRSd can also rewrite in reverse, re-assembling the original envelope From when sending the bounce on to Jelle\u0026rsquo;s mailserver.\nI configure Postfix to do this using the sender and recipient canonical maps. I read these from a server running on localhost port 10001 and 10002 respectively. This is where PostSRSd does its magic.\nOh, what\u0026rsquo;s that I hear? The telephone is ringing! 1982 called, and it wants to change the title of [RFC821] from SMTP to CMTP (Convoluted Mail Transfer Protocol).\nVirtual: With all of that out of the way, I can now receive and forward aliased e-mails. I won\u0026rsquo;t be using local mail delivery (to unix users on the local machine), but rather I will forward the mails for my local users onwards to what is called a redundant maildrop server. So for the virtualized part of the Postfix config, I have things like this:\npim@smtp-in-chrma0:~$ cat /etc/postfix/virtual-domains ublog.tech\tublog.tech frys-ix.net\tfrys-ix.net ipng.nl\tipng.nl ipng.ch\tipng.ch ... pim@smtp-in-chrma0:~$ cat /etc/postfix/virtual ## Virtual domain: ipng.ch postmaster@ipng.ch pim+postmaster@maildrop.net.ipng.ch hostmaster@ipng.ch pim+hostmaster@maildrop.net.ipng.ch abuse@ipng.ch pim+abuse@maildrop.net.ipng.ch pim@ipng.ch pim+ipng@maildrop.net.ipng.ch noreply@ipng.ch /dev/null ... ## Virtual domain: ipng.nl @ipng.nl @ipng.ch ## Virtual domain: frys-ix.net postmaster@frys-ix.net pim+postmaster@maildrop.net.ipng.ch hostmaster@frys-ix.net pim+hostmaster@maildrop.net.ipng.ch abuse@frys-ix.net pim+abuse@maildrop.net.ipng.ch noc@frys-ix.net pim+frysix@maildrop.net.ipng.ch,noc@eritap.com pim@frys-ix.net pim+frysix@maildrop.net.ipng.ch arend@frys-ix.net arend+frysix@eritap.com event@frys-ix.net someplace@example.com ... The first file here virtual_alias_domains, simply explains to Postfix which domains it is to accept e-mail for. This avoids users trying to use it as a relay. If the domain is not listed in the lefthand side of the table, it\u0026rsquo;s not welcome here. But then once Postfix knows it\u0026rsquo;s supposed to be accepting e-mail for this domain, it will consult the virtual_alias_maps configuration. Here, I showed three domains, and a few features:\nI can simply forward along pim@ipng.ch to pim+ipng@maildrop.net.ipng.ch. Cool. I can toss the email by passing it to /dev/null (useful for things like noreply@ and nobody@) I can forward it to multiple recipients as well, for example noc@frys-ix.net goes to me and Eritap (hoi, Arend!) When such a forward happens, PostSRSd kicks in, and for that e-mail, the envelope rewrite will happen such that smtp-in can safely deliver this to even the strictest of SPF users.\nWhy no NGINX ? There\u0026rsquo;s an important technical reason for me not to be able to use an inbound loadbalancer, even though I\u0026rsquo;d love to frontend port 25 and 465 on IPng\u0026rsquo;s nginx cluster. I have enabled the use of DNSBL, which implies that Postfix needs to know the remotely connecting IPv4 and IPv6 addresses. While for domain-based blocklists this is not important, for IP based ones like zen.spamhaus.org it is critical. Therefore, I will assign a public IPv4 and IPv6 address to each of the machines in the cluster. They will be used in a round-robin way, and if one of them is down for a while, remote mail servers will automatically and gracefully use another replica.\nWith that, the public DNS entries:\nublog.tech.\t86400\tIN\tMX\t10 smtp-in.ipng.ch. ipng.nl.\t86400\tIN\tMX\t10 smtp-in.ipng.ch. ipng.ch.\t86400\tIN\tMX\t10 smtp-in.ipng.ch. smtp-in.ipng.ch.\t60\tIN\tA\t46.20.246.125 smtp-in.ipng.ch.\t60\tIN\tA\t94.142.245.85 smtp-in.ipng.ch.\t60\tIN\tA\t194.1.163.141 smtp-in.ipng.ch.\t60\tIN\tAAAA\t2a02:2528:ff00::125 smtp-in.ipng.ch.\t60\tIN\tAAAA\t2a02:898:146:1::5 smtp-in.ipng.ch.\t60\tIN\tAAAA\t2001:678:d78:6::141 3. Dovecot: maildrop Remember when I said that mail to pim@ipng.ch is forwarded to pim+ipng@maildrop.net.ipng.ch? Doing this allows me to have replicated, fully redundant, IMAP servers! As it turns out, Dovecot, a very popular open source pop3/imap server, has the ability to do realtime synchronization between multiple machines serving the same user.\nOn these servers, I\u0026rsquo;ll start with enabling Postfix only using the smtp and smtps transport in master.cf. The maildrop servers will be entirely within IPng Site Local, and cannot be reached from the internet directly, just the same as the smtp-out server replicas.\nPostfix on the server receives mail from the smtp-in servers as the final destination for an e-mail. It does this very similar to the smtp-in server pool I described above, with two notable differences:\nIt does not need to do DNSBL lookups or spam analysis \u0026ndash; those have already happened upstream from these maildrop servers by the smtp-in servers. That\u0026rsquo;s also why these can be safely tucked away in IPng Site Local. Their virtual maps point to what is called an LMTP: Local Mail Transport Protocol, where I\u0026rsquo;ll ask Postfix to pump them into a redundalty replicated Dovecot pair. # Completely virtual virtual_alias_maps = hash:/etc/postfix/virtual-maildrop virtual_mailbox_domains = maildrop.net.ipng.ch virtual_transport = lmtp:unix:private/dovecot-lmtp pim@maildrop0-chbtl0:$ cat /etc/postfix/virtual-maildrop pim@maildrop.net.ipng.ch\tpim What I\u0026rsquo;ve done here is define only one virtual_mailbox_domains entry, for which I look up the users in the virtual_alias_maps and use a virtual_transport to deliver the enduser (pim) to a unix domain socket in /var/spool/postfix/private/dovecot-lmtp. Once again, mail servers are super simple after you\u0026rsquo;ve spent ten hours reading configuration manuals and RFCs and asked at least three other people how they did theirs\u0026hellip;. Super\u0026hellip; Simple!\nDovecot By default, Dovecot ships with a very elaborate configuration file hierarchy. I decide to replace it with an autogenerated one from Ansible that has fewer includes (namely: none at all). Here\u0026rsquo;s the features that I want to enable in Dovecot:\nUserDB: To define username, password and mail directory for users. LMTP: To be a local recepticle for the Postfix delivery IMAP: To serve SSL enabled IMAP to mail clients like Mail.app, Thunderbird, Roundcube, etc. Replicator: To replicate mailboxes between pairs of Dovecot servers. Sieve: To allow users to create mail filters using the Sieve protocol. Starting from the easier bits, here\u0026rsquo;s how I configure the User Database in dovecot.conf:\npassdb { driver = passwd-file args = username_format=%n /etc/dovecot/maildrop-users } userdb { driver = passwd-file args = username_format=%n /etc/dovecot/maildrop-users default_fields = uid=vmail gid=vmail home=/var/dovecot/users/%n } mail_plugins = $mail_plugins notify push_notification replication mail_location = mdbox:~/mdbox I can add a user pim with an encrypted password from doveadm pw like so:\npim@maildrop0-chbtl0:/etc/dovecot$ sudo cat maildrop-users ... pim:{CRYPT}$2y$\u0026lt;some encrypted password goes here\u0026gt;:::: Due to the passdb option, this user can authenticate with username and password, and due to the userdb option, this user receives a mailbox homedir in the specified location. One important observation is that the unix user is vmail:vmail for every mailbox. This is pretty cool as it allows the whole mail delivery system to be virtualized under Dovecot\u0026rsquo;s guidance. Slick!\nThere are two tried-and-tested mailbox formats: Maildir and mbox. mbox is one giant file per mail folder, and can be expensive to search and sort and delete mails out of. Maildir is cheaper to search and sort and delete, but is essentially one file per e-mail, which is bulky. Dovecot has its own high performance mailbox, which is the best of both worlds: an indexed append-only chunked mail format called mdbox. I learned more about the options and trade offs reading [this doc].\nDovecot: LMTP The following dovecot.conf snippet ties Postfix into Dovecot:\nprotocols = $protocols lmtp protocol lmtp { mail_plugins = $mail_plugins sieve } service lmtp { unix_listener /var/spool/postfix/private/dovecot-lmtp { mode = 0660 user = postfix group = postfix } } Recall that in Postfix above, the virtual_transport field specified the same location. This is how user pim gets mail handed to Dovecot. One other tidbit here is that the LMTP protocol enables a plugin called sieve. What this does, is upon receipt of each e-mail, a list of filters is run through, on the server side! It is here that I can tell Dovecot that some mail goes to different folders and sub-folders, some might be forwarded, marked read or discarded entirely. I\u0026rsquo;ll get to that in a minute.\nDovecot: IMAP Then, I enable SSL enabled IMAP in dovecot.conf:\ndisable_plaintext_auth = yes auth_mechanisms = plain login protocols = $protocols imap protocol imap { mail_max_userip_connections = 50 mail_plugins = $mail_plugins imap_sieve } service imap-login { inet_listener imap { port = 0 ## Disabled } inet_listener imaps { port = 993 } } With this snippet, I instruct Dovecot to disable any plain-text authentication, and use either plain or login challenges to authenticate users. I\u0026rsquo;ll disable the un-encrypted IMAP listener by setting its port to 0, and I\u0026rsquo;ll allow for an IMAP+SSL listener on the common port 993, which will be presenting a *.ipng.ch wildcard certificate that\u0026rsquo;s shared between all sorts of services at IPng.\nDovecot: Replication And now for something really magical. Dovecot can be instructed to replicate in multi-master (ie. read/write) mailboxes to remote machines also running Dovecot. This is called dsync and it\u0026rsquo;s hella cool! In reading the [docs], I take note that the same user should be directed to a stable replica in normal use, but changes do not get lost even if the same user modifies mails simultaneously on both replicas, some mails just might have to be redownloaded in that case. The replication is done by looking at Dovecot index files (not what exists in filesystem), so no mails get lost due to filesystem corruption or an accidental rm -rf, they will simply be replicated back.\nThis is amazing!! The configuration for it is remarkably straight forward:\nmail_plugins = $mail_plugins notify replication # Replication details replication_max_conns = 10 replication_full_sync_interval = 1h service aggregator { fifo_listener replication-notify-fifo { user = vmail group = vmail mode = 0660 } unix_listener replication-notify { user = vmail group = vmail mode = 0660 } } # Enable doveadm replicator commands service replicator { process_min_avail = 1 unix_listener replicator-doveadm { mode = 0660 user = vmail group = vmail } } doveadm_port = 63301 doveadm_password = \u0026lt;some password here\u0026gt; service doveadm { vsz_limit=512 MB inet_listener { port = 63301 } } plugin { mail_replica = tcp:maildrop0.ddln0.net.ipng.ch } To try to explain this - The first service, the aggregator opens some notification FIFOs that will notify listeners of new replication events. Then, Dovecot will start a process called replicator, which gets these cues when there is work to be done. It will connect to a mail_replica on another host, on the doveadm_port (in my case 63301) which is protected by a shared password. And with that, all e-mail that is delivered via LMTP on this machine, is both retrievable via IMAPS but also gets copied to the remote machine maildrop0.ddln0.net.ipng.ch (and in its configuration, it\u0026rsquo;ll synchronize mail to maildrop0.chbtl0.net.ipng.ch). Nice!\nDovecot: Sieve Having a flat mailbox is just no fun (unless you\u0026rsquo;re using GMail, in which case: tolerable). Enter Sieve, described in [RFC5228]. Scripts written in Sieve are executed during final delivery, when the message is moved to the user-accessible mailbox. In systems where the Mail Transfer Agent (MTA) does final delivery, such as traditional Unix mail, it is reasonable to filter when the MTA deposits mail into the user\u0026rsquo;s mailbox.\nprotocols = $protocols sieve plugin { sieve = ~/.dovecot.sieve sieve_global_path = /etc/dovecot/sieve/default.sieve sieve_dir = ~/sieve sieve_global_dir = /etc/dovecot/sieve/ sieve_extensions = +editheader sieve_before = /etc/dovecot/sieve/before.d sieve_after = /etc/dovecot/sieve/after.d } plugin { sieve_plugins = sieve_imapsieve sieve_extprograms # From elsewhere to Junk folder imapsieve_mailbox1_name = Junk imapsieve_mailbox1_causes = COPY imapsieve_mailbox1_before = file:/etc/dovecot/sieve/report-spam.sieve # From Junk folder to elsewhere imapsieve_mailbox2_name = * imapsieve_mailbox2_from = Junk imapsieve_mailbox2_causes = COPY imapsieve_mailbox2_before = file:/etc/dovecot/sieve/report-ham.sieve sieve_pipe_bin_dir = /etc/dovecot/sieve sieve_global_extensions = +vnd.dovecot.pipe } This is a mouthful, but only because it\u0026rsquo;s hella cool. By default, each mailbox will have a .dovecot.sieve file that is consulted at each delivery. If no file exists there, the default sieve will be used. But also, some sieve filters might happen either sieve_before the users\u0026rsquo; one is called, or sieve_after. Then, in the plugin I create two specific triggers:\nif a file is copied to the Junk folder, I will run it through a script called report-spam.sieve. similarly, if it is moved out of the Junk folder, I\u0026rsquo;ll run the report-ham.sieve script. Using an rspamc client, I can wheel over the cluster of Rspamd servers one by one and offer them these two events (the train-spam and train-ham are similar, so I\u0026rsquo;ll only show one):\npim@maildrop0-chbtl0:~$ cat /etc/dovecot/sieve/report-spam.sieve require [\u0026#34;vnd.dovecot.pipe\u0026#34;, \u0026#34;copy\u0026#34;, \u0026#34;imapsieve\u0026#34;, \u0026#34;environment\u0026#34;, \u0026#34;variables\u0026#34;]; if environment :matches \u0026#34;imap.email\u0026#34; \u0026#34;*\u0026#34; { set \u0026#34;email\u0026#34; \u0026#34;${1}\u0026#34;; } pipe :copy \u0026#34;train-spam.sh\u0026#34; [ \u0026#34;${email}\u0026#34; ]; pim@maildrop0-chbtl0:~$ cat /etc/dovecot/sieve/train-spam.sh logger learning spam /usr/bin/rspamc -h rspamd.net.ipng.ch:11332 learn_spam Users of Dovecot can now add their Sieve configs to their mailbox:\npim@maildrop0-chbtl0:~$ sudo ls -la /var/dovecot/users/pim/ lrwxrwxrwx 1 vmail vmail 19 Apr 2 17:27 .dovecot.sieve -\u0026gt; sieve/ipng_v1.sieve -rw------- 1 vmail vmail 2113 Mar 29 11:35 .dovecot.sieve.log -rw------- 1 vmail vmail 5001 May 14 14:34 .dovecot.svbin drwx------ 4 vmail vmail 4096 May 17 16:02 mdbox drwx------ 3 vmail vmail 4096 May 14 14:30 sieve but seeing as (a) it\u0026rsquo;s tedious to have to edit these files on multiple dovecot replicas, and (b) my users will not receive access to vmail user in order to actually do that, as it would be a security risk, I need one more thing.\nDovecot: IMAP Sieve Dovecot has an implementation of a replication-aware Sieve filter editor called managesieve:\nservice managesieve-login { inet_listener sieve { port = 4190 } } service managesieve { process_limit = 256 } protocol sieve { } It will use the IMAP credentials to allow users to edit their Sieve filter online. For example, Thunderbird has a plugin for it, which does syntax checking and what-not. When the filter is edited, it is syntax checked, compiled and replicated to the other Dovecot instance.\nNGINX I have an imap server and a mangesieve server, redundantly running on two Dovecot machines. I recall reading in the Dovecot manual that it is slightly preferable to have users go to a consistent replica and not bounce around between them. Luckily, I can do exactly that using the NGINX frontends:\nupstream imap { server maildrop0.chbtl0.net.ipng.ch:993 fail_timeout=10s max_fails=2; server maildrop0.ddln0.net.ipng.ch:993 fail_timeout=10s max_fails=2 backup; } server { listen [::]:993; listen 0.0.0.0:993; proxy_pass imap; } upstream sieve { server maildrop0.chbtl0.net.ipng.ch:4190 fail_timeout=10s max_fails=2; server maildrop0.ddln0.net.ipng.ch:4190 fail_timeout=10s max_fails=2 backup; } server { listen [::]:4190; listen 0.0.0.0:4190; proxy_pass sieve; } I keep port 993 for maildrop as well as port 587 for smtp-out unfiltered on the NGINX cluster. I\u0026rsquo;m a little bit more protective of the managesieve service, so port 4190 is allowed only when users are connected to the VPN or the internal (office/home) network.\nNow, you\u0026rsquo;ll recall that in the smtp-in servers, I forward mail to pim@maildrop.net.ipng.ch, for which the redundant Dovecot servers are both accepting mail. On the way in, I can see to it that the primary replica is used , by giving it a slightly lower preference in DNS MX records:\nmaildrop.net.ipng.ch.\t300 IN\tMX\t10 maildrop0.chbtl0.net.ipng.ch. maildrop.net.ipng.ch.\t300 IN\tMX\t20 maildrop0.ddln0.net.ipng.ch. imap.ipng.ch. 60 IN\tCNAME nginx0.ipng.ch. nginx0.ipng.ch. 600 IN A 194.1.163.151 nginx0.ipng.ch. 600 IN A 46.20.246.124 nginx0.ipng.ch. 600 IN A 94.142.241.189 nginx0.ipng.ch. 600 IN AAAA 2001:678:d78:7::151 nginx0.ipng.ch. 600 IN AAAA 2a02:2528:ff00::124 nginx0.ipng.ch. 600 IN AAAA 2a02:898:146::5 This will make the smtp-in hosts prefer to use the chbtl0 maildrop replica when it\u0026rsquo;s available. If ever it were to go down, they will automatically fail over and use ddln0, which will replicate back any changes while chbtl0 is down for maintenance or hardware failure. On the way to out, the nginx cluster will prefer to use chbtl0 as well, as it has marked the ddln0 replica as backup.\n4. Webmail: Roundcube Now that I have all of the infrastructure up and running, I thought I\u0026rsquo;d put some icing on the cake with Roundcube, a web-based IMAP email client. Roundcube\u0026rsquo;s most prominent feature is the pervasive use of Ajax technology. It also comes with an online Sieve editor, and runs in Docker. What more can I ask for?\nInstalling it is really really easy in my case. Since I have an nginx cluster to frontend it and do the SSL offloading, I choose the simplest version with the following docker-compose.yaml:\nversion: \u0026#39;2\u0026#39; services: roundcubemail: image: roundcube/roundcubemail:latest container_name: roundcubemail volumes: - ./www:/var/www/html - ./db/sqlite:/var/roundcube/db ports: - 9002:80 environment: - ROUNDCUBEMAIL_DB_TYPE=sqlite - ROUNDCUBEMAIL_SKIN=elastic - ROUNDCUBEMAIL_DEFAULT_HOST=ssl://maildrop0.net.ipng.ch - ROUNDCUBEMAIL_DEFAULT_PORT=993 - ROUNDCUBEMAIL_SMTP_SERVER=tls://smtp-out.net.ipng.ch - ROUNDCUBEMAIL_SMTP_PORT=587 There\u0026rsquo;s a small snag, in that by default the SMTP user and password are expected to be the same as for the IMAP server, which is not the case for my design. So, I create a user roundcube on the smtp-out cluster and give it a suitable password. I nose around a little bit, and decide my preference is to have threaded view by default, and I also enable the managesieve plugin:\n$config[\u0026#39;log_driver\u0026#39;] = \u0026#39;stdout\u0026#39;; $config[\u0026#39;zipdownload_selection\u0026#39;] = true; $config[\u0026#39;des_key\u0026#39;] = \u0026#39;\u0026lt;this key of sorts\u0026gt;\u0026#39;; $config[\u0026#39;enable_spellcheck\u0026#39;] = true; $config[\u0026#39;spellcheck_engine\u0026#39;] = \u0026#39;pspell\u0026#39;; $config[\u0026#39;smtp_user\u0026#39;] = \u0026#39;roundcube\u0026#39;; $config[\u0026#39;smtp_pass\u0026#39;] = \u0026#39;\u0026lt;something or other\u0026gt;\u0026#39;; $config[\u0026#39;plugins\u0026#39;] = array(\u0026#39;managesieve\u0026#39;); $config[\u0026#39;managesieve_host\u0026#39;] = \u0026#39;tls://maildrop0.net.ipng.ch:4190\u0026#39;; $config[\u0026#39;default_list_mode\u0026#39;] = \u0026#39;threads\u0026#39;; I start the docker containers, and very quickly after, Roundcube shoots to life. I can expose it behind the nginx cluster, while keeping it accessible only for VPN + office/home network users:\nserver { listen [::]:80; listen 0.0.0.0:80; server_name webmail.ipng.ch webmail.net.ipng.ch webmail; access_log /var/log/nginx/webmail.ipng.ch-access.log; include /etc/nginx/conf.d/ipng-headers.inc; location / { return 301 https://webmail.ipng.ch$request_uri; } } geo $allowed_user { default 0; include /etc/nginx/conf.d/geo-ipng.inc; } server { listen [::]:443 ssl http2; listen 0.0.0.0:443 ssl http2; ssl_certificate /etc/certs/ipng.ch/fullchain.pem; ssl_certificate_key /etc/certs/ipng.ch/privkey.pem; include /etc/nginx/conf.d/options-ssl-nginx.inc; ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc; server_name webmail.ipng.ch; access_log /var/log/nginx/webmail.ipng.ch-access.log upstream; include /etc/nginx/conf.d/ipng-headers.inc; if ($allowed_user = 0) { rewrite ^ https://ipng.ch/ break; } location / { proxy_pass http://docker0.frggh0.net.ipng.ch:9002; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; } } The configuration has one neat trick in it \u0026ndash; it uses the geo module in NGINX to assert that the client address is used to set the value of allowed_user. It will be 1 if the client connected from any network defined in the geo-ipng.inc file, and 0 otherwise. I then use it to bounce unwanted visitors back to the main [website], and expose Roundcube for those that are welcome.\nWhile the Roundcube instance is not replicated, it\u0026rsquo;s also non-essential. I will be using Thunderbird, Mail.app and other clients more regularly than Roundcube. It may just be handy in a pinch to either check mail using a browser, but also to edit the Sieve filters easily.\nIn my defense, considering roundcube is pretty much stateless, I can actually just run multiple copies of it on a few docker hosts at IPng \u0026ndash; then in the nginx configs I might use a similar construct as for the maildrop and smtp-out services, with a primary and hot standby. But that will be for that one day that the docker host in Lille dies AND I decided I absolutely require Roundcube precisely on that day :)\n","date":"2024-05-17","desc":"Intro I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I\u0026rsquo;ve been feeling less enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using \u0026ldquo;free\u0026rdquo; services is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but for the last few years, I\u0026rsquo;ve been more and more inclined to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to privately operated ones.\n","permalink":"https://ipng.ch/s/articles/2024/05/17/case-study-ipngs-mail-servers/","section":"articles","title":"Case Study: IPng's mail servers"},{"contents":"Introduction Tier1 and aspiring Tier2 providers interconnect only in large metropolitan areas, due to commercial incentives and politics. They won\u0026rsquo;t often peer with smaller providers, because why peer with a potential customer? Due to this, it’s entirely likely that traffic between two parties in Thessaloniki is sent to Frankfurt or Milan and back.\nOne possible antidote to this is to connect to a local Internet Exchange point. Not all ISPs have access to large metropolitan datacenters where larger internet exchanges have a point of presence, and it doesn\u0026rsquo;t help that the datacenter operator is happy to charge a substantial amount of money each month, just for the privilege of having a passive fiber cross connect to the exchange. Many Internet Exchanges these days ask for per-month port costs and meter the traffic with policers and rate limiters, such that the total cost of peering starts to exceed what one might pay for transit, especially at low volumes, which further exacerbates the problem. Bah.\nThis is an unfortunate market effect (the race to the bottom), where transit providers are continuously lowering their prices to compete. And while transit providers can make up to some extent due to economies of scale, at some point they are mostly all of equal size, and thus the only thing that can flex is quality of service.\nThe benefit of using an Internet Exchange is to reduce the portion of an ISP’s (and CDN’s) traffic that must be delivered via their upstream transit providers, thereby reducing the average per-bit delivery cost and as well reducing the end to end latency as seen by their users or customers. Furthermore, the increased number of paths available through the IXP improves routing efficiency and fault-tolerance, and it avoids traffic going the scenic route to a large hub like Frankfurt, London, Amsterdam, Paris or Rome, if it could very well remain local.\nIPng Networks really believes in an open and affordable Internet, and I would like to do my part in ensuring the internet stays accessible for smaller parties.\nSmöl IXPs One notable problem with small exchanges, like for example [FNC-IX] in the Paris metro, or [CHIX-CH], [Community IX] and [Free-IX] in the Zurich metropolitan area, is that they are, well, small. They may be cheaper to connect to, in some cases even free, but they don\u0026rsquo;t have a sizable membership which means that there is inherently less traffic flowing, which in turn makes it less appealing for prospect members to connect to.\nAt IPng, I have partnered with a few super cool ISPs and carriers to offer a Free Internet Exchange platform. Just to head the main question off at the pass: Free here actually does mean \u0026ldquo;Free as in beer\u0026rdquo; or [Gratis], a gift to the community that does not cost money. It also more philosophically wants to be \u0026ldquo;Free as in open, and transparent\u0026rdquo; or [Libre].\nTwo examples are:\n[Free IX: Switzerland] with POPs at STACK GEN01 Geneva, NTT Zurich and Bancadati Lugano. [Free IX: Greece] with POPs at TISparkle in Athens and Balkan Gate in Thessaloniki. .. but there are actually quite a few out there once you start looking :)\nGrowing Smöl IXPs Some internet exchanges break through the magical 1Tbps barrier (and get a courtesy callout on Twitter from Dr. King), but many remain smöl. Perhaps it\u0026rsquo;s time to break the chicken-and-egg problem. What if there was a way to interconnect these exchanges?\nLet\u0026rsquo;s take for example the Free IX in Greece that was announced at GRNOG16 in Athens on April 19th. This exchange initially targets Athens and Thessaloniki, with 2x100G between the two cities. Members can connect to either site for the cost of only a cross connect. The 1G/10G/25G ports will be Gratis. But I will be connecting one very special member to Free IX Greece, AS50869:\nFree IX: Remote Here\u0026rsquo;s what I am going to build. The Free IX Remote project offers an outreach infrastructure which connects to internet exchange points, and allows members to benefit from that in the following way:\nFreeIX uses AS50869 to peer with any network operator who is available at public internet exchanges or using private interconnects. It looks like a normal service provider in this regard. It will connect to internet exchanges, and learn a bunch of routes. FreeIX members can join the program, after which they are granted certain propagation permissions by FreeIX at the point where they have a BGP session with AS50869. The prefixes learned on these member sessions are marked as such, and will be allowed to propagate. Members will receive some or all learned prefixes from AS50869. FreeIX members can set fine grained BGP communities to determine which of their prefixes are propagated and at which locations. Members at smaller internet exchanges greatly benefit from this type of outreach, by receiving large portions of the public internet directly at their preferred peering location. Similarly, the Free IX Remote routers will carry their traffic to these remote internet exchanges.\nDetailed Design Peer types There are two types of BGP neighbor adjacency:\nMembers: these are {ip-address,AS}-tuples which FreeIX has explicitly configured. Learned prefixes are added to as-set AS50869:AS-MEMBERS. Members receive some or all prefixes from FreeIX, each annotated with BGP informational communities, and members can drive certain behavior with BGP action communities.\nPeers: these are all other entities with whom FreeIX has an adjacency at public internet exchanges or private network interconnects. Peers receive some (or all) member prefixes from FreeIX and cannot drive any behavior with communities. With respect to internet exchanges and peers, AS50869 looks like a completely normal ISP, advertising subsets of the customer AS cone from AS50869:AS-MEMBERS at each exchange point.\nBGP sessions with members use strict ingress filtering by means of bgpq4, and will be tagged with a set of informational BGP communities, such as where the prefix was learned, and what propagation permissions that it received (eg. at which internet exchanges will it be allowed to be announced). Of course, prefixes that are RPKI invalid will be dropped, while valid and unknown prefixes will be accepted. Members are granted permissions by FreeIX, which determine where their prefixes will be announced by AS50869. Further, members can perform optional actions by means of BGP communities at their ingress point, to inhibit announcements to a certain peer or at a given exchange point.\nPeers on the other hand are not granted any permissions and all action BGP communities will be stripped on prefixes learned. Informational communities will still be tagged on learned prefixes. Two things happen here. Firstly, members will be offered only those prefixes for which they have permission \u0026ndash; in other words, I will create a configuration file that says member AS8298 may receive prefixes learned from Frys-IX. Secondly, even for those prefixes that are advertised, the member AS8298 can use the informational communities to further filter what they accept from Free IX Remote AS50869.\nBGP Classic Communities Members are allowed to set the following legacy action BGP communities for coarse grained distribution of their prefixes through the FreeIX network.\n(50869,0) or (50869,3000) do not announce anywhere (50869,666) or (65535,666) blackhole everywhere (can be on any more specific from the member\u0026rsquo;s AS-SET) (50869,3100) prepend once everywhere (50869,3200) prepend twice everywhere (50869,3300) prepend three times everywhere Peers, on the other hand, are not allowed to set any communities, so all classic BGP communities from them are stripped on ingress.\nBGP Large Communities Free IX Remote will use three types of BGP Large Communities, which each serve a distinct purpose:\nInformational: These communities are set by the FreeIX router when learning a prefix. They cannot be set by peers or members, and will be stripped on ingress. They will be sent to both members and peers, allowing operators to choose which prefixes to learn based on their origin details, like which country or internet exchange they were learned at.\nPermission: These communities are also set by FreeIX operators when learning a prefix (eg. on the ingress router). They cannot be set by peers or members, and will be stripped on ingress. The permission communities determine where FreeIX will allow the prefix to propagate. They will be stripped on egress.\nAction: Based on the permissions, members can further steer announcements by sending certain action communities to FreeIX. These actions cannot be sent by peers, but in certain cases they can be set by FreeIX operators on ingress. Similarly to the permission communties, all action communities will be stripped on egress.\nRegular peers of AS50869 at exchange points and private network interconnects will not be able to set any communities, so all large BGP communities from them are stripped on ingress.\nInformational Communities When FreeIX routers learn prefixes, they will annotate them with certain communities. For example, the router at Amsterdam NIKHEF (which is router #1, country #2), when learning a prefix at FrysIX (which is ixp #1152), will set the following BGP large communities:\n(50869,1010,1): Informational (10XX), Router (1010), vpp0.nlams0.free-ix.net (1) (50869,1020,2): Informational (10XX), Country (1020), Netherlands (2) (50869,1030,1152): Informational (10XX), IXP (1030), PeeringDB IXP for FrysIX (1152) When propagating these prefixes to neighbors (both members and peers), these informational communities can be used to determine local policy, for example by setting a different localpref or dropping prefixes from a certain location. Informational communities can be read, but they can\u0026rsquo;t be set by peers or members \u0026ndash; they are always cleared by FreeIX routers when learning prefixes, and as such the only routers which will set them are the FreeIX ones.\nPermission Communities FreeIX maintains a list of permissions per member. When members announce their prefixes to FreeIX routers, these permissions communities are set. They determine what the member is allowed to do with FreeIX propagation - notably which routers, countries, and internet exchanges the member will be allowed to propagate to.\nUsually, member prefixes are allowed to propagate everywhere, so the following communities might be set by the FreeIX router on ingress:\n(50869,2010,0): Permission (20XX), Router (2010), everywhere (0) (50869,2020,0): Permission (20XX), Country (2020), everywhere (0) (50869,2030,0): Permission (20XX), IXP (2030), everywhere (0) If the member prefixes are allowed to propagate only to certain places, the \u0026rsquo;everywhere\u0026rsquo; communities will not be set, and instead lists of communities with finer grained permissions can be used, for example:\n(50869,2010,2): Permission (20XX), Router (2010), vpp0.grskg0.free-ix.net (2) (50869,2020,3): Permission (20XX), Country (2020), Greece (3) (50869,2030,60): Permission (20XX), IXP (2030), PeeringDB IXP for SwissIX (60) Permission communities can\u0026rsquo;t be set by peers, nor by members \u0026ndash; they are always cleared by FreeIX routers when learning prefixes, and are configured explicitly by FreeIX operators.\nAction Communities Based on the permission communities, zero or more egress routers, countries and internet exchanges are eligible to propagate member prefixes by AS50869 to its peers. Members can define very fine grained action communities to further tweak which prefixes propagate on which routers, in which countries and towards which internet exchanges and private network interconnects:\n(50869,3010,3): Inhibit Action (30XX), Router (3010), vpp0.gratt0.free-ix.net (3) (50869,3020,1): Inhibit Action (30XX), Country (3020), Switzerland (1) (50869,3030,1308): Inhibit Action (30XX), IXP (3030), PeeringDB IXP for LS-IX (1308) Four actions can be placed on a per-remote-asn basis:\n(50869,3040,13030): Inhibit Action (30XX), AS (3040), Init7 (AS13030) (50869,3100,6939): Prepend Once Action (3100), Hurricane Electric (AS6939) (50869,3200,12859): Prepend Twice Action (3200), BIT BV (AS12859) (50869,3300,8283): Prepend Thice Action (3300), Coloclue (AS8283) Peers cannot set these actions, as all action communities will be stripped on ingress. Members can set these action communities on their sessions with FreeIX routers, however in some cases they may also be set by FreeIX operators when learning prefixes.\nWhat\u0026rsquo;s next Perhaps this interaction between informational, permission and action BGP communities gives you an idea on how such a network may operate. It\u0026rsquo;s somewhat different to a classic Transit provider, in that AS50869 will not carry a full table. It\u0026rsquo;ll merely provide a form of partial transit from member A at IXP #1, to and from all peers that can be found at IXPs #2-#N. Makes the mind boggle? Don\u0026rsquo;t worry, we\u0026rsquo;ll figure it out together :)\nIn an upcoming article I\u0026rsquo;ll detail the programming work that goes into implementing this complex peering policy in Bird2 as driving VPP routers (duh), with an IGP that is IPv4-less, because at this point, I [may as well] put my money where my mouth is.\nIf you\u0026rsquo;re interested in this kind of stuff, take a look at the IPng Networks AS8298 [Routing Policy]. Similar to that one, this one will use a combination of functional programming, templates, and clever expansions to make a customized per-member and per-peer configuration based on a YAML input file which dictates which member and which prefix is allowed to go where.\nFirst, I need to get a replacement router for the Thessaloniki router, which will run VPP of course. My buddy Antonis noticed that there are CPU and/or DDR errors on that chassis, so it may need to be RMAd. But once it\u0026rsquo;s operational, I will start by deploying one instance in Amsterdam NIKHEF, and another in Thessaloniki Balkan Gate, with a 100G connection between them, graciously provided by [LANCOM]. Just look at that FD.io hound runnnnn!!1\n","date":"2024-04-27","desc":"Introduction Tier1 and aspiring Tier2 providers interconnect only in large metropolitan areas, due to commercial incentives and politics. They won\u0026rsquo;t often peer with smaller providers, because why peer with a potential customer? Due to this, it’s entirely likely that traffic between two parties in Thessaloniki is sent to Frankfurt or Milan and back.\nOne possible antidote to this is to connect to a local Internet Exchange point. Not all ISPs have access to large metropolitan datacenters where larger internet exchanges have a point of presence, and it doesn\u0026rsquo;t help that the datacenter operator is happy to charge a substantial amount of money each month, just for the privilege of having a passive fiber cross connect to the exchange. Many Internet Exchanges these days ask for per-month port costs and meter the traffic with policers and rate limiters, such that the total cost of peering starts to exceed what one might pay for transit, especially at low volumes, which further exacerbates the problem. Bah.\n","permalink":"https://ipng.ch/s/articles/2024/04/27/freeix-remote-part-1/","section":"articles","title":"FreeIX Remote - Part 1"},{"contents":" Introduction A few weeks ago I took a good look at the [Babel] protocol. I found a set of features there that I really appreciated. The first was a latency aware routing protocol - this is useful for mesh (wireless) networks but it is also a good fit for IPng\u0026rsquo;s usecase, notably because it makes use of carrier ethernet which, if any link in the underlying MPLS network fails, will automatically re-route but sometimes with much higher latency. In these cases, Babel can reconverge on its own to a topology that has the lowest end to end latency.\nBut a second really cool find, is that Babel can use IPv6 nexthops for IPv4 destinations - which is super useful because it will allow me to retire all of the IPv4 /31 point to point networks between my routers. AS8298 has about half of a /24 tied up in these otherwise pointless (pun intended) transit networks.\nIn the same week, my buddy Benoit asked a question about OSPFv3 on the Bird users mailinglist [ref] which may or may not have been because I had been messing around with Babel using only IPv4 loopback interfaces. And just a few weeks before that, the incomparable Nico from [Ungleich] had a very similar question [ref].\nThese three folks have something in common - we\u0026rsquo;re all trying to conserve IPv4 addresses!\nOSPFv3 with IPv4 🙁 Nico\u0026rsquo;s thread referenced [RFC 5838] which defines support for multiple address families in OSPFv3. It does this by mapping a given address family to a specific instance of OSPFv3 using the instance id and adding a new option to the options field that tells neighbors that multiple address families are supported in this instance (and thus, that the neighbor should not assume all link state advertisements are IPv6-only).\nThis way, multiple instances can run on the same router, and they will only form adjacencies with neighbors that are operating in the same address family. This in itself doesn\u0026rsquo;t change much: rather than using IPv4 multicast in the hello\u0026rsquo;s while forming adjacencies, OSPFv3 will use IPv6 link local addresses for them.\nRFC 5838, Section 2.5 says:\nAlthough IPv6 link local addresses could be used as next hops for IPv4 address families, it is desirable to have IPv4 next-hop addresses. [ \u0026hellip; ] In order to achieve this, the link\u0026rsquo;s IPv4 address will be advertised in the \u0026ldquo;link local address\u0026rdquo; field of the IPv4 instance\u0026rsquo;s Link-LSA. This address is placed in the first 32 bits of the \u0026ldquo;link local address\u0026rdquo; field and is used for IPv4 next-hop calculations. The remaining bits MUST be set to zero.\nFirst my hopes are raised by saying IPv6 link local addresses could be used as next hops (just like Babel, yaay!), but then it goes on to say the link local address field will be overridden with an IPv4 address in the top 32 bits. That\u0026rsquo;s \u0026hellip; gross. I understand why this was done, it allows for a minimal deviation of the OSPFv3 protocol, but this unfortunate choice precludes the ability for IPv6 nexthops to be used. Crap on a cracker!\nOSPFv3 with IPv4 🥰 But wait, not all is lost! Remember in my [VPP Babel] article I mentioned that VPP has this ability to run unnumbered interfaces? To recap, this is a configuration where a primary interface, typically a loopback, will have an IPv4 and IPv6 address, say 192.168.10.2/32 and 2001:678:d78:200::2/128 and other interfaces will borrow from that. That will allow for the IPv4 address to be present on multiple interfaces, like so:\npim@vpp0-2:~$ ip -br a lo UNKNOWN 127.0.0.1/8 ::1/128 loop0 UNKNOWN 192.168.10.2/32 2001:678:d78:200::2/128 fe80::dcad:ff:fe00:0/64 e0 UP 192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1120/64 e1 UP 192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1121/64 VPP Changes Historically in VPP, broadcast medium like ethernet will respond to ARP requests only if the requestor is in the same subnet. With these point to point interfaces, the remote will never be in the same subnet, because we\u0026rsquo;re using /32 addresses here! VPP logs these as invalid ARP requests. With a small change though, I can make VPP become tolerant of this scenario, and the consensus in the VPP community is that this is OK.\nCheck out [40482] for the full change, but in a nutshell, just before deciding to return an error because the requesting source address is not directly connected (called an attached route in VPP), I\u0026rsquo;ll change the condition to allow for it, if and only if the ARP request comes from an unnumbered interface.\nI think this is a good direction, if only because most other popular implementations (including Linux, FreeBSD, Cisco IOS/XR and Juniper) will answer ARP requests that are onlink but not directly connected, in the same way.\nBird2 Changes Meanwhile, in the Bird community, we were thinking about solving this problem in a different way. Babel allows a feature to use IPv6 transit networks with IPv4 destinations, by specifying an option called extended next hop. With this option, Babel will set a nexthop across address families. It may sound freaky at first, but it\u0026rsquo;s not too strange when you think about it. Take a look at my explanation in the [Babel] article on how IPv6 neighbor discovery can take the place of IPv4 ARP resolution to figure out the ethernet next hop.\nSo our initial take was: why don\u0026rsquo;t we do that with OSPFv3 as well? We thought of a trick to get that Link LSA hack from RFC5838 removed: what if Bird, upon setting the extended next hop feature on an interface, would simply put the IPv6 address back like it was, rather than overwriting it with the IPv4 address? That way, we\u0026rsquo;d just learn routes to IPv4 destinations with nexthops on IPv6 linklocal addresses. It would break compatibility with other vendors, but seeing as it is an optional feature which defaults to off, perhaps it is a reasonable compromise\u0026hellip;\nOndrej started to work on it, but came back a few days later with a different solution, which is quite clever. Any IPv4 router needs at least one IPv4 address anyways, to be able to send ICMP messages, so there is no need to put IPv4 addresses on links. Ondrej\u0026rsquo;s theory corroborates my previous comments for Babel\u0026rsquo;s IPv4-less routing:\nI’ve learned so far that I (a) MAY use IPv6 link-local networks in order to forward IPv4 packets, as I can use IPv6 NDP to find the link-layer next hop; and (b) each router SHOULD be able to originate ICMPv4 packets, therefore it needs at least one IPv4 address.\nThese two claims mean that I need at most one IPv4 address on each router.\nOndrej\u0026rsquo;s proposal for Bird2 will, when OSPFv3 is used with IPv4 destinations, keep the RFC5838 behavior and try to find a working IPv4 address to put in the Link LSA: He adds a function update_loopback_addr(), which scans all interfaces for an IPv4 address, and if there are multiple, prefer host addresses, then addresses from OSPF stub interfaces, and finally just any old IPv4 address. Now that IPv4 address can be simply used to put in the Link LSA. Slick!\nHis change also removes next-hop-in-address-range check for OSPFv3 when using IPv4, and automatically adds onlink flag to such routes, which newly accepts next hops that are not directly connected: I realize when reading the code that this change paired with the [Gerrit] are perfect partners:\nOndrej\u0026rsquo;s change will make the Link LSA be onlink, which is a way to describe that the next hop is not directly connected, in other words nexthop 192.168.10.3/32, while the router itself is 192.168.10.2/32. My change will make VPP answer for ARP requests in such a scenario where the router with an unnumbered interface with 192.168.10.3/32 will respond to a request from the not directly connected onlink peer at 192.168.10.2. Tying it together With all of that, I am ready to demonstrate two working solutions now. I first compile Bird2 with Ondrej\u0026rsquo;s [commit]. Then, I compile VPP with my pending [gerrit]. Finally, to demonstrate how update_loopback_addr() might work, I compile lcpng with my previous [commit], which allows me to inhibit copying forward addresses from VPP to Linux, when using unnumbered interfaces.\nI take an IPng lab instance out for a spin with this updated Bird2 and VPP+lcpng environment:\nSolution 1: Somewhat unnumbered I configure an otherwise empty VPP dataplane as follows:\nvpp0-3# lcp lcp-sync on vpp0-3# lcp lcp-sync-unnumbered on vpp0-3# create loopback interface instance 0 vpp0-3# set interface state loop0 up vpp0-3# set interface ip address loop0 192.168.10.3/32 vpp0-3# set interface ip address loop0 2001:678:d78:200::3/128 vpp0-3# set interface mtu 9000 GigabitEthernet10/0/0 vpp0-3# set interface mtu packet 9000 GigabitEthernet10/0/0 vpp0-3# set interface unnumbered GigabitEthernet10/0/0 use loop0 vpp0-3# set interface state GigabitEthernet10/0/0 up vpp0-3# lcp create loop0 host-if loop0 vpp0-3# lcp create GigabitEthernet10/0/0 host-if e0 Which yields the following configuration:\npim@vpp0-3:~$ ip -br a lo UNKNOWN 127.0.0.1/8 ::1/128 loop0 UNKNOWN 192.168.10.3/32 2001:678:d78:200::3/128 fe80::dcad:ff:fe00:0/64 e0 UP 192.168.10.3/32 2001:678:d78:200::3/128 fe80::5054:ff:fef0:1130/64 pim@vpp0-3:~$ ip route get 182.168.10.2 RTNETLINK answers: Network is unreachable I can see that VPP copied forward the IPv4/IPv6 addresses to interface e0, and because there\u0026rsquo;s no routing protocol running yet, the neighbor router vpp0-2 is unreachable. Let me fix that, next. I start bird in the VPP dataplane network namespace, and configure it as follows:\nrouter id 192.168.10.3; protocol device { scan time 30; } protocol direct { ipv4; ipv6; check link yes; } protocol kernel kernel4 { ipv4 { import none; export where source != RTS_DEVICE; }; learn off; scan time 300; } protocol kernel kernel6 { ipv6 { import none; export where source != RTS_DEVICE; }; learn off; scan time 300; } protocol bfd bfd1 { interface \u0026#34;e*\u0026#34; { interval 100 ms; multiplier 20; }; } protocol ospf v3 ospf4 { ipv4 { export all; import where (net ~ [ 192.168.10.0/24+, 0.0.0.0/0 ]); }; area 0 { interface \u0026#34;loop0\u0026#34; { stub yes; }; interface \u0026#34;e0\u0026#34; { type pointopoint; cost 5; bfd on; }; }; } protocol ospf v3 ospf6 { ipv6 { export all; import where (net ~ [ 2001:678:d78:200::/56, ::/0 ]); }; area 0 { interface \u0026#34;loop0\u0026#34; { stub yes; }; interface \u0026#34;e0\u0026#34; { type pointopoint; cost 5; bfd on; }; }; } This minimal Bird2 configuration will configure the main protocols device, direct, and two kernel protocols kernel4 and kernel6, which are instructed to export learned routes from the kernel for all but directly connected routes (because the Linux kernel and VPP already have these when an interface is brought up, this avoids duplicate connected route entries).\nIf you haven\u0026rsquo;t come across it yet, Bidirectional Forwarding Detection or BFD is a protocol that repeatedly sends UDP packets between routers, to be able to detect if the forwarding is interrupted even if the interface link stays up. It\u0026rsquo;s described in detail in [RFC5880], and I use it at IPng Networks all over the place.\nThen I\u0026rsquo;ll configure two OSPF protocols, one for IPv4 called ospf4 and another for IPv6 called ospf6. It\u0026rsquo;s easy to overlook, but while usually the IPv4 protocol is OSPFv2 and the IPv6 protocol is OSPFv3, here both are using OSPFv3! I\u0026rsquo;ll instruct Bird to erect a BFD session for any neighbor it establishes an adjacency with. If at any point the BFD session times out (currently at 20x100ms or 2.0s), OSPF will tear down the adjacency.\nThe OSPFv3 protocols each define one channel, in which I allow Bird to export anything, but import only those routes that are in the LAB IPv4 (192.168.10.0/24) and IPv6 (2001:687:d78:200::/56), and I\u0026rsquo;ll also allow a default to be learned over OSPF for both address families. That\u0026rsquo;ll come in handy later.\nI start up Bird on the rightmost two routers in the lab (vpp0-3 and vpp0-2). Looking at vpp0-3, Bird starts sending IPv6 hello packets on interface e0, and pretty quickly finds not one but two neighbors:\npim@vpp0-3:~$ birdc show ospf neighbors BIRD v2.15.1-4-g280daed5-x ready. ospf4: Router ID Pri\tState DTime\tInterface Router IP 192.168.10.2\t1\tFull/PtP 30.870\te0 fe80::5054:ff:fef0:1121 ospf6: Router ID Pri\tState DTime\tInterface Router IP 192.168.10.2\t1\tFull/PtP 30.870\te0 fe80::5054:ff:fef0:1121 Bird is able to sort out which is which on account of the \u0026rsquo;normal\u0026rsquo; IPv6 OSPFv3 having an instance id value of 0 (IPv6 Unicast), and the IPv4 OSPFv3 having an instance id of 64 (IPv4 Unicast). Further, the IPv4 variant will set the AF-bit in the OSPFv3 options, so the peer will know it supports using the Link LSA to model IPv4 nexthops rather than IPv6 nexthops.\nIndeed, routes are quickly learned:\npim@vpp0-3:~$ birdc show route table master4 BIRD v2.15.1-4-g280daed5-x ready. Table master4: 192.168.10.3/32 unicast [direct1 13:02:56.883] * (240) dev loop0 unicast [direct1 13:02:56.883] (240) dev e0 unicast [ospf4 13:02:56.980] I (150/0) [192.168.10.3] dev loop0 dev e0 192.168.10.2/32 unicast [ospf4 13:03:04.980] * I (150/5) [192.168.10.2] via 192.168.10.2 on e0 onlink They are quickly propagated both to the Linux kernel, and by means of Netlink into the Linux Control Plane plugin in VPP, which programs it into VPP\u0026rsquo;s FIB:\npim@vpp0-3:~$ ip ro 192.168.10.2 via 192.168.10.2 dev e0 proto bird metric 32 onlink pim@vpp0-3:~$ vppctl show ip fib 192.168.10.2 ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, lcp-rt:1, ] 192.168.10.2/32 fib:0 index:23 locks:3 lcp-rt-dynamic refs:1 src-flags:added,contributing,active, path-list:[40] locks:2 flags:shared, uPRF-list:22 len:1 itfs:[1, ] path:[53] pl-index:40 ip4 weight=1 pref=32 attached-nexthop: oper-flags:resolved, 192.168.10.2 GigabitEthernet10/0/0 [@0]: ipv4 via 192.168.10.2 GigabitEthernet10/0/0: mtu:9000 next:6 flags:[] 525400f01121525400f011300800 adjacency refs:1 entry-flags:attached, src-flags:added, cover:-1 path-list:[43] locks:1 uPRF-list:24 len:1 itfs:[1, ] path:[56] pl-index:43 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, 192.168.10.2 GigabitEthernet10/0/0 [@0]: ipv4 via 192.168.10.2 GigabitEthernet10/0/0: mtu:9000 next:6 flags:[] 525400f01121525400f011300800 Extensions: path:56 forwarding: unicast-ip4-chain [@0]: dpo-load-balance: [proto:ip4 index:28 buckets:1 uRPF:22 to:[0:0]] [0] [@5]: ipv4 via 192.168.10.2 GigabitEthernet10/0/0: mtu:9000 next:6 flags:[] 525400f01121525400f011300800 The neighbor is reachable, over IPv6 (which is nothing special), but also over IPv4:\npim@vpp0-3:~$ ping -c5 2001:678:d78:200::2 PING 2001:678:d78:200::2(2001:678:d78:200::2) 56 data bytes 64 bytes from 2001:678:d78:200::2: icmp_seq=1 ttl=64 time=2.16 ms 64 bytes from 2001:678:d78:200::2: icmp_seq=2 ttl=64 time=3.69 ms 64 bytes from 2001:678:d78:200::2: icmp_seq=3 ttl=64 time=2.66 ms 64 bytes from 2001:678:d78:200::2: icmp_seq=4 ttl=64 time=2.30 ms 64 bytes from 2001:678:d78:200::2: icmp_seq=5 ttl=64 time=2.92 ms --- 2001:678:d78:200::2 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4006ms rtt min/avg/max/mdev = 2.164/2.747/3.687/0.540 ms pim@vpp0-3:~$ ping -c5 192.168.10.2 PING 192.168.10.2 (192.168.10.2) 56(84) bytes of data. 64 bytes from 192.168.10.2: icmp_seq=1 ttl=64 time=3.58 ms 64 bytes from 192.168.10.2: icmp_seq=2 ttl=64 time=3.40 ms 64 bytes from 192.168.10.2: icmp_seq=3 ttl=64 time=3.28 ms 64 bytes from 192.168.10.2: icmp_seq=4 ttl=64 time=3.32 ms 64 bytes from 192.168.10.2: icmp_seq=5 ttl=64 time=3.29 ms --- 192.168.10.2 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4007ms rtt min/avg/max/mdev = 3.283/3.374/3.577/0.109 ms ✅ OSPFv3 with IPv4/IPv6 on-link nexthops works!\nSolution 2: Truly unnumbered However, Ondrej\u0026rsquo;s patch does something in addition to this. I repeat the same setup, except now I set one additional feature when starting up VPP: lcp lcp-sync-unnumbered off\nWhat happens next is that VPP\u0026rsquo;s dataplane looks subtly different. It has created an unnumbered interface keyed off of loop0, but it doesn\u0026rsquo;t propagate the addresses to Linux.\npim@vpp0-3:~$ ip -br a lo UNKNOWN 127.0.0.1/8 ::1/128 loop0 UNKNOWN 192.168.10.3/32 2001:678:d78:200::3/128 fe80::dcad:ff:fe00:0/64 e0 UP fe80::5054:ff:fef0:1130/64 With e0 only having a linklocal address, Bird can still form an adjacency with its neighbor vpp0-2, because adjacencies in OSPFv3 are formed using IPv6 only. However, the clever trick to walk the list of interfaces in update_loopback_addr() will be able to find a usable IPv4 address, and use that to put in the Link LSA using RFC5838. In this case, it finds 192.168.10.3 from interface loop0 so it\u0026rsquo;ll use that to signal the next hop for LSAs that it sends.\nNow I start the same VPP and Bird configuration on all four VPP routers, but on vpp0-0 I\u0026rsquo;ll add a static route out of the LAB to the internet:\nprotocol static static4 { ipv4 { export all; }; route 0.0.0.0/0 via 192.168.10.4; } protocol static static6 { ipv6 { export all; }; route ::/0 via 2001:678:d78:201::ffff; } These two default routes from vpp0-0 quickly propagate through the network, where vpp0-3 ultimately sees this:\npim@vpp0-3:~$ ip -br a lo UNKNOWN 127.0.0.1/8 ::1/128 loop0 UNKNOWN 192.168.10.3/32 2001:678:d78:200::3/128 fe80::dcad:ff:fe00:0/64 e0 UP fe80::5054:ff:fef0:1130/64 pim@vpp0-3:~$ ip ro default via 192.168.10.2 dev e0 proto bird metric 32 onlink 192.168.10.0 via 192.168.10.2 dev e0 proto bird metric 32 onlink 192.168.10.1 via 192.168.10.2 dev e0 proto bird metric 32 onlink 192.168.10.2 via 192.168.10.2 dev e0 proto bird metric 32 onlink 192.168.10.4/31 via 192.168.10.2 dev e0 proto bird metric 32 onlink pim@vpp0-3:~$ ip -6 ro 2001:678:d78:200:: via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium 2001:678:d78:200::1 via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium 2001:678:d78:200::2 via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium 2001:678:d78:200::3 dev loop0 proto kernel metric 256 pref medium 2001:678:d78:201::/112 via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium fe80::/64 dev loop0 proto kernel metric 256 pref medium fe80::/64 dev e0 proto kernel metric 256 pref medium default via fe80::5054:ff:fef0:1121 dev e0 proto bird metric 32 pref medium ✅ OSPFv3 with loopback-only, unnumbered IPv4/IPv6 interfaces works!\nResults I thought I\u0026rsquo;d record a little [asciinema, gif] that shows the end to end configuration, starting from an empty dataplane and bird configuration. I\u0026rsquo;ll show Solution 2, that is, the solution that doesn\u0026rsquo;t copy the unnumbered interfaces in VPP to Linux.\nReady? Here I go!\nTo unnumbered or Not To unnumbered I\u0026rsquo;m torn between Solution 1 and Solution 2. While on the one hand, setting the unnumbered interface would be best reflected in Linux, it is not without problems. If the operator subsequently tries to remove one of the addresses on e0 or e1, that will yield a desync between Linux and VPP (Linux will have removed the address, but VPP will still be unnumbered). On the other hand, tricking Linux (and the operator) to believe there isn\u0026rsquo;t an IPv4 (and IPv6) address configured on the interface, is also not great.\nOf the two approaches, I think I prefer Solution 2 (configuring the Linux CP plugin to not sync unnumbered addresses), because it minimizes the chance of operator error. If you\u0026rsquo;re reading this and have an Opinion™, would you please let me know?\n","date":"2024-04-06","desc":" Introduction A few weeks ago I took a good look at the [Babel] protocol. I found a set of features there that I really appreciated. The first was a latency aware routing protocol - this is useful for mesh (wireless) networks but it is also a good fit for IPng\u0026rsquo;s usecase, notably because it makes use of carrier ethernet which, if any link in the underlying MPLS network fails, will automatically re-route but sometimes with much higher latency. In these cases, Babel can reconverge on its own to a topology that has the lowest end to end latency.\n","permalink":"https://ipng.ch/s/articles/2024/04/06/vpp-with-loopback-only-ospfv3-part-1/","section":"articles","title":"VPP with loopback-only OSPFv3 - Part 1"},{"contents":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. Thanks to the [Linux ControlPlane] plugin, higher level control plane software becomes available, that is to say: things like BGP, OSPF, LDP, VRRP and so on become quite natural for VPP.\nIPng Networks is a small service provider that has built a network based entirely on open source: [Debian] servers with widely available Intel and Mellanox 10G/25G/100G network cards, paired with [VPP] for the dataplane, and [Bird2] for the controlplane.\nAs a small provider, I am well aware of the cost of IPv4 address space. Long gone are the times at which an initial allocation was a /19, and subsequent allocations usually a /20 based on justification. Then it watered down to a /22 for new Local Internet Registries, then that became a /24 for new LIRs, and ultimately we ran out. What was once a plentiful resource, has now become a very constrained resource.\nIn this first article, I want to show a rather clever way to conserve IPv4 addresses by exploring one of the newer routing protocols: Babel.\n🙁 A sad waste I have to go back to something very fundamental about routing. When RouterA holds a routing table, it will associate prefixes with next-hops and their associated interfaces. When RouterA gets a packet, it\u0026rsquo;ll look up the destination address, and then forward the packet on to RouterB which is the next router in the path towards the destination:\nRouterA does a route lookup in its routing table. For destination 192.0.2.1, the covering prefix is 192.0.2.0/24 and it might find that it can reach it via IPv4 next hop 100.64.0.1. RouterA then does another lookup in its routing table, to figure out how can it reach 100.64.0.1. It may find that this address is directly connected, say to interface eth0, on which RouterA is 100.64.0.2/30. Assuming that eth0 is an ethernet device, which the vast majority of interfaces are, then RouterA can look up the link-layer address for that IPv4 address 100.64.0.1, by using ARP. The ARP request asks, quite literally who-has 100.64.0.1? using a broadcast message on eth0, to which the other RouterB will answer 100.64.0.1 is-at 90:e2:ba:3f:ca:d5. Now that RouterA knows that, it can forward along the IP packet out on its eth0 device and towards 90:e2:ba:3f:ca:d5. Huzzah. 🥰 A clever trick I can\u0026rsquo;t help but notice that the only purpose of having the 100.64.0.0/30 transit network between these two routers is to:\nprovide the routers the ability to resolve IPv4 next hops towards link-layer MAC addresses, using ARP resolution. provide a means for the routers to send ICMP messages, for example in a traceroute, each hop along the way will respond with an TTL exceeded message. And I do like traceroutes! Let me discuss these two purposes in more detail:\n1. IPv4 ARP, née IPv6 NDP One really neat trick is simply replacing ARP resolution by something that can resolve the link-layer MAC address in a different way. As it turns out, IPv6 has an equivalent that\u0026rsquo;s called Neighbor Discovery Protocol in which a router can determine the link-layer address of a neighbor, or to verify that a neighbor is still reachable via a cached link-layer address. This uses ICMPv6 to send out a query with the Neighbor Solicitation, which is followed by a response in the form of a Neighbor Advertisement.\nWhy am I talking about IPv6 neighbor discovery when I\u0026rsquo;m explaining IPv4 forwarding, you may be wondering? Well, because of this neat trick that the IPv4 prefix brokers don\u0026rsquo;t want you to know:\npim@vpp0-0:~$ sudo ip ro add 192.0.2.0/24 via inet6 fe80::5054:ff:fef0:1110 dev e1 pim@vpp0-0:~$ ip -br a show e1 e1 UP fe80::5054:ff:fef0:1101/64 pim@vpp0-0:~$ ip ro get 192.0.2.0 192.0.2.0 via inet6 fe80::5054:ff:fef0:1110 dev e1 src 192.168.10.0 uid 0 cache pim@vpp0-0:~$ ip neighbor | grep fe80::5054:ff:fef0:1110 fe80::5054:ff:fef0:1110 dev e1 lladdr 52:54:00:f0:11:10 REACHABLE pim@vpp0-0:~$ sudo tcpdump -evni e1 host 192.0.2.0 tcpdump: listening on e1, link-type EN10MB (Ethernet), snapshot length 262144 bytes 16:21:30.002878 52:54:00:f0:11:01 \u0026gt; 52:54:00:f0:11:10, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 21521, offset 0, flags [DF], proto ICMP (1), length 84) 192.168.10.0 \u0026gt; 192.0.2.0: ICMP echo request, id 54710, seq 20, length 64 While it looks counter-intuitive at first, this is actually pretty straight forward. When the router gets a packet destined for 192.0.2.0/24, it will know that the next hop is some link-local IPv6 address, which it can resolve by NDP on ethernet interface e1. It can then simply forward the IPv4 datagram to the MAC address it found.\nWho would\u0026rsquo;ve thunk that you do not need ARP or even IPv4 on the interface at all?\n2. Originating ICMP messages The Internet Control Message Protocol is described in [RFC792]. It\u0026rsquo;s mostly used to carry diagnostic and debugging information, either originated by end hosts, for example the \u0026ldquo;destination unreachable, port unreachable\u0026rdquo; types of messages, but they may also be originated by intermediate routers, for example with most other kinds of \u0026ldquo;destination unreachable\u0026rdquo; packets.\nPath MTU Discovery, described in [RFC1191] allows a host to discover the maximum packet size that a route is able to carry. There\u0026rsquo;s a few different types of PMTUd, but the most common one uses ICMPv4 packets coming from these intermediate routers, informing them that packets which are marked as un-fragmentable, will not be able to be transmitted due to them being too large.\nWithout the ability for a router to signal these ICMPv4 packets, end to end connectivity quality might break undetected. So, every router that is able to forward IPv4 traffic SHOULD be able originate ICMPv4 traffic.\nIf you\u0026rsquo;re curious, you can read more in this [IETF Draft] from Juliusz Chroboczek et al. It\u0026rsquo;s really insightful, yet elegant.\nIntroducing Babel I\u0026rsquo;ve learned so far that I (a) MAY use IPv6 link-local networks in order to forward IPv4 packets, as I can use IPv6 NDP to find the link-layer next hop; and (b) each router SHOULD be able to originate ICMPv4 packets, therefore it needs at least one IPv4 address.\nThese two claims mean that I need at most one IPv4 address on each router. Could it be?!\nBabel is a loop-avoiding distance-vector routing protocol that is designed to be robust and efficient both in networks using prefix-based routing and in networks using flat routing (\u0026ldquo;mesh networks\u0026rdquo;), and both in relatively stable wired networks and in highly dynamic wireless networks.\nThe definitive [RFC8966] describes it in great detail, and previous work are in [RFC7557] and [RFC6126]. Lots of reading :) Babel is a hybrid routing protocol, in the sense that it can carry routes for multiple network-layer protocols (IPv4 and IPv6), regardless of which protocol the Babel packets are themselves being carried over.\nI quickly realise that Babel is hybrid in a different and very interesting way: it can set next-hops across address families, which is described in [RFC9229]:\nWhen a packet is routed according to a given routing table entry, the forwarding plane typically uses a neighbour discovery protocol (the Neighbour Discovery (ND) protocol [RFC4861] in the case of IPv6 and the Address Resolution Protocol (ARP) [RFC826] in the case of IPv4) to map the next-hop address to a link-layer address (a \u0026ldquo;Media Access Control (MAC) address\u0026rdquo;), which is then used to construct the link-layer frames that encapsulate forwarded packets.\nIt is apparent from the description above that there is no fundamental reason why the destination prefix and the next-hop address should be in the same address family: there is nothing preventing an IPv6 packet from being routed through a next hop with an IPv4 address (in which case the next hop\u0026rsquo;s MAC address will be obtained using ARP) or, conversely, an IPv4 packet from being routed through a next hop with an IPv6 address. (In fact, it is even possible to store link-layer addresses directly in the next-hop entry of the routing table, which is commonly done in networks using the OSI protocol suite).\nBabel and Bird2 There\u0026rsquo;s an implementation of Babel in Bird2, the routing solution that I use at AS8298. What made me extra enthusiastic, is that I found out the functionality described in RFC9229 was committed about a year ago in Bird2 [ref], with a hat-tip to Toke Høiland-Jørgensen.\nThe Debian machines at IPng are current (Bookworm 12.5), but Debian still ships a version older than this commit, so my first order of business is to get a Debian package:\npim@summer:~/src$ sudo apt install devscripts pim@summer:~/src$ wget http://deb.debian.org/debian/pool/main/b/bird2/bird2_2.14.orig.tar.gz pim@summer:~/src$ tar xzf bird2_2.14.orig.tar.gz pim@summer:~/src/bird-2.14$ wget http://deb.debian.org/debian/pool/main/b/bird2/bird2_2.14-1.debian.tar.xz pim@summer:~/src/bird-2.14$ tar xf bird2_2.14-1.debian.tar.xz pim@summer:~/src/bird-2.14$ sudo mk-build-deps -i pim@summer:~/src/bird-2.14$ sudo dpkg-buildpackage -b -uc -us And that yields me a fresh Bird 2.14 package. I can\u0026rsquo;t help but wonder though, why did the semantic versioning [ref] of 2.0.X change to 2.14? I found an answer in the NEWS file of the 2.13 release [link]. It\u0026rsquo;s a little bit of a disappointment, but I quickly get over myself because I want to take this Babel-Bird out for a test flight. Thank you for the Babel-Bird-Build, Summer!\nBabel and the LAB I decide to take an IPng [lab] out for a spin. These labs come with four VPP routers and two Debian machines connected like so:\nThe configuration snippet for Bird2 is very simple, as most of the defaults are sensible:\npim@vpp0-0:~$ cat \u0026lt;\u0026lt; EOF | sudo tee -a /etc/bird/bird.conf protocol babel { interface \u0026#34;e*\u0026#34; { type wired; extended next hop on; }; ipv6 { import all; export all; }; ipv4 { import all; export all; }; } EOF pim@vpp0-0:~$ birdc show babel interfaces BIRD 2.14 ready. babel1: Interface State Auth RX cost Nbrs Timer Next hop (v4) Next hop (v6) e1 Up No 96 1 0.958 :: fe80::5054:ff:fef0:1101 pim@vpp0-0:~$ birdc show babel neigh BIRD 2.14 ready. babel1: IP address Interface Metric Routes Hellos Expires Auth RTT (ms) fe80::5054:ff:fef0:1110 e1 96 8 16 5.003 No 4.831 pim@vpp0-0:~$ birdc show babel entries BIRD 2.14 ready. babel1: Prefix Router ID Metric Seqno Routes Sources 192.168.10.0/32 00:00:00:00:c0:a8:0a:00 0 1 0 0 192.168.10.0/24 00:00:00:00:c0:a8:0a:00 0 1 1 0 192.168.10.1/32 00:00:00:00:c0:a8:0a:01 96 7 1 0 2001:678:d78:200::/128 00:00:00:00:c0:a8:0a:00 0 1 0 0 2001:678:d78:200::/60 00:00:00:00:c0:a8:0a:00 0 1 1 0 2001:678:d78:200::1/128 00:00:00:00:c0:a8:0a:01 96 7 1 0 Based on this simple configuration, Bird2 will start the babel protocol on e0 and e1, and it quickly finds a neighbor with which it establishes an adjacency. Looking at the routing protocol database (called entries), I can see my own IPv4 and IPv6 loopbacks (192.168.10.0 and 2001:678:d78:200::), the neighbor\u0026rsquo;s IPv4 and IPv6 loopbacks (192.168.10.1 and 201:678:d78:200::1), and finally the two supernets (192.168.10.0/24 and 2001:678:d78:200::/60).\nThe coolest part is the extended next hop on statement, which enables Babel to set the nexthop to be an IPv6 address, which becomes clear very quickly when looking at the Linux routing table:\npim@vpp0-0:~$ ip ro 192.168.10.1 via inet6 fe80::5054:ff:fef0:1110 dev e1 proto bird metric 32 unreachable 192.168.10.0/24 proto bird metric 32 pim@vpp0-0:~$ ip -6 ro 2001:678:d78:200:: dev loop0 proto kernel metric 256 pref medium 2001:678:d78:200::1 via fe80::5054:ff:fef0:1110 dev e1 proto bird metric 32 pref medium unreachable 2001:678:d78:200::/60 dev lo proto bird metric 32 pref medium fe80::/64 dev loop0 proto kernel metric 256 pref medium fe80::/64 dev e1 proto kernel metric 256 pref medium ✅ Setting IPv4 routes over IPv6 nexthops works!\nBabel and VPP For the [VPP] configuration, I start off with a pretty much empty configuration, creating only a loopback interface called loop0, setting the interfaces up, and exposing them in LinuxCP:\nvpp0-0# create loopback interface instance 0 vpp0-0# set interface state loop0 up vpp0-0# set interface ip address loop0 192.168.10.0/32 vpp0-0# set interface ip address loop0 2001:678:d78:200::/128 vpp0-0# set interface mtu 9000 GigabitEthernet10/0/0 vpp0-0# set interface state GigabitEthernet10/0/0 up vpp0-0# set interface mtu 9000 GigabitEthernet10/0/1 vpp0-0# set interface state GigabitEthernet10/0/1 up vpp0-0# lcp create loop0 host-if loop0 vpp0-0# lcp create GigabitEthernet10/0/0 host-if e0 vpp0-0# lcp create GigabitEthernet10/0/1 host-if e1 Between the four VPP routers, the only relevant difference is the IPv4 and IPv6 addresses of the loopback device. For the rest, things are good. The routing tables quickly fill with all IPv4 and IPv6 loopbacks across the network.\nAdding support to VPP IPv6 pings and looks good. However, IPv4 endpoints do not ping yet. The first thing I look at, is does VPP understand how to interpret an IPv4 route with an IPv6 nexthop? I think it does, because I remember reviewing a change from Adrian during our MPLS [project], which he submitted in this [Gerrit]. His change allows VPP to use routes with rtnl_route_nh_get_via() to map them to a different address family, exactly what I am looking for. The routes are correctly installed in the FIB:\npim@vpp0-0:~$ vppctl show ip fib 192.168.10.1 ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[default-route:1, lcp-rt:1, ] 192.168.10.1/32 fib:0 index:31 locks:2 lcp-rt-dynamic refs:1 src-flags:added,contributing,active, path-list:[51] locks:4 flags:shared, uPRF-list:42 len:1 itfs:[2, ] path:[72] pl-index:51 ip6 weight=1 pref=32 attached-nexthop: oper-flags:resolved, fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1 [@0]: ipv6 via fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 525400f01110525400f0110186dd forwarding: unicast-ip4-chain [@0]: dpo-load-balance: [proto:ip4 index:34 buckets:1 uRPF:42 to:[0:0]] [0] [@5]: ipv4 via fe80::5054:ff:fef0:1110 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 525400f01110525400f011010800 Using the Open vSwitch tap I can see I can clearly see the packets go out from vpp0-0.e1 and into vpp0-1.e0, but there is no response, so they are getting lost in vpp0-1 somewhere. I take a look at a packet trace on vpp0-1, I\u0026rsquo;m expecting the ICMP packet there:\npim@vpp0-1:~$ vppctl show trace 07:42:53:178694: dpdk-input GigabitEthernet10/0/0 rx queue 0 buffer 0x4c513d: current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x0 ext-hdr-valid PKT MBUF: port 0, nb_segs 1, pkt_len 98 buf_len 2176, data_len 98, ol_flags 0x0, data_off 128, phys_addr 0x29944fc0 packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 rss 0x0 fdir.hi 0x0 fdir.lo 0x0 IP4: 52:54:00:f0:11:01 -\u0026gt; 52:54:00:f0:11:10 ICMP: 192.168.10.0 -\u0026gt; 192.168.10.1 tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN fragment id 0xf52b, flags DONT_FRAGMENT ICMP echo_request checksum 0x43b7 id 26166 07:42:53:178765: ethernet-input frame: flags 0x1, hw-if-index 1, sw-if-index 1 IP4: 52:54:00:f0:11:01 -\u0026gt; 52:54:00:f0:11:10 07:42:53:178791: ip4-input ICMP: 192.168.10.0 -\u0026gt; 192.168.10.1 tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN fragment id 0xf52b, flags DONT_FRAGMENT ICMP echo_request checksum 0x43b7 id 26166 07:42:53:178810: ip4-not-enabled ICMP: 192.168.10.0 -\u0026gt; 192.168.10.1 tos 0x00, ttl 64, length 84, checksum 0xb02b dscp CS0 ecn NON_ECN fragment id 0xf52b, flags DONT_FRAGMENT ICMP echo_request checksum 0x43b7 id 26166 07:42:53:178833: error-drop rx:GigabitEthernet10/0/0 07:42:53:178835: drop dpdk-input: no error Okay, that checks out! Going over this packet trace, the ip4-input node indeed got handed a packet, which it promptly rejected by forwarding it to ip4-not-enabled which drops it. It kind of makes sense, the VPP dataplane doesn\u0026rsquo;t think it\u0026rsquo;s logical to handle IPv4 traffic on an interface which does not have an IPv4 address. Except \u0026ndash; I\u0026rsquo;m bending the rules a little bit by doing exactly that.\nApproach 1: force-enable ip4 in VPP There\u0026rsquo;s an internal function ip4_sw_interface_enable_disable() which is called to enable IPv4 processing on an interface once the first IPv4 address is added. So my first fix is to force this to be enabled for any interface that is exposed via Linux Control Plane, notably in lcp_itf_pair_create() [here].\nThis approach is partially effective:\npim@vpp0-0:~$ ip ro get 192.168.10.1 192.168.10.1 via inet6 fe80::5054:ff:fef0:1110 dev e1 src 192.168.10.0 uid 0 cache pim@vpp0-0:~$ ping -c5 192.168.10.1 PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data. 64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=3.92 ms 64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=3.81 ms 64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=3.75 ms 64 bytes from 192.168.10.1: icmp_seq=4 ttl=64 time=3.23 ms 64 bytes from 192.168.10.1: icmp_seq=5 ttl=64 time=2.67 ms ^C --- 192.168.10.1 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 4006ms rtt min/avg/max/mdev = 2.673/3.477/3.921/0.467 ms pim@vpp0-0:~$ traceroute 192.168.10.3 traceroute to 192.168.10.3 (192.168.10.3), 30 hops max, 60 byte packets 1 * * * 2 * * * 3 192.168.10.3 (192.168.10.3) 10.418 ms 10.343 ms 11.362 ms I take a moment to think about why the traceroutes are not responding in the routers in the middle, and it dawns on me that when the router needs to send an ICMPv4 TTL Exceeded message, it can\u0026rsquo;t select an IPv4 address to originate the message from, as the interface has none.\n🟠 Forwarding works, but ❌ PMTUd does not!\nApproach 2: Use unnumbered interfaces Looking at my options, I see that VPP is capable of using so-called unnumbered interfaces. These can be left unconfigured, but borrow an address from another interface. It\u0026rsquo;s a good idea to borrow from loop0, which has a valid IPv4 and IPv6 address. It looks like this in VPP:\nvpp0-0# set interface unnumbered GigabitEthernet10/0/0 use loop0 vpp0-0# set interface unnumbered GigabitEthernet10/0/1 use loop0 vpp0-0# show interface address GigabitEthernet10/0/0 (dn): unnumbered, use loop0 L3 192.168.10.0/32 L3 2001:678:d78:200::/128 GigabitEthernet10/0/1 (up): unnumbered, use loop0 L3 192.168.10.0/32 L3 2001:678:d78:200::/128 loop0 (up): L3 192.168.10.0/32 L3 2001:678:d78:200::/128 The Linux ControlPlane configuration will always synchronize interface information from VPP to Linux, as I described back then when I [worked on the plugin]. Babel starts and sets next hops for IPv4 that look like this:\npim@vpp0-2:~$ ip -br a lo UNKNOWN 127.0.0.1/8 ::1/128 loop0 UNKNOWN 192.168.10.2/32 2001:678:d78:200::2/128 fe80::dcad:ff:fe00:0/64 e0 UP 192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1120/64 e1 UP 192.168.10.2/32 2001:678:d78:200::2/128 fe80::5054:ff:fef0:1121/64 pim@vpp0-2:~$ ip ro 192.168.10.0 via 192.168.10.1 dev e0 proto bird metric 32 onlink unreachable 192.168.10.0/24 proto bird metric 32 192.168.10.1 via 192.168.10.1 dev e0 proto bird metric 32 onlink 192.168.10.3 via 192.168.10.3 dev e1 proto bird metric 32 onlink While on the surface this looks good, for VPP it clearly poses a problem, as my IPv4 neighbors (192.168.10.1 and 192.168.10.3) are not reachable:\npim@vpp0-2:~# ping -c3 192.168.10.1 PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data. From 192.168.10.2 icmp_seq=1 Destination Host Unreachable From 192.168.10.2 icmp_seq=2 Destination Host Unreachable From 192.168.10.2 icmp_seq=3 Destination Host Unreachable --- 192.168.10.1 ping statistics --- 3 packets transmitted, 0 received, +3 errors, 100% packet loss, time 2034ms I take a look at why that might be, and I notice this on the neighbor vpp0-1 when I try to ping it from vpp0-2:\nvpp0-1# show err Count Node Reason Severity 5 arp-reply IP4 source address not local to sub error 1 arp-reply IP4 source address matches local in error Oh, snap! I traced this down to src/vnet/arp/arp.c around line 522 where I can see that VPP, when it receives an ARP request, wants that to be coming from a peer that is in its own subnet. But with a point to point link like this one, there is nobody else in the 192.168.10.1/32 subnet! I think this error should not be returned if the interface is arp_unnumbered(), defined further up in the same source file. I write a small patch in Gerrit [40482] which removes this requirement and the test that asserts the previous behavior, allowing the ARP request to succeed, and things shoot to life:\npim@vpp0-2:~$ ping -c3 192.168.10.1 PING 192.168.10.1 (192.168.10.1) 56(84) bytes of data. 64 bytes from 192.168.10.1: icmp_seq=1 ttl=64 time=11.5 ms 64 bytes from 192.168.10.1: icmp_seq=2 ttl=64 time=1.69 ms 64 bytes from 192.168.10.1: icmp_seq=3 ttl=64 time=3.03 ms --- 192.168.10.1 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 2004ms rtt min/avg/max/mdev = 1.689/5.394/11.468/4.329 ms I make a mental note to discuss this ARP relaxation Gerrit with [vpp-dev], and I\u0026rsquo;ll see where that takes me.\n✅ Forwarding IPv4 routes over IPv4 point-to-point nexthops works!\nApproach 3: VPP Unnumbered Hack At this point, I think I\u0026rsquo;m good, but one of the cool features of Babel is that it can use IPv6 next hops for IPv4 destinations. Setting GigabitEthernet10/0/X to unnumbered will make 192.168.10.X/32 reappear on the e0 an e1 interfaces, which will make Babel prefer the more classic IPv4 next-hops. So can I trick it somehow to use IPv6 anyway ?\nOne option is to ask Babel to use extended next hop even when IPv4 is available, which would be a change to Bird (and possibly a violation of the Babel specification, I should read up on that).\nBut I think there\u0026rsquo;s another way, so I take a look at the VPP code which prints out the unnumbered, use loop0 message, and I find a way to know if an interface is borrowing addresses in this way. I decide to change the LCP plugin to inhibit sync\u0026rsquo;ing the addresses if they belong to an interface which is unnumbered. Because I don\u0026rsquo;t know for sure if everybody would find this behavior desirable, I make sure to guard the behavior behind a backwards compatible configuration option.\nIf you\u0026rsquo;re curious, please take a look at the change in my [GitHub repo], in which I:\nadd a new configuration option, lcp-sync-unnumbered, which defaults to on. That would be what the plugin would do in the normal case: copy forward these borrowed IP addresses to Linux. add a CLI call to change the value, lcp lcp-sync-unnumbered [on|enable|off|disable] extend the CLI call to show the LCP plugin state, as an additional output of lcp show And with that, the VPP configuration becomes:\nvpp0-0# lcp lcp-sync on vpp0-0# lcp lcp-sync-unnumbered off vpp0-0# create loopback interface instance 0 vpp0-0# set interface state loop0 up vpp0-0# set interface ip address loop0 192.168.10.0/32 vpp0-0# set interface ip address loop0 2001:678:d78:200::/128 vpp0-0# set interface mtu 9000 GigabitEthernet10/0/0 vpp0-0# set interface unnumbered GigabitEthernet10/0/0 use loop0 vpp0-0# set interface state GigabitEthernet10/0/0 up vpp0-0# set interface mtu 9000 GigabitEthernet10/0/1 vpp0-0# set interface unnumbered GigabitEthernet10/0/1 use loop0 vpp0-0# set interface state GigabitEthernet10/0/1 up vpp0-0# lcp create loop0 host-if loop0 vpp0-0# lcp create GigabitEthernet10/0/0 host-if e0 vpp0-0# lcp create GigabitEthernet10/0/1 host-if e1 Results I can claim plausible success on this effort, which makes me wiggle in my seat a little bit, I have to admit:\npim@vpp0-0:~$ ip -br a lo UNKNOWN 127.0.0.1/8 ::1/128 loop0 UNKNOWN 192.168.10.0/32 2001:678:d78:200::/128 fe80::dcad:ff:fe00:0/64 e0 UP fe80::5054:ff:fef0:1100/64 e1 UP fe80::5054:ff:fef0:1101/64 e2 DOWN e3 DOWN pim@vpp0-0:~$ traceroute -n 192.168.10.3 traceroute to 192.168.10.3 (192.168.10.3), 30 hops max, 60 byte packets 1 192.168.10.1 1.882 ms 2.231 ms 1.472 ms 2 192.168.10.2 4.243 ms 3.492 ms 2.797 ms 3 192.168.10.3 6.689 ms 5.925 ms 5.157 ms pim@vpp0-0:~$ traceroute -n 2001:678:d78:200::3 traceroute to 2001:678:d78:200::3 (2001:678:d78:200::3), 30 hops max, 80 byte packets 1 2001:678:d78:200::1 2.543 ms 1.762 ms 2.154 ms 2 2001:678:d78:200::2 4.943 ms 3.063 ms 3.562 ms 3 2001:678:d78:200::3 6.273 ms 6.694 ms 7.086 ms ✅ Forwarding IPv4 routes over IPv6 nexthops works, ICMPv4 works, PMTUd works!\nI recorded a little [screencast] that shows my work, so far:\nAdditional thoughts Comparing OSPFv2 and Babel Ondrej from the Bird team pointed out (thank you!) that OSPFv2 can also be made to avoid use of IPv4 transit networks, by making use of this peer pattern, which is similar but not quite the same as what I discussed in Approach 2 above:\n$ ip addr add 192.168.10.2 peer 192.168.10.1 dev e0 $ ip addr add 192.168.10.2 peer 192.168.10.3 dev e1 The Linux ControlPlane plugin is not currently capable of accepting the peer netlink message, and I can see a problem: VPP does not allow for two interfaces to have the same IP address, unless one is borrowing from another using unnumbered. I wonder why that is \u0026hellip;\nI could certainly give implementing that peer pattern in Netlink a go, but I\u0026rsquo;m not enthusiastic. To consume the netlink message correctly, the plugin would need to assert that left hand (source) IPv4 address strictly corresponds to a loopback, and then internally rewrite the address addition into a unnumbered use, and also somehow reject (delete?) the netlink configuration otherwise. Ick!\nI think there\u0026rsquo;s a more idiomatic way of doing this in VPP. OSPFv2 doesn\u0026rsquo;t really need to use the peer pattern, as long as the point to point peer is reachable. Babel is emitting a static route over the interface after using IPv6 to learn its peer\u0026rsquo;s IPv4 address, which is really neat! I suppose for OSPFv2 setting a manual static route for the peer into the device would do the trick as well.\nThe VPP idiom for the peer pattern above, which Babel does naturally, and OSPFv2 could be manually configured to do, would look like this:\nvpp0-2# set interface ip address loop0 192.168.10.2/32 vpp0-2# set interface state loop0 up vpp0-2# set interface unnumbered GigabitEthernet10/0/0 use loop0 vpp0-2# set interface state GigabitEthernet10/0/0 up vpp0-2# ip route add 192.168.10.1/32 via 192.168.10.1 GigabitEthernet10/0/0 vpp0-2# set interface unnumbered GigabitEthernet10/0/1 use loop0 vpp0-2# set interface state GigabitEthernet10/0/1 up vpp0-2# ip route add 192.168.10.3/32 via 192.168.10.3 GigabitEthernet10/0/1 Either way, when using point to point connections (like these explicit static routes, or the implied static routes that the peer pattern will yield) over an ethernet broadcast medium, will require to get the ARP [Gerrit] merged. This one seems reasonably straight forward because allowing point to point to work over an ethernet broadcast medium is successfully done in many popular vendors, and I can\u0026rsquo;t find any RFC that forbids it. Perhaps VPP is being a bit too strict.\nTo Unnumbered or Not To Unnumbered I\u0026rsquo;m torn between Approach 2 and Approach 3. While on the one hand, setting the unnumbered interface would be best reflected in Linux, it is not without problems. If the operator subsequently tries to remove one of the addresses on e0 or e1, that will yield a desync between Linux and VPP (Linux will have removed the address, but VPP will still be unnumbered). On the other hand, tricking Linux (and the operator) to believe there isn\u0026rsquo;t an IPv4 (and IPv6) address configured on the interface, is also not great.\nOf the two approaches, I think I prefer Approach 3 (changing the Linux CP plugin to not sync unnumbered addresses), because it minimizes the chance of operator error. If you\u0026rsquo;re reading this and have an Opinion™, would you please let me know?\nWhat\u0026rsquo;s Next I think that over time, IPng Networks might replace OSPF and OSPFv3 with Babel, as it will allow me to retire the many /31 IPv4 and /112 IPv6 transit networks (which consume about half of my routable IPv4 addresses!). I will discuss my change with the VPP and Babel/Bird Developer communities and see if it makes sense to upstream my changes. Personally, I think it\u0026rsquo;s a reasonable direction, because (a) both changes are backwards compatible and (b) its semantics are pretty straight forward. I\u0026rsquo;ll also add some configuration knobs to [vppcfg] to make it easier to configure VPP in this way.\nOf course, migrating AS8298 won\u0026rsquo;t be overnight, I need to gain a bit more confidence, and obviously upgrade both Bird2 and VPP using my changes, which I think might benefit from a bit of peer review. And finally I need to roll this new IPv4-less IGP out very carefully and without interruptions, which considering the IGP is the most fundamental building block of the network, may be tricky.\nBut, I am uncomfortably excited by the prospect of having my network go entirely without backbone transit networks. By the way: Babel is amazing!\n","date":"2024-03-06","desc":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. Thanks to the [Linux ControlPlane] plugin, higher level control plane software becomes available, that is to say: things like BGP, OSPF, LDP, VRRP and so on become quite natural for VPP.\n","permalink":"https://ipng.ch/s/articles/2024/03/06/vpp-with-babel-part-1/","section":"articles","title":"VPP with Babel - Part 1"},{"contents":"About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. Over the years, folks have asked me regularly \u0026ldquo;What about BSD?\u0026rdquo; and to my surprise, late last year I read an announcement from the FreeBSD Foundation [ref] as they looked back over 2023 and forward to 2024:\nPorting the Vector Packet Processor to FreeBSD\nVector Packet Processing (VPP) is an open-source, high-performance user space networking stack that provides fast packet processing suitable for software-defined networking and network function virtualization applications. VPP aims to optimize packet processing through vectorized operations and parallelism, making it well-suited for high-speed networking applications. In November of this year, the Foundation began a contract with Tom Jones, a FreeBSD developer specializing in network performance, to port VPP to FreeBSD. Under the contract, Tom will also allocate time for other tasks such as testing FreeBSD on common virtualization platforms to improve the desktop experience, improving hardware support on arm64 platforms, and adding support for low power idle on Intel and arm64 hardware.\nIn my first [article], I wrote a sort of a hello world by installing FreeBSD 14.0-RELEASE on both a VM and a bare metal Supermicro, and showed that Tom\u0026rsquo;s VPP branch compiles, runs and pings. In this article, I\u0026rsquo;ll take a look at some comparative performance numbers.\nComparing implementations FreeBSD has an extensive network stack, including regular kernel based functionality such as routing, filtering and bridging, a faster netmap based datapath, including some userspace utilities like a netmap bridge, and of course completely userspace based dataplanes, such as the VPP project that I\u0026rsquo;m working on here. Last week, I learned that VPP has a netmap driver, and from previous travels I am already quite familiar with its DPDK based forwarding. I decide to do a baseline loadtest for each of these on the Supermicro Xeon-D1518 that I installed last week. See the [article] for details on the setup.\nThe loadtests will use a common set of different configurations, using Cisco T-Rex\u0026rsquo;s default benchmark profile called bench.py:\nvar2-1514b: Large Packets, multiple flows with modulating source and destination IPv4 addresses, often called an \u0026lsquo;iperf test\u0026rsquo;, with packets of 1514 bytes. var2-imix: Mixed Packets, multiple flows, often called an \u0026lsquo;imix test\u0026rsquo;, which includes a bunch of 64b, 390b and 1514b packets. var2-64b: Small Packets, still multiple flows, 64 bytes, which allows for multiple receive queues and kernel or application threads. 64b: Small Packets, but now single flow, often called \u0026rsquo;linerate test\u0026rsquo;, with a packet size of 64 bytes, limiting to one receive queue. Each of these four loadtests might occur in only undirectionally (port0 -\u0026gt; port1) or bidirectionally (port0 \u0026lt;-\u0026gt; port1). This yields eight different loadtests, each taking about 8 minutes. I put the kettle on and get underway.\nFreeBSD 14: Kernel Bridge The machine I\u0026rsquo;m testing has a quad-port Intel i350 (1Gbps copper, using the FreeBSD igb(4) driver), a dual-port Intel X522 (10Gbps SFP+, using the ix(4) driver), and a dual-port Intel i710-XXV (25Gbps SFP28, using the ixl(4) driver). I decide to live it up a little, and choose the 25G ports for my loadtests today, even if I think this machine with its relatively low-end Xeon-D1518 CPU will struggle a little bit at very high packet rates. No pain, no gain, amirite?\nI take my fresh FreeBSD 14.0-RELEASE install, without any tinkering other than compiling a GENERIC kernel that has support for the DPDK modules I\u0026rsquo;ll need later. For my first loadtest, I create a kernel based bridge as follows, just tying the two 25G interfaces together:\n[pim@france /usr/obj]$ uname -a FreeBSD france 14.0-RELEASE FreeBSD 14.0-RELEASE #0: Sat Feb 10 22:18:51 CET 2024 root@france:/usr/obj/usr/src/amd64.amd64/sys/GENERIC amd64 [pim@france ~]$ dmesg | grep ixl ixl0: \u0026lt;Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 - 2.3.3-k\u0026gt; mem 0xf8000000-0xf8ffffff,0xf9008000-0xf900ffff irq 16 at device 0.0 on pci7 ixl1: \u0026lt;Intel(R) Ethernet Controller XXV710 for 25GbE SFP28 - 2.3.3-k\u0026gt; mem 0xf7000000-0xf7ffffff,0xf9000000-0xf9007fff irq 16 at device 0.1 on pci7 [pim@france ~]$ sudo ifconfig bridge0 create [pim@france ~]$ sudo ifconfig bridge0 addm ixl0 addm ixl1 up [pim@france ~]$ sudo ifconfig ixl0 up [pim@france ~]$ sudo ifconfig ixl1 up [pim@france ~]$ ifconfig bridge0 bridge0: flags=1008843\u0026lt;UP,BROADCAST,RUNNING,SIMPLEX,MULTICAST,LOWER_UP\u0026gt; metric 0 mtu 1500 options=0 ether 58:9c:fc:10:6c:2e id 00:00:00:00:00:00 priority 32768 hellotime 2 fwddelay 15 maxage 20 holdcnt 6 proto rstp maxaddr 2000 timeout 1200 root id 00:00:00:00:00:00 priority 32768 ifcost 0 port 0 member: ixl1 flags=143\u0026lt;LEARNING,DISCOVER,AUTOEDGE,AUTOPTP\u0026gt; ifmaxaddr 0 port 4 priority 128 path cost 800 member: ixl0 flags=143\u0026lt;LEARNING,DISCOVER,AUTOEDGE,AUTOPTP\u0026gt; ifmaxaddr 0 port 3 priority 128 path cost 800 groups: bridge nd6 options=9\u0026lt;PERFORMNUD,IFDISABLED\u0026gt; One thing that I quickly realize, is that FreeBSD, when using hyperthreading, does have 8 threads available, but only 4 of them participate in forwarding. When I put the machine under load, I see a curious 399% spent in kernel while I see 402% in idle:\nWhen I then do a single-flow unidirectional loadtest, the expected outcome is that only one CPU participates (100% in kernel and 700% in idle) and if I perform a single-flow bidirectional loadtest, my expectations are confirmed again, seeing two CPU threads do the work (200% in kernel and 600% in idle).\nWhile the math checks out, the performance is a little bit less impressive:\nType Uni/BiDir Packets/Sec L2 Bits/Sec Line Rate vm=var2,size=1514 Unidirectional 2.02Mpps 24.77Gbps 99% vm=var2,size=imix Unidirectional 3.48Mpps 10.23Gbps 43% vm=var2,size=64 Unidirectional 3.61Mpps 2.43Gbps 9.7% size=64 Unidirectional 1.22Mpps 0.82Gbps 3.2% vm=var2,size=1514 Bidirectional 3.77Mpps 46.31Gbps 93% vm=var2,size=imix Bidirectional 3.81Mpps 11.22Gbps 24% vm=var2,size=64 Bidirectional 4.02Mpps 2.69Gbps 5.4% size=64 Bidirectional 2.29Mpps 1.54Gbps 3.1% Conclusion: FreeBSD\u0026rsquo;s kernel on this Xeon-D1518 processor can handle about 1.2Mpps per CPU thread, and I can use only four of them. FreeBSD is happy to forward big packets, and I can reasonably reach 2x25Gbps but once I start ramping up the packets/sec by lowering the packet size, things very quickly deteriorate.\nFreeBSD 14: netmap Bridge Tom pointed out a tool in the source tree, called the netmap bridge originally written by Luigi Rizzo and Matteo Landi. FreeBSD ships the source code, but you can also take a look at their GitHub repository [ref].\nWhat is netmap anyway? It\u0026rsquo;s a framework for extremely fast and efficient packet I/O for userspace and kernel clients, and for Virtual Machines. It runs on FreeBSD, Linux and some versions of Windows. As an aside, my buddy Pavel from FastNetMon pointed out a blogpost from 2015 in which Cloudflare folks described a way to do DDoS mitigation on Linux using traffic classification to program the network cards to move certain offensive traffic to a dedicated hardware queue, and service that queue from a netmap client. If you\u0026rsquo;re curious (I certainly was!), you might take a look at that cool write-up [here].\nI compile the code and put it to work, and the man-page tells me that I need to fiddle with the interfaces a bit. They need to be:\nset to promiscuous, which makes sense as they have to receive ethernet frames sent to MAC addresses other than their own turn off any hardware offloading, notably -rxcsum -txcsum -tso4 -tso6 -lro my user needs write permission to /dev/netmap to bind the interfaces from userspace. [pim@france /usr/src/tools/tools/netmap]$ make [pim@france /usr/src/tools/tools/netmap]$ cd /usr/obj/usr/src/amd64.amd64/tools/tools/netmap [pim@france .../tools/netmap]$ sudo ifconfig ixl0 -rxcsum -txcsum -tso4 -tso6 -lro promisc [pim@france .../tools/netmap]$ sudo ifconfig ixl1 -rxcsum -txcsum -tso4 -tso6 -lro promisc [pim@france .../tools/netmap]$ sudo chmod 660 /dev/netmap [pim@france .../tools/netmap]$ ./bridge -i netmap:ixl0 -i netmap:ixl1 065.804686 main [290] ------- zerocopy supported 065.804708 main [297] Wait 4 secs for link to come up... 075.810547 main [301] Ready to go, ixl0 0x0/4 \u0026lt;-\u0026gt; ixl1 0x0/4. I start my first loadtest, which pretty immediately fails. It\u0026rsquo;s an interesting behavior pattern which I\u0026rsquo;ve not seen before. After staring at the problem, and reading the code of bridge.c, which is a remarkably straight forward program, I restart the bridge utility, and traffic passes again but only for a little while. Whoops!\nI took a [screencast] in case any kind soul on freebsd-net wants to take a closer look at this:\nI start a bit of trial and error in which I conclude that if I send a lot of traffic (like 10Mpps), forwarding is fine; but if I send a little traffic (like 1kpps), at some point forwarding stops alltogether. So while it\u0026rsquo;s not great, this does allow me to measure the total throughput just by sending a lot of traffic, say 30Mpps, and seeing what amount comes out the other side.\nHere I go, and I\u0026rsquo;m having fun:\nType Uni/BiDir Packets/Sec L2 Bits/Sec Line Rate vm=var2,size=1514 Unidirectional 2.04Mpps 24.72Gbps 100% vm=var2,size=imix Unidirectional 8.16Mpps 23.76Gbps 100% vm=var2,size=64 Unidirectional 10.83Mpps 5.55Gbps 29% size=64 Unidirectional 11.42Mpps 5.83Gbps 31% vm=var2,size=1514 Bidirectional 3.91Mpps 47.27Gbps 96% vm=var2,size=imix Bidirectional 11.31Mpps 32.74Gbps 77% vm=var2,size=64 Bidirectional 11.39Mpps 5.83Gbps 15% size=64 Bidirectional 11.57Mpps 5.93Gbps 16% Conclusion: FreeBSD\u0026rsquo;s netmap implementation is also bound by packets/sec, and in this setup, the Xeon-D1518 machine is capable of forwarding roughly 11.2Mpps. What I find cool is that single flow or multiple flows doesn\u0026rsquo;t seem to matter that much, in fact bidirectional 64b single flow loadtest was most favorable at 11.57Mpps, which is an order of magnitude better than using just the kernel (which clocked in at 1.2Mpps).\nFreeBSD 14: VPP with netmap It\u0026rsquo;s good to have a baseline on this machine on how the FreeBSD kernel itself performs. But of course this series is about Vector Packet Processing, so I now turn my attention to the VPP branch that Tom shared with me. I wrote a bunch of details about the VM and bare metal install in my [first article] so I\u0026rsquo;ll just go straight to the configuration parts:\nDBGvpp# create netmap name ixl0 DBGvpp# create netmap name ixl1 DBGvpp# set int state netmap-ixl0 up DBGvpp# set int state netmap-ixl1 up DBGvpp# set int l2 xconnect netmap-ixl0 netmap-ixl1 DBGvpp# set int l2 xconnect netmap-ixl1 netmap-ixl0 DBGvpp# show int Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count local0 0 down 0/0/0/0 netmap-ixl0 1 up 9000/0/0/0 rx packets 25622 rx bytes 1537320 tx packets 25437 tx bytes 1526220 netmap-ixl1 2 up 9000/0/0/0 rx packets 25437 rx bytes 1526220 tx packets 25622 tx bytes 1537320 At this point I can pretty much rule out that the netmap bridge.c is the issue, because a few seconds after introducing 10Kpps of traffic and seeing it successfully pass, the loadtester receives no more packets, even though T-Rex is still sending it. However, about a minute later I can also see the RX and TX counters continue to increase in the VPP dataplane:\nDBGvpp# show int Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count local0 0 down 0/0/0/0 netmap-ixl0 1 up 9000/0/0/0 rx packets 515843 rx bytes 30950580 tx packets 515657 tx bytes 30939420 netmap-ixl1 2 up 9000/0/0/0 rx packets 515657 rx bytes 30939420 tx packets 515843 tx bytes 30950580 .. and I can see that every packet that VPP received is accounted for: interface ixl0 has received 515843 packets, and ixl1 claims to have transmitted exactly that amount of packets. So I think perhaps they are getting lost somewhere on egress between the kernel and the Intel i710-XXV network card.\nHowever, counter to the previous case, I cannot sustain any reasonable amount of traffic, be it 1Kpps, 10Kpps or 10Mpps, the system pretty consistently comes to a halt mere seconds after introducing the load. Restarting VPP makes it forward traffic again for a few seconds, just to end up in the same upset state. I don\u0026rsquo;t learn much.\nConclusion: This setup with VPP using netmap does not yield results, for the moment. I have a suspicion that whatever the cause is of the netmap bridge in the previous test, is likely also the culprit for this test.\nFreeBSD 14: VPP with DPDK But not all is lost - I have one test left, and judging by what I learned last week when bringing up the first test environment, this one is going to be a fair bit better. In my previous loadtests, the network interfaces were on their usual kernel driver (ixl(4) in the case of the Intel i710-XXV interfaces), but now I\u0026rsquo;m going to mix it up a little, and rebind these interfaces to a specific DPDK driver called nic_uio(4) which stands for Network Interface Card Userspace Input/Output:\n[pim@france ~]$ cat \u0026lt; EOF | sudo tee -a /boot/loader.conf nic_uio_load=\u0026#34;YES\u0026#34; hw.nic_uio.bdfs=\u0026#34;6:0:0,6:0:1\u0026#34; EOF After I reboot, the network interfaces are gone from the output of ifconfig(8), which is good. I start up VPP with a minimal config file [ref], which defines three worker threads and starts DPDK with 3 RX queues and 4 TX queues. It\u0026rsquo;s a common question why there would be one more TX queue. The explanation is that in VPP, there is one (1) main thread and zero or more worker threads. If the main thread wants to send traffic (for example, in a plugin like LLDP which sends periodic announcements), it would be most efficient to use a transmit queue specific to that main thread. Any return traffic will be picked up by the DPDK Process on worker threads (as main does not have one of these). That\u0026rsquo;s why the general rule num(TX) = num(RX)+1.\n[pim@france ~/src/vpp]$ export STARTUP_CONF=/home/pim/src/startup.conf [pim@france ~/src/vpp]$ gmake run-release vpp# set int l2 xconnect TwentyFiveGigabitEthernet6/0/0 TwentyFiveGigabitEthernet6/0/1 vpp# set int l2 xconnect TwentyFiveGigabitEthernet6/0/1 TwentyFiveGigabitEthernet6/0/0 vpp# set int state TwentyFiveGigabitEthernet6/0/0 up vpp# set int state TwentyFiveGigabitEthernet6/0/1 up vpp# show int Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count TwentyFiveGigabitEthernet6/0/0 1 up 9000/0/0/0 rx packets 11615035382 rx bytes 1785998048960 tx packets 700076496 tx bytes 161043604594 TwentyFiveGigabitEthernet6/0/1 2 up 9000/0/0/0 rx packets 700076542 rx bytes 161043674054 tx packets 11615035440 tx bytes 1785998136540 local0 0 down 0/0/0/0 And with that, the dataplane shoots to life and starts forwarding (lots of) packets. To my great relief, sending either 1kpps or 1Mpps \u0026ldquo;just works\u0026rdquo;. I can run my loadtest as per normal, first with 1514 byte packets, then imix, then 64 byte packets, and finally single-flow 64 byte packets. And of course, both unidirectionally and bidirectionally.\nI take a look at the system load while the loadtests are running:\nIt is fully expected that the VPP process is spinning 300% +epsilon of CPU time. This is because it has started three worker threads, and these are execuing the DPDK Poll Mode Driver which is essentially a tight loop that asks the network cards for work, and if there are any packets arriving, executes on that work. As such, each worker thread is always burning 100% of its assigned CPU.\nThat said, I can take a look at finer grained statistics in the dataplane itself:\nvpp# show run Thread 0 vpp_main (lcore 0) Time .9, 10 sec internal node vector rate 0.00 loops/sec 297041.19 vector rates in 0.0000e0, out 0.0000e0, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call ip4-full-reassembly-expire-wal any wait 0 0 18 2.39e3 0.00 ip6-full-reassembly-expire-wal any wait 0 0 18 3.08e3 0.00 unix-cli-process-0 active 0 0 9 7.62e4 0.00 unix-epoll-input polling 13066 0 0 1.50e5 0.00 --------------- Thread 1 vpp_wk_0 (lcore 1) Time .9, 10 sec internal node vector rate 12.38 loops/sec 1467742.01 vector rates in 5.6294e6, out 5.6294e6, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call TwentyFiveGigabitEthernet6/0/1 active 399663 5047800 0 2.20e1 12.63 TwentyFiveGigabitEthernet6/0/1 active 399663 5047800 0 9.54e1 12.63 dpdk-input polling 1531252 5047800 0 1.45e2 3.29 ethernet-input active 399663 5047800 0 3.97e1 12.63 l2-input active 399663 5047800 0 2.93e1 12.63 l2-output active 399663 5047800 0 2.53e1 12.63 unix-epoll-input polling 1494 0 0 3.09e2 0.00 (et cetera) I showed only one worker thread\u0026rsquo;s output, but there are actually three worker threads, and they are all doing similar work, because they are picking up 33% of the traffic each assigned to the three RX queues in the network card.\nWhile the overall CPU load is 300%, here I can see a different picture. Thread 0 (the main thread) is doing essentially ~nothing. It is polling a set of unix sockets in the node called unix-epoll-input, but other than that, main doesn\u0026rsquo;t have much on its plate. Thread 1 however is a worker thread, and I can see that it is busy doing work:\ndpdk-input: it\u0026rsquo;s polling the NIC for work, it has been called 1.53M times, and in total it has handled just over 5.04M vectors (which are packets). So I can derive, that each time the Poll Mode Driver gives work, on average there are 3.29 vectors (packets), and each packet is taking about 145 CPU clocks. ethernet-input: The DPDK vectors are all ethernet frames coming from the loadtester. Seeing as I have cross connected all traffic from Tf6/0/0 to Tf6/0/1 and vice-versa, VPP knows that it should handle the packets in the L2 forwarding path. l2-input is called with the (list of N) ethernet frames, which all get cross connected to the output interface, in this case Tf6/0/1. l2-output prepares the ethernet frames for output into their egress interface. TwentyFiveGigabitEthernet6/0/1-output (Note: the name is truncated) If this were to have been L3 traffic, this would be the place where the destination MAC address is inserted into the ethernet frame, but since this is an L2 cross connect, the node simply passes the ethernet frames through to the final egress node in DPDK. TwentyFiveGigabitEthernet6/0/1-tx (Note: the name is truncated) hands them to the DPDK driver for marshalling on the wire. Halfway through, I see that there\u0026rsquo;s an issue with the distribution of ingress traffic over the three workers, maybe you can spot it too:\n--------------- Thread 1 vpp_wk_0 (lcore 1) Time 56.7, 10 sec internal node vector rate 38.59 loops/sec 106879.84 vector rates in 7.2982e6, out 7.2982e6, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call TwentyFiveGigabitEthernet6/0/0 active 6689553 206899956 0 1.34e1 30.93 TwentyFiveGigabitEthernet6/0/0 active 6689553 206899956 0 1.37e2 30.93 TwentyFiveGigabitEthernet6/0/1 active 6688572 206902836 0 1.45e1 30.93 TwentyFiveGigabitEthernet6/0/1 active 6688572 206902836 0 1.34e2 30.93 dpdk-input polling 7128012 413802792 0 8.77e1 58.05 ethernet-input active 13378125 413802792 0 2.77e1 30.93 l2-input active 6809002 413802792 0 1.81e1 60.77 l2-output active 6809002 413802792 0 1.68e1 60.77 unix-epoll-input polling 6954 0 0 6.61e2 0.00 --------------- Thread 2 vpp_wk_1 (lcore 2) Time 56.7, 10 sec internal node vector rate 256.00 loops/sec 7702.68 vector rates in 4.1188e6, out 4.1188e6, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call TwentyFiveGigabitEthernet6/0/0 active 456112 116764672 0 1.27e1 256.00 TwentyFiveGigabitEthernet6/0/0 active 456112 116764672 0 2.64e2 256.00 TwentyFiveGigabitEthernet6/0/1 active 456112 116764672 0 1.39e1 256.00 TwentyFiveGigabitEthernet6/0/1 active 456112 116764672 0 2.74e2 256.00 dpdk-input polling 456112 233529344 0 1.41e2 512.00 ethernet-input active 912224 233529344 0 5.71e1 256.00 l2-input active 912224 233529344 0 3.66e1 256.00 l2-output active 912224 233529344 0 1.70e1 256.00 unix-epoll-input polling 445 0 0 9.59e2 0.00 --------------- Thread 3 vpp_wk_2 (lcore 3) Time 56.7, 10 sec internal node vector rate 256.00 loops/sec 7742.43 vector rates in 4.1188e6, out 4.1188e6, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call TwentyFiveGigabitEthernet6/0/0 active 456113 116764928 0 8.94e0 256.00 TwentyFiveGigabitEthernet6/0/0 active 456113 116764928 0 2.81e2 256.00 TwentyFiveGigabitEthernet6/0/1 active 456113 116764928 0 9.54e0 256.00 TwentyFiveGigabitEthernet6/0/1 active 456113 116764928 0 2.72e2 256.00 dpdk-input polling 456113 233529856 0 1.61e2 512.00 ethernet-input active 912226 233529856 0 4.50e1 256.00 l2-input active 912226 233529856 0 2.93e1 256.00 l2-output active 912226 233529856 0 1.23e1 256.00 unix-epoll-input polling 445 0 0 1.03e3 0.00 Thread 1 (vpp_wk_0) is handling 7.29Mpps and moderately loaded, while Thread 2 and 3 are handling each 4.11Mpps and are completely pegged. That said, the relative amount of CPU clocks they are spending per packet is reasonably similar, but they don\u0026rsquo;t quite add up:\nThread 1 is doing 7.29Mpps and is spending on average 449 CPU cycles per packet. I get this number by adding up all of the values in the Clocks column, except for the unix-epoll-input node. But that\u0026rsquo;s somewhat strange, because this Xeon D1518 clocks at 2.2GHz \u0026ndash; and yet 7.29M * 449 is 3.27GHz. My experience (in Linux) is that these numbers actually line up quite well. Thread 2 is doing 4.12Mpps and is spending on average 816 CPU cycles per packet. This kind of makes sense as the cycles/packet is roughly double that of thread 1, and the packet/sec is roughly half \u0026hellip; and the total of 4.12M * 816 is 3.36GHz. I can see similarly values for thread 3: 4.12Mpps and also 819 CPU cycles per packet which amounts to VPP self-reporting using 3.37GHz worth of cycles on this thread. When I look at the thread to CPU placement, I get another surprise:\nvpp# show threads ID Name Type LWP Sched Policy (Priority) lcore Core Socket State 0 vpp_main 100346 (nil) (n/a) 0 42949674294967 1 vpp_wk_0 workers 100473 (nil) (n/a) 1 42949674294967 2 vpp_wk_1 workers 100474 (nil) (n/a) 2 42949674294967 3 vpp_wk_2 workers 100475 (nil) (n/a) 3 42949674294967 vpp# show cpu Model name: Intel(R) Xeon(R) CPU D-1518 @ 2.20GHz Microarch model (family): [0x6] Broadwell ([0x56] Broadwell DE) stepping 0x3 Flags: sse3 pclmulqdq ssse3 sse41 sse42 avx rdrand avx2 bmi2 rtm pqm pqe rdseed aes invariant_tsc Base frequency: 2.19 GHz The numbers in show threads are all messed up, and I don\u0026rsquo;t quite know what to make of it yet. I think the perhaps overly specific Linux implementation of the thread pool management is throwing off FreeBSD a bit. Perhaps some profiling could be useful, so I make a note to discuss this with Tom or the freebsd-net mailing list, who will know a fair bit more about this type of stuff on FreeBSD than I do.\nAnyway, functionally: this works. Performance wise: I have some questions :-) I let all eight loadtests complete and without further ado, here\u0026rsquo;s the results:\nType Uni/BiDir Packets/Sec L2 Bits/Sec Line Rate vm=var2,size=1514 Unidirectional 2.01Mpps 24.45Gbps 99% vm=var2,size=imix Unidirectional 8.07Mpps 23.42Gbps 99% vm=var2,size=64 Unidirectional 23.93Mpps 12.25Gbps 64% size=64 Unidirectional 12.80Mpps 6.56Gbps 34% vm=var2,size=1514 Bidirectional 3.91Mpps 47.35Gbps 86% vm=var2,size=imix Bidirectional 13.38Mpps 38.81Gbps 82% vm=var2,size=64 Bidirectional 15.56Mpps 7.97Gbps 21% size=64 Bidirectional 20.96Mpps 10.73Gbps 28% Conclusion: I have to say: 12.8Mpps on a unidirectional 64b single-flow loadtest (thereby only being able to make use of one DPDK worker), and 20.96Mpps on a bidirectional 64b single-flow loadtest, is not too shabby. But seeing as one CPU thread can do 12.8Mpps, I would imagine that three CPU threads would perform at 38.4Mpps or there-abouts, but I\u0026rsquo;m seeing only 23.9Mpps and some unexplained variance in per-thread performance.\nResults I learned a lot! Some hilights:\nThe netmap implementation is not playing ball for the moment, as forwarding stops consistently, in both the bridge.c as well as the VPP plugin. It is clear though, that netmap is a fair bit faster (11.4Mpps) than kernel forwarding which came in at roughly 1.2Mpps per CPU thread. What\u0026rsquo;s a bit troubling is that netmap doesn\u0026rsquo;t seem to work very well in VPP \u0026ndash; traffic forwarding also stops here. DPDK performs quite well on FreeBSD, I manage to see a throughput of 20.96Mpps which is almost twice the throughput of netmap, which is cool but I can\u0026rsquo;t quite explain the stark variance in throughput between the worker threads. Perhaps VPP is placing the workers on hyperthreads? Perhaps an equivalent of isolcpus in the Linux kernel would help? For the curious, I\u0026rsquo;ve bundled up a few files that describe the machine and its setup: [dmesg] [pciconf] [loader.conf] [VPP startup.conf]\n","date":"2024-02-17","desc":"About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. Over the years, folks have asked me regularly \u0026ldquo;What about BSD?\u0026rdquo; and to my surprise, late last year I read an announcement from the FreeBSD Foundation [ref] as they looked back over 2023 and forward to 2024:\n","permalink":"https://ipng.ch/s/articles/2024/02/17/vpp-on-freebsd-part-2/","section":"articles","title":"VPP on FreeBSD - Part 2"},{"contents":"About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. Over the years, folks have asked me regularly \u0026ldquo;What about BSD?\u0026rdquo; and to my surprise, late last year I read an announcement from the FreeBSD Foundation [ref] as they looked back over 2023 and forward to 2024:\nPorting the Vector Packet Processor to FreeBSD\nVector Packet Processing (VPP) is an open-source, high-performance user space networking stack that provides fast packet processing suitable for software-defined networking and network function virtualization applications. VPP aims to optimize packet processing through vectorized operations and parallelism, making it well-suited for high-speed networking applications. In November of this year, the Foundation began a contract with Tom Jones, a FreeBSD developer specializing in network performance, to port VPP to FreeBSD. Under the contract, Tom will also allocate time for other tasks such as testing FreeBSD on common virtualization platforms to improve the desktop experience, improving hardware support on arm64 platforms, and adding support for low power idle on Intel and arm64 hardware.\nI reached out to Tom and introduced myself \u0026ndash; and IPng Networks \u0026ndash; and offered to partner. Tom knows FreeBSD very well, and I know VPP very well. And considering lots of folks have asked me that loaded \u0026ldquo;What about BSD?\u0026rdquo; question, I think a reasonable answer might now be: Coming up! Tom will be porting VPP to FreeBSD, and I\u0026rsquo;ll be providing a test environment with a few VMs, physical machines with varying architectures (think single-numa, AMD64 and Intel platforms).\nIn this first article, let\u0026rsquo;s take a look at tablestakes: installing FreeBSD 14.0-RELEASE and doing all the little steps necessary to get VPP up and running.\nMy test setup Tom and I will be using two main test environments. The first is a set of VMs running on QEMU, which we can do functional testing on, by configuring a bunch of VPP routers with a set of normal FreeBSD hosts attached to them. The second environment will be a few Supermicro bare metal servers that we\u0026rsquo;ll use for performance testing, notably to compare the FreeBSD kernel routing, fancy features like netmap, and of course VPP itself. I do intend to do some side-by-side comparisons between Debian and FreeBSD when they run VPP.\nIf you know me a little bit, you\u0026rsquo;ll know that I typically forget how I did a thing, so I\u0026rsquo;m using this article for others as well as myself in case I want to reproduce this whole thing 5 years down the line. Oh, and if you don\u0026rsquo;t know me at all, now you know my brain, pictured left, is not too different from a leaky sieve.\nVMs: IPng Lab I really like the virtual machine environment that the [IPng Lab] provides. So my very first step is to grab an UFS based image like [these ones], and I prepare a lab image. This goes roughly as follows \u0026ndash;\nDownload the UFS qcow2 and unxz it. Create a 10GB ZFS blockdevice zfs create ssd-vol0/vpp-proto-freebsd-disk0 -V10G Make a copy of my existing vpp-proto-bookworm libvirt config, and edit it with new MAC addresses, UUID and hostname (essentially just an s/bookworm/freebsd/g Boot the VM once using VNC, to add serial booting to /boot/loader.conf Finally, install a bunch of stuff that I would normally use on a FreeBSD machine: A user account \u0026lsquo;pim\u0026rsquo; and \u0026lsquo;ipng\u0026rsquo;, set the \u0026lsquo;root\u0026rsquo; password A bunch of packages (things like vim, bash, python3, rsync) SSH host keys and authorized_keys files A sensible rc.conf that DHCPs on its first network card vtnet0 I notice that FreeBSD has something pretty neat in rc.conf, called growfs_enable, which will take a look at the total disk size available in slice 4 (the one that contains the main filesystem), and if the disk has free space beyond the end of the partition, it\u0026rsquo;ll slurp it up and resize the filesystem to fit. Reading the /etc/rc.d/growfs file, I see that this works for both ZFS and UFS. A chef\u0026rsquo;s kiss that I found super cool!\nNext, I take a snapshot of the disk image and add it to the Lab\u0026rsquo;s zrepl configuration, so that this base image gets propagated to all hypervisors, the result is a nice 10GB large base install that boots off of serial.\npim@hvn0-chbtl0:~$ zfs list -t all | grep vpp-proto-freebsd ssd-vol0/vpp-proto-freebsd-disk0 13.0G 45.9G 6.14G - ssd-vol0/vpp-proto-freebsd-disk0@20240206-release 614M - 6.07G - ssd-vol0/vpp-proto-freebsd-disk0@20240207-release 3.95M - 6.14G - ssd-vol0/vpp-proto-freebsd-disk0@20240207-2 0B - 6.14G - ssd-vol0/vpp-proto-freebsd-disk0#zrepl_CURSOR_G_760881003460c452_J_source-vpp-proto - - 6.14G - One note for the pedants \u0026ndash; the kernel that ships with Debian, for some reason I don\u0026rsquo;t quite understand, does not come with an UFS kernel module that allows to mount these filesystems read-write. Maybe this is because there are a few different flavors of UFS out there, and the maintainer of that kernel module is not comfortable enabling write-mode on all of them. I don\u0026rsquo;t know, but my use case isn\u0026rsquo;t critical as my build will just copy a few files on the otherwise ephemeral ZFS cloned filesystem.\nSo off I go, asking Summer to build me a Linux 6.1 kernel for Debian Bookworm (which is what the hypervisors are running). For those following along at home, here\u0026rsquo;s how that looked like for me:\npim@summer:/usr/src$ sudo apt-get install build-essential linux-source bc kmod cpio flex \\ libncurses5-dev libelf-dev libssl-dev dwarves bison pim@summer:/usr/src$ sudo apt install linux-source-6.1 pim@summer:/usr/src$ sudo tar xf linux-source-6.1.tar.xz pim@summer:/usr/src$ cd linux-source-6.1/ pim@summer:/usr/src/linux-source-6.1$ sudo cp /boot/config-6.1.0-16-amd64 .config pim@summer:/usr/src/linux-source-6.1$ cat \u0026lt;\u0026lt; EOF | sudo tee -a .config CONFIG_UFS_FS=m CONFIG_UFS_FS_WRITE=y EOF pim@summer:/usr/src/linux-source-6.1$ sudo make menuconfig pim@summer:/usr/src/linux-source-6.1$ sudo make -j`nproc` bindeb-pkg Finally, I add a new LAB overlay type called freebsd to the Python/Jinja2 tool I built, which I use to create and maintain the LAB hypervisors. If you\u0026rsquo;re curious about this part, take a look at the [article] I wrote about the environment. I reserve LAB #2 running on hvn2.lab.ipng.ch for the time being, as LAB #0 and #1 are in use by other projects. To cut to the chase, here\u0026rsquo;s what I type to generate the overlay and launch a LAB using the FreeBSD I just made. There\u0026rsquo;s not much in the overlay, really just some templated rc.conf to set the correct hostname and mgmt IPv4/IPv6 addresses and so on.\npim@lab:~/src/lab$ find overlays/freebsd/ -type f overlays/freebsd/common/home/ipng/.ssh/authorized_keys.j2 overlays/freebsd/common/etc/rc.local.j2 overlays/freebsd/common/etc/rc.conf.j2 overlays/freebsd/common/etc/resolv.conf.j2 overlays/freebsd/common/root/lab-build/perms overlays/freebsd/common/root/.ssh/authorized_keys.j2 pim@lab:~/src/lab$ ./generate --host hvn2.lab.ipng.ch --overlay freebsd pim@lab:~/src/lab$ export BASE=vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-freebsd-disk0@20240207-2 pim@lab:~/src/lab$ OVERLAY=freebsd LAB=2 ./create pim@lab:~/src/lab$ LAB=2 ./command start After rebooting the hypervisors with their new UFS2-write-capable kernel, I can finish the job and create the lab VMs. The create call above first makes a ZFS clone of the base image, then mounts it, rsyncs the generated overlay files over it, then creates a ZFS snapshot called @pristine, before booting up the seven virtual machines that comprise this spiffy new FreeBSD lab:\nI decide to park the LAB for now, as that beautiful daisy-chain of vpp2-0 - vpp2-3 routers will first need a working VPP install, which I don\u0026rsquo;t quite have yet.\nBare Metal Next, I take three spare Supermicro SYS-5018D-FN8T, which have the following specs:\nFull IPMI support (power, serial-over-lan and kvm-over-ip with HTML5), on a dedicated network port. A 4-core, 8-thread Xeon D1518 CPU which runs at 35W TDP Two independent Intel i210 NICs (Gigabit) A Quad Intel i350 NIC (Gigabit) Two Intel X552 (TenGigabitEthernet) Two Intel X710-XXV (TwentyFiveGigabitEthernet) ports in the PCIe v3.0 x8 slot m.SATA 120G boot SSD 2x16GB of ECC RAM These were still arranged in a test network from when Adrian and I worked on the [VPP MPLS] project together, and back then I called the three machines France, Belgium and Netherlands. I decide to reuse that, and save myself some recabling. Using IPMI, I install the France server with FreeBSD, while the other two, for now, are still running Debian. This can be useful for (a) side by side comparison tests and (b) to be able to quickly run some T-Rex loadtests.\nI have to admit - I love Supermicro\u0026rsquo;s IPMI implementation. Being able to plop in an ISO over Samba, and the boot on VGA, including into the BIOS to set/change things, and then completely reinstall while hanging out on the couch while drinking tea, absolutetey\nStarting Point I use the base image I described above to clone a beefy VM for building and development purposes. I give that machine 32GB of RAM and 24 cores on one of IPng\u0026rsquo;s production hypervisors. I spent some time with Tom this week to go over a few details about the build, and he patiently described where he\u0026rsquo;s at with the porting. It\u0026rsquo;s not done yet, but he has good news: it does compile cleanly on his machine, so there is hope for me yet! He has prepared a GitHub repository with all of the changes staged - and he will be sequencing them out one by one to merge upstream. In case you want to follow along with his work, take a look at this [Gerrit search].\nFirst, I need to go build a whole bunch of stuff. Here\u0026rsquo;s a recap \u0026ndash;\nDownload ports and kernel source Build and install a GENERIC kernel Build DPDK including its FreeBSD kernel modules contigmem and nic_uio Build netmap bridge utility Build VPP :) To explain a little bit: Linux has hugepages which are 2MB or 1GB memory pages. These come with a significant performance benefit, mostly because the CPU will have a table called the Translation Lookaside Buffer or [TLB] which keeps a mapping between virtual and physical memory pages. If there is too much memory allocated to a process, this TLB table thrashes which comes at a performance penalty. When allocating not the standard 4kB pages, but larger 2MB or 1GB ones, this does not happen. For FreeBSD, the DPDK library provides an equivalent kernel module, which is called contigmem.\nMany (but not all!) DPDK poll mode drivers will remove the kernel network card driver and rebind the network card to a Userspace IO or UIO driver. DPDK also ships one of these for FreeBSD, called nic_uio. So my first three steps are compiling all of these things, including a standard DPDK install from ports.\nBuild: FreeBSD + DPDK Building things on FreeBSD is all very well documented in the [FreeBSD Handbook]. In order to avoid filling up the UFS boot disk, I snuck in another SAS-12 SSD to get a bit faster builds, and I mount /usr/src and /usr/obj on it.\nHere\u0026rsquo;s a recap of what I ended up doing to build a fresh GENERIC kernel and the DPDK port:\n[pim@freebsd-builder ~]$ sudo zfs create -o mountpoint=/usr/src ssd-vol0/src [pim@freebsd-builder ~]$ sudo zfs create -o mountpoint=/usr/obj ssd-vol0/obj [pim@freebsd-builder ~]$ sudo git clone --branch stable/14 https://git.FreeBSD.org/src.git /usr/src [pim@freebsd-builder /usr/src]$ sudo make buildkernel KERNCONF=GENERIC [pim@freebsd-builder /usr/src]$ sudo make installkernel KERNCONF=GENERIC [pim@freebsd-builder ~]$ sudo git clone https://git.FreeBSD.org/ports.git /usr/ports [pim@freebsd-builder /usr/ports/net/dpdk ]$ sudo make install I patiently answer a bunch of questions (all of them just with the default) when the build process asks me what I want. DPDK is a significant project, and it pulls in lots of dependencies to build as well. After what feels like an eternity, the builds are complete, and I have a kernel together with kernel modules, as well as a bunch of handy DPDK helper utilities (like dpdk-testpmd) installed. Just to set expectations \u0026ndash; the build took about an hour for me from start to finish (on a 32GB machine with 24 vCPUs), so hunker down if you go this route.\nNOTE: I wanted to see what I was being asked in this build process, but since I ended up answering everything with the default, you can feel free to add BATCH=yes to the make of DPDK (and see the man page of dpdk(7) for details).\nBuild: contigmem and nic_uio Using a few sysctl calls, I can configure four buffers of 1GB each, which will serve as my equivalent hugepages from Linux, and I add the following to /boot/loader.conf, so that these contiguous regions are reserved early in the boot cycle, when memory is not yet fragmented:\nhw.contigmem.num_buffers=4 hw.contigmem.buffer_size=1073741824 contigmem_load=\u0026#34;YES\u0026#34; To figure out which network devices to rebind to the UIO driver, I can inspect the PCI bus with the pciconf utility:\n[pim@freebsd-builder ~]$ pciconf -vl | less ... virtio_pci0@pci0:1:0:0: class=0x020000 rev=0x01 hdr=0x00 vendor=0x1af4 device=0x1041 subvendor=0x1af4 subdevice=0x1100 vendor = \u0026#39;Red Hat, Inc.\u0026#39; device = \u0026#39;Virtio 1.0 network device\u0026#39; class = network subclass = ethernet virtio_pci1@pci0:1:0:1: class=0x020000 rev=0x01 hdr=0x00 vendor=0x1af4 device=0x1041 subvendor=0x1af4 subdevice=0x1100 vendor = \u0026#39;Red Hat, Inc.\u0026#39; device = \u0026#39;Virtio 1.0 network device\u0026#39; class = network subclass = ethernet virtio_pci0@pci0:1:0:2: class=0x020000 rev=0x01 hdr=0x00 vendor=0x1af4 device=0x1041 subvendor=0x1af4 subdevice=0x1100 vendor = \u0026#39;Red Hat, Inc.\u0026#39; device = \u0026#39;Virtio 1.0 network device\u0026#39; class = network subclass = ethernet virtio_pci1@pci0:1:0:3: class=0x020000 rev=0x01 hdr=0x00 vendor=0x1af4 device=0x1041 subvendor=0x1af4 subdevice=0x1100 vendor = \u0026#39;Red Hat, Inc.\u0026#39; device = \u0026#39;Virtio 1.0 network device\u0026#39; class = network subclass = ethernet My virtio based network devices are on PCI location 1:0:0 \u0026ndash; 1:0:3 and I decide to take away the last two, which makes my final loader configuration for the kernel:\n[pim@freebsd-builder ~]$ cat /boot/loader.conf kern.geom.label.disk_ident.enable=0 zfs_load=YES boot_multicons=YES boot_serial=YES comconsole_speed=115200 console=\u0026#34;comconsole,vidconsole\u0026#34; hw.contigmem.num_buffers=4 hw.contigmem.buffer_size=1073741824 contigmem_load=\u0026#34;YES\u0026#34; nic_uio_load=\u0026#34;YES\u0026#34; hw.nic_uio.bdfs=\u0026#34;1:0:2,1:0:3\u0026#34; Build: Results Now that all of this is done, the machine boots with these drivers loaded, and I can see only my first two network devices (vtnet0 and vtnet1), while the other two are gone. This is good news, because that means they are now under control of the DPDK nic_uio kernel driver, whohoo!\n[pim@freebsd-builder ~]$ kldstat Id Refs Address Size Name 1 28 0xffffffff80200000 1d36230 kernel 2 1 0xffffffff81f37000 4258 nic_uio.ko 3 1 0xffffffff81f3c000 5d5618 zfs.ko 4 1 0xffffffff82513000 5378 contigmem.ko 5 1 0xffffffff82c18000 3250 ichsmb.ko 6 1 0xffffffff82c1c000 2178 smbus.ko 7 1 0xffffffff82c1f000 430c virtio_console.ko 8 1 0xffffffff82c24000 22a8 virtio_random.ko Build: VPP Tom has prepared a branch on his GitHub account, which poses a few small issues with the build. Notably, we have to use a few GNU tools like gmake. But overall, I find the build is very straight forward - kind of looking like this:\n[pim@freebsd-builder ~]$ sudo pkg install py39-ply git gmake gsed cmake libepoll-shim gdb python3 ninja [pim@freebsd-builder ~/src]$ git clone git@github.com:adventureloop/vpp.git [pim@freebsd-builder ~/src/vpp]$ git checkout freebsd-vpp [pim@freebsd-builder ~]$ gmake install-dep [pim@freebsd-builder ~]$ gmake build Results Now, taking into account that not everything works (for example there isn\u0026rsquo;t a packaging yet, let alone something as fancy as a port), and that there\u0026rsquo;s a bit of manual tinkering going on, let me show you at least the absolute gem that is this screenshot:\nThe (debug build) VPP instance started, the DPDK plugin loaded, and it found the two devices that were bound by the newly installed nic_uio driver. Setting an IPv4 address on one of these interfaces works, and I can ping another machine on the LAN connected to Gi10/0/2, which I find dope.\nHello, World!\nWhat\u0026rsquo;s next ? There\u0026rsquo;s a lot of ground to cover with this port. While Tom munches away at the Gerrits he has stacked up, I\u0026rsquo;m going to start kicking the tires on the FreeBSD machines. I showed in this article the tablestakes preparation, giving a FreeBSD lab on the hypervisors, a build machine that has DPDK, kernel and VPP in a somewhat working state (with two NICs in VirtIO), and I installed a Supermicro bare metal machine to do the same.\nIn a future set of articles in this series, I will:\nDo a comparative loadtest between FreeBSD kernel, Netmap, VPP+Netmap, and VPP+DPDK Take a look at how FreeBSD stacks up against Debian on the the same machine Do a bit of functional testing, to ensure dataplane functionality is in place A few things will need some attention:\nSome Linux details have leaked, for example show cpu and show pci in VPP Linux Control Plane uses TAP devices which Tom has mentioned may need some work Similarly, Linux Control Plane netlink handling may or may not work as expected in FreeBSD Build and packaging, obviously there is no make pkg-deb ","date":"2024-02-10","desc":"About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. Over the years, folks have asked me regularly \u0026ldquo;What about BSD?\u0026rdquo; and to my surprise, late last year I read an announcement from the FreeBSD Foundation [ref] as they looked back over 2023 and forward to 2024:\n","permalink":"https://ipng.ch/s/articles/2024/02/10/vpp-on-freebsd-part-1/","section":"articles","title":"VPP on FreeBSD - Part 1"},{"contents":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nYou\u0026rsquo;ll hear me talk about VPP being API centric, with no configuration persistence, and that\u0026rsquo;s by design. However, there is this also a CLI utility called vppctl, right, so what gives? In truth, the CLI is used a lot by folks to configure their dataplane, but it really was always meant to be a debug utility. There\u0026rsquo;s a whole wealth of programmability that is not exposed via the CLI at all, and the VPP community develops and maintains an elaborate set of tools to allow external programs to (re)configure the dataplane. One such tool is my own [vppcfg] which takes a YAML specification that describes the dataplane configuration, and applies it safely to a running VPP instance.\nIntroduction In case you\u0026rsquo;re interested in writing your own automation, this article is for you! I\u0026rsquo;ll provide a deep dive into the Python API which ships with VPP. It\u0026rsquo;s actually very easy to use once you get used it it \u0026ndash; assuming you know a little bit of Python of course :)\nVPP API: Anatomy When developers write their VPP features, they\u0026rsquo;ll add an API definition file that describes control-plane messages that are typically called via shared memory interface, which explains why these things are called memclnt in VPP. Certain API types can be created, resembling their underlying C structures, and these types are passed along in messages. Finally a service is a Request/Reply pair of messages. When requests are received, VPP executes a handler whose job it is to parse the request and send either a singular reply, or a stream of replies (like a list of interfaces).\nClients connect to a unix domain socket, typically /run/vpp/api.sock. A TCP port can also be used, with the caveat that there is no access control provided. Messages are exchanged over this channel asynchronously. A common pattern of async API design is to have a client identifier (called a client_index) and some random number (called a context) with which the client identifies their request. Using these two things, VPP will issue a callback using (a) the client_index to send the reply to and (b) the client knows which context the reply is meant for.\nBy the way, this asynchronous design pattern gives programmers one really cool benefit out of the box: events that are not explicitly requested, like say, link-state change on an interface, can now be implemented by simply registering a standing callback for a certain message type - I\u0026rsquo;ll show how that works at the end of this article. As a result, any number of clients, their requests and even arbitrary VPP initiated events can be in flight at the same time, which is pretty slick!\nAPI Types Most APIs requests pass along datastructures, which follow their internal representation in VPP. I\u0026rsquo;ll start by taking a look at a simple example \u0026ndash; the VPE itself. It defines a few things in src/vpp/vpe_types.api, notably a few type definitions and one enum:\ntypedef version { u32 major; u32 minor; u32 patch; /* since we can\u0026#39;t guarantee that only fixed length args will follow the typedef, string type not supported for typedef for now. */ u8 pre_release[17]; /* 16 + \u0026#34;\\0\u0026#34; */ u8 build_metadata[17]; /* 16 + \u0026#34;\\0\u0026#34; */ }; typedef f64 timestamp; typedef f64 timedelta; enum log_level { VPE_API_LOG_LEVEL_EMERG = 0, /* emerg */ VPE_API_LOG_LEVEL_ALERT = 1, /* alert */ VPE_API_LOG_LEVEL_CRIT = 2, /* crit */ VPE_API_LOG_LEVEL_ERR = 3, /* err */ VPE_API_LOG_LEVEL_WARNING = 4, /* warn */ VPE_API_LOG_LEVEL_NOTICE = 5, /* notice */ VPE_API_LOG_LEVEL_INFO = 6, /* info */ VPE_API_LOG_LEVEL_DEBUG = 7, /* debug */ VPE_API_LOG_LEVEL_DISABLED = 8, /* disabled */ }; By doing this, API requests and replies can start referring to these types in. When reading this, it feels a bit like a C header file, showing me the structure. For example, I know that if I ever need to pass along an argument called log_level, I know which values I can provide, together with their meaning.\nAPI Messages I now take a look at src/vpp/api/vpe.api itself, this is where the VPE API is defined. It includes the former vpe_types.api file, so it can reference these typedefs and the enum. Here, I see a few messages defined that constitute a Request/Reply pair:\ndefine show_version { u32 client_index; u32 context; }; define show_version_reply { u32 context; i32 retval; string program[32]; string version[32]; string build_date[32]; string build_directory[256]; }; There\u0026rsquo;s one small surprise here out of the gate. I would\u0026rsquo;ve expected that beautiful typedef called version from the vpe_types.api file to make an appearance, but it\u0026rsquo;s conspicuously missing from the show_version_reply message. Ha! But the rest of it seems reasonably self-explanatory \u0026ndash; as I already know about the client_index and context fields, I now know that this request does not carry any arguments, and that the reply has a retval for application errors, similar to how most libC functions return 0 on success, and some negative value error number defined in [errno.h]. Then, there are four strings of the given length, which I should be able to consume.\nAPI Services The VPP API defines three types of message exchanges:\nRequest/Reply - The client sends a request message and the server replies with a single reply message. The convention is that the reply message is named as method_name + _reply.\nDump/Detail - The client sends a “bulk” request message to the server, and the server replies with a set of detail messages. These messages may be of different type. The method name must end with method + _dump, the reply message should be named method + _details. These Dump/Detail methods are typically used for acquiring bulk information, like the complete FIB table.\nEvents - The client can register for getting asynchronous notifications from the server. This is useful for getting interface state changes, and so on. The method name for requesting notifications is conventionally prefixed with want_, for example want_interface_events.\nIf the convention is kept, the API machinery will correlate the foo and foo_reply messages into RPC services. But it\u0026rsquo;s also possible to be explicit about these, by defining service scopes in the *.api files. I\u0026rsquo;ll take two examples, the first one is from the Linux Control Plane plugin (which I\u0026rsquo;ve [written about] a lot while I was contributing to it back in 2021).\nDump/Detail (example): When enumerating Linux Interface Pairs, the service definition looks like this:\nservice { rpc lcp_itf_pair_get returns lcp_itf_pair_get_reply stream lcp_itf_pair_details; }; To puzzle this together, the request called lcp_itf_pair_get is paired up with a reply called lcp_itf_pair_get_reply followed by a stream of zero-or-more lcp_itf_pair_details messages. Note the use of the pattern rpc X returns Y stream Z.\nEvents (example): I also take a look at an event handler like the one in the interface API that made an appearance in my list of API message types, above:\nservice { rpc want_interface_events returns want_interface_events_reply events sw_interface_event; }; Here, the request is want_interface_events which returns a want_interface_events_reply followed by zero or more sw_interface_event messages, which is very similar to the streaming (dump/detail) pattern. The semantic difference is that streams are lists of things and events are asynchronously happening things in the dataplane \u0026ndash; in other words the stream is meant to end whlie the events messages are generated by VPP when the event occurs. In this case, if an interface is created or deleted, or the link state of an interface changes, one of these is sent from VPP to the client(s) that registered an interested in it by calling the want_interface_events RPC.\nJSON Representation VPP comes with an internal API compiler that scans the source code for these *.api files and assembles them into a few output formats. I take a look at the Python implementation of it in src/tools/vppapigen/ and see that it generates C, Go and JSON. As an aside, I chuckle a little bit on a Python script generating Go and C, but I quickly get over myself. I\u0026rsquo;m not that funny.\nThe vppapigen tool outputs a bunch of JSON files, one per API specification, which wraps up all of the information from the types, unions and enums, the message and service definitions, together with a few other bits and bobs, and when VPP is installed, these end up in /usr/share/vpp/api/. As of the upcoming VPP 24.02 release, there\u0026rsquo;s about 50 of these core APIs and an additional 80 or so APIs defined by plugins like the Linux Control Plane.\nImplementing APIs is pretty user friendly, largely due to the vppapigen tool taking so much of the boilerplate and autogenerating things. As an example, I need to be able to enumerate the interfaces that are MPLS enabled, so that I can use my [vppcfg] utility to configure MPLS. I contributed an API called mpls_interface_dump which returns a stream of mpls_interface_details messages. You can see that small contribution in merged [Gerrit 39022].\nVPP Python API The VPP API has been ported to many languages (C, C++, Go, Lua, Rust, Python, probably a few others). I am primarily a user of the Python API, which ships alongside VPP in a separate Debian package. The source code lives in src/vpp-api/python/ which doesn\u0026rsquo;t have any dependencies other than Python\u0026rsquo;s own setuptools. Its implementation canonically called vpp_papi, which, I cannot tell a lie, reminds me of spanish rap music. But, if you\u0026rsquo;re still reading, maybe now is a good time to depart from the fundamental, and get to the practical!\nExample: Hello World Without further ado, I dive right in with this tiny program:\nfrom vpp_papi import VPPApiClient, VPPApiJSONFiles vpp_api_dir = VPPApiJSONFiles.find_api_dir([]) vpp_api_files = VPPApiJSONFiles.find_api_files(api_dir=vpp_api_dir) vpp = VPPApiClient(apifiles=vpp_api_files, server_address=\u0026#34;/run/vpp/api.sock\u0026#34;) vpp.connect(\u0026#34;ipng-client\u0026#34;) api_reply = vpp.api.show_version() print(api_reply) The first thing this program does is construct a so-called VPPApiClient object. To do this, I need to feed it a list of JSON definitions, so that it knows what types of APIs are available. As I mentioned above, those to create the list of files, but there are two handy helpers here:\nfind_api_dir() - This is a helper that finds the location of the API files. Normally, the JSON files get installed in /usr/share/vpp/api/, but when I\u0026rsquo;m writing code, it\u0026rsquo;s more likely that the files are in /home/pim/src/vpp/ somewhere. This helper function tries to do the right thing and detect if I\u0026rsquo;m in a client or if I\u0026rsquo;m using a production install, and will return the correct directory. find_api_files() - Now, I could rummage through that directory and find the JSON files, but there\u0026rsquo;s another handy helper that does that for me, given a directory (like the one I just got handed to me). Life is easy. Once I have the JSON files in hand, I can construct a client by specifying the server_address location to connect to \u0026ndash; this is typically a unix domain socket in /run/vpp/api.sock but it can also be a TCP endpoint. As a quick aside: If you, like me, stumbled over the socket being owned by root:vpp but not writable by the group, that finally got fixed by Georgy in [Gerrit 39862].\nOnce I\u0026rsquo;m connected, I can start calling arbitrary API methods, like show_version() which does not take any arguments. Its reply is a named tuple, and it looks like this:\npim@vpp0-0:~/vpp_papi_examples$ ./00-version.py show_version_reply(_0=1415, context=1, retval=0, program=\u0026#39;vpe\u0026#39;, version=\u0026#39;24.02-rc0~46-ga16463610\u0026#39;, build_date=\u0026#39;2023-10-15T14:50:49\u0026#39;, build_directory=\u0026#39;/home/pim/src/vpp\u0026#39;) And here is my beautiful hello world in seven (!) lines of code. All that reading and preparing finally starts paying off. Neat-oh!\nExample: Listing Interfaces From here on out, it\u0026rsquo;s just incremental learning. Here\u0026rsquo;s an example of how to extend the hello world example above and make it list the dataplane interfaces and their IPv4/IPv6 addresses:\napi_reply = vpp.api.sw_interface_dump() for iface in api_reply: str = f\u0026#34;[{iface.sw_if_index}] {iface.interface_name}\u0026#34; ipr = vpp.api.ip_address_dump(sw_if_index=iface.sw_if_index, is_ipv6=False) for addr in ipr: str += f\u0026#34; {addr.prefix}\u0026#34; ipr = vpp.api.ip_address_dump(sw_if_index=iface.sw_if_index, is_ipv6=True) for addr in ipr: str += f\u0026#34; {addr.prefix}\u0026#34; print(str) The API method sw_interface_dump() can take a few optional arguments. Notably, if sw_if_index is set, the call will dump that exact interface. If it\u0026rsquo;s not set, it will default to -1 which will dump all interfaces, and this is how I use it here. For completeness, the method also has an optional string name_filter, which will dump all interfaces which contain a given substring. For example passing name_filter='loop' and name_filter_value=True as arguments, would enumerate all interfaces that have the word \u0026rsquo;loop\u0026rsquo; in them.\nNow, the definition of the sw_interface_dump method suggests that it returns a stream (remember the Dump/Detail pattern above), so I can predict that the messages I will receive are of type sw_interface_details. There\u0026rsquo;s lots of cool information in here, like the MAC address, MTU, encapsulation (if this is a sub-interface), but for now I\u0026rsquo;ll only make note of the sw_if_index and interface_name.\nUsing this interface index, I then call the ip_address_dump() method, which looks like this:\ndefine ip_address_dump { u32 client_index; u32 context; vl_api_interface_index_t sw_if_index; bool is_ipv6; }; define ip_address_details { u32 context; vl_api_interface_index_t sw_if_index; vl_api_address_with_prefix_t prefix; }; Allright then! If I want the IPv4 addresses for a given interface (referred to not by its name, but by its index), I can call it with argument is_ipv6=False. The return is zero or more messages that contain the index again, and a prefix the precise type of which can be looked up in ip_types.api. After doing a form of layer-one traceroute through the API specification files, it turns out, that this prefix is cast to an instance of the IPv4Interface() class in Python. I won\u0026rsquo;t bore you with it, but the second call sets is_ipv6=True and, unsurprisingly, returns a bunch of IPv6Interface() objects.\nTo put it all together, the output of my little script:\npim@vpp0-0:~/vpp_papi_examples$ ./01-interface.py VPP version is 24.02-rc0~46-ga16463610 [0] local0 [1] GigabitEthernet10/0/0 192.168.10.5/31 2001:678:d78:201::fffe/112 [2] GigabitEthernet10/0/1 192.168.10.6/31 2001:678:d78:201::1:0/112 [3] GigabitEthernet10/0/2 [4] GigabitEthernet10/0/3 [5] loop0 192.168.10.0/32 2001:678:d78:200::/128 Example: Linux Control Plane Normally, services of are either a Request/Reply or a Dump/Detail type. But careful readers may have noticed that the Linux Control Plane does a little bit of both. It has a Request/Reply/Detail triplet, because for request lcp_itf_pair_get, it will return a lcp_itf_pair_get_reply AND a stream of lcp_itf_pair_details. Perhaps in hindsight a more idiomatic way to do this was to have created simply a lcp_itf_pair_dump, but considering this is what we ended up with, I can use it as a good example case \u0026ndash; how might I handle such a response?\napi_reply = vpp.api.lcp_itf_pair_get() if isinstance(api_reply, tuple) and api_reply[0].retval == 0: for lcp in api_reply[1]: str = f\u0026#34;[{lcp.vif_index}] {lcp.host_if_name}\u0026#34; api_reply2 = vpp.api.sw_interface_dump(sw_if_index=lcp.host_sw_if_index) tap_iface = api_reply2[0] api_reply2 = vpp.api.sw_interface_dump(sw_if_index=lcp.phy_sw_if_index) phy_iface = api_reply2[0] str += f\u0026#34; tap {tap_iface.interface_name} phy {phy_iface.interface_name} mtu {phy_iface.link_mtu}\u0026#34; print(str) This particular API first sends its reply and then its stream, so I can expect it to be a tuple with the first element being a namedtuple and the second element being a list of details messages. A good way to ensure that is to check for the reply\u0026rsquo;s retval field to be 0 (success) before trying to enumerate the Linux Interface Pairs. These consist of a VPP interface (say GigabitEthernet10/0/0), which corresponds to a TUN/TAP device which in turn has a VPP name (eg tap1) and a Linux name (eg. e0).\nThe Linux Control Plane call will return these dataplane objects as numerical interface indexes, not names. However, I can resolve them to names by calling the sw_interface_dump() method and specifying the index as an argument. Because this is a Dump/Detail type API call, the return will be a stream (a list), which will have either zero (if the index didn\u0026rsquo;t exist), or one element (if it did).\nUsing this I can puzzle together the following output:\npim@vpp0-0:~/vpp_papi_examples$ ./02-lcp.py VPP version is 24.02-rc0~46-ga16463610 [2] loop0 tap tap0 phy loop0 mtu 9000 [3] e0 tap tap1 phy GigabitEthernet10/0/0 mtu 9000 [4] e1 tap tap2 phy GigabitEthernet10/0/1 mtu 9000 [5] e2 tap tap3 phy GigabitEthernet10/0/2 mtu 9000 [6] e3 tap tap4 phy GigabitEthernet10/0/3 mtu 9000 VPP\u0026rsquo;s Python API objects The objects in the VPP dataplane can be arbitrarily complex. They can have nested objects, enums, unions, repeated fields and so on. To illustrate a more complete example, I will take a look at an MPLS tunnel object in the dataplane. I first create the MPLS tunnel using the CLI, as follows:\nvpp# mpls tunnel l2-only via 192.168.10.3 GigabitEthernet10/0/1 out-labels 8298 100 200 vpp# mpls local-label add 8298 eos via l2-input-on mpls-tunnel0 The first command creates an interface called mpls-tunnel0 which, if it receives an ethernet frame, will encapsulate it into an MPLS datagram with a labelstack of 8298.100.200, and then forward it on to the router at 192.168.10.3. The second command adds a FIB entry to the MPLS table, upon receipt of a datagram with the label 8298, unwrap it and present the resulting datagram contents as an ethernet frame into mpls-tunnel0. By cross connecting this MPLS tunnel with any other dataplane interface (for example, HundredGigabitEthernet10/0/1.1234), this would be an elegant way to configure a classic L2VPN ethernet-over-MPLS transport. Which is hella cool, but I digress :)\nI want to inspect this tunnel using the API, and I find an mpls_tunnel_dump() method. As we know well by now, this is a Dump/Detail type method, so the return value will be a list of zero or more mpls_tunnel_details messages.\nThe mpls_tunnel_details message is simply a wrapper around an mpls_tunnel type as can be seen in mpls.api, and it references the fib_path type as well. Here they are:\ntypedef fib_path { u32 sw_if_index; u32 table_id; u32 rpf_id; u8 weight; u8 preference; vl_api_fib_path_type_t type; vl_api_fib_path_flags_t flags; vl_api_fib_path_nh_proto_t proto; vl_api_fib_path_nh_t nh; u8 n_labels; vl_api_fib_mpls_label_t label_stack[16]; }; typedef mpls_tunnel { vl_api_interface_index_t mt_sw_if_index; u32 mt_tunnel_index; bool mt_l2_only; bool mt_is_multicast; string mt_tag[64]; u8 mt_n_paths; vl_api_fib_path_t mt_paths[mt_n_paths]; }; define mpls_tunnel_details { u32 context; vl_api_mpls_tunnel_t mt_tunnel; }; Taking a closer look, the mpls_tunnel message consists of an index, then an mt_tunnel_index which corresponds to the tunnel number (ie. interface mpls-tunnelN), some boolean flags, and then a vector of N FIB paths. Incidentally, you\u0026rsquo;ll find FIB paths all over the place in VPP: in routes, tunnels like this one, ACLs, and so on, so it\u0026rsquo;s good to get to know them a bit.\nRemember when I created the tunnel, I specified something like .. via ..? That\u0026rsquo;s a tell-tale sign that what follows is a FIB path. I specified a nexhop (192.168.10.3 GigabitEthernet10/0/3) and a list of three out-labels (8298, 100 and 200), all of which VPP has tucked them away in this mt_paths field.\nAlthough it\u0026rsquo;s a bit verbose, I\u0026rsquo;ll paste the complete object for this tunnel, including the FIB path. You know, for science:\nmpls_tunnel_details(_0=1185, context=5, mt_tunnel=vl_api_mpls_tunnel_t( mt_sw_if_index=17, mt_tunnel_index=0, mt_l2_only=True, mt_is_multicast=False, mt_tag=\u0026#39;\u0026#39;, mt_n_paths=1, mt_paths=[ vl_api_fib_path_t(sw_if_index=2, table_id=0,rpf_id=0, weight=1, preference=0, type=\u0026lt;vl_api_fib_path_type_t.FIB_API_PATH_TYPE_NORMAL: 0\u0026gt;, flags=\u0026lt;vl_api_fib_path_flags_t.FIB_API_PATH_FLAG_NONE: 0\u0026gt;, proto=\u0026lt;vl_api_fib_path_nh_proto_t.FIB_API_PATH_NH_PROTO_IP4: 0\u0026gt;, nh=vl_api_fib_path_nh_t( address=vl_api_address_union_t( ip4=IPv4Address(\u0026#39;192.168.10.3\u0026#39;), ip6=IPv6Address(\u0026#39;c0a8:a03::\u0026#39;)), via_label=0, obj_id=0, classify_table_index=0), n_labels=3, label_stack=[ vl_api_fib_mpls_label_t(is_uniform=0, label=8298, ttl=0, exp=0), vl_api_fib_mpls_label_t(is_uniform=0, label=100, ttl=0, exp=0), vl_api_fib_mpls_label_t(is_uniform=0, label=200, ttl=0, exp=0), vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0), vl_api_fib_mpls_label_t(is_uniform=0, label=0, ttl=0, exp=0) ] ) ] ) ) This mt_paths is really interesting, and I\u0026rsquo;d like to make a few observations:\ntype, flags and proto are ENUMs which I can find in fib_types.api nh is the nexthop - there is only one nexthop specified per path entry, so when things like ECMP multipath are in play, this will be a vector of N paths each with one nh. Good to know. This nexthop specifies an address which is a union just like in C. It can be either an ip4 or an ip6. I will know which to choose due to the proto field above. n_labels and label_stack: The MPLS label stack has a fixed size. VPP reveals here (in the API definition but also in the response) that the label-stack can be at most 16 labels deep. I feel like this is an interview question at Cisco, somehow. I know how many labels are relevant because of the n_labels field above. Their type is of fib_mpls_label which can be found in mpls.api. After having consumed all of this, I am ready to write a program that wheels over these message types and prints something a little bit more compact. The final program, in all of its glory \u0026ndash;\nfrom vpp_papi import VPPApiClient, VPPApiJSONFiles, VppEnum def format_path(path): str = \u0026#34;\u0026#34; if path.proto == VppEnum.vl_api_fib_path_nh_proto_t.FIB_API_PATH_NH_PROTO_IP4: str += f\u0026#34; ipv4 via {path.nh.address.ip4}\u0026#34; elif path.proto == VppEnum.vl_api_fib_path_nh_proto_t.FIB_API_PATH_NH_PROTO_IP6: str += f\u0026#34; ipv6 via {path.nh.address.ip6}\u0026#34; elif path.proto == VppEnum.vl_api_fib_path_nh_proto_t.FIB_API_PATH_NH_PROTO_MPLS: str += f\u0026#34; mpls\u0026#34; elif path.proto == VppEnum.vl_api_fib_path_nh_proto_t.FIB_API_PATH_NH_PROTO_ETHERNET: api_reply2 = vpp.api.sw_interface_dump(sw_if_index=path.sw_if_index) iface = api_reply2[0] str += f\u0026#34; ethernet to {iface.interface_name}\u0026#34; else: print(path) if path.n_labels \u0026gt; 0: str += \u0026#34; label\u0026#34; for i in range(path.n_labels): str += f\u0026#34; {path.label_stack[i].label}\u0026#34; return str api_reply = vpp.api.mpls_tunnel_dump() for tunnel in api_reply: str = f\u0026#34;Tunnel [{tunnel.mt_tunnel.mt_sw_if_index}] mpls-tunnel{tunnel.mt_tunnel.mt_tunnel_index}\u0026#34; for path in tunnel.mt_tunnel.mt_paths: str += format_path(path) print(str) api_reply = vpp.api.mpls_table_dump() for table in api_reply: print(f\u0026#34;Table [{table.mt_table.mt_table_id}] {table.mt_table.mt_name}\u0026#34;) api_reply2 = vpp.api.mpls_route_dump(table=table.mt_table.mt_table_id) for route in api_reply2: str = f\u0026#34; label {route.mr_route.mr_label} eos {route.mr_route.mr_eos}\u0026#34; for path in route.mr_route.mr_paths: str += format_path(path) print(str) Funny detail - it took me almost two years to discover VppEnum, which contains all of these symbols. If you end up reading this after a Bing, Yahoo or DuckDuckGo search, feel free to buy me a bottle of Glenmorangie - sláinte!\nThe format_path() method here has the smarts. Depending on the proto field, I print either an IPv4 path, an IPv6 path, an internal MPLS path (for example for the reserved labels 0..15), or an Ethernet path, which is the case in the FIB entry above that diverts incoming packets with label 8298 to be presented as ethernet datagrams into the intererface mpls-tunnel0. If it is an Ethernet proto, I can use the sw_if_index field to figure out which interface, and retrieve its details to find its name.\nThe format_path() method finally adds the stack of labels to the returned string, if the n_labels field is non-zero.\nMy program\u0026rsquo;s output:\npim@vpp0-0:~/vpp_papi_examples$ ./03-mpls.py VPP version is 24.02-rc0~46-ga16463610 Tunnel [17] mpls-tunnel0 ipv4 via 192.168.10.3 label 8298 100 200 Table [0] MPLS-VRF:0 label 0 eos 0 mpls label 0 eos 1 mpls label 1 eos 0 mpls label 1 eos 1 mpls label 2 eos 0 mpls label 2 eos 1 mpls label 8298 eos 1 ethernet to mpls-tunnel0 Creating VxLAN Tunnels Until now, all I\u0026rsquo;ve done is inspect the dataplane, in other words I\u0026rsquo;ve called a bunch of APIs that do not change state. Of course, many of VPP\u0026rsquo;s API methods change state as well. I\u0026rsquo;ll turn to another example API \u0026ndash; The VxLAN tunnel API is defined in plugins/vxlan/vxlan.api and it\u0026rsquo;s gone through a few iterations. The VPP community tries to keep backwards compatibility, and a simple way of doing this is to create new versions of the methods by tagging them with suffixes such as _v2, while eventually marking the older versions as deprecated by setting the option deprecated; field in the definition. In this API specification I can see that we\u0026rsquo;re already at version 3 of the Request/Reply method in vxlan_add_del_tunnel_v3 and version 2 of the Dump/Detail method in vxlan_tunnel_v2_dump.\nOnce again, using these *.api defintions, finding an incantion to create a unicast VxLAN tunnel with a given VNI, then listing the tunnels, and finally deleting the tunnel I just created, would look like this:\napi_reply = vpp.api.vxlan_add_del_tunnel_v3(is_add=True, instance=100, vni=8298, src_address=\u0026#34;192.0.2.1\u0026#34;, dst_address=\u0026#34;192.0.2.254\u0026#34;, decap_next_index=1) if api_reply.retval == 0: print(f\u0026#34;Created VXLAN tunnel with sw_if_index={api_reply.sw_if_index}\u0026#34;) api_reply = vpp.api.vxlan_tunnel_v2_dump() for vxlan in api_reply: str = f\u0026#34;[{vxlan.sw_if_index}] instance {vxlan.instance} vni {vxlan.vni}\u0026#34; str += \u0026#34; src {vxlan.src_address}:{vxlan.src_port} dst {vxlan.dst_address}:{vxlan.dst_port}\u0026#34;) print(str) api_reply = vpp.api.vxlan_add_del_tunnel_v3(is_add=False, instance=100, vni=8298, src_address=\u0026#34;192.0.2.1\u0026#34;, dst_address=\u0026#34;192.0.2.254\u0026#34;, decap_next_index=1) if api_reply.retval == 0: print(f\u0026#34;Deleted VXLAN tunnel with sw_if_index={api_reply.sw_if_index}\u0026#34;) Many of the APIs in VPP will have create and delete in the same method, mostly by specifying the operation with an is_add argument like here. I think it\u0026rsquo;s kind of nice because it makes the creation and deletion symmetric, even though the deletion needs to specify a fair bit more than strictly necessary: the instance uniquely identifies the tunnel and should have been enough.\nThe output of this [CRUD] sequence (which stands for Create, Read, Update, Delete, in case you haven\u0026rsquo;t come across that acronym yet) then looks like this:\npim@vpp0-0:~/vpp_papi_examples$ ./04-vxlan.py VPP version is 24.02-rc0~46-ga16463610 Created VXLAN tunnel with sw_if_index=18 [18] instance 100 vni 8298 src 192.0.2.1:4789 dst 192.0.2.254:4789 Deleted VXLAN tunnel with sw_if_index=18 Listening to Events But wait, there\u0026rsquo;s more! Just one more thing, I promise. Way in the beginning of this article, I mentioned that there is a special variant of the Dump/Detail pattern, and that\u0026rsquo;s the Events pattern. With the VPP API client, first I register a single callback function, and then I can enable/disable events to trigger this callback.\nOne important note to this: enabling this callback will spawn a new (Python) thread so that the main program can continue to execute. Because of this, all the standard care has to be taken to make the program thread-aware. Make sure to pass information from the events-thread to the main-thread in a safe way!\nLet me demonstrate this powerful functionality with a program that listens on want_interface_events which is defined in interface.api:\nservice { rpc want_interface_events returns want_interface_events_reply events sw_interface_event; }; define sw_interface_event { u32 client_index; u32 pid; vl_api_interface_index_t sw_if_index; vl_api_if_status_flags_t flags; bool deleted; }; Here\u0026rsquo;s a complete program, shebang and all, that accomplishes this in a minimalistic way:\n#!/usr/bin/env python3 import time from vpp_papi import VPPApiClient, VPPApiJSONFiles, VppEnum def sw_interface_event(msg): print(msg) def vpp_event_callback(msg_name, msg): if msg_name == \u0026#34;sw_interface_event\u0026#34;: sw_interface_event(msg) else: print(f\u0026#34;Received unknown callback: {msg_name} =\u0026gt; {msg}\u0026#34;) vpp_api_dir = VPPApiJSONFiles.find_api_dir([]) vpp_api_files = VPPApiJSONFiles.find_api_files(api_dir=vpp_api_dir) vpp = VPPApiClient(apifiles=vpp_api_files, server_address=\u0026#34;/run/vpp/api.sock\u0026#34;) vpp.connect(\u0026#34;ipng-client\u0026#34;) vpp.register_event_callback(vpp_event_callback) vpp.api.want_interface_events(enable_disable=True, pid=8298) api_reply = vpp.api.show_version() print(f\u0026#34;VPP version is {api_reply.version}\u0026#34;) try: while True: time.sleep(1) except KeyboardInterrupt: pass Results After all of this deep-diving, all that\u0026rsquo;s left is for me to demonstrate the API by means of this little screencast [asciinema, gif] - I hope you enjoy it as much as I enjoyed creating it:\nNote to self:\n$ asciinema-edit quantize --range 0.18,0.8 --range 0.5,1.5 --range 1.5 \\ vpp_papi.cast \u0026gt; clean.cast $ Insert the ANSI colorcodes from the mac\u0026#39;s terminal into clean.cast\u0026#39;s header: \u0026#34;theme\u0026#34;:{\u0026#34;fg\u0026#34;: \u0026#34;#ffffff\u0026#34;,\u0026#34;bg\u0026#34;:\u0026#34;#000000\u0026#34;, \u0026#34;palette\u0026#34;:\u0026#34;#000000:#990000:#00A600:#999900:#0000B3:#B300B3:#999900:#BFBFBF: #666666:#F60000:#00F600:#F6F600:#0000F6:#F600F6:#00F6F6:#F6F6F6\u0026#34;} $ agg --font-size 18 clean.cast clean.gif $ gifsicle --lossy=80 -k 128 -O2 -Okeep-empty clean.gif -o vpp_papi_clean.gif ","date":"2024-01-27","desc":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nYou\u0026rsquo;ll hear me talk about VPP being API centric, with no configuration persistence, and that\u0026rsquo;s by design. However, there is this also a CLI utility called vppctl, right, so what gives? In truth, the CLI is used a lot by folks to configure their dataplane, but it really was always meant to be a debug utility. There\u0026rsquo;s a whole wealth of programmability that is not exposed via the CLI at all, and the VPP community develops and maintains an elaborate set of tools to allow external programs to (re)configure the dataplane. One such tool is my own [vppcfg] which takes a YAML specification that describes the dataplane configuration, and applies it safely to a running VPP instance.\n","permalink":"https://ipng.ch/s/articles/2024/01/27/vpp-python-api/","section":"articles","title":"VPP Python API"},{"contents":" Introduction When IPng Networks first built out a european network, I was running the Disaggregated Network Operating System [ref], initially based on AT\u0026amp;T’s “dNOS” software framework. Over time though, the DANOS project slowed down, and the developers with whom I had a pretty good relationship all left for greener pastures.\nIn 2019, Pierre Pfister (and several others) built a VPP router sandbox [ref], which graduated into a feature called the Linux Control Plane plugin [ref]. Lots of folks put in an effort for the Linux Control Plane, notably Neale Ranns from Cisco (these days Graphiant), and Matt Smith and Jon Loeliger from Netgate (who ship this as TNSR [ref], check it out!). I helped as well, by adding a bunch of Netlink handling and VPP-\u0026gt;Linux synchronization code, which I\u0026rsquo;ve written about a bunch on this blog in the 2021 VPP development series [ref].\nAt the time, Ubuntu and CentOS were the supported platforms, so I installed a bunch of Ubuntu machines when doing the deploy with my buddy Fred from IP-Max [ref]. But as time went by, I fell back to my old habit of running Debian on hypervisors and VMs for the services at IPng Networks. After some time automating mostly everything with Ansible and Kees, I got tired of those places where I needed branches like if Ubuntu then ... elif Debian then ... elif OpenBSD then ... else panic.\nI took stock of the fleet at the end of 2023, and I found the following:\nOpenBSD: 3 virtual machines, bastion jumphosts connected to Internet and IPng Site Local Ubuntu: 4 physical machines, VPP routers (nlams0, defra0, chplo0 and usfmt0) Debian: 22 physical machines and 116 virtual machines, running internal and public services, almost all of these machines are entirely in IPng Site Local [ref], not connected to the internet at all. It became clear to me that I could make a small sprint to standardize all physical hardware on Debian Bookworm, and move away from Ubuntu LTS. In case you\u0026rsquo;re wondering: there\u0026rsquo;s nothing wrong with Ubuntu, although I will admit I\u0026rsquo;m not a big fan of snapd and cloud-init but they are easily disabled. I guess with the way the situation evolved in AS8298, I ended up running a fair few more Debian physical (and virtual) machines, so I\u0026rsquo;ll make an executive decision to move to Debian. By the way, the fun thing about IPng is that being the Chief of Everything (COE), I get to make those calls unilaterally :)\nUpgrading to Debian Luckily, I already have a fair number of VPP routers that have been deployed on Debian (mostly Bullseye, but one of them is Bookworm), and my LAB environment [ref] is running Debian Bookworm as well. Although its native habitat is Ubuntu, I regularly run VPP in a Debian environment, for example when Adrian contributed the MPLS code [ref], he also recommended Debian 12, because that ships with a modern libnl which supports a few bits and pieces he needed.\nPreparations OK, while my network is not large, it\u0026rsquo;a also not completely devoid of customers, so instead of a YOLO, I decide to make an action plan that roughly looks like this:\nNotify customers of upcoming maintenance For each of the routers to-be-upgraded: Check the borgmatic daily backups Drain traffic away from the router Use IPMI to re-install it remotely Put the VPP, Bird, SSH configs back Undrain the router Drink my advents-calendar tea! When deploying a datacenter site, I am adamant to have a consistent and dependable environment. At each site, specifically those that are a bit further away, I deploy a standard issue PCEngines APU [ref] with 802.11ac WiFi, serial, and IPMI access to any machine that may be there. If you ever visit a datacenter floor where I\u0026rsquo;m present, look for an SSID like AS8298 FRA in the case of the Frankfurt site. The password is IPngGuest, you\u0026rsquo;re welcome to some bits of bandwidth in a pinch :)\nYou can find the APU in the picture to the right. All the way at the top, you\u0026rsquo;ll see a small blue machine with two antenna\u0026rsquo;s sticking out. It\u0026rsquo;s connected to my carrier, AS25091\u0026rsquo;s packet factory Cisco ASR9010, for out of band connectivity. Then, all the way at the bottom, you can see my Supermicro SYS-5018D-FN8T called defra0.ipng.ch paired with a Centec MPLS switch for transport and breakout ports 😍.\nWhen I installed all of this kit, I did two specific things that will greatly benefit me now:\nI enabled IPMI KVM and Serial-over-LAN on the Supermicro, so I can reach it over its dedicated IPMI port, and see what its VGA does. Also, in case anything weird happens to VPP and/or the Centec switches and IPng Site Local becomes unavailable, I can still log in and take a look via serial. I installed Samba on the APU, which allows me to instruct the IPMI to insert a virtual USB \u0026lsquo;stick\u0026rsquo; by means of mounting a SAMBA share. This is incredibly useful in scenarios such as this reinstall! Allthough I do trust it, I would hate to reboot the machine to find that IPMI or serial doesn\u0026rsquo;t work. So let me make sure that the machine is still good go to:\npim@summer:~$ ssh -L 8443:defra0-ipmi:443 cons0.defra0 pim@cons0-defra0:~$ ipmitool -I lanplus -H defra0-ipmi -U ${IPMI_USER} -P ${IPMI_PASS} sol activate [SOL Session operational. Use ~? for help] defra0 login: Nice going! Checking the samba configuration, it is super straightforward:\npim@cons0-defra0:~$ cat /etc/samba/smb.conf [global] workgroup = WINSHARE server string = Ubuntu Samba %v netbios name = console security = user map to guest = bad user dns proxy = no server min protocol = NT1 #============================ Share Definitions ============================== [share] path = /var/samba browsable = yes writable = no guest ok = yes read only = yes pim@cons0-defra0:/var/samba$ ls -lrt total 2306000 -rw-r--r-- 1 pim pim 441450496 Feb 10 2021 danos-2012-base-amd64.iso -rw-r--r-- 1 pim pim 1261371392 Aug 24 2021 ubuntu-20.04.3-live-server-amd64.iso -rw-r--r-- 1 pim pim 658505728 Dec 17 17:20 debian-12.4.0-amd64-netinst.iso pim@cons0-defra0:~$ ip -br a internal UP 172.16.13.1/24 fd25:8c03:9b1c:100d::1/64 fe80::b49b:1cff:feb2:7f2f/64 external UP 46.20.246.50/29 2a02:2528:ff01::2/64 fe80::d8fe:8ff:fe73:8c99/64 wlp4s0 UP 172.16.14.1/24 fd25:8c03:9b1c:100e::1/64 fe80::6f0:21ff:fe9b:562e/64 You can see the lifecycle progression on this server. In Feb'21, I installed DANOS 20.12, then moving to Ubuntu LTS 20.04 around Aug'21, and now it is time to advance once again, this time to Debian 12.\nAs a final pre-flight check, while using the port forwarding I set up (-L flag above), I will log in to the IPMI controller remotely, to insert this CD image into the virtual CDROM drive, like so:\nAnd indeed, it pops up in the running Ubuntu router:\npim@defra0:~$ uname -a Linux defra0 5.4.0-109-generic #123-Ubuntu SMP Fri Apr 8 09:10:54 UTC 2022 x86_64 GNU/Linux pim@defra0:~$ uptime 15:51:10 up 600 days, 17:40, 1 user, load average: 3.44, 3.30, 3.31 pim@defra0:~$ dmesg | tail -10 [51852396.194030] usb 2-4.2: New USB device strings: Mfr=0, Product=0, SerialNumber=0 [51852396.215804] usb-storage 2-4.2:1.0: USB Mass Storage device detected [51852396.215993] scsi host6: usb-storage 2-4.2:1.0 [51852396.216107] usbcore: registered new interface driver usb-storage [51852396.219915] usbcore: registered new interface driver uas [51852396.232081] scsi 6:0:0:0: CD-ROM ATEN Virtual CDROM YS0J PQ: 0 ANSI: 0 CCS [51852396.232475] scsi 6:0:0:0: Attached scsi generic sg1 type 5 [51852396.251038] sr 6:0:0:0: [sr0] scsi3-mmc drive: 40x/40x cd/rw xa/form2 cdda tray [51852396.251047] cdrom: Uniform CD-ROM driver Revision: 3.20 [51852396.267643] sr 6:0:0:0: Attached scsi CD-ROM sr0 I just love it when this stuff works. And it\u0026rsquo;s nice to see the happenstance of the machine being up for 600 days. Good power, great operating system and awesome hosting provider. Thanks for the service so far, my sweet little Ubuntu router ❤️ !\nInstalling Drain Considering there is live traffic on the network, typically what an operator would do is drain the links to route around the maintenance. To do this in my case, I need to make two changes, notably draining OSPF and eBGP.\nOSPF: In AS8298, all backbone connections use OSPF, and typically traffic from Zurich to Amsterdam will be over Frankfurt because the OSPF cost is slightly lower than the other way around. I\u0026rsquo;ve decided to standardize the OSPF link cost to be in tenths of milliseconds. In other words, if the latency from chrma0 to defra0 is 5.6 ms, the OSPF cost will be 56. One way for me to avoid using the Frankfurt router, is to make the cost of all traffic in- and out of the router be synthetically high. I do this by adding +1000 to the OSPF cost.\nBGP: But there are also a bunch of internet exchanges (such as Kleyrex, DE-CIX and LoCIX), and two IP transit upstreams (IP-Max and Meerfarbig) connected to this router in Frankfurt. I do not want them to send IPng any traffic here during the maintenance, so I will drain eBGP as well by setting the groups to shutdown state in Kees.\npim@squanchy:~/src/ipng-kees$ git diff diff --git a/config/defra0.ipng.ch.yaml b/config/defra0.ipng.ch.yaml index 869058c..105630c 100644 --- a/config/defra0.ipng.ch.yaml +++ b/config/defra0.ipng.ch.yaml @@ -151,12 +151,13 @@ vppcfg: ospf: xe1-0.304: description: chrma0 - cost: 56 + cost: 1056 xe1-1.302: description: defra0 - cost: 61 + cost: 1061 ebgp: + shutdown: true groups: decix_dus: local-addresses: [ 185.1.171.43/23, 2001:7f8:9e::206a:0:1/64 ] By raising the OSPF cost, the network will route around the machine that I want to play with:\npim@squanchy:~/src/ipng-kees$ traceroute nlams0.ipng.ch traceroute to defra0.ipng.ch (194.1.163.32), 64 hops max, 40 byte packets 1 chbtl0 (194.1.163.66) 0.492 ms 0.64 ms 0.615 ms 2 chrma0 (194.1.163.17) 1.268 ms 1.196 ms 1.194 ms 3 chplo0 (194.1.163.51) 5.682 ms 5.514 ms 5.603 ms 4 frpar0 (194.1.163.40) 14.481 ms 14.605 ms 14.58 ms 5 frggh0 (194.1.163.30) 19.545 ms 18.61 ms 18.684 ms 6 nlams0 (194.1.163.32) 47.613 ms 47.765 ms 47.584 ms And by setting the sessions to shutdown, Kees will make it regenerate all of the BGP sessions with an export none and a low bgp_local_pref, which will make the router itself stop announcing any prefixes, for example this session in Düsseldorf:\n@@ -25,11 +25,11 @@ protocol bgp decix_dus_56890_ipv4_1 { source address 185.1.171.43; neighbor 185.1.170.252 as 56890; default bgp_med 0; - default bgp_local_pref 200; + default bgp_local_pref 0; # shutdown ipv4 { import keep filtered; import filter ebgp_decix_dus_56890_import; - export filter ebgp_decix_dus_56890_export; + export none; # shutdown receive limit 250000 action restart; next hop self on; }; This is where it\u0026rsquo;s a good idea to grab some tea. Quite a few internet providers have incredibly slow convergence, so just by stopping the announcment of AS8298:AS-IPNG prefixes at this internet exchange, doesn\u0026rsquo;t mean things get updated too quickly. It makes sense to wait a few minutes (by default I wait 15min) so that every router that might be a slow-poke (I\u0026rsquo;m looking at you, Juniper!) has time to update their RIB and FIB.\nVPP itself pretty immediately flips all of its paths to other places, and it converges a full table of 950K IPv4 and 195K IPv6 routes in about 7 seconds or so, but not everybody has such fast CPUs in their vendor-silicon-fancypants-router :-)\nUpgrade The tea in my advents calendar for December 17th is Whittard\u0026rsquo;s Lemon \u0026amp; Ginger infusion, and it is delicious. What could possibly go wrong?! Now that the router is fully drained, I start a ping to the loopback, and flip the virtual powerswitch on the IPMI console. A few seconds later, the machine expectedly stops pinging and \u0026hellip; the world doesn\u0026rsquo;t end, my SSH session to a hypervisor in Amsterdam is still alive, and most importantly, Spotify is still playing music:\npim@squanchy:~/src/ipng-kees$ ping defra0.ipng.ch PING defra0.ipng.ch (194.1.163.7): 56 data bytes 64 bytes from 194.1.163.7: icmp_seq=0 ttl=62 time=6.3 ms 64 bytes from 194.1.163.7: icmp_seq=1 ttl=62 time=6.5 ms 64 bytes from 194.1.163.7: icmp_seq=2 ttl=62 time=6.2 ms ... I open the IPMI KVM console, hit F10 and select the CDROM option, which has my previously inserted Debian 12 netinst ISO:\nAt this point I can\u0026rsquo;t help but smile. I\u0026rsquo;m sitting here in Brüttisellen, roughly 400km south of this computer in Frankfurt, and I am looking at the VGA output of a fresh Debian installer. Come on, you have to admit, that\u0026rsquo;s pretty slick! Installing Debian follows pretty precisely my previous VPP#7 article [ref]. I go through the installer options and a few minutes later, it\u0026rsquo;s mission accomplished. I give the router its IPv4/IPv6 address in IPng Site Local, so that it has management network connectivity, and just before it wants to reboot, I quickly edit /etc/default/grub to turn on serial output, just like in the article:\nGRUB_CMDLINE_LINUX=\u0026#34;console=tty0 console=ttyS0,115200n8 isolcpus=1,2,3,5,6,7\u0026#34; GRUB_TERMINAL=serial GRUB_SERIAL_COMMAND=\u0026#34;serial --unit=0 --speed=115200 --stop=1 --parity=no --word=8\u0026#34; As the machine reboots, I eject the CDROM from the IPMI web interface, and attach to the serial-over-lan interface instead. Booyah, it boots!\nConfigure On my workstation, I mount yesterday\u0026rsquo;s Borg backup for the machine, because instead of doing the whole router build over from scratch, I\u0026rsquo;m going to selectively copy a few bits and pieces over, in the interest of time. Also, it\u0026rsquo;s nice to actually use borgbackup for once, although Fred and I have made grateful use of it in an emergency when one of IP-Max\u0026rsquo;s hypervisors failed in Geneva.\npim@summer:~$ sudo borg mount ssh://${BORG_REPO}/defra0.ipng.ch/ /var/borgbackup/ Enter passphrase for key ssh://${BORG_REPO}/defra0.ipng.ch: pim@summer:~$ sudo ls -l /var/borgbackup/defra0-2023-12-17T01:45:47.983599 bin boot cdrom etc home lib lib32 lib64 libx32 lost+found media mnt opt root sbin srv tmp usr var In case you\u0026rsquo;re wondering why I mount the backup as root, it\u0026rsquo;s because that way I can guarantee all the correct users/permissions etc are present in the restore. I\u0026rsquo;ve done a practice run of the upgrade, yesterday, at chplo0.ipng.ch, so by now I think I have a pretty good handle on what needs to happen, so while connected to the freshly installed Debian Bookworm machine via serial-over-lan, here\u0026rsquo;s what I do:\nroot@defra0:~# apt install sudo rsync net-tools traceroute snmpd snmp iptables ipmitool bird2 \\ lm-sensors netplan.io build-essential borgmatic unbound tcpdump \\ libnl-3-200 libnl-route-3-200 root@defra0:~# adduser pim sudo root@defra0:~# adduser pim bird root@defra0:~# systemctl stop bird; systemctl disable bird; systemctl mask bird root@defra0:~# sensors-detect --auto root@defra0:~# export REPO=summer.net.ipng.ch:/var/borgbackup/defra0-2023-12-17T01:45:47.983599 root@defra0:~# mv /etc/network/interfaces /etc/network/interfaces.orig root@defra0:~# rsync -avugP $REPO/etc/netplan/ /etc/netplan/ root@defra0:~# rm -f /etc/ssh/ssh_host* root@defra0:~# rsync -avugP $REPO/etc/ssh/ssh_host* /etc/ssh/ root@defra0:~# rsync -avugP $REPO/etc/sysctl.d/80* /etc/sysctl.d/ root@defra0:~# rsync -avugP $REPO/etc/bird/ /etc/bird/ root@defra0:~# rsync -avugP $REPO/etc/vpp/ /etc/vpp/ root@defra0:~# rsync -avugP $REPO/etc/borgmatic/ /etc/borgmatic/ root@defra0:~# rsync -avugP $REPO/etc/rc.local /etc/rc.local root@defra0:~# rsync -avugP $REPO/lib/systemd/system/*dataplane* /lib/systemd/system I decide to selectively copy only the specific configuration files necessary to boot the dataplane. This means the systemd services (like snmpd, sshd, and their network namespace), and all the Bird and VPP config files. Because I prefer not to have to clear the SSH host keys, I also copy the old SSH host keys over. And considering IPng Networks standardizes on netplan for interface config, I\u0026rsquo;ll move the Debian-default interfaces out of the way.\nFinally, I add a few finishing touches and reboot one last time to ensure things are settled:\nroot@defra0:~# cat \u0026lt;\u0026lt; EOF | tee -a /etc/modules coretemp mpls_router vfio_pci EOF root@defra0:~# update-initramfs -k all -u root@defra0:~# update-grub root@defra0:~# mkdir -p /etc/systemd/system/unbound.service.d/ root@defra0:~# mkdir -p /etc/systemd/system/snmpd.service.d/ root@defra0:~# cat \u0026lt;\u0026lt; EOF | tee /etc/systemd/system/unbound.service.d/override.conf [Service] NetworkNamespacePath=/var/run/netns/dataplane EOF root@defra0:~# cp /etc/systemd/system/unbound.service.d/override.conf \\ /etc/systemd/system/snmpd.service.d/override.conf root@defra0:~# reboot The machine once again comes up, and now it\u0026rsquo;s loaded the VFIO and MPLS kernel modules, so I\u0026rsquo;m ready for the grand finale, which is installing VPP at the same version as the other routers in the fleet:\nroot@defra0:~# mkdir -p /var/log/vpp/ root@defra0:~# wget -m --no-parent https://ipng.ch/media/vpp/bookworm/24.02-rc0~175-g31d4891cf/ root@defra0:~# dpkg -i ipng.ch/media/vpp/bookworm/24.02-rc0~175-g31d4891cf/*.deb root@defra0:~# adduser pim vpp root@defra0:~# vppctl show version vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 In the corner of my eye, I see one of my xterms move. Hah! It\u0026rsquo;s the ping I left running on squanchy before, check it out:\npim@squanchy:~/src/ipng-kees$ ping defra0.ipng.ch PING defra0.ipng.ch (194.1.163.7): 56 data bytes 64 bytes from 194.1.163.7: icmp_seq=0 ttl=62 time=6.3 ms 64 bytes from 194.1.163.7: icmp_seq=1 ttl=62 time=6.5 ms 64 bytes from 194.1.163.7: icmp_seq=2 ttl=62 time=6.2 ms ... 64 bytes from 194.1.163.7: icmp_seq=1484 ttl=62 time=6.5 ms 64 bytes from 194.1.163.7: icmp_seq=1485 ttl=62 time=6.6 ms 64 bytes from 194.1.163.7: icmp_seq=1486 ttl=62 time=6.8 ms One think-o I made is that the Bird configs that I just put back from the backup were those from before I set the drains (remember, raising the OSPF cost and setting the EBGP sessions to shutdown) so they are now all alive again. But it\u0026rsquo;s all good - the dataplane came up, Bird2 came up and formed OSPF and OSPFv3 adjacencies a few seconds later, and BGP sessions all shot to life. I take a quick look at the state of the dataplane to make sure I\u0026rsquo;m not accidentally introducing a broken router:\npim@defra0:~$ birdc show route count BIRD 2.0.12 ready. 6782372 of 6782372 routes for 958020 networks in table master4 1848350 of 1848350 routes for 198255 networks in table master6 1620753 of 1620753 routes for 405189 networks in table t_roa4 367875 of 367875 routes for 91969 networks in table t_roa6 Total: 10619350 of 10619350 routes for 1653433 networks in 4 tables pim@defra0:~$ vppctl show ip fib summary | awk \u0026#39;{ TOTAL += $2 } END { print TOTAL }\u0026#39; 958664 pim@defra0:~$ vppctl show ip6 fib summary | awk \u0026#39;{ TOTAL += $2 } END { print TOTAL }\u0026#39; 198322 OK, looking at the output I can conclude that my think-o was benign and the router has all routes accounted for in the RIB, it has slurped in the RPKI tables, and it has successfully transferred all of this into VPP\u0026rsquo;s FIB. So this entire upgrade took 1482 seconds, which is just under 25 minutes. Gnarly!\nPost Install The machine is up and running, and there\u0026rsquo;s one last thing for me to do, which is perform an Ansble run to make sure that the whole machine is configured correctly (for example, the correct access list for Unbound, the correct IPv4/IPv6 firewall for the Linux controlplane, the correct SSH daemon options, working mailer and NTP daemon, et cetera).\nSo I fire off a one-shot Ansible playbook run, and it pokes and prods the machine a bit:\nNow the machine is completely up-to-snuff, its latest VPP SNMP agent Prometheus exporter, Bird exporter, and so on are all good. I check LibreNMS and indeed, the machine is back with a half an hour or so of monitoring data missing. I\u0026rsquo;m still grinning as I write this, as most Juniper and Cisco firmware upgrades take more than 30min, while for me the whole thing from start to finish was less than that.\nResults This article describes how I managed to upgrade the entire network of routers, remotely, from the comfort of my home, while sipping tea, and without having a single network outage. The bump in the graph is the moment at which I drained defra0 and traffic from the monitoring machine at nlams0 had to go via France to my house at chbtl0. No packets were lost in the making of this upgrade!\nYesterday I practiced on chplo0, and today for this article I did defra0, after which I also did the last remaining router nlams0. Every router is now up to date running Debian Bookworm as well as VPP version 24.02 (including a bunch of desirable fixes for IPFIX/Flowprobe):\npim@squanchy:~/src/ipng-kees$ ./doall.sh \u0026#39;echo -n $(hostname -s):\\ ; vppctl show version\u0026#39; chbtl0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 chbtl1: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 chgtg0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 chplo0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 chrma0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 ddln0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 ddln1: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 defra0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 frggh0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 frpar0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 nlams0: vpp v24.02-rc0~175-g31d4891cf built by pim on bookworm-builder at 2023-12-09T12:54:52 usfmt0: vpp v24.02-rc0~175-g31d4891cf built by pim on bullseye-builder at 2023-12-09T16:27:33 For the hawk-eyed, yes usfmt0 has not been done. I don\u0026rsquo;t have Supermicro with IPMI there, so the next time I visit California, I\u0026rsquo;ll make a stop at the local Hurricane Electric datacenter to upgrade that last one :-)\n","date":"2023-12-17","desc":" Introduction When IPng Networks first built out a european network, I was running the Disaggregated Network Operating System [ref], initially based on AT\u0026amp;T’s “dNOS” software framework. Over time though, the DANOS project slowed down, and the developers with whom I had a pretty good relationship all left for greener pastures.\nIn 2019, Pierre Pfister (and several others) built a VPP router sandbox [ref], which graduated into a feature called the Linux Control Plane plugin [ref]. Lots of folks put in an effort for the Linux Control Plane, notably Neale Ranns from Cisco (these days Graphiant), and Matt Smith and Jon Loeliger from Netgate (who ship this as TNSR [ref], check it out!). I helped as well, by adding a bunch of Netlink handling and VPP-\u0026gt;Linux synchronization code, which I\u0026rsquo;ve written about a bunch on this blog in the 2021 VPP development series [ref].\n","permalink":"https://ipng.ch/s/articles/2023/12/17/debian-on-ipngs-vpp-routers/","section":"articles","title":"Debian on IPng's VPP Routers"},{"contents":"Introduction I\u0026rsquo;m still hunting for a set of machines with which I can generate 1Tbps and 1Gpps of VPP traffic, and considering a 100G network interface can do at most 148.8Mpps, I will need 7 or 8 of these network cards. Doing a loadtest like this with DACs back-to-back is definitely possible, but it\u0026rsquo;s a bit more convenient to connect them all to a switch. However, for this to work I would need (at least) fourteen or more HundredGigabitEthernet ports, and these switches tend to get expensive, real quick.\nOr do they?\nHardware I thought I\u0026rsquo;d ask the #nlnog IRC channel for advice, and of course the usual suspects came past, such as Juniper, Arista, and Cisco. But somebody mentioned \u0026ldquo;How about Mellanox, like SN2700?\u0026rdquo; and I remembered my buddy Eric was a fan of those switches. I looked them up on the refurbished market and I found one for EUR 1'400,- for 32x100G which felt suspiciously low priced\u0026hellip; but I thought YOLO and I ordered it. It arrived a few days later via UPS from Denmark to Switzerland.\nThe switch specs are pretty impressive, with 32x100G QSFP28 ports, which can be broken out to a set of sub-ports (each of 1/10/25/50G), with a specified switch throughput of 6.4Tbps in 4.76Gpps, while only consuming ~150W all-up.\nFurther digging revealed that the architecture of this switch consists of two main parts:\nan AMD64 component with an mSATA disk to boot from, two e1000 network cards, and a single USB and RJ45 serial port with standard pinout. It has a PCIe connection to a switch board in the front of the chassis, further more it\u0026rsquo;s equipped with 8GB of RAM in an SO-DIMM, and its CPU is a two core Celeron(R) CPU 1047UE @ 1.40GHz.\nthe silicon used in this switch is called Spectrum and identifies itself in Linux as PCI device 03:00.0 called Mellanox Technologies MT52100, so the front dataplane with 32x100G is separated from the Linux based controlplane.\nWhen turning on the device, the serial port comes to life and shows me a BIOS, quickly after which it jumps into GRUB2 and wants me to install it using something called ONIE. I\u0026rsquo;ve heard of that, but now it\u0026rsquo;s time for me to learn a little bit more about that stuff. I ask around and there\u0026rsquo;s plenty of ONIE images for this particular type of chip to be found - some are open source, some are semi-open source (as in: were once available but now are behind paywalls etc).\nBefore messing around with the switch and possibly locking myself out or bricking it, I take out the 16GB mSATA and make a copy of it for safe keeping. I feel somewhat invincible by doing this. How bad could I mess up this switch, if I can just copy back a bitwise backup of the 16GB mSATA? I\u0026rsquo;m about to find out, so read on!\nSoftware The Mellanox SN2700 switch is an ONIE (Open Network Install Environment) based platform that supports a multitude of operating systems, as well as utilizing the advantages of Open Ethernet and the capabilities of the Mellanox Spectrum® ASIC. The SN2700 has three modes of operation:\nPreinstalled with Mellanox Onyx (successor to MLNX-OS Ethernet), a home-grown operating system utilizing common networking user experiences and industry standard CLI. Preinstalled with Cumulus Linux, a revolutionary operating system taking the Linux user experience from servers to switches and providing a rich routing functionality for large scale applications. Provided with a bare ONIE image ready to be installed with the aforementioned or other ONIE-based operating systems. I asked around a bit more and found that there\u0026rsquo;s a few more things one might do with this switch. One of them is [SONiC], which stands for Software for Open Networking in the Cloud, and has support for the Spectrum and notably the SN2700 switch. Cool!\nI also learned about [DENT], which utilizes the Linux Kernel, Switchdev, and other Linux based projects as the basis for building a new standardized network operating system without abstractions or overhead. Unfortunately, while the Spectrum chipset is known to DENT, this particular layout on SN2700 is not supported.\nFinally, my buddy fall0ut said \u0026ldquo;why not just Debian with switchdev?\u0026rdquo; and now my eyes opened wide. I had not yet come across [switchdev], which is a standard Linux kernel driver model for switch devices which offload the forwarding (data)plane from the kernel. As it turns out, Mellanox did a really good job writing a switchdev implementation in the [linux kernel] for the Spectrum series of silicon, and it\u0026rsquo;s all upstreamed to the Linux kernel. Wait, what?!\nMellanox Switchdev I start by reading the [brochure], which shows me the intentions Mellanox had when designing and marketing these switches. It seems that they really meant it when they said this thing is a fully customizable Linux switch, check out this paragraph:\nOnce the Mellanox Switchdev driver is loaded into the Linux Kernel, each of the switch’s physical ports is registered as a net_device within the kernel. Using standard Linux tools (for example, bridge, tc, iproute), ports can be bridged, bonded, tunneled, divided into VLANs, configured for L3 routing and more. Linux switching and routing tables are reflected in the switch hardware. Network traffic is then handled directly by the switch. Standard Linux networking applications can be natively deployed and run on switchdev. This may include open source routing protocol stacks, such as Quagga, Bird and XORP, OpenFlow applications, or user-specific implementations.\nInstalling Debian on SN2700 .. they had me at Bird :) so off I go, to install a vanilla Debian AMD64 Bookworm on a 120G mSATA I had laying around. After installing it, I noticed that the coveted mlxsw driver is not shipped by default on the Linux kernel image in Debian, so I decide to build my own, letting the [Debian docs] take my hand and guide me through it.\nI find a reference on the Mellanox [GitHub wiki] which shows me which kernel modules to include to successfully use the Spectrum under Linux, so I think I know what to do:\npim@summer:/usr/src$ sudo apt-get install build-essential linux-source bc kmod cpio flex \\ libncurses5-dev libelf-dev libssl-dev dwarves bison pim@summer:/usr/src$ sudo apt install linux-source-6.1 pim@summer:/usr/src$ sudo tar xf linux-source-6.1.tar.xz pim@summer:/usr/src$ cd linux-source-6.1/ pim@summer:/usr/src/linux-source-6.1$ sudo cp /boot/config-6.1.0-12-amd64 .config pim@summer:/usr/src/linux-source-6.1$ cat \u0026lt;\u0026lt; EOF | sudo tee -a .config CONFIG_NET_IPIP=m CONFIG_NET_IPGRE_DEMUX=m CONFIG_NET_IPGRE=m CONFIG_IPV6_GRE=m CONFIG_IP_MROUTE_MULTIPLE_TABLES=y CONFIG_IP_MULTIPLE_TABLES=y CONFIG_IPV6_MULTIPLE_TABLES=y CONFIG_BRIDGE=m CONFIG_VLAN_8021Q=m CONFIG_BRIDGE_VLAN_FILTERING=y CONFIG_BRIDGE_IGMP_SNOOPING=y CONFIG_NET_SWITCHDEV=y CONFIG_NET_DEVLINK=y CONFIG_MLXFW=m CONFIG_MLXSW_CORE=m CONFIG_MLXSW_CORE_HWMON=y CONFIG_MLXSW_CORE_THERMAL=y CONFIG_MLXSW_PCI=m CONFIG_MLXSW_I2C=m CONFIG_MLXSW_MINIMAL=y CONFIG_MLXSW_SWITCHX2=m CONFIG_MLXSW_SPECTRUM=m CONFIG_MLXSW_SPECTRUM_DCB=y CONFIG_LEDS_MLXCPLD=m CONFIG_NET_SCH_PRIO=m CONFIG_NET_SCH_RED=m CONFIG_NET_SCH_INGRESS=m CONFIG_NET_CLS=y CONFIG_NET_CLS_ACT=y CONFIG_NET_ACT_MIRRED=m CONFIG_NET_CLS_MATCHALL=m CONFIG_NET_CLS_FLOWER=m CONFIG_NET_ACT_GACT=m CONFIG_NET_ACT_MIRRED=m CONFIG_NET_ACT_SAMPLE=m CONFIG_NET_ACT_VLAN=m CONFIG_NET_L3_MASTER_DEV=y CONFIG_NET_VRF=m EOF pim@summer:/usr/src/linux-source-6.1$ sudo make menuconfig pim@summer:/usr/src/linux-source-6.1$ sudo make -j`nproc` bindeb-pkg I run a gratuitous make menuconfig after adding all those config statements to the end of the .config file, and it figures out how to combine what I wrote before with what was in the file earlier, and I used the standard Bookworm 6.1 kernel config that came from the default installer, so that it would be a minimal diff to what Debian itself shipped with.\nAfter Summer stretches her legs a bit compiling this kernel for me, look at the result:\npim@summer:/usr/src$ dpkg -c linux-image-6.1.55_6.1.55-4_amd64.deb | grep mlxsw drwxr-xr-x root/root 0 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/ -rw-r--r-- root/root 414897 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_core.ko -rw-r--r-- root/root 19721 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_i2c.ko -rw-r--r-- root/root 31817 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_minimal.ko -rw-r--r-- root/root 65161 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_pci.ko -rw-r--r-- root/root 1425065 2023-11-09 20:22 ./lib/modules/6.1.55/kernel/drivers/net/ethernet/mellanox/mlxsw/mlxsw_spectrum.ko Good job, Summer! On my mSATA disk, I tell Linux to boot its kernel using the following in GRUB, which will make the kernel not create spiffy interface names like enp6s0 or eno1 but just enumerate them all one by one and call them eth0 and so on:\npim@fafo:~$ grep GRUB_CMDLINE /etc/default/grub GRUB_CMDLINE_LINUX_DEFAULT=\u0026#34;\u0026#34; GRUB_CMDLINE_LINUX=\u0026#34;console=tty0 console=ttyS0,115200n8 net.ifnames=0 biosdevname=0\u0026#34; Mellanox SN2700 running Debian+Switchdev I insert the freshly installed Debian Bookworm with custom compiled 6.1.55+mlxsw kernel into the switch, and it boots on the first try. I see 34 (!) ethernet ports, noting that the first two come from an Intel NIC but carrying a MAC address from Mellanox (starting with 0c:42:a1) and the other 32 have a common MAC address (from Mellanox, starting with 04:3f:72), and what I noticed is that the MAC addresses here are skipping one between subsequent ports, which leads me to believe that these 100G ports can be split into two (perhaps 2x50G, 2x40G, 2x25G, 2x10G, which I intend to find out later). According to the official spec sheet, the switch allows 2-way breakout ports as well as converter modules, to insert for example a 25G SFP28 into a QSFP28 switchport.\nHonestly, I did not think I would get this far, so I humorously (at least, I think so) decide to call this switch [FAFO].\nFirst off, the mlxsw driver loaded:\nroot@fafo:~# lsmod | grep mlx mlxsw_spectrum 708608 0 mlxsw_pci 36864 1 mlxsw_spectrum mlxsw_core 217088 2 mlxsw_pci,mlxsw_spectrum mlxfw 36864 1 mlxsw_core vxlan 106496 1 mlxsw_spectrum ip6_tunnel 45056 1 mlxsw_spectrum objagg 53248 1 mlxsw_spectrum psample 20480 1 mlxsw_spectrum parman 16384 1 mlxsw_spectrum bridge 311296 1 mlxsw_spectrum I run sensors-detect and pwmconfig, let the fans calibrate and write their config file. The fans come back down to a more chill (pun intended) speed, and I take a closer look. It seems all fans and all thermometers, including the ones in the QSFP28 cages and the Spectrum switch ASIC are accounted for:\nroot@fafo:~# sensors coretemp-isa-0000 Adapter: ISA adapter Package id 0: +30.0°C (high = +87.0°C, crit = +105.0°C) Core 0: +29.0°C (high = +87.0°C, crit = +105.0°C) Core 1: +30.0°C (high = +87.0°C, crit = +105.0°C) acpitz-acpi-0 Adapter: ACPI interface temp1: +27.8°C (crit = +106.0°C) temp2: +29.8°C (crit = +106.0°C) mlxsw-pci-0300 Adapter: PCI adapter fan1: 6239 RPM fan2: 5378 RPM fan3: 6268 RPM fan4: 5378 RPM fan5: 6326 RPM fan6: 5442 RPM fan7: 6268 RPM fan8: 5315 RPM temp1: +37.0°C (highest = +41.0°C) front panel 001: +23.0°C (crit = +73.0°C, emerg = +75.0°C) front panel 002: +24.0°C (crit = +73.0°C, emerg = +75.0°C) front panel 003: +23.0°C (crit = +73.0°C, emerg = +75.0°C) front panel 004: +26.0°C (crit = +73.0°C, emerg = +75.0°C) ... From the top, first I see the classic CPU core temps, then an ACPI interface which I\u0026rsquo;m not quite sure I understand the purpose of (possibly motherboard, but not PSU because pulling one out does not change any values). Finally, the sensors using driver mlxsw-pci-0300, are those on the switch PCB carrying the Spectrum silicon, and there\u0026rsquo;s a thermometer for each of the QSFP28 cages, possibly reading from the optic, as most of them are empty except the first four which I inserted optics to. Slick!\nI notice that the ports are in a bit of a weird order. Firstly, eth0-1 are the two 1G ports on the Debian machine. But then, the rest of the ports are the Mellanox Spectrum ASIC:\neth2-17 correspond to port 17-32, which seems normal, but eth18-19 correspond to port 15-16 eth20-21 correspond to port 13-14 eth30-31 correspond to port 3-4 eth32-33 correspond to port 1-2 The switchports are actually sequentially numbered with respect to MAC addresses, with eth2 starting at 04:3f:72:74:a9:41 and finally eth34 having 04:3f:72:74:a9:7f (for 64 consecutive MACs).\nSomehow though, the ports are wired in a different way on the front panel. As it turns out, I can insert a little udev ruleset that will take care of this:\nroot@fafo:~# cat \u0026lt;\u0026lt; EOF \u0026gt; /etc/udev/rules.d/10-local.rules SUBSYSTEM==\u0026#34;net\u0026#34;, ACTION==\u0026#34;add\u0026#34;, DRIVERS==\u0026#34;mlxsw_spectrum*\u0026#34;, \\ NAME=\u0026#34;sw$attr{phys_port_name}\u0026#34; EOF After rebooting the switch, the ports are now called swp1 .. swp32 and they also correspond with their physical ports on the front panel. One way to check this, is using ethtool --identify swp1 which will blink the LED of port 1, until I press ^C. Nice.\nDebian SN2700: Diagnostics The first thing I\u0026rsquo;m curious to try, is if Link Layer Discovery Protocol [LLDP] works. This is a vendor-neutral protocol that network devices use to advertise their identity to peers over Ethernet. I install an open source LLDP daemon and plug in a DAC from port1 to a Centec switch in the lab.\nAnd indeed, quickly after that, I see two devices, the first on the Linux machine eth0 which is the Unifi switch that has my LAN, and the second is the Centec behind swp1:\nroot@fafo:~# apt-get install lldpd root@fafo:~# lldpcli show nei summary ------------------------------------------------------------------------------- LLDP neighbors: ------------------------------------------------------------------------------- Interface: eth0, via: LLDP Chassis: ChassisID: mac 44:d9:e7:05:ff:46 SysName: usw6-BasementServerroom Port: PortID: local Port 9 PortDescr: fafo.lab TTL: 120 Interface: swp1, via: LLDP Chassis: ChassisID: mac 60:76:23:00:01:ea SysName: sw3.lab Port: PortID: ifname eth-0-25 PortDescr: eth-0-25 TTL: 120 With this I learn that the switch forwards these datagrams (ethernet type 0x88CC) from the dataplane to the Linux controlplane. I would call this punting in VPP language, but switchdev calls it trapping, and I can see the LLDP packets when tcpdumping on ethernet device swp1. So today I learned how to trap packets :-)\nDebian SN2700: ethtool One popular diagnostics tool that is useful (and, hopefully well known because it\u0026rsquo;s awesome), isethtool, a command-line tool in Linux for managing network interface devices. It allows me to modify the parameters of the ports and their transceivers, as well as query the information of those devices.\nHere are few common examples, all of which work on this switch running Debian:\nethtool swp1: Shows link capabilities (eg, 1G/10G/25G/40G/100G) ethtool -s swp1 speed 40000 duplex full autoneg off: Force speed/duplex ethtool -m swp1: Shows transceiver diagnostics like SFP+ light levels, link levels (also --module-info) ethtool -p swp1: Flashes the transceiver port LED (also --identify) ethtool -S swp1: Shows packet and octet counters, and sizes, discards, errors, and so on (also --statistics) I specifically love the digital diagnostics monitoring (DDM), originally specified in [SFF-8472], which allows me to read the EEPROM of optical transceivers and get all sorts of critical diagnostics. I wish DPDK and VPP had that!\nDebian SN2700: devlink In reading up on the switchdev ecosystem, I stumbled across devlink, an API to expose device information and resources not directly related to any device class, such as switch ASIC configuration. As a fun fact, devlink was written by the same engineer who wrote the mlxsw driver for Linux, Jiří Pírko. Its documentation can be found in the [linux kernel], and it ships with any modern iproute2 distribution. The specific (somewhat terse) documentation of the mlxsw driver [lives there] as well.\nThere\u0026rsquo;s a lot to explore here, but I\u0026rsquo;ll focus my attention to three things:\n1. devlink resource When learning that the switch also does IPv4 and IPv6 routing, I immediately thought: how many prefixes can be offloaded to the ASIC? One way to find out is to query what types of resources it has:\nroot@fafo:~# devlink resource show pci/0000:03:00.0 pci/0000:03:00.0: name kvd size 258048 unit entry dpipe_tables none resources: name linear size 98304 occ 1 unit entry size_min 0 size_max 159744 size_gran 128 dpipe_tables none resources: name singles size 16384 occ 1 unit entry size_min 0 size_max 159744 size_gran 1 dpipe_tables none name chunks size 49152 occ 0 unit entry size_min 0 size_max 159744 size_gran 32 dpipe_tables none name large_chunks size 32768 occ 0 unit entry size_min 0 size_max 159744 size_gran 512 dpipe_tables none name hash_double size 65408 unit entry size_min 32768 size_max 192512 size_gran 128 dpipe_tables none name hash_single size 94336 unit entry size_min 65536 size_max 225280 size_gran 128 dpipe_tables none name span_agents size 3 occ 0 unit entry dpipe_tables none name counters size 32000 occ 4 unit entry dpipe_tables none resources: name rif size 8192 occ 0 unit entry dpipe_tables none name flow size 23808 occ 4 unit entry dpipe_tables none name global_policers size 1000 unit entry dpipe_tables none resources: name single_rate_policers size 968 occ 0 unit entry dpipe_tables none name rif_mac_profiles size 1 occ 0 unit entry dpipe_tables none name rifs size 1000 occ 1 unit entry dpipe_tables none name physical_ports size 64 occ 36 unit entry dpipe_tables none There\u0026rsquo;s a lot to unpack here, but this is a tree of resources, each with names and children. Let me focus on the first one, called kvd, which stands for Key Value Database (in other words, a set of lookup tables). It contains a bunch of children called linear, hash_double and hash_single. The kernel [docs] explain it in more detail, but this is where the switch will keep its FIB in Content Addressable Memory (CAM) of certain types of elements of a given length and count. All up, the size is 252KB, which is not huge, but also certainly not tiny!\nHere I learn that it\u0026rsquo;s subdivided into:\nlinear: 96KB bytes of flat memory using an index, further divided into regions: ***singles***: 16KB of size 1, nexthops ***chunks***: 48KB of size 32, multipath routes with \u0026lt;32 entries ***large_chunks***: 32KB of size 512, multipath routes with \u0026lt;512 entries hash_single: 92KB bytes of hash table for keys smaller than 64 bits (eg. L2 FIB, IPv4 FIB and neighbors) hash_double: 63KB bytes of hash table for keys larger than 64 bits (eg. IPv6 FIB and neighbors) 2. devlink dpipe Now that I know the memory layout and regions of the CAM, I can start making some guesses on the FIB size. The devlink pipeline debug API (DPIPE) is aimed at providing the user visibility into the ASIC\u0026rsquo;s pipeline in a generic way. The API is described in detail in the [kernel docs]. I feel free to take a peek at the dataplane configuration innards:\nroot@fafo:~# devlink dpipe table show pci/0000:03:00.0 pci/0000:03:00.0: name mlxsw_erif size 1000 counters_enabled false match: type field_exact header mlxsw_meta field erif_port mapping ifindex action: type field_modify header mlxsw_meta field l3_forward type field_modify header mlxsw_meta field l3_drop name mlxsw_host4 size 0 counters_enabled false resource_path /kvd/hash_single resource_units 1 match: type field_exact header mlxsw_meta field erif_port mapping ifindex type field_exact header ipv4 field destination ip action: type field_modify header ethernet field destination mac name mlxsw_host6 size 0 counters_enabled false resource_path /kvd/hash_double resource_units 2 match: type field_exact header mlxsw_meta field erif_port mapping ifindex type field_exact header ipv6 field destination ip action: type field_modify header ethernet field destination mac name mlxsw_adj size 0 counters_enabled false resource_path /kvd/linear resource_units 1 match: type field_exact header mlxsw_meta field adj_index type field_exact header mlxsw_meta field adj_size type field_exact header mlxsw_meta field adj_hash_index action: type field_modify header ethernet field destination mac type field_modify header mlxsw_meta field erif_port mapping ifindex From this I can puzzle together how the CAM is actually used:\nmlxsw_host4: matches on the interface port and IPv4 destination IP, using hash_single above with one unit for each entry, and when looking that up, puts the result into the ethernet destination MAC (in other words, the FIB entry points at an L2 nexthop!) mlxsw_host6: matches on the interface port and IPv6 destination IP using hash_double with two units for each entry. mlxsw_adj: holds the L2 adjacencies, and the lookup key is an index, size and hash index, where the returned value is used to rewrite the destination MAC and select the egress port! Now that I know the types of tables and what they are matching on (and then which action they are performing), I can also take a look at the actual data in the FIB. For example, if I create an IPv4 interface on the switch and ping a member on directly connected network there, I can see an entry show up in the L2 adjacency table, like so:\nroot@fafo:~# ip addr add 100.65.1.1/30 dev swp31 root@fafo:~# ping 100.65.1.2 root@fafo:~# devlink dpipe table dump pci/0000:03:00.0 name mlxsw_host4 pci/0000:03:00.0: index 0 match_value: type field_exact header mlxsw_meta field erif_port mapping ifindex mapping_value 71 value 1 type field_exact header ipv4 field destination ip value 100.65.1.2 action_value: type field_modify header ethernet field destination mac value b4:96:91:b3:b1:10 To decypher what the switch is doing: if the ifindex is 71 (which corresponds to swp31), and the IPv4 destination IP address is 100.65.1.2, then the destination MAC address will be set to b4:96:91:b3:b1:10, so the switch knows where to send this ethernet datagram.\nAnd now I have found what I need to know to be able to answer the question of the FIB size. This switch can take 92K IPv4 routes and 31.5K IPv6 routes, and I can even inspect the FIB in great detail. Rock on!\n3. devlink port split But reading the switch chip configuration and FIB is not all that devlink can do, it can also make changes! One particularly interesting one is the ability to split and unsplit ports. What this means is that, when you take a 100Gbit port, it internally is divided into four so-called lanes of 25Gbit each, where a 40Gbit port is internally divided into four lanes of 10Gbit each. Splitting ports is the act of taking such a port and reconfiguring its lanes.\nLet me show you, by means of example, what spliting the first two switchports might look like. They begin their life as 100G ports, which support a number of link speeds, notably: 100G, 50G, 25G, but also 40G, 10G, and finally 1G:\nroot@fafo:~# ethtool swp1 Settings for swp1: Supported ports: [ FIBRE ] Supported link modes: 1000baseKX/Full 10000baseKR/Full 40000baseCR4/Full 40000baseSR4/Full 40000baseLR4/Full 25000baseCR/Full 25000baseSR/Full 50000baseCR2/Full 100000baseSR4/Full 100000baseCR4/Full 100000baseLR4_ER4/Full root@fafo:~# devlink port show | grep \u0026#39;swp[12] \u0026#39; pci/0000:03:00.0/61: type eth netdev swp1 flavour physical port 1 splittable true lanes 4 pci/0000:03:00.0/63: type eth netdev swp2 flavour physical port 2 splittable true lanes 4 root@fafo:~# devlink port split pci/0000:03:00.0/61 count 4 [ 629.593819] mlxsw_spectrum 0000:03:00.0 swp1: link down [ 629.722731] mlxsw_spectrum 0000:03:00.0 swp2: link down [ 630.049709] mlxsw_spectrum 0000:03:00.0: EMAD retries (1/5) (tid=64b1a5870000c726) [ 630.092179] mlxsw_spectrum 0000:03:00.0 swp1s0: renamed from eth2 [ 630.148860] mlxsw_spectrum 0000:03:00.0 swp1s1: renamed from eth2 [ 630.375401] mlxsw_spectrum 0000:03:00.0 swp1s2: renamed from eth2 [ 630.375401] mlxsw_spectrum 0000:03:00.0 swp1s3: renamed from eth2 root@fafo:~# ethtool swp1s0 Settings for swp1s0: Supported ports: [ FIBRE ] Supported link modes: 1000baseKX/Full 10000baseKR/Full 25000baseCR/Full 25000baseSR/Full Whoa, what just happened here? The switch took the port defined by pci/0000:03:00.0/61 which says it is splittable and has four lanes, and split it into four NEW ports called swp1s0-swp1s3, and the resulting ports are 25G, 10G or 1G.\nHowever, I make an important observation. When splitting swp1 in 4, the switch also removed port swp2, and remember at the beginning of this article I mentioned that the MAC addresses seemed to skip one entry between subsequent interfaces? Now I understand why: when spltting the port into two, it will use the second MAC address for the second 50G port; but if I split it into four, it\u0026rsquo;ll use the MAC addresses from the adjacent port and decommission it. In other words: this switch can do 32x100G, or 64x50G, or 64x25G/10G/1G.\nIt doesn\u0026rsquo;t matter which of the PCI interfaces I split on. The operation is also reversible, I can issue devlink port unsplit to return the port to its aggregate state (eg. 4 lanes and 100Gbit), which will remove the swp1s0-3 ports and put back swp1 and swp2 again.\nWhat I find particularly impressive about this, is that for most hardware vendors, this splitting of ports requires a reboot of the chassis, while here it can happen entirely online. Well done, Mellanox!\nPerformance OK, so this all seems to work, but does it work well? If you\u0026rsquo;re a reader of my blog you\u0026rsquo;ll know that I love doing loadtests, so I boot my machine, Hippo, and I connect it with two 100G DACs to the switch on ports 31 and 32:\n[ 1.354802] ice 0000:0c:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link) [ 1.447677] ice 0000:0c:00.0: firmware: direct-loading firmware intel/ice/ddp/ice.pkg [ 1.561979] ice 0000:0c:00.1: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link) [ 7.738198] ice 0000:0c:00.0 enp12s0f0: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: RS-FEC, Autoneg Advertised: On, Autoneg Negotiated: True, Flow Control: None [ 7.802572] ice 0000:0c:00.1 enp12s0f1: NIC Link is up 100 Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: RS-FEC, Autoneg Advertised: On, Autoneg Negotiated: True, Flow Control: None I hope you\u0026rsquo;re hungry, Hippo, cuz you\u0026rsquo;re about to get fed!\nDebian SN2700: L2 To use the switch in L2 mode, I intuitively create Linux bridge, say br0, and add ports to that. From the Mellanox documentation I learn that there can be multiple bridges, each isolated from one another, but there can only be one such bridge with vlan_filtering set. VLAN Filtering allows the switch to only accept tagged frames from a list of configured VLANs, and drop the rest. This is what you\u0026rsquo;d imagine a regular commercial switch would provide.\nSo off I go, creating the bridge in which I\u0026rsquo;ll add two ports (HundredGigabitEthernet port swp31 and swp32), and I will allow for the maximum MTU size of 9216, also known as [Jumbo Frames].\nroot@fafo:~# ip link add name br0 type bridge root@fafo:~# ip link set br0 type bridge vlan_filtering 1 mtu 9216 up root@fafo:~# ip link set swp31 mtu 9216 master br0 up root@fafo:~# ip link set swp32 mtu 9216 master br0 up These two ports are now access ports, that is to say they accept and emit only untagged traffic, and due to the vlan_filtering flag, they will drop all other frames. Using the standard bridge utility from Linux, I can manipulate the VLANs on these ports.\nFirst, I\u0026rsquo;ll remove the default VLAN and add VLAN 1234 to both ports, specifying that VLAN 1234 is the so-called Port VLAN ID (pvid). This makes them the equivalent of Cisco\u0026rsquo;s switchport access 1234:\nroot@fafo:~# bridge vlan del vid 1 dev swp1 root@fafo:~# bridge vlan del vid 1 dev swp2 root@fafo:~# bridge vlan add vid 1234 dev swp1 pvid root@fafo:~# bridge vlan add vid 1234 dev swp2 pvid Then, I\u0026rsquo;ll add a few tagged VLANs to the ports, so that they become the Cisco equivalent of a trunk port allowing these tagged VLANs and assuming untagged traffic is still VLAN 1234:\nroot@fafo:~# for port in swp1 swp2; do for vlan in 100 200 300 400; do \\ bridge vlan add vid $vlan dev $port; done; done root@fafo:~# bridge vlan port vlan-id swp1 100 200 300 400 1234 PVID swp2 100 200 300 400 1234 PVID br0 1 PVID Egress Untagged When these commands are run against the interfaces swp*, they are picked up by the mlxsw kernel driver, and transmitted to the Spectrum switch chip, in other words, these commands end up programming the silicon. Traffic through these switch ports on the front, rarely (if ever) get forwarded to the Linux kernel, very similar to [VPP], the traffic stays mostly in the dataplane. Some traffic, such as LLDP (and as we\u0026rsquo;ll see later, IPv4 ARP and IPv6 neighbor discovery), will be forwarded from the switch chip over the PCIe link to the kernel, after which the results are transmitted back via PCIe to program the switch chip L2/L3 Forwarding Information Base (FIB).\nNow I turn my attention to the loadtest, by configuring T-Rex in L2 Stateless mode. I start a bidirectional loadtest with 256b packets at 50% of line rate, which looks just fine:\nAt this point I can already conclude that this is all happening in the dataplane, as the Spectrum switch is connected to the Debian machine using a PCIe v3.0 x8 link, which is even obscured by another device on the PCIe bus, so the Debian kernel is in no way able to process more than a token amount of traffic, and yet I\u0026rsquo;m seeing 100Gbit go through the switch chip and the CPU load on the kernel pretty much zero. I can however retrieve the link statistics using ip stats, and those will show me the actual counters of the silicon, not just the trapped packets. If you\u0026rsquo;ll recall, in VPP the only packets that the TAP interfaces see are those packets that are punted, and the Linux kernel there is completely oblivious to the total dataplane throughput. Here, the interface is showing the correct dataplane packet and byte counters, which means that things like SNMP will automatically just do the right thing.\nroot@fafo:~# dmesg | grep 03:00.*bandwidth [ 2.180410] pci 0000:03:00.0: 16.000 Gb/s available PCIe bandwidth, limited by 5.0 GT/s PCIe x4 link at 0000:00:01.2 (capable of 31.504 Gb/s with 8.0 GT/s PCIe x4 link) root@fafo:~# uptime 03:19:16 up 2 days, 14:14, 1 user, load average: 0.00, 0.00, 0.00 root@fafo:~# ip stats show dev swp32 group link 72: swp32: group link RX: bytes packets errors dropped missed mcast 5106713943502 15175926564 0 0 0 103 TX: bytes packets errors dropped carrier collsns 23464859508367 103495791750 0 0 0 0 Debian SN2700: IPv4 and IPv6 I now take a look at the L3 capabilities of the switch. To do this, I simply destroy the bridge br0, which will return the enslaved switchports. I then convert the T-Rex loadtester to use an L3 profile, and configure the switch as follows:\nroot@fafo:~# ip addr add 100.65.1.1/30 dev swp31 root@fafo:~# ip nei replace 100.65.1.2 lladdr b4:96:91:b3:b1:10 dev swp31 root@fafo:~# ip ro add 16.0.0.0/8 via 100.65.1.2 dev swp31 root@fafo:~# ip addr add 100.65.2.1/30 dev swp32 root@fafo:~# ip nei replace 100.65.2.2 lladdr b4:96:91:b3:b1:11 dev swp32 root@fafo:~# ip ro add 48.0.0.0/8 via 100.65.2.2 dev swp32 Several other routers I\u0026rsquo;ve loadtested have the same (cosmetic) issue, that T-Rex doesn\u0026rsquo;t reply to ARP packets after the first few seconds, so I first set the IPv4 address, then add a static L2 adjacency for the T-Rex side (on MAC b4:96:91:b3:b1:10), and route 16.0.0.0/8 to port 0 and I route 48.0.0.0/8 to port 1 of the loadtester.\nI start a stateless L3 loadtest with 192 byte packets in both directions, and the switch keeps up just fine. Taking a closer look at the ip stats instrumentation, I see that there\u0026rsquo;s the ability to turn on L3 counters in addition to L2 (ethernet) counters. So I do that on my two router ports while they are happily forwarding 58.9Mpps, and I can now see the difference between dataplane (forwarded in hardware) and CPU (forwarded by the CPU)\nroot@fafo:~# ip stats set dev swp31 l3_stats on root@fafo:~# ip stats set dev swp32 l3_stats on root@fafo:~# ip stats show dev swp32 group offload subgroup l3_stats 72: swp32: group offload subgroup l3_stats on used on RX: bytes packets errors dropped mcast 270222574848200 1137559577576 0 0 0 TX: bytes packets errors dropped 281073635911430 1196677185749 0 0 root@fafo:~# ip stats show dev swp32 group offload subgroup cpu_hit 72: swp32: group offload subgroup cpu_hit RX: bytes packets errors dropped missed mcast 1068742 17810 0 0 0 0 TX: bytes packets errors dropped carrier collsns 468546 2191 0 0 0 0 The statistics above clearly demonstrate that the lion\u0026rsquo;s share of the packets have been forwarded by the ASIC, and only a few (notably things like IPv6 neighbor discovery, IPv4 ARP, LLDP, and of course any traffic to the IP addresses configured on the router) will go to the kernel.\nDebian SN2700: BVI (or VLAN Interfaces) I\u0026rsquo;ve played around a little bit with L2 (switch) and L3 (router) ports, but there is one middle ground. I\u0026rsquo;ll keep the T-Rex loadtest running in L3 mode, but now I\u0026rsquo;ll reconfigure the switch to put the ports back into the bridge, each port in its own VLAN, and have so-called Bridge Virtual Interface, also known as VLAN interfaces \u0026ndash; this is where the switch has a bunch of ports together in a VLAN, but the switch itself has an IPv4 or IPv6 address in that VLAN as well, which can act as a router.\nI reconfigure the switch to put the interfaces back into VLAN 1000 and 2000 respectively, and move the IPv4 addresses and routes there \u0026ndash; so here I go, first putting the switch interfaces back into L2 mode and adding them to the bridge, each in their own VLAN, by making them access ports:\nroot@fafo:~# ip link add name br0 type bridge vlan_filtering 1 root@fafo:~# ip link set br0 address 04:3f:72:74:a9:7d mtu 9216 up root@fafo:~# ip link set swp31 master br0 mtu 9216 up root@fafo:~# ip link set swp32 master br0 mtu 9216 up root@fafo:~# bridge vlan del vid 1 dev swp31 root@fafo:~# bridge vlan del vid 1 dev swp32 root@fafo:~# bridge vlan add vid 1000 dev swp31 pvid root@fafo:~# bridge vlan add vid 2000 dev swp32 pvid From the ASIC specs, I understand that these BVIs need to (re)use a MAC from one of the members, so the first thing I do is give br0 the right MAC address. Then I put the switch ports into the bridge, remove VLAN 1 and put them in their respective VLANs. At this point, the loadtester reports 100% packet loss, because the two ports can no longer see each other at layer2, and layer3 configs have been removed. But I can restore connectivity with two BVIs as follows:\nroot@fafo:~# for vlan in 1000 2000; do ip link add link br0 name br0.$vlan type vlan id $vlan bridge vlan add dev br0 vid $vlan self ip link set br0.$vlan up mtu 9216 done root@fafo:~# ip addr add 100.65.1.1/24 dev br0.1000 root@fafo:~# ip ro add 16.0.0.0/8 via 100.65.1.2 root@fafo:~# ip nei replace 100.65.1.2 lladdr b4:96:91:b3:b1:10 dev br0.1000 root@fafo:~# ip addr add 100.65.2.1/24 dev br0.2000 root@fafo:~# ip ro add 48.0.0.0/8 via 100.65.2.2 root@fafo:~# ip nei replace 100.65.2.2 lladdr b4:96:91:b3:b1:11 dev br0.2000 And with that, the loadtest shoots back in action: First a quick overview of the sitation I have created:\nroot@fafo:~# bridge vlan port vlan-id swp31 1000 PVID swp32 2000 PVID br0 1 PVID Egress Untagged root@fafo:~# ip -4 ro default via 198.19.5.1 dev eth0 onlink rt_trap 16.0.0.0/8 via 100.65.1.2 dev br0.1000 offload rt_offload 48.0.0.0/8 via 100.65.2.2 dev br0.2000 offload rt_offload 100.65.1.0/24 dev br0.1000 proto kernel scope link src 100.65.1.1 rt_offload 100.65.2.0/24 dev br0.2000 proto kernel scope link src 100.65.2.1 rt_offload 198.19.5.0/26 dev eth0 proto kernel scope link src 198.19.5.62 rt_trap root@fafo:~# ip -4 nei 198.19.5.1 dev eth0 lladdr 00:1e:08:26:ec:f3 REACHABLE 100.65.1.2 dev br0.1000 lladdr b4:96:91:b3:b1:10 offload PERMANENT 100.65.2.2 dev br0.2000 lladdr b4:96:91:b3:b1:11 offload PERMANENT Looking at the situation now, compared to the regular IPv4 L3 loadtest, there is one important difference. Now, the switch can have any number of ports in VLAN 1000, which will all amongst themselves do L2 forwarding at line rate, and when they need to send IPv4 traffic out, they will ARP for the gateway (for example at 100.65.1.1/24), which will get trapped and forwarded to the CPU, after which the ARP reply will go out so that the machines know where to find the gateway. From that point on, IPv4 forwarding happens once again in hardware, which can be shown by the keywords rt_offload in the routing table (br0, in the ASIC), compared to the rt_trap (eth0, in the kernel). Similarly for the IPv4 neighbors, the L2 adjacency is programmed into the CAM (the output of which I took a look at above), do forwarding can be done directly by the ASIC without intervention from the CPU.\nAs a result, these VLAN Interfaces (which are synonymous with BVIs), work at line rate out of the box.\nResults This switch is phenomenal, and Jiří Pírko and the Mellanox team truly outdid themselves with their mlxsw switchdev implementation. I have in my hands a very affordable 32x100G or 64x(50G, 25G, 10G, 1G) and anything in between, with IPv4 and IPv6 forwarding in hardware, with a limited FIB size, not too dissimilar from the [Centec] switches that IPng Networks runs in its AS8298 network, albeit without MPLS forwarding capabilities.\nStill, for a LAB switch, to better test 25G and 100G topologies, this switch is very good value for my money spent, and that it runs Debian and is fully configurable with things like Kees and Ansible. Considering there\u0026rsquo;s a whole range of 48x10G and 48x25G switches as well from Mellanox, all completely open and officially allowed to run OSS stuff on, these make a perfect fit for IPng Networks!\nAcknowledgements This article was written after fussing around and finding out, but a few references were particularly helpful, and I\u0026rsquo;d like to acknowledge the following super useful sites:\n[mlxsw wiki] on GitHub [jpirko\u0026rsquo;s kernel driver] on GitHub [SONiC wiki] on GitHub [Spectrum Docs] on NVIDIA And to the community for writing and maintaining this excellent switchdev implementation.\n","date":"2023-11-11","desc":"Introduction I\u0026rsquo;m still hunting for a set of machines with which I can generate 1Tbps and 1Gpps of VPP traffic, and considering a 100G network interface can do at most 148.8Mpps, I will need 7 or 8 of these network cards. Doing a loadtest like this with DACs back-to-back is definitely possible, but it\u0026rsquo;s a bit more convenient to connect them all to a switch. However, for this to work I would need (at least) fourteen or more HundredGigabitEthernet ports, and these switches tend to get expensive, real quick.\n","permalink":"https://ipng.ch/s/articles/2023/11/11/debian-on-mellanox-sn2700-32x100g/","section":"articles","title":"Debian on Mellanox SN2700 (32x100G)"},{"contents":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nThere\u0026rsquo;s some really fantastic features in VPP, some of which are lesser well known, and not always very well documented. In this article, I will describe a unique usecase in which I think VPP will excel, notably acting as a gateway for Internet Exchange Points.\nIn this first article, I\u0026rsquo;ll take a closer look at three things that would make such a gateway possible: bridge domains, MAC address filtering and traffic shaping.\nIntroduction Internet Exchanges are typically L2 (ethernet) switch platforms that allow their connected members to exchange traffic amongst themselves. Not all members share physical locations with the Internet Exchange itself, for example the IXP may be at NTT Zurich, but the member may be present in Interxion Zurich. For smaller clubs, like IPng Networks, it\u0026rsquo;s not always financially feasible (or desirable) to order a dark fiber between two adjacent datacenters, or even a cross connect in the same datacenter (as many of them are charging exorbitant fees for what is essentially passive fiber optics and patch panels), if the amount of traffic passed is modest.\nOne solution to such problems is to have one member transport multiple end-user downstream members to the platform, for example by means of an Ethernet over MPLS or VxLAN transport from where the enduser lives, to the physical port of the Internet Exchange. These transport members are often called IXP Resellers noting that usually, but not always, some form of payment is required.\nFrom the point of view of the IXP, it\u0026rsquo;s often the case that there is a one MAC address per member limitation, and not all members will have the same bandwidth guarantees. Many IXPs will offer physical connection speeds (like a Gigabit, TenGig or HundredGig port), but they also have a common practice to limit the passed traffic by means of traffic shaping, for example one might have a TenGig port but only entitled to pass 3.0 Gbit/sec of traffic in- and out of the platform.\nFor a long time I thought this kind of sucked, after all, who wants to connect to an internet exchange point but then see their traffic rate limited? But if you think about it, this is often to protect both the member, and the reseller, and the exchange itself: if the total downstream bandwidth to the reseller is potentially larger than the reseller\u0026rsquo;s port to the exchange, and this is almost certainly the case in the other direction: the total IXP bandwidth that might go to one individual members, is significantly larger than the reseller\u0026rsquo;s port to the exchange.\nDue to these two issues, a reseller port may become a bottleneck and packetlo may occur. To protect the ecosystem, having the internet exchange try to enforce fairness and bandwidth limits makes operational sense.\nVPP as an IXP Gateway Here\u0026rsquo;s a few requirements that may be necessary to provide an end-to-end solution:\nDownstream ports MAY be untagged, or tagged, in which case encapsulation (for example .1q VLAN tags) SHOULD be provided, one per downstream member. Each downstream member MUST ONLY be allowed to send traffic from one or more registered MAC addresses, in other words, strict filtering MUST be applied by the gateway. If a downstream member is assigned an up- and downstream bandwidth limit, this MUST be enforced by the gateway. Of course, all sorts of other things come to mind \u0026ndash; perhaps MPLS encapsulation, or VxLAN/GENEVE tunneling endpoints, and certainly some monitoring with SNMP or Prometheus, and how about just directly integrating this gateway with [IXPManager] while we\u0026rsquo;re at it. Yes, yes! But for this article, I\u0026rsquo;m going to stick to the bits and pieces regarding VPP itself, and leave the other parts for another day!\nFirst, I build a quick lab out of this, by taking one supermicro bare metal server with VPP (it will be the VPP IXP Gateway), and a couple of Debian servers and switches to simulate clients (A-J):\nClient A-D (on port e0-e3) will use 192.0.2.1-4/24 and 2001:db8::1-5/64 Client E-G (on switch port e0-e2 of switch0, behind port xe0) will use 192.0.2.5-7/24 and 2001:db8::5-7/64 Client H-J (on switch port e0-e2 of switch1, behind port xe1) will use 192.0.2.8-10/24 and 2001:db8::8-a/64 There will be a server attached to port xxv0 with address 198.0.2.254/24 and 2001:db8::ff/64 The server will run iperf3. VPP: Bridge Domains The fundamental topology described in the picture above tries to bridge together a bunch of untagged ports (e0..e3 1Gbit each)) with two tagged ports (xe0 and xe1, 10Gbit) into an upstream IXP port (xxv0, 25Gbit). One thing to note for the pedants (and I love me some good pedantry) is that the total physical bandwidth to downstream members in this gateway (4x1+2x10 == 24Gbit) is lower than the physical bandwidth to the IXP platform (25Gbit), which makes sense. It means that there will not be contention per se.\nBuilding this topology in VPP is rather straight forward by using a so called Bridge Domain, which will be referred to by its bridge-id, for which I\u0026rsquo;ll rather arbitrarily choose 8298:\nvpp# create bridge-domain 8298 vpp# set interface l2 bridge xxv0 8298 vpp# set interface l2 bridge e0 8298 vpp# set interface l2 bridge e1 8298 vpp# set interface l2 bridge e2 8298 vpp# set interface l2 bridge e3 8298 vpp# set interface l2 bridge xe0 8298 vpp# set interface l2 bridge xe1 8298 VPP: Bridge Domain Encapsulations I cheated a little bit in the previous section: I added the two TenGig ports called xe0 and xe1 directly to the bridge; however they are trunk ports to breakout switches which will each contain three additional downstream customers. So to add these six new customers, I will do the following:\nvpp# set interface l3 xe0 vpp# create sub-interfaces xe0 10 vpp# create sub-interfaces xe0 20 vpp# create sub-interfaces xe0 30 vpp# set interface l2 bridge xe0.10 8298 vpp# set interface l2 bridge xe0.20 8298 vpp# set interface l2 bridge xe0.30 8298 The first command here puts the interface xe0 back into Layer3 mode, which will detach it from the bridge-domain. The second set of commands creates sub-interfaces with dot1q tags 10, 20 and 30 respectively. The third set then adds these three sub-interfaces to the bridge. By the way, I\u0026rsquo;ll do this for both xe0 shown above, but also for the second xe1 port, so all-up that makes 6 downstream member ports.\nReaders of my articles at this point may have a little bit of an uneasy feeling: \u0026ldquo;What about the VLAN Gymnastics?\u0026rdquo; I hear you ask :) You see, VPP will generally just pick up these ethernet frames from xe0.10 which are tagged, and add them as-is to the bridge, which is weird, because all the other bridge ports are expecting untagged frames. So what I must do is tell VPP, upon receipt of a tagged ethernet frame on these ports, to strip the tag; and on the way out, before transmitting the ethernet frame, to wrap it into its correct encapsulation. This is called tag rewriting in VPP, and I\u0026rsquo;ve written a bit about it in [this article] in case you\u0026rsquo;re curious. But to cut to the chase:\nvpp# set interface l2 tag-rewrite xe0.10 pop 1 vpp# set interface l2 tag-rewrite xe0.20 pop 1 vpp# set interface l2 tag-rewrite xe0.30 pop 1 vpp# set interface l2 tag-rewrite xe1.10 pop 1 vpp# set interface l2 tag-rewrite xe1.20 pop 1 vpp# set interface l2 tag-rewrite xe1.30 pop 1 Allright, with the VLAN gymnastics properly applied, I now have a bridge with all ten downstream members and one upstream port (xxv0):\nvpp# show bridge-domain 8298 int BD-ID Index BSN Age(min) Learning U-Forwrd UU-Flood Flooding ARP-Term arp-ufwd Learn-co Learn-li BVI-Intf 8298 1 0 off on on flood on off off 1 16777216 N/A Interface If-idx ISN SHG BVI TxFlood VLAN-Tag-Rewrite xxv0 3 1 0 - * none e0 5 1 0 - * none e1 6 1 0 - * none e2 7 1 0 - * none e3 8 1 0 - * none xe0.10 19 1 0 - * pop-1 xe0.20 20 1 0 - * pop-1 xe0.30 21 1 0 - * pop-1 xe1.10 22 1 0 - * pop-1 xe1.20 23 1 0 - * pop-1 xe1.30 24 1 0 - * pop-1 One cool thing to re-iterate is that VPP is really a router, not a switch. It\u0026rsquo;s entirely possible and common to create two completely independent subinterfaces with .1q tag 10 (in my case, xe0.10 and xe1.10 and use the bridge-domain to tie them together.\nValidating Bridge Domains Looking at my clients above, I can see that several of them are untagged (e0-e3) and a few of them are tagged behind ports xe0 and xe1. It should be straight forward to validate reachability with the following simple ping command:\npim@clientA:~$ fping -a -g 192.0.2.0/24 192.0.2.1 is alive 192.0.2.2 is alive 192.0.2.3 is alive 192.0.2.4 is alive 192.0.2.5 is alive 192.0.2.6 is alive 192.0.2.7 is alive 192.0.2.8 is alive 192.0.2.9 is alive 192.0.2.10 is alive 192.0.2.254 is alive At this point the table stakes configuration provides for a Layer2 bridge domain spanning all of these ports, including performing the correct encapsulation on the TenGig ports that connect to the switches. There is L2 reachability between all clients over this VPP IXP Gateway.\n✅ Requirement #1 is implemented!\nVPP: MAC Address Filtering Enter classifiers! Actually while doing the research for this article, I accidentally nerd-sniped myself while going through the features provided by VPP\u0026rsquo;s classifier system, and holy moly is that thing powerful!\nI\u0026rsquo;m only going to show the results of that little journey through the code base and documentation, but in an upcoming article I intend to do a thorough deep-dive into VPP classifiers, and add them to vppcfg because I think that would be the bee\u0026rsquo;s knees!\nBack to the topic of MAC address filtering, a classifier would look roughly like this:\nvpp# classify table acl-miss-next deny mask l2 src table 5 vpp# classify session acl-hit-next permit table-index 5 match l2 src 00:01:02:03:ca:fe vpp# classify session acl-hit-next permit table-index 5 match l2 src 00:01:02:03:d0:d0 vpp# set interface input acl intfc e0 l2-table 5 vpp# show inacl type l2 Intfc idx Classify table Interface name 5 5 e0 The first line create a classify table where we\u0026rsquo;ll want to match on Layer2 source addresses, and if there is no entry in the table that matches, the default will be to deny (drop) the ethernet frame. The next two lines add an entry for ethernet frames which have Layer2 source of the cafe and d0d0 MAC addresses. When matching, the action is to permit (accept) the ethernet frame. Then, I apply this classifier as an l2 input ACL on interface e0.\nIncidentally the input ACL can operate at five distinct points in the packet\u0026rsquo;s journey through the dataplane. At the Layer2 input stage, like I\u0026rsquo;m using here, in the IPv4 and IPv6 input path, and when punting traffic for IPv4 and IPv6 respectively.\nValidating MAC filtering Remember when I created the classify table and added two bogus MAC addresses to it? Let me show you what would happen on client A, which is directly connected to port e0.\npim@clientA:~$ ip -br link show eno3 eno3 UP 3c:ec:ef:6a:7b:74 \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; pim@clientA:~$ ping 192.0.2.254 PING 192.0.2.254 (192.0.2.254) 56(84) bytes of data. ... This is expected because ClientA\u0026rsquo;s MAC address has not yet been added to the classify table driving the Layer2 input ACL, which is quicky remedied like so:\nvpp# classify session acl-hit-next permit table-index 5 match l2 src 3c:ec:ef:6a:7b:74 ... 64 bytes from 192.0.2.254: icmp_seq=34 ttl=64 time=2048 ms 64 bytes from 192.0.2.254: icmp_seq=35 ttl=64 time=1024 ms 64 bytes from 192.0.2.254: icmp_seq=36 ttl=64 time=0.450 ms 64 bytes from 192.0.2.254: icmp_seq=37 ttl=64 time=0.262 ms ✅ Requirement #2 is implemented!\nVPP: Traffic Policers I realize that from the IXP\u0026rsquo;s point of view, not all the available bandwidth behind xxv0 should be made available to all clients. Some may have negotiated a higher- or lower- bandwidth available to them. Therefor, the VPP IXP Gateway should be able to rate limit the traffic through the it, for which a VPP feature already exists: Policers.\nConsider for a moment our client A (untagged on port e0), and client E (behind port xe0 with a dot1q tag of 10). Client A has a bandwidth of 1Gbit, but client E nominally has a bandwidth of 10Gbit. If I were to want to restrict both clients to, say, 150Mbit, I could do the following:\nvpp# policer add name client-a rate kbps cir 150000 cb 15000000 conform-action transmit vpp# policer input name client-a e0 vpp# policer output name client-a e0 vpp# policer add name client-e rate kbps cir 150000 cb 15000000 conform-action transmit vpp# policer input name client-e xe0.10 vpp# policer output name client-e xe0.10 And here\u0026rsquo;s where I bump into a stubborn VPP dataplane. I would\u0026rsquo;ve expected the input and output packet shaping to occur on both the untagged interface e0 as well as the tagged interface xe0.10, but alas, the policer only works in one of these four cases. Ouch!\nI read the code around vnet/src/policer/ and understand the following:\nOn input, the policer is applied on device-input which is the Phy, not the Sub-Interface. This explains why the policer works on untagged, but not on tagged interfaces. On output, the policer is applied on ip4-output and ip6-output, which works only for L3 enabled interfaces, not for L2 ones like the ones in this bridge domain. I also tried to work with classifiers, like in the MAC address filtering above \u0026ndash; but I concluded here as well, that the policer works only on input, not on output. So the mission is now to figure out how to enable an L2 policer on (1) untagged output, and (2) tagged in- and output.\n❌ Requirement #3 is not implemented!\nWhat\u0026rsquo;s Next It\u0026rsquo;s too bad that policers are a bit fickle. That\u0026rsquo;s quite unfortunate, but I think fixable. I\u0026rsquo;ve started a thread on vpp-dev@ to discuss, and will reach out to Stanislav who added the policer output capability in commit e5a3ae0179.\nOf course, this is just a proof of concept. I typed most of the configuration by hand on the VPP IXP Gateway, just to show a few of the more advanced features of VPP. For me, this triggered a whole new line of thinking: classifiers. This extract/match/act pattern can be used in policers, ACLs and arbitrary traffic redirection through VPP\u0026rsquo;s directed graph (eg. selecting a next node for processing). I\u0026rsquo;m going to deep-dive into this classifier behavior in an upcoming article, and see how I might add this to [vppcfg], because I think it would be super powerful to abstract away the rather complex underlying API into something a little bit more \u0026hellip; user friendly. Stay tuned! :)\n","date":"2023-10-21","desc":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nThere\u0026rsquo;s some really fantastic features in VPP, some of which are lesser well known, and not always very well documented. In this article, I will describe a unique usecase in which I think VPP will excel, notably acting as a gateway for Internet Exchange Points.\n","permalink":"https://ipng.ch/s/articles/2023/10/21/vpp-ixp-gateway-part-1/","section":"articles","title":"VPP IXP Gateway - Part 1"},{"contents":"About this series In the distant past (to be precise, in November of 2009) I wrote a little piece of automation together with my buddy Paul, called PaPHosting. The goal was to be able to configure common attributes like servername, config files, webserver and DNS configs in a consistent way, tracked in Subversion. By the way despite this project deriving its name from the first two authors, our mutual buddy Jeroen also started using it, and has written lots of additional cool stuff in the repo, as well as helped to move from Subversion to Git a few years ago.\nMichael DeHaan [ref] founded Ansible in 2012, and by then our little PaPHosting project, which was written as a set of bash scripts, had sufficiently solved our automation needs. But, as is the case with most home-grown systems, over time I kept on seeing more and more interesting features and integrations emerge, solid documentation, large user group, and eventually I had to reconsider our 1.5K LOC of Bash and ~16.5K files under maintenance, and in the end, I settled on Ansible.\ncommit c986260040df5a9bf24bef6bfc28e1f3fa4392ed Author: Pim van Pelt \u0026lt;pim@ipng.nl\u0026gt; Date: Thu Nov 26 23:13:21 2009 +0000 pim@squanchy:~/src/paphosting$ find * -type f | wc -l 16541 pim@squanchy:~/src/paphosting/scripts$ wc -l *push.sh funcs 132 apache-push.sh 148 dns-push.sh 92 files-push.sh 100 nagios-push.sh 178 nginx-push.sh 271 pkg-push.sh 100 sendmail-push.sh 76 smokeping-push.sh 371 funcs 1468 total In a [previous article], I talked about having not one but a cluster of NGINX servers that would each share a set of SSL certificates and pose as a reversed proxy for a bunch of websites. At the bottom of that article, I wrote:\nThe main thing that\u0026rsquo;s next is to automate a bit more of this. IPng Networks has an Ansible controller, which I\u0026rsquo;d like to add \u0026hellip; but considering Ansible is its whole own elaborate bundle of joy, I\u0026rsquo;ll leave that for maybe another article.\nTadaah.wav that article is here! This is by no means an introduction or howto to Ansible. For that, please take a look at the incomparable Jeff Geerling [ref] and his book: [Ansible for Devops]. I bought and read this book, and I highly recommend it.\nAnsible: Playbook Anatomy The first thing I do is install four Debian Bookworm virtual machines, two in Amsterdam, one in Geneva and one in Zurich. These will be my first group of NGINX servers, that are supposed to be my geo-distributed frontend pool. I don\u0026rsquo;t do any specific configuration or installation of packages, I just leave whatever deboostrap gives me, which is a relatively lean install with 8 vCPUs, 16GB of memory, a 20GB boot disk and a 30G second disk for caching and static websites.\nAnsible is a simple, but powerful, server and configuration management tool (with a few other tricks up its sleeve). It consists of an inventory (the hosts I\u0026rsquo;ll manage), that are put in one or more groups, there is a registery of variables (telling me things about those hosts and groups), and an elaborate system to run small bits of automation, called tasks organized in things called Playbooks.\nNGINX Cluster: Group Basics First of all, I create an Ansible group called nginx and I add the following four freshly installed virtual machine hosts to it:\npim@squanchy:~/src/ipng-ansible$ cat \u0026lt;\u0026lt; EOF | tee -a inventory/nodes.yml nginx: hosts: nginx0.chrma0.net.ipng.ch: nginx0.chplo0.net.ipng.ch: nginx0.nlams1.net.ipng.ch: nginx0.nlams2.net.ipng.ch: EOF I have a mixture of Debian and OpenBSD machines at IPng Networks, so I will add this group nginx as a child to another group called debian, so that I can run \u0026ldquo;common debian tasks\u0026rdquo;, such as installing Debian packages that I want all of my servers to have, adding users and their SSH key for folks who need access, installing and configuring the firewall and things like Borgmatic backups.\nI\u0026rsquo;m not going to go into all the details here for the debian playbook, though. It\u0026rsquo;s just there to make the base system consistent across all servers (bare metal or virtual). The one thing I\u0026rsquo;ll mention though, is that the debian playbook will see to it that the correct users are created, with their SSH pubkey, and I\u0026rsquo;m going to first use this feature by creating two users:\nlego: As I described in a [post on DNS-01], IPng has a certificate machine that answers Let\u0026rsquo;s Encrypt DNS-01 challenges, and its job is to regularly prove ownership of my domains, and then request a (wildcard!) certificate. Once that renews, copy the certificate to all NGINX machines. To do that copy, lego needs an account on these machines, it needs to be able to write the certs and issue a reload to the NGINX server. drone: Most of my websites are static, for example ipng.ch is generated by Jekyll. I typically write an article on my laptop, and once I\u0026rsquo;m happy with it, I\u0026rsquo;ll git commit and push it, after which a Continuous Integration system called [Drone] gets triggered, builds the website, runs some tests, and ultimately copies it out to the NGINX machines. Similar to the first user, this second user must have an account and the ability to write its web data to the NGINX server in the right spot. That explains the following:\npim@squanchy:~/src/ipng-ansible$ cat \u0026lt;\u0026lt; EOF | tee group_vars/nginx.yml --- users: lego: comment: Lets Encrypt password: \u0026#34;!\u0026#34; groups: [ lego ] drone: comment: Drone CI password: \u0026#34;!\u0026#34; groups: [ www-data ] sshkeys: lego: - key: ecdsa-sha2-nistp256 \u0026lt;hidden\u0026gt; comment: lego@lego.net.ipng.ch drone: - key: ecdsa-sha2-nistp256 \u0026lt;hidden\u0026gt; comment: drone@git.net.ipng.ch I note that the users and sshkeys used here are dictionaries, and that the users role defines a few default accounts like my own account pim, so writing this to the group_vars means that these new entries are applied to all machines that belong to the group nginx, so they\u0026rsquo;ll get these users created in addition to the other users in the dictionary. Nifty!\nNGINX Cluster: Config I wanted to be able to conserve IP addresses, and just a few months ago, had a discussion with some folks at Coloclue where we shared the frustration that what was hip in the 90s (go to RIPE NCC and ask for a /20, justifying that with \u0026ldquo;I run SSL websites\u0026rdquo;) is somehow still being used today, even though that\u0026rsquo;s no longer required, or in fact, desirable. So I take one IPv4 and IPv6 address and will use a TLS extension called Server Name Indication or [SNI], designed in 2003 (20 years old today), which you can see described in [RFC 3546].\nFolks who try to argue they need multiple IPv4 addresses because they run multiple SSL websites are somewhat of a trigger to me, so this article doubles up as a \u0026ldquo;how to do SNI and conserve IPv4 addresses\u0026rdquo;.\nI will group my websites that share the same SSL certificate, and I\u0026rsquo;ll call these things clusters. An IPng NGINX Cluster:\nis identified by a name, for example ipng or frysix is served by one or more NGINX servers, for example nginx0.chplo0.ipng.ch and nginx0.nlams1.ipng.ch serves one or more distinct websites, for example www.ipng.ch and nagios.ipng.ch and go.ipng.ch has exactly one SSL certificate, which should cover all of the website(s), preferably using wildcard certs, for example *.ipng.ch, ipng.ch And then, I define several clusters this way, in the following configuration file:\npim@squanchy:~/src/ipng-ansible$ cat \u0026lt;\u0026lt; EOF | tee vars/nginx.yml --- nginx: clusters: ipng: members: [ nginx0.chrma0.net.ipng.ch, nginx0.chplo0.net.ipng.ch, nginx0.nlams1.net.ipng.ch, nginx0.nlams2.net.ipng.ch ] ssl_common_name: ipng.ch sites: ipng.ch: nagios.ipng.ch: go.ipng.ch: frysix: members: [ nginx0.nlams1.net.ipng.ch, nginx0.nlams2.net.ipng.ch ] ssl_common_name: frys-ix.net sites: frys-ix.net: This way I can neatly group the websites (eg. the ipng websites) together, call them by name, and immediately see which servers are going to be serving them using which certificate common name. For future expansion (hint: an upcoming article on monitoring), I decide to make the sites element here a dictionary with only keys and no values as opposed to a list, because later I will want to add some bits and pieces of information for each website.\nNGINX Cluster: Sites As is common with NGINX, I will keep a list of websites in the directory /etc/nginx/sites-available/ and once I need a given machine to actually serve that website, I\u0026rsquo;ll symlink it from /etc/nginx/sites-enabled/. In addition, I decide to add a few common configuration snippets, such as logging and SSL/TLS parameter files and options, which allow the webserver to score relatively high on SSL certificate checker sites. It helps to keep the security buffs off my case.\nSo I decide on the following structure, each file to be copied to all nginx machines in /etc/nginx/:\nroles/nginx/files/conf.d/http-log.conf roles/nginx/files/conf.d/ipng-headers.inc roles/nginx/files/conf.d/options-ssl-nginx.inc roles/nginx/files/conf.d/ssl-dhparams.inc roles/nginx/files/sites-available/ipng.ch.conf roles/nginx/files/sites-available/nagios.ipng.ch.conf roles/nginx/files/sites-available/go.ipng.ch.conf roles/nginx/files/sites-available/go.ipng.ch.htpasswd roles/nginx/files/sites-available/... In order:\nconf.d/http-log.conf defines a custom logline type called upstream that contains a few interesting additional items that show me the performance of NGINX: log_format upstream \u0026lsquo;$remote_addr - $remote_user [$time_local] \u0026rsquo; \u0026lsquo;\u0026quot;$request\u0026quot; $status $body_bytes_sent \u0026rsquo; \u0026lsquo;\u0026quot;$http_referer\u0026quot; \u0026ldquo;$http_user_agent\u0026rdquo; \u0026rsquo; \u0026lsquo;rt=$request_time uct=$upstream_connect_time uht=$upstream_header_time urt=$upstream_response_time\u0026rsquo;;\nconf.d/ipng-headers.inc adds a header served to end-users from this NGINX, that reveals the instance that served the request. Debugging a cluster becomes a lot easier if you know which server served what: add_header X-IPng-Frontend $hostname always;\nconf.d/options-ssl-nginx.inc and conf.d/ssl-dhparams.inc are files borrowed from Certbot\u0026rsquo;s NGINX configuration, and ensure the best TLS and SSL session parameters are used. sites-available/*.conf are the configuration blocks for the port-80 (HTTP) and port-443 (SSL certificate) websites. In the interest of brevity I won\u0026rsquo;t copy them here, but if you\u0026rsquo;re curious I showed a bunch of these in a [previous article]. These per-website config files sensibly include the SSL defaults, custom IPng headers and upstream log format. NGINX Cluster: Let\u0026rsquo;s Encrypt I figure the single most important thing to get right is how to enable multiple groups of websites, including SSL certificates, in multiple Clusters (say ipng and frysix), to be served using different SSL certificates, but on the same IPv4 and IPv6 address, using Server Name Indication or SNI. Let\u0026rsquo;s first take a look at building these two of these certificates, one for [IPng Networks] and one for [FrysIX], the internet exchange with Frysian roots, which incidentally offers free 1G, 10G, 40G and 100G ports all over the Amsterdam metro. My buddy Arend and I are running that exchange, so please do join it!\nI described the usual HTTP-01 certificate challenge a while ago in [this article], but I rarely use it because I\u0026rsquo;ve found that once installed, DNS-01 is vastly superior. I wrote about the ability to request a single certificate with multiple wildcard entries in a [DNS-01 article], so I\u0026rsquo;m going to save you the repetition, and simply use certbot, acme-dns and the DNS-01 challenge type, to request the following two certificates:\nlego@lego:~$ certbot certonly --config-dir /home/lego/acme-dns --logs-dir /home/lego/logs \\ --work-dir /home/lego/workdir --manual --manual-auth-hook /home/lego/acme-dns/acme-dns-auth.py \\ --preferred-challenges dns --debug-challenges \\ -d ipng.ch -d *.ipng.ch -d *.net.ipng.ch \\ -d ipng.nl -d *.ipng.nl \\ -d ipng.eu -d *.ipng.eu \\ -d ipng.li -d *.ipng.li \\ -d ublog.tech -d *.ublog.tech \\ -d as8298.net -d *.as8298.net \\ -d as50869.net -d *.as50869.net lego@lego:~$ certbot certonly --config-dir /home/lego/acme-dns --logs-dir /home/lego/logs \\ --work-dir /home/lego/workdir --manual --manual-auth-hook /home/lego/acme-dns/acme-dns-auth.py \\ --preferred-challenges dns --debug-challenges \\ -d frys-ix.net -d *.frys-ix.net First off, while I showed how to get these certificates by hand, actually generating these two commands is easily doable in Ansible (which I\u0026rsquo;ll show at the end of this article!) I defined which cluster has which main certificate name, and which websites it\u0026rsquo;s wanting to serve. Looking at vars/nginx.yml, it becomes quickly obvious how I can automate this. Using a relatively straight forward construct, I can let Ansible create for me a list of commandline arguments programmatically:\nInitialize a variable CERT_ALTNAMES as a list of nginx.clusters.ipng.ssl_common_name and its wildcard, in other words [ipng.ch, *.ipng.ch]. As a convenience, tack onto the CERT_ALTNAMES list any entries in the nginx.clusters.ipng.ssl_altname, such as [*.net.ipng.ch]. Then looping over each entry in the nginx.clusters.ipng.sites dictionary, use fnmatch to match it against any entries in the CERT_ALTNAMES list: If it matches, for example with go.ipng.ch, skip and continue. This website is covered already by an altname. If it doesn\u0026rsquo;t match, for example with ublog.tech, simply add it and its wildcard to the CERT_ALTNAMES list: [ublog.tech, *.ublog.tech]. Now, the first time I run this for a new cluster (which has never had a certificate issued before), certbot will ask me to ensure the correct _acme-challenge records are in each respective DNS zone. After doing that, it will issue two separate certificates and install a cronjob that will periodically check the age, and renew the certificate(s) when they are up for renewal. In a post-renewal hook, I will create a script that copies the new certificate to the NGINX cluster (using the lego user + SSH key that I defined above).\nlego@lego:~$ find /home/lego/acme-dns/live/ -type f /home/lego/acme-dns/live/README /home/lego/acme-dns/live/frys-ix.net/README /home/lego/acme-dns/live/frys-ix.net/chain.pem /home/lego/acme-dns/live/frys-ix.net/privkey.pem /home/lego/acme-dns/live/frys-ix.net/cert.pem /home/lego/acme-dns/live/frys-ix.net/fullchain.pem /home/lego/acme-dns/live/ipng.ch/README /home/lego/acme-dns/live/ipng.ch/chain.pem /home/lego/acme-dns/live/ipng.ch/privkey.pem /home/lego/acme-dns/live/ipng.ch/cert.pem /home/lego/acme-dns/live/ipng.ch/fullchain.pem The crontab entry that Certbot normally installs makes soms assumptions on directory and which user is running the renewal. I am not a fan of having the root user do this, so I\u0026rsquo;ve changed it to this:\nlego@lego:~$ cat /etc/cron.d/certbot 0 */12 * * * lego perl -e \u0026#39;sleep int(rand(43200))\u0026#39; \u0026amp;\u0026amp; certbot -q renew \\ --config-dir /home/lego/acme-dns --logs-dir /home/lego/logs \\ --work-dir /home/lego/workdir \\ --deploy-hook \u0026#34;/home/lego/bin/certbot-distribute\u0026#34; And some pretty cool magic happens with this certbot-distribute script. When certbot has successfully received a new certificate, it\u0026rsquo;ll set a few environment variables and execute the deploy hook with them:\nRENEWED_LINEAGE: will point to the config live subdirectory (eg. /home/lego/acme-dns/live/ipng.ch) containing the new certificates and keys RENEWED_DOMAINS will contain a space-delimited list of renewed certificate domains (eg. ipng.ch *.ipng.ch *.net.ipng.ch) Using the first of those two things, I guess it becomes straight forward to distribute the new certs:\n#!/bin/sh CERT=$(basename $RENEWED_LINEAGE) CERTFILE=$RENEWED_LINEAGE/fullchain.pem KEYFILE=$RENEWED_LINEAGE/privkey.pem if [ \u0026#34;$CERT\u0026#34; = \u0026#34;ipng.ch\u0026#34; ]; then MACHS=\u0026#34;nginx0.chrma0.ipng.ch nginx0.chplo0.ipng.ch nginx0.nlams1.ipng.ch nginx0.nlams2.ipng.ch\u0026#34; elif [ \u0026#34;$CERT\u0026#34; = \u0026#34;frys-ix.net\u0026#34; ]; then MACHS=\u0026#34;nginx0.nlams1.ipng.ch nginx0.nlams2.ipng.ch\u0026#34; else echo \u0026#34;Unknown certificate $CERT, do not know which machines to copy to\u0026#34; exit 3 fi for MACH in $MACHS; do fping -q $MACH 2\u0026gt;/dev/null || { echo \u0026#34;$MACH: Skipping (unreachable)\u0026#34; continue } echo $MACH: Copying $CERT scp -q $CERTFILE $MACH:/etc/nginx/certs/$CERT.crt scp -q $KEYFILE $MACH:/etc/nginx/certs/$CERT.key echo $MACH: Reloading nginx ssh $MACH \u0026#39;sudo systemctl reload nginx\u0026#39; done There are a few things to note, if you look at my little shell script. I already kind of know which CERT belongs to which MACHS, because this was configured in vars/nginx.yml, where I have a cluster name, say ipng, which conveniently has two variables, one called members which is a list of machines, and the second is ssl_common_name which is ipng.ch. I think that I can find a way to let Ansible generate this file for me also, whoot!\nAnsible: NGINX Tying it all together (frankly, a tiny bit surprised you\u0026rsquo;re still reading this!), I can now offer an Ansible role that automates all of this.\n{%- raw %} pim@squanchy:~/src/ipng-ansible$ cat \u0026lt;\u0026lt; EOF | tee roles/nginx/tasks/main.yml - name: Install Debian packages ansible.builtin.apt: update_cache: true pkg: [ nginx, ufw, net-tools, apache2-utils, mtr-tiny, rsync ] - name: Copy config files ansible.builtin.copy: src: \u0026#34;{{ item }}\u0026#34; dest: \u0026#34;/etc/nginx/\u0026#34; owner: root group: root mode: u=rw,g=r,o=r directory_mode: u=rwx,g=rx,o=rx loop: [ conf.d, sites-available ] notify: Reload nginx - name: Add cluster ansible.builtin.include_tasks: file: cluster.yml loop: \u0026#34;{{ nginx.clusters | dict2items }}\u0026#34; loop_control: label: \u0026#34;{{ item.key }}\u0026#34; EOF pim@squanchy:~/src/ipng-ansible$ cat \u0026lt;\u0026lt; EOF \u0026gt; roles/nginx/handlers/main.yml - name: Reload nginx ansible.builtin.service: name: nginx state: reloaded EOF {% endraw %} The first task installs the Debian packages I\u0026rsquo;ll want to use. The apache2-utils package is to create and maintain htpasswd files and some other useful things. The rsync package is needed to accept both website data from the drone continuous integration user, as well as certificate data from the lego user.\nThe second task copies all of the (static) configuration files onto the machine, populating /etc/nginx/conf.d/ and /etc/nginx/sites-available/. It uses a notify stanza to make note if any of these files (notably the ones in conf.d/) have changed, and if so, remember to invoke a handler to reload the running NGINX to pick up those changes later on.\nFinally, the third task branches out and executes the tasks defined in tasks/cluster.yml one for each NGINX cluster (in my case, ipng and then frysix):\n{%- raw %} pim@squanchy:~/src/ipng-ansible$ cat \u0026lt;\u0026lt; EOF | tee roles/nginx/tasks/cluster.yml - name: \u0026#34;Enable sites for cluster {{ item.key }}\u0026#34; ansible.builtin.file: src: \u0026#34;/etc/nginx/sites-available/{{ sites_item.key }}.conf\u0026#34; dest: \u0026#34;/etc/nginx/sites-enabled/{{ sites_item.key }}.conf\u0026#34; owner: root group: root state: link loop: \u0026#34;{{ (nginx.clusters[item.key].sites | default({}) | dict2items) }}\u0026#34; when: inventory_hostname in nginx.clusters[item.key].members | default([]) loop_control: loop_var: sites_item label: \u0026#34;{{ sites_item.key }}\u0026#34; notify: Reload nginx EOF {% endraw %} This task is a bit more complicated, so let me go over it from outwards facing in. The thing that called us, already has a loop variable called item which has a key (ipng) and a value (the whole cluster defined under nginx.clusters.ipng). Now if I take that item.key variable and look at its sites dictionary (in other words: nginx.clusters.ipng.sites, I can create another loop over all the sites belonging to that cluster. Iterating over a dictionary in Ansible is done with a filter called dict2items, and because technically the cluster could have zero sites, I can ensure the sites dictionary defaults to the empty dictionary {}. Phew!\nAnsible is running this for each machine, and of course I only want to execute this block, if the given machine (which is referenced as inventory_hostname occurs in the clusters\u0026rsquo; members list. If not: skip, if yes: go! which is what the when line does.\nThe loop itself then runs for each site in the sites dictionary, allowing the loop_control to give that loop variable a unique name called sites_item, and when printing information on the CLI, using the label set to the sites_item.key variable (eg. frys-ix.net) rather than the whole dictionary belonging to it.\nWith all of that said, the inner loop is easy: create a (sym)link for each website config file from sites-available to sites-enabled and if new links are created, invoke the Reload nginx handler.\nAnsible: Certbot But what about that LEGO stuff? Fair question. The two scripts I described above (one to create the certbot certificate, and another to copy it to the correct machines), both need to be generated and copied to the right places, so here I go, appending to the tasks:\n{%- raw %} pim@squanchy:~/src/ipng-ansible$ cat \u0026lt;\u0026lt; EOF | tee -a roles/nginx/tasks/main.yml - name: Create LEGO directory ansible.builtin.file: path: \u0026#34;/etc/nginx/certs/\u0026#34; owner: lego group: lego mode: u=rwx,g=rx,o= - name: Add sudoers.d ansible.builtin.copy: src: sudoers dest: \u0026#34;/etc/sudoers.d/lego-ipng\u0026#34; owner: root group: root - name: Generate Certbot Distribute script delegate_to: lego.net.ipng.ch run_once: true ansible.builtin.template: src: certbot-distribute.j2 dest: \u0026#34;/home/lego/bin/certbot-distribute\u0026#34; owner: lego group: lego mode: u=rwx,g=rx,o= - name: Generate Certbot Cluster scripts delegate_to: lego.net.ipng.ch run_once: true ansible.builtin.template: src: certbot-cluster.j2 dest: \u0026#34;/home/lego/bin/certbot-{{ item.key }}\u0026#34; owner: lego group: lego mode: u=rwx,g=rx,o= loop: \u0026#34;{{ nginx.clusters | dict2items }}\u0026#34; EOF pim@squanchy:~/src/ipng-ansible$ cat \u0026lt;\u0026lt; EOF | tee roles/nginx/files/sudoers ## *** Managed by IPng Ansible *** # %lego ALL=(ALL) NOPASSWD: /usr/bin/systemctl reload nginx EOF {% endraw -%} The first task creates /etc/nginx/certs which will be owned by the user lego, and that\u0026rsquo;s where Certbot will rsync the certificates after renewal. The second task then allows lego user to issue a systemctl reload nginx so that NGINX can pick up the certificates once they\u0026rsquo;ve changed on disk.\nThe third task generated the certbot-distribute script, that, depending on the common name of the certificate (for example ipng.ch or frys-ix.net), knows which NGINX machines to copy it to. Its logic is pretty similar to the plain-old shellscript I started with, but does have a few variable expansions. If you\u0026rsquo;ll recall, that script had hard coded way to assemble the MACHS variable, which can be replaced now:\n{%- raw %} # ... {% for cluster_name, cluster in nginx.clusters.items() | default({}) %} {% if not loop.first%}el{% endif %}if [ \u0026#34;$CERT\u0026#34; = \u0026#34;{{ cluster.ssl_common_name }}\u0026#34; ]; then MACHS=\u0026#34;{{ cluster.members | join(\u0026#39; \u0026#39;) }}\u0026#34; {% endfor %} else echo \u0026#34;Unknown certificate $CERT, do not know which machines to copy to\u0026#34; exit 3 fi {% endraw %} One common Ansible trick here is to detect if a given loop has just begun (in which case loop.first will be true), or if this is the last element in the loop (in which case loop.last will be true). I can use this to emit the if (first) versus elif (not first) statements.\nLooking back at what I wrote in this Certbot Distribute task, you\u0026rsquo;ll see I used two additional configuration elements:\nrun_once: Since there are potentially many machines in the nginx Group, by default Ansible will run this task for each machine. However, the Certbot cluster and distribute scripts really only need to be generated once per Playbook execution, which is determined by this run_once field. delegate_to: This task should be executed not on an NGINX machine, rather instead on the lego.net.ipng.ch machine, which is specified by the delegate_to field. Ansible: lookup example And now for the pièce de résistance, the fourth and final task generates a shell script that captures for each cluster the primary name (called ssl_common_name) and the list of alternate names which will turn into full commandline to request a certificate with all wildcard domains added (eg. ipng.ch and *.ipng.ch). To do this, I decide to create an Ansible [Lookup Plugin]. This lookup will simply return true if a given sitename is covered by any of the existing certificace altnames, including wildcard domains, for which I can use the standard python fnmatch.\nFirst, I can create the lookup plugin in a a well-known directory, so Ansible can discover it:\npim@squanchy:~/src/ipng-ansible$ cat \u0026lt;\u0026lt; EOF | tee roles/nginx/lookup_plugins/altname_match.py import ansible.utils as utils import ansible.errors as errors from ansible.plugins.lookup import LookupBase import fnmatch class LookupModule(LookupBase): def __init__(self, basedir=None, **kwargs): self.basedir = basedir def run(self, terms, variables=None, **kwargs): sitename = terms[0] cert_altnames = terms[1] for altname in cert_altnames: if sitename == altname: return [True] if fnmatch.fnmatch(sitename, altname): return [True] return [False] EOF The Python class here will compare the website name in terms[0] with a list of altnames given in terms[1] and will return True either if a literal match occured, or if the altname fnmatch with the sitename. It will return False otherwise. Dope! Here\u0026rsquo;s how I use it in the certbot-cluster script, which is starting to get pretty fancy:\n{%- raw %} pim@squanchy:~/src/ipng-ansible$ cat \u0026lt;\u0026lt; EOF | tee roles/nginx/templates/certbot-cluster.j2 #!/bin/sh ### ### {{ ansible_managed }} ### {% set cluster_name = item.key %} {% set cluster = item.value %} {% set sites = nginx.clusters[cluster_name].sites | default({}) %} # # This script generates a certbot commandline to initialize (or re-initialize) a given certificate for an NGINX cluster. # ### Metadata for this cluster: # # {{ cluster_name }}: {{ cluster }} {% set cert_altname = [ cluster.ssl_common_name, \u0026#39;*.\u0026#39; + cluster.ssl_common_name ] %} {% do cert_altname.extend(cluster.ssl_altname|default([])) %} {% for sitename, site in sites.items() %} {% set altname_matched = lookup(\u0026#39;altname_match\u0026#39;, sitename, cert_altname) %} {% if not altname_matched %} {% do cert_altname.append(sitename) %} {% do cert_altname.append(\u0026#34;*.\u0026#34;+sitename) %} {% endif %} {% endfor %} # CERT_ALTNAME: {{ cert_altname | join(\u0026#39; \u0026#39;) }} # ### certbot certonly --config-dir /home/lego/acme-dns --logs-dir /home/lego/logs --work-dir /home/lego/workdir \\ --manual --manual-auth-hook /home/lego/acme-dns/acme-dns-auth.py \\ --preferred-challenges dns --debug-challenges \\ {% for domain in cert_altname %} -d {{ domain }}{% if not loop.last %} \\{% endif %} {% endfor %} EOF {% endraw %} Ansible provides a lot of templating and logic evaluation in its Jinja2 templating language, but it isn\u0026rsquo;t really a programming language. That said, from the top, here\u0026rsquo;s what happens:\nI set three variables, cluster_name, cluster (the dictionary with the cluster config) and as a shorthand sites which is a dictionary of sites, defaulting to {} if it doesn\u0026rsquo;t exist. I\u0026rsquo;ll print the cluster name and the cluster config for posterity. Who knows, eventually I\u0026rsquo;ll be debugging this anyway :-) Then comes the main thrust, the simple loop that I described above, but in Jinja2: Initialize the cert_altname list with the ssl_common_name and its wildcard variant, optionally extending it with the list of altnames in ssl_altname, if it\u0026rsquo;s set. For each site in the sites dictionary, invoke the lookup and capture its (boolean) result in altname_matched. If the match failed, we have a new domain, so add it and its wildcard variant to the cert_altname list, I use the do Jinja2 extension there comes from package jinja2.ext.do. At the end of this, all of these website names have been reduced to their domain+wildcard variant, which I can loop over to emit the -d flags to certbot at the bottom of the file. And with that, I can generate both the certificate request command, and distribute the resulting certificates to those NGINX servers that need them.\nResults I\u0026rsquo;m very pleased with the results. I can clearly see that the two servers that I assigned to this NGINX cluster (the two in Amsterdam) got their sites enabled, whereas the other two (Zurich and Geneva) were skipped. I can also see that the new certbot request scripts was generated and the existing certbot-distribute script was updated (to be aware of where to copy a renewed cert for this cluster). And, in the end only the two relevant NGINX servers were reloaded, reducing overall risk.\nOne other way to show that the very same IPv4 and IPv6 address can be used to serve multiple distinct multi-domain/wildcard SSL certificates, using this Server Name Indication (SNI, which, I repeat, has been available since 2003 or so), is this:\npim@squanchy:~$ HOST=nginx0.nlams1.ipng.ch pim@squanchy:~$ PORT=443 pim@squanchy:~$ SERVERNAME=www.ipng.ch pim@squanchy:~$ openssl s_client -connect $HOST:$PORT -servername $SERVERNAME \u0026lt;/dev/null 2\u0026gt;/dev/null \\ | openssl x509 -text | grep DNS: | sed -e \u0026#39;s,^ *,,\u0026#39; DNS:*.ipng.ch, DNS:*.ipng.eu, DNS:*.ipng.li, DNS:*.ipng.nl, DNS:*.net.ipng.ch, DNS:*.ublog.tech, DNS:as50869.net, DNS:as8298.net, DNS:ipng.ch, DNS:ipng.eu, DNS:ipng.li, DNS:ipng.nl, DNS:ublog.tech pim@squanchy:~$ SERVERNAME=www.frys-ix.net pim@squanchy:~$ openssl s_client -connect $HOST:$PORT -servername $SERVERNAME \u0026lt;/dev/null 2\u0026gt;/dev/null \\ | openssl x509 -text | grep DNS: | sed -e \u0026#39;s,^ *,,\u0026#39; DNS:*.frys-ix.net, DNS:frys-ix.net Ansible is really powerful, and once I got to know it a little bit, will readily admit it\u0026rsquo;s way cooler than PaPhosting ever was :)\nWhat\u0026rsquo;s Next If you remember, I wrote that the nginx.clusters.*.sites would not be a list but rather a dictionary, because I\u0026rsquo;d like to be able to carry other bits of information. And if you take a close look at my screenshot above, you\u0026rsquo;ll see I revealed something about Nagios\u0026hellip; so in an upcoming post I\u0026rsquo;d like to share how IPng Networks arranges its Nagios environment, and I\u0026rsquo;ll use the NGINX configs here to show how I automatically monitor all servers participating in an NGINX Cluster, both for pending certificate expiry, which should not generally happen precisely due to the automation here, but also in case any backend server takes the day off.\nStay tuned! Oh, and if you\u0026rsquo;re good at Ansible and would like to point out how silly I approach things, please do drop me a line on Mastodon, where you can reach me on [@IPngNetworks@ublog.tech].\n","date":"2023-08-27","desc":"About this series In the distant past (to be precise, in November of 2009) I wrote a little piece of automation together with my buddy Paul, called PaPHosting. The goal was to be able to configure common attributes like servername, config files, webserver and DNS configs in a consistent way, tracked in Subversion. By the way despite this project deriving its name from the first two authors, our mutual buddy Jeroen also started using it, and has written lots of additional cool stuff in the repo, as well as helped to move from Subversion to Git a few years ago.\n","permalink":"https://ipng.ch/s/articles/2023/08/27/case-study-nginx--certbot-with-ansible/","section":"articles","title":"Case Study: NGINX + Certbot with Ansible"},{"contents":"About this series I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I\u0026rsquo;ve been feeling less enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using \u0026ldquo;free\u0026rdquo; services is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but for me it\u0026rsquo;s time to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to privately operated ones.\nAfter having written a fair bit about my Mastodon [install] and [monitoring], I\u0026rsquo;ve been using it every day. This morning, my buddy Ramón asked if he could make a second account on ublog.tech for his Campervan Adventures, and notably to post pics of where he and his family went.\nBut if pics is your jam, why not \u0026hellip; [Pixelfed]!\nIntroduction Similar to how blogging is the act of publishing updates to a website, microblogging is the act of publishing small updates to a stream of updates on your profile. Very similar to the relationship between Facebook and Instagram, Mastodon and Pixelfed give the ability to post and share, cross-link, discuss, comment and like, across the entire Fediverse. Except, Pixelfed doesn\u0026rsquo;t do this in a centralized way, and I get to be a steward of my own data.\nAs is common in the Fediverse, groups of people congregate on a given server, of which they become a user by creating an account on that server. Then, they interact with one another on that server, but users can also interact with folks on other servers. Instead of following @IPngNetworks, they might follow a user on a given server domain, like @IPngNetworks@pix.ublog.tech. This way, all these servers can be run independently but interact with each other using a common protocol (called ActivityPub). I\u0026rsquo;ve heard this concept be compared to choosing an e-mail provider: I might choose Google\u0026rsquo;s gmail.com, and you might use Microsoft\u0026rsquo;s live.com. However we can send e-mails back and forth due to this common protocol (called SMTP).\npix.uBlog.tech I thought I would give it a go, mostly out of engineering curiosity but also because I more strongly feel today that we (the users) ought to take a bit more ownership back. I\u0026rsquo;ve been a regular blogging and micro-blogging user since approximately for ever, and I think it may be a good investment of my time to learn a bit more about the architecture of Pixelfed. So, I\u0026rsquo;ve decided to build and productionize a server instance.\nPreviously, I registered uBlog.tech and have been running that for about a year as a Mastodon instance. Incidentally, if you\u0026rsquo;re reading this and would like to participate, the server welcomes users in the network-, systems- and software engineering disciplines. But, before I can get to the fun parts though, I have to do a bunch of work to get this server in a shape in which it can be trusted with user generated content.\nThe IPng environment Pixelfed: Virtual Machine I provision a VM with 8vCPUs (dedicated on the underlying hypervisor), including 16GB of memory and one virtio network card. For disks, I will use two block devices, one small one of 16GB (vda) that is created on the hypervisor\u0026rsquo;s ssd-vol1/libvirt/pixelfed-disk0, to be used only for boot, logs and OS. Then, a second one (vdb) is created at 2TB on vol0/pixelfed-disk1 and it will be used for Pixelfed itself.\nI simply install Debian into vda using virt-install. At IPng Networks we have some ansible-style automation that takes over the machine, and further installs all sorts of Debian packages that we use (like a Prometheus node exporter, more on that later), and sets up a firewall that allows SSH access for our trusted networks, and otherwise only allows port 80 because this is to be a (backend) webserver behind the NGINX cluster.\nAfter installing Debian Bullseye, I\u0026rsquo;ll create the following ZFS filesystems on vdb:\npim@pixelfed:~$ sudo zpool create data /dev/vdb pim@pixelfed:~$ sudo zfs create -o data/pixelfed -V10G pim@pixelfed:~$ sudo zfs create -o mountpoint=/data/pixelfed/pixelfed/storage data/pixelfed-storage pim@pixelfed:~$ sudo zfs create -o mountpoint=/var/lib/mysql data/mysql -V20G pim@pixelfed:~$ sudo zfs create -o mountpoint=/var/lib/redis data/redis -V2G As a sidenote, I realize that this ZFS filesystem pool consists only of vdb, but its underlying blockdevice is protected in a raidz, and it is copied incrementally daily off-site by the hypervisor. I\u0026rsquo;m pretty confident on safety here, but I prefer to use ZFS for the virtual machine guests as well, because now I can do local snapshotting, of say data/pixelfed, and I can more easily grow/shrink the datasets for the supporting services, as well as isolate them individually against sibling wildgrowth.\nThe VM gets one virtual NIC, which will connect to the [IPng Site Local] network using jumboframes. This way, the machine itself is disconnected from the internet, saving a few IPv4 addresses and allowing for the IPng NGINX frontends to expose it. I give it the name pixelfed.net.ipng.ch with addresses 198.19.4.141 and 2001:678:d78:507::d, which will be firewalled and NATed via the IPng SL gateways.\nIPng Frontend: Wildcard SSL I run most websites behind a cluster of NGINX webservers, which are carrying an SSL certificate which support wildcards. The system is using [DNS-01] challenges, so the first order of business is to expand the certificate from serving only [ublog.tech] (which is in use by the companion Mastodon instance), to include as well *.ublog.tech so that I can add the new Pixelfed instance as [pix.ublog.tech]:\nlego@lego:~$ certbot certonly --config-dir /home/lego/acme-dns --logs-dir /home/lego/logs \\ --work-dir /home/lego/workdir --manual \\ --manual-auth-hook /home/lego/acme-dns/acme-dns-auth.py \\ --preferred-challenges dns --debug-challenges \\ -d ipng.ch -d *.ipng.ch -d *.net.ipng.ch \\ -d ipng.nl -d *.ipng.nl \\ -d ipng.eu -d *.ipng.eu \\ -d ipng.li -d *.ipng.li \\ -d ublog.tech -d *.ublog.tech \\ -d as8298.net \\ -d as50869.net CERTFILE=/home/lego/acme-dns/live/ipng.ch/fullchain.pem KEYFILE=/home/lego/acme-dns/live/ipng.ch/privkey.pem MACHS=\u0026#34;nginx0.chrma0.ipng.ch nginx0.chplo0.ipng.ch nginx0.nlams1.ipng.ch nginx0.nlams2.ipng.ch\u0026#34; for MACH in $MACHS; do fping -q $MACH 2\u0026gt;/dev/null || { echo \u0026#34;$MACH: Skipping (unreachable)\u0026#34; continue } echo $MACH: Copying $CERT scp -q $CERTFILE $MACH:/etc/nginx/certs/$CERT.crt scp -q $KEYFILE $MACH:/etc/nginx/certs/$CERT.key echo $MACH: Reloading nginx ssh $MACH \u0026#39;sudo systemctl reload nginx\u0026#39; done The first command here requests a certificate with certbot, and note the addition of the flag -d *.ublog.tech. It\u0026rsquo;ll correctly say that there are 11 existing domains in this certificate, and ask me if I\u0026rsquo;d like to request a new cert with the 12th one added. I answer yes, and a few seconds later, acme-dns has answered all of Let\u0026rsquo;s Encrypt\u0026rsquo;s challenges, and issues a certificate.\nThe second command then distributes that certificate to the four NGINX frontends, and reloads the cert. Now, I can use the hostname pix.ublog.tech, as far as the SSL certs are concerned. Of course, the regular certbot cronjob renews the cert regularly, so I tucked away the second part here into a script called bin/certbot-distribute, using the RENEWED_LINEAGE variable that certbot(1) sets when using the flag --deploy-hook:\nlego@lego:~$ cat /etc/cron.d/certbot SHELL=/bin/sh PATH=/usr/local/sbin:/usr/local/bin:/sbin:/bin:/usr/sbin:/usr/bin 0 */12 * * * lego perl -e \u0026#39;sleep int(rand(43200))\u0026#39; \u0026amp;\u0026amp; \\ certbot -q renew --config-dir /home/lego/acme-dns \\ --logs-dir /home/lego/logs --work-dir /home/lego/workdir \\ --deploy-hook \u0026#34;/home/lego/bin/certbot-distribute\u0026#34; IPng Frontend: NGINX The previous certbot-distribute shell script has copied the certificate to four separate NGINX instances, two in Amsterdam hosted at AS8283 (Coloclue), one in Zurich hosted at AS25091 (IP-Max), and one in Geneva hosted at AS8298 (IPng Networks). Each of these NGINX servers has a frontend IPv4 and IPv6 address, and a backend jumboframe enabled interface in IPng Site Local (198.19.0.0/16). Because updating the configuration on four production machines is cumbersome, I previously created an Ansible playbook, which I now add this new site to:\npim@squanchy:~/src/ipng-ansible$ cat roles/nginx/files/sites-available/pix.ublog.tech.conf server { listen [::]:80; listen 0.0.0.0:80; server_name pix.ublog.tech; access_log /var/log/nginx/pix.ublog.tech-access.log; include /etc/nginx/conf.d/ipng-headers.inc; include \u0026#34;conf.d/lego.inc\u0026#34;; location / { return 301 https://$host$request_uri; } } server { listen [::]:443 ssl http2; listen 0.0.0.0:443 ssl http2; ssl_certificate /etc/nginx/certs/ipng.ch.crt; ssl_certificate_key /etc/nginx/certs/ipng.ch.key; include /etc/nginx/conf.d/options-ssl-nginx.inc; ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.inc; server_name pix.ublog.tech; access_log /var/log/nginx/pix.ublog.tech-access.log upstream; include /etc/nginx/conf.d/ipng-headers.inc; keepalive_timeout 70; sendfile on; client_max_body_size 80m; location / { proxy_pass http://pixelfed.net.ipng.ch:80; proxy_set_header Host $host; proxy_set_header X-Forwarded-Proto $scheme; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; } } The configuration is very straight forward. The first server block bounces all traffic destined to port 80 towards its port 443 equivalent. The second server block (listening on port 443) contains the certificate I just renewed serve for *.ublog.tech which allows the cluster to offload SSL and forward the traffic on the internal private network on to the VM I created earlier.\nOne quick Ansible playbook run later, and the reversed proxies are ready to rock and roll:\nOf course, this website will just timeout for the time being, because there\u0026rsquo;s nothing listening (yet) on pixelfed.net.ipng.ch:80.\nInstalling Pixelfed So off I go, installing Pixelfed on the new Debian VM. First, I\u0026rsquo;ll install the set of Debian packages this instance will need, including PHP 8.1 (which is the minimum supported, according to the Pixelfed docs):\npim@pixelfed:~$ sudo apt install apt-transport-https lsb-release ca-certificates git wget curl \\ build-essential apache2 mariadb-server pngquant optipng jpegoptim gifsicle ffmpeg redis pim@pixelfed:~$ sudo wget -O /etc/apt/trusted.gpg.d/php.gpg https://packages.sury.org/php/apt.gpg pim@pixelfed:~$ echo \u0026#34;deb https://packages.sury.org/php/ $(lsb_release -sc) main\u0026#34; \\ | sudo tee -a /etc/apt/sources.list.d/php.list pim@pixelfed:~$ apt update pim@pixelfed:~$ apt-get install php8.1-fpm php8.1 php8.1-common php8.1-cli php8.1-gd \\ php8.1-mbstring php8.1-xml php8.1-bcmath php8.1-pgsql php8.1-curl php8.1-xml php8.1-xmlrpc \\ php8.1-imagick php8.1-gd php8.1-mysql php8.1-cli php8.1-intl php8.1-zip php8.1-redis After all those bits and bytes settle on the filesystem, I simply follow the regular [install guide] from the upstream documentation.\nI update the PHP config to allow larger uploads:\npim@pixelfed:~$ sudo vim /etc/php/8.1/fpm/php.ini upload_max_filesize = 100M post_max_size = 100M I create a FastCGI pool for Pixelfed:\npim@pixelfed:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/php/8.1/fpm/pool.d/pixelfed.conf [pixelfed] user = pixelfed group = pixelfed listen.owner = www-data listen.group = www-data listen.mode = 0660 listen = /var/run/php.pixelfed.sock pm = dynamic pm.max_children = 20 pm.start_servers = 5 pm.min_spare_servers = 5 pm.max_spare_servers = 20 chdir = /data/pixelfed php_flag[display_errors] = on php_admin_value[error_log] = /data/pixelfed/php.error.log php_admin_flag[log_errors] = on php_admin_value[open_basedir] = /data/pixelfed:/usr/share/:/tmp:/var/lib/php EOF I reference this pool in a simple non-SSL Apache config, after enabling the modules that Pixelfed needs:\npim@pixelfed:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/apache2/sites-available/pixelfed.conf \u0026lt;VirtualHost *:80\u0026gt; ServerName pix.ublog.tech ServerAdmin pixelfed@ublog.tech DocumentRoot /data/pixelfed/pixelfed/public LogLevel debug \u0026lt;Directory /data/pixelfed/pixelfed/public\u0026gt; Options Indexes FollowSymLinks AllowOverride All Require all granted \u0026lt;/Directory\u0026gt; ErrorLog ${APACHE_LOG_DIR}/pixelfed.error.log CustomLog ${APACHE_LOG_DIR}/pixelfed.access.log combined \u0026lt;FilesMatch \\.php$\u0026gt; SetHandler \u0026#34;proxy:unix:/var/run/php.pixelfed.sock|fcgi://localhost\u0026#34; \u0026lt;/FilesMatch\u0026gt; \u0026lt;/VirtualHost\u0026gt; EOF I create a user and database, and finally download the Pixelfed sourcecode and install the composer tool:\npim@pixelfed:~$ sudo useradd pixelfed -m -d /data/pixelfed -s /bin/bash -r -c \u0026#34;Pixelfed User\u0026#34; pim@pixelfed:~$ sudo mysql CREATE DATABASE pixelfed; GRANT ALL ON pixelfed.* TO pixelfed@localhost IDENTIFIED BY \u0026#39;\u0026lt;redacted\u0026gt;\u0026#39;; exit pim@pixelfed:~$ wget -O composer-setup.php https://getcomposer.org/installer pim@pixelfed:~$ sudo php composer-setup.php pim@pixelfed:~$ sudo cp composer.phar /usr/local/bin/composer pim@pixelfed:~$ rm composer-setup.php pim@pixelfed:~$ sudo su pixelfed pixelfed@pixelfed:~$ git clone -b dev https://github.com/pixelfed/pixelfed.git pixelfed pixelfed@pixelfed:~$ cd pixelfed pixelfed@pixelfed:/data/pixelfed/pixelfed$ composer install --no-ansi --no-interaction --optimize-autoloader pixelfed@pixelfed:/data/pixelfed/pixelfed$ composer update With the basic installation of pacakges and dependencies all squared away, I\u0026rsquo;m ready to configure the instance:\npixelfed@pixelfed:/data/pixelfed/pixelfed$ vim .env APP_NAME=\u0026#34;uBlog Pixelfed\u0026#34; APP_URL=\u0026#34;https://pix.ublog.tech\u0026#34; APP_DOMAIN=\u0026#34;pix.ublog.tech\u0026#34; ADMIN_DOMAIN=\u0026#34;pix.ublog.tech\u0026#34; SESSION_DOMAIN=\u0026#34;pix.ublog.tech\u0026#34; TRUST_PROXIES=\u0026#34;*\u0026#34; # Database Configuration DB_CONNECTION=\u0026#34;mysql\u0026#34; DB_HOST=\u0026#34;127.0.0.1\u0026#34; DB_PORT=\u0026#34;3306\u0026#34; DB_DATABASE=\u0026#34;pixelfed\u0026#34; DB_USERNAME=\u0026#34;pixelfed\u0026#34; DB_PASSWORD=\u0026#34;\u0026lt;redacted\u0026gt;\u0026#34; MAIL_DRIVER=smtp MAIL_HOST=localhost MAIL_PORT=25 MAIL_FROM_ADDRESS=\u0026#34;pixelfed@ublog.tech\u0026#34; MAIL_FROM_NAME=\u0026#34;uBlog Pixelfed\u0026#34; pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan key:generate pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan storage:link pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan migrate --force pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan import:cities pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan instance:actor pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan passport:keys pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan route:cache pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan view:cache pixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan config:cache Pixelfed is based on [Laravel], a PHP framework for Web Artisans (which I guess now that I run both LibreNMS, IXPManager and Pixelfed, makes me one too?). Laravel has two runner types commonly used. One is task queuing via a module called Laravel Horizon, which uses Redis to store work items to be consumed by task workers:\npim@pixelfed:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /lib/systemd/system/pixelfed.service [Unit] Description=Pixelfed task queueing via Laravel Horizon After=network.target Requires=mariadb Requires=php-fpm Requires=redis Requires=apache [Service] Type=simple ExecStart=/usr/bin/php /data/pixelfed/pixelfed/artisan horizon User=pixelfed Restart=on-failure [Install] WantedBy=multi-user.target pim@pixelfed:~$ sudo systemctl enable --now pixelfed The other type of runner is periodic tasks, typically configured in a crontab, like so:\npim@pixelfed:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/cron.d/pixelfed * * * * * pixelfed /usr/bin/php /data/pixelfed/pixelfed/artisan schedule:run \u0026gt;\u0026gt; /dev/null 2\u0026gt;\u0026amp;1 EOF After running the schedule:run module once by hand, it exits cleanly, so I think this is good to go even though I\u0026rsquo;m not a huge fan of redirecting output to /dev/null like that.\nI will create one admin user on the commandline first:\npixelfed@pixelfed:/data/pixelfed/pixelfed$ php artisan user:create And now that everything is ready, I can put the icing on the cake by enabling and starting the Apache2 webserver:\npim@pixelfed:~$ sudo a2enmod rewrite proxy proxy_fcgi pim@pixelfed:~$ sudo a2ensite pixelfed pim@pixelfed:~$ sudo systemctl restart apache2 Finishing Touches File permissions After signing up, logging in and uploading my first post (which is of a BLT sandwich and a bowl of noodles, of course), I noticed that the permissions are overly strict, and the pictures I just uploaded are not visible. I noticed that the PHP FastCGI is running as user pixelfed while the webserver is running as user www-data, and the former is writing files with permissions rw------- and directories with rwx------, which doesn\u0026rsquo;t seem quite right to me, so I make a small edit in config/filesystems.php, changing the 0600 to 0644 and the 0700 to 0755, after which my post is visible.\nuBlog\u0026rsquo;s logo Although I do like the Pixelfed logo, I wanted to keep a ublog.tech branding, so I replaced the public/storage/headers/default.jpg with my own mountains-picture in roughly the same size. By the way, I took that picture in Grindelwald, Switzerland during a [serene moment] in which I discovered why tinkering with things like this is so important to my mental health.\nBackups Of course, since Ramón is a good friend, I would not want to lose his pictures. Data integrity and durability is important to me. It\u0026rsquo;s the one thing that typically the commercial vendors do really well, and my pride prohibits me from losing data due to things like \u0026ldquo;disk failure\u0026rdquo; or \u0026ldquo;computer broken\u0026rdquo; or \u0026ldquo;datacenter on fire\u0026rdquo;.\nTo honor this promise, I handle backups in three main ways: zrepl(1), borg(1) and mysqldump(1).\nVM Block Devices are running on the hypervisor\u0026rsquo;s ZFS on either the SSD pool, or the disk pool, or both. Using a tool called zrepl(1) (which I described a little bit in a [previous post]), I create a snapshot every 12hrs on the local blockdevice, and incrementally copy away those snapshots daily to the remote fileservers. pim@hvn0.ddln0:~$ sudo cat /etc/zrepl/zrepl.yaml jobs: - name: snap-libvirt type: snap filesystems: { \u0026#34;ssd-vol0/libvirt\u0026lt;\u0026#34;: true, \u0026#34;ssd-vol1/libvirt\u0026lt;\u0026#34;: true } snapshotting: type: periodic prefix: zrepl_ interval: 12h pruning: keep: - type: grid grid: 4x12h(keep=all) | 7x1d regex: \u0026#34;^zrepl_.*\u0026#34; - type: push name: \u0026#34;push-st0-chplo0\u0026#34; filesystems: { \u0026#34;ssd-vol0/libvirt\u0026lt;\u0026#34;: true, \u0026#34;ssd-vol1/libvirt\u0026lt;\u0026#34;: true } connect: type: ssh+stdinserver host: st0.chplo0.net.ipng.ch user: root port: 22 identity_file: /etc/zrepl/ssh/identity snapshotting: type: manual send: encrypted: false pruning: keep_sender: - type: not_replicated - type: last_n count: 10 regex: ^zrepl_.*$ # optional keep_receiver: - type: grid grid: 8x12h(keep=all) | 7x1d | 6x7d regex: \u0026#34;^zrepl_.*\u0026#34; Filesystem Backups make a daily copy of their entire VM filesystem using borgbackup(1) to a set of two remote fileservers. This way, the important file metadata, configs for the virtual machines, and so on, are all safely stored remotely. pim@pixelfed:~$ sudo mkdir -p /etc/borgmatic/ssh pim@pixelfed:~$ sudo ssh-keygen -t ecdsa -f /etc/borgmatic/ssh/identity -C root@pixelfed.net.ipng.ch pim@pixelfed:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/borgmatic/config.yaml location: source_directories: - / repositories: - u022eaebe661@st0.chbtl0.ipng.ch:borg/{fqdn} - u022eaebe661@st0.chplo0.ipng.ch:borg/{fqdn} exclude_patterns: - /proc - /sys - /dev - /run - /swap.img exclude_if_present: - .nobackup - .borgskip storage: encryption_passphrase: \u0026lt;redacted\u0026gt; ssh_command: \u0026#34;ssh -i /etc/borgmatic/identity -6\u0026#34; compression: lz4 umask: 0077 lock_wait: 5 retention: keep_daily: 7 keep_weekly: 4 keep_monthly: 6 consistency: checks: - repository - archives check_last: 3 output: color: false MySQL has a running binary log to recover from failures/restarts, but I also run a daily mysqldump(1) operation that dumps the database to the local filesystem, allowing for quick and painless recovery. As the dump is a regular file on the filesystem, it\u0026rsquo;ll be picked up by the filesystem backup every night as well, for long term and off-site safety. pim@pixelfed:~$ sudo zfs create data/mysql-backups pim@pixelfed:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/cron.d/bitcron 25 5 * * * root /usr/local/bin/bitcron mysql-backup.cron EOF For my friends at AS12859 [bit.nl], I still use bitcron(1) :-) For the rest of you \u0026ndash; bitcron is a little wrapper written in Bash that defines a few primitives such as logging, iteration, info/warning/error/fatals etc, and then runs whatever you define in a function called bitcron_main(), sending e-mail to an operator only if there are warnings or errors, and otherwise logging to /var/log/bitcron. The gist of the mysql-backup bitcron is this:\necho \u0026#34;Rotating the $DESTDIR directory\u0026#34; rotate 10 echo \u0026#34;Done (rotate)\u0026#34; echo \u0026#34;\u0026#34; echo \u0026#34;Creating $DESTDIR/0/ to store today\u0026#39;s backup\u0026#34; mkdir -p $DESTDIR/0 || fatal \u0026#34;Could not create $DESTDIR/0/\u0026#34; echo \u0026#34;Done (mkdir)\u0026#34; echo \u0026#34;\u0026#34; echo \u0026#34;Fetching databases\u0026#34; DBS=$(echo \u0026#39;show databases\u0026#39; | mysql -u$MYSQLUSER -p$MYSQLPASS | egrep -v \u0026#39;^Database\u0026#39;) echo \u0026#34;Done (fetching DBs)\u0026#34; echo \u0026#34;\u0026#34; echo \u0026#34;Backing up all databases\u0026#34; for DB in $DBS; do echo \u0026#34; * Database $DB\u0026#34; mysqldump --single-transaction -u$MYSQLUSER -p$MYSQLPASS -a $DB | gzip -9 -c \\ \u0026gt; $DESTDIR/0/mysqldump_$DB.gz \\ || warning \u0026#34;Could not dump database $DB\u0026#34; done echo \u0026#34;Done backing up all databases\u0026#34; echo \u0026#34;\u0026#34; What\u0026rsquo;s next Now that the server is up, and I have a small amount of users (mostly folks I know from the tech industry), I took some time to explore both the Fediverse, reach out to friends old and new, participate in a few random discussions possibly about food, datacenter pics and camping trips, as well as fiddle with the iOS and Android apps (for now, I\u0026rsquo;ve settled on Vernissage after switching my iPhone away from the horrible HEIC format which literally nobody supports). This is going to be fun :)\nNow, I think I\u0026rsquo;m ready to further productionize the experience. It\u0026rsquo;s important to monitor these applications, so in an upcoming post I\u0026rsquo;ll be looking at how to do blackbox and whitebox monitoring on this instance.\nIf you\u0026rsquo;re looking for a home, feel free to sign up at https://pix.ublog.tech/ as I\u0026rsquo;m sure that having a bit more load / traffic on this instance will allow me to learn (and in turn, to share with others)! Of course, my Mastodon instance at https://ublog.tech/ is also happy to serve.\n","date":"2023-08-06","desc":"About this series I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I\u0026rsquo;ve been feeling less enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using \u0026ldquo;free\u0026rdquo; services is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but for me it\u0026rsquo;s time to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to privately operated ones.\n","permalink":"https://ipng.ch/s/articles/2023/08/06/pixelfed-part-1-installing/","section":"articles","title":"Pixelfed - Part 1 - Installing"},{"contents":" About this series Special Thanks: Adrian vifino Pistol for writing this code and for the wonderful collaboration!\nEver since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nIn the last three articles, I thought I had described \u0026ldquo;all we need to know\u0026rdquo; to perform MPLS using the Linux Controlplane in VPP:\nIn the [first article] of this series, I took a look at MPLS in general. In the [second article] of the series, I demonstrated a few special case labels (such as Explicit Null and Implicit Null which enables the fabled Penultimate Hop Popping behavior of MPLS. Then, in the [third article], I worked with @vifino to implement the plumbing for MPLS in the Linux Control Plane plugin for VPP. He did most of the work, I just watched :) As if in a state of premonition, I mentioned:\nCaveat empor, outside of a modest functional and load-test, this MPLS functionality hasn\u0026rsquo;t seen a lot of mileage as it\u0026rsquo;s only a few weeks old at this point, so it could definitely contain some rough edges. Use at your own risk, but if you did want to discuss issues, the [vpp-dev@] mailinglist is a good first stop.\nIntroduction As a reminder, the LAB we built is running VPP with a feature added to Linux Control Plane Plugin, which lets it consume MPLS routes and program the IPv4/IPv6 routing table as well as the MPLS forwarding table in VPP. At this point, we are running [Gerrit 38702, PatchSet 10].\nFirst, let me specify the problem statement: @vifino and I both noticed that sometimes, pinging from one VPP node to another worked fine, while SSHing did not. This article describes an issue I diagnosed, and provided a fix for, in the Linux Controlplane plugin implementation.\nClue 1: Intermittent ping My first finding is that our LAB machines run all the VPP plugins, notably the ping plugin, which means that VPP was responding to ping/ping6, and the Linux controlplane plugin sometimes did not receive any traffic, while other times it did receive the traffic, say a TCP/syn for port 22, and dutifully responded to that, but that syn/ack was never seen back.\nIf I were to disable the ping plugin, indeed pinging from seemingly random pairs of vpp0-[0123] no longer works, while pinging direct neighbors (eg. vpp0-0.e1 to vpp0-1.e0) consistently works well.\nClue 2: Corrupted MPLS packets Using the tap0-0 virtual machine, which sees a copy of all packets on the Open vSwitch underlay in our lab, I started tcpdumping and noticed two curious packets from time to time:\n09:22:55.349977 52:54:00:03:10:00 \u0026gt; 52:54:00:02:10:01, ethertype 802.1Q (0x8100), length 140: vlan 22, p 0, ethertype MPLS unicast (0x8847), MPLS (label 2 (IPv6 explicit NULL), tc 0, [S], ttl 63) version error: 4 != 6 09:23:00.357583 52:54:00:01:10:00 \u0026gt; 52:54:00:00:10:01, ethertype 802.1Q (0x8100), length 160: vlan 20, p 0, ethertype MPLS unicast (0x8847), MPLS (label 0 (IPv4 explicit NULL), tc 0, [S], ttl 61) IP6, wrong link-layer encapsulation (invalid) Looking at the payload of these broken packets, they are DNS packets coming from the vpp0-3 Linux Control Plane there, and they are being sent to either the IPv4 address of 192.168.10.4 or the IPv6 address of 2001:678:d78:201::ffff. Interestingly, these are the lab\u0026rsquo;s resolvers, so I think vpp0-3 is just trying to resolve something.\nClue 3: Vanishing MPLS packets As I mentioned, some source/destination pairs in the lab do not seem to pass traffic, while others are fine. One such case of packetlo is any traffic from vpp0-3 to the IPv4 address of vpp0-1.e0. The path from vpp0-3 to that IPv4 address should go out on vpp0-3.e0 and into vpp0-2.e1, but using tcpdump shows absolutely no such traffic at between vpp0-3 and vpp0-2, while I\u0026rsquo;d expect to see it on VLAN 22!\nDiagnosis Well, based on Clue 3, I take a look at what is happening on vpp0-3. I start by looking at the Linux controlplane view, where the route to lab looks like this:\nroot@vpp0-3:~$ ip route get 192.168.10.4 192.168.10.4/31 nhid 154 encap mpls 36 via 192.168.10.10 dev e0 proto ospf src 192.168.10.3 metric 20 root@vpp0-3:~$ tcpdump -evni e0 mpls 36 15:07:50.864605 52:54:00:03:10:00 \u0026gt; 52:54:00:02:10:01, ethertype MPLS unicast (0x8847), length 136: MPLS (label 36, tc 0, [S], ttl 64) (tos 0x0, ttl 64, id 15752, offset 0, flags [DF], proto UDP (17), length 118) 192.168.10.3.36954 \u0026gt; 192.168.10.4.53: 20950+ PTR? 1.9.0.0.0.0.0.0.0.0.0.0.0.0.0.0.3.0.0.0.8.7.d.0.8.7.6.0.1.0.0.2.ip6.arpa. (90) Yes indeed, Linux is sending an IPv4 DNS packet out on e0, so what am I seeing on the switch fabric? In the LAB diagram above, I can look up that traffic from vpp0-3 destined to vpp0-2 should show up on VLAN 22:\nroot@tap0-0:~$ tcpdump -evni enp16s0f0 -s 1500 -X vlan 22 and mpls 15:19:56.453521 52:54:00:03:10:00 \u0026gt; 52:54:00:02:10:01, ethertype 802.1Q (0x8100), length 140: vlan 22, p 0, ethertype MPLS unicast (0x8847), MPLS (label 2 (IPv6 explicit NULL), tc 0, [S], ttl 63) version error: 4 != 6 0x0000: 0000 213f 4500 0076 d17e 4000 4011 d3a0 ..!?E..v.~@.@... 0x0010: c0a8 0a03 c0a8 0a04 e139 0035 0062 0dde .........9.5.b.. 0x0020: 079e 0100 0001 0000 0000 0000 0131 0139 .............1.9 0x0030: 0130 0130 0130 0130 0130 0130 0130 0130 .0.0.0.0.0.0.0.0 0x0040: 0130 0130 0130 0130 0130 0130 0133 0130 .0.0.0.0.0.0.3.0 0x0050: 0130 0130 0138 0137 0164 0130 0138 0137 .0.0.8.7.d.0.8.7 0x0060: 0136 0130 0131 0130 0130 0132 0369 7036 .6.0.1.0.0.2.ip6 0x0070: 0461 7270 6100 000c 0001 .arpa..... MPLS Corruption Ouch, that hurts my eyes! Linux sent an IPv4 packet into the TAP device carrying label value 36, so why is it being observed as an IPv6 Explicit Null with label value 2? That can\u0026rsquo;t be right. In an attempt to learn more, I ask VPP to give me a packet trace. I happen to remember that on the way from Linux to VPP, the virtio-input driver is used (while, on the way from the wire to VPP, I see dpdk-input is used).\nThe trace teaches me something really valuable:\nvpp0-3# trace add virtio-input 100 vpp0-3# show trace 00:03:27:192490: virtio-input virtio: hw_if_index 7 next-index 4 vring 0 len 136 hdr: flags 0x00 gso_type 0x00 hdr_len 0 gso_size 0 csum_start 0 csum_offset 0 num_buffers 1 00:03:27:192500: ethernet-input MPLS: 52:54:00:03:10:00 -\u0026gt; 52:54:00:02:10:01 00:03:27:192504: mpls-input MPLS: next mpls-lookup[1] label 36 ttl 64 exp 0 00:03:27:192506: mpls-lookup MPLS: next [6], lookup fib index 0, LB index 92 hash 0 label 36 eos 1 00:03:27:192510: mpls-label-imposition-pipe mpls-header:[ipv6-explicit-null:63:0:eos] 00:03:27:192512: mpls-output adj-idx 21 : mpls via fe80::5054:ff:fe02:1001 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254000210015254000310008847 flow hash: 0x00000000 00:03:27:192515: GigabitEthernet10/0/0-output GigabitEthernet10/0/0 flags 0x00180005 MPLS: 52:54:00:03:10:00 -\u0026gt; 52:54:00:02:10:01 label 2 exp 0, s 1, ttl 63 00:03:27:192517: GigabitEthernet10/0/0-tx GigabitEthernet10/0/0 tx queue 0 buffer 0x4c2ea1: current data 0, length 136, buffer-pool 0, ref-count 1, trace handle 0x7 l2-hdr-offset 0 l3-hdr-offset 14 PKT MBUF: port 65535, nb_segs 1, pkt_len 136 buf_len 2176, data_len 136, ol_flags 0x0, data_off 128, phys_addr 0x730ba8c0 packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 rss 0x0 fdir.hi 0x0 fdir.lo 0x0 MPLS: 52:54:00:03:10:00 -\u0026gt; 52:54:00:02:10:01 label 2 exp 0, s 1, ttl 63 At this point, I think I\u0026rsquo;ve figured it out. I can see clearly that the MPLS packet is seen coming from Linux, and it has label value 36. But, it is then offered to graph node mpls-input, which does what it is designed to do, namely look up the label in the FIB:\nvpp0-3# show mpls fib 36 MPLS-VRF:0, fib_index:0 locks:[interface:4, CLI:1, lcp-rt:1, ] 36:neos/21 fib:0 index:88 locks:2 lcp-rt-dynamic refs:1 src-flags:added,contributing,active, path-list:[50] locks:24 flags:shared, uPRF-list:38 len:1 itfs:[1, ] path:[66] pl-index:50 ip6 weight=1 pref=0 attached-nexthop: oper-flags:resolved, fe80::5054:ff:fe02:1001 GigabitEthernet10/0/0 [@0]: ipv6 via fe80::5054:ff:fe02:1001 GigabitEthernet10/0/0: mtu:9000 next:4 flags:[] 52540002100152540003100086dd Extensions: path:66 labels:[[ipv6-explicit-null pipe ttl:0 exp:0]] forwarding: mpls-neos-chain [@0]: dpo-load-balance: [proto:mpls index:91 buckets:1 uRPF:38 to:[0:0]] [0] [@6]: mpls-label[@34]:[ipv6-explicit-null:64:0:neos] [@1]: mpls via fe80::5054:ff:fe02:1001 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254000210015254000310008847 Haha, I love it when the brain-ligutbulb goes to the on position. What\u0026rsquo;s happening is that when we turned on the MPLS feature on the VPP tap that is connected to e0, and VPP saw an MPLS packet, that it looked up in the MPLS FIB what to do with label 36, learning that it must SWAP it for IPv6 Explicit NULL (which is label value 2), and send it out on Gi10/0/0 to an IPv6 nexthop. Yeah, that\u0026rsquo;ll break all right.\nMPLS Drops OK, that explains the garbled packets, but what about the ones that I never even saw on the wire (Clue 3)? Well, now that I\u0026rsquo;ve enjoyed my lightbulb moment, I know exactly where to look. Consider the following route in Linux, which is sending out encapsulated with MPLS label value 37; and consider also what happens if mpls-input receives an MPLS frame with that value:\nroot@vpp0-3:~# ip ro get 192.168.10.6 192.168.10.6 encap mpls 37 via 192.168.10.10 dev e0 src 192.168.10.3 uid 0 vpp0-3# show mpls fib 37 MPLS-VRF:0, fib_index:0 locks:[interface:4, CLI:1, lcp-rt:1, ] .. that\u0026rsquo;s right, there IS no entry. As such, I would expect VPP to not know what to do with such a mislabeled packet, and drop it. Unsurprisingly at this point, here\u0026rsquo;s a nice proof:\n00:10:31:107882: virtio-input virtio: hw_if_index 7 next-index 4 vring 0 len 102 hdr: flags 0x00 gso_type 0x00 hdr_len 0 gso_size 0 csum_start 0 csum_offset 0 num_buffers 1 00:10:31:107891: ethernet-input MPLS: 52:54:00:03:10:00 -\u0026gt; 52:54:00:02:10:01 00:10:31:107897: mpls-input MPLS: next mpls-lookup[1] label 37 ttl 64 exp 0 00:10:31:107898: mpls-lookup MPLS: next [0], lookup fib index 0, LB index 22 hash 0 label 37 eos 1 00:10:31:107901: mpls-drop drop 00:10:31:107902: error-drop rx:tap1 00:10:31:107905: drop mpls-input: MPLS DROP DPO Conclusion: tadaa.wav. When VPP receives the MPLS packet from Linux, it has already been routed (encapsulated and put in an MPLS packet that\u0026rsquo;s meant to be sent to the next router), so it should be left alone. Instead, VPP is forcing the packet through the MPLS FIB, where if I\u0026rsquo;m lucky (and I\u0026rsquo;m not, clearly \u0026hellip;) the right thing happens. But, sometimes, the MPLS FIB has instructions that are different to what Linux had intended, bad things happen, and kittens get hurt. I can\u0026rsquo;t allow that to happen. I like kittens!\nFixing Linux CP + MPLS Now that I know what\u0026rsquo;s actually going on, the fix comes quickly into focus. Of course, when Linux sends an MPLS packet, VPP must not do a FIB lookup. Instead, it should emit the packet on the correct interface as-is. It sounds a little bit like re-arranging the directed graph that VPP uses internally. I\u0026rsquo;ve never done this before, but why not give it a go .. you know, for science :)\nVPP has a concept called feature arcs. These are codepoints where features can be inserted and turned on/off. There\u0026rsquo;s a feature arc for MPLS called mpls-input. I can create a graph node that does anything I\u0026rsquo;d like to the packets at this point, and what I want to do is take the packet and instead of offering it to the mpls-input node, just emit it on its egress interface using interface-output.\nFirst, I call VLIB_NODE_FN which defines a new node in VPP, and I call it lcp_xc_mpls(). I register this node with VLIB_REGISTER_NODE giving it the symbolic name linux-cp-xc-mpls which extends the existing code in this plugin for ARP and IPv4/IPv6 forwarding. Once the packet enters my new node, there are two possible places for it to go, defined by the next_nodes field:\nLCP_XC_MPLS_NEXT_DROP: If I can\u0026rsquo;t figure out where this packet is headed (there should be an existing adjacency for it), I will send it to error-drop where it will be discarded. LCP_XC_MPLS_NEXT_IO: If I do know, however, I ask VPP to send this packet simply to interface-output, where it will be marshalled on the wire, unmodified. Taking this short cut for MPLS packets avoids them being looked up in the FIB, and in hindsight this is no different to how IPv4 and IPv6 packets are also short circuited: for those, ip4-lookup and ip6-lookup are also not called, but instead lcp_xc_inline() does the business.\nI can inform VPP that my new node should be attached as a feature on the mpls-input arc, by calling VNET_FEATURE_INIT with it.\nImplementing the VPP node is a bit of fiddling - but I take inspiration from the existing function lc_xc_inline() which does this for IPv4 and IPv6. Really all I must do, is two things:\nUsing the Linux Interface Pair (LIP) entry, figure out which physical interface corresponds to the TAP interface I just received the packet on, and then set the TX interface to that. Retrieve the ethernet adjacency based on the destination MAC address, use it to set the correct L2 nexthop. If I don\u0026rsquo;t know what adjacency to use, set LCP_XC_MPLS_NEXT_DROP as the next node, otherwise set LCP_XC_MPLS_NEXT_IO. The finishing touch on the graph node is to make sure that it\u0026rsquo;s trace-aware. I use packet tracing a lot, as can be seen as well in this article, so I\u0026rsquo;ll detect if tracing for a given packet is turned on, and if so, tack on a lcp_xc_trace_t object, so traces will reveal my new node in use.\nOnce the node is ready, I have one final step. When constructing the Linux Interface Pair in lcp_itf_pair_add(), I will enable the newly created feature called linux-cp-xc-mpls on the mpls-input feature arc for the TAP interface, by calling vnet_feature_enable_disable(). Conversely, I\u0026rsquo;ll disable the feature when removing the LIP in lcp_itf_pair_del().\nResults After rebasing @vifino\u0026rsquo;s change, I add my code in [Gerrit 38702, PatchSet 11-14]. I think the simplest thing to show the effect of the change is by taking a look at these MPLS packets that come in from Linux Controlplane, and how they now get moved into linux-cp-xc-mpls instead of mpls-input before:\n00:04:12:846748: virtio-input virtio: hw_if_index 7 next-index 4 vring 0 len 102 hdr: flags 0x00 gso_type 0x00 hdr_len 0 gso_size 0 csum_start 0 csum_offset 0 num_buffers 1 00:04:12:846804: ethernet-input MPLS: 52:54:00:03:10:00 -\u0026gt; 52:54:00:02:10:01 00:04:12:846811: mpls-input MPLS: next BUG![3] label 37 ttl 64 exp 0 00:04:12:846812: linux-cp-xc-mpls lcp-xc: itf:1 adj:21 00:04:12:846844: GigabitEthernet10/0/0-output GigabitEthernet10/0/0 flags 0x00180005 MPLS: 52:54:00:03:10:00 -\u0026gt; 52:54:00:02:10:01 label 37 exp 0, s 1, ttl 64 00:04:12:846846: GigabitEthernet10/0/0-tx GigabitEthernet10/0/0 tx queue 0 buffer 0x4be948: current data 0, length 102, buffer-pool 0, ref-count 1, trace handle 0x0 l2-hdr-offset 0 l3-hdr-offset 14 PKT MBUF: port 65535, nb_segs 1, pkt_len 102 buf_len 2176, data_len 102, ol_flags 0x0, data_off 128, phys_addr 0x1f9a5280 packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 rss 0x0 fdir.hi 0x0 fdir.lo 0x0 MPLS: 52:54:00:03:10:00 -\u0026gt; 52:54:00:02:10:01 label 37 exp 0, s 1, ttl 64 The same is true for the original DNS packet with MPLS label 36 \u0026ndash; it just transmits out on Gi10/0/0 with the same label, which is dope! Indeed, no more garbled MPLS packets are seen, and the following simple acceptance test shows that all machines can reach all other machines on the LAB cluster with both IPv4 and IPv6:\nipng@vpp0-3:~$ fping -g 192.168.10.0 192.168.10.3 192.168.10.0 is alive 192.168.10.1 is alive 192.168.10.2 is alive 192.168.10.3 is alive ipng@vpp0-3:~$ fping6 2001:678:d78:200:: 2001:678:d78:200::1 2001:678:d78:200::2 2001:678:d78:200::3 2001:678:d78:200:: is alive 2001:678:d78:200::1 is alive 2001:678:d78:200::2 is alive 2001:678:d78:200::3 is alive My ping test here from vpp0-3 tries to ping (via the Linux controlplane) each of the other routers, including itself. It first does this with IPv4, and then with IPv6, showing that all eight possible destinations are alive. Progress, sweet sweet progress.\nI then expand that with this nice oneliner:\npim@lab:~$ for af in 4 6; do \\ for node in $(seq 0 3); do \\ ssh -$af ipng@vpp0-$node \u0026#34;fping -g 192.168.10.0 192.168.10.3; \\ fping6 2001:678:d78:200:: 2001:678:d78:200::1 2001:678:d78:200::2 2001:678:d78:200::3\u0026#34;; \\ done \\ done | grep -c alive 64 Explanation: Taking both IPv4 and iPv6, I log in to all four nodes (so in total I invoke SSH 8 times), and then perform both fping operations, and receive each time eight respondes, sixty-four in total. This checks out. I am very pleased with my work.\nWhat\u0026rsquo;s next I joined forces with @vifino who has effectively added MPLS handling to the Linux Control Plane, so VPP can start to function as an MPLS router using FRR\u0026rsquo;s label distribution protocol implementation. Gosh, I wish Bird3 would have LDP :)\nOur work is mostly complete, there\u0026rsquo;s two pending Gerrit\u0026rsquo;s which should be ready to review and certainly ready to play with:\n[Gerrit 38826]: This adds the ability to listen to internal state changes of an interface, so that the Linux Control Plane plugin can enable MPLS on the LIP interfaces and Linux sysctl for MPLS input. [Gerrit 38702/10]: This adds the ability to listen to Netlink messages in the Linux Control Plane plugin, and sensibly apply these routes to the IPv4, IPv6 and MPLS FIB in the VPP dataplane. [Gerrit 38702/14]: This Gerrit now also adds the ability to directly output MPLS packets from Linux out on the correct interface, without pulling it through the MPLS fib. Finally, a note from your friendly neighborhood developers: this code is brand-new and has had very limited peer-review from the VPP developer community. It adds a significant feature to the Linux Controlplane plugin, so make sure you both understand the semantics, the differences between Linux and VPP, and the overall implementation before attempting to use in production. We\u0026rsquo;re pretty sure we got at least some of this right, but testing and runtime experience will tell.\nI will be silently porting the change into my own copy of the Linux Controlplane called lcpng on [GitHub]. If you\u0026rsquo;d like to test this - reach out to the VPP Developer [mailinglist] any time!\n","date":"2023-05-28","desc":" About this series Special Thanks: Adrian vifino Pistol for writing this code and for the wonderful collaboration!\nEver since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\n","permalink":"https://ipng.ch/s/articles/2023/05/28/vpp-mpls-part-4/","section":"articles","title":"VPP MPLS - Part 4"},{"contents":" About this series Special Thanks: Adrian vifino Pistol for writing this code and for the wonderful collaboration!\nEver since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nIn the [first article] of this series, I took a look at MPLS in general, and how setting up static Label Switched Paths can be done in VPP. A few details on special case labels (such as Implicit Null which enabled the fabled Penultimate Hop Popping) were missing, so I took a good look at them in the [second article] of the series.\nThis was all just good fun but also allowed me to buy some time for @vifino who has been implementing MPLS handling within the Linux Control Plane plugin for VPP! This final article in the series shows the engineering considerations that went in to writing the plugin, which is currently under review but reasonably complete. Considering the VPP 23.06 cutoff is next week, I\u0026rsquo;m not super hopeful that we\u0026rsquo;ll be able to get a full community / committer review in time, but at this point both @vifino and I think this code is ready for consumption - considering FRR has a good Label Distribution Protocol daemon, I\u0026rsquo;ll switch out of my usual habitat of Bird and install a LAB with FRR.\nCaveat empor, outside of a modest functional and load-test, this MPLS functionality hasn\u0026rsquo;t seen a lot of mileage as it\u0026rsquo;s only a few weeks old at this point, so it could definitely contain some rough edges. Use at your own risk, but if you did want to discuss issues, the [vpp-dev@] mailinglist is a good first stop.\nIntroduction MPLS support is fairly complete in VPP already, but programming the dataplane would require custom integrations, while using the Linux netlink subsystem feels easier from an end-user point of view. This is a technical deep dive into the implementation of MPLS in the Linux Control Plane plugin for VPP. If you haven\u0026rsquo;t already, now is a good time to read up on the initial implementation of LCP:\n[Part 1]: Punting traffic through TUN/TAP interfaces into Linux [Part 2]: Mirroring VPP interface configuration into Linux [Part 3]: Automatically creating sub-interfaces in Linux [Part 4]: Synchronize link state, MTU and addresses to Linux [Part 5]: Netlink Listener, synchronizing state from Linux to VPP [Part 6]: Observability with LibreNMS and VPP SNMP Agent [Part 7]: Productionizing and reference Supermicro fleet at IPng To keep this writeup focused, I\u0026rsquo;ll assume the anatomy of VPP plugins and the Linux Controlplane Interface and Netlink plugins are understood. That way, I can focus on the changes needed for MPLS integration, which at first glance seem reasonably straight forward.\nVPP Linux-CP: Interfaces First off, to enable any MPLS forwarding at all in VPP, I have to create the MPLS forwarding table and enable MPLS on one or more interfaces:\nvpp# mpls table add 0 vpp# lcp create GigabitEthernet10/0/0 host-if e0 vpp# set int mpls GigabitEthernet10/0/0 enable What happens when the Gi10/0/0 interface has a Linux Interface Pair (LIP) is that there exists a corresponding TAP interface in the dataplane (typically called tapX) which in turn appears on the Linux side as e0. Linux will want to be able to send MPLS datagrams into e0, and for that, two things must happen:\nLinux kernel must enable MPLS input on e0, typically with a sysctl. VPP must enable MPLS on the TAP, in addition to the phy Gi10/0/0. Therefore, the first order of business is to create a hook where the Linux CP interface plugin can be made aware if MPLS is enabled or disabled in VPP - it turns out, such a callback function definition already exists, but it was never implemented. [Gerrit 38826] adds a function mpls_interface_state_change_add_callback(), which implements the ability to register a callback on MPLS on/off in VPP.\nNow that the callback plumbing exists, Linux CP will want to register one of these, so that it can set MPLS to the same enabled or disabled state on the Linux interface using /proc/sys/net/mpls/conf/${host-if}/input (which is the moral equivalent of running sysctl), and it\u0026rsquo;ll also call mpls_sw_interface_enable_disable() on the TAP interface. With these changes both implemented, enabling MPLS now looks like this in the logs:\nlinux-cp/mpls-sync: sync_state_cb: called for sw_if_index 1 linux-cp/mpls-sync: sync_state_cb: mpls enabled 1 parent itf-pair: [1] GigabitEthernet10/0/0 tap2 e0 97 type tap netns dataplane linux-cp/mpls-sync: sync_state_cb: called for sw_if_index 8 linux-cp/mpls-sync: sync_state_cb: set mpls input for e0 Take a look at the code that implements enable/disable semantics in src/plugins/linux-cp/lcp_mpls_sync.c.\nVPP Linux-CP: Netlink When Linux installs a route with MPLS labels, it will be seen in the return value of rtnl_route_nh_get_encap_mpls_dst(). One or more labels can now be read using nl_addr_get_binary_addr() yielding struct mpls_label, which contains the label value, experiment bits and TTL, and these can be added to the route path in VPP by casting them to struct fib_mpls_label_t. The last label in the stackwill have the S-bit set, so we can continue consuming these until we find that condition. The first patchset that plays around with these semantics is [38702#2]. As you can see, MPLS is going to look very much like IPv4 and IPv6 route updates in [previous work], in that they take the Netlink representation, rewrite them into VPP representation, and update the FIB.\nUp until now, the Linux Controlplane netlink plugin understands only IPv4 and IPv6. So some preparation work is called for:\nlcp_router_proto_k2f() gains the ability to cast Linux AF_* into VPP\u0026rsquo;s FIB_PROTOCOL_*. lcp_router_route_mk_prefix() turns into a switch statement that creates a fib_prefix_t for MPLS address family, in addition to the existing IPv4 and IPv6 types. It uses the non-EOS type. lcp_router_mpls_nladdr_to_path() implements the loop that I described above, taking the stack of struct mpls_label from Netlink and turning them into a vector of fib_mpls_label_t for the VPP FIB. lcp_router_route_path_parse() becomes aware of MPLS SWAP and POP operations (the latter being the case if there are 0 labels in the Netlink label stack) lcp_router_fib_route_path_dup() is a helper function to make a copy of a the FIB path for the EOS and non-EOS VPP FIB inserts. The VPP FIB differentiates between entries that are non-EOS (S=0), and can treat them differently to those which are EOS (end of stack, S=1). Linux does not make this destinction, so it\u0026rsquo;s safest to just install non-EOS and EOS entries for each route from Linux. This is why lcp_router_fib_route_path_dup() exists, otherwise Netlink route deletions for the MPLS routes will yield a double free later on.\nThis prep work then allows for the following two main functions to become MPLS aware:\nlcp_router_route_add() when Linux sends a Netlink message about a new route, and that route carries MPLS labels, make a copy of the path for the EOS entry and proceed to insert both the non-EOS and newly crated EOS entries into the FIB, lcp_router_route_del() when Linux sends a Netlink message about a deleted route, we can remove both the EOS and non-EOS variants of the route from VPP\u0026rsquo;s FIB. VPP Linux-CP: MPLS with FRR I finally get to show off @vifino\u0026rsquo;s lab! It\u0026rsquo;s installed based off of a Debian Bookworm build, because there\u0026rsquo;s a few Netlink Library changes that haven\u0026rsquo;t made their way into Debian Bullseye yet. The LAB image is quickly built and distributed, and for this LAB I\u0026rsquo;m choosing specifically for [FRR] because it ships with a Label Distribution Protocol daemon out of the box.\nFirst order of business is to enable MPLS on the correct interfaces, and create the MPLS FIB table. On each machine, I insert the following in the startup sequence:\nipng@vpp0-1:~$ cat \u0026lt;\u0026lt; EOF | tee -a /etc/vpp/config/manual.vpp mpls table add 0 set interface mpls GigabitEthernet10/0/0 enable set interface mpls GigabitEthernet10/0/1 enable EOF The lab comes with OSPF and OSPFv3 enabled on each of the Gi10/0/0 and Gi10/0/1 interfaces that go from East to West. This extra sequence enables MPLS on those interfaces, and because they have a Linux Interface Pair (LIP), VPP will enable MPLS on the internal TAP interfaces, as well as set the Linux sysctl to allow the kernel to send MPLS encapsulated packets towards VPP.\nNext up, turning on LDP for FRR, which is easy enough:\nipng@vpp0-1:~$ vtysh vpp0-2# conf t vpp0-2(config)# mpls ldp router-id 192.168.10.1 dual-stack cisco-interop ordered-control ! address-family ipv4 discovery transport-address 192.168.10.1 label local advertise explicit-null interface e0 interface e1 exit-address-family ! address-family ipv6 discovery transport-address 2001:678:d78:200::1 label local advertise explicit-null ttl-security disable interface e0 interface e1 exit-address-family exit I configure LDP here to prefer advertising locally connected routes as MPLS Explicit NULL, which I described in detail in the [previous post]. It tells the penultimate router to send the router a packet as MPLS with label value 0,S=1 for IPv4 and value 2,S=1 for IPv6, so that VPP knows imediately to decapsulate the packet and continue to IPv4/IPv6 forwarding. An alternative here is setting implicit-null, which instructs the router before this one to perform Penultimate Hop Popping. If this is confusing, take a look at that article for reference!\nOtherwise, just giving each router a transport-address of a loopback interface, and a unique router-id, the same as used in OSPF and OSPFv3, and we\u0026rsquo;re off to the races. Just take a look at how easy this was:\nvpp0-1# show mpls ldp discovery AF ID Type Source Holdtime ipv4 192.168.10.0 Link e0 15 ipv4 192.168.10.2 Link e1 15 ipv6 192.168.10.0 Link e0 15 ipv6 192.168.10.2 Link e1 15 vpp0-1# show mpls ldp neighbor AF ID State Remote Address Uptime ipv6 192.168.10.0 OPERATIONAL 2001:678:d78:200:: 19:49:10 ipv6 192.168.10.2 OPERATIONAL 2001:678:d78:200::2 19:49:10 The first show ... discovery shows which interfaces are receiving multicast LDP Hello Packets, and because I enabled discovery for both IPv4 and IPv6, I can see two pairs there. If I look at which interfaces formed adjacencies, show ... neighbor reveals that LDP is preferring IPv6, and that both adjacencies to vpp0-0 and vpp0-2 are operational. Awesome sauce!\nI see LDP neighbor adjacencies, so let me show you what label information was actually exchanged, in three different places, FRR\u0026rsquo;s label distribution protocol daemon, Linux\u0026rsquo;s IPv4, IPv6 and MPLS routing tables, and VPP\u0026rsquo;s dataplane forwarding information base.\nMPLS: FRR view There are two things to note \u0026ndash; the IPv4 and IPv6 routing table, called a Forwarding Equivalent Class (FEC), and the MPLS forwarding table, called the MPLS FIB:\nvpp0-1# show mpls ldp binding AF Destination Nexthop Local Label Remote Label In Use ipv4 192.168.10.0/32 192.168.10.0 20 exp-null yes ipv4 192.168.10.1/32 0.0.0.0 exp-null - no ipv4 192.168.10.2/32 192.168.10.2 16 exp-null yes ipv4 192.168.10.3/32 192.168.10.2 33 33 yes ipv4 192.168.10.4/31 192.168.10.0 21 exp-null yes ipv4 192.168.10.6/31 192.168.10.0 exp-null exp-null no ipv4 192.168.10.8/31 192.168.10.2 exp-null exp-null no ipv4 192.168.10.10/31 192.168.10.2 17 exp-null yes ipv6 2001:678:d78:200::/128 192.168.10.0 18 exp-null yes ipv6 2001:678:d78:200::1/128 0.0.0.0 exp-null - no ipv6 2001:678:d78:200::2/128 192.168.10.2 31 exp-null yes ipv6 2001:678:d78:200::3/128 192.168.10.2 38 34 yes ipv6 2001:678:d78:210::/60 0.0.0.0 48 - no ipv6 2001:678:d78:210::/128 0.0.0.0 39 - no ipv6 2001:678:d78:210::1/128 0.0.0.0 40 - no ipv6 2001:678:d78:210::2/128 0.0.0.0 41 - no ipv6 2001:678:d78:210::3/128 0.0.0.0 42 - no vpp0-1# show mpls table Inbound Label Type Nexthop Outbound Label ------------------------------------------------------------------ 16 LDP 192.168.10.9 IPv4 Explicit Null 17 LDP 192.168.10.9 IPv4 Explicit Null 18 LDP fe80::5054:ff:fe00:1001 IPv6 Explicit Null 19 LDP fe80::5054:ff:fe00:1001 IPv6 Explicit Null 20 LDP 192.168.10.6 IPv4 Explicit Null 21 LDP 192.168.10.6 IPv4 Explicit Null 31 LDP fe80::5054:ff:fe02:1000 IPv6 Explicit Null 32 LDP fe80::5054:ff:fe02:1000 IPv6 Explicit Null 33 LDP 192.168.10.9 33 38 LDP fe80::5054:ff:fe02:1000 34 In the first table, each entry of the IPv4 and IPv6 routing table, as fed by OSPF and OSPFv3, will get a label associated with them. The negotiation of LDP will ask our peer to set a specific label, and it\u0026rsquo;ll inform the peer on which label we are intending to use for the Label Switched Path towards that destination. I\u0026rsquo;ll give two examples to illustrate how this table is used:\nThis router (vpp0-1) has a peer vpp0-0 and when this router wants to send traffic to it, it\u0026rsquo;ll be sent with exp-null (because it is the last router in the LSP), but when other routers might want to use this router to reach vpp0-0, they should use the MPLS label value 20. This router (vpp0-1) is not directly connected to vpp0-3 and as such, its IPv4 and IPv6 loopback addresses are going to contain labels in both directions: if vpp0-1 itself wants to send a packet to vpp0-3, it will use label value 33 and 38 respectively. However, if other routers want to use this router to reach vpp0-3, they should use the MPLS label value 33 and 34 respectively. The second table describes the MPLS Forwarding Information Base (FIB). When receiving an MPLS packet with an inbound label noted in this table, the operation applied is SWAP to the outbound label, and forward towards a nexthop \u0026ndash; this is the stuff that P-Routers use when transiting MPLS traffic.\nMPLS: Linux view FRR\u0026rsquo;s LDP daemon will offer both of these routing tables to the Linux kernel using Netlink messages, so the Linux view looks similar:\nroot@vpp0-1:~# ip ro 192.168.10.0 nhid 230 encap mpls 0 via 192.168.10.6 dev e0 proto ospf src 192.168.10.1 metric 20 192.168.10.2 nhid 226 encap mpls 0 via 192.168.10.9 dev e1 proto ospf src 192.168.10.1 metric 20 192.168.10.3 nhid 227 encap mpls 33 via 192.168.10.9 dev e1 proto ospf src 192.168.10.1 metric 20 192.168.10.4/31 nhid 230 encap mpls 0 via 192.168.10.6 dev e0 proto ospf src 192.168.10.1 metric 20 192.168.10.6/31 dev e0 proto kernel scope link src 192.168.10.7 192.168.10.8/31 dev e1 proto kernel scope link src 192.168.10.8 192.168.10.10/31 nhid 226 encap mpls 0 via 192.168.10.9 dev e1 proto ospf src 192.168.10.1 metric 20 root@vpp0-1:~# ip -6 ro 2001:678:d78:200:: nhid 231 encap mpls 2 via fe80::5054:ff:fe00:1001 dev e0 proto ospf src 2001:678:d78:200::1 metric 20 pref medium 2001:678:d78:200::1 dev loop0 proto kernel metric 256 pref medium 2001:678:d78:200::2 nhid 237 encap mpls 2 via fe80::5054:ff:fe02:1000 dev e1 proto ospf src 2001:678:d78:200::1 metric 20 pref medium 2001:678:d78:200::3 nhid 239 encap mpls 34 via fe80::5054:ff:fe02:1000 dev e1 proto ospf src 2001:678:d78:200::1 metric 20 pref medium 2001:678:d78:201::/112 nhid 231 encap mpls 2 via fe80::5054:ff:fe00:1001 dev e0 proto ospf src 2001:678:d78:200::1 metric 20 pref medium 2001:678:d78:201::1:0/112 dev e0 proto kernel metric 256 pref medium 2001:678:d78:201::2:0/112 dev e1 proto kernel metric 256 pref medium 2001:678:d78:201::3:0/112 nhid 237 encap mpls 2 via fe80::5054:ff:fe02:1000 dev e1 proto ospf src 2001:678:d78:200::1 metric 20 pref medium root@vpp0-1:~# ip -f mpls ro 16 as to 0 via inet 192.168.10.9 dev e1 proto ldp 17 as to 0 via inet 192.168.10.9 dev e1 proto ldp 18 as to 2 via inet6 fe80::5054:ff:fe00:1001 dev e0 proto ldp 19 as to 2 via inet6 fe80::5054:ff:fe00:1001 dev e0 proto ldp 20 as to 0 via inet 192.168.10.6 dev e0 proto ldp 21 as to 0 via inet 192.168.10.6 dev e0 proto ldp 31 as to 2 via inet6 fe80::5054:ff:fe02:1000 dev e1 proto ldp 32 as to 2 via inet6 fe80::5054:ff:fe02:1000 dev e1 proto ldp 33 as to 33 via inet 192.168.10.9 dev e1 proto ldp 38 as to 34 via inet6 fe80::5054:ff:fe02:1000 dev e1 proto ldp The first two tabled show a \u0026lsquo;regular\u0026rsquo; Linux routing table for IPv4 and IPv6 respectively, except there\u0026rsquo;s an encap mpls \u0026lt;X\u0026gt; added for all not-directly-connected prefixes. In this case, vpp0-1 connects on e0 to vpp0-0 to the West, and on interface e1 to vpp0-2 to the East. These connected routes do not carry MPLS information and in fact, this is how LDP can continue to work and exchange information naturally even when no LSPs are established yet.\nThe third table is the MPLS FIB, and it shows the special case of MPLS Explicit NULL clearly. All IPv4 routes for which this router is the penultimate hop carry the outbound label value 0,S=1, while the IPv6 routes carry the value 2,S=1. Booyah!\nMPLS: VPP view The FIB information in general is super densely populated in VPP. Rather than dumping the whole table, I\u0026rsquo;ll show one example, for 192.168.10.3 which we can see above will be encapsulated into an MPLS packet with label value 33,S=0 before being fowarded:\nroot@vpp0-1:~# vppctl show ip fib 192.168.10.3 ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, lcp-rt:1, ] 192.168.10.3/32 fib:0 index:78 locks:2 lcp-rt-dynamic refs:1 src-flags:added,contributing,active, path-list:[29] locks:6 flags:shared, uPRF-list:53 len:1 itfs:[2, ] path:[41] pl-index:29 ip4 weight=1 pref=20 attached-nexthop: oper-flags:resolved, 192.168.10.9 GigabitEthernet10/0/1 [@0]: ipv4 via 192.168.10.9 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 5254000210005254000110010800 Extensions: path:41 labels:[[33 pipe ttl:0 exp:0]] forwarding: unicast-ip4-chain [@0]: dpo-load-balance: [proto:ip4 index:81 buckets:1 uRPF:53 to:[2421:363846]] [0] [@13]: mpls-label[@4]:[33:64:0:eos] [@1]: mpls via 192.168.10.9 GigabitEthernet10/0/1: mtu:9000 next:3 flags:[] 5254000210005254000110018847 The trick is looking at the Extensions, which shows the out-labels set to 33, with ttl=0 (which makes VPP copy the TTL from the IPv4 packet itself), and exp=0. It can then forward the packet as MPLS onto the nexthop at 192.168.10.9 (vpp0-2.e0 on Gi10/0/1).\nThe MPLS FIB is also a bit chatty, and shows a fundamental difference with Linux:\nroot@vpp0-1:~# vppctl show mpls fib 33 MPLS-VRF:0, fib_index:0 locks:[interface:6, CLI:1, lcp-rt:1, ] 33:neos/21 fib:0 index:37 locks:2 lcp-rt-dynamic refs:1 src-flags:added,contributing,active, path-list:[57] locks:12 flags:shared, uPRF-list:21 len:1 itfs:[2, ] path:[81] pl-index:57 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, 192.168.10.9 GigabitEthernet10/0/1 [@0]: ipv4 via 192.168.10.9 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 5254000210005254000110010800 Extensions: path:81 labels:[[33 pipe ttl:0 exp:0]] forwarding: mpls-neos-chain [@0]: dpo-load-balance: [proto:mpls index:40 buckets:1 uRPF:21 to:[0:0]] [0] [@6]: mpls-label[@28]:[33:64:0:neos] [@1]: mpls via 192.168.10.9 GigabitEthernet10/0/1: mtu:9000 next:3 flags:[] 5254000210005254000110018847 33:eos/21 fib:0 index:64 locks:2 lcp-rt-dynamic refs:1 src-flags:added,contributing,active, path-list:[57] locks:12 flags:shared, uPRF-list:21 len:1 itfs:[2, ] path:[81] pl-index:57 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, 192.168.10.9 GigabitEthernet10/0/1 [@0]: ipv4 via 192.168.10.9 GigabitEthernet10/0/1: mtu:9000 next:7 flags:[] 5254000210005254000110010800 Extensions: path:81 labels:[[33 pipe ttl:0 exp:0]] forwarding: mpls-eos-chain [@0]: dpo-load-balance: [proto:mpls index:67 buckets:1 uRPF:21 to:[73347:10747680]] [0] [@6]: mpls-label[@29]:[33:64:0:eos] [@1]: mpls via 192.168.10.9 GigabitEthernet10/0/1: mtu:9000 next:3 flags:[] 5254000210005254000110018847 I note that there are two entries here \u0026ndash; I wrote about them above. The MPLS implementation in VPP allows for a different forwarding behavior in the case that the label inspected is the last one in the stack (S=1), which is the usual case called End of Stack (EOS). But, it also has a second entry which tells it what to do if S=0 or Not End of Stack (NEOS). Linux doesn\u0026rsquo;t make the destinction, so @vifino added two identical entries using that lcp_router_fib_route_path_dup() function.\nBut, what the entries themselves mean is that if this vpp0-1 router were to receive an MPLS packet with label value 33,S=1 (or value 33,S=0), it\u0026rsquo;ll perform a SWAP operation and put as new outbound label (the same) value 33, and forward the packet as MPLS onto 192.168.10.9 on Gi10/0/1.\nResults And with that, I think we achieved a running LDP with IPv4 and IPv6 and forwarding + encapsulation of MPLS with VPP. One cool wrapup I thought I\u0026rsquo;d leave you with, is showing how these MPLS routers are transparent with respect to IP traffic going through them. If I look at the diagram above, lab reaches vpp0-3 via three hops: first into vpp0-0 where it is wrapped into MPLS and forwarded to vpp0-1, and then through vpp0-2, which sets the Explicit NULL label and forwards again as MPLS onto vpp0-3, which does the IPv4 and IPv6 lookup.\nCheck this out:\npim@lab:~$ for node in $(seq 0 3); do traceroute -4 -q1 vpp0-$node; done traceroute to vpp0-0 (192.168.10.0), 30 hops max, 60 byte packets 1 vpp0-0.lab.ipng.ch (192.168.10.0) 1.907 ms traceroute to vpp0-1 (192.168.10.1), 30 hops max, 60 byte packets 1 vpp0-1.lab.ipng.ch (192.168.10.1) 2.460 ms traceroute to vpp0-1 (192.168.10.2), 30 hops max, 60 byte packets 1 vpp0-2.lab.ipng.ch (192.168.10.2) 3.860 ms traceroute to vpp0-1 (192.168.10.3), 30 hops max, 60 byte packets 1 vpp0-3.lab.ipng.ch (192.168.10.3) 4.414 ms pim@lab:~$ for node in $(seq 0 3); do traceroute -6 -q1 vpp0-$node; done traceroute to vpp0-0 (2001:678:d78:200::), 30 hops max, 80 byte packets 1 vpp0-0.lab.ipng.ch (2001:678:d78:200::) 3.037 ms traceroute to vpp0-1 (2001:678:d78:200::1), 30 hops max, 80 byte packets 1 vpp0-1.lab.ipng.ch (2001:678:d78:200::1) 5.125 ms traceroute to vpp0-1 (2001:678:d78:200::2), 30 hops max, 80 byte packets 1 vpp0-2.lab.ipng.ch (2001:678:d78:200::2) 7.135 ms traceroute to vpp0-1 (2001:678:d78:200::3), 30 hops max, 80 byte packets 1 vpp0-3.lab.ipng.ch (2001:678:d78:200::3) 8.763 ms With MPLS, each of these routers appears to the naked eye to be directly connected to the lab headend machine, but we know better! :)\nWhat\u0026rsquo;s next I joined forces with @vifino who has effectively added MPLS handling to the Linux Control Plane, so VPP can start to function as an MPLS router using FRR\u0026rsquo;s label distribution protocol implementation. Gosh, I wish Bird3 would have LDP :)\nOur work is mostly complete, there\u0026rsquo;s two pending Gerrit\u0026rsquo;s which should be ready to review and certainly ready to play with:\n[Gerrit 38826]: This adds the ability to listen to internal state changes of an interface, so that the Linux Control Plane plugin can enable MPLS on the LIP interfaces and Linux sysctl for MPLS input. [Gerrit 38702]: This adds the ability to listen to Netlink messages in the Linux Control Plane plugin, and sensibly apply these routes to the IPv4, IPv6 and MPLS FIB in the VPP dataplane. Finally, a note from your friendly neighborhood developers: this code is brand-new and has had very limited peer-review from the VPP developer community. It adds a significant feature to the Linux Controlplane plugin, so make sure you both understand the semantics, the differences between Linux and VPP, and the overall implementation before attempting to use in production. We\u0026rsquo;re pretty sure we got at least some of this right, but testing and runtime experience will tell.\nI will be silently porting the change into my own copy of the Linux Controlplane called lcpng on [GitHub]. If you\u0026rsquo;d like to test this - reach out to the VPP Developer [mailinglist] any time!\n","date":"2023-05-21","desc":" About this series Special Thanks: Adrian vifino Pistol for writing this code and for the wonderful collaboration!\nEver since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\n","permalink":"https://ipng.ch/s/articles/2023/05/21/vpp-mpls-part-3/","section":"articles","title":"VPP MPLS - Part 3"},{"contents":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nI\u0026rsquo;ve deployed an MPLS core for IPng Networks, which allows me to provide L2VPN services, and at the same time keep an IPng Site Local network with IPv4 and IPv6 that is separate from the internet, based on hardware/silicon based forwarding at line rate and high availability. You can read all about my Centec MPLS shenanigans in [this article].\nIn the last article, I explored VPP\u0026rsquo;s MPLS implementation a little bit. All the while, @vifino has been tinkering with the Linux Control Plane and adding MPLS support to it, and together we learned a lot about how VPP does MPLS forwarding and how it sometimes differs to other implementations. During the process, we talked a bit about implicit-null and explicit-null. When my buddy Fred read the [previous article], he also talked about a feature called penultimate-hop-popping which maybe deserves a bit more explanation. At the same time, I could not help but wonder what the performance is of VPP as a P-Router and PE-Router, compared to say IPv4 forwarding.\nLab Setup: VMs For this article, I\u0026rsquo;m going to boot up instance LAB1 with no changes (for posterity, using image vpp-proto-disk0@20230403-release), and it will be in the same state it was at the end of my previous [MPLS article]. To recap, there are four routers daisychained in a string, and they are called vpp1-0 through vpp1-3. I\u0026rsquo;ve then connected a Debian virtual machine on both sides of the string. host1-0.enp16s0f3 connects to vpp1-3.e2 and host1-1.enp16s0f0 connects to vpp1-0.e3. Finally, recall that all of the links between these routers and hosts can be inspected with the machine tap1-0 which is connected to a mirror port on the underlying Open vSwitch fabric. I bound some RFC1918 addresses on host1-0 and host1-1 and can ping between the machines, using the VPP routers as MPLS transport.\nMPLS: Simple LSP In this mode, I can plumb two label switched paths (LSPs), the first one westbound from vpp1-3 to vpp1-0, and it wraps the packet destined to 10.0.1.1 into an MPLS packet with a single label 100:\nvpp1-3# ip route add 10.0.1.1/32 via 192.168.11.10 GigabitEthernet10/0/0 out-labels 100 vpp1-2# mpls local-label add 100 eos via 192.168.11.8 GigabitEthernet10/0/0 out-labels 100 vpp1-1# mpls local-label add 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 100 vpp1-0# mpls local-label add 100 eos via ip4-lookup-in-table 0 vpp1-0# ip route add 10.0.1.1/32 via 192.0.2.2 The second is eastbound from vpp1-0 to vpp1-3, and it is using MPLS label 103. Remember: LSPs are unidirectional!\nvpp1-0# ip route add 10.0.1.0/32 via 192.168.11.7 GigabitEthernet10/0/1 out-labels 103 vpp1-1# mpls local-label add 103 eos via 192.168.11.9 GigabitEthernet10/0/1 out-labels 103 vpp1-2# mpls local-label add 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 103 vpp1-3# mpls local-label add 103 eos via ip4-lookup-in-table 0 vpp1-3# ip route add 10.0.1.0/32 via 192.0.2.0 With these two LSPs established, the ICMP echo request and subsequent ICMP echo reply can be seen traveling through the network entirely as MPLS:\nroot@tap1-0:~# tcpdump -c 10 -eni enp16s0f0 tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes 14:41:07.526861 52:54:00:20:10:03 \u0026gt; 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33 p 0, ethertype IPv4 (0x0800), 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 14:41:07.528103 52:54:00:13:10:00 \u0026gt; 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22 p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64) 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 14:41:07.529342 52:54:00:12:10:00 \u0026gt; 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21 p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63) 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 14:41:07.530421 52:54:00:11:10:00 \u0026gt; 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 106: vlan 20 p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 62) 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 14:41:07.531160 52:54:00:10:10:03 \u0026gt; 52:54:00:21:10:00, ethertype 802.1Q (0x8100), length 102: vlan 40 p 0, ethertype IPv4 (0x0800), 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 14:41:07.531455 52:54:00:21:10:00 \u0026gt; 52:54:00:10:10:03, ethertype 802.1Q (0x8100), length 102: vlan 40 p 0, ethertype IPv4 (0x0800), 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 14:41:07.532245 52:54:00:10:10:01 \u0026gt; 52:54:00:11:10:00, ethertype 802.1Q (0x8100), length 106: vlan 20 p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 64) 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 14:41:07.532732 52:54:00:11:10:01 \u0026gt; 52:54:00:12:10:00, ethertype 802.1Q (0x8100), length 106: vlan 21 p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 63) 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 14:41:07.533923 52:54:00:12:10:01 \u0026gt; 52:54:00:13:10:00, ethertype 802.1Q (0x8100), length 106: vlan 22 p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 62) 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 14:41:07.535040 52:54:00:13:10:02 \u0026gt; 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 102: vlan 33 p 0, ethertype IPv4 (0x0800), 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 10 packets captured 10 packets received by filter When vpp1-0 receives the MPLS frame with label 100,S=1, it looks up in the FIB to figure out what operation to perform with this packet is to POP the label, revealing the inner payload, which it must look in the IPv4 FIB, and forward as per normal. This is a bit more expensive than it could be, and the folks who established MPLS protocols found a few clever ways to cut down on cost!\nMPLS: Wellknown Label Values I didn\u0026rsquo;t know this until I started tinkering with MPLS on VPP, and as an operator it\u0026rsquo;s easy to overlook these things. As it so turns out, there are a few MPLS label values that have a very specific meaning. Taking a read on [RFC3032], label values 0-15 are reserved and they each serve a specific purpose:\nValue 0: IPv4 Explicit NULL Label Value 1: Router Alert Label Value 2: IPv6 Explicit NULL Label Value 3: Implicit NULL Label There\u0026rsquo;s a few other label values, 4-15, and if you\u0026rsquo;re curious you could take a look at the [Iana List] for them. For my purposes, though, I\u0026rsquo;m only going to look at these weird little NULL labels. What do they do?\nMPLS: Explicit Null RFC3032 discusses the IPv4 explicit NULL label, value 0 (and the IPv6 variant with value 2):\nThis label value is only legal at the bottom of the label stack. It indicates that the label stack must be popped, and the forwarding of the packet must then be based on the IPv4 header.\nWhat this means in practice is that we can allow MPLS PE-Routers to take a little shortcut. If the MPLS label in the last hop is just telling the router to POP the label and take a look in its IPv4 forwarding table, I can also set the label to 0 in the router just preceding it. This way, when the last router sees label value 0, it knows already what to do, saving it one FIB lookup.\nI can reconfigure both LSPs to make use of this feature, by changing the MPLS FIB entries on vpp1-1 that points the LSP towards vpp1-0, removing what I configured before (mpls local-label del ...) and replacing that with an out-label value of 0 (mpls local-label add ...):\nvpp1-1# mpls local-label del 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 100 vpp1-1# mpls local-label add 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 0 vpp1-2# mpls local-label del 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 103 vpp1-2# mpls local-label add 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 0 Due to this, the last routers in the LSP now already know what to do, so I can clean these up:\nvpp1-0# mpls local-label del 100 eos via ip4-lookup-in-table 0 vpp1-3# mpls local-label del 103 eos via ip4-lookup-in-table 0 If I ping from host1-0 to host1-1 again, I can see a subtle but important difference in the packets on the wire:\n17:49:23.770119 52:54:00:20:10:03 \u0026gt; 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33, p 0, ethertype IPv4 (0x0800), 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64 17:49:23.770403 52:54:00:13:10:00 \u0026gt; 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22, p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64) 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64 17:49:23.771184 52:54:00:12:10:00 \u0026gt; 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21, p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63) 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64 17:49:23.772503 52:54:00:11:10:00 \u0026gt; 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 106: vlan 20, p 0, ethertype MPLS unicast (0x8847), MPLS (label 0, exp 0, [S], ttl 62) 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64 17:49:23.773392 52:54:00:10:10:03 \u0026gt; 52:54:00:21:10:00, ethertype 802.1Q (0x8100), length 102: vlan 40, p 0, ethertype IPv4 (0x0800), 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 6172, seq 524, length 64 17:49:23.773602 52:54:00:21:10:00 \u0026gt; 52:54:00:10:10:03, ethertype 802.1Q (0x8100), length 102: vlan 40, p 0, ethertype IPv4 (0x0800), 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64 17:49:23.774592 52:54:00:10:10:01 \u0026gt; 52:54:00:11:10:00, ethertype 802.1Q (0x8100), length 106: vlan 20, p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 64) 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64 17:49:23.775804 52:54:00:11:10:01 \u0026gt; 52:54:00:12:10:00, ethertype 802.1Q (0x8100), length 106: vlan 21, p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 63) 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64 17:49:23.776973 52:54:00:12:10:01 \u0026gt; 52:54:00:13:10:00, ethertype 802.1Q (0x8100), length 106: vlan 22, p 0, ethertype MPLS unicast (0x8847), MPLS (label 0, exp 0, [S], ttl 62) 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64 17:49:23.778255 52:54:00:13:10:02 \u0026gt; 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 102: vlan 33, p 0, ethertype IPv4 (0x0800), 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 6172, seq 524, length 64 Did you spot it? :) If your eyes are spinning, don\u0026rsquo;t worry! I have configured the routers vpp1-1 towards vpp1-0 in vlan 20 to use IPv4 Explicit NULL (label 0). You can spot it on the fourth packet in the tcpdump above. On the way back, vpp1-2 towards vpp1-3 in vlan 22 also sets IPv4 Explicit NULL for the echo-reply. But, I do notice that end to end, the packet is still traversing the network entirely as MPLS packets. The optimization here is that vpp1-0 knows that label value 0 at the end of the label-stack just means \u0026lsquo;what follows is an IPv4 packet, route it.\u0026rsquo;.\nMPLS: Implicit Null Did that really help that much? I think I can answer the question by loadtesting, but first let me take a closer look at what RFC3032 has to say about the Implicit NULL Label:\nA value of 3 represents the \u0026ldquo;Implicit NULL Label\u0026rdquo;. This is a label that an LSR may assign and distribute, but which never actually appears in the encapsulation. When an LSR would otherwise replace the label at the top of the stack with a new label, but the new label is \u0026ldquo;Implicit NULL\u0026rdquo;, the LSR will pop the stack instead of doing the replacement. Although this value may never appear in the encapsulation, it needs to be specified in the Label Distribution Protocol, so a value is reserved.\nOh, groovy! What this tells me is that I can take one further shortcut: if I set the label value 0 (Explicit NULL IPv4), or 2 (Explicit NULL IPV6), my last router in the chain will know to look up the FIB entry automatically, saving one MPLS FIB lookup. But in this case, label value 3 (Implicit NULL) is telling the router to just unwrap the MPLS parts (it\u0026rsquo;s looking at them anyway!) and just forward the bare inner payload which is an IPv4 or IPv6 packet, directy onto the last router. This is what all the real geeks call Penultimate Hop Popping or PHP, none of that website programming language rubbish!\nLet me replace the FIB entries in the penultimate routers with this magic label value (3):\nvpp1-1# mpls local-label del 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 0 vpp1-1# mpls local-label add 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 3 vpp1-2# mpls local-label del 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 0 vpp1-2# mpls local-label add 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 3 Now I would expect this penultimate hop popping to yield an IPv4 packet between vpp1-1 and vpp1-0 on the ICMP echo-request, and as well an IPv4 packet between vpp1-2 and vpp1-3 on the ICMP echo-reply way back, and would you look at that:\n17:45:35.783214 52:54:00:20:10:03 \u0026gt; 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33, p 0, ethertype IPv4 (0x0800), 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64 17:45:35.783879 52:54:00:13:10:00 \u0026gt; 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22, p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64) 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64 17:45:35.784222 52:54:00:12:10:00 \u0026gt; 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21, p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63) 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64 17:45:35.785123 52:54:00:11:10:00 \u0026gt; 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 102: vlan 20, p 0, ethertype IPv4 (0x0800), 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64 17:45:35.785311 52:54:00:10:10:03 \u0026gt; 52:54:00:21:10:00, ethertype 802.1Q (0x8100), length 102: vlan 40, p 0, ethertype IPv4 (0x0800), 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 6172, seq 298, length 64 17:45:35.785533 52:54:00:21:10:00 \u0026gt; 52:54:00:10:10:03, ethertype 802.1Q (0x8100), length 102: vlan 40, p 0, ethertype IPv4 (0x0800), 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64 17:45:35.786465 52:54:00:10:10:01 \u0026gt; 52:54:00:11:10:00, ethertype 802.1Q (0x8100), length 106: vlan 20, p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 64) 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64 17:45:35.787354 52:54:00:11:10:01 \u0026gt; 52:54:00:12:10:00, ethertype 802.1Q (0x8100), length 106: vlan 21, p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 63) 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64 17:45:35.787575 52:54:00:12:10:01 \u0026gt; 52:54:00:13:10:00, ethertype 802.1Q (0x8100), length 102: vlan 22, p 0, ethertype IPv4 (0x0800), 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64 17:45:35.788320 52:54:00:13:10:02 \u0026gt; 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 102: vlan 33, p 0, ethertype IPv4 (0x0800), 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 6172, seq 298, length 64 I can now see that the behavior has changed in a subtle way once again. Where before, there were three MPLS packets all the way between vpp1-3 through vpp1-2 and vpp1-1 onto vpp1-0, now there are only two MPLS packets, and the last one (on the way out in VLAN 20, and on the way back in VLAN 22), is just an IPv4 packet. PHP is slick!\nLoadtesting Setup: Bare Metal In 1997, an Internet Engineering Task Force (IETF) working group created standards to help fix the issues of the time, mostly around internet traffic routing. MPLS was developed as an alternative to multilayer switching and IP over asynchronous transfer mode (ATM). In the 90s, routers were comparatively weak in terms of CPU, and things like content addressable memory to facilitate faster lookups, was incredibly expensive. Back then, every FIB lookup counted, so tricks like Penultimate Hop Popping really helped. But what about now? I\u0026rsquo;m reasonably confident that any silicon based router would not mind to have one extra MPLS FIB operation, and equally would not mind to unwrap the MPLS packet at the end. But, since these things exist, I thought it would be a fun activity to see how much they would help in the VPP world, where just like in the old days, every operation performed on a packet does cost valuable CPU cycles.\nI can\u0026rsquo;t really perform a loadtest on the virtual machines backed by Open vSwitch, while tightly packing six machines on one hypervisor. That setup is made specifically to do functional testing and development work. To do a proper loadtest, I will need bare metal. So, I grabbed three Supermicro SYS-5018D-FN8T, which I\u0026rsquo;m running throughout [AS8298], as I know their performance quite well. I\u0026rsquo;ll take three of these, and daisychain them with TenGig ports. This way, I can take a look at the cost of P-Routers (which only SWAP MPLS labels and forward the result), as well as PE-Routers (which have to encapsulate, and sometimes decapsulate the IP or Ethernet traffic).\nThese machines get a fresh Debian Bookworm install and VPP 23.06 without any plugins. It\u0026rsquo;s weird for me to run a VPP instance without Linux CP, but in this case I\u0026rsquo;m going completely vanilla, so I disable all plugins and give each VPP machine one worker thread. The install follows my popular [VPP-7]. By the way did you know that you can just type the search query [VPP-7] directly into Google to find this article. Am I an influencer now? Jokes aside, I decide to call the bare metal machines France, Belgium and Netherlands. And because if it ain\u0026rsquo;t dutch, it ain\u0026rsquo;t much, the Netherlands machine sits on top :)\nIPv4 forwarding performance The way Cisco T-Rex works in its simplest stateless loadtesting mode, is that it reads a Scapy file, for example bench.py, and it then generates a stream of traffic from its first port, through the device under test (DUT), and expects to see that traffic returned on its second port. In a bidirectional mode, traffic is sent from 16.0.0.0/8 to 48.0.0.0/8 in one direction, and back from 48.0.0.0/8 to 16.0.0.0/8 in the other.\nOK so first things first, let me configure a basic skeleton, taking Netherlands as an example:\nnetherlands# set interface ip address TenGigabitEthernet6/0/1 192.168.13.7/31 netherlands# set interface ip address TenGigabitEthernet6/0/1 2001:678:d78:230::2:2/112 netherlands# set interface state TenGigabitEthernet6/0/1 up netherlands# ip route add 100.64.0.0/30 via 192.168.13.6 netherlands# ip route add 192.168.13.4/31 via 192.168.13.6 netherlands# set interface ip address TenGigabitEthernet6/0/0 100.64.1.2/30 netherlands# set interface state TenGigabitEthernet6/0/0 up netherlands# ip nei TenGigabitEthernet6/0/0 100.64.1.1 9c:69:b4:61:ff:40 static netherlands# ip route add 16.0.0.0/8 via 100.64.1.1 netherlands# ip route add 48.0.0.0/8 via 192.168.13.6 The Belgium router just has static routes back and forth, and the France router looks similar except it has its static routes all pointing in the other direction, and of course it has different /31 transit networks towards T-Rex and Belgium. The one thing that is a bit curious is the use of a static ARP entry that allows the VPP routers to resolve the nexthop for T-Rex \u0026ndash; in the case above, T-Rex is sourcing from 100.64.1.1/30 (which has MAC address 9c:69:b4:61:ff:40) and sending to our 100.64.1.2 on Te6/0/0.\nAfter fiddling around a little bit with imix, I do notice the machine is still keeping up with one CPU thread in both directions (~6.5Mpps). So I switch to 64b packets and ram up traffic until that one VPP worker thread is saturated, which is a around the 9.2Mpps mark, so I lower it slightly to a cool 9Mpps. Note: this CPU can have 3 worker threads in production, so it can do roughly 27Mpps per router, which is way cool!\nThe machines are at this point all doing exactly the same: receive ethernet from DPDK, do an IPv4 lookup, rewrite the header, and emit the frame on another interface. I can see that clearly in the runtime statistics, taking a look at Belgium for example:\nbelgium# show run Thread 1 vpp_wk_0 (lcore 1) Time 7912.6, 10 sec internal node vector rate 207.47 loops/sec 20604.47 vector rates in 8.9997e6, out 9.0054e6, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call TenGigabitEthernet6/0/0-output active 172120948 35740749991 0 6.47e0 207.65 TenGigabitEthernet6/0/0-tx active 171687877 35650752635 0 8.49e1 207.65 TenGigabitEthernet6/0/1-output active 172119849 35740963315 0 7.79e0 207.65 TenGigabitEthernet6/0/1-tx active 171471125 35605967085 0 8.48e1 207.65 dpdk-input polling 171588827 71211720238 0 4.87e1 415.01 ethernet-input active 344675998 71571710136 0 2.16e1 207.65 ip4-input-no-checksum active 343340278 71751697912 0 1.86e1 208.98 ip4-load-balance active 342929714 71661706997 0 1.44e1 208.97 ip4-lookup active 341632798 71391716172 0 2.28e1 208.97 ip4-rewrite active 342498637 71571712383 0 2.59e1 208.97 Looking at the time spent for one individual packet, it\u0026rsquo;s about 245 CPU cycles, and considering the cores on this Xeon D1518 run at 2.2GHz, that checks out very accurately: 2.2e9 / 245 = 9Mpps! Every time that DPDK is asked for some work, it yields on average a vector of 208 packets \u0026ndash; and this is why VPP is so super fast: the first packet may need to page in the instructions belonging to one of the graph nodes, but the second through 208th packet will find almost 100% hitrate in the CPU\u0026rsquo;s instruction cache. Who needs RAM anyway?\nMPLS forwarding performance Now that I have a baseline, I can take a look at the difference between the IPv4 path and the MPLS path, and here\u0026rsquo;s where the routers will start to behave differently. France and Netherlands will be PE-Routers and handle encapsulation/decapsulation, while Belgium has a comparatively easy job, as it will only handle MPLS forwarding. I\u0026rsquo;ll choose country-codes for the labels, that which is destined to France will have MPLS label 33,S=1; while that which goes to Netherlands will have MPLS label 31,S=1.\nnetherlands# ip ro del 48.0.0.0/8 via 192.168.13.6 netherlands# ip ro add 48.0.0.0/8 via 192.168.13.6 TenGigabitEthernet6/0/1 out-labels 33 netherlands# mpls local-label add 31 eos via ip4-lookup-in-table 0 belgium# ip route del 48.0.0.0/8 via 192.168.13.4 belgium# ip route del 16.0.0.0/8 via 192.168.13.7 belgium# mpls local-label add 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 33 belgium# mpls local-label add 31 eos via 192.168.13.7 TenGigabitEthernet6/0/0 out-labels 31 france# ip route del 16.0.0.0/8 via 192.168.13.5 france# ip route add 16.0.0.0/8 via 192.168.13.5 TenGigabitEthernet6/0/1 out-labels 31 france# mpls local-label add 33 eos via ip4-lookup-in-table 0 The types of operation in MPLS is no longer symmetric. On the way in, the PE-Router has to encapsulate the IPv4 packet into an MPLS packet, and on the way out, the PE-Router has to decapsulate the MPLS packet to reveal the IPv4 packet. So, I change the loadtester to be unidirectional, and ask it to send 10Mpps from Netherlands to France. As soon as I reconfigure the routers in this mode, I see quite a bit of packetlo, as only 7.3Mpps make it through. Interesting! I wonder where this traffic is dropped, and what the bottleneck is, precisely.\nMPLS: PE Ingress Performance First, let\u0026rsquo;s take a look at Netherlands, to try to understand why it is more expensive:\nnetherlands# show run Time 255.5, 10 sec internal node vector rate 256.00 loops/sec 29399.92 vector rates in 7.6937e6, out 7.6937e6, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call TenGigabitEthernet6/0/1-output active 7978541 2042505472 0 7.28e0 255.99 TenGigabitEthernet6/0/1-tx active 7678013 1965570304 0 8.25e1 255.99 dpdk-input polling 7684444 1965570304 0 4.55e1 255.79 ethernet-input active 7978549 2042507520 0 1.94e1 255.99 ip4-input-no-checksum active 7978557 2042509568 0 1.75e1 255.99 ip4-lookup active 7678013 1965570304 0 2.17e1 255.99 ip4-mpls-label-imposition-pipe active 7678013 1965570304 0 2.42e1 255.99 mpls-output active 7678013 1965570304 0 6.71e1 255.99 Each packet gets from dpdk-input into ethernet-input, the resulting IPv4 packet visits ip4-lookup FIB where the MPLS out-label is found in the IPv4 FIB, the packet is then wrapped into an MPLS packet in ip4-mpls-label-imposition-pipe and then sent through mpls-output to the NIC. In total the input path (ip4-* plus mpls-*) takes 131 CPU cycles for each packet. Including all the nodes, from DPDK input to DPDK output sums up to 285 cycles, so 2.2GHz/285 = 7.69Mpps which checks out.\nMPLS: P Transit Performance I would expect that Belgium has it easier, as it\u0026rsquo;s only doing label swapping and MPLS forwarding.\nbelgium# show run Thread 1 vpp_wk_0 (lcore 1) Time 595.6, 10 sec internal node vector rate 47.68 loops/sec 224464.40 vector rates in 7.6930e6, out 7.6930e6, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call TenGigabitEthernet6/0/1-output active 97711093 4659109793 0 8.83e0 47.68 TenGigabitEthernet6/0/1-tx active 96096377 4582172229 0 8.14e1 47.68 dpdk-input polling 161102959 4582172278 0 5.72e1 28.44 ethernet-input active 97710991 4659111684 0 2.45e1 47.68 mpls-input active 97709468 4659096718 0 2.25e1 47.68 mpls-label-imposition-pipe active 99324916 4736048227 0 2.52e1 47.68 mpls-lookup active 99324903 4736045943 0 3.25e1 47.68 mpls-output active 97710989 4659111742 0 3.04e1 47.68 Indeed, Belgium can still breathe, it\u0026rsquo;s spending 110 Cycles per packet doing the MPLS switching (mpls-*), which is 18% less than the PE-Router ingress. Judging by the vectors/Call (last column), it\u0026rsquo;s also running a bit cooler than the ingress router.\nIt\u0026rsquo;s nice to see that the claim that P-Routers are cheaper on the CPU can be verified to be true in practice!\nMPLS: PE Egress Performance On to the last router, France, which is in charge of decapsulating the MPLS packet and doing the resulting IPv4 lookup:\nfrance# show run Thread 1 vpp_wk_0 (lcore 1) Time 1067.2, 10 sec internal node vector rate 256.00 loops/sec 27986.96 vector rates in 7.3234e6, out 7.3234e6, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call TenGigabitEthernet6/0/0-output active 30528978 7815395072 0 6.59e0 255.99 TenGigabitEthernet6/0/0-tx active 30528978 7815395072 0 8.20e1 255.99 dpdk-input polling 30534880 7815395072 0 4.68e1 255.95 ethernet-input active 30528978 7815395072 0 1.97e1 255.99 ip4-load-balance active 30528978 7815395072 0 1.35e1 255.99 ip4-mpls-label-disposition-pip active 30528978 7815395072 0 2.82e1 255.99 ip4-rewrite active 30528978 7815395072 0 2.48e1 255.99 lookup-ip4-dst active 30815069 7888634368 0 3.09e1 255.99 mpls-input active 30528978 7815395072 0 1.86e1 255.99 mpls-lookup active 30528978 7815395072 0 2.85e1 255.99 This router is spending its time (in *ip4* and mpls-*) roughly at roughly 144.5 Cycles per packet and reveals itself as the bottleneck. Netherlands sent Belgium 7.69Mpps which it all forwarded to France, where only 7.3Mpps make it through this PE-Router egress, and into the hands of T-Rex. In total, this router is spending 298 cycles/packet, which amounts to 7.37Mpps.\nMPLS Explicit Null performance At the beginning of this article, I made a claim that we could take some shortcuts, and now is a good time to see if those short cuts are worthwhile in the VPP setting. I\u0026rsquo;ll reconfigure the Belgium router to set the IPv4 Explicit NULL label (0), which can help my poor overloaded France router save some valuable CPU cycles.\nbelgium# mpls local-label del 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 33 belgium# mpls local-label add 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 0 The situation for Belgium doesn\u0026rsquo;t change at all, it\u0026rsquo;s still doing the SWAP operation on the incoming packet, but it\u0026rsquo;s writing label 0,S=1 now (instead of label 33,S=1 before). But, haha!, take a look at France for an important difference:\nfrance# show run Thread 1 vpp_wk_0 (lcore 1) Time 53.3, 10 sec internal node vector rate 85.35 loops/sec 77643.80 vector rates in 7.6933e6, out 7.6933e6, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call TenGigabitEthernet6/0/0-output active 4773870 409847372 0 6.96e0 85.85 TenGigabitEthernet6/0/0-tx active 4773870 409847372 0 8.07e1 85.85 dpdk-input polling 4865704 409847372 0 5.01e1 84.23 ethernet-input active 4773870 409847372 0 2.15e1 85.85 ip4-load-balance active 4773869 409847235 0 1.51e1 85.85 ip4-rewrite active 4773870 409847372 0 2.60e1 85.85 lookup-ip4-dst-itf active 4773870 409847372 0 3.41e1 85.85 mpls-input active 4773870 409847372 0 1.99e1 85.85 mpls-lookup active 4773870 409847372 0 3.01e1 85.85 First off, I notice the input vector rates match the output vector rates, both at 7.69Mpps, and that the average Vectors/Call is no longer pegged at 256. The router is now spending 125 Cycles per packet which is a lot better than it was before (15.4% better than 144.5 Cycles/packet).\nConclusion: MPLS Explicit NULL is cheaper!\nMPLS Implicit Null (PHP) performance So there\u0026rsquo;s one mode of operation left for me to play with. What if we asked Belgium to unwrap the MPLS packet and forward it as an IPv4 packet towards France, in other words apply Penultimate Hop Popping? Of course, the ingress Netherlands won\u0026rsquo;t change at all, but I reconfigure the Belgium router, like so:\nbelgium# mpls local-label del 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 0 belgium# mpls local-label add 33 eos via 192.168.13.4 TenGigabitEthernet6/0/1 out-labels 3 The situation in Belgium now looks subtly different:\nbelgium# show run Thread 1 vpp_wk_0 (lcore 1) Time 171.1, 10 sec internal node vector rate 50.64 loops/sec 188552.87 vector rates in 7.6966e6, out 7.6966e6, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call TenGigabitEthernet6/0/1-output active 26128425 1316828499 0 8.74e0 50.39 TenGigabitEthernet6/0/1-tx active 26128424 1316828327 0 8.16e1 50.39 dpdk-input polling 39339977 1316828499 0 5.58e1 33.47 ethernet-input active 26128425 1316828499 0 2.39e1 50.39 ip4-mpls-label-disposition-pip active 26128425 1316828499 0 3.07e1 50.39 ip4-rewrite active 27648864 1393790359 0 2.82e1 50.41 mpls-input active 26128425 1316828499 0 2.21e1 50.39 mpls-lookup active 26128422 1316828355 0 3.16e1 50.39 After doing the mpls-lookup, this router finds that it can just toss the label and forward the packet as IPv4 down south. Cost for Belgium: 113 Cycles per packet.\nFrance is now not participating in MPLS at all - it is simply receiving IPv4 packets which it has to route back towards T-Rex. I take one final look at France to see where it\u0026rsquo;s spending its time:\nfrance# show run Thread 1 vpp_wk_0 (lcore 1) Time 397.3, 10 sec internal node vector rate 42.17 loops/sec 259634.88 vector rates in 7.7112e6, out 7.6964e6, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call TenGigabitEthernet6/0/0-output active 74381543 3211443520 0 9.47e0 43.18 TenGigabitEthernet6/0/0-tx active 70820630 3057504872 0 8.26e1 43.17 dpdk-input polling 131873061 3063377312 0 6.09e1 23.23 ethernet-input active 72645873 3134461107 0 2.66e1 43.15 ip4-input-no-checksum active 70820629 3057504812 0 2.68e1 43.17 ip4-load-balance active 72646140 3134473660 0 1.74e1 43.15 ip4-lookup active 70820628 3057504796 0 2.79e1 43.17 ip4-rewrite active 70820631 3057504924 0 2.96e1 43.17 As an IPv4 router, France spends in total 102 Cycles per packet. This matches very closely with the 104 cycles/packet I found when doing my baseline loadtest with only IPv4 routing. I love it when numbers align!!\nScaling One thing that I was curious to know, is if MPLS packets would allow for multiple receive queues, to enable horizontal scaling by adding more VPP worker threads. The answer is a resounding YES! If I restart the VPP routers Netherlands, Belgium and France with three workers and set DPDK num-rx-queues to 3 as well, I see perfect linear scaling, in other words these little routers would be able to forward roughly 27Mpps of MPLS packets with varying inner payloads (be it IPv4 or IPv6 or Ethernet traffic with differing src/dest MAC addresses). All things said, IPv4 is still a little bit cheaper on the CPU, at least on these routers with only a very small routing table. But, it\u0026rsquo;s great to see that MPLS forwarding can leverage RSS.\nConclusions This is all fine and dandy, but I think it\u0026rsquo;s a bit trickier to see if PHP is actually cheaper or not. To answer this question, I think I should count the total amount of CPU cycles spent end to end: for a packet traveling from T-Rex coming into Netherlands, through Belgium and France, and back out to T-Rex.\nNetherlands Belgium France Total Cost Regular IPv4 path 104 cycles 104 cycles 104 cycles 312 cycles MPLS: Simple LSP 131 cycles 110 cycles 145 cycles 386 cycles MPLS: Explicit NULL LSP 131 cycles 110 cycles 125 cycles 366 cycles MPLS: Penultimate Hop Pop 131 cycles 113 cycles 102 cycles 346 cycles Note: The clock cycle numbers here are only *mpls* and *ip4* nodes, exclusing the *-input, *-output and *-tx nodes, as they will add the same cost for all modes of operation.\nI threw a lot of numbers into this article, and my head is spinning as I write this. But I still think I can wrap it up in a way that allows me to have a few high level takeaways:\nIPv4 forwarding is a fair bit cheaper than MPLS forwarding (with an empty FIB, anyway). I had not expected this! End to end, the MPLS bottleneck is in the PE-Ingress operation. Explicit NULL helps without any drawbacks, as it cuts off one MPLS FIB lookup in the PE-Egress operation. Implicit NULL (aka Penultimate Hop Popping) is also the fastest way to do MPLS with VPP, all things considered. What\u0026rsquo;s next I joined forces with @vifino who has effectively added MPLS handling to the Linux Control Plane, so VPP can start to function as an MPLS router using FRR\u0026rsquo;s label distribution protocol implementation. Gosh, I wish Bird3 would have LDP :)\nOur work is mostly complete, there\u0026rsquo;s two pending Gerrit\u0026rsquo;s which should be ready to review and certainly ready to play with:\n[Gerrit 38826]: This adds the ability to listen to internal state changes of an interface, so that the Linux Control Plane plugin can enable MPLS on the LIP interfaces and Linux sysctl for MPLS input. [Gerrit 38702]: This adds the ability to listen to Netlink messages in the Linux Control Plane plugin, and sensibly apply these routes to the IPv4, IPv6 and MPLS FIB in the VPP dataplane. If you\u0026rsquo;d like to test this - reach out to the VPP Developer mailinglist [ref] any time!\n","date":"2023-05-17","desc":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nI\u0026rsquo;ve deployed an MPLS core for IPng Networks, which allows me to provide L2VPN services, and at the same time keep an IPng Site Local network with IPv4 and IPv6 that is separate from the internet, based on hardware/silicon based forwarding at line rate and high availability. You can read all about my Centec MPLS shenanigans in [this article].\n","permalink":"https://ipng.ch/s/articles/2023/05/17/vpp-mpls-part-2/","section":"articles","title":"VPP MPLS - Part 2"},{"contents":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nI\u0026rsquo;ve deployed an MPLS core for IPng Networks, which allows me to provide L2VPN services, and at the same time keep an IPng Site Local network with IPv4 and IPv6 that is separate from the internet, based on hardware/silicon based forwarding at line rate and high availability. You can read all about my Centec MPLS shenanigans in [this article].\nEver since the release of the Linux Control Plane [ref] plugin in VPP, folks have asked \u0026ldquo;What about MPLS?\u0026rdquo; \u0026ndash; I have never really felt the need to go this rabbit hole, because I figured that in this day and age, higher level IP protocols that do tunneling are just as performant, and a little bit less of an \u0026lsquo;art\u0026rsquo; to get right. For example, the Centec switches I deployed perform VxLAN, GENEVE and GRE all at line rate in silicon. And in an earlier article, I showed that the performance of VPP in these tunneling protocols is actually pretty good. Take a look at my [VPP L2 article] for context.\nYou might ask yourself: Then why bother? To which I would respond: if you have to ask that question, clearly you don\u0026rsquo;t know me :) This article will form a deep dive into MPLS as implemented by VPP. In a later set of articles, I\u0026rsquo;ll partner with the incomparable @vifino who is adding MPLS support to the Linux Controlplane plugin. After that, I do expect VPP to be able to act as a fully fledged provider- and provider-edge MPLS router.\nLab Setup A while ago I created a [VPP Lab] which is pretty slick, I use it all the time. Most of the time I find myself messing around on the hypervisor and adding namespaces with interfaces in it, to pair up with the VPP interfaces. And I tcpdump a lot! It\u0026rsquo;s time for me to make an upgrade to the Lab \u0026ndash; take a look at this picture:\nThere\u0026rsquo;s quite a bit to unpack here, but it will be useful to know this layout as I\u0026rsquo;ll be referring to the components here throughout the rest of the article. Each lab now has seven virtual machines:\nvppX-Y are Debian Testing machines running a reasonably fresh VPP - they are daisychained with the first one attaching to the headend called lab.ipng.ch, using its Gi10/0/0 interface, and onwards to its eastbound neighbor vpp0-1 using its GI10/0/1 interface. hostX-Y are two Debian machines which have their 4 network cards (enp16s0fX) connected each to one VPP instance\u0026rsquo;s Gi10/0/2 interface (for host0-0) or Gi10/0/3 (for host0-1). This way, I can test all sorts of topologies with one router, two routers, or multiple routers. tapX-0 is a special virtual machine which receives a copy of every packet on the underlying Open vSwitch network fabric. NOTE: X is the 0-based lab number, and Y stands for the 0-based logical machine number, so vpp1-3 is the fourth VPP virtualmachine on the second lab.\nDetour 1: Open vSwitch To explain this tap a little bit - let me first talk about the underlay. All seven of these machines (and each their four network cards) are bound by the hypervisor into an Open vSwitch bridge called vpplan. Then, I use two features to build this topology:\nFirstly, each pair of interfaces will be added as an access port into individual VLANs. For example, vpp0-0.Gi10/0/1 connects with vpp0-1.Gi10/0/0 in VLAN 20 (annotated in orange), and vpp0-0.Gi10/0/2 connects to host0-0.enp16s0f0 in VLAN 30 (annotated in purple). You can see the East-West traffic over the VPP backbone are in the 20s, the host0-0 traffic northbound is in the 30s, and the host0-1 traffic southbound is in the 40s. Finally, the whole Open vSwitch fabric is connected to lab.ipng.ch using VLAN 10 and a physical network card on the hypervisor (annotated in green). The lab.ipng.ch machine then has internet connectivity.\nBR=vpplan for p in $(ovs-vsctl list-ifaces $BR); do ovs-vsctl set port $p vlan_mode=access done # Uplink (Green) ovs-vsctl set port uplink tag=10 ## eno1.200 on the Hypervisor ovs-vsctl set port vpp0-0-0 tag=10 # Backbone (Orange) ovs-vsctl set port vpp0-0-1 tag=20 ovs-vsctl set port vpp0-1-0 tag=20 ... # Northbound (Purple) ovs-vsctl set port vpp0-0-2 tag=30 ovs-vsctl set port host0-0-0 tag=30 ... # Southbound (Red) ... ovs-vsctl set port vpp0-3-3 tag=43 ovs-vsctl set port host0-1-3 tag=43 NOTE: The KVM interface names such as vppX-Y-Z where X means the lab number (0 in this case \u0026ndash; IPng does have multiple labs so I can run experiments and lab environments independently and isolated), Y is the machine number, and Z is the interface number on the machine (from [0..3]).\nDetour 2: Mirroring Traffic Secondly, now that I have created a 29 port switch with 12 VLANs, I decide to create an OVS mirror port, which can be used to make a copy of traffic going in- or out of (a list of) ports. This is a super powerful feature, and it looks like this:\nBR=vpplan MIRROR=mirror-rx ovs-vsctl set port tap0-0-0 vlan_mode=access [ ovs-vsctl list mirror $MIRROR \u0026gt;/dev/null 2\u0026gt;\u0026amp;1 ] || \\ ovs-vsctl --id=@m get mirror $MIRROR -- remove bridge $BR mirrors @m ovs-vsctl --id=@m create mirror name=$MIRROR \\ -- --id=@p get port tap0-0-0 \\ -- add bridge $BR mirrors @m \\ -- set mirror $MIRROR output-port=@p \\ -- set mirror $MIRROR select_dst_port=[] \\ -- set mirror $MIRROR select_src_port=[] for iface in $(ovs-vsctl list-ports $BR); do [[ $iface == tap* ]] \u0026amp;\u0026amp; continue ovs-vsctl add mirror $MIRROR select_dst_port $(ovs-vsctl get port $iface _uuid) done The first call sets up the OVS switchport called tap0-0-0 (which is enp16s0f0 on the machine tap0-0) as an access port. To allow for this script to be idempotent, the second line will look up if the mirror exists and if so, delete it. Then, I (re)create a mirror port with a given name (mirror-rx), add it to the bridge, make the mirror\u0026rsquo;s output port become tap0-0-0, and finally clear the selected source and destination ports (this is where the traffic is mirrored from). At this point, I have an empty mirror. To give it something useful to do, I loop over all of the ports in the vpplan bridge and add them to the mirror, if they are the destination port (here I have to specify the uuid of the interface, not its name). I will add all interfaces, except those of the tap0-0 machine itself, to avoid loops.\nIn the end, I create two of these, one called mirror-rx which is forwarded to tap0-0-0 (enp16s0f0) and the other called mirror-tx which is forwarded to tap0-0-1 (enp16s0f1). I can use tcpdump on either of these ports, to show all the traffic either going ingress to any port on any machine, or emitting egress from any port on any machine, respectively.\nPreparing the LAB I wrote a little bit about the automation I use to maintain a few reproducable lab environments in a [previous article], so I\u0026rsquo;ll only show the commands themselves here, not the underlying systems. When the LAB boots up, it comes with a basic Linux CP configuration that uses OSPF and OSPFv3 running in Bird2, to connect the vpp0-0 through vpp0-3 machines together (each router\u0026rsquo;s Gi10/0/0 port connects to the next router\u0026rsquo;s Gi10/0/1 port). LAB0 is in use by @vifino at the moment, so I\u0026rsquo;ll take the next one running on its own hypervisor, called LAB1.\nEach machine has an IPv4 and IPv6 loopback, so the LAB will come up with basic connectivity:\npim@lab:~/src/lab$ LAB=1 ./create pim@lab:~/src/lab$ LAB=1 ./command pristine pim@lab:~/src/lab$ LAB=1 ./command start \u0026amp;\u0026amp; sleep 150 pim@lab:~/src/lab$ traceroute6 vpp1-3.lab.ipng.ch traceroute to vpp1-3.lab.ipng.ch (2001:678:d78:211::3), 30 hops max, 24 byte packets 1 e0.vpp1-0.lab.ipng.ch (2001:678:d78:211::fffe) 2.0363 ms 2.0123 ms 2.0138 ms 2 e0.vpp1-1.lab.ipng.ch (2001:678:d78:211::1:11) 3.0969 ms 3.1261 ms 3.3413 ms 3 e0.vpp1-2.lab.ipng.ch (2001:678:d78:211::2:12) 6.4845 ms 6.3981 ms 6.5409 ms 4 vpp1-3.lab.ipng.ch (2001:678:d78:211::3) 7.4610 ms 7.5698 ms 7.6413 ms MPLS For Dummies .. like me! MPLS stands for [Multi Protocol Label Switching]. Rather than looking at the IPv4 or IPv6 header in the packet, and making the routing decision based on the destination address, MPLS takes the whole packet and encpsulates it into a new datagram that carries a 20-bit number (called the label), three bits to classify the traffic, one S-bit to signal that this is the last label in a stack of labels, and finally 8 bits of TTL.\nIn total, 32 bits are prepended to the whole IP packet, or Ethernet frame, or any other type of inner datagram. This is why it\u0026rsquo;s called Multi Protocol. The S-bit allows routers to know if the following data is the inner payload (S=1), or if the following 32 bits are another MPLS label (S=0). This way, routers can add more than one labels into a label stack.\nForwarding decisions are made on the contents of this MPLS label, without the need to examine the packet itself. Two significant benefits become obvious:\nThe inner data payload (ie. an IPv6 packet or an Ethernet frame) doesn\u0026rsquo;t have to be rewritten, no new checksum created, no TTL decremented. Any datagram can be stuffed into an MPLS packet, the routing (or packet switching) entirely happens using only the MPLS headers.\nImportantly, no source- or destination IP addresses have to be looked up in a possibly very large ~1M large FIB tree to figure out the next hop. Rather than traversing a [Radix Trie] or other datastructure to find the next-hop, a static [Hash Table] with literal integer MPLS labels can be consulted. This greatly simplifies the computational complexity in transit.\nP-Router: The simplest form of an MPLS router is a so-called Label-Switch-Router (LSR) which is synonymous for Provider-Router (P-Router). This is the router that sits in the core of the network, and its only purpose is to receive MPLS packets, look up what to do with them based on the label value, and then forward the packet onto the next router. Sometimes the router can (and will) rewrite the label, in an operation called a SWAP, but it can also leave the label as it was (in other words, the input label value can be the same as the outgoing label value). The logic kind of goes like MPLS In-Label =\u0026gt; { MPLS Out-Label, Out-Interface, Out-NextHop }. It\u0026rsquo;s this behavior that explains the name Label Switching.\nIf you were to imagine plotting a path through the lab network from say vpp1-0 on the one side, through vpp1-1 and vpp1-2 on finally onwards to vpp1-3, each router would be receiving MPLS packets on one interface, and emitting them on their way to the next router on another interface. That path of switching operations on the labels of those MPLS packets thus forms a so-called Label-Switched-Path (LSP). These LSPs are fundamental building blocks of MPLS networks, as I\u0026rsquo;ll demonstrate later.\nPE-Router: Some routers have a less boring job to do - those that sit at the edge of an MPLS network, accept customer traffic and do something useful with it. These are called Label-Edge-Router (LER) which is often colloquially called a Provider-Edge-Router (PE-Router). These routers receive normal packets (ethernet or IP or otherwise), and perform the encapsulation by adding MPLS labels to them upon receipt (ingress, called PUSH), or removing the encapsulation (called POP) and finding the inner payload, continuing to handle them as per normal. The logic for these can be much more complicated, but you can imagine it goes something like MPLS In-Label =\u0026gt; { Operation } where the operation may be \u0026ldquo;take the resulting datagram, assume it is an IPv4 packet, so look it up in the IPv4 routing table\u0026rdquo; or \u0026ldquo;take the resulting datagram, assume it is an ethernet frame, and emit it on a specific interface\u0026rdquo;, and really any number of other \u0026ldquo;operations\u0026rdquo;.\nThe cool thing about MPLS is that the type of operations are vendor-extensible. If two routers A and B agree what label 1234 means to them, they can simply insert it at the top of the labelstack say {100,1234}, where the bottom one (the 100 label that all the P-Routers see) carries the semantic meaning of \u0026ldquo;switch this packet onto the destination PE-router\u0026rdquo;, where that PE-router can pop the outer label, to reveal the 1234-label, which it can look up in its table to tell it what to do next with the MPLS payload in any way it chooses - the P-Routers don\u0026rsquo;t have to understand the meaning of label 1234, they don\u0026rsquo;t have to use or inspect it at all!\nStep 0: End Host setup For this lab, I\u0026rsquo;m going to boot up instance LAB1 with no changes (for posterity, using image vpp-proto-disk0@20230403-release). As an aside, IPng Networks has several of these lab environments, and while @vifino is doing some development testing on LAB0, I simply switch to LAB1 to let him work in peace.\nWith the MPLS concepts introduced, let me start by configuring host1-0 and host1-1 and giving them an IPv4 loopback address, and a transit network to their routers vpp1-0 and vpp1-3 respectively:\nroot@host1-1:~# ip link set enp16s0f0 up mtu 1500 root@host1-1:~# ip addr add 192.0.2.2/31 dev enp16s0f0 root@host1-1:~# ip addr add 10.0.1.1/32 dev lo root@host1-1:~# ip ro add 10.0.1.0/32 via 192.0.2.3 root@host1-0:~# ip link set enp16s0f3 up mtu 1500 root@host1-0:~# ip addr add 192.0.2.0/31 dev enp16s0f3 root@host1-0:~# ip addr add 10.0.1.0/32 dev lo root@host1-0:~# ip ro add 10.0.1.1/32 via 192.0.2.1 root@host1-0:~# ping -I 10.0.1.0 10.0.1.1 At this point, I don\u0026rsquo;t expect to see much, as I haven\u0026rsquo;t configured VPP yet. But host1-0 will start ARPing for 192.0.2.1 on enp16s0f3, which is connected to vpp1-3.e2. Let me take a look on the Open vSwitch mirror to confirm that:\nroot@tap1-0:~# tcpdump -vni enp16s0f0 vlan 33 12:41:27.174052 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28 12:41:28.333901 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28 12:41:29.517415 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28 12:41:30.645418 ARP, Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.1 tell 192.0.2.0, length 28 Alright! I\u0026rsquo;m going to leave the ping running in the background, and I\u0026rsquo;ll trace packets through the network using the Open vSwitch mirror, as well as take a look at what VPP is doing with the packets using its packet tracer.\nStep 1: PE Ingress vpp1-3# set interface state GigabitEthernet10/0/2 up vpp1-3# set interface ip address GigabitEthernet10/0/2 192.0.2.1/31 vpp1-3# mpls table add 0 vpp1-3# set interface mpls GigabitEthernet10/0/1 enable vpp1-3# set interface mpls GigabitEthernet10/0/0 enable vpp1-3# ip route add 10.0.1.1/32 via 192.168.11.10 GigabitEthernet10/0/0 out-labels 100 Now the ARP resolution succeeds, and I can see that host1-0 starts sending ICMP packets towards the loopback that I have configured on host1-1, and it\u0026rsquo;s of course using the newly learned L2 adjacency for 192.0.2.1 at 52:54:00:13:10:02 (which is vpp1-3.e2). But, take a look at what the VPP router does next: due to the ip route add ... command, I\u0026rsquo;ve told it to reach 10.0.1.1 via a nexthop of vpp1-2.e1, but it will PUSH a single MPLS label 100,S=1 and forward it out on its Gi10/0/0 interface:\nroot@tap1-0:~# tcpdump -eni enp16s0f0 vlan or mpls 12:45:56.551896 52:54:00:20:10:03 \u0026gt; ff:ff:ff:ff:ff:ff, ethertype 802.1Q (0x8100), length 46: vlan 33 p 0, ethertype ARP (0x0806), Request who-has 192.0.2.1 tell 192.0.2.0, length 28 12:45:56.553311 52:54:00:13:10:02 \u0026gt; 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 46: vlan 33 p 0, ethertype ARP (0x0806), Reply 192.0.2.1 is-at 52:54:00:13:10:02, length 28 12:45:56.620924 52:54:00:20:10:03 \u0026gt; 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33 p 0, ethertype IPv4 (0x0800), 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 38791, seq 184, length 64 12:45:56.621473 52:54:00:13:10:00 \u0026gt; 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22 p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64) 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 38791, seq 184, length 64 My MPLS journey on VPP has officially begun! The first exchange in the tcpdump (packets 1 and 2) is the ARP resolution of 192.0.2.1 by host1-0, after which it knows where to send the ICMP echo (packet 3, on VLAN33), which is then sent out by vpp1-3 as MPLS to vpp1-2 (packet 4, on VLAN22).\nLet me show you what such a packet looks like from the point of view of VPP. It has a packet tracing function which shows how any individual packet traverses the graph of nodes through the router. It\u0026rsquo;s a lot of information, but as a VPP operator, let alone a developer, it\u0026rsquo;s really important skill to learn \u0026ndash; so off I go, capturing and tracing a handful of packets:\nvpp1-3# trace add dpdk-input 10 vpp1-3# show trace ------------------- Start of thread 0 vpp_main ------------------- Packet 1 20:15:00:496109: dpdk-input GigabitEthernet10/0/2 rx queue 0 buffer 0x4c44df: current data 0, length 98, buffer-pool 0, ref-count 1, trace handle 0x0 ext-hdr-valid PKT MBUF: port 2, nb_segs 1, pkt_len 98 buf_len 2176, data_len 98, ol_flags 0x0, data_off 128, phys_addr 0x2ed13840 packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 rss 0x0 fdir.hi 0x0 fdir.lo 0x0 IP4: 52:54:00:20:10:03 -\u0026gt; 52:54:00:13:10:02 ICMP: 10.0.1.0 -\u0026gt; 10.0.1.1 tos 0x00, ttl 64, length 84, checksum 0x46a2 dscp CS0 ecn NON_ECN fragment id 0x2706, flags DONT_FRAGMENT ICMP echo_request checksum 0x3bd6 id 8399 20:15:00:496167: ethernet-input frame: flags 0x1, hw-if-index 3, sw-if-index 3 IP4: 52:54:00:20:10:03 -\u0026gt; 52:54:00:13:10:02 20:15:00:496201: ip4-input ICMP: 10.0.1.0 -\u0026gt; 10.0.1.1 tos 0x00, ttl 64, length 84, checksum 0x46a2 dscp CS0 ecn NON_ECN fragment id 0x2706, flags DONT_FRAGMENT ICMP echo_request checksum 0x3bd6 id 8399 20:15:00:496225: ip4-lookup fib 0 dpo-idx 1 flow hash: 0x00000000 ICMP: 10.0.1.0 -\u0026gt; 10.0.1.1 tos 0x00, ttl 64, length 84, checksum 0x46a2 dscp CS0 ecn NON_ECN fragment id 0x2706, flags DONT_FRAGMENT ICMP echo_request checksum 0x3bd6 id 8399 20:15:00:496256: ip4-mpls-label-imposition-pipe mpls-header:[100:64:0:eos] 20:15:00:496258: mpls-output adj-idx 25 : mpls via 192.168.11.10 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254001210015254001310008847 flow hash: 0x00000000 20:15:00:496260: GigabitEthernet10/0/0-output GigabitEthernet10/0/0 flags 0x0018000d MPLS: 52:54:00:13:10:00 -\u0026gt; 52:54:00:12:10:01 label 100 exp 0, s 1, ttl 64 20:15:00:496262: GigabitEthernet10/0/0-tx GigabitEthernet10/0/0 tx queue 0 buffer 0x4c44df: current data -4, length 102, buffer-pool 0, ref-count 1, trace handle 0x0 ext-hdr-valid l2-hdr-offset 0 l3-hdr-offset 14 PKT MBUF: port 2, nb_segs 1, pkt_len 102 buf_len 2176, data_len 102, ol_flags 0x0, data_off 124, phys_addr 0x2ed13840 packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 rss 0x0 fdir.hi 0x0 fdir.lo 0x0 MPLS: 52:54:00:13:10:00 -\u0026gt; 52:54:00:12:10:01 label 100 exp 0, s 1, ttl 64 This packet has gone through a total of eight nodes, and the local timestamps are the uptime of VPP when the packets were received. I\u0026rsquo;ll try to explain them in turn:\ndpdk-input: The packet is initially received by from Gi10/0/2 receive queue 0. It was an ethernet packet from 52:54:00:20:10:03 (host1-0.enp16s0f3) to 52:54:00:13:10:02 (vpp1-3.e2). Some more information is gleaned here, notably that it was an ethernet frame, an L3 IPv4 and L4 ICMP packet. ethernet-input: Since it was an ethernet frame, it was passed into this node. Here VPP concludes that this is an IPv4 packet, because the ethertype is 0x0800. ip4-input: We know it\u0026rsquo;s an IPv4 packet, and the layer4 information shows this is an ICMP echo packet from 10.0.1.0 to 10.0.1.1 (configured on host1-1.lo). VPP now needs to figure out where to route this packet. ip4-lookup: VPP takes a look at its FIB for 10.0.1.1 - note the information I specified above (the ip route add ... on vpp1-3) - the next-hop here is 192.168.11.10 on Gi10/0/0 but VPP also sees that I\u0026rsquo;m intending to add an MPLS out-label of 100. ip4-mpls-label-inposition-pipe: An MPLS packet header is prepended in front of the IPv4 packet, which will have only one label (100) and since it\u0026rsquo;s the only label, it will set the S-bit (end-of-stack) to 1, and the MPLS TTL initializes at 64. mpls-output: Now that the IPv4 packet is wrapped into an MPLS packet, VPP uses the rest of the FIB entry (notably the next-hop 192.168.11.0 and the output interface Gi10/0/0) to find where this thing is supposed to go. Gi10/0/0-output: VPP now prepares the packet to be sent out on Gi10/0/0 as an MPLS ethernet type. It uses the L2FIB adjacency table to figure out that we\u0026rsquo;ll be sending it from our mac address 52:54:00:13:10:00 (vpp1-3.e0) to the next hop on 52:54:00:12:10:01 (vpp1-2.e1). Gi10/0/0-tx: VPP hands the fully formed packet with all necessary information back to DPDK to marshall it on the wire. Can you imagine this router can do such a thing at a rate of 18-20 Million packets per second, linearly scaling up per added CPU thread? I learn something new every time I look at a packet trace, I simply love this dataplane implementation!\nStep 2: P-routers In Step 1 I\u0026rsquo;ve shown that vpp1-3 did send the MPLS packet to vpp1-2, but I haven\u0026rsquo;t configured anything there yet, and because I didn\u0026rsquo;t enable MPLS, each of these beautiful packets is brutally sent to the bit-bucket (also called dpo-drop):\nvpp1-2# show err Count Node Reason Severity 132 mpls-input MPLS input packets decapsulated info 132 mpls-input MPLS not enabled error The purpose of a P-router is to switch labels along the Label-Switched-Path. So let\u0026rsquo;s manually create the following to tell this vpp1-2 router what to do when it receives an MPLS frame with label 100:\nvpp1-2# mpls table add 0 vpp1-2# set interface mpls GigabitEthernet10/0/0 enable vpp1-2# set interface mpls GigabitEthernet10/0/1 enable vpp1-2# mpls local-label add 100 eos via 192.168.11.8 GigabitEthernet10/0/0 out-labels 100 Remember, above I explained that the P-Router has a simple job? It really does! All I\u0026rsquo;m doing here is telling VPP, that if it receives an MPLS packet on any MPLS-enabled interface (notably Gi10/0/1 from which it is currently receiving MPLS packets from vpp1-3), that it should send the MPLS packet out on Gi10/0/0 to neighbor 192.168.11.8 after imposing label 100.\nIf I\u0026rsquo;ve done a good job, I should be able to see this packet traversing the P-Router in a packet trace:\n20:42:51:151144: dpdk-input GigabitEthernet10/0/1 rx queue 0 buffer 0x4c7d8b: current data 0, length 102, buffer-pool 0, ref-count 1, trace handle 0x0 ext-hdr-valid PKT MBUF: port 1, nb_segs 1, pkt_len 102 buf_len 2176, data_len 102, ol_flags 0x0, data_off 128, phys_addr 0x1d1f6340 packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 rss 0x0 fdir.hi 0x0 fdir.lo 0x0 MPLS: 52:54:00:13:10:00 -\u0026gt; 52:54:00:12:10:01 label 100 exp 0, s 1, ttl 64 20:42:51:151161: ethernet-input frame: flags 0x1, hw-if-index 2, sw-if-index 2 MPLS: 52:54:00:13:10:00 -\u0026gt; 52:54:00:12:10:01 20:42:51:151171: mpls-input MPLS: next mpls-lookup[1] label 100 ttl 64 exp 0 20:42:51:151174: mpls-lookup MPLS: next [6], lookup fib index 0, LB index 74 hash 0 label 100 eos 1 20:42:51:151177: mpls-label-imposition-pipe mpls-header:[100:63:0:eos] 20:42:51:151179: mpls-output adj-idx 28 : mpls via 192.168.11.8 GigabitEthernet10/0/0: mtu:9000 next:2 flags:[] 5254001110015254001210008847 flow hash: 0x00000000 20:42:51:151181: GigabitEthernet10/0/0-output GigabitEthernet10/0/0 flags 0x0018000d MPLS: 52:54:00:12:10:00 -\u0026gt; 52:54:00:11:10:01 label 100 exp 0, s 1, ttl 63 20:42:51:151184: GigabitEthernet10/0/0-tx GigabitEthernet10/0/0 tx queue 0 buffer 0x4c7d8b: current data 0, length 102, buffer-pool 0, ref-count 1, trace handle 0x0 ext-hdr-valid l2-hdr-offset 0 l3-hdr-offset 14 PKT MBUF: port 1, nb_segs 1, pkt_len 102 buf_len 2176, data_len 102, ol_flags 0x0, data_off 128, phys_addr 0x1d1f6340 packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 rss 0x0 fdir.hi 0x0 fdir.lo 0x0 MPLS: 52:54:00:12:10:00 -\u0026gt; 52:54:00:11:10:01 label 100 exp 0, s 1, ttl 63 In order, the following nodes are traversed:\ndpdk-input: received the frame from the network interface Gi10/0/1 ethernet-input: the frame was an ethernet frame, and VPP determines based on the ethertype (0x8847) that it is an MPLS frame mpls-input: inspects the MPLS labelstack and sees the outermost label (the only one on this frame) with a value of 100. mpls-lookup: looks up the MPLS FIB what to do with packets which are End-Of-Stack or EOS (ie. with the S-bit set to 1), and are labeled 100. At this point VPP could make a different choice if there is 1 label (as in this case), or a stack of multiple labels (Not-End-of-Stack or NEOS, ie. with the S-bit set to 0). mpls-label-imposition-pipe: reads from the FIB that the outer label needs to be SWAPd to a new out-label (also with value 100). Because it\u0026rsquo;s the same label, this is a no-op. However, since this router is forwarding the MPLS packet, it will decrement the TTL to 63. mpls-output: VPP then uses the rest of the FIB information to determine the L3 nexthop is 192.168.11.8 on Gi10/0/0. Gi10/0/0-output: uses the L2FIB adjacency table to determine that the L2 nexthop is MAC address 52:54:00:11:10:01 (vpp1-1.e1). If there is no L2 adjacency, this would be a good time for VPP to send an ARP request to resolve the IP-to-MAC and store it in the L2FIB. Gi10/0/0-tx: hands off the frame to DPDK for marshalling on the wire. If you counted with me, you\u0026rsquo;ll see that this flow in the P-Router also has eight nodes. However, while the IPv4 FIB can and will be north of one million entries in a longest-prefix match radix trie (which is computationally expensive), the MPLS FIB contains far fewer entries which are organized as a literal key lookup in a hash table; and as well compared to IPv4 routing, the packet that is being transported does not have to get a decremented TTL which requires a recalculated IPv4 checksum. MPLS switching is much cheaper than IPv4 routing!\nNow that our packets are switched from vpp1-2 to vpp1-1 (which is also a P-Router), I\u0026rsquo;ll just rinse and repeat there, using the L3 adjacency pointing at vpp1-0.e1 (192.168.11.6 on Gi10/0/0):\nvpp1-1# mpls table add 0 vpp1-1# set interface mpls GigabitEthernet10/0/0 enable vpp1-1# set interface mpls GigabitEthernet10/0/1 enable vpp1-1# mpls local-label add 100 eos via 192.168.11.6 GigabitEthernet10/0/0 out-labels 100 Did I do this correctly? One way to check is by taking a look at which packets are seen on the Open vSwitch mirror ports:\nroot@tap1-0:~# tcpdump -eni enp16s0f0 13:42:47.724107 52:54:00:20:10:03 \u0026gt; 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33 p 0, ethertype IPv4 (0x0800), 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64 13:42:47.724769 52:54:00:13:10:00 \u0026gt; 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22 p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64) 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64 13:42:47.725038 52:54:00:12:10:00 \u0026gt; 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21 p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63) 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64 13:42:47.726155 52:54:00:11:10:00 \u0026gt; 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 106: vlan 20 p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 62) 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 8399, seq 3238, length 64 Nice!! I confirm that the ICMP packet first travels over VLAN 33 (from host1-0 to vpp1-3), and then the MPLS packets travel from vpp1-3, through vpp-1-2, through vpp1-1 and towards vpp1-0 over VLAN 22, 21 and 20 respectively.\nStep 3: PE Egress Seeing as I haven\u0026rsquo;t done anything with vpp1-0 yet, now the MPLS packets all get dropped there. But not for much longer, as I\u0026rsquo;m now ready to tell vpp1-0 what to do with those packets:\nvpp1-0# mpls table add 0 vpp1-0# set interface mpls GigabitEthernet10/0/0 enable vpp1-0# set interface mpls GigabitEthernet10/0/1 enable vpp1-0# mpls local-label add 100 eos via ip4-lookup-in-table 0 vpp1-0# ip route add 10.0.1.1/32 via 192.0.2.2 The difference between the P-Routers in Step 2 and this PE-Router, is the operation provided in the MPLS FIB. When an MPLS packet with label value 100 is received, instead of forwarding it into another interface (which is what the P-Router would do), I tell VPP here to unwrap the MPLS label, and expect to find an IPv4 packet which I\u0026rsquo;m asking it to route by looking up an IPv4 next hop in the (IPv4) FIB table 0.\nAll that\u0026rsquo;s left for me to do is add a regular static route for 10.0.1.1/32 via 192.0.2.2 (which is the address on interface host1-1.enp16s0f3). If my thinkingcap is still working, I should now see packets emit from vpp1-0 on Gi10/0/3:\nvpp1-0# trace add dpdk-input 10 vpp1-0# show trace 21:34:39:370589: dpdk-input GigabitEthernet10/0/1 rx queue 0 buffer 0x4c4a34: current data 0, length 102, buffer-pool 0, ref-count 1, trace handle 0x0 ext-hdr-valid PKT MBUF: port 1, nb_segs 1, pkt_len 102 buf_len 2176, data_len 102, ol_flags 0x0, data_off 128, phys_addr 0x2ff28d80 packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 rss 0x0 fdir.hi 0x0 fdir.lo 0x0 MPLS: 52:54:00:11:10:00 -\u0026gt; 52:54:00:10:10:01 label 100 exp 0, s 1, ttl 62 21:34:39:370672: ethernet-input frame: flags 0x1, hw-if-index 2, sw-if-index 2 MPLS: 52:54:00:11:10:00 -\u0026gt; 52:54:00:10:10:01 21:34:39:370702: mpls-input MPLS: next mpls-lookup[1] label 100 ttl 62 exp 0 21:34:39:370704: mpls-lookup MPLS: next [6], lookup fib index 0, LB index 83 hash 0 label 100 eos 1 21:34:39:370706: ip4-mpls-label-disposition-pipe rpf-id:-1 ip4, pipe 21:34:39:370708: lookup-ip4-dst fib-index:0 addr:10.0.1.1 load-balance:82 21:34:39:370710: ip4-rewrite tx_sw_if_index 4 dpo-idx 32 : ipv4 via 192.0.2.2 GigabitEthernet10/0/3: mtu:9000 next:9 flags:[] 5254002110005254001010030800 flow hash: 0x00000000 00000000: 5254002110005254001010030800450000543dec40003e01e8bc0a0001000a00 00000020: 01010800173d231c01a0fce65864000000009ce80b00000000001011 21:34:39:370735: GigabitEthernet10/0/3-output GigabitEthernet10/0/3 flags 0x0418000d IP4: 52:54:00:10:10:03 -\u0026gt; 52:54:00:21:10:00 ICMP: 10.0.1.0 -\u0026gt; 10.0.1.1 tos 0x00, ttl 62, length 84, checksum 0xe8bc dscp CS0 ecn NON_ECN fragment id 0x3dec, flags DONT_FRAGMENT ICMP echo_request checksum 0x173d id 8988 21:34:39:370739: GigabitEthernet10/0/3-tx GigabitEthernet10/0/3 tx queue 0 buffer 0x4c4a34: current data 4, length 98, buffer-pool 0, ref-count 1, trace handle 0x0 ext-hdr-valid l2-hdr-offset 0 l3-hdr-offset 14 loop-counter 1 PKT MBUF: port 1, nb_segs 1, pkt_len 98 buf_len 2176, data_len 98, ol_flags 0x0, data_off 132, phys_addr 0x2ff28d80 packet_type 0x0 l2_len 0 l3_len 0 outer_l2_len 0 outer_l3_len 0 rss 0x0 fdir.hi 0x0 fdir.lo 0x0 IP4: 52:54:00:10:10:03 -\u0026gt; 52:54:00:21:10:00 ICMP: 10.0.1.0 -\u0026gt; 10.0.1.1 tos 0x00, ttl 62, length 84, checksum 0xe8bc dscp CS0 ecn NON_ECN fragment id 0x3dec, flags DONT_FRAGMENT ICMP echo_request checksum 0x173d id 8988 Alright, another one of those huge blobs of information about a single packet traversing the VPP dataplane, but it\u0026rsquo;s the last one for this article, I promise! In order:\ndpdk-input: DPDK reads the frame which is arriving from vpp1-1 on Gi10/0/1, it determines that this is an ethernet frame ethernet-input: Based on the ethertype 0x8447, it knows that this ethernet frame is an MPLS packet mpls-input: The MPLS labelstack has one label, value 100, with (obviously) the EndOfStack S-bit set to 1; I can also see the (MPLS) TTL is 62 here, because it has traversed three routers (vpp1-3 TTL=64, vpp1-2 TTL=63, and vpp1-1 TTL=62) mpls-lookup: The lookup of local label 100 informs VPP that it should switch to IPv4 processing and handle the packet as such ip4-mpls-label-disposition-pipe: The MPLS label is removed, revealing an IPv4 packet as the inner payload of the MPLS datagram lookup-ip4-dst: VPP can now do a regular IPv4 forwarding table lookup for 10.0.1.1 which informs it that it should forward the packet via 192.0.2.2 which is directly connected to Gi10/0/3. ip4-rewrite: The IPv4 TTL is decremented and the IP header checksum recomputed Gi10/0/3-output: VPP now can look up the L2FIB adjacency belonging to 192.0.2.2 on Gi10/0/3, which informs it that 52:54:00:21:10:00 is the ethernet nexthop Gi10/0/3-tx: The packet is now handed off to DPDK to marshall on the wire, destined to host1-1.enp16s0f3 That means I should be able to see it on host1-1, right? If you, too, are dying to know, check this out:\nroot@host1-1:~# tcpdump -ni enp16s0f0 icmp tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes 14:25:53.776486 IP 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 8988, seq 1249, length 64 14:25:53.776522 IP 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 8988, seq 1249, length 64 14:25:54.799829 IP 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 8988, seq 1250, length 64 14:25:54.799866 IP 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 8988, seq 1250, length 64 \u0026ldquo;Jiggle jiggle, wiggle wiggle!\u0026rdquo;, as I do a premature congratulatory dance on the chair in my lab! I created a label-switched-path using VPP as MPLS provider-edge and provider routers, to move this ICMP echo packet all the way from host1-0 to host1-1, but there\u0026rsquo;s absolutely nothing to suggest that the resulting ICMP echo-reply can go to back from host1-1 to host1-0, because LSPs are unidirectional. The final step for me to do is create an LSP back in the other direction:\nvpp1-0# ip route add 10.0.1.0/32 via 192.168.11.7 GigabitEthernet10/0/1 out-labels 103 vpp1-1# mpls local-label add 103 eos via 192.168.11.9 GigabitEthernet10/0/1 out-labels 103 vpp1-2# mpls local-label add 103 eos via 192.168.11.11 GigabitEthernet10/0/1 out-labels 103 vpp1-3# mpls local-label add 103 eos via ip4-lookup-in-table 0 vpp1-3# ip route add 10.0.1.0/32 via 192.0.2.0 And with that, the ping I started at the beginning of this article, shoots to life:\nroot@host1-0:~# ping -I 10.0.1.0 10.0.1.1 PING 10.0.1.1 (10.0.1.1) from 10.0.1.0 : 56(84) bytes of data. 64 bytes from 10.0.1.1: icmp_seq=7644 ttl=62 time=6.28 ms 64 bytes from 10.0.1.1: icmp_seq=7645 ttl=62 time=7.45 ms 64 bytes from 10.0.1.1: icmp_seq=7646 ttl=62 time=7.01 ms 64 bytes from 10.0.1.1: icmp_seq=7647 ttl=62 time=5.76 ms 64 bytes from 10.0.1.1: icmp_seq=7648 ttl=62 time=5.88 ms 64 bytes from 10.0.1.1: icmp_seq=7649 ttl=62 time=9.23 ms I will leave you with this packetdump from the Open vSwitch mirror, showing the entire flow of one ICMP packet through the network:\nroot@tap1-0:~# tcpdump -c 10 -eni enp16s0f0 tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on enp16s0f0, link-type EN10MB (Ethernet), snapshot length 262144 bytes 14:41:07.526861 52:54:00:20:10:03 \u0026gt; 52:54:00:13:10:02, ethertype 802.1Q (0x8100), length 102: vlan 33 p 0, ethertype IPv4 (0x0800), 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 14:41:07.528103 52:54:00:13:10:00 \u0026gt; 52:54:00:12:10:01, ethertype 802.1Q (0x8100), length 106: vlan 22 p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 64) 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 14:41:07.529342 52:54:00:12:10:00 \u0026gt; 52:54:00:11:10:01, ethertype 802.1Q (0x8100), length 106: vlan 21 p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 63) 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 14:41:07.530421 52:54:00:11:10:00 \u0026gt; 52:54:00:10:10:01, ethertype 802.1Q (0x8100), length 106: vlan 20 p 0, ethertype MPLS unicast (0x8847), MPLS (label 100, exp 0, [S], ttl 62) 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 14:41:07.531160 52:54:00:10:10:03 \u0026gt; 52:54:00:21:10:00, ethertype 802.1Q (0x8100), length 102: vlan 40 p 0, ethertype IPv4 (0x0800), 10.0.1.0 \u0026gt; 10.0.1.1: ICMP echo request, id 51470, seq 20, length 64 14:41:07.531455 52:54:00:21:10:00 \u0026gt; 52:54:00:10:10:03, ethertype 802.1Q (0x8100), length 102: vlan 40 p 0, ethertype IPv4 (0x0800), 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 14:41:07.532245 52:54:00:10:10:01 \u0026gt; 52:54:00:11:10:00, ethertype 802.1Q (0x8100), length 106: vlan 20 p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 64) 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 14:41:07.532732 52:54:00:11:10:01 \u0026gt; 52:54:00:12:10:00, ethertype 802.1Q (0x8100), length 106: vlan 21 p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 63) 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 14:41:07.533923 52:54:00:12:10:01 \u0026gt; 52:54:00:13:10:00, ethertype 802.1Q (0x8100), length 106: vlan 22 p 0, ethertype MPLS unicast (0x8847), MPLS (label 103, exp 0, [S], ttl 62) 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 14:41:07.535040 52:54:00:13:10:02 \u0026gt; 52:54:00:20:10:03, ethertype 802.1Q (0x8100), length 102: vlan 33 p 0, ethertype IPv4 (0x0800), 10.0.1.1 \u0026gt; 10.0.1.0: ICMP echo reply, id 51470, seq 20, length 64 10 packets captured 10 packets received by filter You can see all of the attributes I demonstrated in this article in one go: ingress ICMP packet on VLAN 33, encapsulation with label 100, S=1 and ttl decrementing as the MPLS packet traverses eastwards through the string of VPP routers on VLANs 22, 21 and 20, ultimately being sent out on VLAN 40. There, the ICMP echo request packet is responded to, and we can trace the ICMP response as it makes its way back westwards through the MPLS network using label 103, ultimately re-appearing on VLAN 33.\nThere you have it. This is a fun story on Multi Protocol Label Switching (MPLS) bringing a packet from a Label-Edge-Router (LER) through several Label-Switch-Routers (LSRs) over a staticlly configured Label-Switched-Path (LSP). I feel like I can now more confidently use these terms without sounding silly.\nWhat\u0026rsquo;s next The first mission is accomplished. I\u0026rsquo;ve taken a good look at IPv4 forwarding in the VPP dataplane as MPLS packets, thereby en- and decapsulating the traffic using PE-Routers and forwarding the traffic using intermediary P-Routers. MPLS switching is cheaper than IPv4/IPv6 routing, but it can also open a bunch of possibilities regarding advanced services offering, such as my coveted Martini Tunnels which transport ethernet frames point-to-point over an MPLS backbone. That will be the topic of an upcoming article, as will I join forces with @vifino who is adding Linux Controlplane functionality to program the MPLS FIB using Netlink \u0026ndash; such that things like \u0026lsquo;ip\u0026rsquo; and \u0026lsquo;FRR\u0026rsquo; can discover and share label information using a Label Distribution Protocol or LDP.\n","date":"2023-05-07","desc":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nI\u0026rsquo;ve deployed an MPLS core for IPng Networks, which allows me to provide L2VPN services, and at the same time keep an IPng Site Local network with IPv4 and IPv6 that is separate from the internet, based on hardware/silicon based forwarding at line rate and high availability. You can read all about my Centec MPLS shenanigans in [this article].\n","permalink":"https://ipng.ch/s/articles/2023/05/07/vpp-mpls-part-1/","section":"articles","title":"VPP MPLS - Part 1"},{"contents":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nI\u0026rsquo;ve been working on the Linux Control Plane [ref], which you can read all about in my series on VPP back in 2021:\n{: style=\u0026ldquo;width:300px; float: right; margin-left: 1em;\u0026rdquo;}\n[Part 1]: Punting traffic through TUN/TAP interfaces into Linux [Part 2]: Mirroring VPP interface configuration into Linux [Part 3]: Automatically creating sub-interfaces in Linux [Part 4]: Synchronize link state, MTU and addresses to Linux [Part 5]: Netlink Listener, synchronizing state from Linux to VPP [Part 6]: Observability with LibreNMS and VPP SNMP Agent [Part 7]: Productionizing and reference Supermicro fleet at IPng With this, I can make a regular server running Linux use VPP as kind of a software ASIC for super fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links, addresses and routes) itself. With Linux CP, running software like FRR or Bird on top of VPP and achieving \u0026gt;150Mpps and \u0026gt;180Gbps forwarding rates are easily in reach. If you find that hard to believe, check out [my DENOG14 talk] or click the thumbnail above. I am continuously surprised at the performance per watt, and the performance per Swiss Franc spent.\nMonitoring VPP Of course, it\u0026rsquo;s important to be able to see what routers are doing in production. For the longest time, the de facto standard for monitoring in the networking industry has been Simple Network Management Protocol (SNMP), described in [RFC 1157]. But there\u0026rsquo;s another way, using a metrics and time series system called Borgmon, originally designed by Google [ref] but popularized by Soundcloud in an open source interpretation called Prometheus [ref]. IPng Networks ♥ Prometheus.\nI\u0026rsquo;m a really huge fan of Prometheus and its graphical frontend Grafana, as you can see with my work on Mastodon in [this article]. Join me on [ublog.tech] if you haven\u0026rsquo;t joined the Fediverse yet. It\u0026rsquo;s well monitored!\nSNMP SNMP defines an extensible model by which parts of the OID (object identifier) tree can be delegated to another process, and the main SNMP daemon will call out to it using an AgentX protocol, described in [RFC 2741]. In a nutshell, this allows an external program to connect to the main SNMP daemon, register an interest in certain OIDs, and get called whenever the SNMPd is being queried for them.\nThe flow is pretty simple (see section 6.2 of the RFC), the Agent (client):\nopens a TCP or Unix domain socket to the SNMPd sends an Open PDU, which the server will respond or reject. (optionally) can send a Ping PDU, the server will respond. registers an interest with Register PDU It then waits and gets called by the SNMPd with Get PDUs (to retrieve one single value), GetNext PDU (to enable snmpwalk), GetBulk PDU (to retrieve a whole subsection of the MIB), all of which are answered by a Response PDU.\nUsing parts of a Python Agentx library written by GitHub user hosthvo [ref], I tried my hands at writing one of these AgentX\u0026rsquo;s. The resulting source code is on [GitHub]. That\u0026rsquo;s the one that\u0026rsquo;s running in production ever since I started running VPP routers at IPng Networks AS8298. After the AgentX exposes the dataplane interfaces and their statistics into SNMP, an open source monitoring tool such as LibreNMS [ref] can discover the routers and draw pretty graphs, as well as detect when interfaces go down, or are overloaded, and so on. That\u0026rsquo;s pretty slick.\nVPP Stats Segment in Go But if I may offer some critique on my own approach, SNMP monitoring is very 1990s. I\u0026rsquo;m continously surpsied that our industry is still clinging on to this archaic approach. VPP offers a lot of observability, its statistics segment is chock full of interesting counters and gauges that can be really helpful to understand how the dataplane performs. If there are errors or a bottleneck develops in the router, going over show runtime or show errors can be a life saver. Let\u0026rsquo;s take another look at that Stats Segment (the one that the SNMP AgentX connects to in order to query it for packets/byte counters and interface names).\nYou can think of the Stats Segment as a directory hierarchy where each file represents a type of counter. VPP comes with a small helper tool called VPP Stats FS, which uses a FUSE based read-only filesystem to expose those counters in an intuitive way, so let\u0026rsquo;s take a look\npim@hippo:~/src/vpp/extras/vpp_stats_fs$ sudo systemctl start vpp pim@hippo:~/src/vpp/extras/vpp_stats_fs$ sudo make start pim@hippo:~/src/vpp/extras/vpp_stats_fs$ mount | grep stats rawBridge on /run/vpp/stats_fs_dir type fuse.rawBridge (rw,nosuid,nodev,relatime,user_id=0,group_id=0,allow_other) pim@hippo:/run/vpp/stats_fs_dir$ ls -la drwxr-xr-x 0 root root 0 Apr 9 14:07 bfd drwxr-xr-x 0 root root 0 Apr 9 14:07 buffer-pools drwxr-xr-x 0 root root 0 Apr 9 14:07 err drwxr-xr-x 0 root root 0 Apr 9 14:07 if drwxr-xr-x 0 root root 0 Apr 9 14:07 interfaces drwxr-xr-x 0 root root 0 Apr 9 14:07 mem drwxr-xr-x 0 root root 0 Apr 9 14:07 net drwxr-xr-x 0 root root 0 Apr 9 14:07 node drwxr-xr-x 0 root root 0 Apr 9 14:07 nodes drwxr-xr-x 0 root root 0 Apr 9 14:07 sys pim@hippo:/run/vpp/stats_fs_dir$ cat sys/boottime 1681042046.00 pim@hippo:/run/vpp/stats_fs_dir$ date +%s 1681042058 pim@hippo:~/src/vpp/extras/vpp_stats_fs$ sudo make stop There\u0026rsquo;s lots of really interesting stuff in here - for example in the /sys hierarchy we can see a boottime file, and from there I can determine the uptime of the process. Further, the /mem hierarchy shows the current memory usage for each of the main, api and stats segment heaps. And of course, in the /interfaces hierarchy we can see all the usual packets and bytes counters for any interface created in the dataplane.\nVPP Stats Segment in C I wish I were good at Go, but I never really took to the language. I\u0026rsquo;m pretty good at Python, but sorting through the stats segment isn\u0026rsquo;t super quick as I\u0026rsquo;ve already noticed in the Python3 based [VPP SNMP Agent]. I\u0026rsquo;m probably the world\u0026rsquo;s least terrible C programmer, so maybe I can take a look at the VPP Stats Client and make sense of it. Luckily, there\u0026rsquo;s an example already in src/vpp/app/vpp_get_stats.c and it reveals the following pattern:\nassemble a vector of regular expression patterns in the hierarchy, or just ^/ to start get a handle to the stats segment with stats_segment_ls() using the pattern(s) use the handle to dump the stats segment into a vector with stat_segment_dump(). iterate over the returned stats structure, each element has a type and a given name: STAT_DIR_TYPE_SCALAR_INDEX: these are floating point doubles STAT_DIR_TYPE_COUNTER_VECTOR_SIMPLE: single uint32 counter STAT_DIR_TYPE_COUNTER_VECTOR_COMBINED: two uint32 counters freeing the used stats structure with stat_segment_data_free() The simple and combined stats turn out to be associative arrays, the outer of which notes the thread and the inner of which refers to the index. As such, a statistic of type VECTOR_SIMPLE can be decoded like so:\nif (res[i].type == STAT_DIR_TYPE_COUNTER_VECTOR_SIMPLE) for (k = 0; k \u0026lt; vec_len (res[i].simple_counter_vec); k++) for (j = 0; j \u0026lt; vec_len (res[i].simple_counter_vec[k]); j++) printf (\u0026#34;[%d @ %d]: %llu packets %s\\n\u0026#34;, j, k, res[i].simple_counter_vec[k][j], res[i].name); The statistic of type VECTOR_COMBINED is very similar, except the union type there is a combined_counter_vec[k][j] which has a member .packets and a member called .bytes. The simplest form, SCALAR_INDEX, is just a single floating point number attached to the name.\nIn principle, this should be really easy to sift through and decode. Now that I\u0026rsquo;ve figured that out, let me dump a bunch of stats with the vpp_get_stats tool that comes with vanilla VPP:\npim@chrma0:~$ vpp_get_stats dump /interfaces/TenGig.*40121 | grep -v \u0026#39;: 0\u0026#39; [0 @ 2]: 67057 packets /interfaces/TenGigabitEthernet81_0_0.40121/drops [0 @ 2]: 76125287 packets /interfaces/TenGigabitEthernet81_0_0.40121/ip4 [0 @ 2]: 1793946 packets /interfaces/TenGigabitEthernet81_0_0.40121/ip6 [0 @ 2]: 77919629 packets, 66184628769 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx [0 @ 0]: 7 packets, 610 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx [0 @ 1]: 26687 packets, 18771919 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx [0 @ 2]: 6448944 packets, 3663975508 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx [0 @ 3]: 138924 packets, 20599785 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx [0 @ 4]: 130720342 packets, 57436383614 bytes /interfaces/TenGigabitEthernet81_0_0.40121/tx I can see both types of counter at play here, let me explain the first line: it is saying that the counter of name /interfaces/TenGigabitEthernet81_0_0.40121/drops, at counter index 0, CPU thread 2, has a simple counter with value 67057. Taking the last line, this is a combined counter type with name /interfaces/TenGigabitEthernet81_0_0.40121/tx at index 0, all five CPU threads (the main thread and four worker threads) have all sent traffic into this interface, and the counters for each in packets and bytes is given.\nFor readability\u0026rsquo;s sake, my grep -v above doesn\u0026rsquo;t print any counter that is 0. For example, interface Te81/0/0 has only one receive queue, and it\u0026rsquo;s bound to thread 2. The other threads will not receive any packets for it, consequently their rx counters stay zero:\npim@chrma0:~/src/vpp$ vpp_get_stats dump /interfaces/TenGig.*40121 | grep rx$ [0 @ 0]: 0 packets, 0 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx [0 @ 1]: 0 packets, 0 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx [0 @ 2]: 80720186 packets, 68458816253 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx [0 @ 3]: 0 packets, 0 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx [0 @ 4]: 0 packets, 0 bytes /interfaces/TenGigabitEthernet81_0_0.40121/rx Hierarchy: Pattern Matching I quickly discover a pattern in most of these names: they start with a scope, say /interfaces, then have a path entry for the interface name, and finally a specific counter (/rx or /mpls). This is true also for the /nodes hiearchy, in which all VPP\u0026rsquo;s graph nodes have a set of counters:\npim@chrma0:~$ vpp_get_stats dump /nodes/ip4-lookup | grep -v \u0026#39;: 0\u0026#39; [0 @ 1]: 11365675493301 packets /nodes/ip4-lookup/clocks [0 @ 2]: 3256664129799 packets /nodes/ip4-lookup/clocks [0 @ 3]: 28364098623954 packets /nodes/ip4-lookup/clocks [0 @ 4]: 30198798628761 packets /nodes/ip4-lookup/clocks [0 @ 1]: 80870763789 packets /nodes/ip4-lookup/vectors [0 @ 2]: 17392446654 packets /nodes/ip4-lookup/vectors [0 @ 3]: 259363625369 packets /nodes/ip4-lookup/vectors [0 @ 4]: 298176625181 packets /nodes/ip4-lookup/vectors [0 @ 1]: 49730112811 packets /nodes/ip4-lookup/calls [0 @ 2]: 13035172295 packets /nodes/ip4-lookup/calls [0 @ 3]: 109088424231 packets /nodes/ip4-lookup/calls [0 @ 4]: 119789874274 packets /nodes/ip4-lookup/calls If you\u0026rsquo;re ever seen the output of show runtime, it looks like this:\nvpp# show runtime Thread 1 vpp_wk_0 (lcore 28) Time 3377500.2, 10 sec internal node vector rate 1.46 loops/sec 3301017.05 vector rates in 2.7440e6, out 2.7210e6, drop 3.6025e1, punt 7.2243e-5 Name State Calls Vectors Suspends Clocks Vectors/Call ... ip4-lookup active 49732141978 80873724903 0 1.41e2 1.63 Hey look! On thread 1, which is called vpp_wk_0 and is running on logical CPU core #28, there are a bunch of VPP graph nodes that are all keeping stats of what they\u0026rsquo;ve been doing, and you can see here that the following numbers line up between show runtime and the VPP Stats dumper:\nName: This is the name of the VPP graph node, in this case ip4-lookup, which is performing an IPv4 FIB lookup to figure out what the L3 nexthop is of a given IPv4 packet we\u0026rsquo;re trying to route. Calls: How often did we invoke this graph node, 49.7 billion times so far. Vectors: How many packets did we push through, 80.87 billion, humble brag. Clocks: This one is a bit different \u0026ndash; you can see the cumulative clock cycles spent by this CPU thread in the stats dump: 11365675493301 divided by 80870763789 packets is 140.54 CPU cycles per packet. It\u0026rsquo;s a cool interview question \u0026ldquo;How many CPU cycles does it take to do an IPv4 routing table lookup\u0026rdquo;. You now know the answer :-) Vectors/Call: This is a measure of how busy the node is (did it run for only one packet, or for many packets?). On average when the worker thread gave the ip4-lookup node some work to do, there have been a total of 80873724903 packets handled in 49732141978 calls, so 1.626 packets per call. If ever you\u0026rsquo;re handling 256 packets per call (the most VPP will allow per call), your router will be sobbing. Prometheus Metrics Prometheus has metrics which carry a name, and zero or more labels. The prometheus query language can then use these labels to do aggregation, division, averages, and so on. As a practical example, above I looked at interface stats and saw that the Rx/Tx numbers were counted one per thread. If we\u0026rsquo;d like the total on the interface, it would be great if we could sum without (thread,index), which will have the effect of adding all of these numbers together. For the monotonically increasing counter numbers (like the total vectors/calls/clocks per node), we can take the running rate of change, showing the time spent over the last minute, or so. This way, spikes in traffic will clearly correlate both with a spike in packets/sec or bytes/sec on the interface, but also a higher number of vectors/call, and correspondingly typically a lower number of clocks/vector, as VPP gets more efficient when it can re-use the CPU\u0026rsquo;s instruction and data cache to do repeat work on multiple packets.\nI decide to massage the statistic names a little bit, by transforming them of the basic format: prefix_suffix{label=\u0026quot;X\u0026quot;,index=\u0026quot;A\u0026quot;,thread=\u0026quot;B\u0026quot;} value\nA few examples:\nThe single counter that looks like [6 @ 0]: 994403888 packets /mem/main heap becomes: mem{heap=\u0026quot;main heap\u0026quot;,index=\u0026quot;6\u0026quot;,thread=\u0026quot;0\u0026quot;} The combined counter [0 @ 1]: 79582338270 packets, 16265349667188 bytes /interfaces/Te1_0_2/rx becomes: interfaces_rx_packets{interface=\u0026quot;Te1_0_2\u0026quot;,index=\u0026quot;0\u0026quot;,thread=\u0026quot;1\u0026quot;} 79582338270 interfaces_rx_bytes{interface=\u0026quot;Te1_0_2\u0026quot;,index=\u0026quot;0\u0026quot;,thread=\u0026quot;1\u0026quot;} 16265349667188 The node information running on, say thread 4, becomes: nodes_clocks{node=\u0026quot;ip4-lookup\u0026quot;,index=\u0026quot;0\u0026quot;,thread=\u0026quot;4\u0026quot;} 30198798628761 nodes_vectors{node=\u0026quot;ip4-lookup\u0026quot;,index=\u0026quot;0\u0026quot;,thread=\u0026quot;4\u0026quot;} 298176625181 nodes_calls{node=\u0026quot;ip4-lookup\u0026quot;,index=\u0026quot;0\u0026quot;,thread=\u0026quot;4\u0026quot;} 119789874274 nodes_suspends{node=\u0026quot;ip4-lookup\u0026quot;,index=\u0026quot;0\u0026quot;,thread=\u0026quot;4\u0026quot;} 0 VPP Exporter I wish I had things like split() and re.match() but in C (well, I guess I do have POSIX regular expressions\u0026hellip;), but it\u0026rsquo;s all a little bit more low level. Based on my basic loop that opens the stats segment, registers its desired patterns, and then retrieves a vector of {name, type, counter}-tuples, I decide to do a little bit of non-intrusive string tokenization first:\nstatic int tokenize (const char *str, char delimiter, char **tokens, int *lengths) { char *p = (char *) str; char *savep = p; int i = 0; while (*p) if (*p == delimiter) { tokens[i] = (char *) savep; lengths[i] = (int) (p - savep); i++; p++; savep = p; } else p++; tokens[i] = (char *) savep; lengths[i] = (int) (p - savep); return i++; } /* The call site */ char *tokens[10]; int lengths[10]; int num_tokens = tokenize (res[i].name, \u0026#39;/\u0026#39;, tokens, lengths); The tokenizer takes an array of N pointers to the resulting tokens, and their lengths. This sets it aside from strtok() and friends, because those will overwrite the occurences of the delimiter in the input string with \\0, and as such cannot take a const char *str as input. This one leaves the string alone though, and will return the tokens as {ptr, len}-tuples, including how many tokens it found.\nOne thing I\u0026rsquo;ll probably regret is that there\u0026rsquo;s no bounds checking on the number of tokens \u0026ndash; if I have more than 10 of these, I\u0026rsquo;ll come to regret it. But for now, the depth of the hierarchy is only 3, so I should be fine. Besides, I got into a fight with ChatGPT after it declared a romantic interest in my cat, so it won\u0026rsquo;t write code for me anymore :-(\nBut using this simple tokenizer, and knowledge of the structure of well known hierarchy paths, the rest of the exporter is quickly in hand. Some variables don\u0026rsquo;t have a label (for example /sys/boottime), but those that do will see that field transposed from the directory path /mem/main heap/free into the label as I showed above.\nResults With this VPP Prometheus Exporter, I can now hook the VPP routers up to Prometheus and Grafana. Aggregations in Grafana are super easy and scalable, due to the conversion of the static paths into dynamically created labels on the prometheus metric names.\nDrawing a graph of the running time spent by each individual VPP graph node might look something like this:\nsum without (node)(rate(nodes_clocks[60s])) / sum without (node)(rate(nodes_vectors[60s])) The plot to the right shows a system under a loadtest that ramps up from 0% to 100% of line rate, and the traces are the cummulative time spent in each node (on a logarithmic scale). The top purple line represents dpdk-input. When a VPP dataplane is idle, the worker threads will be repeatedly polling DPDK to ask it if it has something to do, spending 100% of their time being told \u0026ldquo;there is nothing for you to do\u0026rdquo;. But, once load starts appearing, the other nodes start spending CPU time, for example the chain of IPv4 forwarding is ethernet-input, ip4-input, ip4-lookup, followed by ip4-rewrite and ultimately the packet is transmitted on some other interface. When the system is lightly loaded, the ethernet-input node for example will spend 1100 or so CPU cycles per packet, but when the machine is under higher load, the time spent will decrease to as low as 22 CPU cycles per packet. This is true for almost all of the nodes - VPP gets relatively more efficient under load.\nAnother cool graph that I won\u0026rsquo;t be able to see when using only LibreNMS and SNMP polling, is how busy the router is. In VPP, each dispatch of the worker loop will poll DPDK and dispatch the packets through the directed graph of nodes that I showed above. But how many packets can be moved through the graph per CPU? The largest number of packets that VPP will ever offer into a call of the nodes is 256. Typically an unloaded machine will have an average number of Vectors/Call of around 1.00. When the worker thread is loaded, it may sit at around 130-150 Vectors/Call. If it\u0026rsquo;s saturated, it will quickly shoot up to 256.\nAs a good approximation, Vectors/Call normalized to 100% will be an indication of how busy the dataplane is. In the picture above, between 10:30 and 11:00 my test router was pushing about 180Gbps of traffic, but with large packets so its total vectors/call was modest (roughly 35-40), which you can see as all threads there are running in the ~25% load range. Then at 11:00 a few threads got hotter, and one of them completely saturated, and the traffic being forwarded by the CPU thread was suffering packetlo, even though the others were absolutely fine\u0026hellip; forwarding 150Mpps on a 10 year old Dell R720!\nWhat\u0026rsquo;s Next Together with the graph above, I can also see how many CPU cycles are spent in which type of operation. For example, encapsulation of GENEVE or VxLAN is not free, although it\u0026rsquo;s also not every expensive. If I know how many CPU cycles are available (roughly the clock speed of the CPU threads, in our case Xeon X1518 (2.2GHz) or Xeon E5-2683 v4 CPUs (3GHz), I can pretty accurately calculate what a given mix of traffic and features is going to cost, and how many packets/sec our routers at IPng will be able to forward. Spoiler alert: it\u0026rsquo;s way more than currently needed. Our supermicros can handle roughly 35Mpps each, and considering a regular mixture of internet traffic (called imix) is about 3Mpps per 10G, I will have room to spare for the time being\nThis is super valuable information for folks running VPP in production. I haven\u0026rsquo;t put the finishing touches on the VPP Prometheus Exporter, for example there are no commandline flags yet, it doesn\u0026rsquo;t listen on any port other than 9482 (the same one that the toy exporter in src/vpp/app/vpp_prometheus_export.c ships with [ref]). My grafana dashboard is also not fully completed yet. I hope to get that done in April, and publish both the exporter and the dashboard on GitHub. Stay tuned!\n","date":"2023-04-09","desc":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nI\u0026rsquo;ve been working on the Linux Control Plane [ref], which you can read all about in my series on VPP back in 2021:\n","permalink":"https://ipng.ch/s/articles/2023/04/09/vpp-monitoring/","section":"articles","title":"VPP - Monitoring"},{"contents":"Last week I shared how IPng Networks deployed a loadbalanced frontend cluster of NGINX webservers that have public IPv4 / IPv6 addresses, but talk to a bunch of internal webservers that are in a private network which isn\u0026rsquo;t directly connected to the internet, so called IPng Site Local [ref] with addresses 198.19.0.0/16 and 2001:678:d78:500::/56.\nI wrote in [that article] that IPng will be using ACME HTTP-01 validation, which asks the certificate authority, in this case Let\u0026rsquo;s Encrypt, to contact the webserver on a well-known URI for each domain that I\u0026rsquo;m requesting a certificate for. Unsurprisingly, several folks reached out to me asking \u0026ldquo;well what about DNS-01\u0026rdquo;, and one sentence caught their eye:\nSome SSL certificate providers allow for wildcards (ie. *.ipng.ch), but I\u0026rsquo;m going to keep it relatively simple and use [Let\u0026rsquo;s Encrypt] which offers free certificates with a validity of three months.\nI could\u0026rsquo;ve seen this one coming! The sentence can be read to imply it doesn\u0026rsquo;t, but of course Let\u0026rsquo;s Encrypt offers wildcard certificates. It just doesn\u0026rsquo;t satisfy my relatively simple qualifier of the second part of the sentence \u0026hellip; So here I go, down the rabbit hole that is understanding (for myself, and possibly for readers of this article), how the DNS-01 challenge works, in greater detail. Hopefully after writing this (me) and reading this (you), we can all agree that I was wrong, and that using DNS-01 is relatively simple after all.\nOverview I\u0026rsquo;ve installed three frontend NGINX servers (running at Coloclue AS8283, IPng AS8298 and IP-Max AS25091), and one LEGO certificate machine (running in the internal IPng Site Local network). In the [previous article], I described the setup and the use of Let\u0026rsquo;s Encrypt with HTTP-01 challenges. I\u0026rsquo;ll skip that here.\nHTTP-01 vs DNS-01 Today, most SSL authorities and their customers use the Automatic Certificate Management Environment or ACME protocol which is described in [RFC8555]. It defines a way for certificate authorities to check the websites that they are asked to issue a certificate for using so-called challenges. One popular challenge is the so-called HTTP-01, in which the certificate authority will visit a well-known URI on the website domain for which the certificate is being requested, namely /.well-known/acme-challenge/, which described in [RFC5785]. The CA will expect the webserver to respond with an agreed upon string of numbers at that location, in which case proof of ownership is established and a certificate is issued.\nIn some situations, this HTTP-01 challenge can be difficult to perform:\nIf the webserver is not reachable from the internet, or not reachable from the Let\u0026rsquo;s Encrypt servers, for example if it is on an intranet, such as IPng Site Local itself. If the operator would prefer a wildcard certificate, proving ownership of all possible sub-domains is no longer feasible with HTTP-01 but proving ownership of the parent domain is. One possible solution for these cases is to use the ACME challenge DNS-01, which doesn\u0026rsquo;t use the webserver running on go.ipng.ch to prove ownership, but the nameserver that serves ipng.ch instead. The Let\u0026rsquo;s Encrypt GO client [ref] supports both challenges types.\nThe flow of requests in a DNS-01 challenge is as follows:\nFirst, the LEGO client registers itself with the ACME-DNS server running on auth.ipng.ch. After successful registration, LEGO is given a username, password, and access to one DNS recordname $(RRNAME). It is expected that the operator sets up a CNAME for a well-known record _acme-challenge.ipng.ch which points to that $(RRNAME).auth.ipng.ch. This happens only once.\nWhen a certificate is needed, the LEGO client contacts the Certificate Authority and requests validation for the hostname go.ipng.ch. The CA will will inform the client of a random number $(RANDOM) that it expects to see in a a well-known TXT record for _acme-challenge.ipng.ch (which is the CNAME set up previously).\nThe LEGO client now uses the username and password it received in step 1, to update the TXT record of its $(RRNAME).auth.ipng.ch record to contain the $(RANDOM) number it learned in step 2.\nThe CA will issue a TXT query for _acme-challenge.ipng.ch, which is a CNAME to $(RRNAME).auth.ipng.ch, which ultimately responds to the TXT query with the $(RANDOM) number.\nAfter validating that the response on the TXT records contains the agreed upon random number, the CA knows that the operator of the nameserver is the same as the certificate requestor for the domain. It issues a certificate to the LEGO client, which stores it on its local filesystem.\nSimilar to any other challenge, the LEGO machine can now distribute the private key and certificate to all NGINX machines, which are now capable of serving SSL traffic under the given names.\nOne thing worth noting, is that the TXT query is for domain names, not hostnames, in other words, anything in the ipng.ch domain will solicit a query to _acme-challenge.ipng.ch by the DNS-01 challenge. It is for this reason, that the challenge allows for wildcard certificates, which can greatly reduce operational complexity and the total number of certificates needed.\nACME DNS Originally, DNS providers were expected to give the ability for their clients to directly update the well-known _acme-challenge TXT record, and while many commercial providers allow for this, IPng Networks runs just plain-old [NSD] as authoritative nameservers (shown above as nsd0, nsd1 and nsd2). So what todo? Luckily, it was quickly understood by the community that if there is a lookup for TXT record of _acme-challenge.ipng.ch, that it would be absolutely OK to make some form of DNS-symlink by means of a CNAME.\nOne really great solution that leverages this ability is written by Joona Hoikkala, called [ACME-DNS]. It\u0026rsquo;s sole purpose is to allow for an API, served over https, to register new clients, let those clients update their TXT record(s), and then serve them out in DNS. It\u0026rsquo;s meant to be a multi-tenant system, by which I mean one ACME-DNS instance can host millions of domains from thousands of distinct users.\nInstalling I noticed that ACME-DNS relies on features in relatively modern Go, and the standard version that comes with Debian Bullseye is a tad old, so first I need to install Go v1.19 from backports, before I can continue with the build of the binary:\nlego@lego:~$ sudo apt -t bullseye-backports install golang lego@lego:~/src$ git clone https://github.com/joohoi/acme-dns lego@lego:~/src/acme-dns$ export GOPATH=/tmp/acme-dns lego@lego:~/src/acme-dns$ go build lego@lego:~/src/acme-dns$ sudo cp acme-dns /usr/local/bin/acme-dns lego@lego:~/src/acme-dns$ cat \u0026lt;\u0026lt; EOF | sudo tee /lib/systemd/system/acme-dns.service [Unit] Description=Limited DNS server with RESTful HTTP API to handle ACME DNS challenges easily and securely After=network.target [Service] User=lego Group=lego AmbientCapabilities=CAP_NET_BIND_SERVICE WorkingDirectory=~ ExecStart=/usr/local/bin/acme-dns -c /home/lego/acme-dns/config.cfg Restart=on-failure [Install] WantedBy=multi-user.target EOF This authoritative nameserver will want to listen on UDP and TCP ports 53, for which it either needs to run as root, or perhaps better, run as non-privileged user with the CAP_NET_BIND_SERVICE capability. The only other difference with the provided unit file, is that I\u0026rsquo;ll be running this as the lego user, with a configuration file and working path in its home-directory.\nConfiguring Step 1. Delegate auth.ipng.ch\nThe first thing I should do is configure the subdomain for ACME-DNS, which I decide will be hosted on auth.ipng.ch. I assign it an NS, an A and a AAAA record, and then update the ipng.ch domain:\n$ORIGIN ipng.ch. $TTL 86400 @ IN SOA ns.paphosting.net. hostmaster.ipng.ch. ( 2023032401 28800 7200 604800 86400) NS ns.paphosting.nl. NS ns.paphosting.net. NS ns.paphosting.eu. ; ACME DNS auth NS auth.ipng.ch. A 194.1.163.93 AAAA 2001:678:d78:3::93 This snippet will make a DNS delegation for sub-domain auth.ipng.ch to the server also called auth.ipng.ch and because the downstream delegation is in the same domain, I need to provide glue records, that tell clients who are querying for auth.ipng.ch where to find that nameserver. At this point, any request for *.auth.ipng.ch will end up being forwarded to the authoritative nameserver, which can be found at either 194.1.163.93 or 2001:678:d78:3::93.\nStep 2. Start ACME DNS\nAfter having built the acme-dns server and given it a suitable systemd unit file, and knowing that it\u0026rsquo;s going to be responsible for the sub-domain auth.ipng.ch, I give it the following straight forward configuration file:\nlego@lego:~$ mkdir ~/acme-dns/ lego@lego:~$ cat \u0026lt;\u0026lt; EOF \u0026gt; acme-dns/config.cfg [general] listen = \u0026#34;[::]:53\u0026#34; protocol = \u0026#34;both\u0026#34; domain = \u0026#34;auth.ipng.ch\u0026#34; nsname = \u0026#34;auth.ipng.ch\u0026#34; nsadmin = \u0026#34;hostmaster.ipng.ch\u0026#34; records = [ \u0026#34;auth.ipng.ch. NS auth.ipng.ch.\u0026#34;, \u0026#34;auth.ipng.ch. A 194.1.163.93\u0026#34;, \u0026#34;auth.ipng.ch. AAAA 2001:678:d78:3::93\u0026#34;, ] debug = false [database] engine = \u0026#34;sqlite3\u0026#34; connection = \u0026#34;/home/lego/acme-dns/acme-dns.db\u0026#34; [api] ip = \u0026#34;[::]\u0026#34; disable_registration = false port = \u0026#34;443\u0026#34; tls = \u0026#34;letsencrypt\u0026#34; acme_cache_dir = \u0026#34;/home/lego/acme-dns/api-certs\u0026#34; notification_email = \u0026#34;hostmaster+dns-auth@ipng.ch\u0026#34; corsorigins = [ \u0026#34;*\u0026#34; ] use_header = false header_name = \u0026#34;X-Forwarded-For\u0026#34; [logconfig] loglevel = \u0026#34;debug\u0026#34; logtype = \u0026#34;stdout\u0026#34; logformat = \u0026#34;text\u0026#34; EOF lego@lego:~$ sudo systemctl enable acme-dns lego@lego:~$ sudo systemctl start acme-dns The first part of this tells the server how to construct the SOA record (domain, nsname and nsadmin), and which records to put in the apex, nominally the NS/A/AAAA records that describe the nameserver which is authoritative for the auth.ipng.ch domain. Then, the database part is where user credentials will be stored, and the API portion shows how users will be able to interact with the controlplane part of the service, notably registering new clients, and updating nameserver TXT records for existing clients.\nInterestingly, the API is served on HTTPS port 443, and for that it needs, you guessed it, a certificate! ACME-DNS eats its own dogfood, which I can appreciate: it will use DNS-01 validation to get a certificate for auth.ipng.ch itself, by serving the challenge for well known record _acme-challenge.auth.ipng.ch, so it\u0026rsquo;s turtles all the way down!\nStep 3. Register a new client\nSeeing as many public DNS providers allow programmatic setting of the contents of the zonefiles, for them it\u0026rsquo;s a matter of directly being driven by LEGO. But for me, running NSD, I am going to be using the ACME DNS server to fulfill that purpose, so I have to configure it to do that for me.\nIn the explanation of DNS-01 challenges above, you\u0026rsquo;ll remember I made a mention of registering. Here\u0026rsquo;s a closer look at what that means:\nlego@lego:~$ curl -s -X POST https://auth.ipng.ch/register | json_pp { \u0026#34;allowfrom\u0026#34; : [], \u0026#34;fulldomain\u0026#34; : \u0026#34;76f88564-740b-4483-9bc0-86d1fb531e20.auth.ipng.ch\u0026#34;, \u0026#34;password\u0026#34; : \u0026#34;\u0026lt;redacted\u0026gt;\u0026#34;, \u0026#34;subdomain\u0026#34; : \u0026#34;76f88564-740b-4483-9bc0-86d1fb531e20\u0026#34;, \u0026#34;username\u0026#34; : \u0026#34;e4608fdf-9a69-4930-8cf1-57218738792d\u0026#34; } What happened here is that, using the HTTPS endpoint, I asked the ACME-DNS server to create for me an empty DNS record, which it did on 76f88564-740b-4483-9bc0-86d1fb531e20.auth.ipng.ch. Further, if I offer the given username and password, I am able to update that record\u0026rsquo;s value. Let\u0026rsquo;s take a look:\nlego@lego:~$ dig +short TXT 02e3acfc-bbca-46bb-9cee-8eab52c73c30.auth.ipng.ch lego@lego:~$ curl -s -X POST -H \u0026#34;X-Api-User: 5f3591d1-0d13-4816-a329-7965a8639ab5\u0026#34; \\ -H \u0026#34;X-Api-Key: \u0026lt;redacted\u0026gt;\u0026#34; \\ -d \u0026#39;{\u0026#34;subdomain\u0026#34;: \u0026#34;02e3acfc-bbca-46bb-9cee-8eab52c73c30\u0026#34;, \\ \u0026#34;txt\u0026#34;: \u0026#34;___Hello_World_token_______________________\u0026#34;}\u0026#39; \\ https://auth.ipng.ch/update Numbers everywhere, but I learned a lot here! Notice how the first time I sent the dig request for the 02e3acfc-bbca-46bb-9cee-8eab52c73c30.auth.ipng.ch it did not respond anything (an empty record). But then, using the username/password I could update the record with a 41 character string, and I was informed of the fulldomain key there, which is the one that I should be configuring in the domain(s) for which I want to get a certificate.\nI configure it in the ipng.ch and ipng.nl domain as follows (taking ipng.nl as an example):\n$ORIGIN ipng.nl. $TTL 86400 @ IN SOA ns.paphosting.net. hostmaster.ipng.nl. ( 2023032401 28800 7200 604800 86400) IN NS ns.paphosting.nl. IN NS ns.paphosting.net. IN NS ns.paphosting.eu. CAA 0 issue \u0026#34;letsencrypt.org\u0026#34; CAA 0 issuewild \u0026#34;letsencrypt.org\u0026#34; CAA 0 iodef \u0026#34;mailto:hostmaster@ipng.ch\u0026#34; _acme-challenge CNAME 8ee2969b-571c-4b3a-b6a0-6d6221130c96.auth.ipng.ch. The records here are a CAA which is a type of DNS record used to provide additional confirmation for the Certificate Authority when validating an SSL certificate. This record allows me to specify which certificate authorities are authorized to deliver SSL certificates for the domain. Then, the well known _acme-challenge.ipng.nl record is merely telling the client by means of a CNAME to go ask for 8ee2969b-571c-4b3a-b6a0-6d6221130c96.auth.ipng.ch instead.\nPutting this part all together now, I can issue a query for that ipng.nl domain \u0026hellip;\nlego@lego:~$ dig +short TXT _acme-challenge.ipng.nl. \u0026#34;___Hello_World_token_______________________\u0026#34; \u0026hellip; and would you look at that! The query for the ipng.nl domain, is a CNAME to the specific uuid record in the auth.ipng.ch domain, where ACME-DNS is serving it with the response that I can programmatically set to different values, yee-haw!\nStep 4. Run LEGO\nThe LEGO client has all sorts of challenge providers linked in. Once again, Debian is a bit behind on things, shipping version 3.2.0-3.1+b5 in Bullseye, although upstream is much further along. So I purge the Debian package and download the v4.10.2 amd64 package directly from its [Github] releases page. The ACME-DNS handler was only added in v4 of the client. But now all that\u0026rsquo;s left for me to do is run it:\nlego@lego:~$ export ACME_DNS_API_BASE=https://auth.ipng.ch/ lego@lego:~$ export ACME_DNS_STORAGE_PATH=/home/lego/acme-dns/credentials.json lego@lego:~$ /home/lego/bin/lego --path /etc/lego/ --email noc@ipng.ch --accept-tos --dns acme-dns \\ --domains ipng.ch --domains *.ipng.ch \\ --domains ipng.nl --domains *.ipng.nl \\ run The LEGO client goes through the ACME flow that I described at the top of this article, and ends up spitting out a certificate \\o/\nlego@lego:~$ openssl x509 -noout -text -in /etc/lego/certificates/ipng.ch.crt Certificate: Data: Version: 3 (0x2) Serial Number: 03:58:8f:c1:25:00:e2:f3:d3:3f:d6:ed:ba:bc:1d:0d:54:ea Signature Algorithm: sha256WithRSAEncryption Issuer: C = US, O = Let\u0026#39;s Encrypt, CN = R3 Validity Not Before: Mar 21 20:24:08 2023 GMT Not After : Jun 19 20:24:07 2023 GMT Subject: CN = ipng.ch X509v3 extensions: X509v3 Subject Alternative Name: DNS:*.ipng.ch, DNS:*.ipng.nl, DNS:ipng.ch, DNS:ipng.nl Et voila! Wildcard certificates for multiple domains using ACME-DNS.\n","date":"2023-03-24","desc":"Last week I shared how IPng Networks deployed a loadbalanced frontend cluster of NGINX webservers that have public IPv4 / IPv6 addresses, but talk to a bunch of internal webservers that are in a private network which isn\u0026rsquo;t directly connected to the internet, so called IPng Site Local [ref] with addresses 198.19.0.0/16 and 2001:678:d78:500::/56.\nI wrote in [that article] that IPng will be using ACME HTTP-01 validation, which asks the certificate authority, in this case Let\u0026rsquo;s Encrypt, to contact the webserver on a well-known URI for each domain that I\u0026rsquo;m requesting a certificate for. Unsurprisingly, several folks reached out to me asking \u0026ldquo;well what about DNS-01\u0026rdquo;, and one sentence caught their eye:\n","permalink":"https://ipng.ch/s/articles/2023/03/24/case-study-lets-encrypt-dns-01/","section":"articles","title":"Case Study: Let's Encrypt DNS-01"},{"contents":"A while ago I rolled out an important change to the IPng Networks design: I inserted a bunch of [Centec MPLS] and IPv4/IPv6 capable switches underneath [AS8298], which gave me two specific advantages:\nThe entire IPng network is now capable of delivering L2VPN services, taking the form of MPLS point-to-point ethernet, and VPLS, as shown in a previous [deep dive], in addition to IPv4 and IPv6 transit provided by VPP in an elaborate and elegant [BGP Routing Policy].\nA new internal private network becomes available to any device connected IPng switches, with addressing in 198.19.0.0/16 and 2001:678:d78:500::/56. This network is completely isolated from the Internet, with access controlled via N+2 redundant gateways/firewalls, described in more detail in a previous [deep dive] as well.\nOverview After rolling out this spiffy BGP Free [MPLS Core], I wanted to take a look at maybe conserving a few IP addresses here and there, as well as tightening access and protecting the more important machines that IPng Networks runs. You see, most enterprise networks will include a bunch of internal services, like databases, network attached storage, backup servers, network monitoring, billing/CRM et cetera. IPng Networks is no different.\nSomewhere between the sacred silence and sleep, lives my little AS8298. It\u0026rsquo;s a gnarly and toxic place out there in the DFZ, how do you own disorder?\nConnectivity As a refresher, here\u0026rsquo;s the current situation at IPng Networks:\n1. Site Local Connectivity\nEach switch gets what is called an IPng Site Local (or ipng-sl) interface. This is a /27 IPv4 and a /64 IPv6 that is bound on a local VLAN on each switch on our private network. Remember: the links between sites are no longer switched, they are routed and pass ethernet frames only using MPLS. I can connect for example all of the fleet\u0026rsquo;s hypervisors to this internal network using jumboframes using 198.19.0.0/16 and 2001:678:d78:500::/56 which is not connected to the internet.\n2. Egress Connectivity\nThere are three geographically diverse gateways that inject an OSPF E1 default route into the Centec Site Local network, and they will provide NAT for IPv4 and IPv6 to the internet. This setup allows all machines in the internal private network to reach the internet, using their closest gateway. Failing over between gateways is fully automatic, when one is unavailable or down for maintenance, the network will simply find the next-closest gateway.\n3. Ingress Connectivity\nInbound traffic (from the internet to IPng Site Local) is held at the gateways. First of all, the reserved IPv4 space 198.18.0.0/15 is a bogon and will not be routed on the public internet, but our VPP routers in AS8298 do carry the route albeit with the well-known BGP no-export community set, so traffic could arrive at the gateway coming from our own network only. This is not true for IPv6, because here our prefix is a part of the AS8298 IPv6 PI space, and traffic will be globally routable. Even then, only very few prefixes are allowed to enter into the IPng Site Local private network, nominally only our NOC prefixes, one or two external bastion hosts, and our own Wireguard endpoints which are running on the gateways.\nFrontend Setup One of my goals for the private network is IPv4 conservation. I decided to move our web-frontends to be dual-homed: one network interface towards the internet using public IPv4 and IPv6 addresses, and another network interface that finds backend servers in the IPng Site Local private network.\nThis way, I can have one NGINX instance (or a pool of them), terminate the HTTP/HTTPS connection (there\u0026rsquo;s an InfraSec joke about SSL is inserted and removed here :)), no matter how many websites, domains, or physical webservers I want to use. Some SSL certificate providers allow for wildcards (ie. *.ipng.ch), but I\u0026rsquo;m going to keep it relatively simple and use [Let\u0026rsquo;s Encrypt] which offers free certificates with a validity of three months.\nInstalling NGINX First, I will install three minimal VMs with Debian Bullseye on separate hypervisors (in Rümlang chrma0, Plan-les-Ouates chplo0 and Amsterdam nlams1), giving them each 4 CPUs, a 16G blockdevice on the hypervisor\u0026rsquo;s ZFS (which is automatically snapsotted and backed up offsite using ZFS replication!), and 1GB of memory. These machines will be the IPng Frontend servers, and handle all client traffic to our web properties. Their job is to forward that HTTP/HTTPS traffic internally to webservers that are running entirely in the IPng Site Local (private) network.\nI\u0026rsquo;ll install a few tablestakes packages on them, taking nginx0.chrma0 as an example:\npim@nginx0-chrma0:~$ sudo apt install nginx iptables ufw rsync pim@nginx0-chrma0:~$ sudo ufw allow 80 pim@nginx0-chrma0:~$ sudo ufw allow 443 pim@nginx0-chrma0:~$ sudo ufw allow from 198.19.0.0/16 pim@nginx0-chrma0:~$ sudo ufw allow from 2001:678:d78:500::/56 pim@nginx0-chrma0:~$ sudo ufw enable Installing Lego Next, I\u0026rsquo;ll install one more highly secured minimal VM with Debian Bullseye, giving it 1 CPU, a 16G blockdevice and 1GB of memory. This is where my Let\u0026rsquo;s Encrypt SSL certificate store will live. This machine does not need to be publicly available, so it will only get one interface, connected to the IPng Site Local network, so it\u0026rsquo;ll be using private IPs.\nThis virtual machine really is bare-bones, it only gets a firewall, rsync, and the lego package. It doesn\u0026rsquo;t technically even need to run SSH, because I can log into serial console using the hypervisor. Considering it\u0026rsquo;s an internal-only server (not connected to the internet), but also because I do believe in OpenSSH\u0026rsquo;s track record of safety, I decide to leave SSH enabled:\npim@lego:~$ apt install ufw lego rsync pim@lego:~$ sudo ufw allow 8080 pim@lego:~$ sudo ufw allow 22 pim@lego:~$ sudo ufw enable Now that all four machines are set up and appropriately filtered (using a simple ufw Debian package):\nNGINX will allow port 80 and 443 for public facing web traffic, and is permissive for the IPng Site Local network, to allow SSH for rsync and maintenance tasks LEGO will be entirely closed off, allowing access only from trusted sources for SSH, and to one TCP port 8080 on which HTTP-01 certificate challenges will be served. I make a pre-flight check to make sure that jumbo frames are possible from the frontends into the backend network.\npim@nginx0-nlams1:~$ traceroute lego traceroute to lego (198.19.4.6), 30 hops max, 60 byte packets 1 msw0.nlams0.net.ipng.ch (198.19.4.97) 0.737 ms 0.958 ms 1.155 ms 2 msw0.defra0.net.ipng.ch (198.19.2.22) 6.414 ms 6.748 ms 7.089 ms 3 msw0.chrma0.net.ipng.ch (198.19.2.7) 12.147 ms 12.315 ms 12.401 ms 2 msw0.chbtl0.net.ipng.ch (198.19.2.0) 12.685 ms 12.429 ms 12.557 ms 3 lego.net.ipng.ch (198.19.4.7) 12.916 ms 12.864 ms 12.944 ms pim@nginx0-nlams1:~$ ping -c 3 -6 -M do -s 8952 lego PING lego(lego.net.ipng.ch (2001:678:d78:503::6)) 8952 data bytes 8960 bytes from lego.net.ipng.ch (2001:678:d78:503::7): icmp_seq=1 ttl=62 time=13.33 ms 8960 bytes from lego.net.ipng.ch (2001:678:d78:503::7): icmp_seq=2 ttl=62 time=13.52 ms 8960 bytes from lego.net.ipng.ch (2001:678:d78:503::7): icmp_seq=3 ttl=62 time=13.28 ms --- lego ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 4005ms rtt min/avg/max/mdev = 13.280/13.437/13.590/0.116 ms pim@nginx0-nlams1:~$ ping -c 5 -3 -M do -s 8972 lego PING (198.19.4.6) 8972(9000) bytes of data. 8980 bytes from lego.net.ipng.ch (198.19.4.7): icmp_seq=1 ttl=62 time=12.85 ms 8980 bytes from lego.net.ipng.ch (198.19.4.7): icmp_seq=2 ttl=62 time=12.82 ms 8980 bytes from lego.net.ipng.ch (198.19.4.7): icmp_seq=3 ttl=62 time=12.91 ms --- ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 4007ms rtt min/avg/max/mdev = 12.823/12.843/12.913/0.138 ms A note on the size used: An IPv4 header is 20 bytes, an IPv6 header is 40 bytes, and an ICMP header is 8 bytes. If the MTU defined on the network is 9000, then the size of the ping payload can be 9000-20-8=8972 bytes for IPv4 and 9000-40-8=8952 for IPv6 packets. Using jumboframes internally is a small optimization for the benefit of the internal webservers - less packets/sec means more throughput and performance in general. It\u0026rsquo;s also cool :)\nCSRs and ACME, oh my! In the old days, (and indeed, still today in many cases!) operators would write a Certificate Signing Request or CSR with the pertinent information for their website, and the SSL authority would then issue a certificate, send it to the operator via e-mail (or would you believe it, paper mail), after which the webserver operator could install and use the cert.\nToday, most SSL authorities and their customers use the Automatic Certificate Management Environment or ACME protocol which is described in [RFC8555]. It defines a way for certificate authorities to check the websites that they are asked to issue a certificate for using so-called challenges. There are several challenge types to choose from, but the one I\u0026rsquo;ll be focusing on is called HTTP-01. These challenges are served from a well known URI, unsurprisingly in the path /.well-known/..., as described in [RFC5785].\nCertbot: Usually when running a webserver with SSL enabled, folks will use the excellent [Certbot] tool from the electronic frontier foundation. This tool is really smart, and has plugins that can automatically take a webserver running common server software like Apache, Nginx, HAProxy or Plesk, figure out how you configured the webserver (which hostname, options, etc), request a certificate and rewrite your configuration. What I find a nice touch is that it automatically installs certificate renewal using a crontab.\nLEGO: A Let’s Encrypt client and ACME library written in Go [ref] and it\u0026rsquo;s super powerful, able to solve for multiple ACME challenges, and tailored to work well with Let\u0026rsquo;s Encrypt as a certificate authority. The HTTP-01 challenge works as follows: when an operator wants to prove that they own a given domain name, the CA can challenge the client to host a mutually agreed upon random number at a random URL under their webserver\u0026rsquo;s /.well-known/acme-challenge/ on port 80. The CA will send an HTTP GET to this random URI and expect the number back in the response.\nShared SSL at Edge Because I will be running multiple frontends in different locations, it\u0026rsquo;s operationally tricky to serve this HTTP-01 challenge random number in a randomly named file on all three NGINX servers. But while the LEGO client can write the challenge file directly into a file in the webroot of a server, it can also run as an HTTP server with the sole purpose of responding to the challenge.\nThis is a killer feature: if I point the /.well-known/acme-challenge/ URI on all the NGINX servers to the one LEGO instance running centrally, it no longer matters which of the NGINX servers Let\u0026rsquo;s Encrypt will try to use to solve the challenge - they will all serve the same thing! The LEGO client will construct the challenge request, ask Let\u0026rsquo;s Encrypt to send the challenge, and then serve the response. The only thing left to do then is copy the resulting certificate to the frontends.\nLet me demonstrate how this works, by taking an example based on four websites, none of which run on servers that are reachable from the internet: [go.ipng.ch], [video.ipng.ch], [billing.ipng.ch] and [grafana.ipng.ch]. These run on four separate virtual machines (or docker containers), all within the IPng Site Local private network in 198.19.0.0/16 and 2001:678:d78:500::/56 which aren\u0026rsquo;t reachable from the internet.\nReady? Let\u0026rsquo;s go!\nlego@lego:~$ lego --path /etc/lego/ --http --http.port :8080 --email=noc@ipng.ch \\ --domains=nginx0.ipng.ch --domains=grafana.ipng.ch --domains=go.ipng.ch \\ --domains=video.ipng.ch --domains=billing.ipng.ch \\ run The flow of requests is as follows:\nThe LEGO client contacts the Certificate Authority and requests validation for a list of the cluster hostname nginx0.ipng.ch and the additional four domains. It asks the CA to perform an HTTP-01 challenge. The CA will share two random numbers with LEGO, which will start a webserver on port 8080 and serve the URI /.well-known/acme-challenge/$(NUMBER1).\nThe CA will now resolve the A/AAAA addresses for the domain (grafana.ipng.ch), which is a CNAME for the cluster (nginx0.ipng.ch), which in turn has multiple A/AAAA pointing to the three machines associated with it. Visit any one of the NGINX servers on that negotiated URI, and they will forward requests for /.well-known/acme-challenge internally back to the machine running LEGO on its port 8080.\nThe LEGO client will know that it\u0026rsquo;s going to be visited on the URI /.well-known/acme-challenge/$(NUMBER1), as it has negotiated that with the CA in step 1. When the challenge request arrives, LEGO will know to respond using the contents as agreed upon in $(NUMBER2).\nAfter validating that the response on the random URI contains the agreed upon random number, the CA knows that the operator of the webserver is the same as the certificate requestor for the domain. It issues a certificate to the LEGO client, which stores it on its local filesystem.\nThe LEGO machine finally distributes the private key and certificate to all NGINX machines, which are now capable of serving SSL traffic under the given names.\nThis sequence is done for each of the domains (and indeed, any other domain I\u0026rsquo;d like to add), and in the end a bundled certiicate with the common name nginx0.ipng.ch and the four additional alternate names is issued and saved in the certificate store. Up until this point, NGINX has been operating in clear text, that is to say the CA has issued the ACME challenge on port 80, and NGINX has forwarded it internally to the machine running LEGO on its port 8080 without using encryption.\nTaking a look at the certificate that I\u0026rsquo;ll install in the NGINX frontends (note: never share your .key material, but .crt files are public knowledge):\nlego@lego:~$ openssl x509 -noout -text -in /etc/lego/certificates/nginx0.ipng.ch.crt ... Certificate: Data: Version: 3 (0x2) Serial Number: 03:db:3d:99:05:f8:c0:92:ec:6b:f6:27:f2:31:55:81:0d:10 Signature Algorithm: sha256WithRSAEncryption Issuer: C = US, O = Let\u0026#39;s Encrypt, CN = R3 Validity Not Before: Mar 16 19:16:29 2023 GMT Not After : Jun 14 19:16:28 2023 GMT Subject: CN = nginx0.ipng.ch ... X509v3 extensions: X509v3 Subject Alternative Name: DNS:billing.ipng.ch, DNS:go.ipng.ch, DNS:grafana.ipng.ch, DNS:nginx0.ipng.ch, DNS:video.ipng.ch While the amount of output of this certificate is considerable, I\u0026rsquo;ve highlighted the cool bits. The Subject (also called Common Name or CN) of the cert is the first --domains entry, and the alternate names are that one plus all other --domains given when calling LEGO earlier. In other words, this certificate is valid for all five DNS domain names. Sweet!\nNGINX HTTP Configuration I find it useful to think about the NGINX configuration in two parts: (1) the cleartext / non-ssl parts on port 80, and (2) the website itself that lives behind SSL on port 443. So in order, here\u0026rsquo;s my configuration for the acme-challenge bits on port 80:\npim@nginx0-chrma0:~$ cat \u0026lt; EOF | tee /etc/nginx/conf.d/lego.inc location /.well-known/acme-challenge/ { auth_basic off; proxy_intercept_errors on; proxy_http_version 1.1; proxy_set_header Host $host; proxy_pass http://lego.net.ipng.ch:8080; break; } EOF pim@nginx0-chrma0:~$ cat \u0026lt; EOF | tee /etc/nginx/sites-available/go.ipng.ch.conf server { listen [::]:80; listen 0.0.0.0:80; server_name go.ipng.ch go.net.ipng.ch go; access_log /var/log/nginx/go.ipng.ch-access.log; include \u0026#34;conf.d/lego.inc\u0026#34;; location / { return 301 https://go.ipng.ch$request_uri; } } EOF The first file is an include-file that is shared between all websites I\u0026rsquo;ll serve from this cluster. Its purpose is to forward any requests that start with the well-known ACME challenge URI onto the backend LEGO virtual machine, without requiring any authorization. Then, the second snippet defines a simple webserver on port 80 giving it a few names (the FQDN go.ipng.ch but also two shorthands go.net.ipng.ch and go). Due to the include, the ACME challenge will be performed on port 80. All other requests will be rewritten and returned as a redirect to https://go.ipng.ch/. If you\u0026rsquo;ve ever wondered how folks are able to type http://go/foo and still avoid certificate errors, here\u0026rsquo;s a cool trick that accomplishes that.\nActually these two things are all that\u0026rsquo;s needed to obtain the SSL cert from Let\u0026rsquo;s Encrypt. I haven\u0026rsquo;t even started a webserver on port 443 yet! To recap:\nListen only to /.well-known/acme-challenge/ on port 80, and forward those requests to LEGO. Rewrite all other port-80 traffic to https://go.ipng.ch/ to avoid serving any unencrypted content. NGINX HTTPS Configuration Now that I have the SSL certificate in hand, I can start to write webserver configs to handle the SSL parts. I\u0026rsquo;ll include a few common options to make SSL as safe as it can be (borrowed from Certbot), and then create the configs for the webserver itself:\npim@nginx0-chrma0:~$ cat \u0026lt; EOF | tee -a /etc/nginx/conf.d/options-ssl-nginx.inc ssl_session_cache shared:le_nginx_SSL:10m; ssl_session_timeout 1440m; ssl_session_tickets off; ssl_protocols TLSv1.2 TLSv1.3; ssl_prefer_server_ciphers off; ssl_ciphers \u0026#34;ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256:ECDHE-ECDSA-AES256-GCM-SHA384: ECDHE-RSA-AES256-GCM-SHA384:ECDHE-ECDSA-CHACHA20-POLY1305:ECDHE-RSA-CHACHA20-POLY1305: DHE-RSA-AES128-GCM-SHA256:DHE-RSA-AES256-GCM-SHA384\u0026#34;; EOF pim@nginx0-chrma0:~$ cat \u0026lt; EOF | tee /etc/nginx/sites-available/go.ipng.ch.conf server { listen [::]:443 ssl http2; listen 0.0.0.0:443 ssl http2; ssl_certificate /etc/nginx/conf.d/nginx0.ipng.ch.crt; ssl_certificate_key /etc/nginx/conf.d/nginx0.ipng.ch.key; include /etc/nginx/conf.d/options-ssl-nginx.inc; ssl_dhparam /etc/nginx/conf.d/ssl-dhparams.pem; server_name go.ipng.ch; access_log /var/log/nginx/go.ipng.ch-access.log upstream; location /edit/ { proxy_pass http://git.net.ipng.ch:5000; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; satisfy any; allow 198.19.0.0/16; allow 194.1.163.0/24; allow 2001:678:d78::/48; deny all; auth_basic \u0026#34;Go Edits\u0026#34;; auth_basic_user_file /etc/nginx/conf.d/go.ipng.ch-htpasswd; } location / { proxy_pass http://git.net.ipng.ch:5000; proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; } } EOF The certificate and SSL options are loaded first from /etc/nginc/conf.d/nginx0.ipng.ch.{crt,key}.\nNext, I don\u0026rsquo;t want folks on the internet to be able to create or edit/overwrite my go-links, so I\u0026rsquo;ll add an ACL on the URI starting with /edit/. Either you come from a trusted IPv4/IPv6 prefix, in which case you can edit links at will, or alternatively you present a username and password that is stored in the go-ipng.ch-htpasswd file (using the Debian package apache2-utils).\nFinally, all other traffic is forwarded internally to the machine git.net.ipng.ch on port 5000, where the go-link server is running as a Docker container. That server accepts requests from the IPv4 and IPv6 IPng Site Local addresses of all three NGINX frontends to its port 5000.\nIcing on the cake: Internal SSL The go-links server I described above doesn\u0026rsquo;t itself spreak SSL. It\u0026rsquo;s meant to be frontended on the same machine by an Apache or NGINX or HAProxy which handles the client en- and decryption, and usually that frontend will be running on the same server, at which point I could just let it bind localhost:5000. However, the astute observer will point out that the traffic on the IPng Site Local network is cleartext. Now, I don\u0026rsquo;t think that my go-links traffic poses a security or privacy threat, but certainly other sites (like billing.ipng.ch) are more touchy, and as such require a end to end encryption on the network.\nIn 2003, twenty years ago, a feature was added to TLS that allows the client to specify the hostname it was expecting to connect to, in a feature called Server Name Indication or SNI, described in detail in [RFC3546]:\n[TLS] does not provide a mechanism for a client to tell a server the name of the server it is contacting. It may be desirable for clients to provide this information to facilitate secure connections to servers that host multiple \u0026lsquo;virtual\u0026rsquo; servers at a single underlying network address.\nIn order to provide the server name, clients MAY include an extension of type \u0026ldquo;server_name\u0026rdquo; in the (extended) client hello.\nEvery modern webserver and -browser can utilize the SNI extention when talking to eachother. NGINX can be configured to pass traffic along to the internal webserver by re-encrypting it with a new SSL connection. Considering the internal hostname will not necessarily be the same as the external website hostname, I can use SNI to force the NGINX-\u0026gt;Billing connection to re-use the billing.ipng.ch hostname:\nserver_name billing.ipng.ch; ... location / { proxy_set_header Host $host; proxy_set_header X-Real-IP $remote_addr; proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for; proxy_set_header X-Forwarded-Proto $scheme; proxy_read_timeout 60; proxy_pass https://biling.net.ipng.ch:443; proxy_ssl_name $host; proxy_ssl_server_name on; } What happens here is the upstream server is hit on port 443 with hostname billing.net.ipng.ch but the SNI value is set back to $host which is billing.ipng.ch (note, without the *.net.ipng.ch domain). The cool thing is, now the internal webserver can reuse the same certificate! I can use the mechanism described here to obtain the bundled certificate, and then pass that key+cert along to the billing machine, and serve it there using the same certificate files as the frontend NGINX.\nWhat\u0026rsquo;s next Of course, the mission to save IPv4 addresses is achieved - I can now run dozens of websites behind these three IPv4 and IPv6 addresses, and security gets a little bit better too, as the webservers themselves are tucked away in IPng Site Local and unreachable from the public internet.\nThis IPng Frontend design also helps with reliability and latency. I can put frontends in any number of places, renumber them relatively easily (by adding or removing A/AAAA records to nginx0.ipng.ch and otherwise CNAMEing all my websites to that cluster-name). If load becomes an issue, NGINX has a bunch of features like caching, cookie-persistence, loadbalancing with health checking (so I could use multiple backend webservers and round-robin over the healthy ones), and so on. Our Mastodon server on [ublog.tech] or our Peertube server on [video.ipng.ch] can make use of many of these optimizations, but while I do love engineering, I am also super lazy so I prefer not to prematurely over-optimize.\nThe main thing that\u0026rsquo;s next is to automate a bit more of this. IPng Networks has an Ansible controller, which I\u0026rsquo;d like to add maintenance of the NGINX and LEGO configuration. That would sort of look like defining pool nginx0 with hostnames A, B and C; and then having a playbook that creates the virtual machine, installes and configures NGINX, and plumbs it through to the LEGO machine. I can imagine running a specific playbook that ensures the certificates stay fresh in some CI/CD (I have a drone runner alongside our [Gitea] server), or just add something clever to a cronjob on the LEGO machine that periodically runs lego ... renew and when new certificates are issued, copy them out to the NGINX machines in the given cluster with rsync, and reloading their configuration to pick up the new certs.\nBut considering Ansible is its whole own elaborate bundle of joy, I\u0026rsquo;ll leave that for maybe another article.\n","date":"2023-03-17","desc":"A while ago I rolled out an important change to the IPng Networks design: I inserted a bunch of [Centec MPLS] and IPv4/IPv6 capable switches underneath [AS8298], which gave me two specific advantages:\nThe entire IPng network is now capable of delivering L2VPN services, taking the form of MPLS point-to-point ethernet, and VPLS, as shown in a previous [deep dive], in addition to IPv4 and IPv6 transit provided by VPP in an elaborate and elegant [BGP Routing Policy].\n","permalink":"https://ipng.ch/s/articles/2023/03/17/case-study-site-local-nginx/","section":"articles","title":"Case Study: Site Local NGINX"},{"contents":"After receiving an e-mail from [Starry Networks], I had a chat with their founder and learned that the combination of switch silicon and software may be a good match for IPng Networks.\nI got pretty enthusiastic when this new vendor claimed VxLAN, GENEVE, MPLS and GRE at 56 ports and line rate, on a really affordable budget ($4'200,- for the 56 port; and $1'650,- for the 26 port switch). This reseller is using a less known silicon vendor called [Centec], who have a lineup of ethernet chipsets. In this device, the CTC8096 (GoldenGate) is used for cost effective high density 10GbE/40GbE applications paired with 4x100GbE uplink capability. This is Centec\u0026rsquo;s fourth generation, so CTC8096 inherits the feature set from L2/L3 switching to advanced data center and metro Ethernet features with innovative enhancement. The switch chip provides up to 96x10GbE ports, or 24x40GbE, or 80x10GbE + 4x100GbE ports, inheriting from its predecessors a variety of features, including L2, L3, MPLS, VXLAN, MPLS SR, and OAM/APS. Highlights features include Telemetry, Programmability, Security and traffic management, and Network time synchronization.\nAfter discussing basic L2, L3 and Overlay functionality in my [first post], and explored the functionality and performance of MPLS and VPLS in my [second post], I convinced myself and committed to a bunch of these for IPng Networks. I\u0026rsquo;m now ready to roll out these switches and create a BGP-free core network for IPng Networks. If this kind of thing tickles your fancy, by all means read on :)\nOverview You may be wondering what folks mean when they talk about a [BGP Free Core], and also you may ask yourself why would I decide to retrofit this in our network. For most, operating this way gives very little room for outages to occur in the L2 (Ethernet and MPLS) transport network, because it\u0026rsquo;s relatively simple in design and implementation. Some advantages worth mentioning:\nTransport devices do not need to be capable of supporting a large number of IPv4/IPv6 routes, either in the RIB or FIB, allowing them to be much cheaper. As there is no eBGP, transport devices will not be impacted by BGP-related issues, such as high CPU utilization during massive BGP re-convergence. Also, without eBGP, some of the attack vectors in ISPs (loopback DDoS or ARP storms on public internet exchange, to take two common examples) can be eliminated. If a new BGP security vulnerability were to be discovered, transport devices aren\u0026rsquo;t impacted. Operator errors (the #1 reason for outages in our industry) associated with BGP configuration and the use of large RIBs (eg. leaking into IGP, flapping transit sessions, etc) can be eradicated. New transport services such as MPLS point to point virtual leased lines, SR-MPLS, VPLS clouds, and eVPN can all be introduced without modifying the routing core. If deployed correctly, this type of transport-only network can be kept entirely isolated from the Internet, making DDoS and hacking attacks against transport elements impossible, and it also opens up possibilities for relatively safe sharing of infrastructure resources between ISPs (think of things like dark fibers between locations, rackspace, power, cross connects).\nFor smaller clubs (like IPng Networks), being able to share a 100G wave with others, significantly reduces price per Megabit! So if you\u0026rsquo;re in Zurich, Switzerland, or Europe and find this an interesting avenue to expand your reach in a co-op style environment, [reach out] to us, any time!\nHybrid Design I\u0026rsquo;ve decided to make this the direction of IPng\u0026rsquo;s core network \u0026ndash; I know that the specs of the Centec switches I\u0026rsquo;ve bought will allow for a modest but not huge amount of routes in the hardware forwarding tables. I loadtested them in [a previous article] at line rate (well, at least 8x10G at 64b packets and around 110Mpps), so they were forwarding both IPv4 and MPLS traffic effortlessly, and at 45 Watts I might add! However, they clearly cannot operate in the DFZ for two main reasons:\nThe FIB is limited to 12K IPv4, 2K IPv6 entries, so they can\u0026rsquo;t hold a full table The CPU is a bit whimpy so it won\u0026rsquo;t be happy doing large BGP reconvergence operations IPng Networks has three (3) /24 IPv4 networks, which means we\u0026rsquo;re not swimming in IPv4 addresses. But, I\u0026rsquo;m possibly the world\u0026rsquo;s okayest systems engineer, and I happen to know that most things don\u0026rsquo;t really need an IPv4 address anymore. There\u0026rsquo;s all sorts of fancy loadbalancers like [NGINX] and [HAProxy] which can take traffic (UDP, TCP or higher level constructs like HTTP traffic), provide SSL offloading, and then talk to one or more loadbalanced backends to retrieve the actual content.\nIPv4 versus IPv6 Most modern operating systems can operate in IPv6-only mode, certainly the Debian and Ubuntu and Apple machines that are common in my network are happily dual-stack and probably mono-stack as well. Seeing that I\u0026rsquo;ve been running IPv6 since, eh, the 90s (my first allocation was on the 6bone in 1996, and I did run [SixXS] for longer than I can remember!).\nYou might be inclined to argue that I should be able to advance the core of my serverpark to IPv6-only \u0026hellip; but unfortunately that\u0026rsquo;s not only up to me, as it has been mentioned to me a number of times that my [Videos] are not reachable, which of course they are, but only if your computer speaks IPv6.\nIn addition to my stuff needing legacy reachability, some external websites, including pretty big ones (I\u0026rsquo;m looking at you, [GitHub] and [Cisco T-Rex]) are still IPv4 only, and some network gear still hasn\u0026rsquo;t really caught on to the IPv6 control- and management plane scene (for example, SNMP traps or scraping, BFD, LDP, and a few others, even in a modern switch like the Centecs that I\u0026rsquo;m about to deploy).\nAS8298 BGP-Free Core I have a few options \u0026ndash; I could be stubborn and do NAT64 for an IPv6-only internal network. But if I\u0026rsquo;m going to be doing NAT anyway, I decide to make a compromise and deploy my new network using private IPv4 space alongside public IPv6 space, and to deploy a few strategically placed border gateways that can do the translation and frontending for me.\nThere\u0026rsquo;s quite a few private/reserved IPv4 ranges on the internet, which the current LIRs on the RIPE [Waiting List] are salivating all over, gross. However, there\u0026rsquo;s a few ones beyond canonical [RFC1918] that are quite frequently used in enterprise networking, for example by large Cloud providers. They build what is called a Virtual Private Cloud or [VPC]. And if they can do it, so can I!\nNumberplan Let me draw your attention to [RFC5735], which describes special use IPv4 addresses. One of these is 198.18.0.0/15: this block has been allocated for use in benchmark tests of network interconnect devices. What I found interesting, is that [RFC2544] explains that this range was assigned to minimize the chance of conflict in case a testing device were to be accidentally connected to part of the Internet. Packets with source addresses from this range are not meant to be forwarded across the Internet. But, they can totally be used to build a pan-european private network that is not directly connected to the internet. I grab my free private Class-B, like so:\nFor IPv4, I take the second /16 from that to use as my IPv4 block: 198.19.0.0/16. For IPv6, I carve out a small part of IPng\u0026rsquo;s own IPv6 PI block: 2001:678:d78:500::/56 First order of business is to create a simple numberplan that\u0026rsquo;s not totally broken:\nPurpose IPv4 Prefix IPv6 Prefix Loopbacks 198.19.0.0/24 (size /32) 2001:678:d78:500::/64 (size /128) P2P Networks 198.19.2.0/23 (size /31) 2001:678:d78:501::/64 (size /112) Site Local Networks 198.19.4.0/22 (size /27) 2001:678:d78:502::/56 (size /64) This simple start leaves most of the IPv4 space allocatable for the future, while giving me lots of IPv4 and IPv6 addresses to retrofit this network in all sites where IPng is present, which is [quite a few]. All of 198.19.1.0/24 (reserved either for P2P networks or for loopbacks, whichever I\u0026rsquo;ll need first), 198.19.8.0/21, 198.19.16.0/20, 198.19.32.0/19, 198.19.64.0/18 and 198.19.128.0/17 will be ready for me to use in the future, and they are all nicely tucked away under one 19.198.in-addr.arpa reversed domain, which I stub out on IPng\u0026rsquo;s resolvers. Winner!\nInserting MPLS Under AS8298 I am currently running [VPP] based on my own deployment [article], and this has a bunch of routers connected back-to-back with one another using either crossconnects (if there are multiple routers in the same location), or a CWDM/DWDM wave over dark fiber (if they are in adjacent buildings and I have found a provider willing to share their dark fiber with me), or a Carrier Ethernet virtual leased line (L2VPN, provided by folks like [Init7] in Switzerland, or [IP-Max] throughout europe in our [backbone]).\nMost of these links are actually \u0026ldquo;just\u0026rdquo; point to point ethernet links, which I can use untagged (eg xe1-0), or add any dot1q sub-interfaces (eg xe1-0.10). In some cases, the ISP will deliver the circuit to me with an additional outer tag, in which case I can still use that interface (eg xe1-0.400) and create qinq sub-interfaces (eg xe1-0.400.10).\nIn January 2023, my Zurich metro deployment looks a bit like the top drawing to the right. Of course, these routers connect to all sorts of other things, like internet exchange points ([SwissIX], [CHIX], [CommunityIX], and [FreeIX]), IP transit upstreams (in Zurich mainly [IP-Max] and [Openfactory]), and customer downstreams, colocation space, private network interconnects with others, and so on.\nI want to draw your attention to the four main links between these routers:\nOrange (bottom): chbtl0 and chbtl1 are at our offices in Brüttisellen; they\u0026rsquo;re in two separate racks, and have 24 fibers between them. Here, the two routers connect back to back with a 25G SFP28 optic at 1310nm. Blue (top): Between chrma0 (at NTT datacenter in Rümlang) and chgtg0 (at Interxion datacenter in Glattbrugg), IPng rents a CWDM wave from Openfactory, so the two routers here connect back to back also, albeit over 4.2km of dark fiber between the two datacenters, with a 25G SFP28 optic at 1610nm. Red (left): Between chbtl0 and chrma0, Init7 provides a 10G L2VPN over MPLS ethernet circuit, starting in our offices with a BiDi 10G optic, and delivered at NTT on a BiDi 10G optic as well (we did this, so that the cross connect between our racks might in the future be able to use the other fiber). Init7 delivers both ports tagged VLAN 400. Green (right): Between chbtl1 and chgtg0, Openfactory provides a 10G VLAN ethernet circuit, starting in our offices with a BiDi 10G optic to the local telco, and then transported over dark fiber by UPC to Interxion. Openfactory delivers both sides tagged VLAN 200-209 to us. This is a super fun puzzle! I am running a live network, with customers, and I want to retrofit this MPLS network underneath my existing network, and after thinking about it for a while, I see how I can do it.\nTo avoid using the link, I raise OSPF cost for the link chbtl0-chrma0, the red link in the graph. Traffic will now flow via chgtg0 and through chbtl1. After I\u0026rsquo;ve taken the link out of service, I make a few important changes:\nFirst, I move the interface on both VPP routers from it\u0026rsquo;s dot1q tagged xe1-0.400, to a double tagged xe1-0.400.10. Init7 will pass this through for me, and after I make the change, I can ping both sides again (with a subtle loss of 4 bytes because of the second tag). Next, I unplug the Init7 link on both sides and plug them into a TenGig port on a Centec switch that I deployed in both sites, and I take a second TenGig port and I plug that into the router. I make both ports a trunk mode switchport, and allow VLAN 400 tagged on it. Finally, on the switch I create interface vlan400 on both sides, and the two switches can see each other directly connected now on the single-tagged interface, while the two routers can see each other directly connected now on the double-tagged interface. With the red leg taken care of, I ask the kind folks from Openfactory if they would mind if I use a second wavelength for the duration of my migration, which they kindly agree to. So, I plug a new CWDM 25G optic on another channel (1270nm), and bring the network to Glattbrugg, where I deploy a Centec switch.\nWith the blue/purple leg taken care of, all I have to do is undrain the red link (lower OSPF cost) while draining the green link (raising its OSPF cost). Traffic now flips back from chgtg0 through chrma0 and into chbtl0. I can rinse and repeat the green leg, moving the interfaces on the routers to a double-tagged xe1-0.200.10 on both sides, inserting and moving the green link from the routers into the switches, and connecting them in turn to the routers.\nConfiguration And just like that, I\u0026rsquo;ve inserted a triangle of Centec switches without disrupting any traffic, would you look at that! They are however, still \u0026ldquo;just\u0026rdquo; switches, each with two ports sharing the red VLAN 400 and the green VLAN 200, and doing \u0026hellip; decidedly nothing on the purple leg, as those ports aren\u0026rsquo;t even switchports!\nNext up: configuring these switches to become, you guessed it, routers!\nInterfaces I will take the switch at NTT Rümlang as an example, but the switches really are all very similar. First, I define the loopback addresses and transit networks to chbtl0 (red link) and chgtg0 (purple link).\ninterface loopback0 description Core: msw0.chrma0.net.ipng.ch ip address 198.19.0.2/32 ipv6 address 2001:678:d78:500::2/128 ! interface vlan400 description Core: msw0.chbtl0.net.ipng.ch (Init7) mtu 9172 ip address 198.19.2.1/31 ipv6 address 2001:678:d78:501::2/112 ! interface eth-0-38 description Core: msw0.chgtg0.net.ipng.ch (UPC 1270nm) mtu 9216 ip address 198.19.2.4/31 ipv6 address 2001:678:d78:501::2:1/112 I need to make sure that the MTU is correct on both sides (this will be important later when OSPF is turned on), and I ensure that the underlay has sufficient MTU (in the case of Init7, as the purple interface goes over dark fiber with no active equipment in between!) I issue a set of ping commands ensuring that the dont-fragment bit is set and the size of the resulting IP packet is exactly that which my MTU claims I should allow, and validate that indeed, we\u0026rsquo;re good.\nOSPF, LDP, MPLS For OSPF, I am certain that this network should never carry or propagate anything other than the 198.19.0.0/16 and 2001:678:d78:500::/56 networks that I have assigned to it, even if it were to be connected to other things (like an out-of-band connection, or even AS8298), so as belt-and-braces style protection I take the following base-line configuration:\nip prefix-list pl-ospf seq 5 permit 198.19.0.0/16 le 32 ipv6 prefix-list pl-ospf seq 5 permit 2001:678:d78:500::/56 le 128 ! route-map ospf-export permit 10 match ipv6 address prefix-list pl-ospf route-map ospf-export permit 20 match ip address prefix-list pl-ospf route-map ospf-export deny 9999 ! router ospf router-id 198.19.0.2 redistribute connected route-map ospf-export redistribute static route-map ospf-export network 198.19.0.0/16 area 0 ! router ipv6 ospf router-id 198.19.0.2 redistribute connected route-map ospf-export redistribute static route-map ospf-export ! ip route 198.19.0.0/16 null0 ipv6 route 2001:678:d78:500::/56 null0 I also set a static discard by means of a nullroute, for the space beloning to the private network. This way, packets will not loop around if there is not a more specific for them in OSPF. The route-map ensures that I\u0026rsquo;ll only be advertising our space, even if the switches eventually get connected to other networks, for example some out-of-band access mechanism.\nNext up, enabling LDP and MPLS, which is very straight forward. In my interfaces, I\u0026rsquo;ll add the label-switching and enable-ldp keywords, as well as ensure that the OSPF and OSPFv3 speakers on these interfaces know that they are in point-to-point mode. For the cost, I will start off with the cost in tenths of milliseconds, in other words, if the latency between chbtl0 and chrma0 is 0.8ms, I will set the cost to 8:\ninterface vlan400 description Core: msw0.chbtl0.net.ipng.ch (Init7) mtu 9172 label-switching ip address 198.19.2.1/31 ipv6 address 2001:678:d78:501::2/112 ip ospf network point-to-point ip ospf cost 8 ip ospf bfd ipv6 ospf network point-to-point ipv6 ospf cost 8 ipv6 router ospf area 0 enable-ldp ! router ldp router-id 198.19.0.2 transport-address 198.19.0.2 ! The rest is really just rinse-and-repeat. I loop around all relevant interfaces, and see all of OSPF, OSPFv3, and LDP adjacencies form:\nmsw0.chrma0# show ip ospf nei OSPF process 0: Neighbor ID Pri State Dead Time Address Interface 198.19.0.0 1 Full/ - 00:00:35 198.19.2.0 vlan400 198.19.0.3 1 Full/ - 00:00:39 198.19.2.5 eth-0-38 msw0.chrma0# show ipv6 ospf nei OSPFv3 Process (0) Neighbor ID Pri State Dead Time Interface Instance ID 198.19.0.0 1 Full/ - 00:00:37 vlan400 0 198.19.0.3 1 Full/ - 00:00:39 eth-0-38 0 msw0.chrma0# show ldp session Peer IP Address IF Name My Role State KeepAlive 198.19.0.0 vlan400 Active OPERATIONAL 30 198.19.0.3 eth-0-38 Active OPERATIONAL 30 Connectivity And after I\u0026rsquo;m done with this heavy lifting, I can now build MPLS services (like L2VPN and VPLS) on these three switches. But as you may remember, IPng is in a few more sites than just Brüttisellen, Rümlang and Glattbrugg. While a lot of work, retrofitting every site in exactly the same way is not mentally challenging, so I\u0026rsquo;m not going to spend a lot of words describing it. Wax on, wax off.\nOnce I\u0026rsquo;m done though, the (MPLS) network looks a little bit like this. What\u0026rsquo;s really cool about it, is that it\u0026rsquo;s an fully capable IPv4 and IPv6 network running OSPF and OSPFv3, LDP and MPLS services, albeit one that\u0026rsquo;s not connected to the internet, yet. This means that I\u0026rsquo;ve successfully created both a completely private network that spans all sites we have active equipment in, but also did not stand in the way of our public facing (VPP) routers in AS8298. Customers haven\u0026rsquo;t noticed a single thing, except now they can benefit from any L2 services (using MPLS tunnels or VPLS clouds) from any of our sites. Neat!\nOur VPP routers are connected through the switches, (carrier) L2VPN and WDM waves just as they were before, but carried transparently by the Centec switches. Performance wise, there is no regression, because the switches do line rate L2/MPLS switching and L3 forwarding. This means that the VPP routers, except for having a little detour in-and-out the switch for their long haul, have the same throughput as they had before.\nI will deploy three additional features, to make this new private network a fair bit more powerful:\n1. Site Local Connectivity\nEach switch gets what is called an IPng Site Local (or ipng-sl) interface. This is a /27 IPv4 and a /64 IPv6 that is bound on a local VLAN on each switch on our private network. Remember: the links between sites are no longer switched, they are routed and pass ethernet frames only using MPLS. I can connect for example all of the fleet\u0026rsquo;s hypervisors to this internal network. I have given our three bastion jumphosts (Squanchy, Glootie and Pencilvester) an address on this internal network as well, just look at this beautiful result:\npim@hvn0-ddln0:~$ traceroute hvn0.nlams3.net.ipng.ch traceroute to hvn0.nlams3.net.ipng.ch (198.19.4.98), 64 hops max, 40 byte packets 1 msw0.ddln0.net.ipng.ch (198.19.4.129) 1.488 ms 1.233 ms 1.102 ms 2 msw0.chrma0.net.ipng.ch (198.19.2.1) 2.138 ms 2.04 ms 1.949 ms 3 msw0.defra0.net.ipng.ch (198.19.2.13) 6.207 ms 6.288 ms 7.862 ms 4 msw0.nlams0.net.ipng.ch (198.19.2.14) 13.424 ms 13.459 ms 13.513 ms 5 hvn0.nlams3.net.ipng.ch (198.19.4.98) 12.221 ms 12.131 ms 12.161 ms pim@hvn0-ddln0:~$ iperf3 -6 -c hvn0.nlams3.net.ipng.ch -P 10 Connecting to host hvn0.nlams3, port 5201 - - - - - - - - - - - - - - - - - - - - - - - - - [ 5] 9.00-10.00 sec 60.0 MBytes 503 Mbits/sec 0 1.47 MBytes [ 7] 9.00-10.00 sec 71.2 MBytes 598 Mbits/sec 0 1.73 MBytes [ 9] 9.00-10.00 sec 61.2 MBytes 530 Mbits/sec 0 1.30 MBytes [ 11] 9.00-10.00 sec 91.2 MBytes 765 Mbits/sec 0 2.16 MBytes [ 13] 9.00-10.00 sec 88.8 MBytes 744 Mbits/sec 0 2.13 MBytes [ 15] 9.00-10.00 sec 62.5 MBytes 524 Mbits/sec 0 1.57 MBytes [ 17] 9.00-10.00 sec 60.0 MBytes 503 Mbits/sec 0 1.47 MBytes [ 19] 9.00-10.00 sec 65.0 MBytes 561 Mbits/sec 0 1.39 MBytes [ 21] 9.00-10.00 sec 61.2 MBytes 530 Mbits/sec 0 1.24 MBytes [ 23] 9.00-10.00 sec 63.8 MBytes 535 Mbits/sec 0 1.58 MBytes [SUM] 9.00-10.00 sec 685 MBytes 5.79 Gbits/sec 0 ... [SUM] 0.00-10.00 sec 7.38 GBytes 6.34 Gbits/sec 177 sender [SUM] 0.00-10.02 sec 7.37 GBytes 6.32 Gbits/sec receiver 2. Egress Connectivity\nHaving a private network is great, as it allows me to run the entire internal environment with 9000 byte jumboframes, mix IPv4 and IPv6, segment off background tasks such as ZFS replication and borgbackup between physical sites, and employ monitoring with Prometheus and LibreNMS and log in safely with SSH or IPMI without ever needing to leave the safety of the walled garden that is 198.19.0.0/16.\nHypervisors will now typically get a management interface only in this network, and for them to be able to do things like run apt upgrade, some remote repositories will need to be reachable over IPv4 as well. For this, I decide to add three internet gateways, which will have one leg into the private network, and one leg out into the world. For IPv4 they\u0026rsquo;ll provide NAT, and for IPv6 they\u0026rsquo;ll ensure only trusted traffic can enter the private network.\nThese gateways will:\nConnect to the internal network with OSPF and OSPFv3: They will learn 198.19.0.0/16, 2001:687:d78:500::/56 and their more specifics from it They will inject a default route for 0.0.0.0/0 and ::/0 to it Connect to AS8298 with BGP: They will receive a default IPv4 and IPv6 route from AS8298 They will announce the two aggregate prefixes to it with no-export community set Provide a WireGuard endpoint to allow remote management: Clients will be put in 192.168.6.0/24 and 2001:678:d78:300::/56 These ranges will be announced both to AS8298 externally and to OSPF internally This provides dynamic routing at its best. If the gateway, the physical connection to the internal network, or the OSPF adjacency is down, AS8298 will not learn the routes into the internal network at this node. If the gateway, the physical connection to the external network, or the BGP adjacency is down, the Centec switch will not pick up the default routes, and no traffic will be sent through it. By having three such nodes geographically separated (one in Brüttisellen, one in Plan-les-Ouates and one in Amsterdam), I am very likely to have stable and resilient connectivity.\nAt the same time, these three machines serve as WireGuard endpoints to be able to remotely manage the network. For this purpose, I\u0026rsquo;ve carved out 192.168.6.0/26 and 2001:678:d78:300::/56 and will hand out IP addresses from those to clients. I\u0026rsquo;d like these two networks to have access to the internal private network as well.\nThe Bird2 OSPF configuration for one of the nodes (in Brüttisellen) looks like this:\nfilter ospf_export { if (net.type = NET_IP4 \u0026amp;\u0026amp; net ~ [ 0.0.0.0/0, 192.168.6.0/26 ]) then accept; if (net.type = NET_IP6 \u0026amp;\u0026amp; net ~ [ ::/0, 2001:678:d78:300::/64 ]) then accept; if (source = RTS_DEVICE) then accept; reject; } filter ospf_import { if (net.type = NET_IP4 \u0026amp;\u0026amp; net ~ [ 198.19.0.0/16 ]) then accept; if (net.type = NET_IP6 \u0026amp;\u0026amp; net ~ [ 2001:678:d78:500::/56 ]) then accept; reject; } protocol ospf v2 ospf4 { debug { events }; ipv4 { export filter ospf_export; import filter ospf_import; }; area 0 { interface \u0026#34;lo\u0026#34; { stub yes; }; interface \u0026#34;wg0\u0026#34; { stub yes; }; interface \u0026#34;ipng-sl\u0026#34; { type broadcast; cost 15; bfd on; }; }; } protocol ospf v3 ospf6 { debug { events }; ipv6 { export filter ospf_export; import filter ospf_import; }; area 0 { interface \u0026#34;lo\u0026#34; { stub yes; }; interface \u0026#34;wg0\u0026#34; { stub yes; }; interface \u0026#34;ipng-sl\u0026#34; { type broadcast; cost 15; bfd off; }; }; } The ospf_export filter is what we\u0026rsquo;re telling the Centec switches. Here, precisely the default route and the WireGuard space is announced, in addition to connected routes. The ospf_import is what we\u0026rsquo;re willing to learn from the Centec switches, and here we will accept exactly the aggregate 198.19.0.0/16 and 2001:678:d78:500::/56 prefixes belonging to the private internal network.\nThe Bird2 BGP configuration for this gateway then looks like this:\nfilter bgp_export { if (net.type = NET_IP4 \u0026amp;\u0026amp; ! (net ~ [ 198.19.0.0/16, 192.168.6.0/26 ])) then reject; if (net.type = NET_IP6 \u0026amp;\u0026amp; ! (net ~ [ 2001:678:d78:500::/56, 2001:678:d78:300::/64 ]) then reject; # Add BGP Wellknown community no-export (FFFF:FF01) bgp_community.add((65535,65281)); accept; } template bgp T_GW4 { local as 64512; source address 194.1.163.72; default bgp_med 0; default bgp_local_pref 400; ipv4 { import all; export filter bgp_export; next hop self on; }; } template bgp T_GW6 { local as 64512; source address 2001:678:d78:3::72; default bgp_med 0; default bgp_local_pref 400; ipv6 { import all; export filter bgp_export; next hop self on; }; } protocol bgp chbtl0_ipv4_1 from T_GW4 { neighbor 194.1.163.66 as 8298; }; protocol bgp chbtl1_ipv4_1 from T_GW4 { neighbor 194.1.163.67 as 8298; }; protocol bgp chbtl0_ipv6_1 from T_GW6 { neighbor 2001:678:d78:3::2 as 8298; }; protocol bgp chbtl1_ipv6_1 from T_GW6 { neighbor 2001:678:d78:3::3 as 8298; }; The bgp_export filter is where we restrict our announcements to only exactly the prefixes we\u0026rsquo;ve learned from the Centec, and WireGuard. We\u0026rsquo;ll set the no-export BGP community on it, which will allow the prefixes to live in AS8298 but never be announced to any eBGP peers. If the any of the machine, the BGP session, the WireGuard interface, or the default route, would be missing, they would simply not be announced. In the other direction, if the Centec is not feeding the gateway its prefixes via OSPF, the BGP session may be up, but it will not be propagating these prefixes, and the gateway will not attract network traffic to it. There are two BGP uplinks to AS8298 here, which also provides resilience in case one of them is down for maintenance or in fault condition. N+k is a great rule to live by, when it comes to network engineering.\nThe last two things I should provide on each gateway, is (A) a NAT translator from internal to external, and (B) a firewall that ensures only authorized traffic gets passed to the Centec network.\nFirst, I\u0026rsquo;ll provide an IPv4 NAT translation to the internet facing AS8298 (ipng), for traffic that is coming from WireGuard or the private network, while allowing it to pass between the two networks without performing NAT. The first rule says to jump to ACCEPT (skipping the NAT rules), if the source is WireGuard. The second two rules say to provide NAT towards the internet for any traffic coming from WireGuard or the private network. The fourth and last rule says to provide NAT towards the internal private network, so that anything trying to get into the network will be coming from an address in 198.19.0.0/16 as well. Here they are:\niptables -t nat -A POSTROUTING -s 192.168.6.0/24 -o ipng-sl -j ACCEPT iptables -t nat -A POSTROUTING -s 192.168.6.0/24 -o ipng -j MASQUERADE iptables -t nat -A POSTROUTING -s 198.19.0.0/16 -o ipng -j MASQUERADE iptables -t nat -A POSTROUTING -o ipng-sl -j MASQUERADE 3. Ingress Connectivity\nFor inbound traffic, the rules are similarly permissive for trusted sources but otherwise prohibit any passing traffic. Prefixes are allowed to be forwarded from WireGuard, and some (not disclosed, cuz I\u0026rsquo;m not stoopid!) trusted prefixes for IPv4 and IPv6, but ultimately if not specified the forwarding tables will end in a default policy of DROP, which means no traffic will be passed into the WireGuard or Centec internal networks unless explicitly allowed here:\niptables -P FORWARD DROP ip6tables -P FORWARD DROP for SRC4 in 192.168.6.0/24 ...; do iptables -I FORWARD -s $SRC4 -j ACCEPT done for SRC6 in 2001:678:d78:300::/56 ...; do ip6tables -I FORWARD -s $SRC6 -j ACCEPT done With that, any machine in the Centec (and WireGuard) private internal network will have full access amongst each other, and they will be NATed to the internet, through these three (N+2) gateways. If I turn one of them off, things look like this:\npim@hvn0-ddln0:~$ traceroute 8.8.8.8 traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets 1 msw0.ddln0.net.ipng.ch (198.19.4.129) 0.733 ms 1.040 ms 1.340 ms 2 msw0.chrma0.net.ipng.ch (198.19.2.6) 1.249 ms 1.555 ms 1.799 ms 3 msw0.chbtl0.net.ipng.ch (198.19.2.0) 2.733 ms 2.840 ms 2.974 ms 4 hvn0.chbtl0.net.ipng.ch (198.19.4.2) 1.447 ms 1.423 ms 1.402 ms 5 chbtl0.ipng.ch (194.1.163.66) 1.672 ms 1.652 ms 1.632 ms 6 chrma0.ipng.ch (194.1.163.17) 2.414 ms 2.431 ms 2.322 ms 7 as15169.lup.swissix.ch (91.206.52.223) 2.353 ms 2.331 ms 2.311 ms ... pim@hvn0-chbtl0:~$ sudo systemctl stop bird pim@hvn0-ddln0:~$ traceroute 8.8.8.8 traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets 1 msw0.ddln0.net.ipng.ch (198.19.4.129) 0.770 ms 1.058 ms 1.311 ms 2 msw0.chrma0.net.ipng.ch (198.19.2.6) 1.251 ms 1.662 ms 2.036 ms 3 msw0.chplo0.net.ipng.ch (198.19.2.22) 5.828 ms 5.455 ms 6.064 ms 4 hvn1.chplo0.net.ipng.ch (198.19.4.163) 4.901 ms 4.879 ms 4.858 ms 5 chplo0.ipng.ch (194.1.163.145) 4.867 ms 4.958 ms 5.113 ms 6 chrma0.ipng.ch (194.1.163.50) 9.274 ms 9.306 ms 9.313 ms 7 as15169.lup.swissix.ch (91.206.52.223) 10.168 ms 10.127 ms 10.090 ms ... How cool is that :) First I do a traceroute from the hypervisor pool in DDLN colocation site, which finds its closest default at msw0.chbtl0.net.ipng.ch which exits via hvn0.chbtl0 and into the public internet. Then, I shut down bird on that hypervisor/gateway, which means it won\u0026rsquo;t be advertising the default into the private network, nor will it be picking up traffic to/from it. About one second later, the next default route is found to be at msw0.chplo0.net.ipng.ch over its hypervisor in Geneva (note, 4ms down the line), after which the egress is performed at hvn1.chplo0 into the public internet. Of course, it\u0026rsquo;s then sent back to Zurich to still find its way to Google at SwissIX, but the only penalty is a scenic route: looping from Brüttisellen to Geneva and back adds pretty much 8ms of end to end latency.\nJust look at that beautiful resillience at play. Chef\u0026rsquo;s kiss.\nWhat\u0026rsquo;s next The ring hasn\u0026rsquo;t been fully deployed yet. I am waiting on a backorder of switches from Starry Networks, due to arrive early April. The delivery of those will allow me to deploy in Paris and Lille, hopefully in a cool roadtrip with Fred :)\nBut, I got pretty far, so what\u0026rsquo;s next for me is the following few fun things:\nStart offering EoMPLS / L2VPN / VPLS services to IPng customers. Who wants some?! Move replication traffic from the current public internet, towards the internal private network. This both can leverage 9000 byte jumboframes, but it can also use wirespeed forwarding from the Centec network gear. Move all unneeded IPv4 addresses into the private network, such as maintenance and management / controlplane, route reflectors, backup servers, hypervisors, and so on. Move frontends to be dual-homed as well: one leg towards AS8298 using Public IPv4 and IPv6 addresses, and then finding backend servers in the private network (think of it like an NGINX frontend that terminmates the HTTP/HTTPS connection [SSL is inserted and removed here :)], and then has one or more backend servers in the private network. This can be useful for Mastodon, Peertube, and of course our own websites. ","date":"2023-03-11","desc":"After receiving an e-mail from [Starry Networks], I had a chat with their founder and learned that the combination of switch silicon and software may be a good match for IPng Networks.\nI got pretty enthusiastic when this new vendor claimed VxLAN, GENEVE, MPLS and GRE at 56 ports and line rate, on a really affordable budget ($4'200,- for the 56 port; and $1'650,- for the 26 port switch). This reseller is using a less known silicon vendor called [Centec], who have a lineup of ethernet chipsets. In this device, the CTC8096 (GoldenGate) is used for cost effective high density 10GbE/40GbE applications paired with 4x100GbE uplink capability. This is Centec\u0026rsquo;s fourth generation, so CTC8096 inherits the feature set from L2/L3 switching to advanced data center and metro Ethernet features with innovative enhancement. The switch chip provides up to 96x10GbE ports, or 24x40GbE, or 80x10GbE + 4x100GbE ports, inheriting from its predecessors a variety of features, including L2, L3, MPLS, VXLAN, MPLS SR, and OAM/APS. Highlights features include Telemetry, Programmability, Security and traffic management, and Network time synchronization.\n","permalink":"https://ipng.ch/s/articles/2023/03/11/case-study-centec-mpls-core/","section":"articles","title":"Case Study: Centec MPLS Core"},{"contents":" Author: Pim van Pelt, Rogier Krieger Reviewers: Coloclue Network Committee Status: Draft - Review - Published Almost precisely two years ago, in February of 2021, I created a loadtesting environment at [Coloclue] to prove that a provider of L2 connectivity between two datacenters in Amsterdam was not incurring jitter or loss on its services \u0026ndash; I wrote up my findings in [an article], which demonstrated that the service provider indeed provides a perfect service. One month later, in March 2021, I briefly ran [VPP] on one of the routers at Coloclue, but due to lack of time and a few technical hurdles along the way, I had to roll back [ref].\nThe Problem Over the years, Coloclue AS8283 continues to suffer from packet loss in its network. Taking a look at a simple traceroute, in this case from IPng AS8298, shows very high variance and packetlo when entering the network (at hop 5 in a router called eunetworks-2.router.nl.coloclue.net):\nMy traceroute [v0.94] squanchy.ipng.ch (194.1.193.90) -\u0026gt; 185.52.227.1 2023-02-24T09:03:36+0100 Keys: Help Display mode Restart statistics Order of fields quit Packets Pings Host Loss% Snt Last Avg Best Wrst StDev 1. chbtl0.ipng.ch 0.0% 49904 1.3 0.9 0.7 1.7 0.2 2. chrma0.ipng.ch 0.0% 49904 1.7 1.2 1.2 2.1 0.9 3. defra0.ipng.ch 0.0% 49904 6.3 6.2 6.0 19.2 1.3 4. nlams0.ipng.ch 0.0% 49904 12.7 12.6 12.4 19.8 1.8 5. bond0-105.eunetworks-2.router.nl.coloclue.net 0.2% 49903 98.8 12.3 12.0 272.8 23.0 6. 185.52.227.1 6.6% 49903 15.3 12.5 12.3 308.7 20.4 The last two hops show the packet loss well north of 6.5%, some paths are better, some are worse, but notably when more than one router is in the path, it\u0026rsquo;s difficult to pinpoint where or what is responsible. But honestly, any source will reveal packet loss and high variance when traversing through one or more Coloclue routers, to more or lesser degree:\n\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash; | \u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash; | The screenshots above are smokeping from (left) a machine at AS8283 Coloclue (in Amsterdam, the Netherlands), and from (right) a machine at AS8298 IPng (in Brüttisellen, Switzerland), both are showing ~4.8-5.0% packetlo and high variance in end to end latency. No bueno!\nIsolating a Device Under Test Because Coloclue has several routers, I want to ensure that traffic traverses only the one router under test. I decide to use an allocated but currently unused IPv4 prefix and announce that only from one of the four routers, so that all traffic to and from that /24 goes over that router. Coloclue uses a piece of software called Kees, a set of Python and Jinja2 scripts to generate a Bird1.6 configuration for each router. This is great because that allows me to add a small feature to get what I need: beacons.\nSetting up the beacon A beacon is a prefix that is sent to (some, or all) peers on the internet to attract traffic in a particular way. I added a function called is_coloclue_beacon() which reads the input YAML file and uses a construction similar to the existing feature for \u0026ldquo;supernets\u0026rdquo;. It determines if a given prefix must be announced to peers and upstreams. Any IPv4 and IPv6 prefixes from the beacons list will be then matched in is_coloclue_beacon() and announced. For the curious, [this commit] holds the logic and tests to ensure this is safe.\nBased on a per-router config (eg. vars/eunetworks-2.router.nl.coloclue.net.yml) I can now add the following YAML stanza:\ncoloclue: beacons: - prefix: \u0026#34;185.52.227.0\u0026#34; length: 24 comment: \u0026#34;VPP test prefix (pim, rogier)\u0026#34; And further, from this router, I can forward all traffic destined to this /24 to a machine running in EUNetworks (my Dell R630 called hvn0.nlams2.ipng.ch), using a simple static route:\nstatics: ... - route: \u0026#34;185.52.227.0/24\u0026#34; via: \u0026#34;94.142.240.71\u0026#34; comment: \u0026#34;VPP test prefix (pim, rogier)\u0026#34; After running Kees, I can now see traffic for that /24 show up on my machine. The last step is to ensure that traffic that is destined for the beacon will always traverse back over eunetworks-2. Coloclue has VRRP and sometimes another router might be the logical router. With a little trick on my machine, I can force traffic by means of policy based routing:\npim@hvn0-nlams2:~$ sudo ip ro add default via 94.142.240.254 pim@hvn0-nlams2:~$ sudo ip ro add prohibit 185.52.227.0/24 pim@hvn0-nlams2:~$ sudo ip addr add 185.52.227.1/32 dev lo pim@hvn0-nlams2:~$ sudo ip rule add from 185.52.227.0/24 lookup 10 pim@hvn0-nlams2:~$ sudo ip ro add default via 94.142.240.253 table 10 First, I set the default gateway to be the VRRP address that floats between multiple routers. Then, I will set a prohibit route for the covering /24, which means the machine will send an ICMP unreachable (rather than discarding the packets), which can be useful later. Next, I\u0026rsquo;ll add .1 as an IPv4 address onto loopback, after which the machine will start replying to ICMP packets there with icmp-echo rather than dst-unreach. To make sure routing is always symmetric, I\u0026rsquo;ll add an ip rule which is a classifier that matches packets based on their source address, and then diverts these to an alternate routing table, which has only one entry: send via .253 (which is eunetworks-2).\nLet me show this in action:\npim@hvn0-nlams2:~$ dig +short -x 94.142.240.254 eunetworks-gateway-100.router.nl.coloclue.net. pim@hvn0-nlams2:~$ dig +short -x 94.142.240.253 bond0-100.eunetworks-2.router.nl.coloclue.net. pim@hvn0-nlams2:~$ dig +short -x 94.142.240.252 bond0-100.eunetworks-3.router.nl.coloclue.net. pim@hvn0-nlams2:~$ ip -4 nei | grep \u0026#39;94.142.240.25[234]\u0026#39; 94.142.240.252 dev coloclue lladdr 64:9d:99:b1:31:db REACHABLE 94.142.240.253 dev coloclue lladdr 64:9d:99:b1:31:af REACHABLE 94.142.240.254 dev coloclue lladdr 64:9d:99:b1:31:db REACHABLE In the output above, I can see that eunetworks-2 (94.142.240.253) has MAC address 64:9d:99:b1:31:af, and that eunetworks-3 (94.142.240.252) has MAC address 64:9d:99:b1:31:db. My default gateway, handled by VRRP, is at .254 and it\u0026rsquo;s using the second MAC address, so I know that eunetworks-3 is primary, and will handle my egress traffic.\nVerifying symmetric routing of the beacon A quick demonstration to show the symmetric routing case, I can tcpdump and see that my \u0026ldquo;usual\u0026rdquo; egress traffic will be sent to the MAC address of the VRRP primary (which I showed to be eunetworks-3 above), while traffic coming from 185.52.227.0/24 ought to be sent to the MAC address of eunetworks-2 due to the ip rule and alternate routing table 10:\npim@hvn0-nlams2:~$ sudo tcpdump -eni coloclue host 194.1.163.93 and icmp tcpdump: verbose output suppressed, use -v[v]... for full protocol decode listening on coloclue, link-type EN10MB (Ethernet), snapshot length 262144 bytes 10:02:17.193844 64:9d:99:b1:31:af \u0026gt; 6e:fa:52:d0:c1:ff, ethertype IPv4 (0x0800), length 98: 194.1.163.93 \u0026gt; 94.142.240.71: ICMP echo request, id 16287, seq 1, length 64 10:02:17.193882 6e:fa:52:d0:c1:ff \u0026gt; 64:9d:99:b1:31:db, ethertype IPv4 (0x0800), length 98: 94.142.240.71 \u0026gt; 194.1.163.93: ICMP echo reply, id 16287, seq 1, length 64 10:02:19.276657 64:9d:99:b1:31:af \u0026gt; 6e:fa:52:d0:c1:ff, ethertype IPv4 (0x0800), length 98: 194.1.163.93 \u0026gt; 185.52.227.1: ICMP echo request, id 6646, seq 1, length 64 10:02:19.276694 6e:fa:52:d0:c1:ff \u0026gt; 64:9d:99:b1:31:af, ethertype IPv4 (0x0800), length 98: 185.52.227.1 \u0026gt; 194.1.163.93: ICMP echo reply, id 6646, seq 1, length 64 It takes a keen eye to spot the difference here the first packet (which is going to the main IPv4 address 94.142.240.71), is returned via MAC address 64:9d:99:b1:31:db (the VRRP default gateway), but the second one (going to the beacon 185.52.227.1) is returned via MAC address 64:9d:99:b1:31:af.\nI\u0026rsquo;ve now ensured that traffic to and from 185.52.227.1 will always traverse through the DUT (eunetworks-2 with MAC 64:9d:99:b1:31:af). Very elegant :-)\nInstalling VPP I\u0026rsquo;ve written about this before, the general spiel is just following my previous article (I\u0026rsquo;m often very glad to read back my own articles as they serve as pretty good documentation to my forgetful chipmunk-sized brain!), so here, I\u0026rsquo;ll only recap what\u0026rsquo;s already written in [vpp-7]:\nBuild VPP with Linux Control Plane Bring eunetworks-2 into maintenance mode, so we can safely tinker with it Start services like ssh, snmp, keepalived and bird in a new dataplane namespace Start VPP and give the LCP interface names the same as their original Slowly introduce the router: OSPF, OSPFv3, iBGP, members-bgp, eBGP, in that order Re-enable keepalived and let the machine forward traffic Stare at the latency graphs 1. BUILD: For the first step, the build is straight forward, and yields a VPP instance based on vpp-ext-deps_23.06-1 at version 23.06-rc0~71-g182d2b466, which contains my [LCPng] plugin. I then copy the packages to the router. The router has an E-2286G CPU @ 4.00GHz with 6 cores and 6 hyperthreads. There\u0026rsquo;s a really handy tool called likwid-topology that can show how the L1, L2 and L3 cache lines up with respect to CPU cores. Here I learn that CPU (0+6) and (1+7) share L1 and L2 cache \u0026ndash; so I can conclude that 0-5 are CPU cores which share a hyperthread with 6-11 respectively.\nI also see that L3 cache is shared across all of the cores+hyperthreads, which is normal. I decide to give CPUs 0,1 and their hyperthread 6,7 to Linux for general purpose scheduling, and I want to block the remaining CPUs and their hyperthreads to dedicated to VPP. So the kernel is rebooted with isolcpus=2-5,8-11.\n2. DRAIN: In the mean time, Rogier prepares the drain, which is two step process. First he marks all the BGP sessions as graceful_shutdown: True, and waits for the traffic to die down. Then, he marks the machine as maintenance_mode: True which will make Kees set OSPF cost to 65535 and avoid attracting or sending traffic through this machine. After he submits these, we are free to tinker with the router, as it will not affect any Coloclue members. Rogier also ensures we will have the hand on this little machine in Amsterdam, by preparing an IPMI serial-over-lan connection and KVM.\n3. PREPARE: Starting an ssh and snmpd in the dataplane is the most important part. This way, we will be able to scrape the machine using SNMP just as-if it were a Linux native router. And of course we will want to be able to log in to the router. I start with these two services, the only small note is that, because I want to run two copies (one in the default namespace and one additional one in the dataplane namespace), I\u0026rsquo;ll want to tweak the startup flags (pid file, config file, etc) a little bit:\n## in snmpd-dataplane.service ExecStart=/sbin/ip netns exec dataplane /usr/sbin/snmpd -LOw -u Debian-snmp \\ -g vpp -I -smux,mteTrigger,mteTriggerConf -f -p /run/snmpd-dataplane.pid \\ -C -c /etc/snmp/snmpd-dataplane.conf ## in ssh-dataplane.service ExecStart=/usr/sbin/ip netns exec dataplane /usr/sbin/sshd \\ -oPidFile=/run/sshd-dataplane.pid -D $SSHD_OPTS 4. LAUNCH: Now what\u0026rsquo;s left for us to do is switch from our SSH session to an IPMI serial-over-lan session so that we can safely transition to the VPP world. Rogier and I log in and share a tmux session, after which I bring down all ethernet links, remove VLAN sub-interfaces and the LACP BondEthernet, leaving only the main physical interfaces. I then set link down on them, and restart VPP \u0026ndash; which will take all DPDK eligble interfaces that are link admin-down, and then let the magic happen:\nroot@eunetworks-2:~# vppctl show int Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count GigabitEthernet5/0/0 5 down 9000/0/0/0 GigabitEthernet6/0/0 6 down 9000/0/0/0 TenGigabitEthernet1/0/0 1 down 9000/0/0/0 TenGigabitEthernet1/0/1 2 down 9000/0/0/0 TenGigabitEthernet1/0/2 3 down 9000/0/0/0 TenGigabitEthernet1/0/3 4 down 9000/0/0/0 Dope! One way to trick the rest of the machine into thinking it hasn\u0026rsquo;t changed, is to recreate these interfaces in the dataplane network namespace using their original interface names (eg. enp1s0f3 for AMS-IX, and bond0 for the LACP signaled BondEthernet that we\u0026rsquo;ll create. Rogier prepared an excellent vppcfg config file:\nloopbacks: loop0: description: \u0026#39;eunetworks-2.router.nl.coloclue.net\u0026#39; lcp: \u0026#39;loop0\u0026#39; mtu: 9216 addresses: [ 94.142.247.3/32, 2a02:898:0:300::3/128 ] bondethernets: BondEthernet0: description: \u0026#39;Core: MLAG member switches\u0026#39; interfaces: [ TenGigabitEthernet1/0/0, TenGigabitEthernet1/0/1 ] mode: \u0026#39;lacp\u0026#39; load-balance: \u0026#39;l34\u0026#39; mac: \u0026#39;64:9d:99:b1:31:af\u0026#39; interfaces: GigabitEthernet5/0/0: description: \u0026#34;igb 0000:05:00.0 eno1 # FiberRing\u0026#34; lcp: \u0026#39;eno1\u0026#39; mtu: 9216 sub-interfaces: 205: description: \u0026#34;Peering: Arelion\u0026#34; lcp: \u0026#39;eno1.205\u0026#39; addresses: [ 62.115.144.33/31, 2001:2000:3080:ebc::2/126 ] mtu: 1500 992: description: \u0026#34;Transit: FiberRing\u0026#34; lcp: \u0026#39;eno1.992\u0026#39; addresses: [ 87.255.32.130/30, 2a00:ec8::102/126 ] mtu: 1500 GigabitEthernet6/0/0: description: \u0026#34;igb 0000:06:00.0 eno2 # Free\u0026#34; lcp: \u0026#39;eno2\u0026#39; mtu: 9216 state: down TenGigabitEthernet1/0/0: description: \u0026#34;i40e 0000:01:00.0 enp1s0f0 (bond-member)\u0026#34; mtu: 9216 TenGigabitEthernet1/0/1: description: \u0026#34;i40e 0000:01:00.1 enp1s0f1 (bond-member)\u0026#34; mtu: 9216 TenGigabitEthernet1/0/2: description: \u0026#39;Core: link between eunetworks-2 and eunetworks-3\u0026#39; lcp: \u0026#39;enp1s0f2\u0026#39; addresses: [ 94.142.247.246/31, 2a02:898:0:301::/127 ] mtu: 9214 TenGigabitEthernet1/0/3: description: \u0026#34;i40e 0000:01:00.3 enp1s0f3 # AMS-IX\u0026#34; lcp: \u0026#39;enp1s0f3\u0026#39; mtu: 9216 sub-interfaces: 501: description: \u0026#34;Peering: AMS-IX\u0026#34; lcp: \u0026#39;enp1s0f3.501\u0026#39; addresses: [ 80.249.211.161/21, 2001:7f8:1::a500:8283:1/64 ] mtu: 1500 511: description: \u0026#34;Peering: NBIP-NaWas via AMS-IX\u0026#34; lcp: \u0026#39;enp1s0f3.511\u0026#39; addresses: [ 194.62.128.38/24, 2001:67c:608::f200:8283:1/64 ] mtu: 1500 BondEthernet0: lcp: \u0026#39;bond0\u0026#39; mtu: 9216 sub-interfaces: 100: description: \u0026#34;Cust: Members\u0026#34; lcp: \u0026#39;bond0.100\u0026#39; mtu: 1500 addresses: [ 94.142.240.253/24, 2a02:898:0:20::e2/64 ] 101: description: \u0026#34;Core: Powerbars\u0026#34; lcp: \u0026#39;bond0.101\u0026#39; mtu: 1500 addresses: [ 172.28.3.253/24 ] 105: description: \u0026#34;Cust: Members (no strict uRPF filtering)\u0026#34; lcp: \u0026#39;bond0.105\u0026#39; mtu: 1500 addresses: [ 185.52.225.14/28, 2a02:898:0:21::e2/64 ] 130: description: \u0026#34;Core: Link between eunetworks-2 and dcg-1\u0026#34; lcp: \u0026#39;bond0.130\u0026#39; mtu: 1500 addresses: [ 94.142.247.242/31, 2a02:898:0:301::14/127 ] 2502: description: \u0026#34;Transit: Fusix Networks\u0026#34; lcp: \u0026#39;bond0.2502\u0026#39; mtu: 1500 addresses: [ 37.139.140.27/31, 2a00:a7c0:e20b:104::2/126 ] We take this configuration and pre-generate a suitable VPP config, which exposes two little bugs in vppcfg:\nRogier had used captial letters in his IPv6 addresses (ie. 2001:2000:3080:0EBC::2), while the dataplane reports lower case (ie. 2001:2000:3080:ebc::2), which consistently yield a diff that\u0026rsquo;s not there. I make a note to fix that. When I create the initial --novpp config, there\u0026rsquo;s a bug in vppcfg where I incorrectly reference a dataplane object which I haven\u0026rsquo;t initialized (because with --novpp the tool will not contact the dataplane at all. That one was easy to fix, which I did in [this commit]). After that small detour, I can now proceed to configure the dataplane by offering the resulting VPP commands, like so:\nroot@eunetworks-2:~# vppcfg plan --novpp -c /etc/vpp/vppcfg.yaml \\ -o /etc/vpp/config/vppcfg.vpp [INFO ] root.main: Loading configfile /etc/vpp/vppcfg.yaml [INFO ] vppcfg.config.valid_config: Configuration validated successfully [INFO ] root.main: Configuration is valid [INFO ] vppcfg.reconciler.write: Wrote 84 lines to /etc/vpp/config/vppcfg.vpp [INFO ] root.main: Planning succeeded root@eunetworks-2:~# vppctl exec /etc/vpp/config/vppcfg.vpp 5. UNDRAIN: The VPP dataplane comes to life, only to immediately hang. Whoops! What follows is a 90 minute forray into the innards of VPP (and Bird) which I haven\u0026rsquo;t yet fully understood, but will definitely want to learn more about (future article, anyone?) \u0026ndash; but the TL/DR of our investigation is that if an IPv6 address is added to a loopback device, and an OSPFv3 (IPv6) stub area is created on it, as is common for IPv4 and IPv6 loopback addresses in OSPF, then the dataplane immediately hangs on the controlplane, but does continue to forward traffic.\nHowever, we also find a workaround, which is to put the IPv6 loopback address on a physical interface instead of a loopback interface. Then, we observe a perfectly functioning dataplane, which has a working BondEthernet with LACP signalling:\nroot@eunetworks-2:~# vppctl show bond details BondEthernet0 mode: lacp load balance: l34 number of active members: 2 TenGigabitEthernet1/0/1 TenGigabitEthernet1/0/0 number of members: 2 TenGigabitEthernet1/0/0 TenGigabitEthernet1/0/1 device instance: 0 interface id: 0 sw_if_index: 8 hw_if_index: 8 root@eunetworks-2:~# vppctl show lacp actor state partner state interface name sw_if_index bond interface exp/def/dis/col/syn/agg/tim/act exp/def/dis/col/syn/agg/tim/act TenGigabitEthernet1/0/0 1 BondEthernet0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 LAG ID: [(ffff,64-9d-99-b1-31-af,0008,00ff,0001), (8000,02-1c-73-0f-8b-bc,0015,8000,8015)] RX-state: CURRENT, TX-state: TRANSMIT, MUX-state: COLLECTING_DISTRIBUTING, PTX-state: PERIODIC_TX TenGigabitEthernet1/0/1 2 BondEthernet0 0 0 1 1 1 1 1 1 0 0 1 1 1 1 0 1 LAG ID: [(ffff,64-9d-99-b1-31-af,0008,00ff,0002), (8000,02-1c-73-0f-8b-bc,0015,8000,0015)] RX-state: CURRENT, TX-state: TRANSMIT, MUX-state: COLLECTING_DISTRIBUTING, PTX-state: PERIODIC_TX 6. WRAP UP: After doing a bit of standard issue ping / ping6 and show err and show log, things are looking good. Rogier and I are now ready to slowly introduce the router: we first turn on OSPF and OSPFv3, see adjacencies and BFD turn up. We make a note that enp1s0f2 (which is now a LIP in the dataplane) does not have BFD while it does have OSPF, and the explanation for this is that bond0 is connected to a switch, while enp1s0f2 is directly connected to its peer via a cross connect cable, so if it fails, it\u0026rsquo;ll be able to use link-state to quickly reconverge, while the ethernet link may still be up on bond0 if something along the transport path were to fail, so BFD is the better choice there. Smart thinking, Coloclue!\nroot@eunetworks-2:~# birdc6 show ospf nei ospf1 BIRD 1.6.8 ready. ospf1: Router ID Pri State DTime Interface Router IP 94.142.247.1 1 Full/PtP 00:33 bond0.130 fe80::669d:99ff:feb1:394b 94.142.247.6 1 Full/PtP 00:31 enp1s0f2 fe80::669d:99ff:feb1:31d8 root@eunetworks-2:~# birdc show bfd ses BIRD 1.6.8 ready. bfd1: IP address Interface State Since Interval Timeout 94.142.247.243 bond0.130 Up 2023-02-24 15:56:29 0.100 0.500 We are then ready to undrain iBGP and eBGP to members, transit and peering sessions. Rogier swiftly takes care of business, and the router finds its spot in the DFZ just a few minutes later:\nroot@eunetworks-2:~# birdc show route count BIRD 1.6.8 ready. 6239493 of 6239493 routes for 907650 networks root@eunetworks-2:~# birdc6 show route count BIRD 1.6.8 ready. 1152345 of 1152345 routes for 169987 networks root@eunetworks-2:~# vppctl show ip fib sum ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, lcp-rt:1, ] Prefix length Count 0 1 4 2 8 16 9 13 10 38 11 103 12 299 13 577 14 1214 15 2093 16 13477 17 8250 18 13824 19 24990 20 43089 21 51191 22 109106 23 97073 24 542106 27 3 28 13 29 32 30 36 31 41 32 788 root@eunetworks-2:~# vppctl show ip6 fib sum ipv6-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, lcp-rt:1, ] Prefix length Count 128 863 127 4 126 4 125 1 120 2 64 22 60 17 52 2 49 2 48 80069 47 3535 46 3411 45 1726 44 14909 43 1041 42 2529 41 932 40 14126 39 1459 38 1654 37 988 36 6640 35 1374 34 3419 33 3707 32 22819 31 294 30 589 29 4373 28 196 27 20 26 15 25 8 24 30 23 7 22 7 21 3 20 15 19 1 10 1 0 1 One thing that I really appreciate is how \u0026hellip; normal \u0026hellip; this machine looks, with no interfaces in the default namespace, but after switching to the dataplane network namespace using nsenter, there they are and they look (unsurprisingly, because we configured them that way), identical to what was running before, except now all goverend by VPP instead of the Linux kernel:\nroot@eunetworks-2:~# ip -br l lo UNKNOWN 00:00:00:00:00:00 \u0026lt;LOOPBACK,UP,LOWER_UP\u0026gt; root@eunetworks-2:~# nsenter --net=/var/run/netns/dataplane root@eunetworks-2:~# ip -br l lo UNKNOWN 00:00:00:00:00:00 \u0026lt;LOOPBACK,UP,LOWER_UP\u0026gt; eno1 UP ac:1f:6b:e0:b1:0c \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; eno2 DOWN ac:1f:6b:e0:b1:0d \u0026lt;BROADCAST,MULTICAST\u0026gt; enp1s0f2 UP 64:9d:99:b1:31:ad \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; enp1s0f3 UP 64:9d:99:b1:31:ac \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; bond0 UP 64:9d:99:b1:31:af \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; loop0 UP de:ad:00:00:00:00 \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; eno1.205@eno1 UP ac:1f:6b:e0:b1:0c \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; eno1.992@eno1 UP ac:1f:6b:e0:b1:0c \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; enp1s0f3.501@enp1s0f3 UP 64:9d:99:b1:31:ac \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; enp1s0f3.511@enp1s0f3 UP 64:9d:99:b1:31:ac \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; bond0.100@bond0 UP 64:9d:99:b1:31:af \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; bond0.101@bond0 UP 64:9d:99:b1:31:af \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; bond0.105@bond0 UP 64:9d:99:b1:31:af \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; bond0.130@bond0 UP 64:9d:99:b1:31:af \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; bond0.2502@bond0 UP 64:9d:99:b1:31:af \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; root@eunetworks-2:~# ip -br a lo UNKNOWN 127.0.0.1/8 ::1/128 eno1 UP fe80::ae1f:6bff:fee0:b10c/64 eno2 DOWN enp1s0f2 UP 94.142.247.246/31 2a02:898:0:300::3/128 2a02:898:0:301::/127 fe80::669d:99ff:feb1:31ad/64 enp1s0f3 UP fe80::669d:99ff:feb1:31ac/64 bond0 UP fe80::669d:99ff:feb1:31af/64 loop0 UP 94.142.247.3/32 fe80::dcad:ff:fe00:0/64 eno1.205@eno1 UP 62.115.144.33/31 2001:2000:3080:ebc::2/126 fe80::ae1f:6bff:fee0:b10c/64 eno1.992@eno1 UP 87.255.32.130/30 2a00:ec8::102/126 fe80::ae1f:6bff:fee0:b10c/64 enp1s0f3.501@enp1s0f3 UP 80.249.211.161/21 2001:7f8:1::a500:8283:1/64 fe80::669d:99ff:feb1:31ac/64 enp1s0f3.511@enp1s0f3 UP 194.62.128.38/24 2001:67c:608::f200:8283:1/64 fe80::669d:99ff:feb1:31ac/64 bond0.100@bond0 UP 94.142.240.253/24 2a02:898:0:20::e2/64 fe80::669d:99ff:feb1:31af/64 bond0.101@bond0 UP 172.28.3.253/24 fe80::669d:99ff:feb1:31af/64 bond0.105@bond0 UP 185.52.225.14/28 2a02:898:0:21::e2/64 fe80::669d:99ff:feb1:31af/64 bond0.130@bond0 UP 94.142.247.242/31 2a02:898:0:301::14/127 fe80::669d:99ff:feb1:31af/64 bond0.2502@bond0 UP 37.139.140.27/31 2a00:a7c0:e20b:104::2/126 fe80::669d:99ff:feb1:31af/64 Of course, VPP handles all the traffic through the machine, and the only traffic that Linux will see is that which is destined to the controlplane (eg, to one of the IPv4 or IPv6 addresses or multicast/broadcast groups that they are participating in), so things like tcpdump or SNMP won\u0026rsquo;t really work.\nHowever, due to my [vpp-snmp-agent], which is feeding as an AgentX behind an snmpd that in turn is running in the dataplane namespace, SNMP scrapes work as they did before, albeit with a few different interface names.\n6. Earlier, I had failed over keepalived and stopped the service. This way, the peer router on eunetworks-3 would pick up all outbound traffic to the virtual IPv4 and IPv6 for our users' default gateway. Because we\u0026rsquo;re mainly interested in non-intrusively measuring the BGP beacon (which is forced to always go through this machine), and we know some of our members use BGP and take a preference over this router because it\u0026rsquo;s connected to AMS-IX, we make a decision to leave keepalived turned off for now.\nBut, traffic is flowing, and in fact a little bit more throughput, possibly because traffic flows faster when there\u0026rsquo;s not 5% packet loss on certain egress paths? I don\u0026rsquo;t know but OK, moving along!\nResults Clearly VPP is a winner in this scenario. If you recall the traceroute from before the operation, the latency was good up until nlams0.ipng.ch, after which loss occured and variance was very high. Rogier and I let the VPP instance run overnight, and started this traceroute after our maintenance was concluded:\nMy traceroute [v0.94] squanchy.ipng.ch (194.1.163.90) -\u0026gt; 185.52.227.1 2023-02-25T09:48:46+0100 Keys: Help Display mode Restart statistics Order of fields quit Packets Pings Host Loss% Snt Last Avg Best Wrst StDev 1. chbtl0.ipng.ch 0.0% 51796 0.6 0.2 0.1 1.7 0.2 2. chrma0.ipng.ch 0.0% 51796 1.6 1.0 0.9 5.5 1.2 3. defra0.ipng.ch 0.0% 51796 7.0 6.5 6.4 27.7 1.9 4. nlams0.ipng.ch 0.0% 51796 12.7 12.6 12.5 43.8 3.9 5. bond0-105.eunetworks-2.router.nl.coloclue.net 0.0% 51796 13.3 13.0 12.8 138.9 11.1 6. 185.52.227.1 0.0% 51796 13.6 12.7 12.3 46.6 8.3 This mtr shows clear network weather with absolutely no packets dropped from Brüttisellen (near Zurich, Switzerland) all the way to the BGP beacon running in EUNetworks in Amsterdam. Considering I\u0026rsquo;ve been running VPP for a few years now, including writing the code necessary to plumb the dataplane interfaces through to Linux so that a higher order control plane (such as Bird, or FRR) can manipulate them, I am reasonably bullish, but I do hope to convert others.\nThis computer now forwards packets like a boss, its packet loss is →\nLooking at the local situation, from a hypervisor running at IPng Networks in Equinix AM3 via FrysIX through VPP and into the dataplane of the Coloclue router eunetworks-2 , shows quite reasonable throughput as well:\nroot@eunetworks-2:~# traceroute hvn0.nlams3.ipng.ch traceroute to 46.20.243.179 (46.20.243.179), 30 hops max, 60 byte packets 1 enp1s0f3.eunetworks-3.router.nl.coloclue.net (94.142.247.247) 0.087 ms 0.078 ms 0.071 ms 2 frys-ix.ip-max.net (185.1.203.135) 1.288 ms 1.432 ms 1.479 ms 3 hvn0.nlams3.ipng.ch (46.20.243.179) 0.524 ms 0.534 ms 0.531 ms root@eunetworks-2:~# iperf3 -c 46.20.243.179 -P 10 Connecting to host 46.20.243.179, port 5201 ... [SUM] 0.00-10.00 sec 6.70 GBytes 5.76 Gbits/sec 192 sender [SUM] 0.00-10.03 sec 6.58 GBytes 5.64 Gbits/sec receiver root@eunetworks-2:~# iperf3 -c 46.20.243.179 -P 10 -R Connecting to host 46.20.243.179, port 5201 Reverse mode, remote host 46.20.243.179 is sending ... [SUM] 0.00-10.03 sec 6.07 GBytes 5.20 Gbits/sec 54623 sender [SUM] 0.00-10.00 sec 6.03 GBytes 5.18 Gbits/sec receiver And the smokepings look just plain gorgeous:\n\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash; | \u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash;\u0026mdash; | The screenshots above are smokeping from (left) a machine at AS8283 Coloclue (in Amsterdam, the Netherlands), and from (right) a machine at AS8298 IPng (in Brüttisellen, Switzerland), both are showing no packetloss and clearly improved performance in end to end latency. Super!\nWhat\u0026rsquo;s next The performance of the one router we upgraded definitely improved, no question about that. But there\u0026rsquo;s a couple of things that I think we still need to do, so Rogier and I rolled back the change to the previous situation and kernel based routing.\nWe didn\u0026rsquo;t migrate keepalived, although IPng runs this in our DDLN [colocation] site, so I\u0026rsquo;m pretty confident that it will work. Kees and Ansible at Coloclue will need a few careful changes, to facilitate ongoing automation, think of dataplane and controlplane firewalls, sysctls (uRPF et al), fastnetmon, and so on will need a meaningful overhaul. There\u0026rsquo;s an unknown dataplane hang with Bird IPv6 enables a stub OSPF interface on interface lo0. We worked around that by putting the loopback IPv6 address on another interface, but this needs to be fully understood. Completely unrelated to Coloclue, there\u0026rsquo;s one dataplane hang regarding IPv6 RA/NS and/or BFD and/or Linux Control Plane that the VPP developer community is hunting down - it happens with my plugin but also with [TNSR] (who used the upstream linux-cp plugin). I\u0026rsquo;ve been working with a few folks from Netgate and customers of IPng Networks to try to find the root cause, as AS8298 has been bitten by this a few times over the last ~quarter or so. I cannot recommend in good faith running VPP until this is sorted out. As an important side note, VPP is not well enough understood at Coloclue - rolling this out further risks making me a single point of failure in the networking committee, and I\u0026rsquo;m not comfortable taking that responsibility. I recommend that Coloclue network committee members gain experience with VPP, DPDK, vppcfg and the other ecosystem tools, and that at least the bird6 OSPF issue and possible IPv6 NS/RA issue are understood, before making the jump to the VPP world.\n","date":"2023-02-24","desc":" Author: Pim van Pelt, Rogier Krieger Reviewers: Coloclue Network Committee Status: Draft - Review - Published Almost precisely two years ago, in February of 2021, I created a loadtesting environment at [Coloclue] to prove that a provider of L2 connectivity between two datacenters in Amsterdam was not incurring jitter or loss on its services \u0026ndash; I wrote up my findings in [an article], which demonstrated that the service provider indeed provides a perfect service. One month later, in March 2021, I briefly ran [VPP] on one of the routers at Coloclue, but due to lack of time and a few technical hurdles along the way, I had to roll back [ref].\n","permalink":"https://ipng.ch/s/articles/2023/02/24/case-study-vpp-at-coloclue-part-2/","section":"articles","title":"Case Study: VPP at Coloclue, part 2"},{"contents":" A while ago, in June 2021, we were discussing home routers that can keep up with 1G+ internet connections in the CommunityRack telegram channel. Of course at IPng Networks we are fond of the Supermicro Xeon D1518 [ref], which has a bunch of 10Gbit X522 and 1Gbit i350 and i210 intel NICs, but it does come at a certain price.\nFor smaller applications, PC Engines APU6 [ref] is kind of cool and definitely more affordable. But, in this chat, Patrick offered an alternative, the [Fitlet2] which is a small, passively cooled, and expandable IoT-esque machine.\nFast forward 18 months, and Patrick decided to sell off his units, so I bought one off of him, and decided to loadtest it. Considering the pricetag (the unit I will be testing will ship for around $400), and has the ability to use (1G/SFP) fiber optics, it may be a pretty cool one!\nExecutive Summary TL/DR: Definitely a cool VPP router, 3x 1Gbit line rate, A- would buy again\nWith some care on the VPP configuration (notably RX/TX descriptors), this unit can handle L2XC at (almost) line rate in both directions (2.94Mpps out a theoretical 2.97Mpps), with one VPP worker thread, which it not just good, it\u0026rsquo;s Good Enough™, at which time there is still plenty of headroom on the CPU, as the Atom E3950 has 4 cores.\nIn IPv4 routing, using two VPP worker threads, and 2 RX/TX queues on each NIC, the machine keeps up with 64 byte traffic in both directions (ie 2.97Mpps), again with compute power to spare, and while using only two out of four CPU cores on the Atom E3950.\nFor a $400,- machine that draws close to 11 Watts fully loaded, and sporting 8GB (at a max of 16GB) this Fitlet2 is a gem: it will easily keep up 3x 1Gbit in a production environment, while carrying multiple full BGP tables (900K IPv4 and 170K IPv6), with room to spare. It\u0026rsquo;s a classy little machine!\nDetailed findings The first thing that I noticed when it arrived is how small it is! The design of the Fitlet2 has a motherboard with a non-removable Atom E3950 CPU running at 1.6GHz, from the Goldmont series. This is a notoriously slow/budget CPU, and it comes with 4C/4T, each CPU thread comes with 24kB of L1 and 1MB of L2 cache, and there is no L3 cache on this CPU at all. That would mean performance in applications like VPP (which try to leverage these caches) will be poorer \u0026ndash; the main question on my mind is: does the CPU have enough oompff to keep up with the 1G network cards? I\u0026rsquo;ll want this CPU to be able to handle roughly 4.5Mpps in total, in order for Fitlet2 to count itself amongst the wirespeed routers.\nLooking further, Fitlet2 has one HDMI and one MiniDP port, two USB2 and two USB3 ports, two Intel i211 NICs with RJ45 port (these are 1Gbit). There\u0026rsquo;s a helpful MicroSD slot, two LEDs and an audio in- and output 3.5mm jack. The power button does worry me a little bit, I feel like just brushing against it may turn the machine off. I do appreciate the cooling situation - the top finned plate mates with the CPU on the top of the motherboard, and the bottom bracket holds a sizable aluminium cooling block which further helps dissipate heat, without needing any active cooling. The Fitlet folks claim this machine can run in environments anywhere between -50C and +112C, which I won\u0026rsquo;t be doing :)\nInside, there\u0026rsquo;s a single DDR3 SODIMM slot for memory (the one I have came with 8GB at 1600MT/s) and a custom, ableit open specification expansion board called a FACET-Card which stands for Function And Connectivity Extension T-Card, well okay then! The FACET card in this little machine sports one extra Intel i210-IS NIC, an M2 for an SSD, and an M2E for a WiFi port. The NIC is a 1Gbit SFP capable device. You can see its optic cage on the FACET card above, next to the yellow CMOS / Clock battery.\nThe whole thing is fed with 12V powerbrick delivering 2A, and a nice touch is that the barrel connector has a plastic bracket that locks it into the chassis by turning it 90degrees, so it won\u0026rsquo;t flap around in the breeze and detach. I wish other embedded PCs would ship with those, as I\u0026rsquo;ve been fumbling around in 19\u0026quot; racks that are, let me say, less tightly cable organized, and may or may not have disconnected the CHIX routeserver at some point in the past. Sorry, Max :)\nFor the curious, here\u0026rsquo;s a list of interesting details: [lspci] - [dmidecode] - [likwid-topology] - [dmesg].\nPreparing the Fitlet2 First, I grab a USB key and install Debian Bullseye (11.5) on it, using the UEFI installer. After booting, I carry through the instructions on my [VPP Production] post. Notably, I create the dataplane namespace, run an SSH and SNMP agent there, run isolcpus=1-3 so that I can give three worker threads to VPP, but I start off giving it only one (1) worker thread, because this way I can take a look at what the performance is of a single CPU, before scaling out to the three (3) threads that this CPU can offer. I also take the defaults for DPDK, notably allowing the DPDK poll-mode-drivers to take their proposed defaults:\nGigabitEthernet1/0/0: Intel Corporation I211 Gigabit Network Connection (rev 03) rx: queues 1 (max 2), desc 512 (min 32 max 4096 align 8) tx: queues 2 (max 2), desc 512 (min 32 max 4096 align 8)\nGigabitEthernet3/0/0: Intel Corporation I210 Gigabit Fiber Network Connection (rev 03) rx: queues 1 (max 4), desc 512 (min 32 max 4096 align 8) tx: queues 2 (max 4), desc 512 (min 32 max 4096 align 8)\nI observe that the i211 NIC allows for a maximum of two (2) RX/TX queues, while the (older!) i210 will allow for four (4) of them. And another thing that I see here is that there are two (2) TX queues active, but I only have one worker thread, so what gives? This is because there is always a main thread and a worker thread, and it could be that the main thread needs to / wants to send traffic out on an interface, so it always attaches to a queue in addition to the worker thread(s). When exploring new hardware, I find it useful to take a look at the output of a few tactical show commands on the CLI, such as:\n1. What CPU is in this machine?\nvpp# show cpu Model name: Intel(R) Atom(TM) Processor E3950 @ 1.60GHz Microarch model (family): [0x6] Goldmont ([0x5c] Apollo Lake) stepping 0x9 Flags: sse3 pclmulqdq ssse3 sse41 sse42 rdrand pqe rdseed aes sha invariant_tsc Base frequency: 1.59 GHz 2. Which devices on the PCI bus, PCIe speed details, and driver?\nvpp# show pci Address Sock VID:PID Link Speed Driver Product Name Vital Product Data 0000:01:00.0 0 8086:1539 2.5 GT/s x1 uio_pci_generic 0000:02:00.0 0 8086:1539 2.5 GT/s x1 igb 0000:03:00.0 0 8086:1536 2.5 GT/s x1 uio_pci_generic Note: This device at slot 02:00.0 is the second onboard RJ45 i211 NIC. I have used this one to log in to the Fitlet2 and more easily kill/restart VPP and so on, but I could of course just as well give it to VPP, in which case I\u0026rsquo;d have three gigabit interfaces to play with!\n3. What details are known for the physical NICs?\nvpp# show hardware GigabitEthernet1/0/0 GigabitEthernet1/0/0 1 up GigabitEthernet1/0/0 Link speed: 1 Gbps RX Queues: queue thread mode 0 vpp_wk_0 (1) polling TX Queues: TX Hash: [name: hash-eth-l34 priority: 50 description: Hash ethernet L34 headers] queue shared thread(s) 0 no 0 1 no 1 Ethernet address 00:01:c0:2a:eb:a8 Intel e1000 carrier up full duplex max-frame-size 2048 flags: admin-up maybe-multiseg tx-offload intel-phdr-cksum rx-ip4-cksum int-supported rx: queues 1 (max 2), desc 512 (min 32 max 4096 align 8) tx: queues 2 (max 2), desc 512 (min 32 max 4096 align 8) pci: device 8086:1539 subsystem 8086:0000 address 0000:01:00.00 numa 0 max rx packet len: 16383 promiscuous: unicast off all-multicast on vlan offload: strip off filter off qinq off rx offload avail: vlan-strip ipv4-cksum udp-cksum tcp-cksum vlan-filter vlan-extend scatter keep-crc rss-hash rx offload active: ipv4-cksum scatter tx offload avail: vlan-insert ipv4-cksum udp-cksum tcp-cksum sctp-cksum tcp-tso multi-segs tx offload active: ipv4-cksum udp-cksum tcp-cksum multi-segs rss avail: ipv4-tcp ipv4-udp ipv4 ipv6-tcp-ex ipv6-udp-ex ipv6-tcp ipv6-udp ipv6-ex ipv6 rss active: none tx burst function: (not available) rx burst function: (not available) Configuring VPP After this exploratory exercise, I have learned enough about the hardware to be able to take the Fitlet2 out for a spin. To configure the VPP instance, I turn to [vppcfg], which can take a YAML configuration file describing the desired VPP configuration, and apply it safely to the running dataplane using the VPP API. I\u0026rsquo;ve written a few more posts on how it does that, notably on its [syntax] and its [planner]. A complete configuration guide on vppcfg can be found [here].\npim@fitlet:~$ sudo dpkg -i {lib,}vpp*23.06*deb pim@fitlet:~$ sudo apt install python3-pip pim@fitlet:~$ sudo pip install vppcfg-0.0.3-py3-none-any.whl Methodology Method 1: Single CPU Thread Saturation First I will take VPP out for a spin by creating an L2 Cross Connect where any ethernet frame received on Gi1/0/0 will be directly transmitted as-is on Gi3/0/0 and vice versa. This is a relatively cheap operation for VPP, as it will not have to do any routing table lookups. The configuration looks like this:\npim@fitlet:~$ cat \u0026lt;\u0026lt; EOF \u0026gt; l2xc.yaml interfaces: GigabitEthernet1/0/0: mtu: 1500 l2xc: GigabitEthernet3/0/0 GigabitEthernet3/0/0: mtu: 1500 l2xc: GigabitEthernet1/0/0 EOF pim@fitlet:~$ vppcfg plan -c l2xc.yaml [INFO ] root.main: Loading configfile l2xc.yaml [INFO ] vppcfg.config.valid_config: Configuration validated successfully [INFO ] root.main: Configuration is valid [INFO ] vppcfg.vppapi.connect: VPP version is 23.06-rc0~35-gaf4046134 comment { vppcfg sync: 10 CLI statement(s) follow } set interface l2 xconnect GigabitEthernet1/0/0 GigabitEthernet3/0/0 set interface l2 tag-rewrite GigabitEthernet1/0/0 disable set interface l2 xconnect GigabitEthernet3/0/0 GigabitEthernet1/0/0 set interface l2 tag-rewrite GigabitEthernet3/0/0 disable set interface mtu 1500 GigabitEthernet1/0/0 set interface mtu 1500 GigabitEthernet3/0/0 set interface mtu packet 1500 GigabitEthernet1/0/0 set interface mtu packet 1500 GigabitEthernet3/0/0 set interface state GigabitEthernet1/0/0 up set interface state GigabitEthernet3/0/0 up [INFO ] vppcfg.reconciler.write: Wrote 11 lines to (stdout) [INFO ] root.main: Planning succeeded After I paste these commands on the CLI, I start T-Rex in L2 stateless mode, and start T-Rex, I can generate some activity by starting the bench profile on port 0 with packets of 64 bytes in size and with varying IPv4 source and destination addresses and ports:\ntui\u0026gt;start -f stl/bench.py -m 1.48mpps -p 0 -t size=64,vm=var2 Let me explain a few hilights from the picture to the right. When starting this profile, I specified 1.48Mpps, which is the maximum amount of packets/second that can be generated on a 1Gbit link when using 64 byte frames (the smallest permissible ethernet frames). I do this because the loadtester comes with 10Gbit (and 100Gbit) ports, but the Fitlet2 has only 1Gbit ports. Then, I see that port0 is indeed transmitting (Tx pps) 1.48 Mpps, shown in dark blue. This is about 992 Mbps on the wire (the Tx bps L1), but due to the overhead of ethernet (each 64 byte ethernet frame needs an additional 20 bytes [details]), so the Tx bps L2 is about 64/84 * 992.35 = 756.08 Mbps, which lines up.\nThen, after the Fitlet2 tries its best to forward those from its receiving Gi1/0/0 port onto its transmitting port Gi3/0/0, they are received again by T-Rex on port 1. Here, I can see that the Rx pps is 1.29 Mpps, with an Rx bps of 660.49 Mbps (which is the L2 counter), and in bright red at the top I see the drop_rate is about 95.59 Mbps. In other words, the Fitlet2 is not keeping up.\nBut, after I take a look at the runtime statistics, I see that the CPU isn\u0026rsquo;t very busy at all:\nvpp# show run ... Thread 1 vpp_wk_0 (lcore 1) Time 23.8, 10 sec internal node vector rate 4.30 loops/sec 1638976.68 vector rates in 1.2908e6, out 1.2908e6, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call GigabitEthernet3/0/0-output active 6323688 27119700 0 9.14e1 4.29 GigabitEthernet3/0/0-tx active 6323688 27119700 0 1.79e2 4.29 dpdk-input polling 44406936 27119701 0 5.35e2 .61 ethernet-input active 6323689 27119701 0 1.42e2 4.29 l2-input active 6323689 27119701 0 9.94e1 4.29 l2-output active 6323689 27119701 0 9.77e1 4.29 Very interesting! Notice that the line above says vector rates in .. out .. are saying that the thread is receiving only 1.29Mpps, and it is managing to send all of them out as well. When a VPP worker is busy, each DPDK call will yield many packets, up to 256 in one call, which means the amount of \u0026ldquo;vectors per call\u0026rdquo; will rise. Here, I see that on average, DPDK is returning an average of only 0.61 packets each time it polls the NIC, and in each time a bunch of the packets are sent off into the VPP graph, there is an average of 4.29 packets per loop. If the CPU was the bottleneck, it would look more like 256 in the Vectors/Call column \u0026ndash; so the bottleneck must be in the NIC.\nRemember above, when I showed the show hardware command output? There\u0026rsquo;s a clue in there. The Fitlet2 has two onboard i211 NICs and one i210 NIC on the FACET card. Despite the lower number, the i210 is a bit more advanced [datasheet]. If I reverse the direction of flow (so receiving on the i210 Gi3/0/0, and transmitting on the i211 Gi1/0/0), things look a fair bit better:\nvpp# show run ... Thread 1 vpp_wk_0 (lcore 1) Time 12.6, 10 sec internal node vector rate 4.02 loops/sec 853956.73 vector rates in 1.4799e6, out 1.4799e6, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call GigabitEthernet1/0/0-output active 4642964 18652932 0 9.34e1 4.02 GigabitEthernet1/0/0-tx active 4642964 18652420 0 1.73e2 4.02 dpdk-input polling 12200880 18652933 0 3.27e2 1.53 ethernet-input active 4642965 18652933 0 1.54e2 4.02 l2-input active 4642964 18652933 0 1.04e2 4.02 l2-output active 4642964 18652933 0 1.01e2 4.02 Hey, would you look at that! The line up top here shows vector rates of in 1.4799e6 (which is 1.48Mpps) and outbound is the same number. And in this configuration as well, the DPDK node isn\u0026rsquo;t even reading that many packets, and the graph traversal is on average with 4.02 packets per run, which means that this CPU can do in excess of 1.48Mpps on one (1) CPU thread. Slick!\nSo what is the maximum throughput per CPU thread? To show this, I will saturate both ports with line rate traffic, and see what makes it through the other side. After instructing the T-Rex to perform the following profile:\ntui\u0026gt;start -f stl/bench.py -m 1.48mpps -p 0 1 \\ -t size=64,vm=var2 T-Rex will faithfully start to send traffic on both ports and expect the same amount back from the Fitlet2 (the Device Under Test or DUT). I can see that from T-Rex port 1-\u0026gt;0 all traffic makes its way back, but from port 0-\u0026gt;1 there is a little bit of loss (for the 1.48Mpps sent, only 1.43Mpps is returned). This is the same phenomenon that I explained above \u0026ndash; the i211 NIC is not quite as good at eating packets as the i210 NIC is.\nEven when doing this though, the (still) single threaded VPP is keeping up just fine, CPU wise:\nvpp# show run ... Thread 1 vpp_wk_0 (lcore 1) Time 13.4, 10 sec internal node vector rate 13.59 loops/sec 122820.33 vector rates in 2.9599e6, out 2.8834e6, drop 0.0000e0, punt 0.0000e0 Name State Calls Vectors Suspends Clocks Vectors/Call GigabitEthernet1/0/0-output active 1822674 19826616 0 3.69e1 10.88 GigabitEthernet1/0/0-tx active 1822674 19597360 0 1.51e2 10.75 GigabitEthernet3/0/0-output active 1823770 19826612 0 4.79e1 10.87 GigabitEthernet3/0/0-tx active 1823770 19029508 0 1.56e2 10.43 dpdk-input polling 1827320 39653228 0 1.62e2 21.70 ethernet-input active 3646444 39653228 0 7.67e1 10.87 l2-input active 1825356 39653228 0 4.96e1 21.72 l2-output active 1825356 39653228 0 4.58e1 21.72 Here we can see 2.96Mpps received (vector rates in) while only 2.88Mpps are transmitted (vector rates out). First off, this lines up perfectly with the reporting of T-Rex in the screenshot above, and it also shows that one direction loses more packets than the other. We\u0026rsquo;re dropping some 80kpps, but where did they go? Looking at the statistics counters, which include any packets which had errors in processing, we learn more:\nvpp# show err Count Node Reason Severity 3109141488 l2-output L2 output packets error 3109141488 l2-input L2 input packets error 9936649 GigabitEthernet1/0/0-tx Tx packet drops (dpdk tx failure) error 32120469 GigabitEthernet3/0/0-tx Tx packet drops (dpdk tx failure) error Aha! From previous experience I know that when DPDK signals packet drops due to \u0026rsquo;tx failure\u0026rsquo;, that this is often because it\u0026rsquo;s trying to hand off the packet to the NIC, which has a ringbuffer to collect them while the hardware transmits them onto the wire, and this NIC has run out of slots, which means the packet has to be dropped and a kitten gets hurt. But, I can raise the number of RX and TX slots, by setting them in VPP\u0026rsquo;s startup.conf file:\ndpdk { dev default { num-rx-desc 512 ## default num-tx-desc 1024 } no-multi-seg } And with that simple tweak, I\u0026rsquo;ve succeeded in configuring the Fitlet2 in a way that it is capable of receiving and transmitting 64 byte packets in both directions at (almost) line rate, with one CPU thread.\nMethod 2: Rampup using trex-loadtest.py For this test, I decide to put the Fitlet2 into L3 mode (up until now it was set up in L2 Cross Connect mode). To do this, I give the interfaces an IPv4 address and set a route for the loadtest traffic (which will be coming from 16.0.0.0/8 and going to 48.0.0.0/8). I will once again look to vppcfg to do this, because manipulating the YAML files like this allow me to easily and reliabily swap back and forth, letting vppcfg do the mundane chore of figuring out what commands to type, in which order, safely.\nFrom my existing L2XC dataplane configuration, I switch to L3 like so:\npim@fitlet:~$ cat \u0026lt;\u0026lt; EOF \u0026gt; l3.yaml interfaces: GigabitEthernet1/0/0: mtu: 1500 lcp: e1-0-0 addresses: [ 100.64.10.1/30 ] GigabitEthernet3/0/0: mtu: 1500 lcp: e3-0-0 addresses: [ 100.64.10.5/30 ] EOF pim@fitlet:~$ vppcfg plan -c l3.yaml [INFO ] root.main: Loading configfile l3.yaml [INFO ] vppcfg.config.valid_config: Configuration validated successfully [INFO ] root.main: Configuration is valid [INFO ] vppcfg.vppapi.connect: VPP version is 23.06-rc0~35-gaf4046134 comment { vppcfg prune: 2 CLI statement(s) follow } set interface l3 GigabitEthernet1/0/0 set interface l3 GigabitEthernet3/0/0 comment { vppcfg create: 2 CLI statement(s) follow } lcp create GigabitEthernet1/0/0 host-if e1-0-0 lcp create GigabitEthernet3/0/0 host-if e3-0-0 comment { vppcfg sync: 2 CLI statement(s) follow } set interface ip address GigabitEthernet1/0/0 100.64.10.1/30 set interface ip address GigabitEthernet3/0/0 100.64.10.5/30 [INFO ] vppcfg.reconciler.write: Wrote 9 lines to (stdout) [INFO ] root.main: Planning succeeded One small note \u0026ndash; vppcfg cannot set routes, and this is by design as the Linux Control Plane is meant to take care of that. I can either set routes using ip in the dataplane network namespace, like so:\npim@fitlet:~$ sudo nsenter --net=/var/run/netns/dataplane root@fitlet:/home/pim# ip route add 16.0.0.0/8 via 100.64.10.2 root@fitlet:/home/pim# ip route add 48.0.0.0/8 via 100.64.10.6 Or, alternatively, I can set them directly on VPP in the CLI, interestingly with identical syntax:\npim@fitlet:~$ vppctl vpp# ip route add 16.0.0.0/8 via 100.64.10.2 vpp# ip route add 48.0.0.0/8 via 100.64.10.6 The loadtester will run a bunch of profiles (1514b, imix, 64b with multiple flows, and 64b with only one flow), either in unidirectional or bidirectional mode, which gives me a wealth of data to share:\nLoadtest 1514b imix Multi 64b Single 64b Bidirectional 81.7k (100%) 327k (100%) 1.48M (100%) 1.43M (98.8%) Unidirectional 73.2k (89.6%) 255k (78.2%) 1.18M (79.4%) 1.23M (82.7%) Caveats While all results of the loadtests are navigable [here], I will cherrypick one interesting bundle showing the results of all (bi- and unidirectional) tests:\nI have to admit I was a bit stumped with the unidirectional loadtests - these are pushing traffic into the i211 (onboard RJ45) NIC, and out of the i210 (FACET SFP) NIC. What I found super weird (and can\u0026rsquo;t really explain), is that the unidirectional load, which in the end serves half the packets/sec, is lower than the bidirectional load, which was almost perfect dropping only a little bit of traffic at the very end. A picture says a thousand words - so here\u0026rsquo;s a graph of all the loadtests, which you can also find by clicking on the links in the table.\nAppendix Generating the data The JSON files that are emitted by my loadtester script can be fed directly into Michal\u0026rsquo;s visualizer to plot interactive graphs (which I\u0026rsquo;ve done for the table above):\nDEVICE=Fitlet2 ## Loadtest SERVER=${SERVER:=hvn0.lab.ipng.ch} TARGET=${TARGET:=l3} RATE=${RATE:=10} ## % of line DURATION=${DURATION:=600} OFFSET=${OFFSET:=10} PROFILE=${PROFILE:=\u0026#34;ipng\u0026#34;} for DIR in unidirectional bidirectional; do for SIZE in 1514 imix 64; do [ $DIR == \u0026#34;unidirectional\u0026#34; ] \u0026amp;\u0026amp; FLAGS=\u0026#34;-u \u0026#34; ## Multiple Flows ./trex-loadtest -s ${SERVER} ${FLAGS} -p $PROFILE}.py -t \u0026#34;offset=${OFFSET},vm=var2,size=${SIZE}\u0026#34; \\ -rd ${DURATION} -rt ${RATE} -o ${DEVICE}-${TARGET}-${PROFILE}-var2-${SIZE}-${DIR}.json [ $SIZE -eq 64 ] \u0026amp;\u0026amp; { ## Specialcase: Single Flow ./trex-loadtest -s ${SERVER} ${FLAGS -p ${PROFILE}.py -t \u0026#34;offset=${OFFSET},size=${SIZE}\u0026#34; \\ -rd ${DURATION} -rt ${RATE} -o ${DEVICE}-${TARGET}-${PROFILE}-${SIZE}-${DIR}.json } done done ## Graphs ruby graph.rb -t \u0026#34;${DEVICE} All Loadtests\u0026#34; ${DEVICE}*.json -o ${DEVICE}.html ruby graph.rb -t \u0026#34;${DEVICE} Unidirectional Loadtests\u0026#34; ${DEVICE}*unidir*.json \\ -o ${DEVICE}.unidirectional.html ruby graph.rb -t \u0026#34;${DEVICE} Bidirectional Loadtests\u0026#34; ${DEVICE}*bidir*.json \\ -o ${DEVICE}.bidirectional.html for i in ${PROFILE}-var2-1514 ${PROFILE}-var2-imix ${PROFILE}-var2-64 ${PROFILE}-64; do ruby graph.rb -t \u0026#34;${DEVICE} Unidirectional Loadtests\u0026#34; ${DEVICE}*-${i}*unidirectional.json \\ -o ${DEVICE}.$i-unidirectional.html; done ruby graph.rb -t \u0026#34;${DEVICE} Bidirectional Loadtests\u0026#34; ${DEVICE}*-${i}*bidirectional.json \\ -o ${DEVICE}.$i-bidirectional.html; done done ","date":"2023-02-12","desc":" A while ago, in June 2021, we were discussing home routers that can keep up with 1G+ internet connections in the CommunityRack telegram channel. Of course at IPng Networks we are fond of the Supermicro Xeon D1518 [ref], which has a bunch of 10Gbit X522 and 1Gbit i350 and i210 intel NICs, but it does come at a certain price.\nFor smaller applications, PC Engines APU6 [ref] is kind of cool and definitely more affordable. But, in this chat, Patrick offered an alternative, the [Fitlet2] which is a small, passively cooled, and expandable IoT-esque machine.\n","permalink":"https://ipng.ch/s/articles/2023/02/12/review-compulab-fitlet2/","section":"articles","title":"Review: Compulab Fitlet2"},{"contents":"After receiving an e-mail from a newer [China based OEM], I had a chat with their founder and learned that the combination of switch silicon and software may be a good match for IPng Networks.\nI got pretty enthusiastic when this new vendor claimed VxLAN, GENEVE, MPLS and GRE at 56 ports and line rate, on a really affordable budget ($4'200,- for the 56 port; and $1'650,- for the 26 port switch). This reseller is using a less known silicon vendor called [Centec], who have a lineup of ethernet silicon. In this device, the CTC8096 (GoldenGate) is used for cost effective high density 10GbE/40GbE applications paired with 4x100GbE uplink capability. This is Centec\u0026rsquo;s fourth generation, so CTC8096 inherits the feature set from L2/L3 switching to advanced data center and metro Ethernet features with innovative enhancement. The switch chip provides up to 96x10GbE ports, or 24x40GbE, or 80x10GbE + 4x100GbE ports, inheriting from its predecessors a variety of features, including L2, L3, MPLS, VXLAN, MPLS SR, and OAM/APS. Highlights features include Telemetry, Programmability, Security and traffic management, and Network time synchronization.\nAfter discussing basic L2, L3 and Overlay functionality in my [previous post], I left somewhat of a cliffhanger alluding to all this fancy MPLS and VPLS stuff. Honestly, I needed a bit more time to play around with the featureset and clarify a few things. I\u0026rsquo;m now ready to assert that this stuff is really possible on this switch, and if this tickles your fancy, by all means read on :)\nDetailed findings Hardware The switch comes well packaged with two removable 400W Gold powersupplies from Compuware Technology which output 12V/33A and +5V/3A as well as four removable PWM controlled fans from Protechnic. The switch chip is a Centec [CTC8096] which is a competent silicon unit that can offer 48x10, 2x40 and 4x100G, and its smaller sibling carries the newer [CTC7132] from 2019, which brings 24x10 and 2x100G connectivity. While the firmware seems slightly different in denomination, the large one shows NetworkOS-e580-v7.4.4.r.bin as the firmware, and the smaller one shows uImage-v7.0.4.40.bin, I get the impression that the latter is a compiled down version of the former to work with the newer chipset.\nIn my [previous post], I showed L2, L3 and VxLAN, GENEVE and NvGRE capabilities of this switch to be line rate. But the hardware also supports MPLS, so I figured I\u0026rsquo;d complete the Overlay series by exploring VxLAN, and the MPLS, EoMPLS (L2VPN, Martini style), and VPLS functionality of these units.\nTopology {: style=\u0026ldquo;width:500px; float: right; margin-left: 1em; margin-bottom: 1em;\u0026rdquo;}\nIn the [IPng Networks LAB], I build the following topology using the loadtester, packet analyzer, and switches:\nmsw-top: S5624-2Z-EI switch msw-core: S5648X-2Q4ZA switch msw-bottom: S5624-2Z-EI switch All switches connect to: each other with 100G DACs (right, black) T-Rex machine with 4x10G (left, rainbow) Each switch gets a mgmt IPv4 and IPv6 This is the same topology in the previous post, and it gives me lots of wiggle room to patch anything to anything as I build point to point MPLS tunnels, VPLS clouds and eVPN overlays. Although I will also load/stress test these configurations, this post is more about the higher level configuration work that goes into building such an MPLS enabled telco network.\nMPLS Why even bother, if we have these fancy new IP based transports that I [wrote about] last week? I mentioned that the industry is moving on from MPLS to a set of more flexible IP based solutions like VxLAN and GENEVE, as they certainly offer lots of benefits in deployment (notably as overlays on top of existing IP networks).\nHere\u0026rsquo;s one plausible answer: You may have come across an architectural network design concept known as [BGP Free Core] and operating this way gives very little room for outages to occur in the L2 (Ethernet and MPLS) transport network, because it\u0026rsquo;s relatively simple in design and implementation. Some advantages worth mentioning:\nTransport devices do not need to be capable of supporting a large number of IPv4/IPv6 routes, either in the RIB or FIB, allowing them to be much cheaper. As there is no eBGP, transport devices will not be impacted by BGP-related issues, such as high CPU utilization during massive BGP re-convergence. Also, without eBGP, some of the attack vectors in ISPs (loopback DDoS or ARP storms on public internet exchange, to take two common examples) can be eliminated. If a new BGP security vulnerability were to be discovered, transport devices aren\u0026rsquo;t impacted. Operator errors (the #1 reason for outages in our industry) associated with BGP configuration and the use of large RIBs (eg. leaking into IGP, flapping transit sessions, etc) can be eradicated. New transport services such as MPLS point to point virtual leased lines, SR-MPLS, VPLS clouds, and eVPN can all be introduced without modifying the routing core. If deployed correctly, this type of transport-only network can be kept entirely isolated from the Internet, making DDoS and hacking attacks against transport elements impossible, and it also opens up possibilities for relatively safe sharing of infrastructure resources between ISPs (think of things like dark fibers between locations, rackspace, power, cross connects).\nFor smaller clubs (like IPng Networks), being able to share a 100G wave with others, significantly reduces price per Megabit! So if you\u0026rsquo;re in Zurich, Switzerland, or Europe and find this an interesting avenue to expand your reach in a co-op style environment, [reach out] to us, any time!\nMPLS + LDP Configuration OK, let\u0026rsquo;s talk bits and bytes. Table stakes functionality is of course MPLS switching and label distribution, which is performed with LDP, described in [RFC3036]. Enabling these features is relatively straight forward:\nmsw-top# show run int loop0 interface loopback0 ip address 172.20.0.2/32 ipv6 address 2001:678:d78:400::2/128 ipv6 router ospf 8298 area 0 msw-top# show run int eth-0-25 interface eth-0-25 description Core: msw-bottom eth-0-25 speed 100G no switchport mtu 9216 label-switching ip address 172.20.0.12/31 ipv6 address 2001:678:d78:400::3:1/112 ip ospf network point-to-point ip ospf cost 104 ipv6 ospf network point-to-point ipv6 ospf cost 106 ipv6 router ospf 8298 area 0 enable-ldp msw-top# show run router ospf router ospf 8298 network 172.20.0.0/24 area 0 msw-top# show run router ipv6 ospf router ipv6 ospf 8298 router-id 172.20.0.2 msw-top# show run router ldp router ldp router-id 172.20.0.2 transport-address 172.20.0.2 This seems like a mouthful, but really not too complicated. From the top, I create a loopback interface with an IPv4 (/32) and IPv6 (/128) address. Then, on the 100G transport interfaces, I specify an IPv4 (/31, let\u0026rsquo;s not be wasteful, take a look at [RFC 3021]) and IPv6 (/112) transit network, after which I add the interface to OSPF and OSPFv3.\nThe main two things to note in the interface definition is the use of label-switching which enables MPLS on the interface, and enable-ldp which makes it periodically multicast LDP discovery packets. If another device is also doing that, an LDP adjacency is formed using a TCP session. The two devices then exchange MPLS label tables, so that they learn from each other how to switch MPLS packets across the network.\nLDP signalling kind of looks like this on the wire:\n14:21:43.741089 IP 172.20.0.12.646 \u0026gt; 224.0.0.2.646: LDP, Label-Space-ID: 172.20.0.2:0, pdu-length: 30 14:21:44.331613 IP 172.20.0.13.646 \u0026gt; 224.0.0.2.646: LDP, Label-Space-ID: 172.20.0.1:0, pdu-length: 30 14:21:44.332773 IP 172.20.0.2.36475 \u0026gt; 172.20.0.1.646: Flags [S],seq 195175, win 27528, options [mss 9176,sackOK,TS val 104349486 ecr 0,nop,wscale 7], length 0 14:21:44.333700 IP 172.20.0.1.646 \u0026gt; 172.20.0.2.36475: Flags [S.], seq 466968, ack 195176, win 18328, options [mss 9176,sackOK,TS val 104335979 ecr 104349486,nop,wscale 7], length 0 14:21:44.334313 IP 172.20.0.2.36475 \u0026gt; 172.20.0.1.646: Flags [.], ack 1, win 216, options [nop,nop,TS val 104349486 ecr 104335979], length 0 The first two packets here are the routers announcing to [well known multicast] address for all-routers (224.0.0.2), and well known port 646 (for LDP), in a packet called a Hello Message. The router with address 172.20.0.12 is the one we just configured (msw-top), and the one with address 172.20.0.13 is the other side (msw-bottom). In these Hello messages, the router informs multicast listeners where they should connect (called the IPv4 transport address), in the case of msw-top, it\u0026rsquo;s 172.20.0.2.\nNow that they\u0026rsquo;ve noticed one anothers willingness to form an adjacency, a TCP connection is initiated from our router\u0026rsquo;s loopback address (specified by transport-address in the LDP configuration), towards the loopback that was learned from the Hello Message in the multicast packet earlier. A TCP three way handshake follows, in which the routers also tell each other their MTU (by means of the MSS field set to 9176, which is 9216 minus 20 bytes [IPv4 header] and 20 bytes [TCP header]). The adjacency forms and both routers exchange label information (in things called a Label Mapping Message). Once done exchanging this info, msw-top can now switch MPLS packets across its two 100G interfaces.\nZooming back out from what happened on the wire with the LDP signalling, I can take a look at the msw-top switch: besides the adjacency that I described in detail above, another one has formed over the IPv4 transit network between msw-top and msw-core (refer to the topology diagram to see what connects where). As this is a layer3 network, icky things like spanning tree and forwarding loops are no longer an issue. Any switch can forward MPLS packets to any neighbor in this topology, preference on the used path is informed with OSPF costs for the IPv4 interfaces (because LDP is using IPv4 here).\nmsw-top# show ldp adjacency IP Address Intf Name Holdtime LDP-Identifier 172.20.0.10 eth-0-26 15 172.20.0.1:0 172.20.0.13 eth-0-25 15 172.20.0.0:0 msw-top# show ldp session Peer IP Address IF Name My Role State KeepAlive 172.20.0.0 eth-0-25 Active OPERATIONAL 30 172.20.0.1 eth-0-26 Active OPERATIONAL 30 MPLS pseudowire The easiest form (and possibly most widely used one), is to create a point to point ethernet link betwen an interface on one switch, through the MPLS network, and into another switch\u0026rsquo;s interface on the other side. Think of this as a really long network cable. Ethernet frames are encapsulated into an MPLS frame, and passed through the network though some sort of tunnel, called a pseudowire.\nThere are many names of this tunneling technique. Folks refer to them as PWs (PseudoWires), VLLs (Virtual Leased Lines), Carrier Ethernet, or Metro Ethernet. Luckily, these are almost always interoperable, because under the covers, the vendors are implementing these MPLS cross connect circuits using [Martini Tunnels] which were formalized in [RFC 4447].\nThe way Martini tunnels work is by creating an extension in LDP signalling. An MPLS label-switched-path is annotated as being of a certain type, carrying a 32 bit pseudowire ID, which is ignored by all intermediate routers (they will just switch the MPLS packet onto the next hop), but the last router will inspect the MPLS packet and find which pseudowire ID it belongs to, and look up in its local table what to do with it (mostly just unwrap the MPLS packet, and marshall the resulting ethernet frame into an interface or tagged sub-interface).\nConfiguring the pseudowire is really simple:\nmsw-top# configure terminal interface eth-0-1 mpls-l2-circuit pw-vll1 ethernet ! mpls l2-circuit pw-vll1 829800 172.20.0.0 raw mtu 9000 msw-top# show ldp mpls-l2-circuit 829800 Transport Client VC Trans Local Remote Destination VC ID Binding State Type VC Label VC Label Address 829800 eth-0-1 UP Ethernet 32774 32773 172.20.0.0 After I\u0026rsquo;ve configured this on both msw-top and msw-bottom, using LDP signalling, a new LSP will be set up which carries ethernet packets at up to 9000 bytes, encapsulated MPLS, over the network. To show this in more detail, I\u0026rsquo;ll take the two ethernet interfaces that are connected to msw-top:eth-0-1 and msw-bottom:eth-0-1, and move them in their own network namespace on the lab machine:\nroot@dut-lab:~# ip netns add top root@dut-lab:~# ip netns add bottom root@dut-lab:~# ip link set netns top enp66s0f0 root@dut-lab:~# ip link set netns bottom enp66s0f1 I can now enter the top and bottom namespaces, and play around with those interfaces, for example I\u0026rsquo;ll give them an IPv4 address and a sub-interface with dot1q tag 1234 and an IPv6 address:\nroot@dut-lab:~# nsenter --net=/var/run/netns/bottom root@dut-lab:~# ip addr add 192.0.2.1/31 dev enp66s0f1 root@dut-lab:~# ip link add link enp66s0f1 name v1234 type vlan id 1234 root@dut-lab:~# ip addr add 2001:db8::2/64 dev v1234 root@dut-lab:~# ip link set v1234 up root@dut-lab:~# nsenter --net=/var/run/netns/top root@dut-lab:~# ip addr add 192.0.2.0/31 dev enp66s0f0 root@dut-lab:~# ip link add link enp66s0f0 name v1234 type vlan id 1234 root@dut-lab:~# ip addr add 2001:db8::1/64 dev v1234 root@dut-lab:~# ip link set v1234 up root@dut-lab:~# ping -c 5 2001:db8::2 PING 2001:db8::2(2001:db8::2) 56 data bytes 64 bytes from 2001:db8::2: icmp_seq=1 ttl=64 time=0.158 ms 64 bytes from 2001:db8::2: icmp_seq=2 ttl=64 time=0.155 ms 64 bytes from 2001:db8::2: icmp_seq=3 ttl=64 time=0.162 ms The mpls-l2-circuit that I created will transport the received ethernet frames between enp66s0f0 (in the top namespace) and enp66s0f1 (in the bottom namespace), using MPLS encapsulation, and giving the packets a stack of two labels. The outer most label helps the switches determine where to switch the MPLS packet (in other words, route it from msw-top to msw-bottom). Once the destination is reached, the outer label is popped off the stack, to reveal the second label, the purpose of which is to tell the msw-bottom switch what, preciesly, to do with this payload. The switch will find that the second label instructs it to transmit the MPLS payload as an ethernet frame out on port eth-0-1.\nIf I want to look at what happens on the wire with tcpdump(8), I can use the monitor port on msw-core which mirrors all packets transiting through it. But, I don\u0026rsquo;t get very far:\nroot@dut-lab:~# tcpdump -evni eno2 mpls 19:57:37.055854 00:1e:08:0d:6e:88 \u0026gt; 00:1e:08:26:ec:f3, ethertype MPLS unicast (0x8847), length 144: MPLS (label 32768, exp 0, ttl 255) (label 32773, exp 0, [S], ttl 255) 0x0000: 9c69 b461 7679 9c69 b461 7678 8100 04d2 .i.avy.i.avx.... 0x0010: 86dd 6003 4a42 0040 3a40 2001 0db8 0000 ..`.JB.@:@...... 0x0020: 0000 0000 0000 0000 0001 2001 0db8 0000 ................ 0x0030: 0000 0000 0000 0000 0002 8000 3553 9326 ............5S.\u0026amp; 0x0040: 0001 2185 9363 0000 0000 e7d9 0000 0000 ..!..c.......... 0x0050: 0000 1011 1213 1415 1617 1819 1a1b 1c1d ................ 0x0060: 1e1f 2021 2223 2425 2627 2829 2a2b 2c2d ...!\u0026#34;#$%\u0026amp;\u0026#39;()*+,- 0x0070: 2e2f 3031 3233 3435 3637 ./01234567 19:57:37.055890 00:1e:08:26:ec:f3 \u0026gt; 00:1e:08:0d:6e:88, ethertype MPLS unicast (0x8847), length 140: MPLS (label 32774, exp 0, [S], ttl 254) 0x0000: 9c69 b461 7678 9c69 b461 7679 8100 04d2 .i.avx.i.avy.... 0x0010: 86dd 6009 4122 0040 3a40 2001 0db8 0000 ..`.A\u0026#34;.@:@...... 0x0020: 0000 0000 0000 0000 0002 2001 0db8 0000 ................ 0x0030: 0000 0000 0000 0000 0001 8100 3453 9326 ............4S.\u0026amp; 0x0040: 0001 2185 9363 0000 0000 e7d9 0000 0000 ..!..c.......... 0x0050: 0000 1011 1213 1415 1617 1819 1a1b 1c1d ................ 0x0060: 1e1f 2021 2223 2425 2627 2829 2a2b 2c2d ...!\u0026#34;#$%\u0026amp;\u0026#39;()*+,- 0x0070: 2e2f 3031 3233 3435 3637 ./01234567 For a brief moment, I stare closely at the first part of the hex dump, and I recognize two MAC addresses 9c69.b461.7678 and 9c69.b461.7679 followed by what appears to be 0x8100 (the ethertype for [Dot1Q]) and then 0x04d2 (which is 1234 in decimal, the VLAN tag I chose).\nClearly, the hexdump here is \u0026ldquo;just\u0026rdquo; an ethernet frame. So why doesn\u0026rsquo;t tcpdump decode it? The answer is simple: nothing in the MPLS packet tells me that the payload is actually ethernet. It could be anything, and it\u0026rsquo;s really up to the recipient of the packet with the label 32773 to determine what its payload means. Luckily, Wireshark can be prompted to decode further based on which MPLS label is present. Using the Decode As\u0026hellip; option, I can specify that data following label 32773 is Ethernet PW (no CW), where PW here means pseudowire and CW means controlword. Et, voilà, the first packet reveals itself:\nPseudowires on Sub Interfaces One very common use case for me at IPng Networks is to work with excellent partners like [IP-Max] who provide Internet Exchange transport, for example from DE-CIX or SwissIX, to the customer premises. IP-Max uses Cisco\u0026rsquo;s ASR9k routers, an absolutely beautiful piece of technology [ref], and with those you can terminate a L2VPN in any sub-interface.\nLet\u0026rsquo;s configure something similar. I take one port on msw-top, and branch that out into three remote locations, in this case msw-bottom port 1, 2 and 3. I will be terminating all three pseudowires on the same endpoint, but obviously this could also be one port that goes to three internet exchanges, say SwissIX, DE-CIX and FranceIX, on three different endpoints.\nThe configuration for both switches will look like this:\nmsw-top# configure terminal interface eth-0-1 switchport mode trunk switchport trunk native vlan 5 switchport trunk allowed vlan add 6-8 mpls-l2-circuit pw-vlan10 vlan 10 mpls-l2-circuit pw-vlan20 vlan 20 mpls-l2-circuit pw-vlan30 vlan 30 mpls l2-circuit pw-vlan10 829810 172.20.0.0 raw mtu 9000 mpls l2-circuit pw-vlan20 829820 172.20.0.0 raw mtu 9000 mpls l2-circuit pw-vlan30 829830 172.20.0.0 raw mtu 9000 msw-bottom# configure terminal interface eth-0-1 mpls-l2-circuit pw-vlan10 ethernet interface eth-0-2 mpls-l2-circuit pw-vlan20 ethernet interface eth-0-3 mpls-l2-circuit pw-vlan30 ethernet mpls l2-circuit pw-vlan10 829810 172.20.0.2 raw mtu 9000 mpls l2-circuit pw-vlan20 829820 172.20.0.2 raw mtu 9000 mpls l2-circuit pw-vlan30 829830 172.20.0.2 raw mtu 9000 Previously, I configured the port in ethernet mode, which takes all frames and forwards them into the MPLS tunnel. In this case, I\u0026rsquo;m using vlan mode, specifying a VLAN tag that, when frames arrive on the port matching it, will selectively be put into a pseudowire. As an added benefit, this allows me to still use the port as a regular switchport, in the snippet above it will take untagged frames and assign them to VLAN 5, allow tagged frames with dot1q VLAN tag 6, 7 or 8, and handle them as any normal switch would. VLAN tag 10, however, is directed into the pseudowire called pw-vlan10, and the other two tags similarly get put into their own l2-circuit. Using LDP signalling, the pw-id (829810, 829820, and 829830) determines which label is assigned. On the way back, that label allows the switch to correlate the ethernet frame with the correct port and transmit the it with the configured VLAN tag.\nTo show this from an end-user point of view, let\u0026rsquo;s take a look at the Linux server connected to these switches. I\u0026rsquo;ll put one port in a namespace called top, and three other ports in a network namespace called bottom, and then proceed to give them a little bit of config:\nroot@dut-lab:~# ip link set netns top dev enp66s0f0 root@dut-lab:~# ip link set netns bottom dev enp66s0f1 root@dut-lab:~# ip link set netns bottom dev enp66s0f2 root@dut-lab:~# ip link set netns bottom dev enp4s0f1 root@dut-lab:~# nsenter --net=/var/run/netns/top root@dut-lab:~# ip link add link enp66s0f0 name v10 type vlan id 10 root@dut-lab:~# ip link add link enp66s0f0 name v20 type vlan id 20 root@dut-lab:~# ip link add link enp66s0f0 name v30 type vlan id 30 root@dut-lab:~# ip addr add 192.0.2.0/31 dev v10 root@dut-lab:~# ip addr add 192.0.2.2/31 dev v20 root@dut-lab:~# ip addr add 192.0.2.4/31 dev v30 root@dut-lab:~# nsenter --net=/var/run/netns/bottom root@dut-lab:~# ip addr add 192.0.2.1/31 dev enp66s0f1 root@dut-lab:~# ip addr add 192.0.2.3/31 dev enp66s0f2 root@dut-lab:~# ip addr add 192.0.2.5/31 dev enp4s0f1 root@dut-lab:~# ping 192.0.2.4 PING 192.0.2.4 (192.0.2.4) 56(84) bytes of data. 64 bytes from 192.0.2.4: icmp_seq=1 ttl=64 time=0.153 ms 64 bytes from 192.0.2.4: icmp_seq=2 ttl=64 time=0.209 ms To unpack this a little bit, in the first block I assign the interfaces to their respective namespace. Then, for the interface connected to the msw-top switch, I create three dot1q sub-interfaces, corresponding to the pseudowires I created. Note: untagged traffic out of enp66s0f0 will simply be picked up by the switch and assigned VLAN 5 (and I\u0026rsquo;m also allowed to send VLAN tags 6, 7 and 8, which will all be handled locally).\nBut, VLAN 10, 20 and 30 will be moved through the MPLS network and pop out on the msw-bottom switch, where they are each assigned a unique port, represented by enp66s0f1, enp66s0f2 and enp4s0f1 connected to the bottom switch.\nWhen I finally ping 192.0.2.4, that ICMP packet goes out on enp4s0f1, which enters msw-bottom:eth-0-3, it gets assigned the pseudowire name pw-vlan30, which corresponds to the pw-id 829830, then it travels over the MPLS network, arriving at msw-bottom carrying a label that tells that switch that it belongs to its local pw-id 829830 which corresponds to name pw-vlan30 and is assigned VLAN tag 30 on port eth-0-1. Phew, I made it. It actually makes sense when you think about it!\nVPLS The pseudowires that I described in the previous section are simply ethernet cross connects spanning over an MPLS network. They are inherently point-to-point, much like a physical Ethernet cable is. Sometimes, it makes more sense to take a local port and create what is called a Virtual Private LAN Service (VPLS), described in [RFC4762], where packets into this port are capable of being sent to any number of other ports on any number of other switches, while using MPLS as transport.\nBy means of example, let\u0026rsquo;s say a telco offers me one port in Amsterdam, one in Zurich and one in Frankfurt. A VPLS instance would create an emulated LAN segment between these locations, in other words a Layer 2 broadcast domain that is fully capable of learning and forwarding on Ethernet MAC addresses but the ports are dedicated to me, and they are isolated from other customers. The telco has essentially created a three-port switch for me, but at the same time, that telco can create any number of VPLS services, each one unique to their individual customers. It\u0026rsquo;s a pretty powerful concept.\nIn principle, a VPLS consists of two parts:\nA full mesh of simple MPLS point-to-point tunnels from each participating switch to each other one. These are just pseudowires with a given pw-id, just like I showed before. The pseudowires are then tied together in a form of bridge domain, and learning is applied to MAC addresses that appear behind each port, signalling that these are available behind the port. Configuration on the switch looks like this:\nmsw-top# configure terminal interface eth-0-1 mpls-vpls v-ipng ethernet interface eth-0-2 mpls-vpls v-ipng ethernet interface eth-0-3 mpls-vpls v-ipng ethernet interface eth-0-4 mpls-vpls v-ipng ethernet ! mpls vpls v-ipng 829801 vpls-peer 172.20.0.0 raw vpls-peer 172.20.0.1 raw The first set of commands add each individual interface into the VPLS instance by binding it to a name, in this case v-ipng. Then, the VPLS neighbors are specified, by offering a pw-id (829801) which is used to construct a pseudowire to the two peers. The first, 172.20.0.0 is msw-bottom, and the other, 172.20.0.1 is msw-core. Each switch that participates in the VPLS for v-ipng will signal LSPs to each of its peers, and MAC learning will be enabled just as if each of these pseudowires were a regular switchport.\nOnce I configure this pattern on all three switches, effectively interfaces eth-0-1 - 4 are now bound together as a virtual switch with a unique broadcast domain dedicated to instance v-ipng. I\u0026rsquo;ve created a fully transparent 12-port switch, which means that what-ever traffic I generate, will be encapsulated in MPLS and sent through the MPLS network towards its destination port.\nLet\u0026rsquo;s take a look at the msw-core switch to see how this looks like:\nmsw-core# show ldp vpls VPLS-ID Peer Address State Type Label-Sent Label-Rcvd Cw 829801 172.20.0.0 Up ethernet 32774 32773 0 829801 172.20.0.2 Up ethernet 32776 32774 0 msw-core# show mpls vpls mesh VPLS-ID Peer Addr/name In-Label Out-Intf Out-Label Type St Evpn Type2 Sr-tunid 829801 172.20.0.0/- 32777 eth-0-50 32775 RAW Up N N - 829801 172.20.0.2/- 32778 eth-0-49 32776 RAW Up N N - msw-core# show mpls vpls detail Virtual Private LAN Service Instance: v-ipng, ID: 829801 Group ID: 0, Configured MTU: NULL Description: none AC interface : Name TYPE Vlan eth-0-1 Ethernet ALL eth-0-2 Ethernet ALL eth-0-3 Ethernet ALL eth-0-4 Ethernet ALL Mesh Peers : Peer TYPE State C-Word Tunnel name LSP name 172.20.0.0 RAW UP Disable N/A N/A 172.20.0.2 RAW UP Disable N/A N/A Vpls-mac-learning enable Discard broadcast disabled Discard unknown-unicast disabled Discard unknown-multicast disabled Putting this to the test, I decide to run a loadtest saturating 12x 10G of traffic through this spiffy 12-port virtual switch. I randomly assign ports on the loadtester to the 12 ports in the v-ipng VPLS, and then I start full line rate load with 128 byte packets. Considering I\u0026rsquo;m using twelve TenGig ports, I would expect 12x8.43 or roughly 101Mpps flowing, and indeed, the loadtests demonstrate this mark nicely:\nImportant: The screenshot above shows the first four ports on the T-Rex interface only, but there are actually twelve ports participating in this loadtest. In the top right corner, the total throughput is correctly represented. The switches are handling 120Gbps of L1, 103.5Gbps of L2 (which is expected at 128b frames, as there is a little bit of ethernet overhead for each frame), which is a whopping 101Mpps, which is exactly what I would expect.\nAnd the chassis doesn\u0026rsquo;t even get warm.\nConclusions It\u0026rsquo;s just super cool to see a switch like this work as expected. I did not manage to overload it at all, in my [previous article], I showed VxLAN, GENEVE and NvGRE overlays at line rate. Here, I can see that MPLS with all of its Martini bells and whistles, and as well the more advanced VPLS, are keeping up like a champ. I think at least for initial configuration and throughput on all MPLS features I tested, both the small 24x10 + 2x100G switch, and the larger 48x10 + 2x40 + 4x100G switch, are keeping up just fine.\nA duration test will have to show if the configuration and switch fabric are stable over time, but I am hopeful that Centec is hitting the exact sweet spot for me on the MPLS transport front.\nYes, yes yes. I did as well promise to take a look at eVPN functionality (this is another form of L3VPN which uses iBGP to share which MAC addresses live behind which VxLAN ports). This post has been fun, but also quite long (4300 words!) so I\u0026rsquo;ll follow up in a future article on the eVPN capabilities of the Centec switches.\n","date":"2022-12-09","desc":"After receiving an e-mail from a newer [China based OEM], I had a chat with their founder and learned that the combination of switch silicon and software may be a good match for IPng Networks.\nI got pretty enthusiastic when this new vendor claimed VxLAN, GENEVE, MPLS and GRE at 56 ports and line rate, on a really affordable budget ($4'200,- for the 56 port; and $1'650,- for the 26 port switch). This reseller is using a less known silicon vendor called [Centec], who have a lineup of ethernet silicon. In this device, the CTC8096 (GoldenGate) is used for cost effective high density 10GbE/40GbE applications paired with 4x100GbE uplink capability. This is Centec\u0026rsquo;s fourth generation, so CTC8096 inherits the feature set from L2/L3 switching to advanced data center and metro Ethernet features with innovative enhancement. The switch chip provides up to 96x10GbE ports, or 24x40GbE, or 80x10GbE + 4x100GbE ports, inheriting from its predecessors a variety of features, including L2, L3, MPLS, VXLAN, MPLS SR, and OAM/APS. Highlights features include Telemetry, Programmability, Security and traffic management, and Network time synchronization.\n","permalink":"https://ipng.ch/s/articles/2022/12/09/review-s5648x-2q4z-switch-part-2-mpls/","section":"articles","title":"Review: S5648X-2Q4Z Switch - Part 2: MPLS"},{"contents":"After receiving an e-mail from a newer [China based switch OEM], I had a chat with their founder and learned that the combination of switch silicon and software may be a good match for IPng Networks. You may recall my previous endeavors in the Fiberstore lineup, notably an in-depth review of the [S5860-20SQ] which sports 20x10G, 4x25G and 2x40G optics, and its larger sibling the S5860-48SC which comes with 48x10G and 8x100G cages. I use them in production at IPng Networks and their featureset versus price point is pretty good. In that article, I made one critical note reviewing those FS switches, in that they\u0026rsquo;e be a better fit if they allowed for MPLS or IP based L2VPN services in hardware.\nI got cautiously enthusiastic (albeit suitably skeptical) when this new vendor claimed VxLAN, GENEVE, MPLS and GRE at 56 ports and line rate, on a really affordable budget (sub-$4K for the 56 port; and sub-$2K for the 26 port switch). This reseller is using a less known silicon vendor called [Centec], who have a lineup of ethernet silicon. In this device, the CTC8096 (GoldenGate) is used for cost effective high density 10GbE/40GbE applications paired with 4x100GbE uplink capability. This is Centec\u0026rsquo;s fourth generation, so CTC8096 inherits the feature set from L2/L3 switching to advanced data center and metro Ethernet features with innovative enhancement. The switch chip provides up to 96x10GbE ports, or 24x40GbE, or 80x10GbE + 4x100GbE ports, inheriting from its predecessors a variety of features, including L2, L3, MPLS, VXLAN, MPLS SR, and OAM/APS. Highlights features include Telemetry, Programmability, Security and traffic management, and Network time synchronization.\nThis will be the first of a set of write-ups exploring the hard- and software functionality of this new vendor. As we\u0026rsquo;ll see, it\u0026rsquo;s all about the software.\nDetailed findings Hardware The switch comes well packaged with two removable 400W Gold powersupplies from Compuware Technology which output 12V/33A and +5V/3A as well as four removable PWM controlled fans from Protechnic. The fans are expelling air, so they are cooling front-to-back on this unit. Looking at the fans, changing them to pull air back-to-front would be possible after-sale, by flipping the fans around as they\u0026rsquo;re attached in their case by two M4 flat-head screws. This is truly meant to be an OEM switch \u0026ndash; there is no logo or sticker with the vendor\u0026rsquo;s name, so I should probably print a few vinyl IPng stickers to skin them later.\nOn the front, the switch sports an RJ45 standard serial console, a mini-usb connector of which the function is not clear to me, an RJ45 network port used for management, a pinhole which houses a reset button labeled RST and two LED indicators labeled ID and SYS. The serial port runs at 115200,8n1 and the managment network port is Gigabit.\nRegarding the regular switch ports, there are 48x SFP+ cages, 4x QSFP28 (port 49-52) runing at 100Gbit, and 2x QSFP+ ports (53-54) running at 40Gbit. All ports (management and switch) present a MAC address from OUI 00-1E-08, which is assigned to Centec.\nThe switch is not particularly quiet, as its six fans total start up at a high pitch but once the switch boots, they calm down and emit noise levels as you would expect from a datacenter unit. I measured it at 74dBA when booting, and otherwise at around 62dBA when running. On the inside, the PCB is rather clean. It comes with a daughter board, housing a small PowerPC P1010 with 533MHz CPU, 1GB of RAM, and 2GB flash on board, which is running Linux. This is the same card that many of the FS.com switches use (eg. S5860-48S6Q), a cheaper alternative to the high end Intel Xeon-D.\nS5648X (48x10, 2x40, 4x100) There is one switch chip, on the front of the PCB, connecting all 54 ports. It has a sizable heatsink on it, drawing air backwards through ports (36-48). The switch uses a less well known and somewhat dated Centec [CTC8096], codenamed GoldenGate and released in 2015, which is rated for 1.2Tbps of aggregate throughput. The chip can be programmed to handle a bunch of SDN protocols, including VxLAN, GRE, GENEVE, and MPLS / MPLS SR, with a limited TCAM to hold things like ACLs, IPv4/IPv6 routes and MPLS labels. The CTC8096 provides up to 96x10GbE ports, or 24x40GbE, or 80x10GbE + 4x100GbE ports. The SerDES design is pretty flexible, allowing it to mix and match ports.\nYou can see more (hires) pictures and screenshots throughout these articles in this [Photo Album].\nS5624X (24x10, 2x100) In case you\u0026rsquo;re curious (I certainly was!) the smaller unit (with 24x10+2x100) is built off of the Centic [CTC7132], codenamed TsingMa which released in 2019, and it offers a variety of similar features, including L2, L3, MPLS, VXLAN, MPLS SR, and OAM/APS. Highlights features include Telemetry, Programmability, Security and traffic management, and Network time synchronization. The SoC has an embedded ARM A53 CPU Core running at 800MHz, and the SerDES on this chip allows for 24x1G/2.5G/5G/10G and 2x40G/100G for a throughput of 440Gbps but at a failry sharp price point.\nOne thing worth noting (because I know some of my regular readers will already be wondering!) this series of chips (both the 4th generation CTC8096 and the sixth generation CTC7132) come with a very modest TCAM, which means in practice 32K MAC addresses, 8K IPv4 routes, 1K IPv6 routes, a 6K MPLS table, 1K L2VPN instances, and 64 VPLS instances. The Centec comes as well with a modest 32MB packet buffer shared between all ports, and the controlplane comes with 1GB of memory and a 533MHz ARM. So no, this won\u0026rsquo;t take a full table :-) but in all honesty, that\u0026rsquo;s not the thing this machine is built to do.\nWhen booted, the switch draws roughly 68 Watts combined on its two power supplies, and I find that pretty cool considering the total throughput offered. Of course, once optics are inserted, the total power draw will go up. Also worth noting, when the switch is under load, the Centec chip will consume more power, for example when forwarding 8x10G + 2x100G, the total consumption was 88 Watts, totally respectable now that datacenter power bills are skyrocketing.\nTopology On the heals of my [DENOG14 Talk], in which I showed how VPP can route 150Mpps and 180Gbps on a 10 year old Dell while consuming a full 1M BGP table in 7 seconds or so, I still had a little bit of a LAB left to repurpose. So I build the following topology using the loadtester, packet analyzer, and switches:\nmsw-top: S5624-2Z-EI switch msw-core: S5648X-2Q4ZA switch msw-bottom: S5624-2Z-EI switch All switches connect to: each other with 100G DACs (right, black) T-Rex machine with 4x10G (left, rainbow) Each switch gets a mgmt IPv4 and IPv6 With this topology I will have enough wiggle room to patch anything to anything. Now that the physical part is out of the way, let\u0026rsquo;s take a look at the firmware of these things!\nSoftware As can be seen in the topology above, I am testing three of these switches - two are the smaller sibling [S5624X 2Z-EI] (which come with 24x10G SFP+ and 2x100G QSFP28), and one is this [S5648X 2Q4Z] pictured above. The vendor has a licensing system, for basic L2, basic L3 and advanced metro L3. These switches come with the most advanced/liberal licenses, which means all of the features will work on the switches, notably, MPLS/LDP and VPWS/VPLS.\nTaking a look at the CLI, it\u0026rsquo;s very Cisco IOS-esque; there\u0026rsquo;s a few small differences, but the look and feel is definitely familiar. Base configuration kind of looks like this:\nBasic config msw-core# show running-config management ip address 192.168.1.33/24 management route add gateway 192.168.1.252 ! ntp server 216.239.35.4 ntp server 216.239.35.8 ntp mgmt-if enable ! snmp-server enable snmp-server system-contact noc@ipng.ch snmp-server system-location Bruttisellen, Switzerland snmp-server community public read-only snmp-server version v2c msw-core# conf t msw-core(config)# stm prefer ipran A few small things of note. There is no mgmt0 device as I would\u0026rsquo;ve expected. Instead, the SoC exposes its management interface to be configured with these management ... commands. The IPv4 can be either DHCP or a static address, and IPv6 can only do static addresses. Only one (default) gateway can be set for either protocol. Then, NTP can be set up to work on the mgmt-if which is a useful way to use it for timekeeping.\nThe SNMP server works both from the mgmt-if and from the dataplane, which is nice. SNMP supports everything you\u0026rsquo;d expect, including v3 and traps for all sorts of events, including IPv6 targets and either dataplane or mgmt-if.\nI did notice that the nameserver cannot use the mgmt-if, so I left it unconfigured. I found it a little bit odd, considering all the other functionality does work just fine over the mgmt-if.\nIf you\u0026rsquo;ve run CAM-based systems before, you\u0026rsquo;ll likely have come across some form of partitioning mechanism, to allow certain types in the CAM (eg. IPv4, IPv6, L2 MACs, MPLS labels, ACLs) to have more or fewer entries. This is particularly relevant on this switch because it has a comparatively small CAM. It turns out, that by default MPLS is entirely disabled, and to turn it on (and sacrifice some of that sweet sweet content addressable memory), I have to issue the command stm prefer ipran (other flavors are ipv6, layer3, ptn, and default), and reload the switch.\nHaving been in the networking industry for a while, I scratched my head on the acronym IPRAN, so I will admit having to look it up. It\u0026rsquo;s a general term used to describe an IP based Radio Access Network (2G, 3G, 4G or 5G) which uses IP as a transport layer technology. I find it funny in a twisted sort of way, that to get the oldskool MPLS service, I have to turn on IPRAN.\nAnyway, after changing the STM profile to ipran, the following partition is available:\nIPRAN CAM S5648X (msw-core) S5624 (msw-top \u0026amp; msw-bottom) MAC Addresses 32k 98k IPv4 routes host: 4k, indirect: 8k host: 12k, indirect: 56k IPv6 routes host: 512, indirect: 512 host: 2048, indirect: 1024 MPLS labels 6656 6144 VPWS instances 1024 1024 VPLS instances 64 64 Port ACL entries ingress: 1927, egress: 176 ingress: 2976, egress: 928 VLAN ACL entries ingress: 256, egress: 32 ingress: 256, egress: 64 First off: there\u0026rsquo;s quite a few differences here! The big switch has relatively few MAC, IPv4 and IPv6 routes, compared to the little ones. But, it has a few more MPLS labels. ACL wise, the small switch once again has a bit more capacity. But, of course the large switch has lots more ports (56 versus 26), and is more expensive. Choose wisely :)\nRegarding IPv4/IPv6 and MPLS space, luckily [AS8298] is relatively compact in its IGP. As of today, it carries 41 IPv4 and 48 IPv6 prefixes in OSPF, which means that these switches would be fine participating in Area 0. If CAM space does turn into an issue down the line, I can put them in stub areas and advertise only a default. As an aside, VPP doesn\u0026rsquo;t have any CAM at all, so for my routers the size is basically goverened by system memory (which on modern computers equals \u0026ldquo;infinite routes\u0026rdquo;). As long as I keep it out of the DFZ, this switch should be fine, for example in a BGP-free core that switches traffic based on VxLAN or MPLS, but I digress.\nL2 First let\u0026rsquo;s test a straight forward configuration:\nmsw-top# configure terminal msw-top(config)# vlan database msw-top(config-vlan)# vlan 5-8 msw-top(config-vlan)# interface eth-0-1 msw-top(config-if)# switchport access vlan 5 msw-top(config-vlan)# interface eth-0-2 msw-top(config-if)# switchport access vlan 6 msw-top(config-vlan)# interface eth-0-3 msw-top(config-if)# switchport access vlan 7 msw-top(config-vlan)# interface eth-0-4 msw-top(config-if)# switchport mode dot1q-tunnel msw-top(config-if)# switchport dot1q-tunnel native vlan 8 msw-top(config-vlan)# interface eth-0-26 msw-top(config-if)# switchport mode trunk msw-top(config-if)# switchport trunk allowed vlan only 5-8 By means of demonstration, I created port eth-0-4 as a QinQ capable port - which means that any untagged frames coming into it will become VLAN 8, but any tagged frames will become s-tag 8 and c-tag with whatever tag was sent, in other words standard issue QinQ tunneling. The configuration of msw-bottom is exactly the same, and because we\u0026rsquo;re connecting these VLANs through msw-core, I\u0026rsquo;ll have to make it a member of all these interfaces using the interface range shortcut:\nmsw-core# configure terminal msw-core(config)# vlan database msw-core(config-vlan)# vlan 5-8 msw-core(config-vlan)# interface range eth-0-49 - 50 msw-core(config-if)# switchport mode trunk msw-core(config-if)# switchport trunk allowed vlan only 5-8 The loadtest results in T-Rex are, quite unsurprisingly, line rate. In the screenshot below, I\u0026rsquo;m sending 128 byte frames at 8x10G (40G from msw-top through msw-core and out msw-bottom, and 40G in the other direction):\nA few notes, for critical observers:\nI have to use 128 byte frames because the T-Rex loadtester is armed with 3x Intel x710 NICs, which have a total packet rate of 40Mpps only. Intel made these with LACP redundancy in mind, and do not recommend fully loading them. As 64b frames would be ~59.52Mpps, the NIC won\u0026rsquo;t keep up. So, I let T-Rex send 128b frames, which is ~33.8Mpps. T-Rex shows only the first 4 ports in detail, and you can see all four ports are sending 10Gbps of L1 traffic, which at this frame size is 8.66Gbps of ethernet (as each frame also has a 24 byte overhead [ref]). We can clearly see though, that all Tx packets/sec are also Rx packets/sec, which means all traffic is safely accounted for. In the top panel, you will see not 4x10, but 8x10Gbps and 67.62Mpps of total throughput, with no traffic lost, and the loadtester CPU well within limits: 👍 msw-top# show int summary | exc DOWN RXBS: rx rate (bits/sec) RXPS: rx rate (pkts/sec) TXBS: tx rate (bits/sec) TXPS: tx rate (pkts/sec) Interface Link RXBS RXPS TXBS TXPS ----------------------------------------------------------------------------- eth-0-1 UP 10016060422 8459510 10016060652 8459510 eth-0-2 UP 10016080176 8459527 10016079835 8459526 eth-0-3 UP 10015294254 8458863 10015294258 8458863 eth-0-4 UP 10016083019 8459529 10016083126 8459529 eth-0-25 UP 449 0 501 0 eth-0-26 UP 41362394687 33837608 41362394527 33837608 Clearly, all three switches are happy to forward 40Gbps in both directions, and the 100G port is happy to forward (at least) 40G symmetric - and because the uplink port is trunked, each ethernet frame will be 4 bytes longer due to the dot1q tag, which, at 128b frames means we\u0026rsquo;ll be using 132/128 * 4 * 10G == 41.3G of traffic, which it spot on.\nL3 For this test, I will reconfigure the 100G ports to become routed rather than switched. Remember, msw-top connects to msw-core, which in turn connects to msw-bottom, so I\u0026rsquo;ll need two IPv4 /31 and two IPv6 /64 transit networks. I\u0026rsquo;ll also create a loopback interface with a stable IPv4 and IPv6 address on each switch, and I\u0026rsquo;ll tie all of these together in IPv4 and IPv6 OSPF in Area 0. The configuration for the msw-top switch becomes:\nmsw-top# configure terminal interface loopback0 ip address 172.20.0.2/32 ipv6 address 2001:678:d78:400::2/128 ipv6 router ospf 8298 area 0 ! interface eth-0-26 description Core: msw-core eth-0-49 speed 100G no switchport mtu 9216 ip address 172.20.0.11/31 ipv6 address 2001:678:d78:400::2:2/112 ip ospf network point-to-point ip ospf cost 1004 ipv6 ospf network point-to-point ipv6 ospf cost 1006 ipv6 router ospf 8298 area 0 ! router ospf 8298 router-id 172.20.0.2 network 172.20.0.0/22 area 0 redistribute static ! router ipv6 ospf 8298 router-id 172.20.0.2 redistribute static Now that the IGP is up for IPv4 and IPv6 and I can ping the loopbacks from any switch to any other switch, I can continue with the loadtest. I\u0026rsquo;ll configure four IPv4 interfaces:\nmsw-top# configure terminal interface eth-0-1 no switchport ip address 100.65.1.1/30 ! interface eth-0-2 no switchport ip address 100.65.2.1/30 ! interface eth-0-3 no switchport ip address 100.65.3.1/30 ! interface eth-0-4 no switchport ip address 100.65.4.1/30 ! ip route 16.0.1.0/24 100.65.1.2 ip route 16.0.2.0/24 100.65.2.2 ip route 16.0.3.0/24 100.65.3.2 ip route 16.0.4.0/24 100.65.4.2 After which I can see these transit networks and static routes propagate, through msw-core, and into msw-bottom:\nmsw-bottom# show ip route Codes: K - kernel, C - connected, S - static, R - RIP, B - BGP O - OSPF, IA - OSPF inter area N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2 E1 - OSPF external type 1, E2 - OSPF external type 2 i - IS-IS, L1 - IS-IS level-1, L2 - IS-IS level-2, ia - IS-IS inter area Dc - DHCP Client [*] - [AD/Metric] * - candidate default O 16.0.1.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56 O 16.0.2.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56 O 16.0.3.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56 O 16.0.4.0/24 [110/2013] via 172.20.0.9, eth-0-26, 05:23:56 O 100.65.1.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56 O 100.65.2.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56 O 100.65.3.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56 O 100.65.4.0/30 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56 C 172.20.0.0/32 is directly connected, loopback0 O 172.20.0.1/32 [110/1005] via 172.20.0.9, eth-0-26, 05:50:48 O 172.20.0.2/32 [110/2010] via 172.20.0.9, eth-0-26, 05:23:56 C 172.20.0.8/31 is directly connected, eth-0-26 C 172.20.0.8/32 is in local loopback, eth-0-26 O 172.20.0.10/31 [110/1018] via 172.20.0.9, eth-0-26, 05:50:48 I now instruct the T-Rex loadtester to send single-flow loadtest traffic from 16.0.X.1 -\u0026gt; 48.0.X.1 on port 0; and back from 48.0.X.1 -\u0026gt; 16.0.X.1 on port 1; and then for port2+3 I use X=2, for port4+5 I will use X=3, and port 6+7 I will use X=4. After T-Rex starts up, it\u0026rsquo;s sending 80Gbps of traffic with a grand total of 67.6Mpps in 8 unique flows of 8.45Mpps at 128b each, and the three switches forward this L3 IPv4 unicast traffic effortlessly:\nOverlay What I\u0026rsquo;ve built just now would be acceptable really only if the switches were in the same rack (or at best, facility). As an industry professional, I frown upon things like VLAN-stretching, a term that describes bridging VLANs between buildings (or, as some might admit to .. between cities or even countries🤮). A long time ago (in December 1999), Luca Martini invented what is now called [Martini Tunnels], defining how to transport Ethernet frames over an MPLS network, which is what I really want to demonstrate, albeit in the next article.\nWhat folks don\u0026rsquo;t always realize is that the industry is moving on from MPLS to a set of more flexible IP based solutions, notably tunneling using IPv4 or IPv6 UDP packets such as found in VxLAN or GENEVE, two of my favorite protocols. This certainly does cost a little bit in VPP, as I wrote about in my post on [VLLs in VPP], although you\u0026rsquo;d be surprised how many VxLAN encapsulated packets/sec a simple AMD64 router can forward. With respect to these switches, though, let\u0026rsquo;s find out if tunneling this way incurs an overhead or performance penalty. Ready? Let\u0026rsquo;s go!\nFirst I will put the first four interfaces in range eth-0-1 - 4 into a new set of VLANs, but in the VLAN database I will enable what is called overlay on them:\nmsw-top# configure terminal vlan database vlan 5-8,10,20,30,40 vlan 10 name v-vxlan-xco10 vlan 10 overlay enable vlan 20 name v-vxlan-xco20 vlan 20 overlay enable vlan 30 name v-vxlan-xco30 vlan 30 overlay enable vlan 40 name v-vxlan-xco40 vlan 40 overlay enable ! interface eth-0-1 switchport access vlan 10 ! interface eth-0-2 switchport access vlan 20 ! interface eth-0-3 switchport access vlan 30 ! interface eth-0-4 switchport access vlan 40 Next, I create two new loopback interfaces (bear with me on this one), and configure the transport of these overlays in the switch. This configuration will pick up the VLANs and move them to remote sites in either VxLAN, GENEVE or NvGRE protocol, like this:\nmsw-top# configure terminal ! interface loopback1 ip address 172.20.1.2/32 ! interface loopback2 ip address 172.20.2.2/32 ! overlay remote-vtep 1 ip-address 172.20.0.0 type vxlan src-ip 172.20.0.2 remote-vtep 2 ip-address 172.20.1.0 type nvgre src-ip 172.20.1.2 remote-vtep 3 ip-address 172.20.2.0 type geneve src-ip 172.20.2.2 keep-vlan-tag vlan 10 vni 829810 vlan 10 remote-vtep 1 vlan 20 vni 829820 vlan 20 remote-vtep 2 vlan 30 vni 829830 vlan 30 remote-vtep 3 vlan 40 vni 829840 vlan 40 remote-vtep 1 ! Alright, this is seriously cool! The first overlay defines what is called a remote VTEP (virtual tunnel end point), of type VxLAN towards IPv4 address 172.20.0.0, coming from source address 172.20.0.2 (which is our loopback0 interface on switch msw-top). As it turns out, I am not allowed to create different overlay types to the same destination address, but not to worry: I can create a few unique loopback interfaces with unique IPv4 addresses (see loopback1 and loopback2; and create new VTEPs using these. So, VTEP at index 2 is of type NvGRE and the one at index 3 is of type GENEVE and due to the use of keep-vlan-tag, the encapsulated traffic will carry dot1q tags, where-as in the other two VTEPs the tag will be stripped and what is transported on the wire is untagged traffic.\nmsw-top# show vlan all VLAN ID Name State STP ID Member ports (u)-Untagged, (t)-Tagged ======= =============================== ======= ======= ======================== (...) 10 v-vxlan-xco10 ACTIVE 0 eth-0-1(u) VxLAN: 172.20.0.2-\u0026gt;172.20.0.0 20 v-vxlan-xco20 ACTIVE 0 eth-0-2(u) NvGRE: 172.20.1.2-\u0026gt;172.20.1.0 30 v-vxlan-xco30 ACTIVE 0 eth-0-3(u) GENEVE: 172.20.2.2-\u0026gt;172.20.2.0 40 v-vxlan-xco40 ACTIVE 0 eth-0-4(u) VxLAN: 172.20.0.2-\u0026gt;172.20.0.0 msw-top# show mac address-table Mac Address Table ------------------------------------------- (*) - Security Entry (M) - MLAG Entry (MO) - MLAG Output Entry (MI) - MLAG Input Entry (E) - EVPN Entry (EO) - EVPN Output Entry (EI) - EVPN Input Entry Vlan Mac Address Type Ports ---- ----------- -------- ----- 10 6805.ca32.4595 dynamic VxLAN: 172.20.0.2-\u0026gt;172.20.0.0 10 6805.ca32.4594 dynamic eth-0-1 20 6805.ca32.4596 dynamic NvGRE: 172.20.1.2-\u0026gt;172.20.1.0 20 6805.ca32.4597 dynamic eth-0-2 30 9c69.b461.7679 dynamic GENEVE: 172.20.2.2-\u0026gt;172.20.2.0 30 9c69.b461.7678 dynamic eth-0-3 40 9c69.b461.767a dynamic VxLAN: 172.20.0.2-\u0026gt;172.20.0.0 40 9c69.b461.767b dynamic eth-0-4 Turning my attention to the VLAN database, I can now see the power of this become obvious. This switch has any number of local interfaces either tagged or untagged (in the case of VLAN 10 we can see eth-0-1(u) which means that interface is participating in the VLAN untagged), but we can also see that this VLAN 10 has a member port called VxLAN: 172.20.0.2-\u0026gt;172.20.0.0. This port is just like any other, in that it\u0026rsquo;ll participate in unknown unicast, broadcast and multicast, and \u0026ldquo;learn\u0026rdquo; MAC addresses behind these virtual overlay ports. In VLAN 10 (and VLAN 40), I can see in the L2 FIB (show mac address-table), that there\u0026rsquo;s a local MAC address learned (from the T-Rex loadtester) behind eth-0-1, but there\u0026rsquo;s also a remote MAC address learned behind the VxLAN port. I\u0026rsquo;m impressed.\nI can add any number of VLANs (and dot1q-tunnels) into a VTEP endpoint, after assigning each of them a unique VNI (virtual network identifier). If you\u0026rsquo;re curious about these, take a look at the [VxLAN], [GENEVE] and [NvGRE] specifications. Basically, the encapsulation is just putting the ethernet frame as a payload of an UDP packet, so let\u0026rsquo;s take a look at those.\nInspecting overlay As you\u0026rsquo;ll recall, the VLAN 10,20,30,40 trafffic is now traveling over an IP network, notably encapsulated by the source switch msw-top and delivered to msw-bottom via IGP (in my case, OSPF), while it transits through msw-core. I decide to take a look at this, by configuring a monitor port on msw-core:\nmsw-core# show run | inc moni monitor session 1 source interface eth-0-49 both monitor session 1 destination interface eth-0-1 This will copy all in- and egress traffic from interface eth-0-49 (connected to msw-top) through to local interface eth-0-1, which is connected to the loadtester. I can simply tcpdump this stuff:\npim@trex01:~$ sudo tcpdump -ni eno2 \u0026#39;(proto gre) or (udp and port 4789) or (udp and port 6081)\u0026#39; 01:26:24.685666 00:1e:08:26:ec:f3 \u0026gt; 00:1e:08:0d:6e:88, ethertype IPv4 (0x0800), length 174: (tos 0x0, ttl 127, id 7496, offset 0, flags [DF], proto UDP (17), length 160) 172.20.0.0.49208 \u0026gt; 172.20.0.2.4789: VXLAN, flags [I] (0x08), vni 829810 68:05:ca:32:45:95 \u0026gt; 68:05:ca:32:45:94, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110) 48.0.1.47.1025 \u0026gt; 16.0.1.47.12: UDP, length 82 01:26:24.688305 00:1e:08:0d:6e:88 \u0026gt; 00:1e:08:26:ec:f3, ethertype IPv4 (0x0800), length 166: (tos 0x0, ttl 128, id 44814, offset 0, flags [DF], proto GRE (47), length 152) 172.20.1.2 \u0026gt; 172.20.1.0: GREv0, Flags [key present], key=0xca97c38, proto TEB (0x6558), length 132 68:05:ca:32:45:97 \u0026gt; 68:05:ca:32:45:96, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110) 48.0.2.73.1025 \u0026gt; 16.0.2.73.12: UDP, length 82 01:26:24.689100 00:1e:08:26:ec:f3 \u0026gt; 00:1e:08:0d:6e:88, ethertype IPv4 (0x0800), length 178: (tos 0x0, ttl 127, id 7502, offset 0, flags [DF], proto UDP (17), length 164) 172.20.2.0.49208 \u0026gt; 172.20.2.2.6081: GENEVE, Flags [none], vni 0xca986, proto TEB (0x6558) 9c:69:b4:61:76:79 \u0026gt; 9c:69:b4:61:76:78, ethertype 802.1Q (0x8100), length 128: vlan 30, p 0, ethertype IPv4 (0x0800), (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110) 48.0.3.109.1025 \u0026gt; 16.0.3.109.12: UDP, length 82 01:26:24.701666 00:1e:08:0d:6e:89 \u0026gt; 00:1e:08:0d:6e:88, ethertype IPv4 (0x0800), length 174: (tos 0x0, ttl 127, id 7496, offset 0, flags [DF], proto UDP (17), length 160) 172.20.0.0.49208 \u0026gt; 172.20.0.2.4789: VXLAN, flags [I] (0x08), vni 829840 68:05:ca:32:45:95 \u0026gt; 68:05:ca:32:45:94, ethertype IPv4 (0x0800), length 124: (tos 0x0, ttl 64, id 1, offset 0, flags [none], proto UDP (17), length 110) 48.0.4.47.1025 \u0026gt; 16.0.4.47.12: UDP, length 82 We can see packets for all four tunnels in this dump. The first one is a UDP packet to port 4789, which is the standard port for VxLAN, and it has VNI 829810. The second packet is proto GRE with flag TEB which stands for transparent ethernet bridge in other words an L2 variant of GRE that carries ethernet frames. The third one shows that feature I configured above (in case you forgot it, it\u0026rsquo;s the keep-vlan-tag option when creating the VTEP), and because of that flag we can see that the inner payload carries the vlan 30 tag, neat! The VNI there is 0xca986 which is hex for 829830. Finally, the fourth one shows VLAN40 traffic that is sent to the same VTEP endpoint as VLAN10 traffic (showing that multiple VLANs can be transported across the same tunnel, distinguished by VNI).\nAt this point I make an important observation. VxLAN and GENEVE both have this really cool feature that they can hash their inner payload (ie. the IPv4/IPv6 address and ports if available) and use that to randomize the source port, which makes them preferable to GRE. The reason why this is preferable is hashing makes these inner flows become unique outer flows, which in turn allows them to be loadbalanced in intermediate networks, but also in the receiver if it has multiple receive queues. However, and this is important!, the switch does not hash, which means that all ethernet traffic in the VxLAN, GENEVE and NvGRE tunnels always have the exact same outer header, so loadbalancing and multiple receive queues are out of the question. I wonder if this is a limitation of the Centec chip, or failure to program or configure it by the firmware.\nWith that gripe out of the way, let\u0026rsquo;s take a look at 80Gbit of tunneled traffic, shall we?\nOnce again, all three switches are acing it. So at least 40Gbps of encap- and 40Gbps of decapsulation per switch, and the transport over IPv4 through the msw-core switch to the other side, is all in working order. On top of that, I\u0026rsquo;ve shown that multiple types of overlay can live alongside one another, even betwween the same pair of switches, and that multiple VLANs can share the same underlay transport. The only downside is the single flow nature of these UDP transports.\nA final inspection of the switch throughput:\nmsw-top# show interface summary | exc DOWN RXBS: rx rate (bits/sec) RXPS: rx rate (pkts/sec) TXBS: tx rate (bits/sec) TXPS: tx rate (pkts/sec) Interface Link RXBS RXPS TXBS TXPS ----------------------------------------------------------------------------- eth-0-1 UP 10013004482 8456929 10013004548 8456929 eth-0-2 UP 10013030687 8456951 10013030801 8456951 eth-0-3 UP 10012625863 8456609 10012626030 8456609 eth-0-4 UP 10013032737 8456953 10013034423 8456954 eth-0-25 UP 505 0 513 0 eth-0-26 UP 51147539721 33827761 51147540111 33827762 Take a look at that eth-0-26 interface: it\u0026rsquo;s using significantly more bandwidth (51Gbps) than the sum of the four transports (4x10Gbps). This is because each ethernet frame (of 128b) has to be wrapped in an IPv4 UDP packet (or in the case of NvGRE an IPv4 packet with a GRE header), which incurs quite some overhead, for small packets at least. But it definitely proves that the switches here are happy to do this forwarding at line rate, and that\u0026rsquo;s what counts!\nConclusions It\u0026rsquo;s just super cool to see a switch like this work as expected. I did not manage to overload it at all, neither with IPv4 loadtest at 67Mpps and 80Gbit of traffic, nor with L2 loadtest with four ports transported with VxLAN, NvGRE and GENEVE, at the same time. Although the underlay can only use IPv4 (no IPv6 is available in the switch chip), this is not a huge problem for me. At AS8298, I can easily define some private VRF with IPv4 space from RFC1918 to do the transport of traffic over VxLAN. And what\u0026rsquo;s even better, this can perfectly inter-operate with my VPP routers which also do VxLAN en/decapsulation.\nNow there is one more thing for me to test (and, cliffhanger, I\u0026rsquo;ve tested it already but I\u0026rsquo;ll have to write up all of my data and results \u0026hellip;). I need to do what I said I would do in the beginning of this article, and what I had hoped to achieve with the FS switches but failed to due to lack of support: MPLS L2VPN transport (and, its more complex but cooler sibling VPLS).\n","date":"2022-12-05","desc":"After receiving an e-mail from a newer [China based switch OEM], I had a chat with their founder and learned that the combination of switch silicon and software may be a good match for IPng Networks. You may recall my previous endeavors in the Fiberstore lineup, notably an in-depth review of the [S5860-20SQ] which sports 20x10G, 4x25G and 2x40G optics, and its larger sibling the S5860-48SC which comes with 48x10G and 8x100G cages. I use them in production at IPng Networks and their featureset versus price point is pretty good. In that article, I made one critical note reviewing those FS switches, in that they\u0026rsquo;e be a better fit if they allowed for MPLS or IP based L2VPN services in hardware.\n","permalink":"https://ipng.ch/s/articles/2022/12/05/review-s5648x-2q4z-switch-part-1-vxlan/geneve/nvgre/","section":"articles","title":"Review: S5648X-2Q4Z Switch - Part 1: VxLAN/GENEVE/NvGRE"},{"contents":"About this series I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I\u0026rsquo;ve been feeling less enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using \u0026ldquo;free\u0026rdquo; services is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but for me it\u0026rsquo;s time to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to privately operated ones.\nIn my [first post], I shared some thoughts on how I installed a Mastodon instance for myself. In a [followup post] I talked about its overall architecture and how one might use Prometheus to monitor vital backends like Redis, Postgres and Elastic. But Mastodon itself is also an application which can provide a wealth of telemetry using a protocol called [StatsD].\nIn this post, I\u0026rsquo;ll show how I tie these all together in a custom Grafana Mastodon dashboard!\nMastodon Statistics I noticed in the [Mastodon docs], that there\u0026rsquo;s a one-liner breadcrumb that might be easy to overlook, as it doesn\u0026rsquo;t give many details:\nSTATSD_ADDR: If set, Mastodon will log some events and metrics into a StatsD instance identified by its hostname and port.\nInteresting, but what is this statsd, precisely? It\u0026rsquo;s a simple text-only protocol that allows applications to send key-value pairs in the form of \u0026lt;metricname\u0026gt;:\u0026lt;value\u0026gt;|\u0026lt;type\u0026gt; strings, that carry statistics of certain type across the network, using either TCP or UDP. Cool! To make use of these stats, I first add this STATSD_ADDR environment variable from the docs to my .env.production file, pointing it at localhost:9125. This should make Mastodon apps emit some statistics of sorts.\nI decide to a look at those packets, instructing tcpdump to show the contents of the packets (using the -A flag). Considering my destination is localhost, I also know which interface to tcpdump on (using the -i lo flag). My first attempt is a little bit noisy, because the packet dump contains the [IPv4 header] (20 bytes) and [UDP header] (8 bytes) as well, but sure enough, if I start reading from the 28th byte onwards, I get human-readable data, in a bunch of strings that start with Mastodon:\npim@ublog:~$ sudo tcpdump -Ani lo port 9125 | grep \u0026#39;Mastodon.production.\u0026#39; | sed -e \u0026#39;s,.*Mas,Mas,\u0026#39; Mastodon.production.sidekiq.ActivityPub..ProcessingWorker.processing_time:16272|ms Mastodon.production.sidekiq.ActivityPub..ProcessingWorker.success:1|c Mastodon.production.sidekiq.scheduled_size:25|g Mastodon.production.db.tables.accounts.queries.select.duration:1.8323479999999999|ms Mastodon.production.web.ActivityPub.InboxesController.create.json.total_duration:33.856679|ms Mastodon.production.web.ActivityPub.InboxesController.create.json.db_time:2.3943890118971467|ms Mastodon.production.web.ActivityPub.InboxesController.create.json.view_time:1|ms Mastodon.production.web.ActivityPub.InboxesController.create.json.status.202:1|c ... statsd organizes its variable names in a dot-delimited tree hierarchy. I can clearly see some patterns in here, but why guess when you\u0026rsquo;re working with Open Source? Mastodon turns out to be using a popular Ruby library called the [National Statsd Agency], a wordplay that I don\u0026rsquo;t necessarily find all that funny. Naming aside though, this library collects application level statistics in four main categories:\n:action_controller: listens to the ActionController class that is extended into ApplicationControllers in Mastodon :active_record: listens to any database (SQL) queries and emits timing information for them :active_support_cache: records information regarding caching (Redis) queries, and emits timing information for them :sidekiq: listens to Sidekiq middleware and emits information about queues, workers and their jobs Using the library\u0026rsquo;s [docs], I can clearly see the patterns described, for example in the SQL recorder, the format will be {ns}.{prefix}.tables.{table_name}.queries.{operation}.duration where operation here means one of the classic SQL query types, SELECT, INSERT, UPDATE, and DELETE. Similarly, in the cache recorder, the format will be {ns}.{prefix}.{operation}.duration where operation denotes one of read_hit, read_miss, generate, delete, and so on.\nReading a bit more of the Mastodon and statsd library code, I learn that all variables emitted, the namespace {ns} is always a combination of the application name and Ruby Rails environment, ie. Mastodon.production, and the {prefix} is the collector name, one-of web, db, cache or sidekiq. If you\u0026rsquo;re curious, the Mastodon code that initializes the statsd collectors lives in config/initializers/statsd.rb. Alright, I conclude that this is all I need to know about the naming schema.\nMoving along, statsd gives each variable name a [metric type], which can be counters c, timers ms and gauges g. In the packet dump above you can see examples of each of these. The counter type in particular is a little bit different \u0026ndash; applications emit increments here - in the case of the ActivityPub.InboxesController, it merely signaled to increment the counter by 1, not the absolute value of the counter. This is actually pretty smart, because now any number of workers/servers can all contribute to a global counter, by each just sending incrementals which are aggregated by the receiver.\nAs a small critique, I happened to notice that in the sidekiq datastream, some of what I think are counters are actually modeled as gauges (notably the processed and failed jobs from the workers). I will have to remember that, but after observing for a few minutes, I think I can see lots of nifty data in here.\nPrometheus At IPng Networks, we use Prometheus as a monitoring observability tool. It\u0026rsquo;s worth pointing out that statsd has a few options itself to visualise data, but considering I already have lots of telemetry in Prometheus and Grafana (see my [previous post]), I\u0026rsquo;m going to take a bit of a detour, and convert these metrics into the Prometheus exposition format, so that they can be scraped on a /metrics endpoint just like the others. This way, I have all monitoring in one place and using one tool. Monitoring is hard enough as it is, and having to learn multiple tools is no bueno :)\nStatsd Exporter: overview The community maintains a Prometheus [Statsd Exporter] on GitHub. This tool, like many others in the exporter family, will connect to a local source of telemetry, and convert these into the required format for consumption by Prometheus. If left completely unconfigured, it will simply receive the statsd UDP packets on the Mastodon side, and export them verbatim on the Prometheus side. This will have a few downsides, notably when new operations or controllers come into existence, I would have to explicitly make Prometheus aware of them.\nI think we can do better, specifically because of the patterns noted above, I can condense the many metricnames from statsd into a few carefully chosen Prometheus metrics, and add their variability into labels in those time series. Taking SQL queries as an example, I see that there\u0026rsquo;s a metricname for each known SQL table in Mastodon (and there are many) and then for each table, a unique metric is created for each of the four operations:\nMastodon.production.tables.{table_name}.queries.select.duration Mastodon.production.tables.{table_name}.queries.insert.duration Mastodon.production.tables.{table_name}.queries.update.duration Mastodon.production.tables.{table_name}.queries.delete.duration What if I could rewrite these by capturing the {table_name} label, and further observing that there are four query types (SELECT, INSERT, UPDATE, DELETE), so possibly capturing those into a {operation} label, like so:\nmastodon_db_operation_sum{operation=\u0026#34;select\u0026#34;,table=\u0026#34;users\u0026#34;} 85.910 mastodon_db_operation_sum{operation=\u0026#34;insert\u0026#34;,table=\u0026#34;accounts\u0026#34;} 112.70 mastodon_db_operation_sum{operation=\u0026#34;update\u0026#34;,table=\u0026#34;web_push_subscriptions\u0026#34;} 6.55 mastodon_db_operation_sum{operation=\u0026#34;delete\u0026#34;,table=\u0026#34;web_settings\u0026#34;} 9.668 mastodon_db_operation_count{operation=\u0026#34;select\u0026#34;,table=\u0026#34;users\u0026#34;} 28790 mastodon_db_operation_count{operation=\u0026#34;insert\u0026#34;,table=\u0026#34;accounts\u0026#34;} 610 mastodon_db_operation_count{operation=\u0026#34;update\u0026#34;,table=\u0026#34;web_push_subscriptions\u0026#34;} 380 mastodon_db_operation_count{operation=\u0026#34;delete\u0026#34;,table=\u0026#34;web_settings\u0026#34;} 4) This way, there are only two Prometheus metric names mastodon_db_operation_sum and mastodon_db_operation_count. The first one counts the cumulative time spent performing operations of that type on the table, and the second one counts the total amount of queries of that type on the table. If I take the rate() of the count variable, I will have queries-per-second, and if I divide the rate() of the time spent by the rate() of the count, I will have a running average time spent per query over that time interval.\nStatsd Exporter: configuration The Prometheus folks also thought of this, quelle surprise, and the exporter provides incredibly powerful transformation functionality between the hierarchical tree-form of statsd and the multi-dimensional labeling format of Prometheus. This is called the [Mapping Configuration], and it allows either globbing or regular expression matching of the input metricnames, turning them into labeled output metrics. Building further on our example for SQL queries, I can create a mapping like so:\npim@ublog:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/prometheus/statsd-mapping.yaml mappings: - match: Mastodon\\.production\\.db\\.tables\\.(.+)\\.queries\\.(.+)\\.duration match_type: regex name: \u0026#34;mastodon_db_operation\u0026#34; labels: table: \u0026#34;$1\u0026#34; operation: \u0026#34;$2\u0026#34; This snippet will use a regular expression to match input metricnames, carefully escaping the dot-delimiters. Within the input, I will match two groups, the segment following tables. holds the variable SQL Table name and the segment following queries. captures the SQL Operation. Once this matches, the exporter will give the resulting variable in Prometheus simply the name mastodon_db_operation and add two labels with the results of the regexp capture groups.\nThis one mapping I showed above will take care of all of the metrics from the database collector, but are three other collectors in Mastodon\u0026rsquo;s Ruby world. In the interest of brevity, I\u0026rsquo;ll not bore you with them in this article, as this is mostly a rinse-and-repeat jobbie. But I have attached a copy of the complete mapping configuration at the end of this article. With all of that hard work on mapping completed, I can now start the statsd exporter and see its beautifully formed and labeled timeseries show up on port 9102, the default [assigned port] for this exporter type.\nGrafana First let me start by saying I\u0026rsquo;m incredibly grateful to all the folks who have contributed to existing exporters and Grafana dashboards, notably for [Node Exporter], [Postgres Exporter], [Redis Exporter], [NGINX Exporter], and [ElasticSearch Exporter]. I\u0026rsquo;m ready to make a modest contribution back to this wonderful community of monitoring dashboards, in the form of a Grafana dashboard for Mastodon!\nWriting these is pretty rewarding. I\u0026rsquo;ll take some time to explain a few Grafana concepts, although this is not meant to be a tutorial at all and honestly, I\u0026rsquo;m not that good at this anyway. A good dashboard design goes from a 30'000ft overview of the most vital stats (not necessarily graphs, but using visual clues like colors), and gives more information in so-called drill-down dashboards that allow a much finer grained / higher resolution picture of a specific part of the monitored application.\nSeeing as the collectors are emitting four main parts of the application (remember, the {prefix} is one of web, db, cache, or sidekiq), so I will give the dashboard the same structure. Also, I will try my best not to invent new terminology, as the application developers have given their telemetry certain names, I will stick to these as well. Building a dashboard this way, application developers as well as application operators will more likely be talking about the same things.\nMastodon Overview In the Mastodon Overview, each of the four collectors gets one or two stats-chips to present their highest level vitalsigns on. For a web application, this will largely be requests per second, latency and possibly errors served. For a SQL database, this is typically the issued queries and their latency. For a cache, the types of operation and again the latency observed in those operations. For Sidekiq (the background worker pool that performs certain tasks in a queue on behalf of the system or user), I decide to focus on units of work, latency and queue sizes.\nSetting up the Prometheus queries in Grafana that fetch the data I need for these is typically going to be one of two things:\nQPS: This is a rate of the monotonic increasing _count, over say one minute, I will see the average queries-per-second. Considering the counters I created have labels that tell me what they are counting (for example in Puma, which API endpoint is being queried, and what format that request is using), I can now elegantly aggregate those application-wide, like so:\nsum by (mastodon)(rate(mastodon_controller_duration_count[1m]))\nLatency: The metrics in Prometheus aggregate a runtime monotonic increasing _sum which tells me about the total time spent doing those things. It\u0026rsquo;s pretty easy to calculate the running average latency over the last minute, by simply dividing the rate of time spent by the rate of requests served, like so:\nsum by (mastodon)(rate(mastodon_controller_duration_sum[1m])) / sum by (mastodon)(rate(mastodon_controller_duration_count[1m]))\nTo avoid clutter, I will leave the detailed full resolution view (like which controller exactly, and what format was queried, and which action was taken in the API) to a drilldown below. These two patterns are continud throughout the overview panel. Each QPS value is rendered in dark blue, while the latency gets a special treatment on colors: I define a threshold which I consider \u0026ldquo;unacceptable\u0026rdquo;, and then create a few thresholds in Grafana to change the color as I approach that unacceptable max limit. By means of example, the Puma Latency element I described above, will have a maximum acceptable latency of 250ms. If the latency is above 40% of that, the color will turn yellow; above 60% it\u0026rsquo;ll turn orange and above 80% it\u0026rsquo;ll turn red. This provides a visual queue that something may be wrong.\nPuma Controllers The APIs that Mastodon offers are served by a component called [Puma], a simple, fast, multi-threaded, and highly parallel HTTP 1.1 server for Ruby/Rack applications. The application running in Puma typically defines endpoints as so-called ActionControllers, which Mastodon expands on in a derived concept called ApplicationControllers which each have a unique controller name (for example ActivityPub.InboxesController), an action performed on them (for example create, show or destroy), and a format in which the data is handled (for example html or json). For each cardinal combination, a set of timeseries (counter, time spent and latency quantiles) will exist. At the moment, there are about 53 API controllers, 8 actions, and 4 formats, which means there are 1'696 interesting metrics to inspect.\nDrawing all of these in one graph quickly turns into an unmanageable mess, but there\u0026rsquo;s a neat trick in Grafana: what if I could make these variables selectable, and maybe pin them to exactly one value (for example, all information with a specific controller), that would greatly reduce the amount of data we have to show. To implement this, the dashboard can pre-populate a variable based on a Prometheus query. By means of example, to find the possible values of controller, I might take a look at all Prometheus metrics with name mastodon_controller_duration_count and search for labels within them with a regular expression, for example /controller=\u0026quot;([^\u0026quot;]+)\u0026quot;/.\nWhat this will do is select all values in the group \u0026quot;([^\u0026quot;]+)\u0026quot; which may seem a little bit cryptic at first. The logic behind it is first create a group between parenthesis (...) and then within that group match a set of characters [...] where the set is all characters except the double-quote [^\u0026quot;] and that is repeated one-or-more times with the + suffix. So this will precisely select the string between the double-quotes in the label: controller=\u0026quot;whatever\u0026quot; will return whatever with this expression.\nAfter creating three of these, one for controller, action and format, three new dropdown selectors appear at the top of my dashboard. I will allow any combination of selections, including \u0026ldquo;All\u0026rdquo; of them (the default). Then, if I wish to drill down, I can pin one or more of these variables to narrow down the total amount of timeseries to draw.\nShown to the right are two examples, one with \u0026ldquo;All\u0026rdquo; timeseries in the graph, which shows at least which one(s) are outliers. In this case, the orange trace in the top graph showing more than average operations is the so-called ActivityPub.InboxesController.\nI can find this out by hovering over the orange trace, the tooltip will show me the current name and value. Then, selecting this in the top navigation dropdown Puma Controller, Grafana will narrow down the data for me to only those relevant to this controller, which is super cool.\nDrilldown in Grafana Where the graph down below (called Action Format Controller Operations) showed all 1'600 or so timeseries, selecting the one controller I\u0026rsquo;m interested in, shows me a much cleaner graph with only three timeseries, take a look to the right. Just by playing around with this data, I\u0026rsquo;m learning a lot about the architecture of this application!\nFor example, I know that the only action on this particular controller seems to be create, and there are three available formats in which this create action can be performed: all, html and json. And using the graph above that got me started on this little journey, I now know that the traffic spike was for controller=ActivityPub.InboxesController, action=create, format=all. Dope!\nSQL Details While I already have a really great [Postgres Dashboard] (the one that came with Postgres server), it is also good to be able to see what the client is experiencing. Here, we can drill down on two variables, called $sql_table and $sql_operation. For each {table,operation}-tuple, the average, median and 90th/99th percentile latency are available. So I end up with the following graphs and dials for tail latency: the top left graph shows me something interesting \u0026ndash; most queries are SELECT, but the bottom graph shows me lots of tables (at the time of this article, Mastodon has 73 unique SQL tables). If I wanted to answer the question \u0026ldquo;which table gets most SELECTs\u0026rdquo;, I can drill down first by selecting the SQL Operation to be select, after which I see decidedly less traces in the SQL Table Operations graph. Further analysis shows that the two places that are mostly read from are the tables called statuses and accounts. When I drill down using the selectors at the top of Grafana\u0026rsquo;s dashboard UI, the tail latency is automatically filtered to only that which is selected. If I were to see very slow queries at some point in the future, it\u0026rsquo;ll be very easy to narrow down exactly which table and which operation is the culprit.\nCache Details For the cache statistics collector, I learn there are a few different operators. Similar to Postgres, I already have a really cool [Redis Dashboard], for which I can see the Redis server view. But in Mastodon, I can now also see the client view, and see when any of these operations spike in either queries/sec (left graph), latency (middle graph), or tail latency for common operations (the dials on the right). This is bound to come in handy at some point \u0026ndash; I already saw one or two spikes in the generate operation (see the blue spike in the screenshot above), which is something to keep an eye on.\nSidekiq Details The single most interesting thing in the Mastodon application is undoubtedly its Sidekiq workers, the ones that do all sorts of system- and user-triggered work such as distributing the posts to federated servers, prefetch links and media, and calculate trending tags, posts and links. Sidekiq is a [producer-consumer] system where new units of work (called jobs) are written to a queue in Redis by a producer (typically Mastodon\u0026rsquo;s webserver Puma, or another Sidekiq task that needs something to happen at some point in the future), and then consumed by one or more pools which execute the worker jobs.\nThere are several queues defined in Mastodon, and each worker has a name, a failure and success rate, and a running tally of how much processing_time they\u0026rsquo;ve spent executing this type of work. Sidekiq workers will consume jobs in [FIFO] order, and it has a finite amount of workers (by default on a small instance it runs one worker with 25 threads). If you\u0026rsquo;re interested in this type of provisioning, [Nora Tindall] wrote a great article about it.\nThis drill-down dashboard shows all of the Sidekiq worker types known to Prometheus, and can be selected at the top of the dashboard in the dropdown called Sidekiq Worker. A total amount of worker jobs/second, as well as the running average time spent performing those jobs is shown in the first two graphs. The three dials show the median, 90th percentile and 99th percentile latency of the work being performed.\nIf all threads are busy, new work is left in the queue, until a worker thread is available to execute the job. This will lead to a queue delay on a busy server that is underprovisioned. For jobs that had to wait for an available thread to pick them up, the number of jobs per queue, and the time in seconds that the jobs were waiting to be picked up by a worker, are shown in the two lists at bottom right.\nAnd with that, as [North of the Border] would say: \u0026ldquo;We\u0026rsquo;re on to the glamour shots!\u0026rdquo;.\nWhat\u0026rsquo;s next I made a promise on two references that will be needed to successfully hook up Prometheus and Grafana to the STATSD_ADDR configuration for Mastodon\u0026rsquo;s Rails environment, and here they are:\nThe Statsd Exporter mapping configuration file: [/etc/prometheus/statsd-mapping.yaml] The Grafana Dashboard: [grafana.com/dashboards/] As a call to action: if you are running a larger instance and would allow me to take a look and learn from you, I\u0026rsquo;d be very grateful.\nI\u0026rsquo;m going to monitor my own instance for a little while, so that I can start to get a feeling for where the edges of performance cliffs are, in other words: How slow is too slow? How much load is too much? In an upcoming post, I will take a closer look at alerting in Prometheus, so that I can catch these performance cliffs and make human operators aware of them by means of alerts, delivered via Telegram or Slack.\nBy the way: If you\u0026rsquo;re looking for a home, feel free to sign up at https://ublog.tech/ as I\u0026rsquo;m sure that having a bit more load / traffic on this instance will allow me to learn (and in turn, to share with others)!\n","date":"2022-11-27","desc":"About this series I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I\u0026rsquo;ve been feeling less enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using \u0026ldquo;free\u0026rdquo; services is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but for me it\u0026rsquo;s time to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to privately operated ones.\n","permalink":"https://ipng.ch/s/articles/2022/11/27/mastodon-part-3-statsd-and-prometheus/","section":"articles","title":"Mastodon - Part 3 - statsd and Prometheus"},{"contents":"About this series I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I\u0026rsquo;ve been feeling less enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using \u0026ldquo;free\u0026rdquo; services is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but for me it\u0026rsquo;s time to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to privately operated ones.\nIn the [previous post], I shared some thoughts on how the overall install of a Mastodon instance went, making it a point to ensure my users\u0026rsquo; (and my own!) data is somehow safe, and the machine runs on good hardware, and with good connectivity. Thanks IPng, for that 10G connection! In this post, I visit an old friend, [Borgmon], which has since reincarnated and become the de facto open source observability and signals ecosystem, and its incomparably awesome friend. Hello, Prometheus and Grafana!\nAnatomy of Mastodon Looking more closely at the architecture of Mastodon, it consists of a few moving parts:\nStorage: At the bottom, there\u0026rsquo;s persistent storage, in my case ZFS, on which account information (like avatars), media attachments, and site-specific media lives. As posts stream to my instance, their media is spooled locally for performance. State: Application state is kept in two databases: Firstly, a SQL database which is chosen to be [PostgreSQL]. Secondly, a memory based key-value storage system [Redis] is used to track the vitals of home feeds, list feeds, Sidekiq queues as well as Mastodon\u0026rsquo;s streaming API. Web (transactional): The webserver that serves end user requests and the API is written in a Ruby framework called [Puma]. Puma tries to do its job efficiently, and doesn\u0026rsquo;t allow itself to be bogged down by long lived web sessions, such as the ones where clients get streaming updates to their timelines on the web- or mobile client. Web (streaming): This webserver is written in [NodeJS] and excels at long lived connections that use Websockets, by providing a Streaming API to clients. Web (frontend): To tie all the current and future microservices together, provide SSL (for HTTPS), and a local object cache for things that don\u0026rsquo;t change often, one or more [NGINX] servers are used. Backend (processing): Many interactions with the server (such as distributing posts) turn in to background tasks that are enqueued and handled asynchronously by a worker pool provided by [Sidekiq]. Backend (search): Users that wish to search the local corpus of posts and media, can interact with an instance of [Elastic], a free and open search and analytics solution. These systems all interact in particular ways, but I immediately noticed one interesting tidbit. Pretty much every system in this list can (or can be easily made to) emit metrics in a popular [Prometheus] format. I cannot overstate the love I have for this project, both technically but also socially because I know how it came to be. Ben, thanks for the RC racecars (I still have them!). Matt, I admire your Go- and Java-skills and your general workplace awesomeness. And Richi sorry to have missed you last week in Hamburg at [DENOG14]!\nPrometheus Taking stock of the architecture here, I think my best bet is to rig this stuff up with Prometheus. This works mostly by having a central, in my case external to [uBlog.tech] server scrape a bunch of timeseries metrics periodically, after which I can create pretty graphs of them, but also monitor if some values seem out of whack, like a Sidekiq queue delay raising, CPU or disk I/O running a bit hot. And the best thing yet? I will get pretty much all of this for free, because other, smarter folks have contributed into this ecosystem already:\nServer: monitoring is canonically done by [Node Exporter]. It provides metrics for all the lowlevel machine and kernel stats you\u0026rsquo;d ever think to want: network, disk, cpu, processes, load, and so on. Redis: Is provided by [Redis Exporter] and can show all sorts of operations on data realms served by Redis. PostgreSQL: is provided by [Postgres Exporter] which is maintained by the Prometheus Community. NGINX: Is provided by [NGINX Exporter] which is maintained by the company behind NGINX. I used to have a Lua based exporter (when I ran [SixXS]) which had lots of interesting additional stats, but for the time being I\u0026rsquo;ll just use this one. Elastic: has a converter from its own metrics system in the [Elasticsearch Exporter], once again maintained by the (impressively fabulous!) Prometheus Community. All of these implement a common pattern: they take the (bespoke, internal) representation of statistics counters or dials/gauges, and transform them into a common format called the Metrics Exposition format, and they provide this in either an HTTP endpoint (typically using a /metrics URI handler directly on the webserver), or in a push-mechanism using a popular [Pushgateway] in case there is no server to poll, for example a batch process that did some work and wanted to report on its results.\nIncidentally, a fair amount of popular open source infrastructure already has a Prometheus exporter \u0026ndash; check out [this list], but also the assigned [TCP ports] for popular things that you might also be using. Maybe you\u0026rsquo;ll get lucky and find out that somebody has already provided an exporter, so you don\u0026rsquo;t have to!\nConfiguring Exporters Now that I have found a whole swarm of these Prometheus Exporter microservices, and understand how to plumb each of them through to what-ever it is they are monitoring, I can get cracking on some observability. Let me provide some notes for posterity, both for myself if I ever revisit the topic and \u0026hellip; kind of forgot what I had done so far :), but maybe also for the adventurous, who are interested in using Prometheus on their own Mastodon instance.\nFirst of all, it\u0026rsquo;s worth mentioning that while these exporters (typically written in Go) have command line flags, they can often also take their configuration from environment variables, provided mostly becasue they operate in Docker or Kubernetes. My exporters will all run vanilla in systemd, but these systemd units can also be configured to use environments, which is neat!\nFirst, I create a few environment files for each systemd unit that contains a Prometheus exporter:\npim@ublog:~$ ls -la /etc/default/*exporter -rw-r----- 1 root root 49 Nov 23 18:15 /etc/default/elasticsearch-exporter -rw-r----- 1 root root 76 Nov 22 17:13 /etc/default/nginx-exporter -rw-r----- 1 root root 170 Nov 22 22:41 /etc/default/postgres-exporter -rw-r----- 1 root root 9789 May 27 2021 /etc/default/prometheus-node-exporter -rw-r----- 1 root root 0 Nov 22 22:56 /etc/default/redis-exporter -rw-r----- 1 root root 67 Nov 22 23:56 /etc/default/statsd-exporter The contents of these files will give away passwords, like the one for ElasticSearch or Postgres, so I specifically make them readable only by root:root. I won\u0026rsquo;t share my passwords with you, dear reader, so you\u0026rsquo;ll have to guess the contents here!\nPriming the environment with these values, I will take the systemd unit for elasticsearch as an example:\npim@ublog:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /lib/systemd/system/elasticsearch-exporter.service [Unit] Description=Elasticsearch Prometheus Exporter After=network.target [Service] EnvironmentFile=/etc/default/elasticsearch-exporter ExecStart=/usr/local/bin/elasticsearch_exporter User=elasticsearch Group=elasticsearch Restart=always [Install] WantedBy=multi-user.target EOF pim@ublog:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/default/elasticsearch-exporter ES_USERNAME=elastic ES_PASSWORD=$(SOMETHING_SECRET) # same as ES_PASS in .env.production EOF pim@ublog:~$ sudo systemctl enable elasticsearch-exporter pim@ublog:~$ sudo systemctl start elasticsearch-exporter Et voilà, just like that the service starts, connects to elasticsearch, transforms all of its innards into beautiful Prometheus metrics, and exposes them on its \u0026ldquo;registered\u0026rdquo; port, in this case 9114, which can be scraped by the Prometheus instance a few computers away, connected to the uBlog VM via backend LAN over RFC1918. I just knew that second NIC would come in useful!\nAll five of the exporters are configured and exposed. They are now providing a wealth of realtime information on how the various Mastodon components are going. And if any of them start malfunctioning, or running out of steam, or simply taking the day off, I will be able to see this either by certrain metrics going out of expected ranges, or by the exporter reporting that it cannot even find the service at all (which we can also detect and turn into alarms, more on that later).\nPictured here (you should probably open it in full resolution unless you have hawk eyes), is an example of those metrics, of which Prometheus is happy to handle several million at relatively high period of scraping, in my case every 10 seconds, it comes around and pulls the data from these five exporters. While these metrics are human readable, they aren\u0026rsquo;t very practical\u0026hellip;\nGrafana \u0026hellip; so let\u0026rsquo;s visualize them with an equally awesome tool: [Grafana]. This tool provides operational dashboards for any data that is stored here, there, or anywhere :) Grafana can render stuff from a plethora of backends, one popular and established one is Prometheus. And as it turns out, as with Prometheus, lots of work has been done already with canonical, almost out-of-the-box, dashboards that were contributed by folks in the field. n fact, every single one of the five exporters I installed, also have an accompanying dashboard, sometimes even multiple to choose from! Grafana allows you to [search and download] these from a corpus they provide, referring to them by their id, or alternatively downloading a JSON representation of the dashboard, for example one that comes with the exporter, or one you find on GitHub.\nFor uBlog, I installed: [Node Exporter], [Postgres Exporter], [Redis Exporter], [NGINX Exporter], and [ElasticSearch Exporter].\nTo the right (top) you\u0026rsquo;ll see a dashboard for PostgreSQL - it has lots of expert insights on how databases are used, how many read/write operations (like SELECT and UPDATE/DELETE queries) are performed, and their respective latency expectations. What I find particularly useful is the total amount of memory, CPU and disk activity. This allows me to see at a glance when it\u0026rsquo;s time to break out [pgTune] to help change system settings for Postgres, or even inform me when it\u0026rsquo;s time to move the database to its own server rather than co-habitating with the other stuff running on this virtual machine. In my experience, stateful systems are often the source of bottlenecks, so I take special care to monitor them and observe their performance over time. In particular, slowness will be seen in Mastodon if the database is slow (sound familiar?).\nNext, to the right (middle) you\u0026rsquo;ll see a dashboard for Redis. This one shows me how full the Redis cache is (you can see the yellow line at in the first graph there is when I restarted Redis to give it a maxmemory setting of 1GB), but also a high resolution overview of how many operations it\u0026rsquo;s doing. I can see that the server is spiky and upon closer inspection this is the pfcount command with a period of exactly 300 seconds, in other words something is spiking every 5min. I have a feeling that this might become an issue\u0026hellip; and when it does, I\u0026rsquo;ll get to learn all about this elusive [pfcount] command. But until then, I can see the average time by command: because Redis serves from RAM and this is a pretty quick server, I see the turnaround time for most queries to it in the 200-500 µs range, wow!\nBut while these dashboards are awesome, what I find has saved me (and my ISP, IPng Networks) a metric tonne of time, is the most fundamental monitoring in the Node Exporter dashboard, pictured to the right (bottom). What I really love about this dashboard, is that it shows at a glance the parts of the computer that are going to become a problem. If RAM is full (but not because of filesystem cache), or CPU is running hot, or the network is flatlining at a certain throughput or packets/sec limit, these are all things that the applications running on the machine won\u0026rsquo;t necessarily be able to show me more information on, but the Node Exporter to the rescue: it has so many interesting pieces of kernel and host operating system telemetry, that it is one of the single most useful tools I know. Every physical host and every virtual machine, is exporting metrics into IPng Networks\u0026rsquo; prometheus instance, and it constantly shows me what to improve. Thanks, Obama!\nWhat\u0026rsquo;s next Careful readers will have noticed that this whole article talks about all sorts of interesting telemetry, observability metrics, and dashboards, but they are all common components, and none of them touch on the internals of Mastodon\u0026rsquo;s processes, like Puma or Sidekiq or the API Services that Mastodon exposes. Consider this a cliff hanger (eh, mostly because I\u0026rsquo;m a bit busy at work and will need a little more time).\nIn an upcoming post, I take a deep dive into this application-specific behavior and how to extract this telemetry (spoiler alert: it can be done! and I will open source it!), as I\u0026rsquo;ve started to learn more about how Ruby gathers and exposes its own internals. Interestingly, one of the things that I\u0026rsquo;ll talk about is NSA but not the American agency, rather a comical wordplay from some open source minded folks who have blazed the path in making Ruby\u0026rsquo;s Rail application performance metrics available to external observers. In a round-about way, I hope to show how to plug these into Prometheus in the same way all the other exporters have already done so.\nBy the way: If you\u0026rsquo;re looking for a home, feel free to sign up at https://ublog.tech/ as I\u0026rsquo;m sure that having a bit more load / traffic on this instance will allow me to learn (and in turn, to share with others)!\n","date":"2022-11-24","desc":"About this series I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I\u0026rsquo;ve been feeling less enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using \u0026ldquo;free\u0026rdquo; services is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but for me it\u0026rsquo;s time to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to privately operated ones.\n","permalink":"https://ipng.ch/s/articles/2022/11/24/mastodon-part-2-monitoring/","section":"articles","title":"Mastodon - Part 2 - Monitoring"},{"contents":"About this series I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I\u0026rsquo;ve been feeling less enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using \u0026ldquo;free\u0026rdquo; services is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but for me it\u0026rsquo;s time to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to privately operated ones.\nThis series details my findings starting a micro blogging website, which uses a new set of super interesting open interconnect protocols to share media (text, pictures, videos, etc) between producers and their followers, using an open source project called Mastodon.\nIntroduction Similar to how blogging is the act of publishing updates to a website, microblogging is the act of publishing small updates to a stream of updates on your profile. You can publish text posts and optionally attach media such as pictures, audio, video, or polls. Mastodon lets you follow friends and discover new ones. It doesn\u0026rsquo;t do this in a centralized way, however.\nGroups of people congregate on a given server, of which they become a user by creating an account on that server. Then, they interact with one another on that server, but users can also interact with folks on other servers. Instead of following @IPngNetworks, they might follow a user on a given server domain, like @IPngNetworks@ublog.tech. This way, all these servers can be run independently but interact with each other using a common protocol (called ActivityPub). I\u0026rsquo;ve heard this concept be compared to choosing an e-mail provider: I might choose Google\u0026rsquo;s gmail.com, and you might use Microsoft\u0026rsquo;s live.com. However we can send e-mails back and forth due to this common protocol (called SMTP).\nuBlog.tech I thought I would give it a go, mostly out of engineering curiosity but also because I more strongly feel today that we (the users) ought to take a bit more ownership back. I\u0026rsquo;ve been a regular blogging and micro-blogging user since approximately for ever, and I think it may be a good investment of my time to learn a bit more about the architecture of Mastodon. So, I\u0026rsquo;ve decided to build and productionize a server instance.\nI registered uBlog.tech. Incidentally, if you\u0026rsquo;re reading this and would like to participate, the server welcomes users in the network-, systems- and software engineering disciplines. But, before I can get to the fun parts though, I have to do a bunch of work to get this server in a shape in which it can be trusted with user generated content.\nHardware I\u0026rsquo;m running Debian on (a set of) Dell R720s hosted by IPng Networks in Zurich, Switzerland. These machines are all roughly the same, and come with:\n2x10C/10T Intel E5-2680 (so 40 CPUs) 256GB ECC RAM 2x240G SSD in mdraid to boot from 3x1T SSD in ZFS for fast storage 6x16T harddisk with 2x500G SSD for L2ARC, in ZFS for bulk storage Data integrity and durability is important to me. It\u0026rsquo;s the one thing that typically the commercial vendors do really well, and my pride prohibits me from losing data due to things like \u0026ldquo;disk failure\u0026rdquo; or \u0026ldquo;computer broken\u0026rdquo; or \u0026ldquo;datacenter on fire\u0026rdquo;. So, I handle backups in two main ways: borg(1) and zrepl(1).\nHypervisor hosts make a daily copy of their entire filesystem using borgbackup(1) to a set of two remote fileservers. This way, the important file metadata, configs for the virtual machines, and so on, are all safely stored remotely. Virtual machines are running on ZFS blockdevices on either the SSD pool, or the disk pool, or both. Using a tool called zrepl(1) (which I described a little bit in a [previous post]), I create a snapshot every 12hrs on the local blockdevice, and incrementally copy away those snapshots daily to the remote fileservers. If I do something silly on a given virtual machine, I can roll back the machine filesystem state to the previous checkpoint and reboot. This has saved my butt a number of times, during say a PHP 7 to 8 upgrade for Librenms, or during an OpenBSD upgrade that ran out of disk midway through. Being able to roll back to a last known good state is awesome, and completely transparent for the virtual machine, as the snapshotting is done on the underlying storage pool in the hypervisor. The fileservers run physically separated from the server pools, one in Zurich and another in Geneva, so this way, if I were to lose the entire machine, I still have a ~12h old backup in two locations.\nSoftware I provision a VM with 8vCPUs (dedicated on the underlying hypervisor), including 16GB of memory and two virtio network cards. One NIC will connect to a backend LAN in some RFC1918 address space, and the other will present an IPv4 and IPv6 interface to the internet. I give this machine two blockdevices, one small one of 16GB (vda) that is created on the hypervisor\u0026rsquo;s ssd-vol0/libvirt/ublog-disk0, to be used only for boot, logs and OS. Then, a second one (vdb) is created at 300GB on ssd-vol1/libvirt/ublog-disk1 and it will be used for Mastodon and its supporting services.\nThen I simply install Debian into vda using virt-install. At IPng Networks we have some ansible-style automation that takes over the machine, and further installs all sorts of Debian packages that we use (like a Prometheus node exporter, more on that later), and sets up a firewall that allows SSH access for our trusted networks, and otherwise only allows port 80 and 443 because this is to be a webserver.\nAfter installing Debian Bullseye, I\u0026rsquo;ll create the following ZFS filesystems on vdb:\npim@ublog:~$ sudo zfs create -o mountpoint=/home/mastodon data/mastodon -V10G pim@ublog:~$ sudo zfs create -o mountpoint=/var/lib/elasticsearch data/elasticsearch -V10G pim@ublog:~$ sudo zfs create -o mountpoint=/var/lib/postgresql data/postgresql -V20G pim@ublog:~$ sudo zfs create -o mountpoint=/var/lib/redis data/redis -V2G pim@ublog:~$ sudo zfs create -o mountpoint=/home/mastodon/libve/public/system data/mastodon-system As a sidenote, I realize that this ZFS filesystem pool consists only of vdb, but its underlying blockdevice is protected in a raidz, and it is copied incrementally daily off-site by the hypervisor. I\u0026rsquo;m pretty confident on safety here, but I prefer to use ZFS for the virtual machine guests as well, because now I can do local snapshotting, of say data/mastodon-system, and I can more easily grow/shrink the datasets for the supporting services, as well as monitor them individually for wildgrowth.\nInstalling Mastodon I then go through the public Mastodon docs to further install the machine. I choose not to go the Docker route, but instead stick to systemd installs. The install itself is pretty straight forward, but I did find the nginx config a bit rough around the edges (notably because the default files I\u0026rsquo;m asked to use have their ssl certificate stanza\u0026rsquo;s commented out, while trying to listen on port 443, and this makes nginx and certbot very confused). A cup of tea later, and we\u0026rsquo;re all good.\nI am not going to start prematurely optimizing, and after a very engaging thread on Mastodon itself [@davidlars@hachyderm.io] with a few fellow admins, the consensus really is to KISS (keep it simple, silly!). In that thread, I made a few general observations on scaling up and out (none of which I\u0026rsquo;ll be doing initially), just by using some previous experience as a systems engineer, and knowing a bit about the components used here:\nRunning services on dedicated machines (ie. saparate storage, postgres, Redis, Puma and Sidekiq workers) Fiddle with Puma worker pool (more workers, and/or more threads per worker) Fiddle with Sidekiq worker pool and dedicated instances per queue Put storage on local minio cluster Run multiple postgres databases, read-only replicas, or multimaster Run cluster of multiple redis instances instead of one Split off the cache redis into mem-only Frontend the service with a cluster of NGINX + object caching Some other points of interest for those of us on the adventure of running our own machines follow:\nLogging Mastodon is a chatty one - it is logging to stdout/stderr and most of its tasks in Sidekiq have a lot to say. On Debian, by default this output goes from systemd into journald which in turn copies it into syslogd. The result of this is that each logline hits the disk three (!) times. And also by default, Debian and Ubuntu aren\u0026rsquo;t too great at log hygiene. While /var/log/ is scrubbed by logrotate(8), nothing helps avoid the journal from growing unboundedly. So I quickly make the following change:\npim@ublog:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/systemd/journald.conf [Journal] SystemMaxUse=500M ForwardToSyslog=no EOF pim@ublog:~$ sudo systemctl restart systemd-journald Paperclip and ImageMagick I noticed while tailing the journal journalctl -f that lots of incoming media gets first spooled to /tmp and then run through a conversion step to ensure the media is of the right format/aspect ratio. Mastodon calls a library called paperclip which in turn uses file(1) and identify(1) to determine the type of file, and based on the answer for images runs convert(1) or ffmpeg(1) to munge it into the shape it wants. I suspect that this will cause a fair bit of I/O in /tmp so something to keep in mind, is to either lazily turn that mountpoint into a tmpfs (which is in general frowned upon), or to change the paperclip library to use a user-defined filesystem like ~mastodon/tmp and make that a memory backed filesystem instead. The log signature in case you\u0026rsquo;re curious:\nNov 20 21:02:10 ublog bundle[408189]: Command :: file -b --mime \u0026#39;/tmp/a22ab94adb939b0eb3c224bb9046c9cf20221123-408189-s0rsty.jpg\u0026#39; Nov 20 21:02:10 ublog bundle[408189]: Command :: identify -format %m \u0026#39;/tmp/6205b887c6c337b1a72ae2a7ccb359c920221123-408189-e9jul1.jpg[0]\u0026#39; Nov 20 21:02:10 ublog bundle[408189]: Command :: convert \u0026#39;/tmp/6205b887c6c337b1a72ae2a7ccb359c920221123-408189-e9jul1.jpg[0]\u0026#39; -auto-orient -resize \u0026#34;400x400\u0026gt;\u0026#34; -coalesce \u0026#39;/tmp/8ce2976b99d4b5e861e6c988459ee20c20221123-408189-1p5gg4\u0026#39; Nov 20 21:02:10 ublog bundle[408189]: Command :: convert \u0026#39;/tmp/8ce2976b99d4b5e861e6c988459ee20c20221123-408189-1p5gg4\u0026#39; -depth 8 RGB:- I will put a pin in this until it becomes a bottleneck, but larger server admins may have thought about this before, and if so, let me know what you came up with!\nElasticsearch There\u0026rsquo;s a little bit of a timebomb here, unfortunately. Following [Full-text search] docs, the install and integration is super easy. But, in an upcoming release, Elasticsearch is going to force authentication by default, even though in the current version they are still tolerant of non-secured instances, those will break in the future. So I\u0026rsquo;m going to get ahead of that and create my instance with the minimally required security setup in mind [ref]:\npim@ublog:~$ cat \u0026lt;\u0026lt; EOF | sudo tee -a /etc/elasticsearch/elasticsearch.yml xpack.security.enabled: true discovery.type: single-node EOF pim@ublog:~$ PASS=$(openssl rand -base64 12) pim@ublog:~$ /usr/share/elasticsearch/bin/elasticsearch-setup-passwords interactive (use this $PASS for the \u0026#39;elastic\u0026#39; user) pim@ublog:~$ cat \u0026lt;\u0026lt; EOF | sudo tee -a ~mastodon/live/.env.production ES_USER=elastic ES_PASS=$PASS EOF pim@ublog:~$ sudo systemctl restart mastodon-streaming mastodon-web mastodon-sidekiq Elasticsearch is a memory hog, which is not that strange considering its job is to supply full text retrieval in a large amount of documents and data at high performance. It\u0026rsquo;ll by default grab roughly half of the machine\u0026rsquo;s memory, which it really doesn\u0026rsquo;t need for now. So, I\u0026rsquo;ll give it a little bit of a smaller playground to expand into, by limiting it\u0026rsquo;s heap to 2 GB to get us started:\npim@ublog:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/elasticsearch/jvm.options.d/memory.options -Xms2048M -Xmx2048M EOF pim@ublog:~$ sudo systemctl restart elasticsearch Mail E-mail can be quite tricky to get right. At IPng we\u0026rsquo;ve been running mailservers for a while now, and we\u0026rsquo;re reasonably good at delivering mail even to the most hard-line providers (looking at you, GMX and Google). We use relays from a previous project of mine called [PaPHosting], which you can clearly see comes from the Dark Ages when the Internet was still easy. These days, our mailservers run a combination of STS-MTA, TLS certs from Lets Encrypt, DMARC, and SPF. So our outbound mail is simply using OpenBSD\u0026rsquo;s smtpd(8), and it forwards to the remote relay pool of five servers using authentication, but only after rewriting the envelope to always come from @ublog.tech and match the e-mail sender (which allows for strict SPF):\npim@ublog:~$ cat /etc/smtpd.conf table aliases file:/etc/aliases table secrets file:/etc/mail/secrets listen on localhost action \u0026#34;local_mail\u0026#34; mbox alias \u0026lt;aliases\u0026gt; action \u0026#34;outbound\u0026#34; relay host \u0026#34;smtp+tls://papmx@smtp.paphosting.net\u0026#34; auth \u0026lt;secrets\u0026gt; \\ mail-from \u0026#34;@ublog.tech\u0026#34; match from local for local action \u0026#34;local_mail\u0026#34; match from local for any action \u0026#34;outbound\u0026#34; Inbound mail to the @ublog.tech domain is also handled by the paphosting servers, which forward them all to our respective inboxes.\nServer Settings After reading a post from [@rriemann@chaos.social], I was quickly convinced that having a good privacy policy is worth the time. I took their excellent advice to create a reasonable [Privacy Policy]. Thanks again for that, and if you\u0026rsquo;re running a server in Europe or with European users, definitely check it out.\nRules are important. I didn\u0026rsquo;t give this as much thought, but I did assert some ground rules. Even though I do believe in [Postel\u0026rsquo;s Robustness Principle] (Be liberal in what you accept, and conservative in what you send.), I generally tend to believe that computers lose their temper less often than humans, so I started off with:\nBehavioral Tenets: Use welcoming and inclusive language, be respectful of differing viewpoints and experiences, gracefully accept constructive criticism, focus on what is best for the community, show empathy towards other community members. Be kind to each other, and yourself. Unacceptable behavior: Use of sexualized language or imagery, unsolicited romantic attention, trolling, derogatory comments, personal or political attacks, doxxing are strictly prohibited. Use conduct considered inappropriate for a professional setting. I also read an entertaining (likely insider-joke) post from [@nova@hachyderm.io], in which she was asking about the internet explorer favicon on her instance, so I couldn\u0026rsquo;t resist but replace the mastodon favicon with the IPng Networks one. Vanity matters.\nWhat\u0026rsquo;s next Now that the server is up, and I have a small amount of users (mostly folks I know from the tech industry), I took some time to explore both the Fediverse, reach out to friends old and new, participate in a few random discussions, fiddle with the iOS apps (and in the end, settled on Toot! with a runner up of Metatext), and generally had an amazing time on Mastodon these last few days.\nNow, I think I\u0026rsquo;m ready to further productionize the experience. My next article will cover monitoring - a vital aspect of any serious project. I\u0026rsquo;ll go over Prometheus, Grafana, Alertmanager and how to get the most signal out of a running Mastodon instance. Stay tuned!\nIf you\u0026rsquo;re looking for a home, feel free to sign up at https://ublog.tech/ as I\u0026rsquo;m sure that having a bit more load / traffic on this instance will allow me to learn (and in turn, to share with others)!\n","date":"2022-11-20","desc":"About this series I have seen companies achieve great successes in the space of consumer internet and entertainment industry. I\u0026rsquo;ve been feeling less enthusiastic about the stronghold that these corporations have over my digital presence. I am the first to admit that using \u0026ldquo;free\u0026rdquo; services is convenient, but these companies are sometimes taking away my autonomy and exerting control over society. To each their own of course, but for me it\u0026rsquo;s time to take back a little bit of responsibility for my online social presence, away from centrally hosted services and to privately operated ones.\n","permalink":"https://ipng.ch/s/articles/2022/11/20/mastodon-part-1-installing/","section":"articles","title":"Mastodon - Part 1 - Installing"},{"contents":" Introduction In a previous post (VPP Linux CP - Virtual Machine Playground), I wrote a bit about building a QEMU image so that folks can play with the Vector Packet Processor and the Linux Control Plane code. Judging by our access logs, this image has definitely been downloaded a bunch, and I myself use it regularly when I want to tinker a little bit, without wanting to impact the production routers at AS8298.\nThe topology of my tests has become a bit more complicated over time, and often just one router would not be enough. Yet, repeatability is quite important, and I found myself constantly reinstalling / recheckpointing the vpp-proto virtual machine I was using. I got my hands on some LAB hardware, so it\u0026rsquo;s time for an upgrade!\nIPng Networks LAB - Physical First, I specc\u0026rsquo;d out a few machines that will serve as hypervisors. From top to bottom in the picture here, two FS.com S5680-20SQ switches \u0026ndash; I reviewed these earlier [ref], and I really like these, as they come with 20x10G, 4x25G and 2x40G ports, an OOB management port and serial to configure them. Under it, is its larger brother, with 48x10G and 8x100G ports, the FS.com S5860-48SC. Although it\u0026rsquo;s a bit more expensive, it\u0026rsquo;s also necessary because I often test VPP at higher bandwidth, and as such being able to make ethernet topologies by mixing 10, 25, 40, 100G is super useful for me. So, this switch is fsw0.lab.ipng.ch and dedicated to lab experiments.\nConnected to the switch are my trusty Rhino and Hippo machines. If you remember that game Hungry Hungry Hippos that\u0026rsquo;s where the name comes from. They are both Ryzen 5950X on ASUS B550 motherboard, with each 2x1G i350 copper nics (pictured here not connected), and 2x100G i810 QSFP network cards (properly slotted in the motherboard\u0026rsquo;ss PCIe v4.0 x16 slot).\nFinally, three Dell R720XD machines serve as the to be built VPP testbed. They each come with 128GB of RAM, 2x500G SSDs, two Intel 82599ES dual 10G NICs (four ports total), and four Broadcom BCM5720 1G NICs. The first 1G port is connected to a management switch, and it doubles up as an IPMI speaker, so I can turn on/off the hypervisors remotely. All four 10G ports are connected with DACs to fsw0-lab, as are two 1G copper ports (the blue UTP cables). Everything can be turned on/off remotely, which is useful for noise, heat and overall the environment 🍀.\nIPng Networks LAB - Logical I have three of these Dell R720XD machines in the lab, and each one of them will run one complete lab environment, consisting of four VPP virtual machines, network plumbing, and uplink. That way, I can turn on one hypervisor, say hvn0.lab.ipng.ch, prepare and boot the VMs, mess around with it, and when I\u0026rsquo;m done, return the VMs to a pristine state, and turn off the hypervisor. And, because I have three of these machines, I can run three separate LABs at the same time, or one really big one spanning all the machines. Pictured on the right is a logical sketch of one of the LABs (LAB id=0), with a bunch of VPP virtual machines, each four NICs daisychained together, with a few NICs left for experimenting.\nHeadend At the top of the logical environment, I am going to be using one of our production machines (hvn0.chbtl0.ipng.ch) which will run a permanently running LAB headend, a Debian VM called lab.ipng.ch. This allows me to hermetically seal the LAB environments, letting me run them entirely in RFC1918 space, and by forcing the LAbs to be connected under this machine, I can ensure that no unwanted traffic enters or exits the network [imagine a loadtest at 100Gbit accidentally leaking, this may or totally may not have once happened to me before \u0026hellip;].\nDisk images On this production hypervisor (hvn0.chbtl0.ipng.ch), I\u0026rsquo;ll also prepare and maintain a prototype vpp-proto disk image, which will serve as a consistent image to boot the LAB virtual machines. This main image will be replicated over the network into all three hvn0 - hvn2 hypervisor machines. This way, I can do periodical maintenance on the main vpp-proto image, snapshot it, publish it as a QCOW2 for downloading (see my [VPP Linux CP - Virtual Machine Playground] post for details on how it\u0026rsquo;s built and what you can do with it yourself!). The snapshots will then also be sync\u0026rsquo;d to all hypervisors, and from there I can use simple ZFS filesystem cloning and snapshotting to maintain the LAB virtual machines.\nNetworking Each hypervisor will get an install of Open vSwitch, a production quality, multilayer virtual switch designed to enable massive network automation through programmatic extension, while still supporting standard management interfaces and protocols. This takes lots of the guesswork and tinkering out of Linux bridges in KVM/QEMU, and it\u0026rsquo;s a perfect fit due to its tight integration with libvirt (the thing most of us use in Debian/Ubuntu hypervisors). If need be, I can add one or more of the 1G or 10G ports as well to the OVS fabric, to build more complicated topologies. And, because the OVS infrastructure and libvirt both allow themselves to be configured over the network, I can control all aspects of the runtime directly from the lab.ipng.ch headend, not having to log in to the hypervisor machines at all. Slick!\nImplementation Details I start with image management. On the production hypervisor, I create a 6GB ZFS dataset that will serve as my vpp-proto machine, and install it using the exact same method as the playground [ref]. Once I have it the way I like it, I\u0026rsquo;ll poweroff the VM, and see to this image being replicated to all hypervisors.\nZFS Replication Enter zrepl, a one-stop, integrated solution for ZFS replication. This tool is incredibly powerful, and can do snapshot management, sourcing / sinking replication, of course using incremental snapshots as they are native to ZFS. Because this is a LAB article, not a zrepl tutorial, I\u0026rsquo;ll just cut to the chase and show the configuration I came up with.\npim@hvn0-chbtl0:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/zrepl/zrepl.yml global: logging: # use syslog instead of stdout because it makes journald happy - type: syslog format: human level: warn jobs: - name: snap-vpp-proto type: snap filesystems: \u0026#39;ssd-vol0/vpp-proto-disk0\u0026lt;\u0026#39;: true snapshotting: type: manual pruning: keep: - type: last_n count: 10 - name: source-vpp-proto type: source serve: type: stdinserver client_identities: - \u0026#34;hvn0-lab\u0026#34; - \u0026#34;hvn1-lab\u0026#34; - \u0026#34;hvn2-lab\u0026#34; filesystems: \u0026#39;ssd-vol0/vpp-proto-disk0\u0026lt;\u0026#39;: true # all filesystems snapshotting: type: manual EOF pim@hvn0-chbtl0:~$ cat \u0026lt;\u0026lt; EOF | sudo tee -a /root/.ssh/authorized_keys # ZFS Replication Clients for IPng Networks LAB command=\u0026#34;zrepl stdinserver hvn0-lab\u0026#34;,restrict ecdsa-sha2-nistp256 \u0026lt;omitted\u0026gt; root@hvn0.lab.ipng.ch command=\u0026#34;zrepl stdinserver hvn1-lab\u0026#34;,restrict ecdsa-sha2-nistp256 \u0026lt;omitted\u0026gt; root@hvn1.lab.ipng.ch command=\u0026#34;zrepl stdinserver hvn2-lab\u0026#34;,restrict ecdsa-sha2-nistp256 \u0026lt;omitted\u0026gt; root@hvn2.lab.ipng.ch EOF To unpack this, there are two jobs configured in zrepl:\nsnap-vpp-proto - the purpose of this job is to track snapshots as they are created. Normally, zrepl is configured to automatically make snapshots every hour and copy them out, but in my case, I only want to take snapshots when I changed and released the vpp-proto image, not periodically. So, I set the snapshotting to manual, and let the system keep the last ten images. source-vpp-proto - this is a source job that uses a lazy (albeit fine in this lab environment) method to serve the snapshots to clients. By adding these SSH keys to the authorized_keys file, but restricting them to be able to execute only the zrepl stdinserver command, and nothing else (ie. these keys cannot log in to the machine). If any given server were to present thesze keys, I can now map them to a zrepl client (for example, hvn0-lab for the SSH key presented by hostname hvn0.lab.ipng.ch. The source job now knows to serve the listed filesystems (and their dataset children, noted by the \u0026lt; suffix), to those clients. For the client side, each of the hypervisors gets only one job, called a pull job, which will periodically wake up (every minute) and ensure that any pending snapshots and their incrementals from the remote source are slurped in and replicated to a root_fs dataset, in this case I called it ssd-vol0/hvn0.chbtl0.ipng.ch so I can track where the datasets come from.\npim@hvn0-lab:~$ sudo ssh-keygen -t ecdsa -f /etc/zrepl/ssh/identity -C \u0026#34;root@$(hostname -f)\u0026#34; pim@hvn0-lab:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/zrepl/zrepl.yml global: logging: # use syslog instead of stdout because it makes journald happy - type: syslog format: human level: warn jobs: - name: vpp-proto type: pull connect: type: ssh+stdinserver host: hvn0.chbtl0.ipng.ch user: root port: 22 identity_file: /etc/zrepl/ssh/identity root_fs: ssd-vol0/hvn0.chbtl0.ipng.ch interval: 1m pruning: keep_sender: - type: regex regex: \u0026#39;.*\u0026#39; keep_receiver: - type: last_n count: 10 recv: placeholder: encryption: off After restarting zrepl for each of the machines (the source machine and the three pull machines), I can now do the following cool hat trick:\npim@hvn0-chbtl0:~$ virsh start --console vpp-proto ## Do whatever maintenance, and then poweroff the VM pim@hvn0-chbtl0:~$ sudo zfs snapshot ssd-vol0/vpp-proto-disk0@20221019-release pim@hvn0-chbtl0:~$ sudo zrepl signal wakeup source-vpp-proto This signals the zrepl daemon to re-read the snapshots, which will pick up the newest one, and then without me doing much of anything else:\npim@hvn0-lab:~$ sudo zfs list -t all | grep vpp-proto ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0 6.60G 367G 6.04G - ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221013-release 499M - 6.04G - ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221018-release 24.1M - 6.04G - ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221019-release 0B - 6.04G - That last image was just pushed automatically to all hypervisors! If they\u0026rsquo;re turned off, no worries, as soon as they start up, their local zrepl will make its next minutely poll, and pull in all snapshots, bringing the machine up to date. So even when the hypervisors are normally turned off, this is zero-touch and maintenance free.\nVM image maintenance Now that I have a stable image to work off of, all I have to do is zfs clone this image into new per-VM datasets, after which I can mess around on the VMs all I want, and when I\u0026rsquo;m done, I can zfs destroy the clone and bring it back to normal. However, I clearly don\u0026rsquo;t want one and the same clone for each of the VMs, as they do have lots of config files that are specific to that one instance. For example, the mgmt IPv4/IPv6 addresses are unique, and the VPP and Bird/FRR configs are unique as well. But how unique are they, really?\nEnter Jinja (known mostly from Ansible). I decide to make some form of per-VM config files that are generated based on some templates. That way, I can clone the base ZFS dataset, copy in the deltas, and boot that instead. And to be extra efficient, I can also make a per-VM zfs snapshot of the cloned+updated filesystem, before tinkering with the VMs, which I\u0026rsquo;ll call a pristine snapshot. Still with me?\nFirst, clone the base dataset into a per-VM dataset, say ssd-vol0/vpp0-0 Then, generate a bunch of override files, copying them into the per-VM dataset ssd-vol0/vpp0-0 Finally, create a snapshot of that, called ssd-vol0/vpp0-0@pristine and boot off of that. Now, returning the VM to a pristine state is simply a matter of shutting down the VM, performing a zfs rollback to the pristine snapshot, and starting the VM again. Ready? Let\u0026rsquo;s go!\nGenerator So off I go, writing a small Python generator that uses Jinja to read a bunch of YAML files, merging them along the way, and then traversing a set of directories with template files and per-VM overrides, to assemble a build output directory with a fully formed set of files that I can copy into the per-VM dataset.\nTake a look at this as a minimally viable configuration:\npim@lab:~/src/lab$ cat config/common/generic.yaml overlays: default: path: overlays/bird/ build: build/default/ lab: mgmt: ipv4: 192.168.1.80/24 ipv6: 2001:678:d78:101::80/64 gw4: 192.168.1.252 gw6: 2001:678:d78:101::1 nameserver: search: [ \u0026#34;lab.ipng.ch\u0026#34;, \u0026#34;ipng.ch\u0026#34;, \u0026#34;rfc1918.ipng.nl\u0026#34;, \u0026#34;ipng.nl\u0026#34; ] nodes: 4 pim@lab:~/src/lab$ cat config/hvn0.lab.ipng.ch.yaml lab: id: 0 ipv4: 192.168.10.0/24 ipv6: 2001:678:d78:200::/60 nameserver: addresses: [ 192.168.10.4, 2001:678:d78:201::ffff ] hypervisor: hvn0.lab.ipng.ch Here I define a common config file with fields and attributes which will apply to all LAB environments, things such as the mgmt network, nameserver search paths, and how many VPP virtual machine nodes I want to build. Then, for hvn0.lab.ipng.ch, I specify an IPv4 and IPv6 prefix assigned to it, some specific nameserver endpoints that will point at an unbound running on lab.ipng.ch itself.\nI can now create any file I\u0026rsquo;d like which may use variable substition and other jinja2 style templating. Take for example these two files:\npim@lab:~/src/lab$ cat overlays/bird/common/etc/netplan/01-netcfg.yaml.j2 network: version: 2 renderer: networkd ethernets: enp1s0: optional: true accept-ra: false dhcp4: false addresses: [ {{node.mgmt.ipv4}}, {{node.mgmt.ipv6}} ] gateway4: {{lab.mgmt.gw4}} gateway6: {{lab.mgmt.gw6}} pim@lab:~/src/lab$ cat overlays/bird/common/etc/netns/dataplane/resolv.conf.j2 domain lab.ipng.ch search{% for domain in lab.nameserver.search %} {{ domain }}{% endfor %} {% for resolver in lab.nameserver.addresses %} nameserver {{ resolver }} {% endfor %} The first file is a [NetPlan.io] configuration that substitutes the correct management IPv4 and IPv6 addresses and gateways. The second one enumerates a set of search domains and nameservers, so that each LAB can have their own unique resolvers. I point these at the lab.ipng.ch uplink interface, in the case of the LAB hvn0.lab.ipng.ch, this will be 192.168.10.4 and 2001:678:d78:201::ffff, but on hvn1.lab.ipng.ch I can override that to become 192.168.11.4 and 2001:678:d78:211::ffff.\nThere\u0026rsquo;s one subdirectory for each overlay type (imagine that I want a lab that runs Bird2, but I may also want one which runs FRR, or another thing still). Within the overlay directory, there\u0026rsquo;s one common tree, with files that apply to every machine in the LAB, and a hostname tree, with files that apply only to specific nodes (VMs) in the LAB:\npim@lab:~/src/lab$ tree overlays/default/ overlays/default/ ├── common │ ├── etc │ │ ├── bird │ │ │ ├── bfd.conf.j2 │ │ │ ├── bird.conf.j2 │ │ │ ├── ibgp.conf.j2 │ │ │ ├── ospf.conf.j2 │ │ │ └── static.conf.j2 │ │ ├── hostname.j2 │ │ ├── hosts.j2 │ │ ├── netns │ │ │ └── dataplane │ │ │ └── resolv.conf.j2 │ │ ├── netplan │ │ │ └── 01-netcfg.yaml.j2 │ │ ├── resolv.conf.j2 │ │ └── vpp │ │ ├── bootstrap.vpp.j2 │ │ └── config │ │ ├── defaults.vpp │ │ ├── flowprobe.vpp.j2 │ │ ├── interface.vpp.j2 │ │ ├── lcp.vpp │ │ ├── loopback.vpp.j2 │ │ └── manual.vpp.j2 │ ├── home │ │ └── ipng │ └── root ├── hostname ├── vpp0-0 └── etc (etc) └── vpp └── config └── interface.vpp Now all that\u0026rsquo;s left to do is generate this hierarchy, and of course I can check this in to git and track changes to the templates and their resulting generated filesystem overrides over time:\npim@lab:~/src/lab$ ./generate -q --host hvn0.lab.ipng.ch pim@lab:~/src/lab$ find build/default/hvn0.lab.ipng.ch/vpp0-0/ -type f build/default/hvn0.lab.ipng.ch/vpp0-0/home/ipng/.ssh/authorized_keys build/default/hvn0.lab.ipng.ch/vpp0-0/etc/hosts build/default/hvn0.lab.ipng.ch/vpp0-0/etc/resolv.conf build/default/hvn0.lab.ipng.ch/vpp0-0/etc/bird/static.conf build/default/hvn0.lab.ipng.ch/vpp0-0/etc/bird/bfd.conf build/default/hvn0.lab.ipng.ch/vpp0-0/etc/bird/bird.conf build/default/hvn0.lab.ipng.ch/vpp0-0/etc/bird/ibgp.conf build/default/hvn0.lab.ipng.ch/vpp0-0/etc/bird/ospf.conf build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/config/loopback.vpp build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/config/flowprobe.vpp build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/config/interface.vpp build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/config/defaults.vpp build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/config/lcp.vpp build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/config/manual.vpp build/default/hvn0.lab.ipng.ch/vpp0-0/etc/vpp/bootstrap.vpp build/default/hvn0.lab.ipng.ch/vpp0-0/etc/netplan/01-netcfg.yaml build/default/hvn0.lab.ipng.ch/vpp0-0/etc/netns/dataplane/resolv.conf build/default/hvn0.lab.ipng.ch/vpp0-0/etc/hostname build/default/hvn0.lab.ipng.ch/vpp0-0/root/.ssh/authorized_keys Open vSwitch maintenance The OVS installs on each Debian hypervisor in the lab is the same. I install the required Debian packages, create a switchfabric, add one physical network port (the one that will serve as the uplink (VLAN 10 in the sketch above) for the LAB), and all the virtio ports from KVM.\npim@hvn0-lab:~$ sudo vi /etc/netplan/01-netcfg.yaml network: vlans: uplink: optional: true accept-ra: false dhcp4: false link: eno1 id: 200 pim@hvn0-lab:~$ sudo netplan apply pim@hvn0-lab:~$ sudo apt install openvswitch-switch python3-openvswitch pim@hvn0-lab:~$ sudo ovs-vsctl add-br vpplan pim@hvn0-lab:~$ sudo ovs-vsctl add-port vpplan uplink tag=10 The vpplan switch fabric and its uplink port will persist across reboots. Then I add a small change to libvirt defined virtual machines:\npim@hvn0-lab:~$ virsh edit vpp0-0 ... \u0026lt;interface type=\u0026#39;bridge\u0026#39;\u0026gt; \u0026lt;mac address=\u0026#39;52:54:00:00:10:00\u0026#39;/\u0026gt; \u0026lt;source bridge=\u0026#39;vpplan\u0026#39;/\u0026gt; \u0026lt;virtualport type=\u0026#39;openvswitch\u0026#39; /\u0026gt; \u0026lt;target dev=\u0026#39;vpp0-0-0\u0026#39;/\u0026gt; \u0026lt;model type=\u0026#39;virtio\u0026#39;/\u0026gt; \u0026lt;mtu size=\u0026#39;9216\u0026#39;/\u0026gt; \u0026lt;address type=\u0026#39;pci\u0026#39; domain=\u0026#39;0x0000\u0026#39; bus=\u0026#39;0x10\u0026#39; slot=\u0026#39;0x00\u0026#39; function=\u0026#39;0x0\u0026#39; multifunction=\u0026#39;on\u0026#39;/\u0026gt; \u0026lt;/interface\u0026gt; \u0026lt;interface type=\u0026#39;bridge\u0026#39;\u0026gt; \u0026lt;mac address=\u0026#39;52:54:00:00:10:01\u0026#39;/\u0026gt; \u0026lt;source bridge=\u0026#39;vpplan\u0026#39;/\u0026gt; \u0026lt;virtualport type=\u0026#39;openvswitch\u0026#39; /\u0026gt; \u0026lt;target dev=\u0026#39;vpp0-0-1\u0026#39;/\u0026gt; \u0026lt;model type=\u0026#39;virtio\u0026#39;/\u0026gt; \u0026lt;mtu size=\u0026#39;9216\u0026#39;/\u0026gt; \u0026lt;address type=\u0026#39;pci\u0026#39; domain=\u0026#39;0x0000\u0026#39; bus=\u0026#39;0x10\u0026#39; slot=\u0026#39;0x00\u0026#39; function=\u0026#39;0x1\u0026#39;/\u0026gt; \u0026lt;/interface\u0026gt; ... etc That the only two things I need to do are ensure that the source bridge will be called the same as the OVS fabric, in my case vpplan, and the virtualport type is openvswitch, and that\u0026rsquo;s it! Once all four vpp0-* virtual machines each have all four of their network cards updated, when they boot, the hypervisor will add them each as new untagged ports in the OVS fabric.\nTo then build the topology that I have in mind for the LAB, where each VPP machine is daisychained to its siblin, all we have to do is program that into the OVS configuration:\npim@hvn0-lab:~$ cat \u0026lt;\u0026lt; EOF \u0026gt; ovs-config.sh #!/bin/sh # # OVS configuration for the `default` overlay LAB=${LAB:=0} for node in 0 1 2 3; do for int in 0 1 2 3; do ovs-vsctl set port vpp${LAB}-${node}-${int} vlan_mode=native-untagged done done # Uplink is VLAN 10 ovs-vsctl add port vpp${LAB}-0-0 tag 10 ovs-vsctl add port uplink tag 10 # Link vpp${LAB}-0 \u0026lt;-\u0026gt; vpp${LAB}-1 in VLAN 20 ovs-vsctl add port vpp${LAB}-0-1 tag 20 ovs-vsctl add port vpp${LAB}-1-0 tag 20 # Link vpp${LAB}-1 \u0026lt;-\u0026gt; vpp${LAB}-2 in VLAN 21 ovs-vsctl add port vpp${LAB}-1-1 tag 21 ovs-vsctl add port vpp${LAB}-2-0 tag 21 # Link vpp${LAB}-2 \u0026lt;-\u0026gt; vpp${LAB}-3 in VLAN 22 ovs-vsctl add port vpp${LAB}-2-1 tag 22 ovs-vsctl add port vpp${LAB}-3-0 tag 22 EOF pim@hvn0-lab:~$ chmod 755 ovs-config.sh pim@hvn0-lab:~$ sudo ./ovs-config.sh The first block here wheels over all nodes and then for all of ther ports, sets the VLAN mode to what OVS calleds \u0026rsquo;native-untagged\u0026rsquo;. In this mode, the tag becomes the VLAN in which the port will operate, but, to add as well dot1q tagged additional VLANs, we can use the syntax add port ... trunks 10,20,30.\nTO see the configuration, ovs-vsctl show port vpp0-0-0 will show the switch port configuration, while ovs-vsctl show interface vpp0-0-0 will show the virtual machine\u0026rsquo;s NIC configuration (think of the difference here as the switch port on the one hand, and the NIC (interface) plugged into it on the other).\nDeployment There\u0026rsquo;s three main points to consider when deploying these lab VMs:\nCreate the VMs and their ZFS datasets Destroy the VMs and their ZFS datasets Bring the VMs into a pristine state Create If the hypervisor doesn\u0026rsquo;t yet have a LAB running, we need to create it:\nBASE=${BASE:=ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221019-release} BUILD=${BUILD:=default} LAB=${LAB:=0} ## Do not touch below this line LABDIR=/var/lab STAGING=$LABDIR/staging HVN=\u0026#34;hvn${LAB}.lab.ipng.ch\u0026#34; echo \u0026#34;* Cloning base\u0026#34; ssh root@$HVN \u0026#34;set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\\${node}; \\ mkdir -p $STAGING/\\$VM; zfs clone $BASE ssd-vol0/\\$VM; done\u0026#34; sleep 1 echo \u0026#34;* Mounting in staging\u0026#34; ssh root@$HVN \u0026#34;set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\\${node}; \\ mount /dev/zvol/ssd-vol0/\\$VM-part1 $STAGING/\\$VM; done\u0026#34; echo \u0026#34;* Rsyncing build\u0026#34; rsync -avugP build/$BUILD/$HVN/ root@hvn${LAB}.lab.ipng.ch:$STAGING echo \u0026#34;* Setting permissions\u0026#34; ssh root@$HVN \u0026#34;set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\\${node}; \\ chown -R root. $STAGING/\\$VM/root; done\u0026#34; echo \u0026#34;* Unmounting and snapshotting pristine state\u0026#34; ssh root@$HVN \u0026#34;set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\\${node}; \\ umount $STAGING/\\$VM; zfs snapshot ssd-vol0/\\${VM}@pristine; done\u0026#34; echo \u0026#34;* Starting VMs\u0026#34; ssh root@$HVN \u0026#34;set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\\${node}; \\ virsh start \\$VM; done\u0026#34; echo \u0026#34;* Committing OVS config\u0026#34; scp overlays/$BUILD/ovs-config.sh root@$HVN:$LABDIR ssh root@$HVN \u0026#34;set -x; LAB=$LAB $LABDIR/ovs-config.sh\u0026#34; After running this, the hypervisor will have 4 clones, and 4 snapshots (one for each virtual machine):\nroot@hvn0-lab:~# zfs list -t all NAME USED AVAIL REFER MOUNTPOINT ssd-vol0 6.80G 367G 24K /ssd-vol0 ssd-vol0/hvn0.chbtl0.ipng.ch 6.60G 367G 24K none ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0 6.60G 367G 24K none ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0 6.60G 367G 6.04G - ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221013-release 499M - 6.04G - ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221018-release 24.1M - 6.04G - ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221019-release 0B - 6.04G - ssd-vol0/vpp0-0 43.6M 367G 6.04G - ssd-vol0/vpp0-0@pristine 1.13M - 6.04G - ssd-vol0/vpp0-1 25.0M 367G 6.04G - ssd-vol0/vpp0-1@pristine 1.14M - 6.04G - ssd-vol0/vpp0-2 42.2M 367G 6.04G - ssd-vol0/vpp0-2@pristine 1.13M - 6.04G - ssd-vol0/vpp0-3 79.1M 367G 6.04G - ssd-vol0/vpp0-3@pristine 1.13M - 6.04G - The last thing the create script does is commit the OVS configuration, because when the VMs are shutdown or newly created, KVM will add them to the switching fabric as untagged/unconfigured ports.\nBut would you look at that! The delta between the base image and the pristine snapshots is about 1MB of configuration files, the ones that I generated and rsync\u0026rsquo;d in above, and then once the machine boots, it will have a read/write mounted filesystem as per normal, except it\u0026rsquo;s a delta on top of the snapshotted, cloned dataset.\nDestroy I love destroying things! But in this case, I\u0026rsquo;m removing what are essentially ephemeral disk images, as I still have the base image to clone from. But, the destroy is conceptually very simple:\nBASE=${BASE:=ssd-vol0/hvn0.chbtl0.ipng.ch/ssd-vol0/vpp-proto-disk0@20221018-release} LAB=${LAB:=0} ## Do not touch below this line HVN=\u0026#34;hvn${LAB}.lab.ipng.ch\u0026#34; echo \u0026#34;* Destroying VMs\u0026#34; ssh root@$HVN \u0026#34;set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\\${node}; \\ virsh destroy \\$VM; done\u0026#34; echo \u0026#34;* Destroying ZFS datasets\u0026#34; ssh root@$HVN \u0026#34;set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\\${node}; \\ zfs destroy -r ssd-vol0/\\$VM; done\u0026#34; After running this, the VMs will be shutdown and their cloned filesystems (including any snapshots those may have) are wiped. To get back into a working state, all I must do is run ./create again!\nPristine Sometimes though, I don\u0026rsquo;t need to completely destroy the VMs, but rather I want to put them back into the state they where just after creating the LAB. Luckily, the create made a snapshot (called pristine) for each VM before booting it, so bringing the LAB back to factory default settings is really easy:\nBUILD=${BUILD:=default} LAB=${LAB:=0} ## Do not touch below this line LABDIR=/var/lab STAGING=$LABDIR/staging HVN=\u0026#34;hvn${LAB}.lab.ipng.ch\u0026#34; ## Bring back into pristine state echo \u0026#34;* Restarting VMs from pristine snapshot\u0026#34; ssh root@$HVN \u0026#34;set -x; for node in 0 1 2 3; do VM=vpp${LAB}-\\${node}; \\ virsh destroy \\$VM; zfs rollback ssd-vol0/\\${VM}@pristine; virsh start \\$VM; done\u0026#34; echo \u0026#34;* Committing OVS config\u0026#34; scp overlays/$BUILD/ovs-config.sh root@$HVN:$LABDIR ssh root@$HVN \u0026#34;set -x; $LABDIR/ovs-config.sh\u0026#34; Results After completing this project, I have a completely hands-off, automated and autogenerated, and very maneageable set of three LABs, each booting up in a running OSPF/OSPFv3 enabled topology for IPv4 and IPv6:\npim@lab:~/src/lab$ traceroute -q1 vpp0-3 traceroute to vpp0-3 (192.168.10.3), 30 hops max, 60 byte packets 1 e0.vpp0-0.lab.ipng.ch (192.168.10.5) 1.752 ms 2 e0.vpp0-1.lab.ipng.ch (192.168.10.7) 4.064 ms 3 e0.vpp0-2.lab.ipng.ch (192.168.10.9) 5.178 ms 4 vpp0-3.lab.ipng.ch (192.168.10.3) 7.469 ms pim@lab:~/src/lab$ ssh ipng@vpp0-3 ipng@vpp0-3:~$ traceroute6 -q1 vpp2-3 traceroute to vpp2-3 (2001:678:d78:220::3), 30 hops max, 80 byte packets 1 e1.vpp0-2.lab.ipng.ch (2001:678:d78:201::3:2) 2.088 ms 2 e1.vpp0-1.lab.ipng.ch (2001:678:d78:201::2:1) 6.958 ms 3 e1.vpp0-0.lab.ipng.ch (2001:678:d78:201::1:0) 8.841 ms 4 lab0.lab.ipng.ch (2001:678:d78:201::ffff) 7.381 ms 5 e0.vpp2-0.lab.ipng.ch (2001:678:d78:221::fffe) 8.304 ms 6 e0.vpp2-1.lab.ipng.ch (2001:678:d78:221::1:21) 11.633 ms 7 e0.vpp2-2.lab.ipng.ch (2001:678:d78:221::2:22) 13.704 ms 8 vpp2-3.lab.ipng.ch (2001:678:d78:220::3) 15.597 ms If you read this far, thanks! Each of these three LABs come with 4x10Gbit DPDK based packet generators (Cisco T-Rex), four VPP machines running either Bird2 or FRR, and together they are connected to a 100G capable switch.\nThese LABs are for rent, and we offer hands-on training on them. Please contact us for daily/weekly rates, and custom training sessions.\nI checked the generator and deploy scripts in to a git repository, which I\u0026rsquo;m happy to share if there\u0026rsquo;s an interest. But because it contains a few implementation details and doesn\u0026rsquo;t do a lot of fool-proofing, as well as because most of this can be easily recreated by interested parties from this blogpost, I decided not to publish the LAB project github, but on our private git.ipng.ch server instead. Mail us if you\u0026rsquo;d like to take a closer look, I\u0026rsquo;m happy to share the code.\n","date":"2022-10-14","desc":" Introduction In a previous post (VPP Linux CP - Virtual Machine Playground), I wrote a bit about building a QEMU image so that folks can play with the Vector Packet Processor and the Linux Control Plane code. Judging by our access logs, this image has definitely been downloaded a bunch, and I myself use it regularly when I want to tinker a little bit, without wanting to impact the production routers at AS8298.\n","permalink":"https://ipng.ch/s/articles/2022/10/14/vpp-lab-setup/","section":"articles","title":"VPP Lab - Setup"},{"contents":" About this series I use VPP - Vector Packet Processor - extensively at IPng Networks. Earlier this year, the VPP community merged the Linux Control Plane plugin. I wrote about its deployment to both regular servers like the Supermicro routers that run on our AS8298, as well as virtual machines running in KVM/Qemu.\nNow that I\u0026rsquo;ve been running VPP in production for about half a year, I can\u0026rsquo;t help but notice one specific drawback: VPP is a programmable dataplane, and by design it does not include any configuration or controlplane management stack. It\u0026rsquo;s meant to be integrated into a full stack by operators. For end-users, this unfortunately means that typing on the CLI won\u0026rsquo;t persist any configuration, and if VPP is restarted, it will not pick up where it left off. There\u0026rsquo;s one developer convenience in the form of the exec command-line (and startup.conf!) option, which will read a file and apply the contents to the CLI line by line. However, if any typo is made in the file, processing immediately stops. It\u0026rsquo;s meant as a convenience for VPP developers, and is certainly not a useful configuration method for all but the simplest topologies.\nLuckily, VPP comes with an extensive set of APIs to allow it to be programmed. So in this series of posts, I\u0026rsquo;ll detail the work I\u0026rsquo;ve done to create a configuration utility that can take a YAML configuration file, compare it to a running VPP instance, and step-by-step plan through the API calls needed to safely apply the configuration to the dataplane. Welcome to vppcfg!\nIn this second post of the series, I want to talk a little bit about how planning a path from a running configuration to a desired new configuration might look like.\nNote: Code is on my Github, but it\u0026rsquo;s not quite ready for prime-time yet. Take a look, and engage with us on GitHub (pull requests preferred over issues themselves) or reach out by contacting us.\nVPP Config: a DAG Before we dive into my vppcfg code, let me first introduce a mental model of how configuration is built. We rarely stop and think about it, but when we configure our routers (no matter if it\u0026rsquo;s a Cisco or a Juniper or a VPP router), in our mind we logically order the operations in a very particular way. To state the obvious, if I want to create a sub-interface which also has an address, I would create the sub-int before adding the address, right? Similarly, if I wanted to expose a sub-interface Hu12/0/0.100 in Linux as a LIP, I would create it only after having created a LIP for the parent interface Hu12/0/0, to satisfy Linux\u0026rsquo;s requirement all sub-interfaces have a parent interface, like so:\nvpp# create sub HundredGigabitEthernet12/0/0 100 vpp# set interface ip address HundredGigabitEthernet12/0/0.100 192.0.2.1/29 vpp# lcp create HundredGigabitEthernet12/0/0 host-if ice0 vpp# lcp create HundredGigabitEthernet12/0/0.100 host-if ice0.100 vpp# set interface state HundredGigabitEthernet12/0/0 up vpp# set interface state HundredGigabitEthernet12/0/0.100 up Of course some of the ordering doesn\u0026rsquo;t strictly matter. For example, I can set the state of Hu12/0/0.100 up before adding the address, or after adding the address, or even after adding the LIP, but one thing is certain: I cannot set its state to up before it was created in the first place! In the other direction, when removing things, it\u0026rsquo;s easy to see that you cannot manipulate the state of a sub-interface after deleting it, so to cleanly remove the construction above, I would have to walk the statements back in reverse, like so:\nvpp# set interface state HundredGigabitEthernet12/0/0.100 down vpp# set interface state HundredGigabitEthernet12/0/0 down vpp# lcp delete HundredGigabitEthernet12/0/0.100 host-if ice0.100 vpp# lcp delete HundredGigabitEthernet12/0/0 host-if ice0 vpp# set interface ip address del HundredGigabitEthernet12/0/0.100 192.0.2.1/29 vpp# delete sub HundredGigabitEthernet12/0/0.100 Because of this reasonably straight forward ordering, it\u0026rsquo;s possible to construct a graph showing operations that depend on other operations having been completed beforehand. Such a graph, called a Directed Acyclic Graph or DAG.\nFirst some theory (from Wikipedia): A directed graph is a DAG if and only if it can be topologically ordered, by arranging the vertices as a linear ordering that is consistent with all edge directions. DAGs have numerous scientific and computational applications, but the one I\u0026rsquo;m mostly interested here is dependency mapping and computational scheduling.\nA graph is formed by vertices and by edges connecting pairs of vertices, where the vertices are objects that might exist in VPP (interfaces, bridge-domains, VXLAN tunnels, IP addresses, etc), and these objects are connected in pairs by edges. In the case of a directed graph, each edge has an orientation (or direction), from one (source) vertex to another (destination) vertex. A path in a directed graph is a sequence of edges having the property that the ending vertex of each edge in the sequence is the same as the starting vertex of the next edge in the sequence; a path forms a cycle if the starting vertex of its first edge equals the ending vertex of its last edge. A directed acyclic graph is a directed graph that has no cycles, which in this particular case means that objects' existence can\u0026rsquo;t rely other things that ultimately rely back on their own existence.\nAfter I got that technobabble out of the way, practically speaking, the edges in this graph model dependencies, let me give a few examples:\nThe arrow from Sub Interface pointing at BondEther and Physical Int makes the claim that for the sub-int to exist, it depends on the existence of either a BondEthernet, or a PHY. The arrow from the BondEther to the Physical Int, which makes the claim that for the BondEthernet to work, it must have one or more PHYs in it. There is no arrow between BondEther and Sub Interface which makes the claim that they are independent, there is no need for a sub-int to exist in order for a BondEthernet to work. VPP Config: Ordering In my previous post, I talked about a bunch of constraints that make certain YAML configurations invalid (for example, having both dot1q and dot1ad on a sub-interface, that wouldn\u0026rsquo;t make any sense). Here, I\u0026rsquo;m going to talk about another type of constraint: Temporal Constraints are statements about the ordering of operations. With the example DAG above, I derive the following constraints:\nA parent interface must exist before a sub-interface can be created on it An interface (regardless of sub-int or phy) must exist before an IP address can be added to it A LIP can be created on a sub-int only if its parent PHY has a LIP LIPs must be removed from all sub-interfaces before a PHY\u0026rsquo;s LIP can be removed The admin-state of a sub-interface can only be up if its PHY is up \u0026hellip; and so on. But there\u0026rsquo;s a second thing to keep in mind, and this is a bit more specific to the VPP configuration operations themselves. Sometimes, I may find that an object already exists, say a sub-interface, but that it has configuration attributes that are not what I wanted. For example, I may have previously configured a sub-int to be of a certain encapsulation dot1q 1000 inner-dot1q 1234, but I changed my mind and want the sub-int to now be dot1ad 1000 inner-dot1q 1234 instead. Some attributes of an interface can be changed on the fly (like the MTU, for example), but some really cannot, and in my example here, the encapsulation change has to be done another way.\nI\u0026rsquo;ll make an obvious but hopefully helpful observation: I can\u0026rsquo;t create the second sub-int with the same subid, because one already exists (duh). The intuitive way to solve this, of course, is to delete the old sub-int first and then create a new sub-int with the correct attributes (dot1ad outer encapsulation).\nHere\u0026rsquo;s another scenario that illustrates the ordering: Let\u0026rsquo;s say I want to move an IP address from interface A to interface B. In VPP, I can\u0026rsquo;t configure the same IP address/prefixlen on two interfaces at the same time, so as with the previous scenario of the encap changing, I will want to remove the IP address from A before adding it to B.\nCome to think of it, there are lots of scenarios where remove-before-add is required:\nIf an interface was in bridge-domain A but now wants to be put in bridge-domain B, it\u0026rsquo;ll have to be removed from the first bridge before being added to the second bridge, because an interface can\u0026rsquo;t be in two bridges at the same time. If an interface was a member of a BondEthernet, but will be moved to be a member of a bridge-domain now, it will have to be removed from the bond before being added to the bridge, because an interface can\u0026rsquo;t be both a bondethernet member and a member of a bridge at the same time. And to add to the list, the scenario above: A sub-interface that differs in its intended encapsulation must be removed before a new one with the same subid can be created. All of these cases can be modeled as edges (arrows) between vertices (objects) in the graph describing the ordering of operations in VPP! I\u0026rsquo;m now ready to draw two important conclusions:\nAll objects that differ from their intended configuration must be removed before being added elsewhere, in order to avoid them being referenced/used twice. All objects must be created before their attributes can be set. vppcfg: Path Planning By thinking about the configuration in this way, I can precisely predict the order of operations needed to go from any running dataplane configuration to any new target dataplane configuration. A so called path-planner emerges, which has three main phases of execution:\nPrune phase (remove objects from VPP that are not in the config) Create phase (add objects to VPP that are in the config but not VPP) Sync phase, for each object in the configuration When removing things, care has to be taken to remove inner-most objects first (first removing LCP, then QinQ, Dot1Q, BondEthernet, and lastly PHY), because indeed, there exists a dependency relationship between objects in this DAG. Conversely, when creating objects, the edges flip their directionality, because creation must be done on outer-most objects first (first creating the PHY, then BondEthernet, Dot1Q, QinQ and lastly LCP).\nFor example, QinQ/QinAD sub-interfaces should be removed before before their intermediary Dot1Q/Dot1AD can be removed. Another example, MTU of parents should raise before their children, while children should shrink before their parent.\nOrder matters.\nPruning: First, vppcfg will ensure all objects do not have attributes which they should not (eg. IP addresses) and that objects are destroyed that are not needed (ie. have been removed from the target config). After this phase, I am certain that any object that exists in the dataplane, both (a) has the right to exist (because it\u0026rsquo;s in the target configuration), and (b) has the correct create-time (ie non syncable) attributes.\nCreating: Next, vppcfg will ensure that all objects that are not yet present (including the ones that it just removed because they were present but had incorrect attributes), get (re)created in the right order. After this phase, I am certain that all objects in the dataplane now (a) have the right to exist (because they are in the target configuration), (b) have the correct attributes, but newly, also that (c) all objects that are in the target configuration also got created and now exist in the dataplane.\nSyncing: Finally, all objects are synchronized with the target configuration (IP addresses, MTU etc), taking care to shrink children before their parents, and growing parents before their children (this is for the special case of any given sub-interface\u0026rsquo;s MTU having to be equal to or lower than their parent\u0026rsquo;s MTU).\nvppcfg: Demonstration I\u0026rsquo;ll create three configurations and let vppcfg path-plan between them. I start a completely empty VPP dataplane which has two GigabitEthernet and two HundredGigabitEthernet interfaces:\npim@hippo:~/src/vpp$ make run _______ _ _ _____ ___ __/ __/ _ \\ (_)__ | | / / _ \\/ _ \\ _/ _// // / / / _ \\ | |/ / ___/ ___/ /_/ /____(_)_/\\___/ |___/_/ /_/ DBGvpp# show interface Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count GigabitEthernet3/0/0 1 down 9000/0/0/0 GigabitEthernet3/0/1 2 down 9000/0/0/0 HundredGigabitEthernet12/0/0 3 down 9000/0/0/0 HundredGigabitEthernet12/0/1 4 down 9000/0/0/0 local0 0 down 0/0/0/0 Demo 1: First time config (empty VPP) First, starting simple, I write the following YAML configuration called hippo4.yaml. It defines a few sub-interfaces, a bridgedomain with one QinQ sub-interface Hu12/0/0.101 in it, and it then cross-connects Gi3/0/0.100 with Hu12/0/1.100, keeping all sub-interfaces at an MTU of 2000 and their PHYs at an MTU of 9216:\ninterfaces: GigabitEthernet3/0/0: mtu: 9216 sub-interfaces: 100: mtu: 2000 l2xc: HundredGigabitEthernet12/0/1.100 GigabitEthernet3/0/1: description: Not Used HundredGigabitEthernet12/0/0: mtu: 9216 sub-interfaces: 100: mtu: 3000 101: mtu: 2000 encapsulation: dot1q: 100 inner-dot1q: 200 exact-match: True HundredGigabitEthernet12/0/1: mtu: 9216 sub-interfaces: 100: mtu: 2000 l2xc: GigabitEthernet3/0/0.100 bridgedomains: bd10: description: \u0026#34;Bridge Domain 10\u0026#34; mtu: 2000 interfaces: [ HundredGigabitEthernet12/0/0.101 ] If I offer this config to vppcfg and ask it to plan a path, there won\u0026rsquo;t be any pruning going on, because there are no objects in the newly started VPP dataplane that need to be deleted. But I do expect to see a bunch of sub-interface and one bridge-domain creation, followed by syncing a bunch of interfaces with bridge-domain memberships and L2 Cross Connects. Finally, the MTU of the interfaces will be sync\u0026rsquo;d to their configured values, and the path is planned like so:\npim@hippo:~/src/vppcfg$ ./vppcfg -c hippo4.yaml plan [INFO ] root.main: Loading configfile hippo4.yaml [INFO ] vppcfg.config.valid_config: Configuration validated successfully [INFO ] root.main: Configuration is valid [INFO ] vppcfg.vppapi.connect: VPP version is 22.06-rc0~324-g247385823 create sub GigabitEthernet3/0/0 100 dot1q 100 exact-match create sub HundredGigabitEthernet12/0/0 100 dot1q 100 exact-match create sub HundredGigabitEthernet12/0/1 100 dot1q 100 exact-match create sub HundredGigabitEthernet12/0/0 101 dot1q 100 inner-dot1q 200 exact-match create bridge-domain 10 set interface l2 bridge HundredGigabitEthernet12/0/0.101 10 set interface l2 tag-rewrite HundredGigabitEthernet12/0/0.101 pop 2 set interface l2 xconnect GigabitEthernet3/0/0.100 HundredGigabitEthernet12/0/1.100 set interface l2 tag-rewrite GigabitEthernet3/0/0.100 pop 1 set interface l2 xconnect HundredGigabitEthernet12/0/1.100 GigabitEthernet3/0/0.100 set interface l2 tag-rewrite HundredGigabitEthernet12/0/1.100 pop 1 set interface mtu 9216 GigabitEthernet3/0/0 set interface mtu 9216 HundredGigabitEthernet12/0/0 set interface mtu 9216 HundredGigabitEthernet12/0/1 set interface mtu packet 1500 GigabitEthernet3/0/1 set interface mtu packet 9216 GigabitEthernet3/0/0 set interface mtu packet 9216 HundredGigabitEthernet12/0/0 set interface mtu packet 9216 HundredGigabitEthernet12/0/1 set interface mtu packet 2000 GigabitEthernet3/0/0.100 set interface mtu packet 3000 HundredGigabitEthernet12/0/0.100 set interface mtu packet 2000 HundredGigabitEthernet12/0/1.100 set interface mtu packet 2000 HundredGigabitEthernet12/0/0.101 set interface mtu 1500 GigabitEthernet3/0/1 set interface state GigabitEthernet3/0/0 up set interface state GigabitEthernet3/0/0.100 up set interface state GigabitEthernet3/0/1 up set interface state HundredGigabitEthernet12/0/0 up set interface state HundredGigabitEthernet12/0/0.100 up set interface state HundredGigabitEthernet12/0/0.101 up set interface state HundredGigabitEthernet12/0/1 up set interface state HundredGigabitEthernet12/0/1.100 up [INFO ] root.main: Planning succeeded On the vppctl commandline, I can simply cut-and-paste these CLI commands and the dataplane ends up configured exactly like was desired in the hippo4.yaml configuration file. One nice way to tell if the reconciliation of the config file into the running VPP instance was successful is by running the planner again with the same YAML config file. It should not find anything worth pruning, creating nor syncing, and indeed:\npim@hippo:~/src/vppcfg$ ./vppcfg -c hippo4.yaml plan [INFO ] root.main: Loading configfile hippo4.yaml [INFO ] vppcfg.config.valid_config: Configuration validated successfully [INFO ] root.main: Configuration is valid [INFO ] vppcfg.vppapi.connect: VPP version is 22.06-rc0~324-g247385823 [INFO ] root.main: Planning succeeded Demo 2: Moving from one config to another To demonstrate how my reconciliation algorithm works in practice, I decide to invent a radically different configuration for Hippo, called hippo12.yaml, in which a new BondEthernet appears, two of its sub-interfaces are cross connected, Hu12/0/0 now gets a LIP and some IP addresses, and the bridge-domain bd10 is replaced by two others, bd1 and bd11, the former of which also sports a BVI (with a LIP called bvi1) and a VXLAN Tunnel bridged into bd1 for good measure:\nbondethernets: BondEthernet0: interfaces: [ GigabitEthernet3/0/0, GigabitEthernet3/0/1 ] interfaces: GigabitEthernet3/0/0: mtu: 9000 description: \u0026#34;LAG #1\u0026#34; GigabitEthernet3/0/1: mtu: 9000 description: \u0026#34;LAG #2\u0026#34; HundredGigabitEthernet12/0/0: lcp: \u0026#34;ice12-0-0\u0026#34; mtu: 9000 addresses: [ 192.0.2.17/30, 2001:db8:3::1/64 ] sub-interfaces: 1234: mtu: 1200 lcp: \u0026#34;ice0.1234\u0026#34; encapsulation: dot1q: 1234 exact-match: True 1235: mtu: 1100 lcp: \u0026#34;ice0.1234.1000\u0026#34; encapsulation: dot1q: 1234 inner-dot1q: 1000 exact-match: True HundredGigabitEthernet12/0/1: mtu: 2000 description: \u0026#34;Bridged\u0026#34; BondEthernet0: mtu: 9000 lcp: \u0026#34;bond0\u0026#34; sub-interfaces: 10: lcp: \u0026#34;bond0.10\u0026#34; mtu: 3000 100: mtu: 2500 l2xc: BondEthernet0.200 encapsulation: dot1q: 100 exact-match: False 200: mtu: 2500 l2xc: BondEthernet0.100 encapsulation: dot1q: 200 exact-match: False 500: mtu: 2000 encapsulation: dot1ad: 500 exact-match: False 501: mtu: 2000 encapsulation: dot1ad: 501 exact-match: False vxlan_tunnel1: mtu: 2000 loopbacks: loop0: lcp: \u0026#34;lo0\u0026#34; addresses: [ 10.0.0.1/32, 2001:db8::1/128 ] loop1: lcp: \u0026#34;bvi1\u0026#34; addresses: [ 10.0.1.1/24, 2001:db8:1::1/64 ] bridgedomains: bd1: mtu: 2000 bvi: loop1 interfaces: [ BondEthernet0.500, BondEthernet0.501, HundredGigabitEthernet12/0/1, vxlan_tunnel1 ] bd11: mtu: 1500 vxlan_tunnels: vxlan_tunnel1: local: 192.0.2.1 remote: 192.0.2.2 vni: 101 Before writing vppcfg, the art of moving from hippo4.yaml to this radically different hippo12.yaml would be a nightmare, and almost certainly have caused me to miss a step and cause an outage. But, due to the fundamental understanding of ordering, and the methodical execution of pruning, creating and syncing the objects, the path planner comes up with the following sequence, which I\u0026rsquo;ll break down in its three constituent phases:\npim@hippo:~/src/vppcfg$ ./vppcfg -c hippo12.yaml plan [INFO ] root.main: Loading configfile hippo12.yaml [INFO ] vppcfg.config.valid_config: Configuration validated successfully [INFO ] root.main: Configuration is valid [INFO ] vppcfg.vppapi.connect: VPP version is 22.06-rc0~324-g247385823 set interface state HundredGigabitEthernet12/0/0.101 down set interface state GigabitEthernet3/0/0.100 down set interface state HundredGigabitEthernet12/0/0.100 down set interface state HundredGigabitEthernet12/0/1.100 down set interface l2 tag-rewrite HundredGigabitEthernet12/0/0.101 disable set interface l3 HundredGigabitEthernet12/0/0.101 create bridge-domain 10 del set interface l2 tag-rewrite GigabitEthernet3/0/0.100 disable set interface l3 GigabitEthernet3/0/0.100 set interface l2 tag-rewrite HundredGigabitEthernet12/0/1.100 disable set interface l3 HundredGigabitEthernet12/0/1.100 delete sub HundredGigabitEthernet12/0/0.101 delete sub GigabitEthernet3/0/0.100 delete sub HundredGigabitEthernet12/0/0.100 delete sub HundredGigabitEthernet12/0/1.100 First, vppcfg concludes that Hu12/0/0.101, Hu12/0/1.100 and Gi3/0/0.100 are no longer needed, so it sets them all admin-state down. The bridge-domain bd10 no longer has the right to exist, the poor thing. But before it is deleted, the interface that was in bd10 can be pruned (membership depends on the bridge, so in pruning, dependencies are removed before dependents). Considering Hu12/0/1.101 and Gi3/0/0.100 were an L2XC pair before, they are returned to default (L3) mode and because it\u0026rsquo;s no longer needed, the VLAN Gymnastics tag rewriting is also cleaned up for both interfaces. Finally, the sub-interfaces that do not appear in the target configuration are deleted, completing the pruning phase.\nIt then continues with the create phase:\ncreate loopback interface instance 0 create loopback interface instance 1 create bond mode lacp load-balance l34 id 0 create vxlan tunnel src 192.0.2.1 dst 192.0.2.2 instance 1 vni 101 decap-next l2 create sub HundredGigabitEthernet12/0/0 1234 dot1q 1234 exact-match create sub BondEthernet0 10 dot1q 10 exact-match create sub BondEthernet0 100 dot1q 100 create sub BondEthernet0 200 dot1q 200 create sub BondEthernet0 500 dot1ad 500 create sub BondEthernet0 501 dot1ad 501 create sub HundredGigabitEthernet12/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match create bridge-domain 1 create bridge-domain 11 lcp create HundredGigabitEthernet12/0/0 host-if ice12-0-0 lcp create BondEthernet0 host-if bond0 lcp create loop0 host-if lo0 lcp create loop1 host-if bvi1 lcp create HundredGigabitEthernet12/0/0.1234 host-if ice0.1234 lcp create BondEthernet0.10 host-if bond0.10 lcp create HundredGigabitEthernet12/0/0.1235 host-if ice0.1234.1000 Here, interfaces are created in order of loopbacks first, then BondEthernets, then Tunnels, and finally sub-interfaces, first creating single-tagged and then creating dual-tagged sub-interfaces. Of course, the BondEthernet has to be created before any sub-int will be able to be created on it. Note that the QinQ Hu12/0/0.1235 will be created after its intermediary parent Hu12/0/0.1234 due to this ordering requirement.\nThen, the two new bridgedomains bd1 and bd11 are created, and finally the LIP plumbing is performed, starting with the PHY ice12-0-0 and BondEthernet bond0, then the two loopbacks, and only then advancing to the two single-tag dot1q interfaces and finally the QinQ interface. For LCPs, this is very important, because in Linux, the interfaces are a tree, not a list. ice12-0-0 must be created before its child ice0.1234@ice12-0-0 can be created, and only then can the QinQ ice0.1234.1000@ice0.1234 be created. This creation order follows from the DAG having an edge signalling an LCP depending on the sub-interface, and an edge between the sub-interface with two tags depending on the sub-interface with one tag, and an edge between the single-tagged sub-interface depending on its PHY.\nAfter all this work, vppcfg can assert (a) every object that now exists in VPP is in the target configuration and (b) that any object that exists in the configuration also is present in VPP (with the correct attributes).\nBut there\u0026rsquo;s one last thing to do, and that\u0026rsquo;s ensure that the attributes that can be changed at runtime (IP addresses, L2XCs, BondEthernet and bridge-domain members, etc) , are sync\u0026rsquo;d into their respective objects in VPP based on what\u0026rsquo;s in the target configuration:\nbond add BondEthernet0 GigabitEthernet3/0/0 bond add BondEthernet0 GigabitEthernet3/0/1 comment { ip link set bond0 address 00:25:90:0c:05:01 } set interface l2 bridge loop1 1 bvi set interface l2 bridge BondEthernet0.500 1 set interface l2 tag-rewrite BondEthernet0.500 pop 1 set interface l2 bridge BondEthernet0.501 1 set interface l2 tag-rewrite BondEthernet0.501 pop 1 set interface l2 bridge HundredGigabitEthernet12/0/1 1 set interface l2 tag-rewrite HundredGigabitEthernet12/0/1 disable set interface l2 bridge vxlan_tunnel1 1 set interface l2 tag-rewrite vxlan_tunnel1 disable set interface l2 xconnect BondEthernet0.100 BondEthernet0.200 set interface l2 tag-rewrite BondEthernet0.100 pop 1 set interface l2 xconnect BondEthernet0.200 BondEthernet0.100 set interface l2 tag-rewrite BondEthernet0.200 pop 1 set interface state GigabitEthernet3/0/1 down set interface mtu 9000 GigabitEthernet3/0/1 set interface state GigabitEthernet3/0/1 up set interface mtu packet 9000 GigabitEthernet3/0/0 set interface mtu packet 9000 HundredGigabitEthernet12/0/0 set interface mtu packet 2000 HundredGigabitEthernet12/0/1 set interface mtu packet 2000 vxlan_tunnel1 set interface mtu packet 1500 loop0 set interface mtu packet 1500 loop1 set interface mtu packet 9000 GigabitEthernet3/0/1 set interface mtu packet 1200 HundredGigabitEthernet12/0/0.1234 set interface mtu packet 3000 BondEthernet0.10 set interface mtu packet 2500 BondEthernet0.100 set interface mtu packet 2500 BondEthernet0.200 set interface mtu packet 2000 BondEthernet0.500 set interface mtu packet 2000 BondEthernet0.501 set interface mtu packet 1100 HundredGigabitEthernet12/0/0.1235 set interface state GigabitEthernet3/0/0 down set interface mtu 9000 GigabitEthernet3/0/0 set interface state GigabitEthernet3/0/0 up set interface state HundredGigabitEthernet12/0/0 down set interface mtu 9000 HundredGigabitEthernet12/0/0 set interface state HundredGigabitEthernet12/0/0 up set interface state HundredGigabitEthernet12/0/1 down set interface mtu 2000 HundredGigabitEthernet12/0/1 set interface state HundredGigabitEthernet12/0/1 up set interface ip address HundredGigabitEthernet12/0/0 192.0.2.17/30 set interface ip address HundredGigabitEthernet12/0/0 2001:db8:3::1/64 set interface ip address loop0 10.0.0.1/32 set interface ip address loop0 2001:db8::1/128 set interface ip address loop1 10.0.1.1/24 set interface ip address loop1 2001:db8:1::1/64 set interface state HundredGigabitEthernet12/0/0.1234 up set interface state HundredGigabitEthernet12/0/0.1235 up set interface state BondEthernet0 up set interface state BondEthernet0.10 up set interface state BondEthernet0.100 up set interface state BondEthernet0.200 up set interface state BondEthernet0.500 up set interface state BondEthernet0.501 up set interface state vxlan_tunnel1 up set interface state loop0 up set interface state loop1 up I\u0026rsquo;m not gonna lie, it\u0026rsquo;s a tonne of work, but it\u0026rsquo;s all a pretty staight forward juggle. The sync phase will look at each object in the config and ensure that the attributes that same object has in the dataplane are present and correct. In my demo, hippo12.yaml creates a lot of interfaces and IP addresses, and changes the MTU of pretty much every interface, but in order:\nThe bondethernet gets its members Gi3/0/0 and Gi3/0/1. As an interesting aside, when VPP creates a BondEthernet it\u0026rsquo;ll initially assign it an ephemeral MAC address. Then, when its first member is added, the MAC address of the BondEthernet will change to that of the first member. The comment reminds me to also set this MAC on the Linux device bond0. In the future, I\u0026rsquo;ll add some PyRoute2 code to do that automatically. BridgeDomains are next. The BVI loop1 is added first, then a few sub-interfaces and a tunnel, and VLAN tag-rewriting for tagged interfaces is configured. There are two bridges, but only one of them has members, so there\u0026rsquo;s not much (in fact, there\u0026rsquo;s nothing) to do for the other one. L2 Cross Connects can be changed at runtime, and they\u0026rsquo;re next. The two interfaces BE0.100 and BE0.200 are connected to one another and tag-rewrites are set up for them, considering they are both tagged sub-interfaces. MTU is next. There\u0026rsquo;s two variants of this. The first one set interface mtu is actually a change in the DPDK driver to change the maximum allowable frame size. For this change, some interface types have to be brought down first, the max frame size changed, and then brought back up again. For all the others, the MTU will be changed in a specific order: PHYs will grow their MTU first, as growing a PHY is guaranteed to be always safe. Sub-interfaces will shrink QinX first, then Dot1Q/Dot1AD, then untagged interfaces. This is to ensure we do not leave VPP and LinuxCP in a state where a QinQ sub-int has a higher MTU than any of its parents. Sub-interfaces will grow untagged first, then DOt1Q/Dot1AD, and finally QinX sub-interfaces. Same reason as step 2, no sub-interface will end up with a higher MTU than any of its parents. PHYs will shrink their MTU last. The YAML configuration validation asserts that no PHY can have an MTU lower than any of its children, so this is safe. Finally, IP addresses are added to Hu12/0/0, loop0 and loop1. I can guarantee that adding IP addresses will not clash with any other interface, because pruning would\u0026rsquo;ve removed IP addresses from interfaces where they don\u0026rsquo;t belong previously. And to finish off, the admin state for interfaces is set, again going from PHY, Bond, Tunnel, 1-tagged sub-interfaces and finally 2-tagged sub-interfaces and loopbacks. Let\u0026rsquo;s take it to the test:\npim@hippo:~/src/vppcfg$ ./vppcfg -c hippo12.yaml plan -o hippo4-to-12.exec [INFO ] root.main: Loading configfile hippo12.yaml [INFO ] vppcfg.config.valid_config: Configuration validated successfully [INFO ] root.main: Configuration is valid [INFO ] vppcfg.vppapi.connect: VPP version is 22.06-rc0~324-g247385823 [INFO ] vppcfg.reconciler.write: Wrote 94 lines to hippo4-to-12.exec [INFO ] root.main: Planning succeeded pim@hippo:~/src/vppcfg$ vppctl exec ~/src/vppcfg/hippo4-to-12.exec pim@hippo:~/src/vppcfg$ ./vppcfg -c hippo12.yaml plan [INFO ] root.main: Loading configfile hippo12.yaml [INFO ] vppcfg.config.valid_config: Configuration validated successfully [INFO ] root.main: Configuration is valid [INFO ] vppcfg.vppapi.connect: VPP version is 22.06-rc0~324-g247385823 [INFO ] root.main: Planning succeeded Notice that after applying hippo4-to-12.exec, the planner had nothing else to say. VPP is now in the target configuration state, slick!\nDemo 3: Returning VPP to empty This one is easy, but shows the pruning in action. Let\u0026rsquo;s say I wanted to return VPP to a default configuration without any objects, and its interfaces all at MTU 1500:\ninterfaces: GigabitEthernet3/0/0: mtu: 1500 description: Not Used GigabitEthernet3/0/1: mtu: 1500 description: Not Used HundredGigabitEthernet12/0/0: mtu: 1500 description: Not Used HundredGigabitEthernet12/0/1: mtu: 1500 description: Not Used Simply applying that plan:\npim@hippo:~/src/vppcfg$ ./vppcfg -c hippo-empty.yaml plan -o 12-to-empty.exec [INFO ] root.main: Loading configfile hippo-empty.yaml [INFO ] vppcfg.config.valid_config: Configuration validated successfully [INFO ] root.main: Configuration is valid [INFO ] vppcfg.vppapi.connect: VPP version is 22.06-rc0~324-g247385823 [INFO ] vppcfg.reconciler.write: Wrote 66 lines to 12-to-empty.exec [INFO ] root.main: Planning succeeded pim@hippo:~/src/vppcfg$ vppctl vpp# exec ~/src/vppcfg/12-to-empty.exec vpp# show interface Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count GigabitEthernet3/0/0 1 up 1500/0/0/0 GigabitEthernet3/0/1 2 up 1500/0/0/0 HundredGigabitEthernet12/0/0 3 up 1500/0/0/0 HundredGigabitEthernet12/0/1 4 up 1500/0/0/0 local0 0 down 0/0/0/0 Final notes Now you may have been wondering why I would call the first file hippo4.yaml and the second one hippo12.yaml. This is because I have 20 such YAML files that bring Hippo into all sorts of esoteric configuration states, and I do this so that I can do a full integration test of any config morphing into any other config:\nfor i in hippo[0-9]*.yaml; do echo \u0026#34;Clearing: Moving to hippo-empty.yaml\u0026#34; ./vppcfg -c hippo-empty.yaml \u0026gt; /tmp/vppcfg-exec-empty [ -s /tmp/vppcfg-exec-empty ] \u0026amp;\u0026amp; vppctl exec /tmp/vppcfg-exec-empty for j in hippo[0-9]*.yaml; do echo \u0026#34; - Moving to $i .. \u0026#34; ./vppcfg -c $i \u0026gt; /tmp/vppcfg-exec_$i [ -s /tmp/vppcfg-exec_$i ] \u0026amp;\u0026amp; vppctl exec /tmp/vppcfg-exec_$i echo \u0026#34; - Moving from $i to $j\u0026#34; ./vppcfg -c $j \u0026gt; /tmp/vppcfg-exec_${i}_${j} [ -s /tmp/vppcfg-exec_${i}_${j} ] \u0026amp;\u0026amp; vppctl exec /tmp/vppcfg-exec_${i}_${j} echo \u0026#34; - Checking that from $j to $j is empty\u0026#34; ./vppcfg -c $j \u0026gt; /tmp/vppcfg-exec_${j}_${j}_null done done What this does is starts off Hippo with an empty config, then moves it to hippo1.yaml and from there it moves the configuration to each YAML file and back to hippo1.yaml. Doing this proves, that no matter which configuration I want to obtain, I can get there safely when the VPP dataplane config starts out looking like what is described in hippo1.yaml. I\u0026rsquo;ll then move it back to empty, and into hippo2.yaml, doing the whole cycle again. So for 20 files, this means ~400 or so configuration transitions. And some of these are special, notably moving from hippoN.yaml to the same hippoN.yaml should result in zero diffs.\nWith this path planner reasonably well tested, I have pretty high confidence that vppcfg can change the dataplane from any existing configuration to any desired target configuration.\nWhat\u0026rsquo;s next One thing that I didn\u0026rsquo;t mention yet, is that the vppcfg path planner works by reading the API configuration state exactly once (at startup), and then it figures out the CLI calls to print without needing to talk to VPP again. This is super useful as it\u0026rsquo;s a non-intrusive way to inspect the changes before applying them, and it\u0026rsquo;s a property I\u0026rsquo;d like to carry forward.\nHowever, I don\u0026rsquo;t necessarily think that emitting the CLI statements is the best user experience, it\u0026rsquo;s more for the purposes of analysis that they can be useful. What I really want to do is emit API calls after the plan is created and reviewed/approved, directly reprogramming the VPP dataplane, and likely the Linux network namespace interfaces as well, for example setting the MAC address of a BondEthernet as I showed in that one comment above, or setting interface alias names based on the configured descriptions.\nHowever, the VPP API set needed to do this is not 100% baked yet. For example, I observed crashes when tinkering with BVIs and Loopbacks (thread), and fixed a few obvious errors in the Linux CP API (gerrit) but there are still a few more issues to work through before I can set the next step with vppcfg.\nBut for now, it\u0026rsquo;s already helping me out tremendously at IPng Networks and I hope it\u0026rsquo;ll be useful for others, too.\n","date":"2022-04-02","desc":" About this series I use VPP - Vector Packet Processor - extensively at IPng Networks. Earlier this year, the VPP community merged the Linux Control Plane plugin. I wrote about its deployment to both regular servers like the Supermicro routers that run on our AS8298, as well as virtual machines running in KVM/Qemu.\nNow that I\u0026rsquo;ve been running VPP in production for about half a year, I can\u0026rsquo;t help but notice one specific drawback: VPP is a programmable dataplane, and by design it does not include any configuration or controlplane management stack. It\u0026rsquo;s meant to be integrated into a full stack by operators. For end-users, this unfortunately means that typing on the CLI won\u0026rsquo;t persist any configuration, and if VPP is restarted, it will not pick up where it left off. There\u0026rsquo;s one developer convenience in the form of the exec command-line (and startup.conf!) option, which will read a file and apply the contents to the CLI line by line. However, if any typo is made in the file, processing immediately stops. It\u0026rsquo;s meant as a convenience for VPP developers, and is certainly not a useful configuration method for all but the simplest topologies.\n","permalink":"https://ipng.ch/s/articles/2022/04/02/vpp-configuration-part2/","section":"articles","title":"VPP Configuration - Part2"},{"contents":" About this series I use VPP - Vector Packet Processor - extensively at IPng Networks. Earlier this year, the VPP community merged the Linux Control Plane plugin. I wrote about its deployment to both regular servers like the Supermicro routers that run on our AS8298, as well as virtual machines running in KVM/Qemu.\nNow that I\u0026rsquo;ve been running VPP in production for about half a year, I can\u0026rsquo;t help but notice one specific drawback: VPP is a programmable dataplane, and by design it does not include any configuration or controlplane management stack. It\u0026rsquo;s meant to be integrated into a full stack by operators. For end-users, this unfortunately means that typing on the CLI won\u0026rsquo;t persist any configuration, and if VPP is restarted, it will not pick up where it left off. There\u0026rsquo;s one developer convenience in the form of the exec command-line (and startup.conf!) option, which will read a file and apply the contents to the CLI line by line. However, if any typo is made in the file, processing immediately stops. It\u0026rsquo;s meant as a convenience for VPP developers, and is certainly not a useful configuration method for all but the simplest topologies.\nLuckily, VPP comes with an extensive set of APIs to allow it to be programmed. So in this series of posts, I\u0026rsquo;ll detail the work I\u0026rsquo;ve done to create a configuration utility that can take a YAML configuration file, compare it to a running VPP instance, and step-by-step plan through the API calls needed to safely apply the configuration to the dataplane. Welcome to vppcfg!\nIn this first post, let\u0026rsquo;s take a look at tablestakes: writing a YAML specification which models the main configuration elements of VPP, and then ensures that the YAML file is both syntactically as well as semantically correct.\nNote: Code is on my Github, but it\u0026rsquo;s not quite ready for prime-time yet. Take a look, and engage with us on GitHub (pull requests preferred over issues themselves) or reach out by contacting us.\nYAML Specification I decide to use Yamale, which is a schema description language and validator for YAML. YAML is a very simple, text/human-readable annotation format that can be used to store a wide range of data types. An interesting, but quick introduction to the YAML language can be found on CraftIRC\u0026rsquo;s GitHub page.\nThe first order of business for me is to devise a YAML file specification which models the configuration options of VPP objects in an idiomatic way. It\u0026rsquo;s apealing to make the decision to immediately build a higher level abstraction, but I resist the urge and instead look at the types of objects that exist in VPP, for example the VNET_DEVICE_CLASS types:\nethernet_simulated_device_class: Loopbacks bvi_device_class: Bridge Virtual Interfaces dpdk_device_class: DPDK Interfaces rdma_device_class: RDMA Interfaces bond_device_class: BondEthernet Interfaces vxlan_device_class: VXLAN Tunnels There are several others, but I decide to start with these, as I\u0026rsquo;ll be needing each one of these in my own network. Looking over the device class specification, I learn a lot about how they are configured, which arguments and of which type they need, and which data-structures they are represent as in VPP internally.\nSyntax Validation Yamale first reads a schema definition file, and then holds a given YAML file against the definition and shows if the file has a syntax that is well-formed or not. As a practical example, let me start with the following definition:\n$ cat \u0026lt;\u0026lt; EOF \u0026gt; schema.yaml sub-interfaces: map(include(\u0026#39;sub-interface\u0026#39;),key=int()) --- sub-interface: description: str(exclude=\u0026#39;\\\u0026#39;\u0026#34;\u0026#39;,len=64,required=False) lcp: str(max=15,matches=\u0026#39;[a-z]+[a-z0-9-]*\u0026#39;,required=False) mtu: int(min=128,max=9216,required=False) addresses: list(ip(version=6),required=False) encapsulation: include(\u0026#39;encapsulation\u0026#39;,required=False) --- encapsulation: dot1q: int(min=1,max=4095,required=False) dot1ad: int(min=1,max=4095,required=False) inner-dot1q: int(min=1,max=4095,required=False) exact-match: bool(required=False) EOF This snippet creates two types, one called sub-interface and the other called encapsulation. The fields of the sub-interface, for example the description field, must follow the given typing to be valid. In the case of the description, it must be at most 64 characters long and it must not contain the or \u0026quot; characters. The designationrequired=Falsenotes that this is an optional field and may be omitted. Thelcpfield is also a string but it must match a certain regular expression, and start with a lowercase letter. TheMTU` field must be an integer between 128 and 9216, and so on.\nOne nice feature of Yamale is the ability to reference other object types. I do this here with the encapsulation field, which references an object type of the same name, and again, is optional. This means that when the encapsulation field is encountered in the YAML file Yamale is validating, it\u0026rsquo;ll hold the contents of that field to the schema below. There, we have dot1q, dot1ad, inner-dot1q and exact-match fields, which are all optional.\nThen, at the top of the file, I create the entrypoint schema, which expects YAML files to contain a map called sub-interfaces which is keyed by integers and contains values of type sub-interface, tying it all together.\nYamale comes with a commandline utility to do direct schema validation, which is handy. Let me demonstrate with the following terrible YAML:\n$ cat \u0026lt;\u0026lt; EOF \u0026gt; bad.yaml sub-interfaces: 100: description: \u0026#34;Pim\u0026#39;s illegal description\u0026#34; lcp: \u0026#34;NotAGoodName-AmIRite\u0026#34; mtu: 16384 addresses: 192.0.2.1 encapsulation: False EOF $ yamale -s schemal.yaml bad.yaml Validating /home/pim/bad.yaml... Validation failed! Error validating data \u0026#39;/home/pim/bad.yaml\u0026#39; with schema \u0026#39;/home/pim/schema.yaml\u0026#39; sub-interfaces.100.description: \u0026#39;Pim\u0026#39;s illegal description\u0026#39; contains excluded character \u0026#39;\u0026#39;\u0026#39; sub-interfaces.100.lcp: Length of NotAGoodName-AmIRite is greater than 15 sub-interfaces.100.lcp: NotAGoodName-AmIRite is not a regex match. sub-interfaces.100.mtu: 16384 is greater than 9216 sub-interfaces.100.addresses: \u0026#39;192.0.2.1\u0026#39; is not a list. sub-interfaces.100.encapsulation : \u0026#39;False\u0026#39; is not a map This file trips so many syntax violations, it should be a crime! In fact every single field is invalid. The one that is closest to being correct is the addresses field, but there I\u0026rsquo;ve set it up as a list (not a scalar), and even then, the list elements are expected to be IPv6 addresses, not IPv4 ones.\nSo let me try again:\n$ cat \u0026lt;\u0026lt; EOF \u0026gt; good.yaml sub-interfaces: 100: description: \u0026#34;Core: switch.example.com Te0/1\u0026#34; lcp: \u0026#34;xe3-0-0\u0026#34; mtu: 9216 addresses: [ 2001:db8::1, 2001:db8:1::1 ] encapsulation: dot1q: 100 exact-match: True EOF $ yamale good.yaml Validating /home/pim/good.yaml... Validation success! 👍 Semantic Validation When using Yamale, I can make a good start in syntax validation, that is to say, if a field is present, it follows a prescribed type. But that\u0026rsquo;s not the whole story, though. There are many configuration files I can think of that would be syntactically correct, but still make no sense in practice. For example, creating an encapsulation which has both dot1q as well as dot1ad, or creating a LIP (Linux Interface Pair) for sub-interface which does not have exact-match set. Or how\u0026rsquo;s about having two sub-interfaces with the same exact encapsulation?\nHere\u0026rsquo;s where semantic validation comes in to play. So I set out to create all sorts of constraints, and after reading the (Yamale validated, so syntactically correct) YAML file, I can hand it into a set of validators that check for violations of these constraints. By means of example, let me create a few constraints that might capture the issues described above:\nIf a sub-interface has encapsulation: It MUST have dot1q OR dot1ad set It MUST NOT have dot1q AND dot1ad both set If a sub-interface has one or more addresses: Its encapsulation MUST be set to exact-match It MUST have an lcp set. Each individual address MUST NOT occur in any other interface Config Validation After spending a few weeks thinking about the problem, I came up with 59 semantic constraints, that is to say things that might appear OK, but will yield impossible to implement or otherwise erratic VPP configurations. This article would be a bad place to discuss them all, so I will talk about the structure of vppcfg instead.\nFirst, a Validator class is instantiated with the Yamale schema. Then, a YAML file is read and passed to the validator\u0026rsquo;s validate() method. It will first run Yamale on the YAML file and make note of any issues that arise. If so, it will enumerate them in a list and return (bool, [list-of-messages]). The validation will have failed if the boolean returned is false, and if so, the list of messages will help understand which constraint was violated.\nThe vppcfg schema consists of toplevel types, which are validated in order:\nvalidate_bondethernets()\u0026rsquo;s job is to ensure that anything configured in the bondethernets toplevel map is correct. For example, if a BondEthernet device is created there, its members should reference existing interfaces, and it itself should make an appearance in the interfaces map, and the MTU of each member should be equal to the MTU of the BondEthernet, and so on. See config/bondethernet.py for a complete rundown. validate_loopbacks() is pretty straight forward. It makes a few common assertions, such as that if the loopback has addresses, it must also have an LCP, and if it has an LCP, that no other interface has the same LCP name, and that all of the addresses configured are unique. validate_vxlan_tunnels() Yamale already asserts that the local and remote fields are present and an IP address. The semantic validator ensures that the address family of the tunnel endpoints are the same, and that the used VNI is unique. validate_bridgedomains() fiddles with its Bridge Virtual Interface, making sure that its addresses and LCP name are unique. Further, it makes sure that a given member interface is in at most one bridge, and that said member is in L2 mode, in other words, that it doesn\u0026rsquo;t have an LCP or an address. An L2 interface can be either in a bridgedomain, or act as an L2 Cross Connect, but not both. Finally, it asserts that each member has an MTU identical to the bridge\u0026rsquo;s MTU value. validate_interfaces() is by far the most complex, but a few common things worth calling out is that each sub-interface must have a unique encapsulation, and if a given QinQ or QinAD 2-tagged sub-interface has an LCP, that there exist a parent Dot1Q or Dot1AD interface with the correct encapsulation, and that it also has an LCP. See config/interface.py for an extensive overview. Testing Of course, in a configuration model so complex as a VPP router, being able to do a lot of validation helps ensure that the constraints above are implemented correctly. To help this along, I use regular unittesting as provided by the Python3 unittest framework, but I extend it to run as well a special kind of test which I call a YAMLTest.\nUnit Testing This is bread and butter, and should be straight forward for software engineers. I took a model of so called test-driven development, where I start off by writing a test, which of course fails because the code hasn\u0026rsquo;t been implemented yet. Then I implement the code, and run this and all other unittests expecting them to pass.\nLet me give an example based on BondEthernets, with a YAML config file as follows:\nbondethernets: BondEthernet0: interfaces: [ GigabitEthernet1/0/0, GigabitEthernet1/0/1 ] interfaces: GigabitEthernet1/0/0: mtu: 3000 GigabitEthernet1/0/1: mtu: 3000 GigabitEthernet2/0/0: mtu: 3000 sub-interfaces: 100: mtu: 2000 BondEthernet0: mtu: 3000 lcp: \u0026#34;be012345678\u0026#34; addresses: [ 192.0.2.1/29, 2001:db8::1/64 ] sub-interfaces: 100: mtu: 2000 addresses: [ 192.0.2.9/29, 2001:db8:1::1/64 ] As I mentioned when discussing the semantic constraints, there\u0026rsquo;s a few here that jump out at me. First, the BondEthernet members Gi1/0/0 and Gi1/0/1 must exist. There is one BondEthernet defined in this file (obvious, I know, but bear with me), and Gi2/0/0 is not a bond member, and certainly Gi2/0/0.100 is not a bond member, because having a sub-interface as an LACP member would be super weird. Taking things like this into account, here\u0026rsquo;s a few tests that could assert that the behavior of the bondethernets map in the YAML config is correct:\nclass TestBondEthernetMethods(unittest.TestCase): def setUp(self): with open(\u0026#34;unittest/test_bondethernet.yaml\u0026#34;, \u0026#34;r\u0026#34;) as f: self.cfg = yaml.load(f, Loader = yaml.FullLoader) def test_get_by_name(self): ifname, iface = bondethernet.get_by_name(self.cfg, \u0026#34;BondEthernet0\u0026#34;) self.assertIsNotNone(iface) self.assertEqual(\u0026#34;BondEthernet0\u0026#34;, ifname) self.assertIn(\u0026#34;GigabitEthernet1/0/0\u0026#34;, iface[\u0026#39;interfaces\u0026#39;]) self.assertNotIn(\u0026#34;GigabitEthernet2/0/0\u0026#34;, iface[\u0026#39;interfaces\u0026#39;]) ifname, iface = bondethernet.get_by_name(self.cfg, \u0026#34;BondEthernet-notexist\u0026#34;) self.assertIsNone(iface) self.assertIsNone(ifname) def test_members(self): self.assertTrue(bondethernet.is_bond_member(self.cfg, \u0026#34;GigabitEthernet1/0/0\u0026#34;)) self.assertTrue(bondethernet.is_bond_member(self.cfg, \u0026#34;GigabitEthernet1/0/1\u0026#34;)) self.assertFalse(bondethernet.is_bond_member(self.cfg, \u0026#34;GigabitEthernet2/0/0\u0026#34;)) self.assertFalse(bondethernet.is_bond_member(self.cfg, \u0026#34;GigabitEthernet2/0/0.100\u0026#34;)) def test_is_bondethernet(self): self.assertTrue(bondethernet.is_bondethernet(self.cfg, \u0026#34;BondEthernet0\u0026#34;)) self.assertFalse(bondethernet.is_bondethernet(self.cfg, \u0026#34;BondEthernet-notexist\u0026#34;)) self.assertFalse(bondethernet.is_bondethernet(self.cfg, \u0026#34;GigabitEthernet1/0/0\u0026#34;)) def test_enumerators(self): ifs = bondethernet.get_bondethernets(self.cfg) self.assertEqual(len(ifs), 1) self.assertIn(\u0026#34;BondEthernet0\u0026#34;, ifs) self.assertNotIn(\u0026#34;BondEthernet-noexist\u0026#34;, ifs) Every single function that is defined in the file config/bondethernet.py (there are four) will have an accompanying unittest to ensure it works as expected. And every validator module, will have a suite of unittests fully covering their functionality. In total, I wrote a few dozen unit tests like this, in an attempt to be reasonably certain that the config validator functionality works as advertised.\nYAML Testing I added one additional class of unittest called a YAMLTest. What happens here is that a certain YAML configuration file, which may be valid or have errors, is offered to the end to end config parser (so both the Yamale schema validator as well as the semantic validators), and all errors are accounted for. As an example, two sub-interfaces on the same parent cannot have the same encapsulation, so offering the following file to the config validator is expected to trip errors:\n$ cat unittest/yaml/error-subinterface1.yaml \u0026lt;\u0026lt; EOF test: description: \u0026#34;Two subinterfaces can\u0026#39;t have the same encapsulation\u0026#34; errors: expected: - \u0026#34;sub-interface .*.100 does not have unique encapsulation\u0026#34; - \u0026#34;sub-interface .*.102 does not have unique encapsulation\u0026#34; count: 2 --- interfaces: GigabitEthernet1/0/0: sub-interfaces: 100: description: \u0026#34;VLAN 100\u0026#34; 101: description: \u0026#34;Another VLAN 100, but without exact-match\u0026#34; encapsulation: dot1q: 100 102: description: \u0026#34;Another VLAN 100, but without exact-match\u0026#34; encapsulation: dot1q: 100 exact-match: True EOF You can see the file here has two YAML documents (separated by ---), the first one explains to the YAMLTest class what to expect. There can either be no errors (in which case test.errors.count=0), or there can be specific errors that are expected. In this case, Gi1/0/0.100 and Gi1/0/0/102 have the same encapsulation but Gi1/0/0.101 is unique (if you\u0026rsquo;re curious, this is because the encap on 100 and 102 has exact-match, but the one one 101 does not have exact-match).\nThe implementation of this YAMLTest class is in tests.py, which in turn runs all YAML tests on the files it finds in unittest/yaml/*.yaml (currently 47 specific cases are tested there, which covered 100% of the semantic constraints), and regular unittests (currently 42, which is a coincidence, I swear!)\nWhat\u0026rsquo;s next? These tests, together, give me a pretty strong assurance that any given YAML file that passes the validator, is indeed a valid configuration for VPP. In my next post, I\u0026rsquo;ll go one step further, and talk about applying the configuration to a running VPP instance, which is of course the overarching goal. But I would not want to mess up my (or your!) VPP router by feeding it garbage, so the lions\u0026rsquo; share of my time so far on this project has been to assert the YAML file is both syntactically and semantically valid.\nIn the mean time, you can take a look at my code on GitHub, but to whet your appetite, here\u0026rsquo;s a hefty configuration that demonstrates all implemented types:\nbondethernets: BondEthernet0: interfaces: [ GigabitEthernet3/0/0, GigabitEthernet3/0/1 ] interfaces: GigabitEthernet3/0/0: mtu: 9000 description: \u0026#34;LAG #1\u0026#34; GigabitEthernet3/0/1: mtu: 9000 description: \u0026#34;LAG #2\u0026#34; HundredGigabitEthernet12/0/0: lcp: \u0026#34;ice0\u0026#34; mtu: 9000 addresses: [ 192.0.2.17/30, 2001:db8:3::1/64 ] sub-interfaces: 1234: mtu: 1200 lcp: \u0026#34;ice0.1234\u0026#34; encapsulation: dot1q: 1234 exact-match: True 1235: mtu: 1100 lcp: \u0026#34;ice0.1234.1000\u0026#34; encapsulation: dot1q: 1234 inner-dot1q: 1000 exact-match: True HundredGigabitEthernet12/0/1: mtu: 2000 description: \u0026#34;Bridged\u0026#34; BondEthernet0: mtu: 9000 lcp: \u0026#34;be0\u0026#34; sub-interfaces: 100: mtu: 2500 l2xc: BondEthernet0.200 encapsulation: dot1q: 100 exact-match: False 200: mtu: 2500 l2xc: BondEthernet0.100 encapsulation: dot1q: 200 exact-match: False 500: mtu: 2000 encapsulation: dot1ad: 500 exact-match: False 501: mtu: 2000 encapsulation: dot1ad: 501 exact-match: False vxlan_tunnel1: mtu: 2000 loopbacks: loop0: lcp: \u0026#34;lo0\u0026#34; addresses: [ 10.0.0.1/32, 2001:db8::1/128 ] loop1: lcp: \u0026#34;bvi1\u0026#34; addresses: [ 10.0.1.1/24, 2001:db8:1::1/64 ] bridgedomains: bd1: mtu: 2000 bvi: loop1 interfaces: [ BondEthernet0.500, BondEthernet0.501, HundredGigabitEthernet12/0/1, vxlan_tunnel1 ] bd11: mtu: 1500 vxlan_tunnels: vxlan_tunnel1: local: 192.0.2.1 remote: 192.0.2.2 vni: 101 The vision for my VPP Configuration utility is that it can move from any existing VPP configuration to any other (validated successfully) configuration with a minimal amount of steps, and that it will plan its way declaratively from A to B, ordering the calls to the API safely and quickly. Interested? Good, because I do expect that a utility like this would be very valuable to serious VPP users!\n","date":"2022-03-27","desc":" About this series I use VPP - Vector Packet Processor - extensively at IPng Networks. Earlier this year, the VPP community merged the Linux Control Plane plugin. I wrote about its deployment to both regular servers like the Supermicro routers that run on our AS8298, as well as virtual machines running in KVM/Qemu.\nNow that I\u0026rsquo;ve been running VPP in production for about half a year, I can\u0026rsquo;t help but notice one specific drawback: VPP is a programmable dataplane, and by design it does not include any configuration or controlplane management stack. It\u0026rsquo;s meant to be integrated into a full stack by operators. For end-users, this unfortunately means that typing on the CLI won\u0026rsquo;t persist any configuration, and if VPP is restarted, it will not pick up where it left off. There\u0026rsquo;s one developer convenience in the form of the exec command-line (and startup.conf!) option, which will read a file and apply the contents to the CLI line by line. However, if any typo is made in the file, processing immediately stops. It\u0026rsquo;s meant as a convenience for VPP developers, and is certainly not a useful configuration method for all but the simplest topologies.\n","permalink":"https://ipng.ch/s/articles/2022/03/27/vpp-configuration-part1/","section":"articles","title":"VPP Configuration - Part1"},{"contents":"Introduction From time to time, I wish I could be made aware of failures earlier. There are two events, in particular, that I am interested to know about very quickly, as they may impact service at AS8298:\nOpen Shortest Path First (OSPF) adjacency removals. OSPF is a link-state protocol and it knows when a physical link goes down, that the peer (neighbor) is no longer reachable. It can then recompute paths to other routers fairly quickly. But if the link stays up but connectivity is interrupted, for example because there is a switch in the path, it can take a relatively long time to detect. Bidirectional Forwarding Detection (BFD) session timeouts. BFD sets up a rapid (for example every 50ms or 20Hz) of a unidirectional UDP stream between two hosts. If a number of packets (for example 40 packets or 2 seconds) are not received, a link can be assumed to be dead. Notably, BIRD, as many other vendors do, can combine the two. At IPng, each OSPF adjacency is protected by BFD. What happens is that once an OSPF enabled link comes up, OSPF Hello packets will be periodically transmitted (with a period called called a Hello Timer, typically once every 10 seconds). When a number of these are missed (called a Dead Timer, typically 40 seconds), the neighbor is considered missing in action and the session cleaned up.\nTo help recover from link failure faster than 40 seconds, a new BFD session can be set up from any neighbor that sends a Hello packet. From then on, BFD will send a steady stream of UDP packets, and expect as well the neighbor to send them. If BFD detects a timeout, it can inform BIRD to take action well before the OSPF Dead Timer.\nVery strict timers are known to be used, for example 10ms and 5 missed packets, or 50ms (!!) of timeout. But at IPng, in the typical example above, I instruct BFD to send packets every 50ms, and time out after 40 missed packets, or two (2) seconds of link downtime. Considering BIRD+VPP converge a full routing table in about 7 seconds, that gives me an end-to-end recovery time of under 10 seconds, which is respectable, all the while avoiding triggering on false positives.\nI\u0026rsquo;d like to be made aware of these events, which could signal a darkfiber cut or WDM optic failure, or an EoMPLS (ie Virtual Leased Line or VLL failure), or a non-recoverable VPP dataplane crash. To a lesser extent, being made explicitly aware of BGP adjacencies to downstream (IP Transit customers) or upstream (IP Transit providers) can be useful.\nSyslog NG There are two parts to this. First I want to have a (set of) central receiver servers, that will each receive messages from the routers in the field. I decide to take three servers: the main one being nms.ipng.nl, which runs LibreNMS, and further two read-only route collectors rr0.ddln0.ipng.ch at our own DDLN colocation in Zurich, and rr0.nlams0.ipng.ch running at Coloclue in DCG, Amsterdam.\nOf course, it would be a mistake to use UDP as a transport for messages that discuss potential network outages. Having receivers in multiple places in the network does help a little bit. But I decide to configure the server (and the clients) later to use TCP. This way, messages are queued to be sent, and if the TCP connection has to be rerouted when the underlying network converges, I am pretty certain that the messages will arrive at the central logserver eventually.\nSyslog Server The configuration for each of the receiving servers is the same, very straight forward:\n$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/syslog-ng/conf.d/listen.conf template t_remote { template(\u0026#34;\\$ISODATE \\$FULLHOST_FROM [\\$LEVEL] \\${PROGRAM}: \\${MESSAGE}\\n\u0026#34;); template_escape(no); }; source s_network_tcp { network( transport(\u0026#34;tcp\u0026#34;) ip(\u0026#34;::\u0026#34;) ip-protocol(6) port(601) max-connections(300) ); }; destination d_ipng { file(\u0026#34;/var/log/ipng.log\u0026#34; template(t_remote) template-escape(no)); }; log { source(s_network_tcp); destination(d_ipng); }; EOF $ sudo systemctl restart syslog-ng First, I define a template which logs in a consistent and predictable manner. Then, I configure a source which listens on IPv4 and IPv6 on TCP port 601, which allows for more than the default 10 connections. I configure a destination into a file, using the template. Then I tie the log source into the destination, and restart syslog-ng.\nOne thing that took me a while to realize is that for syslog-ng, the parser applied to incoming messages is different depending on the port used (ref):\n514, both TCP and UDP, for RFC3164 (BSD-syslog) formatted traffic 601 TCP, for RFC5424 (IETF-syslog) formatted traffic 6514 TCP, for TLS-encrypted traffic (of IETF-syslog messages) After seeing malformed messages in the syslog, notably with duplicate host/program/timestamp, I ultimately understood that this was because I was sending RFC5424 style messages to an RFC3164 enabled port (514). Once I moved the transport to be port 601, the parser matched and loglines were correct.\nAnd another detail \u0026ndash; I feel a little bit proud for not forgetting to add a logrotate entry for this new log file, keeping 10 days worth of compressed logs:\n$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/logrotate.d/syslog-ng-ipng /var/log/ipng.log { rotate 10 daily missingok notifempty delaycompress compress postrotate invoke-rc.d syslog-ng reload \u0026gt; /dev/null endscript } EOF I open up the firewall in these new syslog servers for TCP port 601, from any loopback addresses on AS8298\u0026rsquo;s network.\nSyslog Clients The clients install syslog-ng-core (which avoids all of the extra packages). On the routers, I have to make sure that the syslog server runs in the dataplane namespace, otherwise it will not have connectivity to send its messages. And, quite importantly, I should make sure that the TCP connections are bound to the loopback address of the router, not any arbitrary interface, as those could go down, rendering the TCP connection useless. So taking nlams0.ipng.ch as an example, here\u0026rsquo;s a configuration snippet:\n$ sudo apt install syslog-ng-core $ sudo sed -i -e \u0026#39;s,ExecStart=,ExecStart=/usr/sbin/ip netns exec dataplane ,\u0026#39; \\ /lib/systemd/system/syslog-ng.service $ LO4=194.1.163.32 $ LO6=2001:678:d78::8 $ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/syslog-ng/conf.d/remote.conf destination d_nms_tcp { tcp(\u0026#34;194.1.163.89\u0026#34; localip(\u0026#34;$LO4\u0026#34;) port(601)); }; destination d_rr0_nlams0_tcp { tcp(\u0026#34;2a02:898:146::4\u0026#34; localip(\u0026#34;$LO6\u0026#34;) port(601)); }; destination d_rr0_ddln0_tcp { tcp(\u0026#34;2001:678:d78:4::1:4\u0026#34; localip(\u0026#34;$LO6\u0026#34;) port(601)); }; filter f_bird { program(bird); }; log { source(s_src); filter(f_bird); destination(d_nms_tcp); }; log { source(s_src); filter(f_bird); destination(d_rr0_nlams0_tcp); }; log { source(s_src); filter(f_bird); destination(d_rr0_ddln0_tcp); }; EOF $ sudo systemctl restart syslog-ng Here, I create simply three destination entries, one for each log-sink. Then I create a filter that grabs logs sent, but only for the BIRD server. You can imagine that later, I can add other things to this \u0026ndash; for example keepalived for VRRP failovers. Finally, I tie these together by applying the filter to the source and sending the result to each syslog server.\nSo far, so good.\nBird For consistency, (although not strictly necessary for the logging and further handling), I add ISO data timestamping and enable syslogging in /etc/bird/bird.conf:\ntimeformat base iso long; timeformat log iso long; timeformat protocol iso long; timeformat route iso long; log syslog all; And for the two protocols of interest, I add debug { events }; to the BFD and OSPF protocols. Note that bfd on stanza in the OSPF interfaces \u0026ndash; this instructs BIRD to create BFD session for each of the neighbors that are found on such an interface, and if BFD were to fail, tear down the adjacency faster than the regular Dead Timer timeouts.\nprotocol bfd bfd1 { debug { events }; interface \u0026#34;*\u0026#34; { interval 50 ms; multiplier 40; }; } protocol ospf v2 ospf4 { debug { events }; ipv4 { export filter ospf_export; import all; }; area 0 { interface \u0026#34;loop0\u0026#34; { stub yes; }; interface \u0026#34;xe1-3.100\u0026#34; { type pointopoint; cost 61; bfd on; }; interface \u0026#34;xe1-3.200\u0026#34; { type pointopoint; cost 75; bfd on; }; }; } This will emit loglines for (amongst others), state changes on BFD neighbors and OSPF adjacencies. There are a lot of messages to choose from, but I found that the following messages contain the minimally needed information to convey links going down or up (both from BFD\u0026rsquo;s point of view as well as from OSPF and OSPFv3\u0026rsquo;s point of view). I can demonstrate that by making the link between Hippo and Rhino go down (ie. by shutting the switchport, or unplugging the cable).\nAnd after this, I can see on nms.ipng.nl that the logs start streaming in:\npim@nms:~$ tail -f /var/log/ipng.log | egrep \u0026#39;(ospf[46]|bfd1):.*changed state.*to (Down|Up|Full)\u0026#39; 2022-02-24T18:12:26+00:00 hippo.btl.ipng.ch [debug] bird: bfd1: Session to 192.168.10.17 changed state from Up to Down 2022-02-24T18:12:26+00:00 hippo.btl.ipng.ch [debug] bird: ospf4: Neighbor 192.168.10.1 on e2 changed state from Full to Down 2022-02-24T18:12:26+00:00 hippo.btl.ipng.ch [debug] bird: bfd1: Session to fe80::5054:ff:fe01:1001 changed state from Up to Down 2022-02-24T18:12:26+00:00 hippo.btl.ipng.ch [debug] bird: ospf6: Neighbor 192.168.10.1 on e2 changed state from Full to Down 2022-02-24T18:17:18+00:00 hippo.btl.ipng.ch [debug] bird: ospf6: Neighbor 192.168.10.1 on e2 changed state from Loading to Full 2022-02-24T18:17:18+00:00 hippo.btl.ipng.ch [debug] bird: bfd1: Session to fe80::5054:ff:fe01:1001 changed state from Init to Up 2022-02-24T18:17:22+00:00 hippo.btl.ipng.ch [debug] bird: ospf4: Neighbor 192.168.10.1 on e2 changed state from Loading to Full 2022-02-24T18:17:22+00:00 hippo.btl.ipng.ch [debug] bird: bfd1: Session to 192.168.10.17 changed state from Down to Up And now I can see that important events are detected and sent , using reliable TCP transport, to multiple logging machines, these messages about BFD and OSPF adjacency changes now make it to a central machine.\nTelegram Bot Of course I can go tail the logfile on one of the servers, but I think it\u0026rsquo;d be a bit more elegant to have a computer do the pattern matching for me. One way might be to use the syslog-ng destination feature program() (ref), which pipes these logs through a userspace process, receiving them on stdin and doing interesting things with them, such as interacting with Telegram, the delivery mechanism of choice for IPng\u0026rsquo;s monitoring systems. Building such a Telegram enabled bot is very straight forward, thanks to the excellent documentation of the Telegram API, and the existence of python-telegram-bot (ref).\nHowever, to avoid my bot from being tied at the hip to syslog-ng, I decide to simply tail a number of logfiles from the commandline (ie pttb /var/log/*.log) - and here emerges the name of my little bot: Python Telegram Tail Bot, or pttb for short, that:\nTails the syslog logstream from one or more files, ie /var/log/ipng.log Pattern matches on loglines, after which an incident is created Waits for a predefined number of seconds (which may be zero) to see if more loglines match, adding them to the incident Holds the incident against a list of known regular expression silences, throwing away those which aren\u0026rsquo;t meant to be distributed Sending to a predefined group chat, those incidents which aren\u0026rsquo;t silenced The bot should allow for the following features, based on a YAML configuration file, which will allow it to be restarted and upgraded:\nA (mandatory) TOKEN to interact with Telegram API A (mandatory) single chat-id - messages will be sent to this Telegram group chat An (optional) list of logline triggers, consisting of: a regular expression to match in the logstream a grace period to coalesce additional loglines of the same trigger into the incident a description to send once the incident is sent to Telegram An (optional) list of silences, consisting of: a regular expression to match any incident message data in an expiry timestamp a description carrying the reason for the silence The bot will start up, announce itself on the chat-id group, and then listen on Telegram for the following commands:\n/help - a list of available commands /trigger - without parameters, list the current triggers /trigger add \u0026lt;regexp\u0026gt; [duration] [\u0026lt;message\u0026gt;] - with one parameter set a trigger on a regular expression. Optionally, add a duration in seconds between [0..3600\u0026gt;, within which additional matched loglines will be added to the same incident, and an optional message to include in the Telegram alert. /trigger del \u0026lt;idx\u0026gt; - with one parameter, remove the trigger with that index (use /trigger to see the list). /silence - without parameters, list the current silences. /silence add \u0026lt;regexp\u0026gt; [duration] [\u0026lt;reason\u0026gt;] - with one parameter, set a default silence for 1d; optionally add a duration in the form of [1-9][0-9]*([hdm]) which defaults to hours (and can be days or minutes), and an optional reason for the silence. /silence del \u0026lt;idx\u0026gt; - with one parameter, remove the silence with that index (use /silence to see the list). /stfu [duration] - a shorthand for a silence with regular expression .*, will suppress all notifications, with a duration similar to the /silence add subcommand. /stats - shows some runtime statistics, notably how many loglines were processed, how many incidents created, and how many were sent or suppressed due to a silence. It will save its configuration file any time a silence or trigger is added or deleted. It will (obviously) then start sending incidents to the chat-id group-chat when they occur.\nResults And a few fun hours of hacking later, I submitted a first rough approxmiation of a useful syslog scanner telegram bot on Github. It does seem to work, although not all functions are implemented yet (I\u0026rsquo;ll get them done in the month of March, probably):\nSo now I\u0026rsquo;ll be pretty quickly and elegantly kept up to date by this logscanner, in addition to my already existing LibreNMS logging, monitoring and alerting. If you find this stuff useful, feel free to grab a copy from Github, the code is open source and licensed with a liberal APACHE 2.0 license, and is based on excellent work of Python Telegram Bot.\n","date":"2022-03-03","desc":"Introduction From time to time, I wish I could be made aware of failures earlier. There are two events, in particular, that I am interested to know about very quickly, as they may impact service at AS8298:\nOpen Shortest Path First (OSPF) adjacency removals. OSPF is a link-state protocol and it knows when a physical link goes down, that the peer (neighbor) is no longer reachable. It can then recompute paths to other routers fairly quickly. But if the link stays up but connectivity is interrupted, for example because there is a switch in the path, it can take a relatively long time to detect. Bidirectional Forwarding Detection (BFD) session timeouts. BFD sets up a rapid (for example every 50ms or 20Hz) of a unidirectional UDP stream between two hosts. If a number of packets (for example 40 packets or 2 seconds) are not received, a link can be assumed to be dead. Notably, BIRD, as many other vendors do, can combine the two. At IPng, each OSPF adjacency is protected by BFD. What happens is that once an OSPF enabled link comes up, OSPF Hello packets will be periodically transmitted (with a period called called a Hello Timer, typically once every 10 seconds). When a number of these are missed (called a Dead Timer, typically 40 seconds), the neighbor is considered missing in action and the session cleaned up.\n","permalink":"https://ipng.ch/s/articles/2022/03/03/syslog-to-telegram/","section":"articles","title":"Syslog to Telegram"},{"contents":"Introduction As with most companies, it started with an opportunity. I got my hands on a location which has a raised floor at 60m2 and a significant power connection of 3x200A, and a metro fiber connection at 10Gbps. I asked my buddy Luuk \u0026lsquo;what would it take to turn this into a colo?\u0026rsquo; and the rest is history. Thanks to Daedalean AG who benefit from this infrastructure as well, making this first small colocation site was not only interesting, but also very rewarding.\nThe colocation business is murder in Zurich - there are several very large datacenters (Equinix, NTT, Colozüri, Interxion) all directly in or around the city, and I\u0026rsquo;m known to dwell in most of these. The networking and service provider industry is quite small and well organized into Network Operator Groups, so I work under the assumption that everybody knows everybody. I definitely like to pitch in and share what I have built, both the physical bits but also the narrative.\nThis article describes the small serverroom I built at a partner\u0026rsquo;s premises in Zurich Albisrieden. The colo is open for business, that is to say: Please feel free to reach out if you\u0026rsquo;re interested.\nPhysical It starts with a competent power distribution. Pictured to the right is a 200Amp 3-phase distribution panel at Daedalean AG in Zurich. There\u0026rsquo;s another similar panel on the other side of the floor, and both are directly connected to EWZ and have plenty of smaller and larger breakers available (the room it\u0026rsquo;s in used to be a serverroom of the previous tenant, the City of Zurich).\nI start with installing a set of Eastron SDM630 power meters, so that I know what is being used by IPng Networks, and can pay my dues, as well as remotely read the state and power consumption using MODBUS, yielding two 3-phase supplies with 32A breakers on each.\nThen, I go scouring on the Internet, to find a few second hand 19\u0026quot; racks. I actually find two 800x1000mm racks but they are all the way across Switzerland. However, they\u0026rsquo;re very affordable, but what\u0026rsquo;s better, they each come with two APC power distribution and remotely switchable zero-u power distribution strips. Score!\nLaura and I rented a little (with which I mean: huge) minivan and went to pick up the racks. The folks at Daedalean kindly helped us schlepp them up the stairs to the serverroom, and we installed the racks in the serverroom, connecting them redundantly to power using the four PDUs. I have to be honest: there is no battery or diesel backup in this room, as it\u0026rsquo;s in the middle of the city and it\u0026rsquo;d be weird to have generators on site for such a small room. It\u0026rsquo;s a compromise we have to make.\nOf course, I have to supply some form of eye-candy, so I decide to make a few decals for the racks, so that they sport the IPng @ DDLN designation. There are a few other racks and infrastructure in the same room, of course, and it\u0026rsquo;s cool to be able to identify IPng\u0026rsquo;s kit upon entering the room. They even have doors, look!\nThe floor space here is about 60m2 of usable serverroom, so there is plenty of room to grow, and if the network ever grows larger than 2x10G uplinks, it is definitely possible to rent dark fiber from this location thanks to the liberal Swiss telco situation. But for now, we start small with 1x 10G layer2 backhaul to Interxion in Glattbrugg. In 2022, I expect to expand with a second 10G layer2 backhaul to NTT in Rümlang to make the site fully redundant.\nLogical The physical situation is sorted, we have cooling, power, 19\u0026quot; racks with PDUs, and uplink connectivity. It\u0026rsquo;s time to think about a simple yet redundant colocation setup:\nIn this design, I\u0026rsquo;m keeping it relatively straight forward. The 10G ethernet leased line from Solnet plugs into one switch, and the 10G leased line from Init7 plugs into the other. Everything is then built in pairs. I bring:\nTwo switches (Mikrotik CRS354, with 48x1G, 4x10G and 2x40G), two power supplies, connect them with 40G together. Two Dell R630 routers running VPP (of course), two power supplies, with 3x10G each: One leg goes back-to-back for OSPF/OSPFv3 between the two routers One leg goes to each switch; the \u0026ldquo;local\u0026rdquo; leg will be in a VLAN into the uplink VLL, and expose the router on the colocation VLAN and any L2 backhaul services. The \u0026ldquo;remote\u0026rdquo; leg will be in a VLAN to the other uplink VLL. Two Supermicro hypervisors, each connected with 10G to their own switch Two PCEngines APU4 machines, each connected to Daedalean\u0026rsquo;s corporate network for OOB These have serial connection to the PDUs and Mikrotik switches They also have mgmt network connection to the Dell VPP routers and Mikrotik switches They also run a Wireguard access service which exposes an IPMI VLAN for colo clusters The result is that each of these can fail without disturbing traffic to/from the servers in the colocation. Each server in the colo gets two power connections (one on each feed), two 1Gbps ports (one for IPMI and one for Internet).\nThe logical colocation network has VRRP configured for direct/live failover of IPv4 and IPv6 gateways, but the VPP routers can offer full redundant IPv4 and IPv6 transit, as well as L2 backhaul to any other location where IPng Networks has a presence (which is quite a few).\nConclusion The colocation that I built, together with Daedalean, is very special. It\u0026rsquo;s not carrier grade, it doesn\u0026rsquo;t have a building/room wide UPS or diesel generators, but it does have competent power, cooling, physical and logical deployment. But most of all: it redundantly connects to AS8298 and offers full N+1 redundancy on the logical level.\nIf you\u0026rsquo;re interested in hosting a server in this colocation, contact us!\n","date":"2022-02-24","desc":"Introduction As with most companies, it started with an opportunity. I got my hands on a location which has a raised floor at 60m2 and a significant power connection of 3x200A, and a metro fiber connection at 10Gbps. I asked my buddy Luuk \u0026lsquo;what would it take to turn this into a colo?\u0026rsquo; and the rest is history. Thanks to Daedalean AG who benefit from this infrastructure as well, making this first small colocation site was not only interesting, but also very rewarding.\n","permalink":"https://ipng.ch/s/articles/2022/02/24/ipng-networks-colocation/","section":"articles","title":"IPng Networks - Colocation"},{"contents":"Introduction If you\u0026rsquo;ve read up on my articles, you\u0026rsquo;ll know that I have deployed a European Ring, which was reformatted late last year into AS8298 and upgraded to run VPP Routers with 10G between each city. IPng Networks rents these 10G point to point virtual leased lines between each of our locations. It\u0026rsquo;s a really great network, and it performs so well because it\u0026rsquo;s built on an EoMPLS underlay provided by IP-Max. They, in turn, run carrier grade hardware in the form of Cisco ASR9k. In part, we\u0026rsquo;re such a good match together, because my choice of VPP on the IPng Networks routers fits very well with Fred\u0026rsquo;s choice of IOS/XR on the IP-Max routers.\nAnd if you follow us on Twitter (I post as @IPngNetworks), you may have seen a recent post where I upgraded an aging ASR9006 with a significantly larger ASR9010. The ASR9006 was initially deployed at Equinix Zurich ZH05 in Oberenstringen near Zurich, Switzerland in 2015, which is seven years ago. It has hauled countless packets from Zurich to Paris, Frankfurt and Lausanne. When it was deployed, it came with a A9K-RSP-4G route switch processor, which in 2019 was upgraded to the A9K-RSP-8G, and after so many hours^W years of runtime needed a replacement. Also, IP-Max was starting to run out of ports for the chassis, hence the upgrade.\nIf you\u0026rsquo;re interested in the line-up, there\u0026rsquo;s this epic reference guide from Cisco Live! that shows a deep dive of the ASR9k architecture. The chassis and power supplies can host several generations of silicon, and even mix-and-match generations. So IP-Max ordered a few new RSPs, and after deploying the ASR9010 at ZH05, we made plans to redeploy this ASR9006 at NTT Zurich in Rümlang next to the airport, to replace an even older Cisco 7600 at that location. Seeing as we have to order XFP optics (IP-Max has some DWDM/CWDM links in service at NTT), we have to park the chassis in and around Zurich. What better place to park it, than in my lab ? :-)\nThe IPng Networks laboratory is where I do most of my work on VPP. The rack you see to the left here holds my coveted Rhino and Hippo (two beefy AMD Ryzen 5950X machines with 100G network cards), and a few Dells that comprise my VPP lab. There was not enough room, so I gave this little fridge a place just adjacent to the rack, connected with 10x 10Gbps and serial and management ports.\nI immediately had a little giggle when booting up the machine. It comes with 4x 3kW power supply slots (3 are installed), and when booting the machine, I was happy that there was no debris laying on the side or back of the router, as its fans create a veritable vortex of airflow. Also, overnight the temperature in my basement lab + office room raised a few degrees. It\u0026rsquo;s now nice and toasty in my office, no need for the heater in the winter. Yet the machine stays quite cool at 26C intake, consuming 2.2KW idle with each of the two route processor (RSP440) drawing 240 Watts, each of the three 8x TenGigE blades drawing 575W each, and the 40x GigE blade drawing a respectable 320 Watts.\nRP/0/RSP0/CPU0:fridge(admin)#show environment power-supply R/S/I Power Supply Voltage Current (W) (V) (A) 0/PS0/M1/* 741.1 54.9 13.5 0/PS0/M2/* 712.4 54.8 13.0 0/PS0/M3/* 765.8 55.1 13.9 -------------- Total: 2219.3 For reference, Rhino and Hippo draw approximately 265W each, but they come with 4x1G, 4x10G, 2x100G and forward ~300Mpps when fully loaded. By the end of this article, I hope you\u0026rsquo;ll see why this is a funny juxtaposition to me.\nInstalling the ASR9006 The Cisco RSPs came to me new-in-refurbished-box. When booting, I had no idea what username/password was used for the preinstall, and none of the standard passwords worked. So the first order of business is to take ownership of the machine. I do this by putting both RSPs in rommon (which is done by sending Break after powercycling the machine \u0026ndash; my choice of tio(1) has Ctrl-t b as the magic incantation). The first RSP (in slot 0) is then set to a different confreg 0x142, while the other is kept in rommon so it doesn\u0026rsquo;t boot and take over the machine. After booting, I\u0026rsquo;m then presented with a root user setup dialog. I create a user pim with some temporary password, set back the configuration register, and reload. When the RSP is about to boot, I release the standby RSP to catch up, and voila: I\u0026rsquo;m In like Flynn.\nWiring this up - I connect Te0/0/0/0 to IPng\u0026rsquo;s office switch on port sfp-sfpplus9, and I assign the router an IPv4 and IPv6 address. Then, I connect four Tengig ports to the lab switch, so that I can play around with loadtests a little bit. After turning on LLDP, I can see the following physical view:\nRP/0/RSP0/CPU0:fridge#show lldp neighbors Sun Feb 20 19:14:21.775 UTC Capability codes: (R) Router, (B) Bridge, (T) Telephone, (C) DOCSIS Cable Device (W) WLAN Access Point, (P) Repeater, (S) Station, (O) Other Device ID Local Intf Hold-time Capability Port ID xsw1-btl Te0/0/0/0 120 B,R bridge/sfp-sfpplus9 fsw0 Te0/1/0/0 41 P,B,R TenGigabitEthernet 0/9 fsw0 Te0/1/0/1 41 P,B,R TenGigabitEthernet 0/10 fsw0 Te0/2/0/0 41 P,B,R TenGigabitEthernet 0/7 fsw0 Te0/2/0/1 41 P,B,R TenGigabitEthernet 0/8 Total entries displayed: 5 First, I decide to hook up basic connectivity behind port Te0/0/0/0. I establish OSPF, OSPFv3 and this gives me visibility to the route-reflectors at IPng\u0026rsquo;s AS8298. Next, I also establish three IPv4 and IPv6 iBGP sessions, so the machine enters the Default Free Zone (also, daaayum, that table keeps on growing at 903K IPv4 prefixes and 143K IPv6 prefixes).\nRP/0/RSP0/CPU0:fridge#show ip ospf neighbor Neighbor ID Pri State Dead Time Address Interface 194.1.163.3 1 2WAY/DROTHER 00:00:35 194.1.163.66 TenGigE0/0/0/0.101 Neighbor is up for 00:11:14 194.1.163.4 1 FULL/BDR 00:00:38 194.1.163.67 TenGigE0/0/0/0.101 Neighbor is up for 00:11:11 194.1.163.87 1 FULL/DR 00:00:37 194.1.163.87 TenGigE0/0/0/0.101 Neighbor is up for 00:11:12 RP/0/RSP0/CPU0:fridge#show ospfv3 neighbor Neighbor ID Pri State Dead Time Interface ID Interface 194.1.163.87 1 FULL/DR 00:00:35 2 TenGigE0/0/0/0.101 Neighbor is up for 00:12:14 194.1.163.3 1 2WAY/DROTHER 00:00:33 16 TenGigE0/0/0/0.101 Neighbor is up for 00:12:16 194.1.163.4 1 FULL/BDR 00:00:36 20 TenGigE0/0/0/0.101 Neighbor is up for 00:12:12 RP/0/RSP0/CPU0:fridge#show bgp ipv4 uni sum Process RcvTblVer bRIB/RIB LabelVer ImportVer SendTblVer StandbyVer Speaker 915517 915517 915517 915517 915517 915517 Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd 194.1.163.87 0 8298 172514 9 915517 0 0 00:04:47 903406 194.1.163.140 0 8298 171853 9 915517 0 0 00:04:56 903406 194.1.163.148 0 8298 176244 9 915517 0 0 00:04:49 903406 RP/0/RSP0/CPU0:fridge#show bgp ipv6 uni sum Process RcvTblVer bRIB/RIB LabelVer ImportVer SendTblVer StandbyVer Speaker 151597 151597 151597 151597 151597 151597 Neighbor Spk AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down St/PfxRcd 2001:678:d78:3::87 0 8298 54763 10 151597 0 0 00:05:19 142542 2001:678:d78:6::140 0 8298 51350 10 151597 0 0 00:05:23 142542 2001:678:d78:7::148 0 8298 54572 10 151597 0 0 00:05:25 142542 One of the acceptance tests of new hardware at AS25091 IP-Max is to ensure that it takes a full table to help ensure memory is present, accounted for, and working. These route switch processor boards come with 12GB of ECC memory, and can scale the routing table for a small while to come. If/when they are at the end of their useful life, they will be replaced with A9K-RSP-880\u0026rsquo;s, which will also give us access to 40G and 100G and 24x10G SFP+ line cards. At that point, the upgrade path is much easier as the chassis will already be installed. It\u0026rsquo;s a matter of popping in new RSPs and replacing the line cards one by one.\nLoadtesting the ASR9006/RSP440-SE Now that this router has some basic connectivity, I\u0026rsquo;ll do something that I always wanted to do: loadtest an ASR9k! I have mad amounts of respect for Cisco\u0026rsquo;s ASR9k series, but as we\u0026rsquo;ll soon see, their stability is their most redeeming quality, not their performance. Nowadays, many flashy 100G machines are around, which do indeed have the performance, but not the stability! I\u0026rsquo;ve seen routers with an uptime of 7 years, and BGP sessions and OSPF adjacencies with an uptime of 5 years+. It\u0026rsquo;s just .. I\u0026rsquo;ve not seen that type of stability beyond Cisco and maybe Juniper. So if you want Rock Solid Internet, this is definitely the way to go.\nI have written a word or two on how VPP (an open source dataplane very similar to these industrial machines) works. A great example is my recent VPP VLAN Gymnastics article. There\u0026rsquo;s a lot I can learn from comparing the performance between VPP and Cisco ASR9k, so I will focus on the following set of practical questions:\nSee if unidirectional versus bidirectional traffic impacts performance. See if there is a performance penalty of using Bundle-Ether (LACP controlled link aggregation). Of course, replay my standard issue 1514b large packets, internet mix (imix) packets, small 64b packets from random source/destination addresses (ie. multiple flows); and finally the killer test of small 64b packets from a static source/destination address (ie. single flow). This is in total 2 (uni/bi) x2 (lag/plain) x4 (packet mix) or 16 loadtest runs, for three forwarding types \u0026hellip;\nSee performance of L2VPN (Point-to-Point), similar to what VPP would call \u0026ldquo;l2 xconnect\u0026rdquo;. I\u0026rsquo;ll create an L2 crossconnect between port Te0/1/0/0 and Te0/2/0/0; this is the simplest form computationally: it forwards any frame received on the first interface directly out on the second interface. Take a look at performance of L2VPN (Bridge Domain), what VPP would call \u0026ldquo;bridge-domain\u0026rdquo;. I\u0026rsquo;ll create a Bridge Domain between port Te0/1/0/0 and Te0/2/0/0; this includes layer2 learning and FIB, and can tie together any number of interfaces into a layer2 broadcast domain. And of course, tablestakes, see performance of IPv4 forwarding, with Te0/1/0/0 as 100.64.0.1/30 and Te0/2/0/0 as 100.64.1.1/30 and setting a static for 48.0.0.0/8 and 16.0.0.0/8 back to the loadtester. \u0026hellip; making a grand total of 48 loadtests. I have my work cut out for me! So I boot up Rhino, which has a Mellanox ConnectX5-Ex (PCIe v4.0 x16) network card sporting two 100G interfaces, and it can easily keep up with this 2x10G single interface, and 2x20G LAG, even with 64 byte packets. I am continually amazed that a full line rate loadtest of small 64 byte packets at a rate of 40Gbps boils down to 59.52Mpps!\nFor each loadtest, I ramp up the traffic using a T-Rex loadtester that I wrote. It starts with a low-pps warmup duration of 30s, then it ramps up from 0% to a certain line rate (in this case, alternating to 10GbpsL1 for the single TenGig tests, or 20GbpsL1 for the LACP tests), with a rampup duration of 120s and finally it holds for duration of 30s.\nThe following sections describe the methodology and the configuration statements on the ASR9k, with a quick table of results per test, and a longer set of thoughts all the way at the bottom of this document. I so encourage you to not skip ahead. Instead, read on and learn a bit (as I did!) from the configuration itself.\nThe question to answer: Can this beasty mini-fridge sustain line rate? Let\u0026rsquo;s go take a look!\nTest 1 - 2x 10G In this test, I configure a very simple physical environment (this is a good time to take another look at the LLDP table above). The Cisco is connected with 4x 10G to the switch, Rhino and Hippo are connected with 2x 100G to the switch and I have a Dell connected as well with 2x 10G to the switch (this can be very useful to take a look at what\u0026rsquo;s going on on the wire). The switch is an FS S5860-48SC (with 48x10G SFP+ ports, and 8x100G QSFP ports), which is a piece of kit that I highly recommend by the way.\nIts configuration:\ninterface TenGigabitEthernet 0/1 description Infra: Dell R720xd hvn0:enp5s0f0 no switchport mtu 9216 ! interface TenGigabitEthernet 0/2 description Infra: Dell R720xd hvn0:enp5s0f1 no switchport mtu 9216 ! interface TenGigabitEthernet 0/7 description Cust: Fridge Te0/2/0/0 mtu 9216 switchport access vlan 20 ! interface TenGigabitEthernet 0/9 description Cust: Fridge Te0/1/0/0 mtu 9216 switchport access vlan 10 ! interface HundredGigabitEthernet 0/53 description Cust: Rhino HundredGigabitEthernet15/0/1 mtu 9216 switchport access vlan 10 ! interface HundredGigabitEthernet 0/54 description Cust: Rhino HundredGigabitEthernet15/0/0 mtu 9216 switchport access vlan 20 ! monitor session 1 destination interface TenGigabitEthernet 0/1 monitor session 1 source vlan 10 rx monitor session 2 destination interface TenGigabitEthernet 0/2 monitor session 2 source vlan 20 rx What this does is connect Rhino\u0026rsquo;s Hu15/0/1 and Fridge\u0026rsquo;s Te0/1/0/0 in VLAN 10, and sends a readonly copy of all traffic to the Dell\u0026rsquo;s enp5s0f0 interface. Similarly, Rhino\u0026rsquo;s Hu15/0/0 and Fridge\u0026rsquo;s Te0/2/0/0 in VLAN 20 with a copy of traffic to the Dell\u0026rsquo;s enp5s0f1 interface. I can now run tcpdump on the Dell to see what\u0026rsquo;s going back and forth.\nIn case you\u0026rsquo;re curious: the monitor on Te0/1 and Te0/2 ports will saturate in case both machines are transmitting at a combined rate of over 10Gbps. If this is the case, the traffic that doesn\u0026rsquo;t fit is simply dropped from the monitor port, but it\u0026rsquo;s of course forwarded correctly between the original Hu0/53 and Te0/9 ports. In other words: the monitor session has no performance penalty. It\u0026rsquo;s merely a convenience to be able to take a look on ports where tcpdump is not easily available (ie. both VPP as well as the ASR9k in this case!)\nTest 1.1: 10G L2 Cross Connect A simple matter of virtually patching one interface into the other, I choose the first port on blade 1 and 2, and tie them together in a p2p cross connect. In my VLAN Gymnastics post, I called this a l2 xconnect, and although the configuration statements are a bit different, the purpose and expected semantics are identical:\ninterface TenGigE0/1/0/0 l2transport ! ! interface TenGigE0/2/0/0 l2transport ! ! l2vpn xconnect group loadtest p2p xc01 interface TenGigE0/1/0/0 interface TenGigE0/2/0/0 ! ! The results of this loadtest look promising - although I can already see that the port will not sustain line rate at 64 byte packets, which I find somewhat surprising. Both when using multiple flows (ie. random source and destination IP addresses), as well as when using a single flow (repeating the same src/dst packet), the machine tops out at around 20 Mpps which is 68% of line rate (29.76 Mpps). Fascinating!\nLoadtest Unidirectional (pps) L1 Unidirectional (bps) Bidirectional (pps) L1 Bidirectional (bps) 1514b 810 kpps 9.94 Gbps 1.61 Mpps 19.77 Gbps imix 3.25 Mpps 9.94 Gbps 6.46 Mpps 19.78 Gbps 64b Multi 14.66 Mpps 9.86 Gbps 20.3 Mpps 13.64 Gbps 64b Single 14.28 Mpps 9.60 Gbps 20.3 Mpps 13.62 Gbps Test 1.2: 10G L2 Bridge Domain I then keep the two physical interfaces in l2transport mode, but change the type of l2vpn into a bridge-domain, which I described in my VLAN Gymnastics post as well. VPP and Cisco IOS/XR semantics look very similar indeed, they differ really only in the way in which the configuration is expressed:\ninterface TenGigE0/1/0/0 l2transport ! ! interface TenGigE0/2/0/0 l2transport ! ! l2vpn xconnect group loadtest ! bridge group loadtest bridge-domain bd01 interface TenGigE0/1/0/0 ! interface TenGigE0/2/0/0 ! ! ! ! Here, I find that performance in one direction is line rate, and with 64b packets ever so slightly better than the L2 crossconnect test above. In both directions though, the router struggles to obtain line rate in small packets, delivering 64% (or 19.0 Mpps) of the total offered 29.76 Mpps back to the loadtester.\nLoadtest Unidirectional (pps) L1 Unidirectional (bps) Bidirectional (pps) L1 Bidirectional (bps) 1514b 807 kpps 9.91 Gbps 1.63 Mpps 19.96 Gbps imix 3.24 Mpps 9.92 Gbps 6.47 Mpps 19.81 Gbps 64b Multi 14.82 Mpps 9.96 Gbps 19.0 Mpps 12.79 Gbps 64b Single 14.86 Mpps 9.98 Gbps 19.0 Mpps 12.81 Gbps I would say that in practice, the performance of a bridge-domain is comparable to that of an L2XC.\nTest 1.3: 10G L3 IPv4 Routing This is the most straight forward test: the T-Rex loadtester in this case is sourcing traffic from 100.64.0.2 on its first interface, and 100.64.1.2 on its second interface. It will send ARP for the nexthop (100.64.0.1 and 100.64.1.1, the Cisco), but the Cisco will not maintain an ARP table for the loadtester, so I have to add static ARP entries for it. Otherwise, this is a simple test, which stress tests the IPv4 forwarding path:\ninterface TenGigE0/1/0/0 ipv4 address 100.64.0.1 255.255.255.252 ! interface TenGigE0/2/0/0 ipv4 address 100.64.1.1 255.255.255.252 ! router static address-family ipv4 unicast 16.0.0.0/8 100.64.1.2 48.0.0.0/8 100.64.0.2 ! ! arp vrf default 100.64.0.2 043f.72c3.d048 ARPA arp vrf default 100.64.1.2 043f.72c3.d049 ARPA ! Alright, so the cracks definitely show on this loadtest. The performance of small routed packets is quite poor, weighing in at 35% of line rate in the unidirectional test, and 43% in the bidirectional test. It seems that the ASR9k (at least in this hardware profile of l3xl) is not happy forwarding traffic at line rate, and the routing performance is indeed significantly lower than the L2VPN performance. That\u0026rsquo;s good to know!\nLoadtest Unidirectional (pps) L1 Unidirectional (bps) Bidirectional (pps) L1 Bidirectional (bps) 1514b 815 kpps 10.0 Gbps 1.63 Mpps 19.98 Gbps imix 3.27 Mpps 9.99 Gbps 6.52 Mpps 19.96 Gbps 64b Multi 5.14 Mpps 3.45 Gbps 12.3 Mpps 8.28 Gbps 64b Single 5.25 Mpps 3.53 Gbps 12.6 Mpps 8.51 Gbps Test 2 - LACP 2x 20G Link aggregation (ref) means combining or aggregating multiple network connections in parallel by any of several methods, in order to increase throughput beyond what a single connection could sustain, to provide redundancy in case one of the links should fail, or both. A link aggregation group (LAG) is the combined collection of physical ports. Other umbrella terms used to describe the concept include trunking, bundling, bonding, channeling or teaming. Bundling ports together on a Cisco IOS/XR platform like the ASR9k can be done by creating a Bundle-Ether or BE. For reference, the same concept on VPP is called a BondEthernet and in Linux it\u0026rsquo;ll often be referred to as simply a bond. They all refer to the same concept.\nOne thing that immediately comes to mind when thinking about LAGs is: how will the member port be selected on outgoing traffic? A sensible approach will be to either hash on the L2 source and/or destination (ie. the ethernet host on either side of the LAG), but in the case of a router and as is the case in our loadtest here, there is only one MAC address on either side of the LAG. So a different hashing algorithm has to be chosen, preferably of the source and/or destination L3 (IPv4 or IPv6) address. Luckily, both the FS switch as well as the Cisco ASR9006 support this.\nFirst I\u0026rsquo;ll reconfigure the switch, and then reconfigure the router to use the newly created 2x 20G LAG ports.\ninterface TenGigabitEthernet 0/7 description Cust: Fridge Te0/2/0/0 port-group 2 mode active ! interface TenGigabitEthernet 0/8 description Cust: Fridge Te0/2/0/1 port-group 2 mode active ! interface TenGigabitEthernet 0/9 description Cust: Fridge Te0/1/0/0 port-group 1 mode active ! interface TenGigabitEthernet 0/10 description Cust: Fridge Te0/1/0/1 port-group 1 mode active ! interface AggregatePort 1 mtu 9216 aggregateport load-balance dst-ip switchport access vlan 10 ! interface AggregatePort 2 mtu 9216 aggregateport load-balance dst-ip switchport access vlan 20 ! And after the Cisco is converted to use Bundle-Ether as well, the link status looks like this:\nfsw0#show int ag1 ... Aggregate Port Informations: Aggregate Number: 1 Name: \u0026#34;AggregatePort 1\u0026#34; Members: (count=2) Lower Limit: 1 TenGigabitEthernet 0/9 Link Status: Up Lacp Status: bndl TenGigabitEthernet 0/10 Link Status: Up Lacp Status: bndl Load Balance by: Destination IP fsw0#show int usage up Interface Bandwidth Average Usage Output Usage Input Usage -------------------------------- ----------- ---------------- ---------------- ---------------- TenGigabitEthernet 0/1 10000 Mbit 0.0000018300% 0.0000013100% 0.0000023500% TenGigabitEthernet 0/2 10000 Mbit 0.0000003450% 0.0000004700% 0.0000002200% TenGigabitEthernet 0/7 10000 Mbit 0.0000012350% 0.0000022900% 0.0000001800% TenGigabitEthernet 0/8 10000 Mbit 0.0000011450% 0.0000021800% 0.0000001100% TenGigabitEthernet 0/9 10000 Mbit 0.0000011350% 0.0000022300% 0.0000000400% TenGigabitEthernet 0/10 10000 Mbit 0.0000016700% 0.0000022500% 0.0000010900% HundredGigabitEthernet 0/53 100000 Mbit 0.00000011900% 0.00000023800% 0.00000000000% HundredGigabitEthernet 0/54 100000 Mbit 0.00000012500% 0.00000025000% 0.00000000000% AggregatePort 1 20000 Mbit 0.0000014600% 0.0000023400% 0.0000005799% AggregatePort 2 20000 Mbit 0.0000019575% 0.0000023950% 0.0000015200% It\u0026rsquo;s clear that both AggregatePort interfaces have 20Gbps of capacity and are using an L3 loadbalancing policy. Cool beans!\nIf you recall my loadtest theory in for example my Netgate 6100 review, it can sometimes be useful to operate a single-flow loadtest, in which the source and destination IP:Port stay the same. As I\u0026rsquo;ll demonstrate, it\u0026rsquo;s not only relevant for PC based routers like ones built on VPP, it can also be very relevant in silicon vendors and high-end routers!\nTest 2.1 - 2x 20G LAG L2 Cross Connect I scratched my head a little while (and with a little while I mean more like an hour or so!), because usually I come across Bundle-Ether interfaces which have hashing turned on in the interface stanza, but in my first loadtest run I did not see any traffic on the second member port. I then found out that I need L2VPN setting l2vpn load-balancing flow src-dst-ip applied rather than the Interface setting:\ninterface Bundle-Ether1 description LAG1 l2transport ! ! interface TenGigE0/1/0/0 bundle id 1 mode active ! interface TenGigE0/1/0/1 bundle id 1 mode active ! interface Bundle-Ether2 description LAG2 l2transport ! ! interface TenGigE0/2/0/0 bundle id 2 mode active ! interface TenGigE0/2/0/1 bundle id 2 mode active ! l2vpn load-balancing flow src-dst-ip xconnect group loadtest p2p xc01 interface Bundle-Ether1 interface Bundle-Ether2 ! ! ! Overall, the router performs as well as can be expected. In the single-flow 64 byte test, however, due to the hashing over the available members in the LAG being on L3 information, the router is forced to always choose the same member and effectively perform at 10G throughput, so it\u0026rsquo;ll get a pass from me on the 64b single test. In the multi-flow test, I can see that it does indeed forward over both LAG members, however it reaches only 34.9Mpps which is 59% of line rate.\nLoadtest Unidirectional (pps) L1 Unidirectional (bps) Bidirectional (pps) L1 Bidirectional (bps) 1514b 1.61 Mpps 19.8 Gbps 3.23 Mpps 39.64 Gbps imix 6.40 Mpps 19.8 Gbps 12.8 Mpps 39.53 Gbps 64b Multi 29.44 Mpps 19.8 Gbps 34.9 Mpps 23.48 Gbps 64b Single 14.86 Mpps 9.99 Gbps 29.8 Mpps 20.0 Gbps Test 2.2 - 2x 20G LAG Bridge Domain Just like with Test 1.2 above, I can now transform this service from a Cross Connect into a fully formed L2 bridge, by simply putting the two Bundle-Ether interfaces in a bridge-domain together, again being careful to apply the L3 load-balancing policy on the l2vpn scope rather than the interface scope:\nl2vpn load-balancing flow src-dst-ip no xconnect group loadtest bridge group loadtest bridge-domain bd01 interface Bundle-Ether1 ! interface Bundle-Ether2 ! ! ! ! The results for this test show that indeed L2XC is computationally cheaper than bridge-domain work. With imix and 1514b packets, the router is fine and forwards 20G and 40G respectively. When the bridge is slammed with 64 byte packets, its performance reaches only 65% with multiple flows in the unidirectional, and 47% in the bidirectional loadtest. I found the performance difference with the L2 crossconnect above remarkable.\nThe single-flow loadtest cannot meaningfully stress both members of the LAG due to the src/dst being identical: the best I can expect here, is 10G performance regardless how many LAG members there are.\nLoadtest Unidirectional (pps) L1 Unidirectional (bps) Bidirectional (pps) L1 Bidirectional (bps) 1514b 1.61 Mpps 19.8 Gbps 3.22 Mpps 39.56 Gbps imix 6.39 Mpps 19.8 Gbps 12.8 Mpps 39.58 Gbps 64b Multi 20.12 Mpps 13.5 Gbps 28.2 Mpps 18.93 Gbps 64b Single 9.49 Mpps 6.38 Gbps 19.0 Mpps 12.78 Gbps Test 2.3 - 2x 20G LAG L3 IPv4 Routing And finally I turn my attention to the usual suspect: IPv4 routing. Here, I simply remove the l2vpn stanza alltogether, and remember to put the load-balancing policy on the Bundle-Ether interfaces. This ensures that upon transmission, both members of the LAG are used. That is, if and only if the IP src/dst addresses differ, which is the case in most, but not all of my loadtests :-)\nno l2vpn interface Bundle-Ether1 description LAG1 ipv4 address 100.64.1.1 255.255.255.252 bundle load-balancing hash src-ip ! interface TenGigE0/1/0/0 bundle id 1 mode active ! interface TenGigE0/1/0/1 bundle id 1 mode active ! interface Bundle-Ether2 description LAG2 ipv4 address 100.64.0.1 255.255.255.252 bundle load-balancing hash src-ip ! interface TenGigE0/2/0/0 bundle id 2 mode active ! interface TenGigE0/2/0/1 bundle id 2 mode active ! The LAG is fine at forwarding IPv4 traffic in 1514b and imix - full line rate and 40Gbps of traffic is passed in the bidirectional test. With the 64b frames though, the forwarding performance is not line rate but rather 84% of line in one direction, and 76% of line rate in the bidirectional test.\nAnd once again, the single-flow loadtest cannot make use of more than one member port in the LAG, so it will be constrained to 10G throughput \u0026ndash; that said, it performs at 42.6% of line rate only.\nLoadtest Unidirectional (pps) L1 Unidirectional (bps) Bidirectional (pps) L1 Bidirectional (bps) 1514b 1.63 Mpps 20.0 Gbps 3.25 Mpps 39.92 Gbps imix 6.51 Mpps 19.9 Gbps 13.04 Mpps 39.91 Gbps 64b Multi 12.52 Mpps 8.41 Gbps 22.49 Mpps 15.11 Gbps 64b Single 6.49 Mpps 4.36 Gbps 11.62 Mpps 7.81 Gbps Bonus - ASR9k linear scaling As I\u0026rsquo;ve shown above, the loadtests often topped out at well under line rate for tests with small packet sizes, but I can also see that the LAG tests offered a higher performance, although not quite double that of single ports. I can\u0026rsquo;t help but wonder: is this perhaps a per-port limit rather than a router-wide limit?\nTo answer this question, I decide to pull out the stops and populate the ASR9k with as many XFPs as I have in my stash, which is 9 pieces. One (Te0/0/0/0) still goes to uplink, because the machine should be carrying IGP and full BGP tables at all times; which leaves me with 8x 10G XFPs, which I decide it might be nice to combine all three scenarios in one test:\nTest 1.1 with Te0/1/0/2 cross connected to Te0/2/0/2, with a loadtest at 20Gbps. Test 1.2 with Te0/1/0/3 in a bridge-domain with Te0/2/0/3, also with a loadtest at 20Gbps. Test 2.3 with Te0/1/0/0+Te0/2/0/0 on one end, and Te0/1/0/1+Te0/2/0/1 on the other end, with an IPv4 loadtest at 40Gbps. 64 byte packets It would be unfair to use single-flow on the LAG, considering the hashing is on L3 source and/or destination IPv4 addresses, so really only one member port would be used. To avoid this pitfall, I run with vm=var2. On the other two tests, however, I do run the most stringent of traffic pattern with single-flow loadtests. So off I go, firing up three T-Rex instances.\nFirst, the 10G L2 Cross Connect test (approximately 17.7Mpps):\nTx bps L2 | 7.64 Gbps | 7.64 Gbps | 15.27 Gbps Tx bps L1 | 10.02 Gbps | 10.02 Gbps | 20.05 Gbps Tx pps | 14.92 Mpps | 14.92 Mpps | 29.83 Mpps Line Util. | 100.24 % | 100.24 % | --- | | | Rx bps | 4.52 Gbps | 4.52 Gbps | 9.05 Gbps Rx pps | 8.84 Mpps | 8.84 Mpps | 17.67 Mpps Then, the 10G Bridge Domain test (approximately 17.0Mpps):\nTx bps L2 | 7.61 Gbps | 7.61 Gbps | 15.22 Gbps Tx bps L1 | 9.99 Gbps | 9.99 Gbps | 19.97 Gbps Tx pps | 14.86 Mpps | 14.86 Mpps | 29.72 Mpps Line Util. | 99.87 % | 99.87 % | --- | | | Rx bps | 4.36 Gbps | 4.36 Gbps | 8.72 Gbps Rx pps | 8.51 Mpps | 8.51 Mpps | 17.02 Mpps Finally, the 20G LAG IPv4 forwarding test (approximately 24.4Mpps), noting that the Line Util. here is of the 100G loadtester ports, so 20% is expected:\nTx bps L2 | 15.22 Gbps | 15.23 Gbps | 30.45 Gbps Tx bps L1 | 19.97 Gbps | 19.99 Gbps | 39.96 Gbps Tx pps | 29.72 Mpps | 29.74 Mpps | 59.46 Mpps Line Util. | 19.97 % | 19.99 % | --- | | | Rx bps | 5.68 Gbps | 6.82 Gbps | 12.51 Gbps Rx pps | 11.1 Mpps | 13.33 Mpps | 24.43 Mpps To summarize, in the above tests I am pumping 80Gbit (which is 8x 10Gbit full linerate at 64 byte packets, in other words 119Mpps) into the machine, and it\u0026rsquo;s returning 30.28Gbps (or 59.2Mpps which is 38%) of that traffic back to the loadtesters. Features: yes; linerate: nope!\n256 byte packets Seeing the lowest performance of the router coming in at 8.5Mpps (or 57% of linerate), it stands to reason that sending 256 byte packets will stay under the per-port observed packets/sec limits, so I decide to restart the loadtesters with 256b packets. The expected ethernet frame is now 256 + 20 byte overhead, or 2208 bits, of which ~4.53Mpps can fit into a 10G link. Immediately all ports go up entirely to full capacity. As seen from the Cisco\u0026rsquo;s commandline:\nRP/0/RSP0/CPU0:fridge#show interfaces | utility egrep \u0026#39;output.*packets/sec\u0026#39; | exclude 0 packets Mon Feb 21 22:14:02.250 UTC 5 minute output rate 18390237000 bits/sec, 9075919 packets/sec 5 minute output rate 18391127000 bits/sec, 9056714 packets/sec 5 minute output rate 9278278000 bits/sec, 4547012 packets/sec 5 minute output rate 9242023000 bits/sec, 4528937 packets/sec 5 minute output rate 9287749000 bits/sec, 4563507 packets/sec 5 minute output rate 9273688000 bits/sec, 4537368 packets/sec 5 minute output rate 9237466000 bits/sec, 4519367 packets/sec 5 minute output rate 9289136000 bits/sec, 4562365 packets/sec 5 minute output rate 9290096000 bits/sec, 4554872 packets/sec The first two ports there are Bundle-Ether interface BE1 and BE2, and the other eight are the TenGigE ports. You can see that each one is forwarding the expected 4.53Mpps, and this lines up perfectly with T-Rex which is sending 10Gbps of L1, and 9.28Gbps of L2 (the difference here is the ethernet overhead of 20 bytes per frame, or 4.53 * 160 bits = 724Mbps), and it\u0026rsquo;s receiving all of that traffic back on the other side, which is good.\nThis clearly demonstrates the hypothesis that the machine is per-port pps-bound.\nSo the conclusion is that, the A9K-RSP440-SE typically will forward maybe only 8Mpps on a single TenGigE port, and 13Mpps on a two-member LAG. However, it will do this for every port, and with at least 8x 10G ports saturated, it remained fully responsive, OSPF and iBGP adjacencies stayed up, and ping times on the regular (Te0/0/0/0) uplink port were smooth.\nResults 1514b and imix: OK! Let me start by showing a side-by-side comparison of the imix tests in all scenarios in the graph above. The graph for 1514b tests looks very similar, differing only in the left-Y axis: imix is a 3.2Mpps stream, while 1514b saturates the 10G port already at 810Kpps. But obviously, the router can do this just fine, even if used on 8 ports, it doesn\u0026rsquo;t mind at all. As I later learned, any traffic mix larger than than 256b packets, or 4.5Mpps per port, forwards fine in any configuration.\n64b: Not so much :) These graphs show the throughput of the ASR9006 with a pair of A9K-RSP440-SE route switch processors. They are rated at 440Gbps per slot, but their packets/sec rates are significantly lower than line rate. The top graph shows the tests with 10G ports, and the bottom graph shows the same tests but with a 2x10G ports in Bundle-Ether LAG.\nIn an ideal situation, each test would follow the loadtester up to completion, and there would be no horizontal lines breaking out partway through. As I showed, some of the loadtests really performed poorly in terms of packets/sec forwarded. Understandably, the 20G LAG with single-flow can only utilize one member port (which is logical) but then managed to push through only 6Mpps or so. Other tests did better, but overall I must say, the results were lower than I had expected.\nThat juxtaposition At the very top of this article I alluded to what I think is a cool juxtaposition. On the one hand, we have these beasty ASR9k routers, running idle at 2.2kW for 24x10G and 40x1G ports (as is the case for the IP-Max router that I took out for a spin here). They are large (10U of rackspace), heavy (40kg loaded), expensive (who cares about list price, the street price is easily $10'000,- apiece).\nOn the other hand, we have these PC based machines with Vector Packet Processing, operating as low as 19W for 2x10G, 2x1G and 4x2.5G ports (like the Netgate 6100) and offering roughly equal performance per port, except having to drop only $700,- apiece. The VPP machines come with ~infinite RAM, even a 16GB machine will run much larger routing tables, including full BGP and so on - there is no (need for) TCAM, and yet routing performance scales out with CPUs and larger CPU instruction/data-cache. Looking at my Ryzen 5950X based Hippo/Rhino VPP machines, they can sustain line rate 64b packets on their 10G ports, due to each CPU being able to process around 22.3Mpps, and the machine has 15 usable CPU cores. Intel or Mellanox 100G network cards are affordable, the whole machine with 2x100G, 4x10G and 4x1G will set me back about $3'000,- in 1U and run 265 Watts when fully loaded.\nSee an extended rationale with backing data in my FOSDEM'22 talk.\nConclusion I set out to answer three questions in this article, and I\u0026rsquo;m ready to opine now:\nUnidirectional vs Bidirectional: there is an impact - bidirectional tests (stressing both ingress and egress of each individual router port) have lower performance, notably in packets smaller than 256b. LACP performance penalty: there is an impact - 64b multiflow loadtest on LAG obtained 59%, 47% and 42% (for Test 2.1-3) while for single ports, they obtained 68%, 64% and 43% (for Test 1.1-3). So while aggregate throughput grows with the LACP Bundle-Ether ports, individual port throughput is reduced. The router performs line rate 1514b, imix, and anything beyond 256b packets really. However, it does not sustain line rate at 64b packets. Some tests passed with a unidirectional loadtest, but all tests failed with bidirectional loadtests. After all of these tests, I have to say I am still a huge fan of the ASR9k. I had kind of expected that it would perform at line rate for any/all of my tests, but the theme became clear after a few - the ports will only forward between 8Mpps and 11Mpps (out of the needed 14.88Mpps), but every port will do that, which means the machine will still scale up significantly in practice. But for business internet, colocation, and non-residential purposes, I would argue that routing stability is most important, and with regards to performance, I would argue that aggregate bandwidth is more important than pure packets/sec performance. Finally, the ASR in Cisco ASR9k stands for Advanced Services Router, and being able to mix-and-match MPLS, L2VPN, Bridges, encapsulation, tunneling, and have an expectation of 8-10Mpps per 10G port is absolutely reasonable. The ASR9k is a very competent machine.\nLoadtest data I\u0026rsquo;ve dropped all loadtest data here and if you\u0026rsquo;d like to play around with the data, take a look at the HTML files in this directory, they were built with Michal\u0026rsquo;s trex-loadtest-viz scripts.\nAcknowledgements I wanted to give a shout-out to Fred and the crew at IP-Max for allowing me to play with their router during these loadtests. I\u0026rsquo;ll be configuring it to replace their router at NTT in March, so if you have a connection to SwissIX via IP-Max, you will be notified for maintenance ahead of time as we plan the maintenance window.\nWe call these things Fridges in the IP-Max world, because they emit so much cool air when they start :) The ASR9001 is the microfridge, this ASR9006 is the minifridge, and the ASR9010 is the regular fridge.\n","date":"2022-02-21","desc":"Introduction If you\u0026rsquo;ve read up on my articles, you\u0026rsquo;ll know that I have deployed a European Ring, which was reformatted late last year into AS8298 and upgraded to run VPP Routers with 10G between each city. IPng Networks rents these 10G point to point virtual leased lines between each of our locations. It\u0026rsquo;s a really great network, and it performs so well because it\u0026rsquo;s built on an EoMPLS underlay provided by IP-Max. They, in turn, run carrier grade hardware in the form of Cisco ASR9k. In part, we\u0026rsquo;re such a good match together, because my choice of VPP on the IPng Networks routers fits very well with Fred\u0026rsquo;s choice of IOS/XR on the IP-Max routers.\n","permalink":"https://ipng.ch/s/articles/2022/02/21/review-cisco-asr9006/rsp440-se/","section":"articles","title":"Review: Cisco ASR9006/RSP440-SE"},{"contents":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nAfter completing the Linux CP plugin, interfaces and their attributes such as addresses and routes can be shared between VPP and the Linux kernel in a clever way, so running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates are easily in reach! But after the controlplane is up and running, VPP has so much more to offer - many interesting L2 and L3 services that you\u0026rsquo;d expect in commercial (and very pricy) routers like Cisco ASR are well within reach.\nWhen Fred and I were in Paris [report], I got stuck trying to configure an Ethernet over MPLS circuit for IPng from Paris to Zurich. Fred took a look for me and quickly determined \u0026ldquo;Ah, you forgot to do the VLAN gymnastics\u0026rdquo;. I found it a fun way to describe the solution to my problem back then, and come to think of it: the router really can be configured to hook up anything to pretty much anything \u0026ndash; this post takes a look at similar flexibility in VPP.\nIntroduction When I first started learning how to work on Cisco\u0026rsquo;s Advanced Services Router platform (Cisco IOS/XR), I was surprised that there is no concept of a switch. As many network engineers, I was used to be able to put a number of ports in the same switch VLAN; and take a different set of ports and put them into L3 mode with an IPv4/IPv6 address, or activate MPLS. And I was used to combining these two concepts by creating VLAN (L3) interfaces.\nTurning to VPP, much like its commercial sibling Cisco IOS/XR, the mental model and approach they take is different. Each physical interface can have a number of sub-interfaces which carry an encapsulation, for example a dot1q, or a dot1ad or even a double-tagged (QinQ or QinAD). When ethernet frames arrive on the physical interface, VPP will match them to the sub-interface which is configured to receive frames of that specific encapsulation, and drop frames that do not match any sub-interface.\nSub Interfaces in VPP There are several forms of sub-interface, let\u0026rsquo;s take a look at them:\n1. create sub \u0026lt;interface\u0026gt; \u0026lt;subId\u0026gt; dot1q|dot1ad \u0026lt;vlanId\u0026gt; 2. create sub \u0026lt;interface\u0026gt; \u0026lt;subId\u0026gt; dot1q|dot1ad \u0026lt;vlanId\u0026gt; exact-match 3. create sub \u0026lt;interface\u0026gt; \u0026lt;subId\u0026gt; dot1q|dot1ad \u0026lt;vlanId\u0026gt; inner-dot1q \u0026lt;vlanId\u0026gt;|any 4. create sub \u0026lt;interface\u0026gt; \u0026lt;subId\u0026gt; dot1q|dot1ad \u0026lt;vlanId\u0026gt; inner-dot1q \u0026lt;vlanId\u0026gt; exact-match 5. create sub \u0026lt;interface\u0026gt; \u0026lt;subId\u0026gt; 6. create sub \u0026lt;interface\u0026gt; \u0026lt;subId\u0026gt;-\u0026lt;subId\u0026gt; 7. create sub \u0026lt;interface\u0026gt; \u0026lt;subId\u0026gt; untagged 8. create sub \u0026lt;interface\u0026gt; \u0026lt;subId\u0026gt; default Alright, that\u0026rsquo;s a lot of choice! Let me go over these one by one.\nThe first variant creates a sub-interface which will match frames with the first VLAN tag being either dot1q or dot1ad with the given vlanId. An important note to this: there might be more VLAN tags following in the ethernet frame, ie the frame may be QinQ or QinAD, and all of these will be matched. The second variant looks to do the same thing, but there, the frame will only match if there is exactly one VLAN tag, not more, not less. So this sub-interface will not match frames which are QinQ or QinAD. The third variant creates a sub-interface which matches an outer dot1q or dot1ad VLAN and in addition an inner dot1q tag. The special keyword any can be specified, which will make the sub-interface match QinQ or QinAD frames without caring which inner tag is used. The fourth variant looks a bit like the second one, in that it will match for frames which have exactly two VLAN tags (either dot1q.dot1q or dot1ad.dot1q). In this exact-match mode of operation, precisely those two tags must be present, and no other tags may follow. The fifth variant is simply a shorthand for the second one, it creates an exact-match dot1q with a vlanId equal to the given subId. This is the most obvious form, and people will recognize this as \u0026ldquo;just\u0026rdquo; a VLAN :) The sixth variant further expands on this pattern, and creates a list of these dot1q exact-match (eg. 100-200 will create 101 sub-interfaces). The seventh variant creates a sub-interface that matches any frames that have exactly zero tags (ie. untagged), and finally The eighth variant might match anything that is not matched in any other sub-interface (ie. the fallthrough default). When I first saw this, it seemed overly complicated to me, but now that I\u0026rsquo;ve gotten to know this way of thinking, what\u0026rsquo;s being presented here is a way for any physical interface to branch off inbound traffic based on either exactly zero (untagged), exactly one (dot1q or dot1ad with exact-match), exactly two (outer dot1q or dot1ad followed by inner-dot1q with exact-match), and one outer tag followed by any inner tag(s). In other words, any combination of zero, one or two present tags on the frame can be matched and acted on by this logic.\nA few other considerations:\nIf a sub-interface is created with a given dot1q or dot1ad tag, you can\u0026rsquo;t have another sub-interface with a diffent matching logic on that same tag, for example creating dot1q 100 means you can\u0026rsquo;t then also create dot1q 100 exact-match. If that behavior is desired, then you\u0026rsquo;ll want to create dot1q 100 inner-dot1q any followed by dot1q 100 exact-match For L3 interfaces, it only makes sense to have exact-match interfaces. I found a bug in VPP that leads to a crash, which I\u0026rsquo;ve fixed in [this gerrit], so now the API and CLI throw an error instead of taking down the router. Bridge Domains So how do we make the functional equivalent of a VLAN, where several interfaces are bound together into an L2 broadcast domain, like a regular switch might do? The VPP answer to this is a bridge-domain which I can create and give a number, and then add any interface to it, like so:\nvpp# create bridge-domain 10 vpp# set interface l2 bridge GigabitEthernet10/0/0 10 vpp# set interface l2 bridge BondEthernet0 10 And if I want to add an IP address (creating the equivalent of a routable VLAN Interface), I create what is called a Bridge Virtual Interface or BVI, add that interface to the bridge domain, and optionally expose it in Linux with the LinuxCP plugin:\nvpp# bvi create instance 10 mac 02:fe:4b:4c:22:8f vpp# set interface l2 bridge bvi10 10 bvi vpp# set interface ip address bvi10 192.0.2.1/24 vpp# set interface ip address bvi10 2001:db8::1/64 vpp# lcp create bvi10 host-if bvi10 A bridge-domain is fully configurable - by default it\u0026rsquo;ll participate in L2 learning, maintain a FIB (which MAC addresses are seen behind which interface), and pass along ARP requests and Neighbor Discovery. But I can configure it to turn on/off forwarding, ARP, handling of unknown unicast frames, and so on, the complete list of functionality that can be changed at runtime:\nset bridge-domain arp entry \u0026lt;bridge-domain-id\u0026gt; [\u0026lt;ip-addr\u0026gt; \u0026lt;mac-addr\u0026gt; [del] | del-all] set bridge-domain arp term \u0026lt;bridge-domain-id\u0026gt; [disable] set bridge-domain arp-ufwd \u0026lt;bridge-domain-id\u0026gt; [disable] set bridge-domain default-learn-limit \u0026lt;maxentries\u0026gt; set bridge-domain flood \u0026lt;bridge-domain-id\u0026gt; [disable] set bridge-domain forward \u0026lt;bridge-domain-id\u0026gt; [disable] set bridge-domain learn \u0026lt;bridge-domain-id\u0026gt; [disable] set bridge-domain learn-limit \u0026lt;bridge-domain-id\u0026gt; \u0026lt;learn-limit\u0026gt; set bridge-domain mac-age \u0026lt;bridge-domain-id\u0026gt; \u0026lt;mins\u0026gt; set bridge-domain rewrite \u0026lt;bridge-domain\u0026gt; [disable] set bridge-domain uu-flood \u0026lt;bridge-domain-id\u0026gt; [disable] This makes bridge domains a very powerful concept, and actually much more powerful (a strict superset) of what I might be able to configure on an L2 switch.\nL2 CrossConnect I thought it\u0026rsquo;d be useful to point out another powerful concept, which made an appearance in my previous post about Virtual Leased Lines. If all I want to do is connect two interfaces together, there won\u0026rsquo;t be a need for learning, L2 FIB, and so on. It is computationally much simpler to just take any frame received on interface A and transmit it out on interface B, unmodified. This is known in VPP as a layer2 crossconnect, and can be configured like so:\nvpp# set interface l2 xconnect GigabitEthernet10/0/0 GigabitEthernet10/0/3 vpp# set interface l2 xconnect GigabitEthernet10/0/3 GigabitEthernet10/0/0 I should point out that this has to be done in both directions. The first invocation will transmit any frame received on Gi10/0/0 directly out on Gi10/0/3, and the second one will transmit any frame from Gi10/0/3 directly out on Gi10/0/0, turning this into a very efficient way to connect two interfaces together. Obviously, this only works in pairs, if more interfaces have to be connected, the bridge-domain is the way to go. That said, L2 cross connects are super common.\nTag Rewriting If I want to connect two tagged sub-interfaces together, for example Gi10/0/0.123 to Gi10/0/3.321, things get a bit more complicated. When VPP receives the frame from the first interface, it\u0026rsquo;ll arrive tagged with VLAN 123, so what happens if that is l2 crossconnected to Gi10/0/3.321? The answer will surprise you, so let\u0026rsquo;s take a look:\nvpp# set interface state GigabitEthernet10/0/0 up vpp# set interface state GigabitEthernet10/0/3 up vpp# create sub GigabitEthernet10/0/0 123 vpp# set interface state GigabitEthernet10/0/0.123 up vpp# create sub GigabitEthernet10/0/3 321 vpp# set interface state GigabitEthernet10/0/3.321 up vpp# set interface l2 xconnect GigabitEthernet10/0/0.123 GigabitEthernet10/0/3.321 vpp# set interface l2 xconnect GigabitEthernet10/0/3.321 GigabitEthernet10/0/0.123 If I send a packet into Gi10/0/0.123, the L2 crossconnect will copy the entire frame, unmodified into Gi10/0/3.321, but how can that be? That interface Gi10/0/3.321 is tagged with VLAN 321! VPP will end up sending the frame out on interface Gi10/0/3 tagged as VLAN 123. In the other direction, frames received on Gi10/0/3.321 will be sent out tagged as VLAN 321 on Gi10/0/0. This is certainly not what I expected.\nTo address this, VPP can add or remove VLAN tags when it receives a frame, when it transmits a frame, or both, let me show you this concept up close, as it\u0026rsquo;s really powerful!\nVLAN tag rewrite provides the ability to change the VLAN tags on a packet. Existing tags can be popped, new tags can be pushed, and existing tags can be swapped with new tags. The rewrite feature is attached to a sub-interface as input and output operations. The input operation is explicitly configured by CLI or API calls, and the output operation is the symmetric opposite and is automatically derived from the input operation.\nPOP: For pop operations, the sub-interface encapsulation (the vlan tags specified when it was created) must have at least the number of popped tags. e.g. the \u0026ldquo;pop 2\u0026rdquo; operation would be rejected on a single-vlan interface. The output tag-rewrite operation will push the specified number of vlan tags onto the packet before transmitting. The pushed tag values are taken from the sub-interface encapsulation configuration. PUSH: For push operations, the ethertype (dot1q or dot1ad) is also specified. The output tag-rewrite operation for pushes is to pop the same number of tags off the packet. If the packet doesn\u0026rsquo;t have enough tags it is dropped. TRANSLATE: This is a combination of a pop and a push operation. This may be confusing at first, so let me demonstrate how this works, by extending the example above. On the machine connected to Gi10/0/0.123, I\u0026rsquo;ll configure an IP address and try to ping its neighbor:\npim@hippo:~$ sudo ip link add link enp4s0f0 name vlan123 type vlan id 123 pim@hippo:~$ sudo ip link set vlan123 up pim@hippo:~$ sudo ip addr add 192.0.2.1/30 dev vlan123 pim@hippo:~$ ping 192.0.2.2 PING 192.0.2.2 (192.0.2.2) 56(84) bytes of data. ... On the other side, I\u0026rsquo;ll tcpdump what comes out the Gi10/0/3 port (which, as I observed above, is not carrying the tag, 321, but instead carrying the original ingress tag, 123):\n16:33:59.489246 fe:54:00:00:10:00 \u0026gt; ff:ff:ff:ff:ff:ff, length 46: ethertype 802.1Q (0x8100), vlan 123, p 0, ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.2 tell 192.0.2.1, length 28 Now, to demonstrate tag rewriting, I will remove (pop) the ingress VLAN tag from Gi10/0/0.123 when a packet is received:\nvpp# set interface l2 tag-rewrite GigabitEthernet10/0/0.123 pop 1 16:37:42.721424 fe:54:00:00:10:00 \u0026gt; ff:ff:ff:ff:ff:ff, length 42: ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.2 tell 192.0.2.1, length 28 There is no tag at all. What happened here is that when Gi10/0/0.123 received the frame, the \u0026lsquo;pop\u0026rsquo; operation stripped 1 VLAN tag off the frame. And as we\u0026rsquo;ll see later, when that sub-interface transmits a frame, the \u0026lsquo;pop\u0026rsquo; operation will add one VLAN tag (123) to the front of the frame.\nRemember how I pointed out above that the \u0026lsquo;pop\u0026rsquo; operation is symmetric? I can use that because if I were to also apply this on the Gi10/0/3.321 interface, then it will push the tag (of Gi10/0/3.321) onto the packet before sending it, and of course the other way around as well:\nvpp# set interface l2 tag-rewrite GigabitEthernet10/0/0.123 pop 1 vpp# set interface l2 tag-rewrite GigabitEthernet10/0/3.321 pop 1 16:41:00.352840 fe:54:00:00:10:00 \u0026gt; ff:ff:ff:ff:ff:ff, length 46: ethertype 802.1Q (0x8100), vlan 321, p 0, ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.2 tell 192.0.2.1, length 28 16:41:00.352867 fe:54:00:00:10:03 \u0026gt; fe:54:00:00:10:00, length 46: ethertype 802.1Q (0x8100), vlan 321, p 0, ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), Reply 192.0.2.2 is-at fe:54:00:00:10:03, length 28 Hey look, there\u0026rsquo;s our ARP reply packet! That packet coming back into Gi10/0/3.321, when hitting the tag-rewrite, will in turn remove the tag, and the \u0026lsquo;pop\u0026rsquo; being symmetrical, will of course add a new tag 123 on egress of Gi10/0/0.123, and I can now see connectivity end to end. Neat!\nOther operations that are interesting, include arbitrarily adding a dot1q tag (or even two tags):\nvpp# set interface l2 tag-rewrite GigabitEthernet10/0/0.123 push dot1q 100 16:45:33.121049 fe:54:00:00:10:00 \u0026gt; ff:ff:ff:ff:ff:ff, length 50: ethertype 802.1Q (0x8100), vlan 100, p 0, ethertype 802.1Q (0x8100), vlan 123, p 0, ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.2 tell 192.0.2.1, length 28 vpp# set interface l2 tag-rewrite GigabitEthernet10/0/0.123 push dot1q 100 200 16:48:15.936807 fe:54:00:00:10:00 \u0026gt; ff:ff:ff:ff:ff:ff, length 54: ethertype 802.1Q (0x8100), vlan 100, p 0, ethertype 802.1Q (0x8100), vlan 200, p 0, ethertype 802.1Q (0x8100), vlan 123, p 0, ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.2 tell 192.0.2.1, length 28 And finally, swapping (translating) VLAN tags:\nvpp# set interface l2 tag-rewrite GigabitEthernet10/0/0.123 translate 1-1 dot1ad 100 16:50:56.705015 fe:54:00:00:10:00 \u0026gt; ff:ff:ff:ff:ff:ff, length 46: ethertype 802.1Q-QinQ (0x88a8), vlan 100, p 0, ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.2 tell 192.0.2.1, length 28 vpp# set interface l2 tag-rewrite GigabitEthernet10/0/0.123 translate 1-1 dot1q 321 vpp# set interface l2 tag-rewrite GigabitEthernet10/0/3.321 translate 1-1 dot1q 123 16:44:03.462842 fe:54:00:00:10:00 \u0026gt; ff:ff:ff:ff:ff:ff, length 46: ethertype 802.1Q (0x8100), vlan 321, p 0, ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), Request who-has 192.0.2.2 tell 192.0.2.1, length 28 16:44:03.462847 fe:54:00:00:10:03 \u0026gt; fe:54:00:00:10:00, length 46: ethertype 802.1Q (0x8100), vlan 321, p 0, ethertype ARP (0x0806), Ethernet (len 6), IPv4 (len 4), Reply 192.0.2.2 is-at fe:54:00:00:10:03, length 28 This last set of \u0026rsquo;translate 1-1\u0026rsquo; has a similar effect to the \u0026lsquo;pop 1\u0026rsquo;, the VLAN is rewritten to 321 when receiving from Gi10/0/0.123, and it\u0026rsquo;s rewritten to 123 when receiving from Gi10/0/3.321, making end to end traffic possible again.\nFinal conclusion The four concepts discussed here can be combined in countless interesting ways:\nCreate sub-interface with or without exact-match, to handle certain encapsulated packets Provide layer2 crossconnect functionality between any two interfaces or sub-interfaces Add multiple interfaces and sub-interfaces into a bridge-domain Ensure that VLAN tags are popped and pushed consistently on tagged sub-interfaces The practical conclusion is that VPP can provide fully transparent, dot1q and jumboframe enabled virtual leased lines (see my previous post on VLL performance), including using regular breakout switches to greatly increase the total port count for customers.\nI\u0026rsquo;ll leave you with a working example of an L2VPN between a breakout switch behind nlams0.ipng.ch in Amsterdam and a remote VPP router in Zurich called ddln0.ipng.ch. Take the following S5860-20SQ switch, which connects to the VPP router on Te0/1 and a customer on Te0/2:\nfsw0(config)#vlan 3438 fsw0(config-vlan)#name v-vll-customer fsw0(config-vlan)#exit fsw0(config)#interface TenGigabitEthernet 0/1 fsw0(config-if-TenGigabitEthernet 0/1)#description Core: nlams0.ipng.ch Te6/0/0 fsw0(config-if-TenGigabitEthernet 0/1)#mtu 9216 fsw0(config-if-TenGigabitEthernet 0/1)#switchport mode trunk fsw0(config-if-TenGigabitEthernet 0/1)#switchport trunk allowed vlan add 3438 fsw0(config)#interface TenGigabitEthernet 0/2 fsw0(config-if-TenGigabitEthernet 0/2)#description Cust: Customer VLL Port NIKHEF fsw0(config-if-TenGigabitEthernet 0/2)#mtu 1522 fsw0(config-if-TenGigabitEthernet 0/2)#switchport mode dot1q-tunnel fsw0(config-if-TenGigabitEthernet 0/2)#switchport dot1q-tunnel native vlan 3438 fsw0(config-if-TenGigabitEthernet 0/2)#switchport dot1q-tunnel allowed vlan add untagged 3438 fsw0(config-if-TenGigabitEthernet 0/2)#switchport dot1q-tunnel allowed vlan add tagged 1000-2000 I configure the first port here to be a VLAN trunk port to the router, and add VLAN 3438 to it. Then, I configure the second port to be a customer dot1q-tunnel port, which accepts untagged frames and puts them in VLAN 3438, and additionally accepts tagged frames in VLAN 1000-2000 and prepends the customer VLAN 3438 to them - so these will become QinQ double tagged 3438.1000-2000.\nThe corresponding snippet of the VPP router configuration as such:\ncomment { Customer VLL to DDLN } lcp lcp-auto-sub-int off create sub TenGigabitEthernet6/0/0 3438 dot1q 3438 set interface mtu packet 1518 TenGigabitEthernet6/0/0.3438 set interface state TenGigabitEthernet6/0/0.3438 up set interface l2 tag-rewrite TenGigabitEthernet6/0/0.3438 pop 1 create vxlan tunnel instance 12 src 194.1.163.32 dst 194.1.163.5 vni 320501 decap-next l2 set interface state vxlan_tunnel12 up set interface mtu packet 1518 vxlan_tunnel12 set interface l2 xconnect TenGigabitEthernet6/0/0.3438 vxlan_tunnel12 set interface l2 xconnect vxlan_tunnel12 TenGigabitEthernet6/0/0.3438 lcp lcp-auto-sub-int on The customer facing interfaces have an MTU of 1518 bytes, which is enough for the 1500 bytes of the IP packet, including 14 bytes of L2 overhead (src-mac, dst-mac, ethertype), and one optional VLAN tag. In other words, this VLL is dot1q capable, because the VPP sub-interface Te6/0/0.3438 did not specify exact-match, so it\u0026rsquo;ll accept any additional VLAN tags. Of course this does require the path from nlams0.ipng.ch to ddln0.ipng.ch to be (baby)jumbo enabled, which they are as AS8298 is fully 9000 byte capable.\n","date":"2022-02-14","desc":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nAfter completing the Linux CP plugin, interfaces and their attributes such as addresses and routes can be shared between VPP and the Linux kernel in a clever way, so running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates are easily in reach! But after the controlplane is up and running, VPP has so much more to offer - many interesting L2 and L3 services that you\u0026rsquo;d expect in commercial (and very pricy) routers like Cisco ASR are well within reach.\n","permalink":"https://ipng.ch/s/articles/2022/02/14/case-study-vlan-gymnastics-with-vpp/","section":"articles","title":"Case Study - VLAN Gymnastics with VPP"},{"contents":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nAfter completing the Linux CP plugin, interfaces and their attributes such as addresses and routes can be shared between VPP and the Linux kernel in a clever way, so running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates are easily in reach!\nIf you\u0026rsquo;ve read my previous articles (thank you!), you will have noticed that I have done a lot of work on making VPP work well in an ISP (BGP/OSPF) environment with Linux CP. However, there\u0026rsquo;s many other cool things about VPP that make it a very competent advanced services router. One that that has always been super interesting to me, is being able to offer L2 connectivity over wide-area-network. For example, a virtual leased line from our Colo in Zurich to Amsterdam NIKHEF. This article explores this space.\nNOTE: If you\u0026rsquo;re only interested in results, scroll all the way down to the markdown table and graph for performance stats.\nIntroduction ISPs can offer ethernet services, often called Virtual Leased Lines (VLLs), Layer2 VPN (L2VPN) or Ethernet Backhaul. They mean the same thing: imagine a switchport in location A that appears to be transparently and directly connected to a switchport in location B, with the ISP (layer3, so IPv4 and IPv6) network in between. The \u0026ldquo;simple\u0026rdquo; old-school setup would be to have switches which define VLANs and are all interconnected. But we collectively learned that it\u0026rsquo;s a bad idea for several reasons:\nLarge broadcast domains tend to encouter L2 forwarding loops sooner rather than later Spanning-Tree and its kin are a stopgap, but they often disable an entire port from forwarding, which can be expensive if that port is connected to a dark fiber into another datacenter far away. Large VLAN setups that are intended to interconnect with other operators run into overlapping VLAN tags, which means switches have to do tag rewriting and filtering and such. Traffic engineering is all but non-existent in L2-only networking domains, while L3 has all sorts of smart TE extensions, ECMP, and so on. The canonical solution is for ISPs to encapsulate the ethernet traffic of their customers in some tunneling mechanism, for example in MPLS or in some IP tunneling protocol. Fundamentally, these are the same, except for the chosen protocol and overhead/cost of forwarding. MPLS is a very thin layer under the packet, but other IP based tunneling mechanisms exist, commonly used are GRE, VXLAN and GENEVE but many others exist.\nThey all work roughly the same:\nAn IP packet has a maximum transmission unit (MTU) of 1500 bytes, while the ethernet header is typically an additional 14 bytes: a 6 byte source MAC, 6 byte destination MAC, and 2 byte ethernet type, which is 0x0800 for an IPv4 datagram, 0x0806 for ARP, and 0x86dd for IPv6, and many others [ref]. If VLANs are used, an additional 4 bytes are needed [ref] making the ethernet frame at most 1518 bytes long, with an ethertype of 0x8100. If QinQ or QinAD are used, yet again 4 bytes are needed [ref], making the ethernet frame at most 1522 bytes long, with an ethertype of either 0x8100 or 0x9100, depending on the implementation. We can take such an ethernet frame, and make it the payload of another IP packet, encapsulating the original ethernet frame in a new IPv4 or IPv6 packet. We can then route it over an IP network to a remote site. Upon receipt of such a packet, by looking at the headers the remote router can determine that this packet represents an encapsulated ethernet frame, unpack it all, and forward the original frame onto a given interface. IP Tunneling Protocols First let\u0026rsquo;s get some theory out of the way \u0026ndash; I\u0026rsquo;ll discuss three common IP tunneling protocols here, and then move on to demonstrate how they are configured in VPP and perhaps more importantly, how they perform in VPP. Each tunneling protocol has its own advantages and disadvantages, but I\u0026rsquo;ll stick to the basics first:\nGRE: Generic Routing Encapsulation Generic Routing Encapsulation (GRE, described in RFC2784) is a very old and well known tunneling protocol. The packet is an IP datagram with protocol number 47, consisting of a header with 4 bits of flags, 8 reserved bits, 3 bits for the version (normally set to all-zeros), and 16 bits for the inner protocol (ether)type, so 0x0800 for IPv4, 0x8100 for 802.1q and so on. It\u0026rsquo;s a very small header of only 4 bytes and an optional key (4 bytes) and sequence number (also 4 bytes) whieah means that to be able to transport any ethernet frame (including the fancy QinQ and QinAD ones), the underlay must have an end to end MTU of at least 1522 + 20(IPv4)+12(GRE) = 1554 bytes for IPv4 and 1574 bytes for IPv6.\nVXLAN: Virtual Extensible LAN Virtual Extensible LAN (VXLAN, described in RFC7348) is a UDP datagram which has a header consisting of 8 bits worth of flags, 24 bits reserved for future expansion, 24 bits of Virtual Network Identifier (VNI) and an additional 8 bits or reserved space at the end. It uses UDP port 4789 as assigned by IANA. VXLAN encapsulation adds 20(IPv4)+8(UDP)+8(VXLAN) = 36 bytes, and considering IPv6 is 40 bytes, there it adds 56 bytes. This means that to be able to transport any ethernet frame, the underlay network must have an end to end MTU of at least 1522+36 = 1558 bytes for IPv4 and 1578 bytes for IPv6.\nGENEVE: Generic Network Virtualization Encapsulation GEneric NEtwork Virtualization Encapsulation (GENEVE, described in RFC8926) is somewhat similar to VXLAN although it was an attempt to stop the wild growth of tunneling protocols, I\u0026rsquo;m sure there is an XKCD out there specifically for this approach. The packet is also a UDP datagram with destination port 6081, followed by an 8 byte GENEVE specific header, containing 2 bits of version, 8 bits for flags, a 16 bit inner ethertype, a 24 bit Virtual Network Identifier (VNI), and 8 bits of reserved space. With GENEVE, several options are available and will be tacked onto the GENEVE header, but they are typically not used. If they are though, the options can add an additional 16 bytes which means that to be able to transport any ethernet frame, the underlay network must have an end to end MTU of at least 1522+52 = 1574 bytes for IPv4 and 1594 bytes for IPv6.\nHardware setup First let\u0026rsquo;s take a look at the physical setup. I\u0026rsquo;m using three servers and a switch in the IPng Networks lab:\nhvn0: Dell R720xd, load generator Dual E5-2620, 24 CPUs, 2 threads per core, 2 numa nodes 64GB DDR3 at 1333MT/s Intel X710 4-port 10G, Speed 8.0GT/s Width x8 (64 Gbit/s) Hippo and Rhino: VPP routers ASRock B550 Taichi Ryzen 5950X 32 CPUs, 2 threads per core, 1 numa node 64GB DDR4 at 2133 MT/s Intel 810C 2-port 100G, Speed 16.0 GT/s Width x16 (256 Gbit/s) fsw0: FS.com switch S5860-48SC, 8x 100G, 48x 10G VLAN 4 (blue) connects Rhino\u0026rsquo;s Hu12/0/1 to Hippo\u0026rsquo;s Hu12/0/1 VLAN 5 (red) connects hvn0\u0026rsquo;s enp5s0f0 to Rhino\u0026rsquo;s Hu12/0/0 VLAN 6 (green) connects hvn0\u0026rsquo;s enp5s0f1 to Hippo\u0026rsquo;s Hu12/0/0 All switchports have jumbo frames enabled and are set to 9216 bytes. Further, Hippo and Rhino are running VPP at head vpp v22.02-rc0~490-gde3648db0, and hvn0 is running T-Rex v2.93 in L2 mode, with MAC address 00:00:00:01:01:00 on the first port, and MAC address 00:00:00:02:01:00 on the second port. This machine can saturate 10G in both directions with small packets even when using only one flow, as can be seen, if the ports are just looped back onto one another, for example by physically crossconnecting them with an SFP+ or DAC; or in my case by putting fsw0 port Te0/1 and Te0/2 in the same VLAN together:\nNow that I have shared all the context and hardware, I\u0026rsquo;m ready to actually dive in to what I wanted to talk about: how does all this virtual leased line business look like, for VPP. Ready? Here we go!\nDirect L2 CrossConnect The simplest thing I can show in VPP, is to configure a layer2 cross-connect (l2 xconnect) between two ports. In this case, VPP doesn\u0026rsquo;t even need to have an IP address, all I do is bring up the ports, set their MTU to be able to carry the 1522 bytes frames (ethernet at 1514, dot1q at 1518, and QinQ at 1522 bytes). The configuration is identical on both Rhino and Hippo:\nset interface state HundredGigabitEthernet12/0/0 up set interface state HundredGigabitEthernet12/0/1 up set interface mtu packet 1522 HundredGigabitEthernet12/0/0 set interface mtu packet 1522 HundredGigabitEthernet12/0/1 set interface l2 xconnect HundredGigabitEthernet12/0/0 HundredGigabitEthernet12/0/1 set interface l2 xconnect HundredGigabitEthernet12/0/1 HundredGigabitEthernet12/0/0 I\u0026rsquo;d say the only thing to keep in mind here, is that the cross-connect commands only link in one direction (receive in A, forward to B), and that\u0026rsquo;s why I have to type them twice (receive in B, forward to A). Of course, this must be really cheap on VPP \u0026ndash; because all it has to do now is receive from DPDK and immediately schedule for transmit on the other port. Looking at show runtime I can see how much CPU time is spent in each of VPP\u0026rsquo;s nodes:\nTime 1241.5, 10 sec internal node vector rate 28.70 loops/sec 475009.85 vector rates in 1.4879e7, out 1.4879e7, drop 0.0000e0, punt 0.0000e0 Name Calls Vectors Clocks Vectors/Call HundredGigabitEthernet12/0/1-o 650727833 18472218801 7.49e0 28.39 HundredGigabitEthernet12/0/1-t 650727833 18472218801 4.12e1 28.39 ethernet-input 650727833 18472218801 5.55e1 28.39 l2-input 650727833 18472218801 1.52e1 28.39 l2-output 650727833 18472218801 1.32e1 28.39 In this simple cross connect mode, the only thing VPP has to do is receive the ethernet, funnel it into l2-input, and immediately send it straight through l2-output back out, which does not cost much in terms of CPU cycles at all. In total, this CPU thread is forwarding 14.88Mpps (line rate 10G at 64 bytes), at an average of 133 cycles per packet (not counting the time spent in DPDK). The CPU has room to spare in this mode, in other words even one CPU thread can handle this workload at line rate, impressive!\nAlthough cool, doing an L2 crossconnect like this isn\u0026rsquo;t super useful. Usually, the customer leased line has to be transported to another location, and for that we\u0026rsquo;ll need some form of encapsulation \u0026hellip;\nCrossconnect over IPv6 VXLAN Let\u0026rsquo;s start with VXLAN. The concept is pretty straight forward in VPP. Based on the configuration I put in Rhino and Hippo above, I first will have to bring Hu12/0/1 out of L2 mode, give both interfaces an IPv6 address, create a tunnel with a given VNI, and then crossconnect the customer side Hu12/0/0 into the vxlan_tunnel0 and vice-versa. Piece of cake:\n## On Rhino set interface mtu packet 1600 HundredGigabitEthernet12/0/1 set interface l3 HundredGigabitEthernet12/0/1 set interface ip address HundredGigabitEthernet12/0/1 2001:db8::1/64 create vxlan tunnel instance 0 src 2001:db8::1 dst 2001:db8::2 vni 8298 set interface state vxlan_tunnel0 up set interface mtu packet 1522 vxlan_tunnel0 set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0 set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0 ## On Hippo set interface mtu packet 1600 HundredGigabitEthernet12/0/1 set interface l3 HundredGigabitEthernet12/0/1 set interface ip address HundredGigabitEthernet12/0/1 2001:db8::2/64 create vxlan tunnel instance 0 src 2001:db8::2 dst 2001:db8::1 vni 8298 set interface state vxlan_tunnel0 up set interface mtu packet 1522 vxlan_tunnel0 set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0 set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0 Of course, now we\u0026rsquo;re actually beginning to make VPP do some work, and the exciting thing is, if there would be an (opaque) ISP network between Rhino and Hippo, this would work just fine considering the encapsulation is \u0026lsquo;just\u0026rsquo; IPv6 UDP. Under the covers, for each received frame, VPP has to encapsulate it into VXLAN, and route the resulting L3 packet by doing an IPv6 routing table lookup:\nTime 10.0, 10 sec internal node vector rate 256.00 loops/sec 32132.74 vector rates in 8.5423e6, out 8.5423e6, drop 0.0000e0, punt 0.0000e0 Name Calls Vectors Clocks Vectors/Call HundredGigabitEthernet12/0/0-o 333777 85445944 2.74e0 255.99 HundredGigabitEthernet12/0/0-t 333777 85445944 5.28e1 255.99 ethernet-input 333777 85445944 4.25e1 255.99 ip6-input 333777 85445944 1.25e1 255.99 ip6-lookup 333777 85445944 2.41e1 255.99 ip6-receive 333777 85445944 1.71e2 255.99 ip6-udp-lookup 333777 85445944 1.55e1 255.99 l2-input 333777 85445944 8.94e0 255.99 l2-output 333777 85445944 4.44e0 255.99 vxlan6-input 333777 85445944 2.12e1 255.99 I can definitely see a lot more action here. In this mode, VPP is handlnig 8.54Mpps on this CPU thread before saturating. At full load, VPP is spending 356 CPU cycles per packet, of which almost half is in node ip6-receive.\nCrossconnect over IPv4 VXLAN Seeing ip6-receive being such a big part of the cost (almost half!), I wonder what it might look like if I change the tunnel to use IPv4. So I\u0026rsquo;ll give Rhino and Hippo an IPv4 address as well, delete the vxlan tunnel I made before (the IPv6 one), and create a new one with IPv4:\nset interface ip address HundredGigabitEthernet12/0/1 10.0.0.0/31 create vxlan tunnel instance 0 src 2001:db8::1 dst 2001:db8::2 vni 8298 del create vxlan tunnel instance 0 src 10.0.0.0 dst 10.0.0.1 vni 8298 set interface state vxlan_tunnel0 up set interface mtu packet 1522 vxlan_tunnel0 set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0 set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0 set interface ip address HundredGigabitEthernet12/0/1 10.0.0.1/31 create vxlan tunnel instance 0 src 2001:db8::2 dst 2001:db8::1 vni 8298 del create vxlan tunnel instance 0 src 10.0.0.1 dst 10.0.0.0 vni 8298 set interface state vxlan_tunnel0 up set interface mtu packet 1522 vxlan_tunnel0 set interface l2 xconnect HundredGigabitEthernet12/0/0 vxlan_tunnel0 set interface l2 xconnect vxlan_tunnel0 HundredGigabitEthernet12/0/0 And after letting this run for a few seconds, I can take a look and see how the ip4-* version of the VPP code performs:\nTime 10.0, 10 sec internal node vector rate 256.00 loops/sec 53309.71 vector rates in 1.4151e7, out 1.4151e7, drop 0.0000e0, punt 0.0000e0 Name Calls Vectors Clocks Vectors/Call HundredGigabitEthernet12/0/0-o 552890 141539600 2.76e0 255.99 HundredGigabitEthernet12/0/0-t 552890 141539600 5.30e1 255.99 ethernet-input 552890 141539600 4.13e1 255.99 ip4-input-no-checksum 552890 141539600 1.18e1 255.99 ip4-lookup 552890 141539600 1.68e1 255.99 ip4-receive 552890 141539600 2.74e1 255.99 ip4-udp-lookup 552890 141539600 1.79e1 255.99 l2-input 552890 141539600 8.68e0 255.99 l2-output 552890 141539600 4.41e0 255.99 vxlan4-input 552890 141539600 1.76e1 255.99 Throughput is now quite a bit higher, clocking a cool 14.2Mpps (just short of line rate!) at 202 CPU cycles per packet, considerably less time spent than in IPv6, but keep in mind that VPP has an ~empty routing table in all of these tests.\nCrossconnect over IPv6 GENEVE Another popular cross connect type, also based on IPv4 and IPv6 UDP packets, is GENEVE. The configuration is almost identical, so I delete the IPv4 VXLAN and create an IPv6 GENEVE tunnel instead:\ncreate vxlan tunnel instance 0 src 10.0.0.0 dst 10.0.0.1 vni 8298 del create geneve tunnel local 2001:db8::1 remote 2001:db8::2 vni 8298 set interface state geneve_tunnel0 up set interface mtu packet 1522 geneve_tunnel0 set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0 set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0 create vxlan tunnel instance 0 src 10.0.0.1 dst 10.0.0.0 vni 8298 del create geneve tunnel local 2001:db8::2 remote 2001:db8::1 vni 8298 set interface state geneve_tunnel0 up set interface mtu packet 1522 geneve_tunnel0 set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0 set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0 All the while, the TRex on the customer machine hvn0, is sending 14.88Mpps in both directions, and after just a short (second or so) interruption, the GENEVE tunnel comes up, cross-connects into the customer Hu12/0/0 interfaces, and starts to carry traffic:\nThread 8 vpp_wk_7 (lcore 8) Time 10.0, 10 sec internal node vector rate 256.00 loops/sec 29688.03 vector rates in 8.3179e6, out 8.3179e6, drop 0.0000e0, punt 0.0000e0 Name Calls Vectors Clocks Vectors/Call HundredGigabitEthernet12/0/0-o 324981 83194664 2.74e0 255.99 HundredGigabitEthernet12/0/0-t 324981 83194664 5.18e1 255.99 ethernet-input 324981 83194664 4.26e1 255.99 geneve6-input 324981 83194664 3.87e1 255.99 ip6-input 324981 83194664 1.22e1 255.99 ip6-lookup 324981 83194664 2.39e1 255.99 ip6-receive 324981 83194664 1.67e2 255.99 ip6-udp-lookup 324981 83194664 1.54e1 255.99 l2-input 324981 83194664 9.28e0 255.99 l2-output 324981 83194664 4.47e0 255.99 Similar to VXLAN when using IPv6 the total for GENEVE-v6 is also comparatively slow (I say comparatively because you should not expect anything like this performance when using Linux or BSD kernel routing!). The lower throughput is again due to the ip6-receive node being costly. It is just slightly worse of a performer at 8.32Mpps per core and 368 CPU cycles per packet.\nCrossconnect over IPv4 GENEVE I am now suspecting that GENEVE over IPv4 would have similar gains to when I switched from VXLAN IPv6 to IPv4 above. So I remove the IPv6 tunnel, create a new IPv4 tunnel instead, and hook it back up to the customer port on both Rhino and Hippo, like so:\ncreate geneve tunnel local 2001:db8::1 remote 2001:db8::2 vni 8298 del create geneve tunnel local 10.0.0.0 remote 10.0.0.1 vni 8298 set interface state geneve_tunnel0 up set interface mtu packet 1522 geneve_tunnel0 set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0 set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0 create geneve tunnel local 2001:db8::2 remote 2001:db8::1 vni 8298 del create geneve tunnel local 10.0.0.1 remote 10.0.0.0 vni 8298 set interface state geneve_tunnel0 up set interface mtu packet 1522 geneve_tunnel0 set interface l2 xconnect HundredGigabitEthernet12/0/0 geneve_tunnel0 set interface l2 xconnect geneve_tunnel0 HundredGigabitEthernet12/0/0 And the results, indeed a significant improvement:\nTime 10.0, 10 sec internal node vector rate 256.00 loops/sec 48639.97 vector rates in 1.3737e7, out 1.3737e7, drop 0.0000e0, punt 0.0000e0 Name Calls Vectors Clocks Vectors/Call HundredGigabitEthernet12/0/0-o 536763 137409904 2.76e0 255.99 HundredGigabitEthernet12/0/0-t 536763 137409904 5.19e1 255.99 ethernet-input 536763 137409904 4.19e1 255.99 geneve4-input 536763 137409904 2.39e1 255.99 ip4-input-no-checksum 536763 137409904 1.18e1 255.99 ip4-lookup 536763 137409904 1.69e1 255.99 ip4-receive 536763 137409904 2.71e1 255.99 ip4-udp-lookup 536763 137409904 1.79e1 255.99 l2-input 536763 137409904 8.81e0 255.99 l2-output 536763 137409904 4.47e0 255.99 So, close to line rate again! Performance of GENEVE-v4 clocks in at 13.7Mpps per core or 207 CPU cycles per packet.\nCrossconnect over IPv6 GRE Now I can\u0026rsquo;t help but wonder, that if those ip4|6-udp-lookup nodes burn valuable CPU cycles, GRE will possibly do better, because it\u0026rsquo;s an L3 protocol (proto number 47) and will never have to inspect beyond the IP header, so I delete the GENEVE tunnel and give GRE a go too:\ncreate geneve tunnel local 10.0.0.0 remote 10.0.0.1 vni 8298 del create gre tunnel src 2001:db8::1 dst 2001:db8::2 teb set interface state gre0 up set interface mtu packet 1522 gre0 set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0 set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0 create geneve tunnel local 10.0.0.1 remote 10.0.0.0 vni 8298 del create gre tunnel src 2001:db8::2 dst 2001:db8::1 teb set interface state gre0 up set interface mtu packet 1522 gre0 set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0 set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0 Results:\nTime 10.0, 10 sec internal node vector rate 255.99 loops/sec 37129.87 vector rates in 9.9254e6, out 9.9254e6, drop 0.0000e0, punt 0.0000e0 Name Calls Vectors Clocks Vectors/Call HundredGigabitEthernet12/0/0-o 387881 99297464 2.80e0 255.99 HundredGigabitEthernet12/0/0-t 387881 99297464 5.21e1 255.99 ethernet-input 775762 198594928 5.97e1 255.99 gre6-input 387881 99297464 2.81e1 255.99 ip6-input 387881 99297464 1.21e1 255.99 ip6-lookup 387881 99297464 2.39e1 255.99 ip6-receive 387881 99297464 5.09e1 255.99 l2-input 387881 99297464 9.35e0 255.99 l2-output 387881 99297464 4.40e0 255.99 The performance of GRE-v6 (in transparent ethernet bridge aka TEB mode) is 9.9Mpps per core or 243 CPU cycles per packet, and I\u0026rsquo;ll also note that while the ip6-receive node in all the UDP based tunneling were in the 170 clocks/packet arena, now we\u0026rsquo;re down to only 51 or so, so indeed a huge improvement.\nCrossconnect over IPv4 GRE To round off the set, I\u0026rsquo;ll remove the IPv6 GRE tunnel and put an IPv4 GRE tunnel in place:\ncreate gre tunnel src 2001:db8::1 dst 2001:db8::2 teb del create gre tunnel src 10.0.0.0 dst 10.0.0.1 teb set interface state gre0 up set interface mtu packet 1522 gre0 set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0 set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0 create gre tunnel src 2001:db8::2 dst 2001:db8::1 teb del create gre tunnel src 10.0.0.1 dst 10.0.0.0 teb set interface state gre0 up set interface mtu packet 1522 gre0 set interface l2 xconnect HundredGigabitEthernet12/0/0 gre0 set interface l2 xconnect gre0 HundredGigabitEthernet12/0/0 And without further ado:\nTime 10.0, 10 sec internal node vector rate 255.87 loops/sec 52898.61 vector rates in 1.4039e7, out 1.4039e7, drop 0.0000e0, punt 0.0000e0 Name Calls Vectors Clocks Vectors/Call HundredGigabitEthernet12/0/0-o 548684 140435080 2.80e0 255.95 HundredGigabitEthernet12/0/0-t 548684 140435080 5.22e1 255.95 ethernet-input 1097368 280870160 2.92e1 255.95 gre4-input 548684 140435080 2.51e1 255.95 ip4-input-no-checksum 548684 140435080 1.19e1 255.95 ip4-lookup 548684 140435080 1.68e1 255.95 ip4-receive 548684 140435080 2.03e1 255.95 l2-input 548684 140435080 8.72e0 255.95 l2-output 548684 140435080 4.43e0 255.95 The performance of GRE-v4 (in transparent ethernet bridge aka TEB mode) is 14.0Mpps per core or 171 CPU cycles per packet. This is really very low, the best of all the tunneling protocols, but (for obvious reasons) will not outperform a direct L2 crossconnect, as that cuts out the L3 (and L4) middleperson entirely. Whohoo!\nConclusions First, let me give a recap of the tests I did, from left to right the better to worse performer.\nTest L2XC GRE-v4 VXLAN-v4 GENEVE-v4 GRE-v6 VXLAN-v6 GENEVE-v6 pps/core \u0026gt;14.88M 14.34M 14.15M 13.74M 9.93M 8.54M 8.32M cycles/packet 132.59 171.45 201.65 207.44 243.35 355.72 368.09 (!) Achtung! Because in the L2XC mode the CPU was not fully consumed (VPP was consuming only ~28 frames per vector), it did not yet achieve its optimum CPU performance. Under full load, the cycles/packet will be somewhat lower than what is shown here.\nTaking a closer look at the VPP nodes in use, below I draw a graph of CPU cycles spent in each VPP node, for each type of cross connect, where the lower the stack is, the faster cross connect will be:\nAlthough clearly GREv4 is the winner, I still would not use it for the following reason: VPP does not support GRE keys, and considering it is an L3 protocol, I will have to use unique IPv4 or IPv6 addresses for each tunnel src/dst pair, otherwise VPP will not know upon receipt of a GRE packet, which tunnel it belongs to. For IPv6 this is not a huge deal (I can bind a whole /64 to a loopback and just be done with it), but GREv6 does not perform as well as VXLAN-v4 or GENEVE-v4.\nVXLAN and GENEVE are equal performers, both in IPv4 and in IPv6. In both cases, IPv4 is significantly faster than IPv6. But due to the use of VNI fields in the header, contrary to GRE, both VXLAN and GENEVE can have the same src/dst IP for any number of tunnels, which is a huge benefit.\nMultithreading Usually, the customer facing side is an ethernet port (or sub-interface with tag popping) that will be receiving IPv4 or IPv6 traffic (either tagged or untagged) and this allows the NIC to use RSS to assign this inbound traffic to multiple queues, and thus multiple CPU threads. That\u0026rsquo;s great, it means linear encapsulation performance.\nOnce the traffic is encapsulated, it risks becoming single flow with respect to the remote host, if Rhino would be sending from 10.0.0.0:4789 to Hippo\u0026rsquo;s 10.0.0.1:4789. However, the VPP VXLAN and GENEVE implementation both inspect the inner payload, and uses it to scramble the source port (thanks to Neale for pointing this out, it\u0026rsquo;s in vxlan/encap.c:246). Deterministically changing the source port based on the inner-flow will allow Hippo to use RSS on the receiving end, which allows these tunneling protocols to scale linearly. I proved this for myself by attaching a port-mirror to the switch and copying all traffic between Hippo and Rhino to a spare machine in the rack:\npim@hvn1:~$ sudo tcpdump -ni enp5s0f3 port 4789 11:19:54.887763 IP 10.0.0.1.4452 \u0026gt; 10.0.0.0.4789: VXLAN, flags [I] (0x08), vni 8298 11:19:54.888283 IP 10.0.0.1.42537 \u0026gt; 10.0.0.0.4789: VXLAN, flags [I] (0x08), vni 8298 11:19:54.888285 IP 10.0.0.0.17895 \u0026gt; 10.0.0.1.4789: VXLAN, flags [I] (0x08), vni 8298 11:19:54.899353 IP 10.0.0.1.40751 \u0026gt; 10.0.0.0.4789: VXLAN, flags [I] (0x08), vni 8298 11:19:54.899355 IP 10.0.0.0.35475 \u0026gt; 10.0.0.1.4789: VXLAN, flags [I] (0x08), vni 8298 11:19:54.904642 IP 10.0.0.0.60633 \u0026gt; 10.0.0.1.4789: VXLAN, flags [I] (0x08), vni 8298 pim@hvn1:~$ sudo tcpdump -ni enp5s0f3 port 6081 11:22:55.802406 IP 10.0.0.0.32299 \u0026gt; 10.0.0.1.6081: Geneve, Flags [none], vni 0x206a: 11:22:55.802409 IP 10.0.0.1.44011 \u0026gt; 10.0.0.0.6081: Geneve, Flags [none], vni 0x206a: 11:22:55.807711 IP 10.0.0.1.45503 \u0026gt; 10.0.0.0.6081: Geneve, Flags [none], vni 0x206a: 11:22:55.807712 IP 10.0.0.0.45532 \u0026gt; 10.0.0.1.6081: Geneve, Flags [none], vni 0x206a: 11:22:55.841495 IP 10.0.0.0.61694 \u0026gt; 10.0.0.1.6081: Geneve, Flags [none], vni 0x206a: 11:22:55.851719 IP 10.0.0.1.47581 \u0026gt; 10.0.0.0.6081: Geneve, Flags [none], vni 0x206a: Considering I was sending the T-Rex profile bench.py with tunables vm=var2,size=64, the latter of which chooses randomized source and destination (inner) IP addresses in the loadtester, I can conclude that the outer source port is chosen based on a hash of the inner packet. Slick!!\nFinal conclusion The most important practical conclusion to draw is that I can feel safe to offer L2VPN services at IPng Networks using VPP and a VXLAN or GENEVE IPv4 underlay \u0026ndash; our backbone is 9000 bytes everywhere, so it will be possible to provide up to 8942 bytes of customer payload taking into account the VXLAN-v4 overhead. At least gigabit symmetric VLLs filled with 64b packets will not be a problem for the routers we have, as they forward approximately 10.2Mpps per core and 35Mpps per chassis when fully loaded. Even considering the overhead and CPU consumption that VXLAN encap/decap brings with it, due to the use of multiple transmit and receive threads, the router would have plenty of room to spare.\nAppendix The backing data for the graph in this article are captured in this Google Sheet.\nVPP Configuration For completeness, the startup.conf used on both Rhino and Hippo:\nunix { nodaemon log /var/log/vpp/vpp.log full-coredump cli-listen /run/vpp/cli.sock cli-prompt rhino# gid vpp } api-trace { on } api-segment { gid vpp } socksvr { default } memory { main-heap-size 1536M main-heap-page-size default-hugepage } cpu { main-core 0 corelist-workers 1-15 } buffers { buffers-per-numa 300000 default data-size 2048 page-size default-hugepage } statseg { size 1G page-size default-hugepage per-node-counters off } dpdk { dev default { num-rx-queues 7 } decimal-interface-names dev 0000:0c:00.0 dev 0000:0c:00.1 } plugins { plugin lcpng_nl_plugin.so { enable } plugin lcpng_if_plugin.so { enable } } logging { default-log-level info default-syslog-log-level crit class linux-cp/if { rate-limit 10000 level debug syslog-level debug } class linux-cp/nl { rate-limit 10000 level debug syslog-level debug } } lcpng { default netns dataplane lcp-sync lcp-auto-subint } Other Details For posterity, some other stats on the VPP deployment. First of all, a confirmation that PCIe 4.0 x16 slots were used, and that the Comms DDP was loaded:\n[ 0.433903] pci 0000:0c:00.0: [8086:1592] type 00 class 0x020000 [ 0.433924] pci 0000:0c:00.0: reg 0x10: [mem 0xea000000-0xebffffff 64bit pref] [ 0.433946] pci 0000:0c:00.0: reg 0x1c: [mem 0xee010000-0xee01ffff 64bit pref] [ 0.433964] pci 0000:0c:00.0: reg 0x30: [mem 0xfcf00000-0xfcffffff pref] [ 0.434104] pci 0000:0c:00.0: reg 0x184: [mem 0xed000000-0xed01ffff 64bit pref] [ 0.434106] pci 0000:0c:00.0: VF(n) BAR0 space: [mem 0xed000000-0xedffffff 64bit pref] (contains BAR0 for 128 VFs) [ 0.434128] pci 0000:0c:00.0: reg 0x190: [mem 0xee220000-0xee223fff 64bit pref] [ 0.434129] pci 0000:0c:00.0: VF(n) BAR3 space: [mem 0xee220000-0xee41ffff 64bit pref] (contains BAR3 for 128 VFs) [ 11.216343] ice 0000:0c:00.0: The DDP package was successfully loaded: ICE COMMS Package version 1.3.30.0 [ 11.280567] ice 0000:0c:00.0: PTP init successful [ 11.317826] ice 0000:0c:00.0: DCB is enabled in the hardware, max number of TCs supported on this port are 8 [ 11.317828] ice 0000:0c:00.0: FW LLDP is disabled, DCBx/LLDP in SW mode. [ 11.317829] ice 0000:0c:00.0: Commit DCB Configuration to the hardware [ 11.320608] ice 0000:0c:00.0: 252.048 Gb/s available PCIe bandwidth (16.0 GT/s PCIe x16 link) And how the NIC shows up in VPP, in particular the rx/tx burst modes and functions are interesting:\nhippo# show hardware-interfaces Name Idx Link Hardware HundredGigabitEthernet12/0/0 1 up HundredGigabitEthernet12/0/0 Link speed: 100 Gbps RX Queues: queue thread mode 0 vpp_wk_0 (1) polling 1 vpp_wk_1 (2) polling 2 vpp_wk_2 (3) polling 3 vpp_wk_3 (4) polling 4 vpp_wk_4 (5) polling 5 vpp_wk_5 (6) polling 6 vpp_wk_6 (7) polling Ethernet address b4:96:91:b3:b1:10 Intel E810 Family carrier up full duplex mtu 9190 promisc flags: admin-up promisc maybe-multiseg tx-offload intel-phdr-cksum rx-ip4-cksum int-supported Devargs: rx: queues 7 (max 64), desc 1024 (min 64 max 4096 align 32) tx: queues 16 (max 64), desc 1024 (min 64 max 4096 align 32) pci: device 8086:1592 subsystem 8086:0002 address 0000:0c:00.00 numa 0 max rx packet len: 9728 promiscuous: unicast on all-multicast on vlan offload: strip off filter off qinq off rx offload avail: vlan-strip ipv4-cksum udp-cksum tcp-cksum qinq-strip outer-ipv4-cksum vlan-filter vlan-extend jumbo-frame scatter keep-crc rss-hash rx offload active: ipv4-cksum jumbo-frame scatter tx offload avail: vlan-insert ipv4-cksum udp-cksum tcp-cksum sctp-cksum tcp-tso outer-ipv4-cksum qinq-insert multi-segs mbuf-fast-free outer-udp-cksum tx offload active: ipv4-cksum udp-cksum tcp-cksum multi-segs rss avail: ipv4-frag ipv4-tcp ipv4-udp ipv4-sctp ipv4-other ipv4 ipv6-frag ipv6-tcp ipv6-udp ipv6-sctp ipv6-other ipv6 l2-payload rss active: ipv4-frag ipv4-tcp ipv4-udp ipv4 ipv6-frag ipv6-tcp ipv6-udp ipv6 tx burst mode: Scalar tx burst function: ice_recv_scattered_pkts_vec_avx2_offload rx burst mode: Offload Vector AVX2 Scattered rx burst function: ice_xmit_pkts Finally, in case it\u0026rsquo;s interesting, an output of lscpu, lspci and dmidecode as run on Hippo (Rhino is an identical machine).\n","date":"2022-01-12","desc":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two.\nAfter completing the Linux CP plugin, interfaces and their attributes such as addresses and routes can be shared between VPP and the Linux kernel in a clever way, so running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates are easily in reach!\n","permalink":"https://ipng.ch/s/articles/2022/01/12/case-study-virtual-leased-line-vll-in-vpp/","section":"articles","title":"Case Study - Virtual Leased Line (VLL) in VPP"},{"contents":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. One thing notably missing, is the higher level control plane, that is to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a VPP plugin which is called the Linux Control Plane, or LCP for short, which creates Linux network devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links, addresses and routes) itself. When the plugin is completed, running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates will be well in reach!\nBefore we head off into the end of year holidays, I thought I\u0026rsquo;d make good on a promise I made a while ago, and that\u0026rsquo;s to explain how to create a Debian (Buster or Bullseye), or Ubuntu (Focal Fossai LTS) virtual machine running in Qemu/KVM into a working setup with both Free Range Routing and Bird installed side by side.\nNOTE: If you\u0026rsquo;re just interested in the resulting image, here\u0026rsquo;s the most pertinent information:\nvpp-proto.qcow2.lrz [Download] SHA256 `a5fdf157c03f2d202dcccdf6ed97db49c8aa5fdb6b9ca83a1da958a8a24780ab Debian Bookworm (12.11) and VPP 25.10-rc0~49-g90d92196 CPU Make sure the (virtualized) CPU supports AVX RAM The image needs at least 4GB of RAM, and the hypervisor should support hugepages and AVX Username: ipng with password: ipng loves vpp and is sudo-enabled Root Password: IPng loves VPP Of course, I do recommend that you change the passwords for the ipng and root user as soon as you boot the VM. I am offering the KVM images as-is and without any support. Contact us if you\u0026rsquo;d like to discuss support on commission.\nReminder - Linux CP Vector Packet Processing by itself offers only a dataplane implementation, that is to say it cannot run controlplane software like OSPF, BGP, LDP etc out of the box. However, VPP allows plugins to offer additional functionalty. Rather than adding the routing protocols as VPP plugins, I much rather leverage high quality and well supported community efforts like FRR or Bird.\nI wrote a series of in-depth articles explaining in detail the design and implementation, but for the purposes of this article, I will keep it brief. The Linux Control Plane (LCP) is a set of two plugins:\nThe Interface plugin is responsible for taking VPP interfaces (like ethernet, tunnel, bond) and exposing them in Linux as a TAP device. When configuration such as link MTU, state, MAC address or IP address are applied in VPP, the plugin will copy this forward into the host interface representation. The Netlink plugin is responsible for taking events in Linux (like a user setting an IP address or route, or the system receiving ARP or IPv6 neighbor request/reply from neighbors), and applying these events to the VPP dataplane. I\u0026rsquo;ve published the code on Github and I am targeting a release in upstream VPP, hoping to make the upcoming 22.02 release in February 2022. I have a lot of ground to cover, but I will note that the plugin has been running in production in AS8298 since Sep'21 and no crashes related to LinuxCP have been observed.\nTo help tinkerers, this article describes a KVM disk image in qcow2 format, which will boot a vanilla Debian install and further comes pre-loaded with a fully functioning VPP, LinuxCP and both FRR and Bird controlplane environment. I\u0026rsquo;ll go into detail on precisely how you can build your own. Of course, you\u0026rsquo;re welcome to just take the results of this work and download the qcow2 image above.\nBuilding the Debian KVM image In this section I\u0026rsquo;ll try to be precise in the steps I took to create the KVM qcow2 image, in case you\u0026rsquo;re interested in reproducing for yourself. Overall, I find that reading about how folks build images teaches me a lot about the underlying configurations, and I\u0026rsquo;m as well keen on remembering how to do it myself, so this article serves as well as reference documentation for IPng Networks in case we want to build images in the future.\nStep 1. Install Debian For this, I\u0026rsquo;ll use virt-install completely on the prompt of my workstation, a Linux machine which is running Ubuntu Hirsute (21.04). Assuming KVM is installed ref and already running, let\u0026rsquo;s build a simple Debian Bullseye qcow2 bootdisk:\npim@chumbucket:~$ sudo apt-get install qemu-kvm libvirt-daemon-system libvirt-clients bridge-utils pim@chumbucket:~$ sudo apt-get install virtinst pim@chumbucket:~$ sudo adduser `id -un` libvirt pim@chumbucket:~$ sudo adduser `id -un` kvm pim@chumbucket:~$ qemu-img create -f qcow2 vpp-proto.qcow2 8G pim@chumbucket:~$ virt-install --virt-type kvm --name vpp-proto \\ --location http://deb.debian.org/debian/dists/bullseye/main/installer-amd64/ \\ --os-variant debian10 \\ --disk /home/pim/vpp-proto.qcow2,bus=virtio \\ --memory 4096 \\ --graphics none \\ --network=bridge:mgmt \\ --console pty,target_type=serial \\ --extra-args \u0026#34;console=ttyS0\u0026#34; \\ --check all=off Note: You may want to use a different network bridge, commonly bridge:virbr0. In my case, the network which runs DHCP is on a bridge called mgmt. And, just for pedantry, it\u0026rsquo;s good to make yourself a member of groups kvm and libvirt so that you can run most virsh commands as an unprivileged user.\nDuring the Debian Bullseye install, I try to leave everything as vanilla as possible, but I do enter the following specifics:\nRoot Password the string IPng loves VPP User login ipng with password ipng loves vpp Disk is entirely in one partition / (all 8GB of it), no swap Software selection remove everything but SSH server and standard system utilities When the machine is done installing, it\u0026rsquo;ll reboot and I\u0026rsquo;ll log in as root to install a few packages, most notably sudo which will allow the user ipng to act as root. The other seemingly weird packages are to help the VPP install along later.\nroot@vpp-proto:~# apt install rsync net-tools traceroute snmpd snmp iptables sudo gnupg2 \\ curl libmbedcrypto3 libmbedtls12 libmbedx509-0 libnl-3-200 libnl-route-3-200 \\ libnuma1 python3-cffi python3-cffi-backend python3-ply python3-pycparser libsubunit0 root@vpp-proto:~# adduser ipng sudo root@vpp-proto:~# poweroff Finally, after I stop the VM, I\u0026rsquo;ll edit its XML config to give it a few VirtIO NICs to play with, nicely grouped on the same virtual PCI bus/slot. I look for the existing \u0026lt;interface\u0026gt; block that virt-install added for me, and add four new ones under that, all added to a newly created bridge called empty, for now:\npim@chumbucket:~$ sudo brctl addbr empty pim@chumbucket:~$ virsh edit vpp-proto \u0026lt;interface type=\u0026#39;bridge\u0026#39;\u0026gt; \u0026lt;mac address=\u0026#39;52:54:00:13:10:00\u0026#39;/\u0026gt; \u0026lt;source bridge=\u0026#39;empty\u0026#39;/\u0026gt; \u0026lt;target dev=\u0026#39;vpp-e0\u0026#39;/\u0026gt; \u0026lt;model type=\u0026#39;virtio\u0026#39;/\u0026gt; \u0026lt;mtu size=\u0026#39;9000\u0026#39;/\u0026gt; \u0026lt;address type=\u0026#39;pci\u0026#39; domain=\u0026#39;0x0000\u0026#39; bus=\u0026#39;0x10\u0026#39; slot=\u0026#39;0x00\u0026#39; function=\u0026#39;0x0\u0026#39; multifunction=\u0026#39;on\u0026#39;/\u0026gt; \u0026lt;/interface\u0026gt; \u0026lt;interface type=\u0026#39;bridge\u0026#39;\u0026gt; \u0026lt;mac address=\u0026#39;52:54:00:13:10:01\u0026#39;/\u0026gt; \u0026lt;source bridge=\u0026#39;empty\u0026#39;/\u0026gt; \u0026lt;target dev=\u0026#39;vpp-e1\u0026#39;/\u0026gt; \u0026lt;model type=\u0026#39;virtio\u0026#39;/\u0026gt; \u0026lt;mtu size=\u0026#39;9000\u0026#39;/\u0026gt; \u0026lt;address type=\u0026#39;pci\u0026#39; domain=\u0026#39;0x0000\u0026#39; bus=\u0026#39;0x10\u0026#39; slot=\u0026#39;0x00\u0026#39; function=\u0026#39;0x1\u0026#39;/\u0026gt; \u0026lt;/interface\u0026gt; \u0026lt;interface type=\u0026#39;bridge\u0026#39;\u0026gt; \u0026lt;mac address=\u0026#39;52:54:00:13:10:02\u0026#39;/\u0026gt; \u0026lt;source bridge=\u0026#39;empty\u0026#39;/\u0026gt; \u0026lt;target dev=\u0026#39;vpp-e2\u0026#39;/\u0026gt; \u0026lt;model type=\u0026#39;virtio\u0026#39;/\u0026gt; \u0026lt;mtu size=\u0026#39;9000\u0026#39;/\u0026gt; \u0026lt;address type=\u0026#39;pci\u0026#39; domain=\u0026#39;0x0000\u0026#39; bus=\u0026#39;0x10\u0026#39; slot=\u0026#39;0x00\u0026#39; function=\u0026#39;0x2\u0026#39;/\u0026gt; \u0026lt;/interface\u0026gt; \u0026lt;interface type=\u0026#39;bridge\u0026#39;\u0026gt; \u0026lt;mac address=\u0026#39;52:54:00:13:10:03\u0026#39;/\u0026gt; \u0026lt;source bridge=\u0026#39;empty\u0026#39;/\u0026gt; \u0026lt;target dev=\u0026#39;vpp-e3\u0026#39;/\u0026gt; \u0026lt;model type=\u0026#39;virtio\u0026#39;/\u0026gt; \u0026lt;mtu size=\u0026#39;9000\u0026#39;/\u0026gt; \u0026lt;address type=\u0026#39;pci\u0026#39; domain=\u0026#39;0x0000\u0026#39; bus=\u0026#39;0x10\u0026#39; slot=\u0026#39;0x00\u0026#39; function=\u0026#39;0x3\u0026#39;/\u0026gt; \u0026lt;/interface\u0026gt; pim@chumbucket:~$ virsh start --console vpp-proto And with that, I have a lovely virtual machine to play with, serial and all, beautiful! Step 2. Compile VPP + Linux CP Compiling DPDK and VPP can both take a while, and to avoid cluttering the virtual machine, I\u0026rsquo;ll do this step on my buildfarm and copy the resulting Debian packages back onto the VM.\nThis step simply follows VPP\u0026rsquo;s doc but to recap the individual steps here, I will:\nuse Git to check out both VPP and my plugin ensure all Debian dependencies are installed build DPDK libraries as a Debian package build VPP and its plugins (including LinuxCP) finally, build a set of Debian packages out of the VPP, Plugins, DPDK, etc. The resulting Packages will work both on Debian (Buster and Bullseye) as well as Ubuntu (Focal, 20.04). So grab a cup of tea, while we let Rhino stretch its legs, ehh, CPUs \u0026hellip;\npim@rhino:~$ mkdir -p ~/src pim@rhino:~$ cd ~/src pim@rhino:~/src$ sudo apt install libmnl-dev pim@rhino:~/src$ git clone https://git.ipng.ch/ipng/lcpng.git pim@rhino:~/src$ git clone https://gerrit.fd.io/r/vpp pim@rhino:~/src$ ln -s ~/src/lcpng ~/src/vpp/src/plugins/lcpng pim@rhino:~/src$ cd ~/src/vpp pim@rhino:~/src/vpp$ make install-deps pim@rhino:~/src/vpp$ make install-ext-deps pim@rhino:~/src/vpp$ make build-release pim@rhino:~/src/vpp$ make pkg-deb Which will yield the following Debian packages, would you believe that, at exactly leet-o\u0026rsquo;clock :-)\npim@rhino:~/src/vpp$ ls -hSl build-root/*.deb -rw-r--r-- 1 pim pim 71M Dec 23 13:37 build-root/vpp-dbg_22.02-rc0~421-ge6387b2b9_amd64.deb -rw-r--r-- 1 pim pim 4.7M Dec 23 13:37 build-root/vpp_22.02-rc0~421-ge6387b2b9_amd64.deb -rw-r--r-- 1 pim pim 4.2M Dec 23 13:37 build-root/vpp-plugin-core_22.02-rc0~421-ge6387b2b9_amd64.deb -rw-r--r-- 1 pim pim 3.7M Dec 23 13:37 build-root/vpp-plugin-dpdk_22.02-rc0~421-ge6387b2b9_amd64.deb -rw-r--r-- 1 pim pim 1.3M Dec 23 13:37 build-root/vpp-dev_22.02-rc0~421-ge6387b2b9_amd64.deb -rw-r--r-- 1 pim pim 308K Dec 23 13:37 build-root/vpp-plugin-devtools_22.02-rc0~421-ge6387b2b9_amd64.deb -rw-r--r-- 1 pim pim 173K Dec 23 13:37 build-root/libvppinfra_22.02-rc0~421-ge6387b2b9_amd64.deb -rw-r--r-- 1 pim pim 138K Dec 23 13:37 build-root/libvppinfra-dev_22.02-rc0~421-ge6387b2b9_amd64.deb -rw-r--r-- 1 pim pim 27K Dec 23 13:37 build-root/python3-vpp-api_22.02-rc0~421-ge6387b2b9_amd64.deb I\u0026rsquo;ve copied these packages to our vpp-proto image in ~ipng/packages/, where I\u0026rsquo;ll simply install them using dpkg:\nipng@vpp-proto:~$ sudo mkdir -p /var/log/vpp ipng@vpp-proto:~$ sudo dpkg -i ~/packages/*.deb ipng@vpp-proto:~$ sudo adduser `id -un` vpp I\u0026rsquo;ll configure 2GB of hugepages and 64MB of netlink buffer size - see my VPP #7 post for more details and lots of background information:\nipng@vpp-proto:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/sysctl.d/80-vpp.conf vm.nr_hugepages=1024 vm.max_map_count=3096 vm.hugetlb_shm_group=0 kernel.shmmax=2147483648 EOF ipng@vpp-proto:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/sysctl.d/81-vpp-netlink.conf net.core.rmem_default=67108864 net.core.wmem_default=67108864 net.core.rmem_max=67108864 net.core.wmem_max=67108864 EOF ipng@vpp-proto:~$ sudo sysctl -p -f /etc/sysctl.d/80-vpp.conf ipng@vpp-proto:~$ sudo sysctl -p -f /etc/sysctl.d/81-vpp-netlink.conf Next, I\u0026rsquo;ll create a network namespace for VPP and associated controlplane software to run in, this is because VPP will want to create its TUN/TAP devices separate from the default namespace:\nipng@vpp-proto:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /usr/lib/systemd/system/netns-dataplane.service [Unit] Description=Dataplane network namespace After=systemd-sysctl.service network-pre.target Before=network.target network-online.target [Service] Type=oneshot RemainAfterExit=yes # PrivateNetwork will create network namespace which can be # used in JoinsNamespaceOf=. PrivateNetwork=yes # To set `ip netns` name for this namespace, we create a second namespace # with required name, unmount it, and then bind our PrivateNetwork # namespace to it. After this we can use our PrivateNetwork as a named # namespace in `ip netns` commands. ExecStartPre=-/usr/bin/echo \u0026#34;Creating dataplane network namespace\u0026#34; ExecStart=-/usr/sbin/ip netns delete dataplane ExecStart=-/usr/bin/mkdir -p /etc/netns/dataplane ExecStart=-/usr/bin/touch /etc/netns/dataplane/resolv.conf ExecStart=-/usr/sbin/ip netns add dataplane ExecStart=-/usr/bin/umount /var/run/netns/dataplane ExecStart=-/usr/bin/mount --bind /proc/self/ns/net /var/run/netns/dataplane # Apply default sysctl for dataplane namespace ExecStart=-/usr/sbin/ip netns exec dataplane /usr/lib/systemd/systemd-sysctl ExecStop=-/usr/sbin/ip netns delete dataplane [Install] WantedBy=multi-user.target WantedBy=network-online.target EOF ipng@vpp-proto:~$ sudo systemctl enable netns-dataplane ipng@vpp-proto:~$ sudo systemctl start netns-dataplane Finally, I\u0026rsquo;ll add a useful startup configuration for VPP (note the comment on poll-sleep-usec which slows down the DPDK poller, making it a little bit milder on the CPU:\nipng@vpp-proto:~$ cd /etc/vpp ipng@vpp-proto:/etc/vpp$ sudo cp startup.conf startup.conf.orig ipng@vpp-proto:/etc/vpp$ cat \u0026lt;\u0026lt; EOF | sudo tee startup.conf unix { nodaemon log /var/log/vpp/vpp.log cli-listen /run/vpp/cli.sock gid vpp ## This makes VPP sleep 1ms between each DPDK poll, greatly ## reducing CPU usage, at the expense of latency/throughput. poll-sleep-usec 1000 ## Execute all CLI commands from this file upon startup exec /etc/vpp/bootstrap.vpp } api-trace { on } api-segment { gid vpp } socksvr { default } memory { main-heap-size 512M main-heap-page-size default-hugepage } buffers { buffers-per-numa 128000 default data-size 2048 page-size default-hugepage } statseg { size 1G page-size default-hugepage per-node-counters off } plugins { plugin lcpng_nl_plugin.so { enable } plugin lcpng_if_plugin.so { enable } } logging { default-log-level info default-syslog-log-level notice class linux-cp/if { rate-limit 10000 level debug syslog-level debug } class linux-cp/nl { rate-limit 10000 level debug syslog-level debug } } lcpng { default netns dataplane lcp-sync lcp-auto-subint } EOF ipng@vpp-proto:/etc/vpp$ cat \u0026lt;\u0026lt; EOF | sudo tee bootstrap.vpp comment { Create a loopback interface } create loopback interface instance 0 lcp create loop0 host-if loop0 set interface state loop0 up set interface ip address loop0 2001:db8::1/64 set interface ip address loop0 192.0.2.1/24 comment { Create Linux Control Plane interfaces } lcp create GigabitEthernet10/0/0 host-if e0 lcp create GigabitEthernet10/0/1 host-if e1 lcp create GigabitEthernet10/0/2 host-if e2 lcp create GigabitEthernet10/0/3 host-if e3 EOF ipng@vpp-proto:/etc/vpp$ sudo systemctl restart vpp After all of this, the following screenshot is a reasonable confirmation of success. Step 3. Install / Configure FRR Debian Bullseye ships with FRR 7.5.1, which will be fine. But for completeness, I\u0026rsquo;ll point out that FRR maintains their own Debian package repo as well, and they\u0026rsquo;re currently releasing FRR 8.1 as stable, so I opt to install that one instead:\nipng@vpp-proto:~$ curl -s https://deb.frrouting.org/frr/keys.asc | sudo apt-key add - ipng@vpp-proto:~$ FRRVER=\u0026#34;frr-stable\u0026#34; ipng@vpp-proto:~$ echo deb https://deb.frrouting.org/frr $(lsb_release -s -c) $FRRVER | \\ sudo tee -a /etc/apt/sources.list.d/frr.list ipng@vpp-proto:~$ sudo apt update \u0026amp;\u0026amp; sudo apt install frr frr-pythontools ipng@vpp-proto:~$ sudo adduser `id -un` frr ipng@vpp-proto:~$ sudo adduser `id -un` frrvty After installing, FRR will start up in the default network namespace, but I\u0026rsquo;m going to be using VPP in a custom namespace called dataplane. FRR after version 7.5 can work with multiple namespaces ref which boils down to adding the following daemons file:\nipng@vpp-proto:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/frr/daemons bgpd=yes ospfd=yes ospf6d=yes bfdd=yes vtysh_enable=yes watchfrr_options=\u0026#34;--netns=dataplane\u0026#34; zebra_options=\u0026#34; -A 127.0.0.1 -s 67108864\u0026#34; bgpd_options=\u0026#34; -A 127.0.0.1\u0026#34; ospfd_options=\u0026#34; -A 127.0.0.1\u0026#34; ospf6d_options=\u0026#34; -A ::1\u0026#34; staticd_options=\u0026#34;-A 127.0.0.1\u0026#34; bfdd_options=\u0026#34; -A 127.0.0.1\u0026#34; EOF ipng@vpp-proto:~$ sudo systemctl restart frr After restarting FRR with this namespace aware configuration, I can check to ensure it found the loop0 and e0-3 interfaces VPP defined above. Let\u0026rsquo;s take a look, while I set link e0 up and give it an IPv4 address. I\u0026rsquo;ll do this in the dataplane namespace, and expect that FRR picks this up as it\u0026rsquo;s monitoring the netlink messages in that namespace as well:\nStep 4. Install / Configure Bird2 Installing Bird2 is straight forward, although as with FRR above, after installing it\u0026rsquo;ll want to run in the default namespace, which we ought to change. And as well, let\u0026rsquo;s give it a bit of a default configuration to get started:\nipng@vpp-proto:~$ sudo apt-get install bird2 ipng@vpp-proto:~$ sudo systemctl stop bird ipng@vpp-proto:~$ sudo systemctl disable bird ipng@vpp-proto:~$ sudo systemctl mask bird ipng@vpp-proto:~$ sudo adduser `id -un` bird Then, I create a systemd unit for Bird running in the dataplane:\nipng@vpp-proto:~$ sed -e \u0026#39;s,ExecStart=,ExecStart=/usr/sbin/ip netns exec dataplane ,\u0026#39; \u0026lt; \\ /usr/lib/systemd/system/bird.service | sudo tee /usr/lib/systemd/system/bird-dataplane.service ipng@vpp-proto:~$ sudo systemctl enable bird-dataplane And, finally, I create some reasonable default config and start bird in the dataplane namespace:\nipng@vpp-proto:~$ cd /etc/bird ipng@vpp-proto:/etc/bird$ sudo cp bird.conf bird.conf.orig ipng@vpp-proto:/etc/bird$ cat \u0026lt;\u0026lt; EOF | sudo tee bird.conf router id 192.0.2.1; protocol device { scan time 30; } protocol direct { ipv4; ipv6; check link yes; } protocol kernel kernel4 { ipv4 { import none; export where source != RTS_DEVICE; }; learn off; scan time 300; } protocol kernel kernel6 { ipv6 { import none; export where source != RTS_DEVICE; }; learn off; scan time 300; } EOF ipng@vpp-proto:/usr/lib/systemd/system$ sudo systemctl start bird-dataplane And the results work quite similar to FRR, due to the VPP plugins working via Netlink, basically any program that operates in the dataplane namespace can interact with the kernel TAP interfaces, create/remove links, set state and MTU, add/remove IP addresses and routes:\nChoosing FRR or Bird At IPng Networks, we have historically, and continue to use Bird as our routing system of choice. But I totally realize the potential of FRR, in fact its implementation of LDP is what may drive me onto the platform after all, as I\u0026rsquo;d love to add MPLS support to the LinuxCP plugin at some point :-)\nBy default the KVM image comes with both FRR and Bird enabled. This is OK because there is no configuration on them yet, and they won\u0026rsquo;t be in each others\u0026rsquo; way. It makes sense for users of the image to make a conscious choice which of the two they\u0026rsquo;d like to use, and simply disable and mask the other one:\nIf FRR is your preference: ipng@vpp-proto:~$ sudo systemctl stop bird-dataplane ipng@vpp-proto:~$ sudo systemctl disable bird-dataplane ipng@vpp-proto:~$ sudo systemctl mask bird-dataplane ipng@vpp-proto:~$ sudo systemctl unmask frr ipng@vpp-proto:~$ sudo systemctl enable frr ipng@vpp-proto:~$ sudo systemctl start frr If Bird is your preference: ipng@vpp-proto:~$ sudo systemctl stop frr ipng@vpp-proto:~$ sudo systemctl disable frr ipng@vpp-proto:~$ sudo systemctl mask frr ipng@vpp-proto:~$ sudo systemctl unmask bird-dataplane ipng@vpp-proto:~$ sudo systemctl enable bird-dataplane ipng@vpp-proto:~$ sudo systemctl start bird-dataplane And with that, I hope to have given you a good overview of what comes into play when installing a Debian machine with VPP, my LinuxCP plugin, and FRR or Bird: Happy hacking!\nOne last thing .. After I created the KVM image, I made a qcow2 snapshot of it in pristine state. This means you can mess around with the VM, and easily revert to that pristine state without having to download the image again. You can also add some customization (as I\u0026rsquo;ve done for our own VPP Lab at IPng Networks) and set another snapshot and roll forwards and backwards between them. The syntax is:\n## Create a named snapshot pim@chumbucket:~$ qemu-img snapshot -c pristine vpp-proto.qcow2 ## List snapshots in the image pim@chumbucket:~$ qemu-img snapshot -l vpp-proto.qcow2 Snapshot list: ID TAG VM SIZE DATE VM CLOCK ICOUNT 1 pristine 0 B 2021-12-23 17:52:36 00:00:00.000 0 ## Revert to the named snapshot pim@chumbucket:~$ qemu-img snapshot -a pristine vpp-proto.qcow2 ## Delete the named snapshot pim@chumbucket:~$ qemu-img snapshot -d pristine vpp-proto.qcow2 ","date":"2021-12-23","desc":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. One thing notably missing, is the higher level control plane, that is to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a VPP plugin which is called the Linux Control Plane, or LCP for short, which creates Linux network devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links, addresses and routes) itself. When the plugin is completed, running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates will be well in reach!\n","permalink":"https://ipng.ch/s/articles/2021/12/23/vpp-linux-cp-virtual-machine-playground/","section":"articles","title":"VPP Linux CP - Virtual Machine Playground"},{"contents":" Author: Pim van Pelt \u0026lt;pim@ipng.nl\u0026gt; Reviewed: Jim Thompson \u0026lt;jim@netgate.com\u0026gt; Status: Draft - Review - Approved A few weeks ago, Jim Thompson from Netgate stumbled across my APU6 Post and introduced me to their new desktop router/firewall the Netgate 6100. It currently ships with pfSense Plus, but he mentioned that it\u0026rsquo;s designed as well to run their TNSR software, considering the device ships with 2x 1GbE SFP/RJ45 combo, 2x 10GbE SFP+, and 4x 2.5GbE RJ45 ports, and all network interfaces are Intel / DPDK capable chips. He asked me if I was willing to take it around the block with VPP, which of course I\u0026rsquo;d be happy to do, and here are my findings. The TNSR image isn\u0026rsquo;t yet public for this device, but that\u0026rsquo;s not a problem because AS8298 runs VPP, so I\u0026rsquo;ll just go ahead and install it myself \u0026hellip;\nExecutive Summary The Netgate 6100 router running pfSense has a single core performance of 623kpps and a total chassis throughput of 2.3Mpps, which is sufficient for line rate in both directions at 1514b packets (1.58Mpps), about 6.2Gbit of imix traffic, and about 419Mbit of 64b packets. Running Linux on the router yields very similar results.\nWith VPP though, the router\u0026rsquo;s single core performance leaps to 5.01Mpps at 438 CPU cycles/packet. This means that all three of 1514b, imix and 64b packets can be forwarded at line rate in one direction on 10Gbit. Due to its Atom C3558 processor (which has 4 cores, 3 of which are dedicated to VPP\u0026rsquo;s worker threads, and 1 to its main thread and controlplane running in LInux), achieving 10Gbit line rate in both directions when using 64 byte packets, is not possible.\nRunning at 19W and a total forwarding capacity of 15.1Mpps, it consumes only 1.26µJ of energy per forwarded packet, while at the same time easily handling a full BGP table with room to spare. I find this Netgate 6100 appliance pretty impressive and when TNSR becomes available, performance will be similar to what I\u0026rsquo;ve tested here, at a pricetag of USD 699,-\nDetailed findings The Netgate 6100 ships with an Intel Atom C-3558 CPU (4 cores including AES-NI and QuickAssist), 8GB of memory and either 16GB of eMMC, or 128GB of NVME storage. The network cards are its main forté: it comes with 2x i354 gigabit combo (SFP and RJ45), 4x i225 ports (these are 2.5GbE), and 2x X553 10GbE ports with an SFP+ cage each, for a total of 8 ports and lots of connectivity.\nThe machine is fanless and this is made possible by its power efficient CPU: the Atom here runs at 16W TDP only, and the whole machine clocks in at a very efficient 19W. It comes with an external power brick, but only one power supply, so no redundancy, unfortunately. To make up for that small omission, here are a few nice touches that I noticed:\nThe power jack has a screw-on barrel - no more accidentally rebooting the machine when fumbling around under the desk. There\u0026rsquo;s both a Cisco RJ45 console port (115200,8n1), as well as a CP2102 onboard USB/serial connector, which means you can connect to its serial port as well with a standard issue micro-USB cable. Cool! Battle of Operating Systems Netgate ships the device with pfSense - it\u0026rsquo;s a pretty great appliance and massively popular - delivering firewall, router and VPN functionality to homes and small business across the globe. I myself am partial to BSD (albeit a bit more of the Puffy persuasion), but DPDK and VPP are more of a Linux cup of tea. So I\u0026rsquo;ll have to deface this little guy, and reinstall it with Linux. My game plan is:\nBased on the shipped pfSense 21.05 (FreeBSD 12.2), do all the loadtests Reinstall the machine with Linux (Ubuntu 20.04.3), do all the loadtests Install VPP using my own HOWTO, and do all the loadtests This allows for, I think, a pretty sweet comparison between FreeBSD, Linux, and DPDK/VPP. Now, on to a description on the defacing, err, reinstall process on this Netgate 6100 machine, as it was not as easy as I had anticipated (but is it ever easy, really?)\nTurning on the device, it presents me with some BIOS firmware from Insyde Software which is loading some software called BlinkBoot [ref], which in turn is loading modules called Lenses, pictured right. Anyway, this ultimately presents me with a Press F2 for Boot Options. Aha! That\u0026rsquo;s exactly what I\u0026rsquo;m looking for. I\u0026rsquo;m really grateful that Netgate decides to ship a device with a BIOS that will allow me to boot off of other media, notably the USB stick in order to reinstall pfSense but in my case, also to install another operating system entirely.\nMy first approach was to get a default image to boot off of USB (the device has two USB3 ports on the side). But none of the USB ports want to load my UEFI bootx64.efi prepared USB key. So my second attempt was to prepare a PXE boot image, taking a few hints from Ubuntu\u0026rsquo;s documentation [ref]:\nwget http://archive.ubuntu.com/ubuntu/dists/focal/main/installer-amd64/current/legacy-images/netboot/mini.iso mv mini.iso /tmp/mini-focal.iso grub-mkimage --format=x86_64-efi \\ --output=/var/tftpboot/grubnetx64.efi.signed \\ --memdisk=/tmp/mini-focal.iso \\ `ls /usr/lib/grub/x86_64-efi | sed -n \u0026#39;s/\\.mod//gp\u0026#39;` After preparing DHCPd and a TFTP server, and getting a slight feeling of being transported back in time to the stone age, I see the PXE both request an IPv4 address, and the image I prepared. And, it boots, yippie!\nNov 25 14:52:10 spongebob dhcpd[43424]: DHCPDISCOVER from 90:ec:77:1b:63:55 via bond0 Nov 25 14:52:11 spongebob dhcpd[43424]: DHCPOFFER on 192.168.1.206 to 90:ec:77:1b:63:55 via bond0 Nov 25 14:52:13 spongebob dhcpd[43424]: DHCPREQUEST for 192.168.1.206 (192.168.1.254) from 90:ec:77:1b:63:55 via bond0 Nov 25 14:52:13 spongebob dhcpd[43424]: DHCPACK on 192.168.1.206 to 90:ec:77:1b:63:55 via bond0 Nov 25 15:04:56 spongebob tftpd[2076]: 192.168.1.206: read request for \u0026#39;grubnetx64.efi.signed\u0026#39; I took a peek while the grubnetx64 was booting, and saw that the available output terminals on this machine are spkmodem morse gfxterm serial_efi0 serial_com0 serial cbmemc audio, and that the default/active one is console, so I make a note that Grub wants to run on \u0026lsquo;console\u0026rsquo; (and specifically NOT on \u0026lsquo;serial\u0026rsquo;, as is usual, see below for a few more details on this) while the Linux kernel will of course be running on serial, so I have to add console=ttyS0,115200n8 to the kernel boot string before booting.\nPiece of cake, by which I mean I spent about four hours staring at the boot loader and failing to get it quite right \u0026ndash; pro-tip: install OpenSSH and fix the GRUB and Kernel configs before finishing the mini.iso install:\nmount --bind /proc /target/proc mount --bind /dev /target/dev mount --bind /sys /target/sys chroot /target /bin/bash # Install OpenSSH, otherwise the machine boots w/o access :) apt update apt install openssh-server # Fix serial for GRUB and Kernel vi /etc/default/grub ## set GRUB_CMDLINE_LINUX_DEFAULT=\u0026#34;console=ttyS0,115200n8\u0026#34; ## set GRUB_TERMINAL=console (and comment out the serial stuff) grub-install /dev/sda update-grub Rebooting now brings me to Ubuntu: Pat on the back, Caveman Pim, you\u0026rsquo;ve still got it!\nNetwork Loadtest After that small but exciting detour, let me get back to the loadtesting. The choice of Intel\u0026rsquo;s network controller on this board allows me to use Intel\u0026rsquo;s DPDK with relatively high performance, compared to regular (kernel) based routing. I loadtested the stock firmware pfSense (21.05, based on FreeBSD 12.2), Linux (Ubuntu 20.04.3), and VPP (22.02, [ref]).\nSpecifically worth calling out is that while Linux and FreeBSD struggled in the packets-per-second department, the use of DPDK in VPP meant absolutely no problems filling a unidirectional 10G stream of \u0026ldquo;regular internet traffic\u0026rdquo; (referred to as imix), it was also able to fill line rate with \u0026ldquo;64b UDP packets\u0026rdquo;, with just a little headroom there, but it ultimately struggled with bidirectional 64b UDP packets.\nMethodology For the loadtests, I used Cisco\u0026rsquo;s T-Rex [ref] in stateless mode, with a custom Python controller that ramps up and down traffic from the loadtester to the device under test (DUT) by sending traffic out port0 to the DUT, and expecting that traffic to be presented back out from the DUT to its port1, and vice versa (out from port1 -\u0026gt; DUT -\u0026gt; back in on port0). The loadtester first sends a few seconds of warmup, this is to ensure the DUT is passing traffic and offers the ability to inspect the traffic before the actual rampup. Then the loadtester ramps up linearly from zero to 100% of line rate (in this case, line rate is 10Gbps in both directions), finally it holds the traffic at full line rate for a certain duration. If at any time the loadtester fails to see the traffic it\u0026rsquo;s emitting return on its second port, it flags the DUT as saturated; and this is noted as the maximum bits/second and/or packets/second.\nSince my last loadtesting post, I\u0026rsquo;ve learned a lot more about packet forwarding and how to make it easier or harder on the router. Let me go into a few more details about the various loadtests that I\u0026rsquo;ve done here.\nMethod 1: Single CPU Thread Saturation Most kernels (certainly OpenBSD, FreeBSD and Linux) will make use of multiple receive queues if the network card supports it. The Intel NICs in this machine are all capable of Receive Side Scaling (RSS), which means the NIC can offload its packets into multiple queues. The kernel will typically enable one queue for each CPU core \u0026ndash; the Atom has 4 cores, so 4 queues are initialized, and inbound traffic is sent, typically using some hashing function, to individual CPUs, allowing for a higher aggregate throughput.\nMostly, this hashing function is based on some L3 or L4 payload, for example a hash over the source IP/port and destination IP/port. So one interesting test is to send the same packet over and over again \u0026ndash; the hash function will then return the same value for each packet, which means all traffic goes into exactly one of the N available queues, and therefore handled by only one core.\nOne such TRex stateless traffic profile is udp_1pkt_simple.py which, as the name implies, simply sends the same UDP packet from source IP/port and destination IP/port, padded with a bunch of \u0026lsquo;x\u0026rsquo; characters, over and over again:\npacket = STLPktBuilder(pkt = Ether()/ IP(src=\u0026#34;16.0.0.1\u0026#34;,dst=\u0026#34;48.0.0.1\u0026#34;)/ UDP(dport=12,sport=1025)/(10*\u0026#39;x\u0026#39;) ) Method 2: Rampup using trex-loadtest.py TRex ships with a very handy bench.py stateless traffic profile which, without any additional arguments, does the same thing as the above method. However, this profile optionally takes a few arguments, which are called tunables, notably:\nsize - set the size of the packets to either a number (ie. 64, the default, or any number up to a maximum of 9216 byes), or the string imix which will send a traffic mix consisting of 60b, 590b and 1514b packets. vm - set the packet source/dest generation. By default (when the flag is None), the same src (16.0.0.1) and dst (48.0.0.1) is set for each packet. When setting the value to var1, the source IP is incremented from 16.0.0.[4-254]. If the value is set to var2, the source and destination IP are incremented, the destination from 48.0.0.[4-254]. So tinkering with the vm parameter is an excellent way of driving one or many receive queues. Armed with this, I will perform a loadtest with four modes of interest, from easier to more challenging:\nbench-var2-1514b: multiple flows, ~815Kpps at 10Gbps; this is the easiest test to perform, as the traffic consists of large (1514 byte) packets, and both source and destination are different each time, which means lots of multiplexing across receive queues, and relatively few packets/sec. bench-var2-imix: multiple flows, with a mix of 60, 590 and 1514b frames in a certain ratio. This yields what can be reasonably expected from normal internet use, just about 3.2Mpps at 10Gbps. This is the most representative test for normal use, but still the packet rate is quite low due to (relatively) large packets. Any respectable router should be able to perform well at an imix profile. bench-var2-64b: Still multiple flows, but very small packets, 14.88Mpps at 10Gbps, often refered to as the theoretical maximum throughput on Tengig. Now it\u0026rsquo;s getting harder, as the loadtester will fill the line with small packets (of 64 bytes, the smallest that an ethernet packet is allowed to be). This is a good way to see if the router vendor is actually capable of what is referred to as line rate forwarding. bench: Now restricted to a constant src/dst IP:port tuple, and the same rate of 14.88Mpps at 10Gbps, means only one Rx queue (and thus, one CPU core) can be used. This is where single-core performance becomes relevant. Notably, vendors who boast many CPUs, will often struggle with a test like this, in case any given CPU core cannot individually handle a full line rate. I\u0026rsquo;m looking at you, Tilera! Further to this list, I can send traffic in one direction only (TRex will emit this from its port0 and expect the traffic to be seen back at port1); or I can send it in both directions. The latter will double the packet rate and bandwidth, to approx 29.7Mpps.\nNOTE: At these rates, TRex can be a bit fickle trying to fit all these packets into its own transmit queues, so I decide to drive it a bit less close to the cliff and stop at 97% of line rate (this is 28.3Mpps). It explains why lots of these loadtests top out at that number.\nResults Method 1: Single CPU Thread Saturation Given the approaches above, for the first method I can \u0026ldquo;just\u0026rdquo; saturate the line and see how many packets emerge through the DUT on the other port, so that\u0026rsquo;s only 3 tests:\nNetgate 6100 Loadtest Throughput (pps) L1 Throughput (bps) % of linerate pfSense 64b 1-flow 622.98 Kpps 418.65 Mbps 4.19% Linux 64b 1-flow 642.71 Kpps 431.90 Mbps 4.32% VPP 64b 1-flow 5.01 Mpps 3.37 Gbps 33.67% NOTE: The bandwidth figures here are so called L1 throughput which means bits on the wire, as opposed to L2 throughput which means bits in the ethernet frame. This is relevant particularly at 64b loadtests as the overhead for each ethernet frame is 20 bytes (7 bytes preamble, 1 byte start-frame, and 12 bytes inter-frame gap [ref]). At 64 byte frames, this is 31.25% overhead! It also means that when L1 bandwidth is fully consumed at 10Gbps, that the observed L2 bandwidth will be only 7.62Gbps.\nInterlude - VPP efficiency In VPP it can be pretty cool to take a look at efficiency \u0026ndash; one of the main reasons why it\u0026rsquo;s so quick is because VPP will consume the entire core, and grab a set of packets from the NIC rather than do work for each individual packet. VPP then advances the set of packets, called a vector, through a directed graph. The first of these packets will result in the code for the current graph node to be fetched into the CPU\u0026rsquo;s instruction cache, and the second and further packets will make use of the warmed up cache, greatly improving per-packet efficiency.\nI can demonstrate this by running a 1kpps, 1Mpps and 10Mpps test against the VPP install on this router, and observing how many CPU cycles each packet needs to get forwarded from the input interface to the output interface. I expect this number to go down when the machine has more work to do, due to the higher CPU i/d-cache hit rate. Seeing the time spent in each of VPP\u0026rsquo;s graph nodes, and for each individual worker thread (which correspond 1:1 with CPU cores), can be done with vppctl show runtime command and some awk magic:\n$ vppctl clear run \u0026amp;\u0026amp; sleep 30 \u0026amp;\u0026amp; vppctl show run | \\ awk \u0026#39;$2 ~ /active|polling/ \u0026amp;\u0026amp; $4 \u0026gt; 25000 { print $0; if ($1==\u0026#34;ethernet-input\u0026#34;) { packets = $4}; if ($1==\u0026#34;dpdk-input\u0026#34;) { dpdk_time = $6}; total_time += $6 } END { print packets/30, \u0026#34;packets/sec, at\u0026#34;,total_time,\u0026#34;cycles/packet,\u0026#34;, total_time-dpdk_time,\u0026#34;cycles/packet not counting DPDK\u0026#34; }\u0026#39; This gives me the following, somewhat verbose but super interesting output, which I\u0026rsquo;ve edited down to fit on screen, and omit the columns that are not super relevant. Ready? Here we go!\ntui\u0026gt;start -f stl/udp_1pkt_simple.py -p 0 -m 1kpps Graph Node Name Clocks Vectors/Call ---------------------------------------------------------------- TenGigabitEthernet3/0/1-output 6.07e2 1.00 TenGigabitEthernet3/0/1-tx 8.61e2 1.00 dpdk-input 1.51e6 0.00 ethernet-input 1.22e3 1.00 ip4-input-no-checksum 6.59e2 1.00 ip4-load-balance 4.50e2 1.00 ip4-lookup 5.63e2 1.00 ip4-rewrite 5.83e2 1.00 1000.17 packets/sec, at 1514943 cycles/packet, 4943 cycles/pkt not counting DPDK I\u0026rsquo;ll observe that a lot of time is spent in dpdk-input, because that is a node that is constantly polling the network card, as fast as it can, to see if there\u0026rsquo;s any work for it to do. Apparently not, because the average vectors per call is pretty much zero, and considering that, most of the CPU time is going to sit in a nice \u0026ldquo;do nothing\u0026rdquo;. Because reporting CPU cycles spent doing nothing isn\u0026rsquo;t particularly interesting, I shall report on both the total cycles spent, that is to say including DPDK, and as well the cycles spent per packet in the other active nodes. In this case, at 1kpps, VPP is spending 4953 cycles on each packet.\nNow, take a look what happens when I raise the traffic to 1Mpps:\ntui\u0026gt;start -f stl/udp_1pkt_simple.py -p 0 -m 1mpps Graph Node Name Clocks Vectors/Call ---------------------------------------------------------------- TenGigabitEthernet3/0/1-output 3.80e1 18.57 TenGigabitEthernet3/0/1-tx 1.44e2 18.57 dpdk-input 1.15e3 .39 ethernet-input 1.39e2 18.57 ip4-input-no-checksum 8.26e1 18.57 ip4-load-balance 5.85e1 18.57 ip4-lookup 7.94e1 18.57 ip4-rewrite 7.86e1 18.57 981830 packets/sec, at 1770.1 cycles/packet, 620 cycles/pkt not counting DPDK Whoa! The system is now running the VPP loop with ~18.6 packets per vector, and you can clearly see that the CPU efficiency went up greatly, from 4953 cycles/packet at 1kpps, to 620 cycles/packet at 1Mpps. That\u0026rsquo;s an order of magnitude improvement!\nFinally, let\u0026rsquo;s give this Netgate 6100 router a run for its money, and slam it with 10Mpps:\ntui\u0026gt;start -f stl/udp_1pkt_simple.py -p 0 -m 10mpps Graph Node Name Clocks Vectors/Call ---------------------------------------------------------------- TenGigabitEthernet3/0/1-output 1.41e1 256.00 TenGigabitEthernet3/0/1-tx 1.23e2 256.00 dpdk-input 7.95e1 256.00 ethernet-input 6.74e1 256.00 ip4-input-no-checksum 3.95e1 256.00 ip4-load-balance 2.54e1 256.00 ip4-lookup 4.12e1 256.00 ip4-rewrite 4.78e1 256.00 5.01426e+06 packets/sec, at 437.9 cycles/packet, 358 cycles/pkt not counting DPDK And here is where I learn the maximum packets/sec that this one CPU thread can handle: 5.01Mpps, at which point every packet is super efficiently handled at 358 CPU cycles each, or 13.8 times (4953/438) as efficient under high load than when the CPU is unloaded. Sweet!!\nAnother really cool thing to do here is derive the effective clock speed of the Atom CPU. We know it runs at 2200Mhz, and we\u0026rsquo;re doing 5.01Mpps at 438 cycles/packet including the time spent in DPDK, which adds up to 2194MHz, remarkable precision. Color me impressed :-)\nMethod 2: Rampup using trex-loadtest.py For the second methodology, I have to perform a lot of loadtests. In total, I\u0026rsquo;m testing 4 modes (1514b, imix, 64b-multi and 64b 1-flow), then take a look at unidirectional traffic and bidirectional traffic, and perform each of these loadtests on pfSense, Ubuntu, and VPP with one, two or three Rx/Tx queues. That\u0026rsquo;s a total of 40 loadtests!\nLoadtest pfSense Ubuntu VPP 1Q VPP 2Q VPP 3Q Details Unidirectional 1514b 97% 97% 97% 97% 97% [graphs] imix 61% 75% 96% 95% 95% [graphs] 64b 15% 17% 33% 66% 96% [graphs] 64b 1-flow 4.4% 4.7% 33% 33% 33% [graphs] Bidirectional 1514b 192% 193% 193% 193% 194% [graphs] imix 63% 71% 190% 190% 191% [graphs] 64b 15% 16% 61% 63% 81% [graphs] 64b 1-flow 8.6% 9.0% 61% 61% 33% (+) [graphs] A picture says a thousand words - so I invite you to take a look at the interactive graphs from the table above. I\u0026rsquo;ll cherrypick what I find the most interesting one here:\nThe graph above is of the unidirectional 64b loadtest. Some observations:\npfSense 21.05 (running FreeBSD 12.2, the bottom blue trace), and Ubuntu 20.04.3 (running Linux 5.13, the orange trace just above it) are are equal performers. They handle fullsized (1514 byte) packets just fine, struggle a little bit with imix, and completely suck at 64b packets (shown here), in particular if only 1 CPU core can be used. Even at 64b packets, VPP scales linearly from 33% of line rate with 1Q (the green trace), 66% with 2Q (the red trace) and 96% with 3Q (the purple trace, that makes it through to the end). With VPP taking 3Q, one CPU is left over for the main thread and controlplane software like FRR or Bird2. Caveats The unit was shipped courtesy of Netgate (thanks again! Jim, this was fun!) for the purposes of load- and systems integration testing and comparing their internal benchmarking with my findings. Other than that, this is not a paid endorsement and views of this review are my own.\nOne quirk I noticed is that while running VPP with 3Q and bidirectional traffic, performance is much worse than with 2Q or 1Q. This is not a fluke with the loadtest, as I have observed the same strange performance with other machines (Supermicro 5018D-FN8T for example). I confirmed that each VPP worker thread is used for each queue, so I would\u0026rsquo;ve expected ~15Mpps shared by both interfaces (so a per-direction linerate of ~50%), but I get 16.8% instead [graphs]. I\u0026rsquo;ll have to understand that better, but for now I\u0026rsquo;m releasing the data as-is.\nAppendix Generating the data You can find all of my loadtest runs in this archive. The archive contains the trex-loadtest.py script as well, for curious readers! These JSON files can be fed directly into Michal\u0026rsquo;s visualizer to plot interactive graphs (which I\u0026rsquo;ve done for the table above):\nDEVICE=netgate-6100 ruby graph.rb -t \u0026#39;Netgate 6100 All Loadtests\u0026#39; -o ${DEVICE}.html netgate-*.json for i in bench-var2-1514b bench-var2-64b bench imix; do ruby graph.rb -t \u0026#39;Netgate 6100 Unidirectional Loadtests\u0026#39; --only-channels 0 \\ netgate-*-${i}-unidi*.json -o ${DEVICE}.$i-unidirectional.html done for i in bench-var2-1514b bench-var2-64b bench imix; do ruby graph.rb -t \u0026#39;Netgate 6100 Bidirectional Loadtests\u0026#39; \\ netgate-*-${i}.json -o ${DEVICE}.$i-bidirectional.html done Notes on pfSense I\u0026rsquo;m not a pfSense user, but I know my way around FreeBSD just fine. After installing the firmware, I simply choose the \u0026lsquo;Give me a Shell\u0026rsquo; option, and take it from there. The router will run pf out of the box, and it is pretty complex, so I\u0026rsquo;ll just configure some addresses, routes and disable the firewall alltogether. That sounds just fair, as the same tests with Linux and VPP also do not use a firewall (even though obviously, both VPP and Linux support firewalls just fine).\nifconfig ix0 inet 100.65.1.1/24 ifconfig ix1 inet 100.65.2.1/24 route add -net 16.0.0.0/8 100.65.1.2 route add -net 48.0.0.0/8 100.65.2.2 pfctl -d Notes on Linux When doing loadtests on Ubuntu, I have to ensure irqbalance is turned off, otherwise the kernel will thrash around re-routing softirq\u0026rsquo;s between CPU threads, and at the end of the day, I\u0026rsquo;m trying to saturate all CPUs anyway, so balancing/moving them around doesn\u0026rsquo;t make any sense. Further, Linux wants to configure a static ARP entry for the interfaces from TRex:\nsudo systemctl disable irqbalance sudo systemctl stop irqbalance sudo systemctl mask irqbalance sudo ip addr add 100.65.1.1/24 dev enp3s0f0 sudo ip addr add 100.65.2.1/24 dev enp3s0f1 sudo ip nei replace 100.65.1.2 lladdr 68:05:ca:32:45:94 dev enp3s0f0 ## TRex port0 sudo ip nei replace 100.65.2.2 lladdr 68:05:ca:32:45:95 dev enp3s0f1 ## TRex port1 sudo ip ro add 16.0.0.0/8 via 100.65.1.2 sudo ip ro add 48.0.0.0/8 via 100.65.2.2 On Linux, I now see a reasonable spread of IRQs by CPU while doing a unidirectional loadtest:\nroot@netgate:/home/pim# cat /proc/softirqs CPU0 CPU1 CPU2 CPU3 HI: 3 0 0 1 TIMER: 203788 247280 259544 401401 NET_TX: 8956 8373 7836 6154 NET_RX: 22003822 19316480 22526729 19430299 BLOCK: 2545 3153 2430 1463 IRQ_POLL: 0 0 0 0 TASKLET: 5084 60 1830 23 SCHED: 137647 117482 56371 49112 HRTIMER: 0 0 0 0 RCU: 11550 9023 8975 8075 ","date":"2021-11-26","desc":" Author: Pim van Pelt \u0026lt;pim@ipng.nl\u0026gt; Reviewed: Jim Thompson \u0026lt;jim@netgate.com\u0026gt; Status: Draft - Review - Approved A few weeks ago, Jim Thompson from Netgate stumbled across my APU6 Post and introduced me to their new desktop router/firewall the Netgate 6100. It currently ships with pfSense Plus, but he mentioned that it\u0026rsquo;s designed as well to run their TNSR software, considering the device ships with 2x 1GbE SFP/RJ45 combo, 2x 10GbE SFP+, and 4x 2.5GbE RJ45 ports, and all network interfaces are Intel / DPDK capable chips. He asked me if I was willing to take it around the block with VPP, which of course I\u0026rsquo;d be happy to do, and here are my findings. The TNSR image isn\u0026rsquo;t yet public for this device, but that\u0026rsquo;s not a problem because AS8298 runs VPP, so I\u0026rsquo;ll just go ahead and install it myself \u0026hellip;\n","permalink":"https://ipng.ch/s/articles/2021/11/26/review-netgate-6100/","section":"articles","title":"Review: Netgate 6100"},{"contents":"Introduction BGP Routing policy is a very interesting topic. I get asked about it formally and informally all the time. I have to admit, there are lots of ways to organize an automous system. Vendors have unique features and templating / procedural functions, but in the end, BGP routing policy all boils down to two+two things:\nNot accepting the prefixes you don\u0026rsquo;t want (inbound) For those prefixes accepted, ensure they have correct attributes. Not announcing prefixes to folks who shouldn\u0026rsquo;t see them (outbound) For those prefixes announced, ensure they have correct attributes. At IPng Networks, I\u0026rsquo;ve cycled through a few iterations and landed on a specific setup that works well for me. It provides sufficient information to enable our downstream (customers) to make good decisions on what they should accept from us, as well as enough expressivity for them to determine which prefixes we should propagate for them, where, and how.\nThis article describes one approach to a relatively feature rich routing policy which is in use at IPng Networks (AS8298). It uses the Bird2 configuration language, although the concepts would be implementable in ~any modern routing suite (ie. FRR, Cisco, Juniper, Arista, Extreme, et cetera).\nInterested in one operator\u0026rsquo;s opinion? Read on!\n1. Concepts There are three basic pieces of routing filtering, which I\u0026rsquo;ll describe briefly.\nPrefix Lists A prefix list (also sometimes referred to as an access-list in older software) is a list of IPv4 of IPv6 prefixes, often with a prefixlen boundary, that determines if a given prefix is \u0026ldquo;in\u0026rdquo; or \u0026ldquo;out\u0026rdquo;.\nAn example could be: 2001:db8::/32{32,48} which describes any prefix in the supernet 2001:db8::/32 that has a prefix length of anywhere between /32 and /48, inclusive.\nAS Paths In BGP, each prefix learned comes with an AS path on how to reach it. If my router learns a prefix from a peer with AS number 65520, it\u0026rsquo;ll see every prefix that peer sends as a list of AS numbers starting with 65520. With AS Paths, the very first one in the list is the one the router directly learned the prefix from, and the very last one is the origin of the prefix. Often times the prefix is shown as a regular expression, starting with ^ and ending with $ and to help readability, spaces are often written as _.\nExamples: ^25091_1299_3301$ and ^58299_174_1299_3301$\nBGP Communities When learning (or originating) a prefix in BGP, zero or more so called communities can be added to it along the way. The Routing Information Base or RIB carries these communities and can share them between peering sessions. Communities can be added, removed and modified. Some communities have special meaning (which is agreed upon by everyone), and some have local meaning (agreed upon by only one or a small set of operators).\nThere\u0026rsquo;s three types of communities: normal communities are a pair of 16-bit integers; extended communities are 8 bytes, split into one 16-bit integer and an additional 48-bit value; and finally large communities consist of a triplet of 32-bit values.\nExamples: (8298, 1234) (normal), or (8298, 3, 212323) (large)\nRouting Policy Now that I\u0026rsquo;ve explained a little bit about the ingredients we have to work with, let me share an observation that took me a few decades to make: BGP sessions are really all the same. As such, every single one of the BGP sessions at IPng Networks are generated with one template. What makes the difference between \u0026lsquo;Transit\u0026rsquo;, \u0026lsquo;Customer\u0026rsquo; and \u0026lsquo;Peer\u0026rsquo; and \u0026lsquo;Private Interconnect\u0026rsquo;, really all boils down to what types of filtering are applied on in- and outbound updates. I will demonstrate this by means of two main functions in Bird: ebgp_import() discussed first in the section Inbound: Learning Routes section, and ebgp_export() in the section Outbound: Announcing Routes.\n2. Inbound: Learning Routes Let\u0026rsquo;s consider this function:\nfunction ebgp_import(int remote_as) { if aspath_bogon() then return false; if (net.type = NET_IP4 \u0026amp;\u0026amp; ipv4_bogon()) then return false; if (net.type = NET_IP6 \u0026amp;\u0026amp; ipv6_bogon()) then return false; if (net.type = NET_IP4 \u0026amp;\u0026amp; ipv4_rpki_invalid()) then return false; if (net.type = NET_IP6 \u0026amp;\u0026amp; ipv6_rpki_invalid()) then return false; # Demote certain AS nexthops to lower pref if (bgp_path.first ~ AS_LOCALPREF50 \u0026amp;\u0026amp; bgp_path.len \u0026gt; 1) then bgp_local_pref = 50; if (bgp_path.first ~ AS_LOCALPREF30 \u0026amp;\u0026amp; bgp_path.len \u0026gt; 1) then bgp_local_pref = 30; if (bgp_path.first ~ AS_LOCALPREF10 \u0026amp;\u0026amp; bgp_path.len \u0026gt; 1) then bgp_local_pref = 10; # Graceful Shutdown (RFC8326) if (65535, 0) ~ bgp_community then bgp_local_pref = 0; # Scrub BLACKHOLE community bgp_community.delete((65535, 666)); return true; } The function works by order of elimination \u0026ndash; for each prefix that is offered on the session, it will either be rejected (by means of returning false), or modified (by means of setting attributes like bgp_local_pref) and then accepted (by means of returning true).\nAS-Path Bogon filtering is a way to remove prefixes that have an invalid AS number in their path. The main example of this are private AS numbers (64496-131071) and their 32 bit equivalents (4200000000-4294967295). In case you haven\u0026rsquo;t come across this yet, AS number 23456 is also magic, see RFC4893 for details:\nfunction aspath_bogon() { return bgp_path ~ [0, 23456, 64496..131071, 4200000000..4294967295]; } Prefix Bogon comes next, as certain prefixes that are not publicly routable (you know, such as RFC1918, but there are many others). They look differently for IPv4 and IPv6:\nfunction ipv4_bogon() { return net ~ [ 0.0.0.0/0, # Default 0.0.0.0/32-, # RFC 5735 Special Use IPv4 Addresses 0.0.0.0/0{0,7}, # RFC 1122 Requirements for Internet Hosts -- Communication Layers 3.2.1.3 10.0.0.0/8+, # RFC 1918 Address Allocation for Private Internets 100.64.0.0/10+, # RFC 6598 IANA-Reserved IPv4 Prefix for Shared Address Space 127.0.0.0/8+, # RFC 1122 Requirements for Internet Hosts -- Communication Layers 3.2.1.3 169.254.0.0/16+, # RFC 3927 Dynamic Configuration of IPv4 Link-Local Addresses 172.16.0.0/12+, # RFC 1918 Address Allocation for Private Internets 192.0.0.0/24+, # RFC 6890 Special-Purpose Address Registries 192.0.2.0/24+, # RFC 5737 IPv4 Address Blocks Reserved for Documentation 192.168.0.0/16+, # RFC 1918 Address Allocation for Private Internets 198.18.0.0/15+, # RFC 2544 Benchmarking Methodology for Network Interconnect Devices 198.51.100.0/24+, # RFC 5737 IPv4 Address Blocks Reserved for Documentation 203.0.113.0/24+, # RFC 5737 IPv4 Address Blocks Reserved for Documentation 224.0.0.0/4+, # RFC 1112 Host Extensions for IP Multicasting 240.0.0.0/4+ # RFC 6890 Special-Purpose Address Registries ]; } function ipv6_bogon() { return net ~ [ ::/0, # Default ::/96, # IPv4-compatible IPv6 address - deprecated by RFC4291 ::/128, # Unspecified address ::1/128, # Local host loopback address ::ffff:0.0.0.0/96+, # IPv4-mapped addresses ::224.0.0.0/100+, # Compatible address (IPv4 format) ::127.0.0.0/104+, # Compatible address (IPv4 format) ::0.0.0.0/104+, # Compatible address (IPv4 format) ::255.0.0.0/104+, # Compatible address (IPv4 format) 0000::/8+, # Pool used for unspecified, loopback and embedded IPv4 addresses 0100::/8+, # RFC 6666 - reserved for Discard-Only Address Block 0200::/7+, # OSI NSAP-mapped prefix set (RFC4548) - deprecated by RFC4048 0400::/6+, # RFC 4291 - Reserved by IETF 0800::/5+, # RFC 4291 - Reserved by IETF 1000::/4+, # RFC 4291 - Reserved by IETF 2001:10::/28+, # RFC 4843 - Deprecated (previously ORCHID) 2001:20::/28+, # RFC 7343 - ORCHIDv2 2001:db8::/32+, # Reserved by IANA for special purposes and documentation 2002:e000::/20+, # Invalid 6to4 packets (IPv4 multicast) 2002:7f00::/24+, # Invalid 6to4 packets (IPv4 loopback) 2002:0000::/24+, # Invalid 6to4 packets (IPv4 default) 2002:ff00::/24+, # Invalid 6to4 packets 2002:0a00::/24+, # Invalid 6to4 packets (IPv4 private 10.0.0.0/8 network) 2002:ac10::/28+, # Invalid 6to4 packets (IPv4 private 172.16.0.0/12 network) 2002:c0a8::/32+, # Invalid 6to4 packets (IPv4 private 192.168.0.0/16 network) 3ffe::/16+, # Former 6bone, now decommissioned 4000::/3+, # RFC 4291 - Reserved by IETF 5f00::/8+, # RFC 5156 - used for the 6bone but was returned 6000::/3+, # RFC 4291 - Reserved by IETF 8000::/3+, # RFC 4291 - Reserved by IETF a000::/3+, # RFC 4291 - Reserved by IETF c000::/3+, # RFC 4291 - Reserved by IETF e000::/4+, # RFC 4291 - Reserved by IETF f000::/5+, # RFC 4291 - Reserved by IETF f800::/6+, # RFC 4291 - Reserved by IETF fc00::/7+, # Unicast Unique Local Addresses (ULA) - RFC 4193 fe80::/10+, # Link-local Unicast fec0::/10+, # Site-local Unicast - deprecated by RFC 3879 (replaced by ULA) ff00::/8+ # Multicast ]; } That\u0026rsquo;s a long list!! But operators on the DFZ should really never be accepting any of these, and we should all collectively yell at those who propagate them.\nRPKI Filtering is a fantastic routing security feature, described in RFC6810 and relatively straight forward to implement. For each originating AS number, we can check in a table of known \u0026lt;origin,prefix\u0026gt; mapping, if it is the correct ISP to originate the prefix. The lookup can either match (which makes the prefix RPKI valid), the lookup can fail because the prefix is missing (which makes the prefix RPKI unknown), and it can specifically mismatch (which makes the prefix RPKI invalid). Operators are encouraged to flag and drop invalid prefixes:\nfunction ipv4_rpki_invalid() { return roa_check(t_roa4, net, bgp_path.last) = ROA_INVALID; } function ipv6_rpki_invalid() { return roa_check(t_roa6, net, bgp_path.last) = ROA_INVALID; } NOTE: In NLNOG my post sparked a bit of debate on the use of bgp_path.last_nonaggregated versus simply bgp_path.last. Job Snijders did some spelunking and offered this post and a reference to RFC6907 for details, and Tijn confirmed that Coloclue (on which many of my approaches have been modeled) indeed uses bgp_path.last. I\u0026rsquo;ve updated my configs, with many thanks for the discussion.\nAlright, now that I\u0026rsquo;ve determined the as-path and prefix are kosher, and that it is not known to be hijacked (ie. is either ROA_VALID or ROA_UNKNOWN), I\u0026rsquo;m ready to set a few attributes, notably:\nAS_LOCALPREF If the peer I learned this prefix from is in the given list, set the BGP local preference to either 50, 30 or 10 respectively (a lower localpref means the prefix is less likely to be selected). Some internet providers send lots of prefixes, but have poor network connectivity to the place I learned the routes from (a few examples to this, 6939 is often oversubscribed in Amsterdam, and 39533 was for a while connected via a tunnel (!) to Zurich, and several hobby/amateur IXPs are on a VXLAN bridged domain rather than a physical switch).\nGraceful Shutdown described in RFC8326, shows a way to allow operators to pre-announce their downtime by setting a special BGP community that informs their peers to deselect that path by setting the local preference to the lowest possible value. This oneliner matching on (65535,0) implements that behavior.\nBlackhole Community described in RFC7999, is another special BGP community of (65535,666) which signals the need to stop sending traffic to the prefix at hand. I haven\u0026rsquo;t yet implemented the blackhole routing (this has to do with an intricacy of the VPP Linux-CP code that I wrote), so for now I\u0026rsquo;ll just remove the community.\nAlright, based on this one template, I\u0026rsquo;m now ready to implement all three types of BGP session: Peer, Upstream, and Downstream.\nPeers function ebgp_import_peer(int remote_as) { # Scrub BGP Communities (RFC 7454 Section 11) bgp_community.delete([(8298, *)]); bgp_large_community.delete([(8298, *, *)]); return ebgp_import(remote_as); } It\u0026rsquo;s dangerous to accept communities for my own AS8298 from peers. This is because several of them can actively change the behavior of route propagation (these types of communities are commonly called action communities). So with peering relationships, I\u0026rsquo;ll just toss them all.\nNow, working my way up to the actual BGP peering session, taking for example a peer that I\u0026rsquo;m connecting to at LSIX (the routeserver, in fact) in Amsterdam:\nfilter ebgp_lsix_49917_import { if ! ebgp_import_peer(49917) then reject; # Add IXP Communities bgp_community.add((8298,1036)); bgp_large_community.add((8298,1,1036)); accept; } protocol bgp lsix_49917_ipv4_1 { description \u0026#34;LSIX IX Route Servers (LSIX)\u0026#34;; local as 8298; source address 185.1.32.74; neighbor 185.1.32.254 as 49917; default bgp_med 0; default bgp_local_pref 200; ipv4 { import keep filtered; import filter ebgp_lsix_49917_import; export filter ebgp_lsix_49917_export; receive limit 100000 action restart; next hop self on; }; }; Parsing this through: the ipv4 import filter is called ebgp_lsix_49917_import and its job is to run the whole kittenkaboodle of filtering I described above, and then if the ebgp_import_peer() function returns false, to simply drop the prefix. But if it is accepted, I\u0026rsquo;ll tag it with a few communities. As I\u0026rsquo;ll show later, any other peer will receive these communities if I decide to propagate the prefix to them. This is specifically useful for downstream (customers), who can decide to accept/deny the prefix based on a wellknown set of communities we tag.\nIXP Community: If the prefix is learned at an IXP, I\u0026rsquo;ll add a large community (8298,1,*) and backwards compat normal community (8298,10XX).\nOne last thing I\u0026rsquo;ll note, and this is a matter of taste, is for most peering prefixes picked up at internet exchanges (like LSIX), are typically much cheaper per megabit than the transit routes, so I will set a default bgp_local_pref of 200 (higher localpref is more likely to be selected as the active route).\nUpstream An interesting observation: from Peers and from Upstreams I typically am happy to take all the prefixes I can get (but see the epilog below for an important note on this). For a Peer, this is mostly \u0026ldquo;their own prefixes\u0026rdquo; and for a Transit, this is mostly \u0026ldquo;all prefixes\u0026rdquo;, but there\u0026rsquo;s things in the middle, say partial transit of \u0026ldquo;all prefixes learned at IXP A B and C\u0026rdquo;. Really, all inbound sessions are very similar:\nfunction ebgp_import_upstream(int remote_as) { # Scrub BGP Communities (RFC 7454 Section 11) bgp_community.delete([(8298, *)]); bgp_large_community.delete([(8298, *, *)]); return ebgp_import(remote_as); } \u0026hellip; is in fact identical to the ebgp_import_peer() function above, so I\u0026rsquo;ll not discuss it further. But for the sessions to upstream (==transit) providers, it can make sense to use slightly different BGP community tags and a lower localpref:\nfilter ebgp_ipmax_25091_import { if ! ebgp_import_upstream(25091) then reject; # Add BGP Large Communities bgp_large_community.add((8298,2,25091)); # Add BGP Communities bgp_community.add((8298,2000)); accept; } protocol bgp ipmax_25091_ipv4_1 { description \u0026#34;IP-Max Transit\u0026#34;; local as 8298; source address 46.20.242.210; neighbor 46.20.242.209 as 25091; default bgp_med 0; default bgp_local_pref 50; ipv4 { import keep filtered; import filter ebgp_ipmax_25091_import; export filter ebgp_ipmax_25091_export; next hop self on; }; }; Again, a very similar pattern; the only material difference is that the inbound prefixes are tagged with an Upstream Community which is of the form (8298,2,*) and backwards compatible (8298,20XX). Downstream customers can use this, if they wish, to select or reject routes (maybe they don\u0026rsquo;t like routes coming from AS25091, although they should know better because IP-Max rocks!).\nThe other slight change here is the bgp_local_pref is set to 50, which implies that it will be used only if there are no alternatives in the RIB with a higher localpref, or with a similar localpref but shorter as-path, or many other scenarios which I won\u0026rsquo;t get into here, because BGP selection criteria 101 is a whole blogpost of its own.\nDownstream That brings us to the third type of BGP sessions \u0026ndash; commonly referred to as customers except that not everybody pays :) so I just call them downstreams:\nfunction ebgp_import_downstream(int remote_as) { # We do not scrub BGP Communities (RFC 7454 Section 11) for customers return ebgp_import(remote_as); } Here, I have a special relationship with the remote_as, and I do not scrub the communities, letting the downstream operator set whichever they like. As I\u0026rsquo;ll demonstrate in the next chapter, they can use these communities to drive certain types of behavior.\nHere\u0026rsquo;s how I use this ebgp_import_downstream() function in the full filter for a downstream:\n# bgpq4 -Ab4 -R 24 -m 24 -l \u0026#39;define AS201723_IPV4\u0026#39; AS201723 define AS201723_IPV4 = [ 185.54.95.0/24 ]; # bgpq4 -Ab6 -R 48 -m 48 -l \u0026#39;define AS201723_IPV6\u0026#39; AS201723 define AS201723_IPV6 = [ 2001:678:3d4::/48, 2001:67c:6bc::/48 ]; filter ebgp_raymon_201723_import { if (net.type = NET_IP4 \u0026amp;\u0026amp; ! (net ~ AS201723_IPV4)) then reject; if (net.type = NET_IP6 \u0026amp;\u0026amp; ! (net ~ AS201723_IPV6)) then reject; if ! ebgp_import_downstream(201723) then reject; # Add BGP Large Communities bgp_large_community.add((8298,3,201723)); # Add BGP Communities bgp_community.add((8298,3500)); accept; } protocol bgp raymon_201723_ipv4_1 { local as 8298; source address 185.54.95.250; neighbor 185.54.95.251 as 201723; default bgp_med 0; default bgp_local_pref 400; ipv4 { import keep filtered; import filter ebgp_raymon_201723_import; export filter ebgp_raymon_201723_export; receive limit 94 action restart; next hop self on; }; }; OK, so this is a mouthful, but the one thing that I really need to do with customers is ensure that I only accept prefixes from them that they\u0026rsquo;re supposed to send me. I do this with a prefix-list for IPv4 and IPv6, and in the importer, I simply reject any prefixes that are not in the list. From then on, it looks very much like a peer, with identical filtering and tagging, except now I\u0026rsquo;m using yet another Customer Community which starts with (8298,3,*) and a vanilla (8298,3500) community. Anybody who wishes to, can act on the presence of these communities to know that it\u0026rsquo;s a downstream of IPng Networks AS8298.\nA note on Peers and Downstreams:\nSome ISPs will not peer with their customers (as in: once you become a transit customer they will terminate all BGP sessions at public internet exchanges), and I find that silly. However, for me the situation becomes a little bit more complex if I were to have AS201723 both as a Downstream (as shown here) as well as a Peer (which in fact, I do, at multiple Amsterdam based internet exchanges). Note how the bgp_local_pref is 400 on this session, and it will always be lower on other types of sessions. The implication is that this prefix from the RIB which carries (8298,3,201723) will be selected, and the ones I learn from LSIX will carry (8298,1,*) and the ones I learn from A2B (a transit provider) will carry (8298,2,51088) and both will not be selected due to those having a lower localpref. As I\u0026rsquo;ll demonstrate below, I can make smart use of these communities when announcing prefixes to my own peers and upstreams, \u0026hellip; read on :)\n3. Outbound: Announcing Routes Alright, the RIB is now filled with lots of prefixes that have the right localpref and communities, for example from having been learned at an IXP, from an Upstream, or from a Downstream. Now let\u0026rsquo;s consider the following generic exporter:\nfunction ebgp_export(int remote_as) { # Remove private ASNs bgp_path.delete([64512..65535, 4200000000..4294967295]); # Well known BGP Large Communities if (8298, 0, remote_as) ~ bgp_large_community then return false; if (8298, 0, 0) ~ bgp_large_community then return false; # Well known BGP Communities if (0, 8298) ~ bgp_community then return false; if (remote_as \u0026lt; 65536 \u0026amp;\u0026amp; (0, remote_as) ~ bgp_community) then return false; # AS path prepending if ((8298, 103, remote_as) ~ bgp_large_community || (8298, 103, 0) ~ bgp_large_community) then { bgp_path.prepend( bgp_path.first ); bgp_path.prepend( bgp_path.first ); bgp_path.prepend( bgp_path.first ); } else if ((8298, 102, remote_as) ~ bgp_large_community || (8298, 102, 0) ~ bgp_large_community) then { bgp_path.prepend( bgp_path.first ); bgp_path.prepend( bgp_path.first ); } else if ((8298, 101, remote_as) ~ bgp_large_community || (8298, 101, 0) ~ bgp_large_community) then { bgp_path.prepend( bgp_path.first ); } return true; } Oh, wow! There\u0026rsquo;s some really cool stuff to unpack here. As a belt-and-braces type safety, I will remove any private AS numbers from the as-path - this avoids my own announcements from tripping any as-path bogon filtering. But then, there\u0026rsquo;s a few well-known communities that help determine if the announcement is made or not, and there are three-and-a-half ways of doing this:\n(8298,0,remote_as) (8298,0,0) (0,8298) (0,remote_as) but only if the remote_as is 16 bits. All four of these methods will tell the router to refuse announcing the prefix on this session. Note that downstreams are allowed to set (8298,*,*) and (8298,*) communities (and they\u0026rsquo;re the only ones who are allowed to do so). So here is where some of the cool magic starts to happen.\nThen, to drive prepending of the prefix on this session, I\u0026rsquo;ll again match certain communities (8298, 103, *) will prepend the customer\u0026rsquo;s AS number three times, using 102 will prepend twice, and 101 will prepend once. If the third digit is 0, then any session with this filter will prepend. If the third digit is the AS number, then only sessions to this AS number will be prepended.\nUsing these types of communities allow downstream (customers) incredibly fine grained propagation actions, at the per-IPng-session level. Not many ISPs offer this functionality!\nPeers Exporting to peers, I really need to make sure that I don\u0026rsquo;t send too many prefixes. Most of us have at some point gone through the embarassing motions of being told by a fellow operator \u0026ldquo;hey you\u0026rsquo;re sending a full table\u0026rdquo;. It is paramount to good peering hygiene that I do not leak. So I\u0026rsquo;ll define a healthy set of defense in depth principles here:\n# bgpq4 -A4b -R 24 -m 24 -l \u0026#39;define AS8298_IPV4\u0026#39; AS8298 define AS8298_IPV4 = [ 92.119.38.0/24, 194.1.163.0/24, 194.126.235.0/24 ]; # bgpq4 -A6bR 48 -m 48 -l \u0026#39;define AS8298_IPV6\u0026#39; AS8298 define AS8298_IPV6 = [ 2001:678:d78::/48, 2a0b:dd80::/29{29,48} ]; # bgpq4 -A4b -R 24 -m 24 -l \u0026#39;define AS_IPNG_IPV4\u0026#39; AS-IPNG define AS_IPNG_IPV4 = [ ... ## Removed for brevity ]; # bgpq4 -A6bR 48 -m 48 -l \u0026#39;define AS_IPNG_IPV6\u0026#39; AS-IPNG define AS_IPNG_IPV6 = [ .. ## Removed for brevity ]; # bgpq4 -t4b -l \u0026#39;define AS_IPNG\u0026#39; AS-IPNG define AS_IPNG = [112, 8298, 50869, 57777, 60557, 201723, 212323, 212855]; function aspath_first_valid() { return (bgp_path.len = 0 || bgp_path.first ~ AS_IPNG); } # A list of well-known tier1 transit providers function aspath_contains_tier1() { return bgp_path ~ [ 174, # Cogent 209, # Qwest (HE carries this on IXPs IPv6 (Jul 12 2018)) 701, # UUNET 702, # UUNET 1239, # Sprint 1299, # Telia 2914, # NTT Communications 3257, # GTT Backbone 3320, # Deutsche Telekom AG (DTAG) 3356, # Level3 3549, # Level3 3561, # Savvis / CenturyLink 4134, # Chinanet 5511, # Orange opentransit 6453, # Tata Communications 6762, # Seabone / Telecom Italia 7018 ]; # AT\u0026amp;T } # The list of our own uplink (transit) providers # Note: This list is autogenerated by our automation. function aspath_contains_upstream() { return bgp_path ~ [ 8283,25091,34549,51088,58299 ]; } function ipv4_prefix_valid() { # Our (locally sourced) prefixes if (net ~ AS8298_IPV4) then return true; # Customer prefixes in AS-IPNG must be tagged with customer community if (net ~ AS_IPNG_IPV4 \u0026amp;\u0026amp; (bgp_large_community ~ [(8298, 3, *)] || bgp_community ~ [(8298, 3500)]) ) then return true; return false; } function ipv6_prefix_valid() { # Our (locally sourced) prefixes if (net ~ AS8298_IPV6) then return true; # Customer prefixes in AS-IPNG must be tagged with customer community if (net ~ AS_IPNG_IPV6 \u0026amp;\u0026amp; (bgp_large_community ~ [(8298, 3, *)] || bgp_community ~ [(8298, 3500)]) ) then return true; return false; } function prefix_valid() { # as-path based filtering if !aspath_first_valid() then return false; if aspath_contains_tier1() then return false; if aspath_contains_upstream() then return false; # prefix (and BGP community) based filtering if (net.type = NET_IP4 \u0026amp;\u0026amp; !ipv4_prefix_valid()) then return false; if (net.type = NET_IP6 \u0026amp;\u0026amp; !ipv6_prefix_valid()) then return false; return true; } function ebgp_export_peer(int remote_as) { if !prefix_valid() then return false; return ebgp_export(remote_as); } Wow, alrighty then!! All I\u0026rsquo;m doing here is checking if the call to prefix_valid() returns true. That function isn\u0026rsquo;t very complex. It takes a look at three as-path based filters and then a prefix-list based filter. Let\u0026rsquo;s go over them in turn:\naspath_first_valid() takes a look at the first hop in the as-path. I need to make sure that I\u0026rsquo;ve received this prefix from an actual downstream, and those are collected in a RIPE as-set called AS-IPNG. So if the first BGP hop in the path is not one of these, I\u0026rsquo;ll refuse to announce the prefix.\naspath_contains_tier1() is a belt-and-braces style check. How on earth would I provide transit for any prefix for which there\u0026rsquo;s already a global Tier1 provider in the path? I mean, in no universe would AS174 or AS1299 need me to reach any of their customers, or indeed, any place in the world. So this filter helps me never announce the prefix, if it has one of these ISPs in the path.\naspath_contains_upstream() similarly, if I am receiving a full table from an upstream provider, I should not be passing this prefix along - I would for similar reasons never be a transit provider for A2B or IP-Max or Meerfarbig. Due to a bug in my configuration, my buddy Erik kindly pointed out this issue to me, so hat-tip to him for the intelligence.\nipv[46]_prefix_valid() is the main thrust of prefix-based filtering. At this point we\u0026rsquo;ve already established that the as-path is clean, but it could be that the downstream is sending prefixes they should not (possibly leaking a full table) so let\u0026rsquo;s take a look at a good way to avoid this.\nFirst, we look at locally sourced routes from AS8298, that is the ones that I myself originate at IPng Networks. These are always OK. The list is carefully curated. Alternatively, the prefix needs to be from the as-set AS-IPNG (which contains both my prefixes and all route and route6 objects belonging to any AS number that I consider a downstream), Finally, if the prefix is from AS-IPNG, I\u0026rsquo;ll still add one additional check to ensure that there is a so-called customer community attached. Remember that I discused this specifically up in the Inbound - Downstream section. So before I were to announce anything on such a session, all four of as-path, inbound prefix-list, outbound prefix-list and bgp-community are checked. This makes it incredibly unlikely that AS8298 ever leaks prefixes \u0026ndash; knock on wood!\nUpstream Interestingly and if you think about it, unsurprisingly, an upstream configuration is exactly identical to a peer:\nfunction ebgp_export_upstream(int remote_as) { if !prefix_valid() then return false; return ebgp_export(remote_as); } Alright, nothing to see here, moving on \u0026hellip;\nDownstream Now the difference between a Peer and an Upstream on the one hand, and a Downstream on the other, is that the former two will only see a very limited set of prefixes, heavily guarded by all of that filtering I described. But a downstream typically has the luxury of getting to learn every prefix I\u0026rsquo;ve learned:\nfunction ipv4_acceptable_size() { if net.len \u0026lt; 8 then return false; if net.len \u0026gt; 24 then return false; return true; } function ipv6_acceptable_size() { if net.len \u0026lt; 12 then return false; if net.len \u0026gt; 48 then return false; return true; } function ebgp_export_downstream(int remote_as) { if (source != RTS_BGP \u0026amp;\u0026amp; source != RTS_STATIC) then return false; if (net.type = NET_IP4 \u0026amp;\u0026amp; ! ipv4_acceptable_size()) then return false; if (net.type = NET_IP6 \u0026amp;\u0026amp; ! ipv6_acceptable_size()) then return false; return ebgp_export(remote_as); } So here I\u0026rsquo;ll assert that the prefix has to be either from the RTS_BGP source, or from the RTS_STATIC source. This latter source is what Bird uses for locally generated routes (ie. the ones in AS8298 itself). Locally generated routes are not known from BGP, but known instead because they are blackholed / null-routed on the router itself. And from these routes, I further deselect those prefixes that are too short or too long, which are slightly different based on address family (IPv4 is anywhere between /8-/24 and for IPv6 is anywhere between /12-/48).\nNow, I will note that I\u0026rsquo;ve seen many operators who inject OSPF or connected or static routes into BGP, and all of those folks will have to maintain elaborate egress \u0026ldquo;bogon\u0026rdquo; route filters, for example for those IXP prefixes that they picked up due to them being directly connected. If those operators would simply not propagate directly connected routes, their life would be so much simpler .. but I digress and it\u0026rsquo;s time for me to wrap up.\nEpilog I hope this little dissertation proves useful for other Bird enthusiasts out there. I myself had to fiddle a bit over the years with the idiosyncracies (and bugs) of Bird and Bird2. I wanted to make a few comments:\nThanks to the crew at Coloclue for having a really phenomenal routing setup, with a lot of thoughtful documentation, action communities, and strict ingress and egress filtering. It\u0026rsquo;s also fully automated and I\u0026rsquo;ve derived, although completely rewritten, my own automation based off of Kees. I understand that the main destinction on inbound Peer and Upstream, is that for Peers many folks will want to do strict filtering. I\u0026rsquo;ve considered this for a long time and ultimately decided against it, because a combination of max prefix, tier1 as-path filtering and RPKI filtering would take care of the most egregious mistakes and otherwise, I\u0026rsquo;m actually happy to get more prefixes via IXPs rather than less. ","date":"2021-11-14","desc":"Introduction BGP Routing policy is a very interesting topic. I get asked about it formally and informally all the time. I have to admit, there are lots of ways to organize an automous system. Vendors have unique features and templating / procedural functions, but in the end, BGP routing policy all boils down to two+two things:\nNot accepting the prefixes you don\u0026rsquo;t want (inbound) For those prefixes accepted, ensure they have correct attributes. Not announcing prefixes to folks who shouldn\u0026rsquo;t see them (outbound) For those prefixes announced, ensure they have correct attributes. At IPng Networks, I\u0026rsquo;ve cycled through a few iterations and landed on a specific setup that works well for me. It provides sufficient information to enable our downstream (customers) to make good decisions on what they should accept from us, as well as enough expressivity for them to determine which prefixes we should propagate for them, where, and how.\n","permalink":"https://ipng.ch/s/articles/2021/11/14/case-study-bgp-routing-policy/","section":"articles","title":"Case Study - BGP Routing Policy"},{"contents":"A Brief History In January of 2003, my buddy Jeroen announced a project called the Ghost Route Hunters, after the industry had been plagued for a few years with anomalies in the DFZ - routes would show up with phantom BGP paths, unable to be traced down to a source or faulty implementation. Jeroen presented his findings at RIPE-46 and for years after this, the industry used the SixXS GRH as a distributed looking glass. At the time, one of SixXS\u0026rsquo;s point of presence providers kindly lent the project AS8298 to build this looking glass and underlying infrastructure.\nAfter running SixXS for 16 years, Jeroen and I decided to Sunset it, which meant that in June of 2017, the Ghost Route Hunter project came to an end as well, and as we tore down the infrastructure, AS8298 became dormant.\nThen in August of 2021, I was doing a little bit of cleaning on the IPng Networks serving infrastructure, and came across some old mail from RIPE NCC about that AS number. And while IPng Networks is running just fine on AS50869 today, it would be just that little bit cooler if it were to run on AS8298. So, I embarked on a journey to move a running ISP into a new AS number, which sounds like fun! This post describes the situation going in to this renumbering project, and there will be another post, likely in January 2022, that describes the retrospective (this future post may be either celebratory, or a huge postmortem, to be determined).\nThe Plan Step 0. Acquire the AS First off, I had to actually acquire the AS number. Back in the days (I\u0026rsquo;m speaking of 2002), RIPE NCC was a little bit less formal than it is today. As such, our loaned AS number was simply registered to SixXS, which is not a legal entity. Later, to do things properly, it was placed in Jeroen\u0026rsquo;s custody (by creating ORG-SIXX1-RIPE). But precisely because Jeroen nor SixXS was an LIR at that time, the previous holder (Easynet, with LIR de.easynet, later acquired by Sky, also called British Sky Broadcasting, trading under LIR uk.bskyb) became the sponsoring LIR. So I had to arrange two things:\nA Transfer Agreement between Jeroen and myself; that signaled his willingness to transfer the AS number to me. This is boilerplate stuff, and example contracts can be downloaded on the RIPE website. An agreement between Sky and IPng Networks; that signaled the transfer from sponsoring LIR to our LIR ch.ipng. This is rather non-bureaucratic and a well traveled path; sponsoring LIRs and movements of resource between holders happen all the time. With the permission of the previous holder, and with the help of the previous sponsoring LIR, the transfer itself was a matter of filing the correct paperwork at RIPE NCC, quoting the transfer agreement, and providing identification for the offering party (Jeroen) and the receiving party (Pim). And within a matter of a few days, the AS number was transfered to ORG-PVP9-RIPE, the ch.ipng LIR.\nNOTE - In case you\u0026rsquo;re wondering, I registered the ch.ipng LIR a few months before I incorporated IPng Networks as a swiss limited liability company (called a GmbH, or Gesellschaft mit beschränkter Haftung in German), so for now I\u0026rsquo;m still trading as my natural person. RIPE has a cooldown period of two years before new LIRs can acquire/merge/rename. I do expect that some time in 2023 the PeeringDB page and bgp.he.net and friends to drop my personal name and take my company name :) For now, trading as Pim will have to do, slightly more business risk, but just as much fun!\nStep 1. Split AS50869 into two networks The autonomous system of IPng Networks spans two main parts. Firstly, in Zurich IPng Networks operates four sites and six routers:\nTwo in a private colocation site at Daedalean (A) in Albisrieden called ddln0 and ddln1, they are running DANOS Two at our offices in Brüttisellen (C), called chbtl0 and chbtl1, they are running Debian One at Interxion ZUR1 datacenter in Glattbrugg (D), called chgtg0, running VPP, connecting to a public internet exchange CHIX-CH and taking transit from IP-Max and Openfactory. One at NTT\u0026rsquo;s datacenter in Rümlang (E), called chrma0, also running VPP, connecting to a public internet exchange SwissIX and taking transit from IP-Max and Meerfarbig. NOTE: You can read a lot about my work on VPP in a series of VPP articles, please take a look!\nThere\u0026rsquo;s a few downstream IP Transit networks and lots of local connected networks, such as in the DDLN colo. Then, from chrma0, we connect to our european ring northwards (towards Frankfurt), and from chgtg0 we connect to our european ring south-westwards (towards Geneva).\nThat ring, then, consists of five additional sites and five routers, all running VPP:\nFrankfurt: defra0, connecting to four DE-CIX exchangepoints in Frankfurt itself directly, and remotely to Munich, Düsseldorf and Hamburg Amsterdam: nlams0, connecting to NL-IX, SpeedIX, FrysIX (our favorite!), and LSIX; we also pick up two transit providers (A2B and Coloclue). Lille: frggh0, connecting to the northern france exchange called LillIX Paris: frpar0, connecting to two FranceIX exchange points, directly in Paris, and remotely to Marseille Geneva: chplo0, connecting to our very own Free IX Every one of these sites has an upstream session with AS25091 (IP-Max). Considering these folks are organizationally very close to me, it is easy for me to rejigger any one of those sessions between AS50869 (current) and AS8298 (new). And considering AS8298 has been a member of our as-set AS-IPNG for a while, it\u0026rsquo;ll also be a natural propagation to rely on IP-Max, even if some peering sessions might be down.\nSo before I start, IPng Networks\u0026rsquo; view from AS25091 in Amsterdam looks like this:\nNetwork Next Hop Metric LocPrf Weight Path * 92.119.38.0/24 46.20.242.210 0 50869 i * 176.119.215.0/24 46.20.242.210 0 50869 60557 i * 185.36.229.0/24 46.20.242.210 0 50869 212855 i * 185.173.128.0/24 46.20.242.210 0 50869 57777 i * 185.209.12.0/24 46.20.242.210 0 50869 212323 i * 192.31.196.0/24 46.20.242.210 0 50869 112 i * 192.175.48.0/24 46.20.242.210 0 50869 112 i * 194.1.163.0/24 46.20.242.210 0 50869 i * 194.126.235.0/24 46.20.242.210 0 50869 i Network Next Hop Metric LocPrf Weight Path * 2001:4:112::/48 2a02:2528:1902::210 0 50869 112 i * 2001:678:3d4::/48 2a02:2528:1902::210 0 50869 201723 i * 2001:678:ce4::/48 2a02:2528:1902::210 0 50869 60557 i * 2001:678:ce8::/48 2a02:2528:1902::210 0 50869 60557 i * 2001:678:cec::/48 2a02:2528:1902::210 0 50869 60557 i * 2001:678:cf0::/48 2a02:2528:1902::210 0 50869 60557 i * 2001:678:d78::/48 2a02:2528:1902::210 0 50869 i * 2001:67c:6bc::/48 2a02:2528:1902::210 0 50869 201723 i * 2620:4f:8000::/48 2a02:2528:1902::210 0 50869 112 i * 2a07:cd40::/29 2a02:2528:1902::210 0 50869 212855 i * 2a0b:dd80::/29 2a02:2528:1902::210 0 50869 i * 2a0b:dd80::/32 2a02:2528:1902::210 0 50869 i * 2a0d:8d06::/32 2a02:2528:1902::210 0 50869 60557 i * 2a0e:fd40:200::/48 2a02:2528:1902::210 0 50869 60557 i * 2a0e:fd45:da0::/48 2a02:2528:1902::210 0 50869 60557 i * 2a10:d200::/29 2a02:2528:1902::210 0 50869 212323 i * 2a10:fc40::/29 2a02:2528:1902::210 0 50869 57777 i Step 2. Restrict the routers that originate our prefixes As a preparation to actually starting to use AS8298, I\u0026rsquo;ll create RPKI records of authorization for all of our prefixes in both AS50869 and AS8298, and I\u0026rsquo;ll add route: objects for all of them in both as well.\nNow, I\u0026rsquo;m ready to make my first networking topology change: instead of originating our prefixes in all routers, I will originate our prefixes in AS50869 only from two routers: chbtl0 and chbtl1. Nobody on the internet will notice this change, the as-path will remain ^50869$ for all prefixes.\nI\u0026rsquo;ll also prepare the two routers to speak to eachother with an iBGP session (rather than to IPng Networks\u0026rsquo; route-reflectors, which are still in AS50869).\nStep 3. Convert Brüttisellen to AS8298 Now, I\u0026rsquo;ll switch these routers chbtl0 and chbtl1 out of AS50869 and into AS8298. The only damage at this point might be that my personal Spotify and Netflix stop working, and my family yells at me (but they do that all the time anyway, so it\u0026rsquo;s a wash\u0026hellip;). If things go poorly, the backout plan is to switch back to AS50869 and return things to normal. But if things go well, from this point onwards everybody will see our own IPng Networks prefixes via as-path ^50869_8298$ and effectively, AS50869 will become a transit provider for AS8298, which will be singlehomed for the moment.\nMy buddy Max runs a small /29 and /64 exchangepoint in Brüttisellen with only three members on it - IPng Networks, Stucchinet and Openfactory. I will ask both of them to be my canary, and change their peering session from AS50869 to AS8298. If things go bad, that\u0026rsquo;s no worries, I can drop/disable these peering sessions as I have sessions with both as well in other places. But it\u0026rsquo;d be a good place to see and test if things work as expected.\nStep 4. Convert DDLN to AS8298 Now that our IPv4 and IPv6 prefixes have moved and AS50869 does not originate prefixes anymore, you may wonder \u0026lsquo;what about the colo?\u0026rsquo;. Indeed, the colo runs at Daedalean behind ddln0 and ddln1, both of which are still in AS50869. However, the only way to ever be able to reach those prefixes, would be to find an entrypoint in AS50869 (as it is the only uplink of AS8298). All routers in AS50869 and AS8298 share an underlying OSPF and OSPFv3 interior gateway protocol, which means that if anything is destined to 194.1.163.0/24 or 2001:678:d78::/48, those packets will find their way to the correct location using the IGP. That\u0026rsquo;s neat, because it means that even though ddln0 is speaking BGP in AS50869, it will happily forward traffic to its more specific prefixes from AS8298.\nConsidering there\u0026rsquo;s only one IP transit network in DDLN, those two routers will be the first for me to convert. After converting them to AS8298, they will receive transit from AS50869 just like the ones in Brüttisellen. I\u0026rsquo;ll rejigger Jeroen\u0026rsquo;s AS57777 to receive transit directly from AS8298, and it will then be the first to transition. Jeroen\u0026rsquo;s prefixes will briefly become ^50869_8298_57777$, which will be the only change, but this will validate that, indeed, AS8298 can provide transit. Apart from the longer as-path, the physical path that IP packets take will remain the same because Jeroen’s network is currently cough singlehomed at DDLN.\nNow that I have four routers in AS8298, I\u0026rsquo;ll add a iBGP sessions directly only amongst the pairs, rather than in a full mesh on these routers. I\u0026rsquo;ve now created two islands of AS8298, interconnected by AS50869. You\u0026rsquo;d think this is bad, but it\u0026rsquo;s actually a fine situation to be in and there will be no loss of service, because:\nwe\u0026rsquo;ve already established that AS50869 has reachability to all more-specifics in AS8298 (Step 3) AS50869 has reachability to AS57777 via its downstream AS8298 AS50869 is the only way in and out of the DDLN or Brüttisellen routers Nevertheless, I\u0026rsquo;d rather swiftly continue with the next step, and reconnect these two islands. It\u0026rsquo;s is a good time for me to give a headsup to the larger internet exchanges (notably SwissIX) so folks can prepare for what\u0026rsquo;s coming. I think many NOC teams know how to establish/remove a peering session, but I do expect the ITIL-automated teams to not have a playbook for \u0026ldquo;peer on IP 192.0.2.1 has changed from AS666 to AS42\u0026rdquo;. I\u0026rsquo;ll observe their performance on this task and take notes, as there\u0026rsquo;s quite a few public IXPs to go\u0026hellip;\nStep 5. Convert Rümlang and Glattbrugg to AS8298 After four of our routers (two redundant pairs) have been operating in AS8298 for a few days, I\u0026rsquo;m ready to renumber our first machine connected to a public internet exchange: chgtg0 is connected to CHIX-CH. I\u0026rsquo;ll contact the members with whom IPng Networks has a direct peering, and of course the internet exchange folks, to ask them to renumber our AS50869 to AS8298. After restarting our router into the new AS, one by one I\u0026rsquo;ll establish sessions with our peers there - this is an important exercise because I\u0026rsquo;ll be doing this later in Step 5 on every peering/border router in the european ring, and there will be in total ~1'850 BGP adjacencies that have to be renumbered.\nAt CHIX-CH, IPng has in total four BGP sessions with the two routeservers in AS212100, and in total 38 direct BGP sessions; two of which are somewhat important: AS13030 (Init7) and AS15169 (Google), to which a large fraction of our traffic flows. While upgrading this router, I\u0026rsquo;ll also switch my one downstream network (Daedalean itself, operating AS212323) to receive transit from AS8298. Because I\u0026rsquo;ve canaried this with Jeroen\u0026rsquo;s AS57777 in the DDLN colo previously, I\u0026rsquo;ll be reasonably certain at this point that it\u0026rsquo;ll work well. If not, they have alternative uplinks (notably AS174), so they should be fine without me.\nAt SwissIX, IPng has as well four BGP sessions with the two routeservers in AS42476, and in total 132 direct BGP sessions. I think that, once these two peering routers are complete, I\u0026rsquo;ll checkpoint and let things run like this for a while. Let\u0026rsquo;s take a few weeks off, giving me a while to hunt down peers at SwissIX and CH-IX to catch up and re-establish their sessions with the new AS8298 :)\nAfter this step, phase one of the transition is complete, and AS8298 (and its networks AS57777 and AS212323) will be directly visible in Switzerland, and still tucked away behind ^50869_8298_212323$ for the international traffic. I will however have 2 transit sessions from IP-Max (AS25091), 2 transit sessions from Openfactory (AS58299) and 1 transit session from Meerfarbig (AS34549); so it is expected to be a quite stable network at this point, which is good.\nStep 6. Convert European Ring to AS8298 For this task, I\u0026rsquo;ll swap my iBGP sessions to all use the new AS8298. I do this by first dismantling rr0.chbtl0.ipng.ch and bring it back up as an iBGP speaker in the new AS; then one by one push all routers to speak to that route reflector (in addition to the existing route reflectors in AS50869). After that stabilizes, rinse and repeat with rr0.chplo0.ipng.ch; and finally finish the job with rr0.frggh0.ipng.ch. The two pairs of routers who were by themselves in AS8298 (chbtl0/chbtl1 in Brüttisellen; and ddln0/ddln1 at Daedalean) can now be reattached to the iBGP route-reflectors as well.\nIt will be a bit of a schlepp, but now comes the notification of all international peers (there are an additional ~250 direct peerings) and downstreams (there are three left: Raymon, Jelle and Krik), and upstreams (there are two additional ones: Coloclue, and A2B). While normally this is a matter of merely swapping the AS number, it has to be done on both sides - on my side, I can do it with a one-line change to the git repository, and it\u0026rsquo;ll be pushed by Kees (the network build and push automation that was inspired by Coloclue\u0026rsquo;s Kees), on the remote side it will be a matter of patience. One by one, folks will (or.. won\u0026rsquo;t) update their peering session. The only folks I\u0026rsquo;ll actively chase is the DE-CIX, FranceIX, NL-IX, LSIX and FrysIX routeserver operators, as the vast majority of adjacencies are learned via those. By means of a headsup one week in advance, and a few reminders on the day of, and the day after maintenance, I should minimize downtime. But, in this case, because I already have two transit providers in Switzerland (AS25091 IP-Max, and AS58299 Openfactory) who provide me full tables in AS8298, it should be operationally smooth sailing. At the end of this exercise, the as-path will be ^8298$ for my own prefixes and ^8298_.*$ for my downstream networks, and AS50869 will no longer be in any as-path in the DFZ. Rolling back is tricky, although Bird can do individual peering sessions with differing AS numbers, I don\u0026rsquo;t think this is a good idea; as it\u0026rsquo;ll mean many (many) changes into the repository; so in the interest of simplicity and don\u0026rsquo;t break things that work, I will do them router-by-router rather than session-by-session; and send a few reminders to folks to update their records to match my peeringdb entries.\nStep 7. Repurpose/retire AS50869 There\u0026rsquo;s really not that much to do \u0026ndash; delete the route: and route6: objects and remove RPKI ROAs for the old announcments; but mostly it\u0026rsquo;ll be a matter of hunting down peering partners who have not (yet) updated their records and sessions. I imagine lots of folks will hesitate and be unfamiliar with this type of operation (even though it literally is an s/50869/8298/g for them). I\u0026rsquo;ll take most of December to remind folks a few times, and ultimately just clean up broken peering sessions in January 2022.\nAnd of course, then lick my wounds and count pokemon - on October 24th, the day this post was published, Hurricane Electric showed 1'845 adjacencies in total for AS50869, of which 1'653 IPv4 and 1'430 IPv6. I will consider it a success if I lose less than 200 adjacencies. I\u0026rsquo;ll keep AS50869 around, as a test AS number to do a few experiments.\nThe Timeline Most all intrusive maintenances will be done in maintenance windows between 22:00 - 03:00 UTC from Thursday to Friday. The tentative planning for the project, which starts on October 22nd and lasts through end of year (10 weeks):\n2021-10-22 - RIPE updates for route:, route6: and RPKI ROAs 2021-10-24 - Originate prefixes from chbtl0/chbtl1 (no-op for the DFZ) 2021-10-28 - Move chbtl0/chbtl1 at Brüttisellen to AS8298 2021-10-29 - Update Brüttisellen IXP (AS58299 and AS58280), canary upstream AS58299 2021-10-29 - Headsup to CHIX-CH and SwissIX peers and Transit partners, announcing move 2021-11-04 - Move ddln0/ddln1 at Daedalean to AS8298, canary downstream AS57777 2021-11-11 - Move chgtg0 Glattbrugg to AS8298, add downstream AS212323 2021-11-12 - Move CHIX-CH peers to AS8298 2021-11-15 - Move chrma0 Rümlang to AS8298 2021-11-16 - Move SwissIX peers to AS8298 Two week cooldown period, start to move IXPs to AS8298 2021-11-29 - Headsup to all european IXPs and IP Transit partners, announcing move 2021-12-02 - Move defra0, nlams0, frggh0, frpar0, chplo0 to AS8298 2021-12 - Move european IXPs to AS8298 2021-12-06 - First reminder for peerings that did not re-establish 2021-12-13 - Second reminder for peerings that did not re-establish 2021-12-20 - Third (final) reminder for peerings that did not re-establish 2022-01-10 - Remove peerings that did not re-establish Appendix This blogpost turned into a talk at RIPE #83, in case you wanted to take a look at the recording and a few questions.\n","date":"2021-10-24","desc":"A Brief History In January of 2003, my buddy Jeroen announced a project called the Ghost Route Hunters, after the industry had been plagued for a few years with anomalies in the DFZ - routes would show up with phantom BGP paths, unable to be traced down to a source or faulty implementation. Jeroen presented his findings at RIPE-46 and for years after this, the industry used the SixXS GRH as a distributed looking glass. At the time, one of SixXS\u0026rsquo;s point of presence providers kindly lent the project AS8298 to build this looking glass and underlying infrastructure.\n","permalink":"https://ipng.ch/s/articles/2021/10/24/ipng-acquires-as8298/","section":"articles","title":"IPng acquires AS8298"},{"contents":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. One thing notably missing, is the higher level control plane, that is to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a VPP plugin which is called the Linux Control Plane, or LCP for short, which creates Linux network devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links, addresses and routes) itself. When the plugin is completed, running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates will be well in reach!\nRunning in Production In the first articles from this series, I showed the code that needed to be written to implement the Control Plane and Netlink Listener plugins. In the penultimate post, I wrote an SNMP Agentx that exposes the VPP interface data to, say, LibreNMS.\nBut what are the things one might do to deploy a router end-to-end? That is the topic of this post.\nA note on hardware Before I get into the details, here\u0026rsquo;s some specifications on the router hardware that I use at IPng Networks (AS50869). See more about our network here.\nThe chassis is a Supermicro SYS-5018D-FN8T, which includes:\nFull IPMI support (power, serial-over-lan and kvm-over-ip with HTML5), on a dedicated network port. A 4-core, 8-thread Xeon D1518 CPU which runs at 35W Two independent Intel i210 NICs (Gigabit) A Quad Intel i350 NIC (Gigabit) Two Intel X552 (TenGig) (optional) One Intel X710 Quad-TenGig NIC in the expansion bus m.SATA 120G boot SSD 2x16GB of ECC RAM The only downside for this machine is that it has only one power supply, so datacenters which do periodical feed-maintenance (such as Interxion is known to do), are likely to reboot the machine from time to time. However, the machine is very well spec\u0026rsquo;d for VPP in \u0026ldquo;low\u0026rdquo; performance scenarios. A machine like this is very affordable (I bought the chassis for about USD 800,- a piece) but its CPU/Memory/PCIe construction is enough to provide forwarding at approximately 35Mpps.\nDoing a lazy 1Mpps on this machine\u0026rsquo;s Xeon D1518, VPP comes in at ~660 clocks per packet with a vector length of ~3.49. This means that if I dedicate 3 cores running at 2200MHz to VPP (leaving 1C2T for the controlplane), this machine has a forwarding capacity of ~34.7Mpps, which fits really well with the Intel X710 NICs (which are limited to 40Mpps [ref]).\nA reasonable step-up from here would be Supermicro\u0026rsquo;s SIS810 with a Xeon E-2288G (8 cores / 16 threads) which carries a dual-PSU, up to 8x Intel i210 NICs and 2x Intel X710 Quad-Tengigs, but it\u0026rsquo;s quite a bit more expensive. I commit to do that the day AS50869 is forwarding 10Mpps in practice :-)\nInstall HOWTO First, I install the \u0026ldquo;canonical\u0026rdquo; (pun intended) operating system that VPP is most comfortable running on: Ubuntu 20.04.3. Nothing special selected when installing and after the install is done, I make sure that GRUB uses the serial IPMI port by adding to /etc/default/grub:\nGRUB_CMDLINE_LINUX=\u0026#34;console=tty0 console=ttyS0,115200n8 isolcpus=1,2,3,5,6,7\u0026#34; GRUB_TERMINAL=serial GRUB_SERIAL_COMMAND=\u0026#34;serial --speed=115200 --unit=0 --word=8 --parity=no --stop=1\u0026#34; # Followed by a gratuitous install and update grub-install /dev/sda update-grub Note that the isolcpus is a neat trick that tells the Linux task scheduler to avoid scheduling any workloads on those CPUs. Because the Xeon-D1518 has 4 cores (0,1,2,3) and 4 additional hyperthreads (4,5,6,7), this stanza effectively makes core 1,2,3 unavailable to Linux, leaving only core 0 and its hyperthread 4 are available. This means that our controlplane will have 2 CPUs available to run things like Bird, SNMP, SSH etc, while hyperthreading is essentially turned off on CPU 1,2,3 giving those cores entirely to VPP.\nIn case you were wondering why I would turn off hyperthreading in this way: hyperthreads share CPU instruction and data cache. The premise of VPP is that a vector (a list) of packets will go through the same routines (like ethernet-input or ip4-lookup) all at once. In such a computational model, VPP leverages the i-cache and d-cache to have subsequent packets make use of the warmed up cache from their predecessor, without having to use the (much slower, relatively speaking) main memory.\nThe last thing you\u0026rsquo;d want, is for the hyperthread to come along and replace the cache contents with what-ever it\u0026rsquo;s doing (be it Linux tasks, or another VPP thread).\nSo: disaallowing scheduling on 1,2,3 and their counterpart hyperthreads 5,6,7 AND constraining VPP to run only on lcore 1,2,3 will essentially maximize the CPU cache hitrate for VPP, greatly improving performance.\nNetwork Namespace Originally proposed by TNSR, a Netgate commercial productionization of VPP, it\u0026rsquo;s a good idea to run VPP and its controlplane in a separate Linux network namespace. A network namespace is logically another copy of the network stack, with its own routes, firewall rules, and network devices.\nCreating a namespace looks like follows, on a machine running systemd, like Ubuntu or Debian:\ncat \u0026lt;\u0026lt; EOF | sudo tee /usr/lib/systemd/system/netns-dataplane.service [Unit] Description=Dataplane network namespace After=systemd-sysctl.service network-pre.target Before=network.target network-online.target [Service] Type=oneshot RemainAfterExit=yes # PrivateNetwork will create network namespace which can be # used in JoinsNamespaceOf=. PrivateNetwork=yes # To set `ip netns` name for this namespace, we create a second namespace # with required name, unmount it, and then bind our PrivateNetwork # namespace to it. After this we can use our PrivateNetwork as a named # namespace in `ip netns` commands. ExecStartPre=-/usr/bin/echo \u0026#34;Creating dataplane network namespace\u0026#34; ExecStart=-/usr/sbin/ip netns delete dataplane ExecStart=-/usr/bin/mkdir -p /etc/netns/dataplane ExecStart=-/usr/bin/touch /etc/netns/dataplane/resolv.conf ExecStart=-/usr/sbin/ip netns add dataplane ExecStart=-/usr/bin/umount /var/run/netns/dataplane ExecStart=-/usr/bin/mount --bind /proc/self/ns/net /var/run/netns/dataplane # Apply default sysctl for dataplane namespace ExecStart=-/usr/sbin/ip netns exec dataplane /usr/lib/systemd/systemd-sysctl ExecStop=-/usr/sbin/ip netns delete dataplane [Install] WantedBy=multi-user.target WantedBy=network-online.target EOF sudo systemctl daemon-reload sudo systemctl enable netns-dataplane sudo systemctl start netns-dataplane Now, every time we reboot the system, a new network namespace will exist with the name dataplane. That\u0026rsquo;s where you\u0026rsquo;ve seen me create interfaces in my previous posts, and that\u0026rsquo;s where our life-as-a-VPP-router will be born.\nPreparing the machine After creating the namespace, I\u0026rsquo;ll install a bunch of useful packages and further prepare the machine, but also I\u0026rsquo;m going to remove a few out-of-the-box installed packages:\n## Remove what we don\u0026#39;t need sudo apt purge cloud-init snapd ## Usual tools for Linux sudo apt install rsync net-tools traceroute snmpd snmp iptables ipmitool bird2 lm-sensors ## And for VPP sudo apt install libmbedcrypto3 libmbedtls12 libmbedx509-0 libnl-3-200 libnl-route-3-200 \\ libnuma1 python3-cffi python3-cffi-backend python3-ply python3-pycparser libsubunit0 ## Disable Bird and SNMPd because it will be running in another namespace for i in bird snmpd; do sudo systemctl stop $i sudo systemctl disable $i sudo systemctl mask $i done # Ensure all temp/fan sensors are detected sensors-detect --auto Installing VPP After building the code, specifically after issuing a successful make pkg-deb, a set of Debian packages will be in the build-root sub-directory. Take these and install them like so:\n## Install VPP sudo mkdir -p /var/log/vpp/ sudo dpkg -i *.deb ## Reserve 6GB (3072 x 2MB) of memory for hugepages cat \u0026lt;\u0026lt; EOF | sudo tee /etc/sysctl.d/80-vpp.conf vm.nr_hugepages=3072 vm.max_map_count=7168 vm.hugetlb_shm_group=0 kernel.shmmax=6442450944 EOF ## Set 64MB netlink buffer size cat \u0026lt;\u0026lt; EOF | sudo tee /etc/sysctl.d/81-vpp-netlink.conf net.core.rmem_default=67108864 net.core.wmem_default=67108864 net.core.rmem_max=67108864 net.core.wmem_max=67108864 EOF ## Apply these sysctl settings sudo sysctl -p -f /etc/sysctl.d/80-vpp.conf sudo sysctl -p -f /etc/sysctl.d/81-vpp-netlink.conf ## Add user to relevant groups sudo adduser $USER bird sudo adduser $USER vpp Next up, I make a backup of the original, and then create a reasonable startup configuration for VPP:\n## Create suitable startup configuration for VPP cd /etc/vpp sudo cp startup.conf startup.conf.orig cat \u0026lt;\u0026lt; EOF | sudo tee startup.conf unix { nodaemon log /var/log/vpp/vpp.log full-coredump cli-listen /run/vpp/cli.sock gid vpp exec /etc/vpp/bootstrap.vpp } api-trace { on } api-segment { gid vpp } socksvr { default } memory { main-heap-size 1536M main-heap-page-size default-hugepage } cpu { main-core 0 workers 3 } buffers { buffers-per-numa 128000 default data-size 2048 page-size default-hugepage } statseg { size 1G page-size default-hugepage per-node-counters off } plugins { plugin lcpng_nl_plugin.so { enable } plugin lcpng_if_plugin.so { enable } } logging { default-log-level info default-syslog-log-level notice } EOF A few notes specific to my hardware configuration:\nthe cpu stanza says to run the main thread on CPU 0, and then run three workers (on CPU 1,2,3; the ones for which I disabled the Linux scheduler by means of isolcpus). So CPU 0 and its hyperthread CPU 4 are available for Linux to schedule on, while there are three full cores dedicated to forwarding. This will ensure very low latency/jitter and predictably high throughput! HugePages are a memory optimization mechanism in Linux. In virtual memory management, the kernel maintains a table in which it has a mapping of the virtual memory address to a physical address. For every page transaction, the kernel needs to load related mapping. If you have small size pages then you need to load more numbers of pages resulting kernel to load more mapping tables. This decreases performance. I set these to a larger size of 2MB (the default is 4KB), reducing mapping load and thereby considerably improving performance. I need to ensure there\u0026rsquo;s enough Stats Segment memory available - each worker thread keeps counters of each prefix, and with a full BGP table (weighing in at 1M prefixes in Q3'21), the amount of memory needed is substantial. Similarly, I need to ensure there are sufficient Buffers available. Finally, observe the stanza unix { exec /etc/vpp/bootstrap.vpp } and this is a way for me to tell VPP to run a bunch of CLI commands as soon as it starts. This ensures that if VPP were to crash, or the machine were to reboot (more likely :-), that VPP will start up with a working interface and IP address configuration, and any other things I might want VPP to do (like bridge-domains).\nA note on VPP\u0026rsquo;s binding of interfaces: by default, VPP\u0026rsquo;s dpdk driver will acquire any interface from Linux that is not in use (which means: any interface that is admin-down/unconfigured). To make sure that VPP gets all interfaces, I will remove /etc/netplan/* (or in Debian\u0026rsquo;s case, /etc/network/interfaces). This is why Supermicro\u0026rsquo;s KVM and serial-over-lan are so valuable, as they allow me to log in and deconfigure the entire machine, in order to yield all interfaces to VPP. They also allow me to reinstall or switch from DANOS to Ubuntu+VPP on a server that\u0026rsquo;s 700km away.\nAnyway, I can start VPP simply like so:\nsudo rm -f /etc/netplan/* sudo rm -f /etc/network/interfaces ## Set any link to down, or reboot the machine and access over KVM or Serial sudo systemctl restart vpp vppctl show interface See all interfaces? Great. Moving on :)\nConfiguring VPP I set a VPP interface configuration (which it\u0026rsquo;ll read and apply any time it starts or restarts, thereby making the configuration persistent across crashes and reboots). Using the exec stanza described above, the contents now become, taking as an example, our first router in Lille, France [details], configured as so:\ncat \u0026lt;\u0026lt; EOF | sudo tee /etc/vpp/bootstrap.vpp set logging class linux-cp rate-limit 1000 level warn syslog-level notice lcp default netns dataplane lcp lcp-sync on lcp lcp-auto-subint on create loopback interface instance 0 lcp create loop0 host-if loop0 set interface state loop0 up set interface ip address loop0 194.1.163.34/32 set interface ip address loop0 2001:678:d78::a/128 lcp create TenGigabitEthernet4/0/0 host-if xe0-0 lcp create TenGigabitEthernet4/0/1 host-if xe0-1 lcp create TenGigabitEthernet6/0/0 host-if xe1-0 lcp create TenGigabitEthernet6/0/1 host-if xe1-1 lcp create TenGigabitEthernet6/0/2 host-if xe1-2 lcp create TenGigabitEthernet6/0/3 host-if xe1-3 lcp create GigabitEthernetb/0/0 host-if e1-0 lcp create GigabitEthernetb/0/1 host-if e1-1 lcp create GigabitEthernetb/0/2 host-if e1-2 lcp create GigabitEthernetb/0/3 host-if e1-3 EOF This base-line configuration will:\nEnsure all host interfaces are created in namespace dataplane which we created earlier Turn on lcp-sync, which copies forward any configuration from VPP into Linux (see VPP Part 2) Turn on lcp-auto-subint, which automatically creates LIPs (Linux interface pairs) for all sub-interfaces (see VPP Part 3) Create a loopback interface, give it IPv4/IPv6 addresses, and expose it to Linux Create one LIP interface for four of the Gigabit and all 6x TenGigabit interfaces Leave 2 interfaces (GigabitEthernet7/0/0 and GigabitEthernet8/0/0) for later Further, sub-interfaces and bridge-groups might be configured as such:\ncomment { Infra: er01.lil01.ip-max.net Te0/0/0/6 } set interface mtu packet 9216 TenGigabitEthernet6/0/2 set interface state TenGigabitEthernet6/0/2 up create sub TenGigabitEthernet6/0/2 100 set interface mtu packet 9000 TenGigabitEthernet6/0/2.100 set interface state TenGigabitEthernet6/0/2.100 up set interface ip address TenGigabitEthernet6/0/2.100 194.1.163.30/31 set interface unnumbered TenGigabitEthernet6/0/2.100 use loop0 comment { Infra: Bridge Domain for mgmt } create bridge-domain 1 create loopback interface instance 1 lcp create loop1 host-if bvi1 set interface ip address loop1 192.168.0.81/29 set interface ip address loop1 2001:678:d78::1:a:1/112 set interface l2 bridge loop1 1 bvi set interface l2 bridge GigabitEthernet7/0/0 1 set interface l2 bridge GigabitEthernet8/0/0 1 set interface state GigabitEthernet7/0/0 up set interface state GigabitEthernet8/0/0 up set interface state loop1 up Particularly the last stanza, creating a bridge-domain, will remind Cisco operators of the same semantics on the ASR9k and IOS/XR operating system. What it does is create a bridge with two physical interfaces, and one so-called bridge virtual interface which I expose to Linux as bvi1, with an IPv4 and IPv6 address. Beautiful!\nConfiguring Bird Now that VPP\u0026rsquo;s interfaces are up, which I can validate with both vppctl show int addr and as well sudo ip netns exec dataplane ip addr, I am ready to configure Bird and put the router in the default free zone (ie. run BGP on it):\ncat \u0026lt;\u0026lt; EOF \u0026gt; /etc/bird/bird.conf router id 194.1.163.34; protocol device { scan time 30; } protocol direct { ipv4; ipv6; check link yes; } protocol kernel kernel4 { ipv4 { import none; export where source != RTS_DEVICE; }; learn off; scan time 300; } protocol kernel kernel6 { ipv6 { import none; export where source != RTS_DEVICE; }; learn off; scan time 300; } include \u0026#34;static.conf\u0026#34;; include \u0026#34;core/ospf.conf\u0026#34;; include \u0026#34;core/ibgp.conf\u0026#34;; EOF The most important thing to note in the configuration is that Bird tends to add a route for all of the connected interfaces, while Linux has already added those. Therefore, I avoid the source RTS_DEVICE, which means \u0026ldquo;connected routes\u0026rdquo;, but otherwise offer all routes to the kernel, which in turn propagates these as Netlink messages which are consumed by VPP. A detailed discussion of Bird\u0026rsquo;s configuration semantics is in my VPP Part 5 post.\nConfiguring SSH While Ubuntu (or Debian) will start an SSH daemon upon startup, they will do this in the default namespace. However, our interfaces (like loop0 or xe1-2.100 above) are configured to be present in the dataplane namespace. Therefor, I\u0026rsquo;ll add a second SSH daemon that runs specifically in the alternate namespace, like so:\ncat \u0026lt;\u0026lt; EOF | sudo tee /usr/lib/systemd/system/ssh-dataplane.service [Unit] Description=OpenBSD Secure Shell server (Dataplane Namespace) Documentation=man:sshd(8) man:sshd_config(5) After=network.target auditd.service ConditionPathExists=!/etc/ssh/sshd_not_to_be_run Requires=netns-dataplane.service After=netns-dataplane.service [Service] EnvironmentFile=-/etc/default/ssh ExecStartPre=/usr/sbin/ip netns exec dataplane /usr/sbin/sshd -t ExecStart=/usr/sbin/ip netns exec dataplane /usr/sbin/sshd -oPidFile=/run/sshd-dataplane.pid -D $SSHD_OPTS ExecReload=/usr/sbin/ip netns exec dataplane /usr/sbin/sshd -t ExecReload=/usr/sbin/ip netns exec dataplane /bin/kill -HUP $MAINPID KillMode=process Restart=on-failure RestartPreventExitStatus=255 Type=notify RuntimeDirectory=sshd RuntimeDirectoryMode=0755 [Install] WantedBy=multi-user.target Alias=sshd-dataplane.service EOF sudo systemctl enable ssh-dataplane sudo systemctl start ssh-dataplane And with that, our loopback address, and indeed any other interface created in the dataplane namespace, will accept SSH connections. Yaay!\nConfiguring SNMPd At IPng Networks, we use LibreNMS to monitor our machines and routers in production. Similar to SSH, I want the snmpd (which we disabled all the way at the top of this article), to be exposed in the dataplane namespace. However, that namespace will have interfaces like xe0-0 or loop0 or bvi1 configured, and it\u0026rsquo;s important to note that Linux will only see those packets that were punted by VPP, that is to say, those packets which were destined to any IP address configured on the control plane. Any traffic going through VPP will never be seen by Linux! So, I\u0026rsquo;ll have to be clever and count this traffic by polling VPP instead. This was the topic of my previous VPP Part 6 about the SNMP Agent. All of that code was released to Github, notably there\u0026rsquo;s a hint there for an snmpd-dataplane.service and a vpp-snmp-agent.service, including the compiled binary that reads from VPP and feeds this to SNMP.\nThen, the SNMP daemon configuration file, assuming net-snmp (the default for Ubuntu and Debian) which was installed in the very first step above, I\u0026rsquo;ll yield the following simple configuration file:\ncat \u0026lt;\u0026lt; EOF | tee /etc/snmp/snmpd.conf com2sec readonly default public com2sec6 readonly default public group MyROGroup v2c readonly view all included .1 80 # Don\u0026#39;t serve ipRouteTable and ipCidrRouteEntry (they\u0026#39;re huge) view all excluded .1.3.6.1.2.1.4.21 view all excluded .1.3.6.1.2.1.4.24 access MyROGroup \u0026#34;\u0026#34; any noauth exact all none none sysLocation Rue des Saules, 59262 Sainghin en Melantois, France sysContact noc@ipng.ch master agentx agentXSocket tcp:localhost:705,unix:/var/agentx/master agentaddress udp:161,udp6:161 # OS Distribution Detection extend distro /usr/bin/distro # Hardware Detection extend manufacturer \u0026#39;/bin/cat /sys/devices/virtual/dmi/id/sys_vendor\u0026#39; extend hardware \u0026#39;/bin/cat /sys/devices/virtual/dmi/id/product_name\u0026#39; extend serial \u0026#39;/bin/cat /var/run/snmpd.serial\u0026#39; EOF This config assumes that /var/run/snmpd.serial exists as a regular file rather than a /sys entry. That\u0026rsquo;s because while the sys_vendor and product_name fields are easily retrievable as user from the /sys filesystem, for some reason board_serial and product_serial are only readable by root, and our SNMPd runs as user Debian-snmp. So, I\u0026rsquo;ll just generate this at boot-time in /etc/rc.local, like so:\ncat \u0026lt;\u0026lt; EOF | sudo tee /etc/rc.local #!/bin/sh # Assemble serial number for snmpd BS=\\$(cat /sys/devices/virtual/dmi/id/board_serial) PS=\\$(cat /sys/devices/virtual/dmi/id/product_serial) echo \\$BS.\\$PS \u0026gt; /var/run/snmpd.serial [ -x /etc/rc.firewall ] \u0026amp;\u0026amp; /etc/rc.firewall EOF sudo chmod 755 /etc/rc.local sudo /etc/rc.local sudo systemctl restart snmpd-dataplane Results With all of this, I\u0026rsquo;m ready to pick up the machine in LibreNMS, which looks a bit like this:\nOr a specific traffic pattern looking at interfaces: Clearly, looking at the 17d of ~18Gbit of traffic going through this particular router, with zero crashes and zero SNMPd / Agent restarts, this thing is a winner:\npim@frggh0:/etc/bird$ date Tue 21 Sep 2021 01:26:49 AM UTC pim@frggh0:/etc/bird$ ps auxw | grep vpp root 1294 307 0.1 154273928 44972 ? Rsl Sep04 73578:50 /usr/bin/vpp -c /etc/vpp/startup.conf Debian-+ 331639 0.2 0.0 21216 11812 ? Ss Sep04 22:23 /usr/sbin/snmpd -LOw -u Debian-snmp -g vpp -I -smux mteTrigger mteTriggerConf -f -p /run/snmpd-dataplane.pid Debian-+ 507638 0.0 0.0 2900 592 ? Ss Sep04 0:00 /usr/sbin/vpp-snmp-agent -a localhost:705 -p 30 Debian-+ 507659 1.6 0.1 1317772 43508 ? Sl Sep04 2:16 /usr/sbin/vpp-snmp-agent -a localhost:705 -p 30 pim 510503 0.0 0.0 6432 736 pts/0 S+ 01:25 0:00 grep --color=auto vpp Thanks for reading this far :-)\n","date":"2021-09-21","desc":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. One thing notably missing, is the higher level control plane, that is to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a VPP plugin which is called the Linux Control Plane, or LCP for short, which creates Linux network devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links, addresses and routes) itself. When the plugin is completed, running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates will be well in reach!\n","permalink":"https://ipng.ch/s/articles/2021/09/21/vpp-linux-cp-part7/","section":"articles","title":"VPP Linux CP - Part7"},{"contents":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. One thing notably missing, is the higher level control plane, that is to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a VPP plugin which is called the Linux Control Plane, or LCP for short, which creates Linux network devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links, addresses and routes) itself. When the plugin is completed, running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates will be well in reach!\nSNMP in VPP Now that the Interface Mirror and Netlink Listener plugins are in good shape, this post shows a few finishing touches. First off, although the native habitat of VPP is Prometheus, many folks still run classic network monitoring systems like the popular Obvervium or its sibling LibreNMS. Although the metrics-based approach is modern, we really ought to have an old-skool SNMP interface so that we can swear it by the Old Gods and the New.\nVPP\u0026rsquo;s Stats Segment VPP maintains lots of interesting statistics at runtime - for example for nodes and their activity, but also, importantly, for each interface known to the system. So I take a look at the stats segment, configured in startup.conf, and I notice that VPP will create socket in /run/vpp/stats.sock which can be connected to. There\u0026rsquo;s also a few introspection tools, notably vpp_get_stats, which can either list, dump once, or continuously dump the data:\npim@hippo:~$ vpp_get_stats socket-name /run/vpp/stats.sock ls | wc -l 3800 pim@hippo:~$ vpp_get_stats socket-name /run/vpp/stats.sock dump /if/names [0]: local0 /if/names [1]: TenGigabitEthernet3/0/0 /if/names [2]: TenGigabitEthernet3/0/1 /if/names [3]: TenGigabitEthernet3/0/2 /if/names [4]: TenGigabitEthernet3/0/3 /if/names [5]: GigabitEthernet5/0/0 /if/names [6]: GigabitEthernet5/0/1 /if/names [7]: GigabitEthernet5/0/2 /if/names [8]: GigabitEthernet5/0/3 /if/names [9]: TwentyFiveGigabitEthernet11/0/0 /if/names [10]: TwentyFiveGigabitEthernet11/0/1 /if/names [11]: tap2 /if/names [12]: TenGigabitEthernet3/0/1.1 /if/names [13]: tap2.1 /if/names [14]: TenGigabitEthernet3/0/1.2 /if/names [15]: tap2.2 /if/names [16]: TenGigabitEthernet3/0/1.3 /if/names [17]: tap2.3 /if/names [18]: tap3 /if/names [19]: tap4 /if/names Alright! Clearly, the /if/ prefix is the one I\u0026rsquo;m looking for. I find a Python library that allows for this data to be MMAPd and directly read as a dictionary, including some neat aggregation functions (see src/vpp-api/python/vpp_papi/vpp_stats.py):\nCounters can be accessed in either dimension. stat[\u0026#39;/if/rx\u0026#39;] - returns 2D lists stat[\u0026#39;/if/rx\u0026#39;][0] - returns counters for all interfaces for thread 0 stat[\u0026#39;/if/rx\u0026#39;][0][1] - returns counter for interface 1 on thread 0 stat[\u0026#39;/if/rx\u0026#39;][0][1][\u0026#39;packets\u0026#39;] - returns the packet counter for interface 1 on thread 0 stat[\u0026#39;/if/rx\u0026#39;][:, 1] - returns the counters for interface 1 on all threads stat[\u0026#39;/if/rx\u0026#39;][:, 1].packets() - returns the packet counters for interface 1 on all threads stat[\u0026#39;/if/rx\u0026#39;][:, 1].sum_packets() - returns the sum of packet counters for interface 1 on all threads stat[\u0026#39;/if/rx-miss\u0026#39;][:, 1].sum() - returns the sum of packet counters for interface 1 on all threads for simple counters Alright, so let\u0026rsquo;s grab that file and refactor it into a small library for me to use, I do this in [this commit].\nVPP\u0026rsquo;s API In a previous project, I already got a little bit of exposure to the Python API (vpp_papi), and it\u0026rsquo;s pretty straight forward to use. Each API is published in a JSON file in /usr/share/vpp/api/{core,plugins}/ and those can be read by the Python library and exposed to callers. This gives me full programmatic read/write access to the VPP runtime configuration, which is super cool.\nThere are dozens of APIs to call (the Linux CP plugin even added one!), and in the case of enumerating interfaces, we can see the definition in core/interface.api.json where there is an element called services.sw_interface_dump which shows its reply is sw_interface_details, and in that message we can see all the fields that will be set in the request and all that will be present in the response. Nice! Here\u0026rsquo;s a quick demonstration:\nfrom vpp_papi import VPPApiClient import os import fnmatch import sys vpp_json_dir = \u0026#39;/usr/share/vpp/api/\u0026#39; # construct a list of all the json api files jsonfiles = [] for root, dirnames, filenames in os.walk(vpp_json_dir): for filename in fnmatch.filter(filenames, \u0026#39;*.api.json\u0026#39;): jsonfiles.append(os.path.join(root, filename)) vpp = VPPApiClient(apifiles=jsonfiles, server_address=\u0026#39;/run/vpp/api.sock\u0026#39;) vpp.connect(\u0026#34;test-client\u0026#34;) v = vpp.api.show_version() print(\u0026#39;VPP version is %s\u0026#39; % v.version) iface_list = vpp.api.sw_interface_dump() for iface in iface_list: print(\u0026#34;idx=%d name=%s mac=%s mtu=%d flags=%d\u0026#34; % (iface.sw_if_index, iface.interface_name, iface.l2_address, iface.mtu[0], iface.flags)) The output:\n$ python3 vppapi-test.py VPP version is 21.10-rc0~325-g4976c3b72 idx=0 name=local0 mac=00:00:00:00:00:00 mtu=0 flags=0 idx=1 name=TenGigabitEthernet3/0/0 mac=68:05:ca:32:46:14 mtu=9000 flags=0 idx=2 name=TenGigabitEthernet3/0/1 mac=68:05:ca:32:46:15 mtu=1500 flags=3 idx=3 name=TenGigabitEthernet3/0/2 mac=68:05:ca:32:46:16 mtu=9000 flags=1 idx=4 name=TenGigabitEthernet3/0/3 mac=68:05:ca:32:46:17 mtu=9000 flags=1 idx=5 name=GigabitEthernet5/0/0 mac=a0:36:9f:c8:a0:54 mtu=9000 flags=0 idx=6 name=GigabitEthernet5/0/1 mac=a0:36:9f:c8:a0:55 mtu=9000 flags=0 idx=7 name=GigabitEthernet5/0/2 mac=a0:36:9f:c8:a0:56 mtu=9000 flags=0 idx=8 name=GigabitEthernet5/0/3 mac=a0:36:9f:c8:a0:57 mtu=9000 flags=0 idx=9 name=TwentyFiveGigabitEthernet11/0/0 mac=6c:b3:11:20:e0:c4 mtu=9000 flags=0 idx=10 name=TwentyFiveGigabitEthernet11/0/1 mac=6c:b3:11:20:e0:c6 mtu=9000 flags=0 idx=11 name=tap2 mac=02:fe:07:ae:31:c3 mtu=1500 flags=3 idx=12 name=TenGigabitEthernet3/0/1.1 mac=00:00:00:00:00:00 mtu=1500 flags=3 idx=13 name=tap2.1 mac=00:00:00:00:00:00 mtu=1500 flags=3 idx=14 name=TenGigabitEthernet3/0/1.2 mac=00:00:00:00:00:00 mtu=1500 flags=3 idx=15 name=tap2.2 mac=00:00:00:00:00:00 mtu=1500 flags=3 idx=16 name=TenGigabitEthernet3/0/1.3 mac=00:00:00:00:00:00 mtu=1500 flags=3 idx=17 name=tap2.3 mac=00:00:00:00:00:00 mtu=1500 flags=3 idx=18 name=tap3 mac=02:fe:95:db:3f:c4 mtu=9000 flags=3 idx=19 name=tap4 mac=02:fe:17:06:fc:af mtu=9000 flags=3 So I added a little abstration with some error handling and one main function to return interfaces as a Python dictionary of those sw_interface_details tuples in [this commit].\nAgentX Now that we are able to enumerate the interfaces and their metadata (like admin/oper status, link speed, name, index, MAC address, and what have you), and as well the highly sought after interface statistics as 64bit counters (with a wealth of extra information like broadcast/multicast/unicast packets, octets received and transmitted, errors and drops). I am ready to tie things together.\nIt took a bit of sleuthing, but I eventually found a library on sourceforge (!) that has a rudimentary implementation of RFC 2741 which is the SNMP Agent Extensibility (AgentX) Protocol. In a nutshell, this allows an external program to connect to the main SNMP daemon, register an interest in certain OIDs, and get called whenever the SNMPd is being queried for them.\nThe flow is pretty simple (see section 6.2 of the RFC), the Agent (client):\nopens a TCP or Unix domain socket to the SNMPd sends an Open PDU, which the server will respond or reject. (optionally) can send a Ping PDU, the server will respond. registers an interest with Register PDU It then waits and gets called by the SNMPd with Get PDUs (to retrieve one single value), GetNext PDU (to enable snmpwalk), GetBulk PDU (to retrieve a whole subsection of the MIB), all of which are answered by a Response PDU.\nIf the Agent is to support writing, it will also have to implement TestSet, CommitSet, CommitUndoSet and CommitCleanupSet PDUs. For this agent, we don\u0026rsquo;t need to implement those, so I\u0026rsquo;ll just ignore those requests and implement the read-only stuff. Sounds easy :)\nThe first order of business is to create the values for two main MIBs of interest:\n.iso.org.dod.internet.mgmt.mib-2.interfaces.ifTable. - This table is an older variant and it contains a bunch of relevant fields, one per interface, notably ifIndex, ifName, ifType, ifMtu, ifSpeed, ifPhysAddress, ifOperStatus, ifAdminStatus and a bunch of 32bit counters for octets/packets in and out of the interfaces. .iso.org.dod.internet.mgmt.mib-2.ifMIB.ifMIBObjects.ifXTable. - This table is a makeover of the other one (the X here stands for eXtra), and adds a few 64 bit counters for the interface stats, and as well an ifHighSpeed which is in megabits instead of kilobits in the previous MIB. Populating these MIBs can be done periodically by retrieving the interfaces from VPP and then simply walking the dictionary with Stats Segment data. I then register these two main MIB entrypoints with SNMPd as I connect to it, and spit out the correct values once asked with GetPDU or GetNextPDU requests, by issuing a corresponding ResponsePDU to the SNMP server \u0026ndash; it takes care of all the rest!\nThe resulting code is in [this commit] but you can also check out the whole thing on [Github].\nBuilding Shipping a bunch of Python files around is not ideal, so I decide to build this stuff together in a binary that I can easily distribute to my machines: I just simply install pyinstaller with PIP and run it:\nsudo pip install pyinstaller pyinstaller vpp-snmp-agent.py --onefile ## Run it on console dist/vpp-snmp-agent -h usage: vpp-snmp-agent [-h] [-a ADDRESS] [-p PERIOD] [-d] optional arguments: -h, --help show this help message and exit -a ADDRESS Location of the SNMPd agent (unix-path or host:port), default localhost:705 -p PERIOD Period to poll VPP, default 30 (seconds) -d Enable debug, default False ## Install sudo cp dist/vpp-snmp-agent /usr/sbin/ Running After installing Net-SNMP, the default in Ubuntu, I do have to ensure that it runs in the correct namespace. So what I do is disable the systemd unit that ships with the Ubuntu package, and instead create these:\npim@hippo:~/src/vpp-snmp-agentx$ cat \u0026lt; EOF | sudo tee /usr/lib/systemd/system/netns-dataplane.service [Unit] Description=Dataplane network namespace After=systemd-sysctl.service network-pre.target Before=network.target network-online.target [Service] Type=oneshot RemainAfterExit=yes # PrivateNetwork will create network namespace which can be # used in JoinsNamespaceOf=. PrivateNetwork=yes # To set `ip netns` name for this namespace, we create a second namespace # with required name, unmount it, and then bind our PrivateNetwork # namespace to it. After this we can use our PrivateNetwork as a named # namespace in `ip netns` commands. ExecStartPre=-/usr/bin/echo \u0026#34;Creating dataplane network namespace\u0026#34; ExecStart=-/usr/sbin/ip netns delete dataplane ExecStart=-/usr/bin/mkdir -p /etc/netns/dataplane ExecStart=-/usr/bin/touch /etc/netns/dataplane/resolv.conf ExecStart=-/usr/sbin/ip netns add dataplane ExecStart=-/usr/bin/umount /var/run/netns/dataplane ExecStart=-/usr/bin/mount --bind /proc/self/ns/net /var/run/netns/dataplane # Apply default sysctl for dataplane namespace ExecStart=-/usr/sbin/ip netns exec dataplane /usr/lib/systemd/systemd-sysctl ExecStop=-/usr/sbin/ip netns delete dataplane [Install] WantedBy=multi-user.target WantedBy=network-online.target EOF pim@hippo:~/src/vpp-snmp-agentx$ cat \u0026lt; EOF | sudo tee /usr/lib/systemd/system/snmpd-dataplane.service [Unit] Description=Simple Network Management Protocol (SNMP) Daemon. After=network.target ConditionPathExists=/etc/snmp/snmpd.conf [Service] Type=simple ExecStartPre=/bin/mkdir -p /var/run/agentx-dataplane/ NetworkNamespacePath=/var/run/netns/dataplane ExecStart=/usr/sbin/snmpd -LOw -u Debian-snmp -g vpp -I -smux,mteTrigger,mteTriggerConf -f -p /run/snmpd-dataplane.pid ExecReload=/bin/kill -HUP \\$MAINPID [Install] WantedBy=multi-user.target EOF pim@hippo:~/src/vpp-snmp-agentx$ cat \u0026lt; EOF | sudo tee /usr/lib/systemd/system/vpp-snmp-agent.service [Unit] Description=SNMP AgentX Daemon for VPP dataplane statistics After=network.target ConditionPathExists=/etc/snmp/snmpd.conf [Service] Type=simple NetworkNamespacePath=/var/run/netns/dataplane ExecStart=/usr/sbin/vpp-snmp-agent Group=vpp ExecReload=/bin/kill -HUP \\$MAINPID Restart=on-failure RestartSec=5s [Install] WantedBy=multi-user.target EOF Note the use of NetworkNamespacePath here \u0026ndash; this ensures that the snmpd and its agent both run in the dataplane namespace which was created by netns-dataplane.service.\nResults I now install the binary and, using the snmpd.conf configuration file (see Appendix):\npim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl stop snmpd pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl disable snmpd pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl daemon-reload pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl enable netns-dataplane pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl start netns-dataplane pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl enable snmpd-dataplane pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl start snmpd-dataplane pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl enable vpp-snmp-agent pim@hippo:~/src/vpp-snmp-agentx$ sudo systemctl start vpp-snmp-agent pim@hippo:~/src/vpp-snmp-agentx$ sudo journalctl -u vpp-snmp-agent [INFO ] agentx.agent - run : Calling setup [INFO ] agentx.agent - setup : Connecting to VPP Stats... [INFO ] agentx.vppapi - connect : Connecting to VPP [INFO ] agentx.vppapi - connect : VPP version is 21.10-rc0~325-g4976c3b72 [INFO ] agentx.agent - run : Initial update [INFO ] agentx.network - update : Setting initial serving dataset (740 OIDs) [INFO ] agentx.agent - run : Opening AgentX connection [INFO ] agentx.network - connect : Connecting to localhost:705 [INFO ] agentx.network - start : Registering: 1.3.6.1.2.1.2.2.1 [INFO ] agentx.network - start : Registering: 1.3.6.1.2.1.31.1.1.1 [INFO ] agentx.network - update : Replacing serving dataset (740 OIDs) [INFO ] agentx.network - update : Replacing serving dataset (740 OIDs) [INFO ] agentx.network - update : Replacing serving dataset (740 OIDs) [INFO ] agentx.network - update : Replacing serving dataset (740 OIDs) Appendix SNMPd Config $ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/snmp/snmpd.conf com2sec readonly default \u0026lt;\u0026lt;some-string\u0026gt;\u0026gt; group MyROGroup v2c readonly view all included .1 80 access MyROGroup \u0026#34;\u0026#34; any noauth exact all none none sysLocation Ruemlang, Zurich, Switzerland sysContact noc@ipng.ch master agentx agentXSocket tcp:localhost:705,unix:/var/agentx/master,unix:/run/vpp/agentx.sock agentaddress udp:161,udp6:161 #OS Distribution Detection extend distro /usr/bin/distro #Hardware Detection extend manufacturer \u0026#39;/bin/cat /sys/devices/virtual/dmi/id/sys_vendor\u0026#39; extend hardware \u0026#39;/bin/cat /sys/devices/virtual/dmi/id/product_name\u0026#39; extend serial \u0026#39;/bin/cat /var/run/snmpd.serial\u0026#39; EOF Note the use of a few helpers here - /usr/bin/distro comes from LibreNMS ref and tries to figure out what distribution is used. The very last line of that file echo\u0026rsquo;s the found distribtion, to which I prepend the string, like echo \u0026quot;VPP ${OSSTR}\u0026quot;. The other file of interest /var/run/snmpd.serial is computed at boot-time, by running the following in /etc/rc.local:\n# Assemble serial number for snmpd BS=$(cat /sys/devices/virtual/dmi/id/board_serial) PS=$(cat /sys/devices/virtual/dmi/id/product_serial) echo $BS.$PS \u0026gt; /var/run/snmpd.serial I have to do this, because SNMPd runs as non-privileged user, yet those DMI elements are root-readable only (for reasons that are beyond me). Seeing as they will not change at runtime anyway, I just create that file and cat it into the serial field. It then shows up nicely in LibreNMS alongside the others.\nOh, and one last thing. The VPP Hound logo!\nIn LibreNMS, the icons in the devices view use a function that leveages this distro field, by looking at the first word (in our case \u0026ldquo;VPP\u0026rdquo;) with an extension of either .svg or .png in an icons directory, usually html/images/os/. I dropped the hound of the fd.io homepage in there, and will add the icon upstream for future use, in this [librenms PR] and its companion change to [librenms-agent PR.\n","date":"2021-09-10","desc":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. One thing notably missing, is the higher level control plane, that is to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a VPP plugin which is called the Linux Control Plane, or LCP for short, which creates Linux network devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links, addresses and routes) itself. When the plugin is completed, running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates will be well in reach!\n","permalink":"https://ipng.ch/s/articles/2021/09/10/vpp-linux-cp-part6/","section":"articles","title":"VPP Linux CP - Part6"},{"contents":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. One thing notably missing, is the higher level control plane, that is to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a VPP plugin which is called the Linux Control Plane, or LCP for short, which creates Linux network devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links, addresses and routes) itself. When the plugin is completed, running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates will be well in reach!\nIn the previous post, I added support for VPP to consume Netlink messages that describe interfaces, IP addresses and ARP/ND neighbor changes. This post completes the stablestakes Netlink handler by adding IPv4 and IPv6 route messages, and ends up with a router in the DFZ consuming 133K IPv6 prefixes and 870K IPv4 prefixes.\nMy test setup The goal of this post is to show what code needed to be written to extend the Netlink Listener plugin I wrote in the fourth post, so that it can consume route additions/deletions, a thing that is common in dynamic routing protocols such as OSPF and BGP.\nThe setup from my third post is still there, but it\u0026rsquo;s no longer a focal point for me. I use it (the regular interface + subints and the BondEthernet + subints) just to ensure my new code doesn\u0026rsquo;t have a regression.\nInstead, I\u0026rsquo;m creating two VLAN interfaces now:\nThe first is in my home network\u0026rsquo;s servers VLAN. There are three OSPF speakers there: chbtl0.ipng.ch and chbtl1.ipng.ch are my main routers, they run DANOS and are in the Default Free Zone (or DFZ for short). rr0.chbtl0.ipng.ch is one of AS50869\u0026rsquo;s three route-reflectors. Every one of the 13 routers in AS50869 exchanges BGP information with these, and it cuts down on the total amount of iBGP sessions I have to maintain \u0026ndash; see here for details on Route Reflectors. The second is an L2 connection to a local BGP exchange, with only three members (IPng Networks, AS50869, Openfactory AS58299, and Stucchinet AS58280). In this VLAN, Openfactory was so kind as to configure a full transit session for me, and I\u0026rsquo;ll use it in my test bench. The test setup offers me the ability to consume OSPF, OSPFv3 and BGP.\nStartingpoint Based on the state of the plugin after the fourth post, operators can create VLANs (including .1q, .1ad, QinQ and QinAD subinterfaces) directly in Linux. They can change link attributes (like set admin state \u0026lsquo;up\u0026rsquo; or \u0026lsquo;down\u0026rsquo;, or change the MTU on a link), they can add/remove IP addresses, and the system will add/remove IPv4 and IPv6 neighbors. But notably, the following Netlink messages are not yet consumed, as shown by the following example:\npim@hippo:~/src/lcpng$ sudo ip link add link e1 name servers type vlan id 101 pim@hippo:~/src/lcpng$ sudo ip link up mtu 1500 servers pim@hippo:~/src/lcpng$ sudo ip addr add 194.1.163.86/27 dev servers pim@hippo:~/src/lcpng$ sudo ip ro add default via 194.1.163.65 which does the first three commands just fine, but the fourth:\nlinux-cp/nl [debug ]: dispatch: ignored route/route: add family inet type 1 proto 3 table 254 dst 0.0.0.0/0 nexthops { gateway 194.1.163.65 idx 197 } In this post, I\u0026rsquo;ll implement that last missing piece in two functions called lcp_nl_route_add() and lcp_nl_route_del(). Here we go!\nNetlink Routes Reusing the approach from the work-in-progress [Gerrit], I introduce two FIB sources: one for manual routes (ie. the ones that an operator might set with ip route add), and another one for dynamic routes (ie. what a routing protocol like Bird or FRR might set), this is in lcp_nl_proto_fib_source(). Next, I need a bunch of helper functions that can translate the Netlink message information into VPP primitives:\nlcp_nl_mk_addr46() converts a Netlink nl_addr to a VPP ip46_address_t. lcp_nl_mk_route_prefix() converts a Netlink rtnl_route to a VPP fib_prefix_t. lcp_nl_mk_route_mprefix() converts a Netlink rtnl_route to a VPP mfib_prefix_t (for multicast routes). lcp_nl_mk_route_entry_flags() generates fib_entry_flag_t from the Netlink route type, table and proto metadata. lcp_nl_proto_fib_source() selects the most appropciate FIB source by looking at the rt_proto field from the Netlink message (see /etc/iproute2/rt_protos for a list of these). Anything RTPROT_STATIC or better is fib_src, while anything above that becomes fib_src_dynamic. lcp_nl_route_path_parse() converts a Netlink rtnl_nexthop to a VPP fib_route_path_t and adds that to a growing list of paths. Similar to Netlink\u0026rsquo;s nethops being a list, so are the individual paths in VPP, so that lines up perfectly. lcp_nl_route_path_add_special() adds a blackhole/unreach/prohibit route to the list of paths, in the special-case there is not yet a path for the destination. With these helpers, I will have enough to manipulate VPP\u0026rsquo;s forwarding information base or FIB for short. But in VPP, the FIB consists of any number of tables (think of them as VRFs or Virtual Routing/Forwarding domains). So first, I need to add these:\nlcp_nl_table_find() selects the matching {table-id,protocol} (v4/v6) tuple from an internally kept hash of tables. lcp_nl_table_add_or_lock() if a table with key {table-id,protocol} (v4/v6) hasn\u0026rsquo;t been used yet, create one in VPP, and store it for future reference. Otherwise increment a table reference counter so I know how many FIB entries VPP will have in this table. lcp_nl_table_unlock() given a table, decrease the refcount on it, and if no more prefixes are in the table, remove it from VPP. All of this code was heavily inspired by the pending [Gerrit] but a few finishing touches were added, and wrapped up in this [commit].\nDeletion Our main function lcp_nl_route_del() will remove a route from the given table-id/protocol. I do this by applying rtnl_route_foreach_nexthop() callbacks to the list of Netlink message nexthops, converting each of them into VPP paths in a lcp_nl_route_path_parse_t structure. If the route is for unreachable/blackhole/prohibit in Linux, add that path too.\nThen, remove the VPP paths from the FIB and reduce refcnt or remove the table if it\u0026rsquo;s empty. This is reasonably straight forward.\nAddition Adding routes to the FIB is done with lcp_nl_route_add(). It immediately becomes obvious that not all routes are relevant for VPP. A prime example are those in table 255, they are \u0026rsquo;local\u0026rsquo; routes, which have already been set up by IPv4 and IPv6 address addition functions in VPP. There are some other route types that are invalid, so I\u0026rsquo;ll just skip those.\nLink-local IPv6 and IPv6 multicast is also skipped, because they\u0026rsquo;re also added when interfaces get their IP addresses configured. But for the other routes, similar to deletion, I\u0026rsquo;ll extract the paths from the Netlink message\u0026rsquo;s netxhops list, by constructing an lcp_nl_route_path_parse_t by walking those Netlink nexthops, and optionally add a special route (in case the route was for unreachable/blackhole/prohibit in Linux \u0026ndash; those won\u0026rsquo;t have a nexthop).\nThen, insert the VPP paths found in the Netlink message into the FIB or the multicast FIB, respectively.\nControl Plane: Bird So with this newly added code, the example above of setting a default route shoots to life. But I can do better! At IPng Networks, my routing suite of choice is Bird2, and I have some code to generate configurations for it and push those configs safely to routers. So, let\u0026rsquo;s take a closer look at a configuration on the test machine running VPP + Linux CP with this new Netlink route handler.\nrouter id 194.1.163.86; protocol device { scan time 10; } protocol direct { ipv4; ipv6; check link yes; } These first two protocols are internal implementation details. The first, called device periodically scans the network interface list in Linux, to pick up new interfaces. You can compare it to issuing ip link and acting on additions/removals as they occur. The second, called direct, generates directly connected routes for interfaces that have IPv4 or IPv6 addresses configured. It turns out that if I add 194.1.163.86/27 as an IPv4 address on an interface, it\u0026rsquo;ll generate several Netlink messages: one for the RTM_NEWADDR which I discussed in my fourth post, and also a RTM_NEWROUTE for the connected 194.1.163.64/27 in this case. It helps the kernel understand that if we want to send a packet to a host in that prefix, we should not send it to the default gateway, but rather to a nexthop of the device. Those are intermittently called direct or connected routes. Ironically, these are called RTS_DEVICE routes in Bird2 ref even though they are generated by the direct routing protocol.\nThat brings me to the third protocol, one for each address type:\nprotocol kernel kernel4 { ipv4 { import all; export where source != RTS_DEVICE; }; } protocol kernel kernel6 { ipv6 { import all; export where source != RTS_DEVICE; }; } We\u0026rsquo;re asking Bird to import any route it learns from the kernel, and we\u0026rsquo;re asking it to export any route that\u0026rsquo;s not an RTS_DEVICE route. The reason for this is that when we create IPv4/IPv6 addresses, the ip command already adds the connected route, and this avoids Bird from inserting a second, identical route for those connected routes. And with that, I have a very simple view, given for example these two interfaces:\npim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip route 45.129.224.232/29 dev ixp proto kernel scope link src 45.129.224.235 194.1.163.64/27 dev servers proto kernel scope link src 194.1.163.86 pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip -6 route 2a0e:5040:0:2::/64 dev ixp proto kernel metric 256 pref medium 2001:678:d78:3::/64 dev servers proto kernel metric 256 pref medium pim@hippo:/etc/bird$ birdc show route BIRD 2.0.7 ready. Table master4: 45.129.224.232/29 unicast [direct1 20:48:55.547] * (240) dev ixp 194.1.163.64/27 unicast [direct1 20:48:55.547] * (240) dev servers Table master6: 2a0e:5040:1001::/64 unicast [direct1 20:48:55.547] * (240) dev stucchi 2001:678:d78:3::/64 unicast [direct1 20:48:55.547] * (240) dev servers Control Plane: OSPF Considering the servers network above has a few OSPF speakers in it, I will introduce this router there as well. The configuration is very straight forward in Bird, let\u0026rsquo;s just add the OSPF and OSPFv3 protocols as follows:\nprotocol ospf v2 ospf4 { ipv4 { export where source = RTS_DEVICE; import all; }; area 0 { interface \u0026#34;lo\u0026#34; { stub yes; }; interface \u0026#34;servers\u0026#34; { type broadcast; cost 5; }; }; } protocol ospf v3 ospf6 { ipv6 { export where source = RTS_DEVICE; import all; }; area 0 { interface \u0026#34;lo\u0026#34; { stub yes; }; interface \u0026#34;servers\u0026#34; { type broadcast; cost 5; }; }; } Here, I tell OSPF to export all connected routes, and accept any route given to it. The only difference between IPv4 and IPv6 is that the former uses OSPF version 2 of the protocol, and IPv6 uses version 3 of the protocol. And, as with the kernel routing protocol above, each instance has to has its own unique name, so I make the obvious choice.\nWithin a few seconds, the OSPF Hello packets can be seen going out of the servers interface, and adjacencies form shortly thereafter:\npim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip ro | wc -l 83 pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip -6 ro | wc -l 74 pim@hippo:~/src/lcpng$ birdc show ospf nei ospf4 BIRD 2.0.7 ready. ospf4: Router ID Pri State DTime Interface Router IP 194.1.163.3 1 Full/Other 39.588 servers 194.1.163.66 194.1.163.87 1 Full/DR 39.588 servers 194.1.163.87 194.1.163.4 1 Full/Other 39.588 servers 194.1.163.67 pim@hippo:~/src/lcpng$ birdc show ospf nei ospf6 BIRD 2.0.7 ready. ospf6: Router ID Pri State DTime Interface Router IP 194.1.163.87 1 Full/DR 32.221 servers fe80::5054:ff:feaa:2b24 194.1.163.3 1 Full/BDR 39.504 servers fe80::9e69:b4ff:fe61:7679 194.1.163.4 1 2-Way/Other 38.357 servers fe80::9e69:b4ff:fe61:a1dd And all of these were inserted into the VPP forwarding information base, taking for example the IPng router in Amsterdam, loopback address 194.1.163.32 and 2001:678:d78::8:\nDBGvpp# show ip fib 194.1.163.32 ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, recursive-resolution:1, default-route:1, lcp-rt:1, nat-hi:2, ] 194.1.163.32/32 fib:0 index:70 locks:2 lcp-rt-dynamic refs:1 src-flags:added,contributing,active, path-list:[49] locks:142 flags:shared,popular, uPRF-list:49 len:1 itfs:[16, ] path:[69] pl-index:49 ip4 weight=1 pref=32 attached-nexthop: oper-flags:resolved, 194.1.163.67 TenGigabitEthernet3/0/1.3 [@0]: ipv4 via 194.1.163.67 TenGigabitEthernet3/0/1.3: mtu:1500 next:5 flags:[] 9c69b461a1dd6805ca324615810000650800 forwarding: unicast-ip4-chain [@0]: dpo-load-balance: [proto:ip4 index:72 buckets:1 uRPF:49 to:[0:0]] [0] [@5]: ipv4 via 194.1.163.67 TenGigabitEthernet3/0/1.3: mtu:1500 next:5 flags:[] 9c69b461a1dd6805ca324615810000650800 DBGvpp# show ip6 fib 2001:678:d78::8 ipv6-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, ] 2001:678:d78::8/128 fib:0 index:130058 locks:2 lcp-rt-dynamic refs:1 src-flags:added,contributing,active, path-list:[116] locks:220 flags:shared,popular, uPRF-list:106 len:1 itfs:[16, ] path:[141] pl-index:116 ip6 weight=1 pref=32 attached-nexthop: oper-flags:resolved, fe80::9e69:b4ff:fe61:a1dd TenGigabitEthernet3/0/1.3 [@0]: ipv6 via fe80::9e69:b4ff:fe61:a1dd TenGigabitEthernet3/0/1.3: mtu:1500 next:5 flags:[] 9c69b461a1dd6805ca3246158100006586dd forwarding: unicast-ip6-chain [@0]: dpo-load-balance: [proto:ip6 index:130060 buckets:1 uRPF:106 to:[0:0]] [0] [@5]: ipv6 via fe80::9e69:b4ff:fe61:a1dd TenGigabitEthernet3/0/1.3: mtu:1500 next:5 flags:[] 9c69b461a1dd6805ca3246158100006586dd In the snippet above we can see elements of the Linux CP Netlink Listener plugin doing its work. It found the right nexthop, the right interface, enabled the FIB entry, and marked it with the correct FIB source lcp-rt-dynamic. And, with OSPF and OSPFv3 now enabled, VPP has gained visibility to all of my internal network:\npim@hippo:~/src/lcpng$ traceroute nlams0.ipng.ch traceroute to nlams0.ipng.ch (2001:678:d78::8) from 2001:678:d78:3::86, 30 hops max, 24 byte packets 1 chbtl1.ipng.ch (2001:678:d78:3::1) 0.3182 ms 0.2840 ms 0.1841 ms 2 chgtg0.ipng.ch (2001:678:d78::2:4:2) 0.5473 ms 0.6996 ms 0.6836 ms 3 chrma0.ipng.ch (2001:678:d78::2:0:1) 0.7700 ms 0.7693 ms 0.7692 ms 4 defra0.ipng.ch (2001:678:d78::7) 6.6586 ms 6.6443 ms 6.9292 ms 5 nlams0.ipng.ch (2001:678:d78::8) 12.8321 ms 12.9398 ms 12.6225 ms Control Plane: BGP But the holy grail, and what got me started on this whole adventure, is to be able to participate in the Default Free Zone using BGP, So let\u0026rsquo;s put these plugins to the test and load up a so-called full table which means: all the routing information needed to reach any part of the internet. As of August'21, there are about 870'000 such prefixes for IPv4, and aboug 133'000 prefixes for IPv6. We passed the magic 1M number, which I\u0026rsquo;m sure makes some silicon vendors anxious, because lots of older kit in the field won\u0026rsquo;t scale beyond a certain size. VPP is totally immune to this problem, so here we go!\ntemplate bgp T_IBGP4 { local as 50869; neighbor as 50869; source address 194.1.163.86; ipv4 { import all; export none; next hop self on; }; }; protocol bgp rr4_frggh0 from T_IBGP4 { neighbor 194.1.163.140; } protocol bgp rr4_chplo0 from T_IBGP4 { neighbor 194.1.163.148; } protocol bgp rr4_chbtl0 from T_IBGP4 { neighbor 194.1.163.87; } template bgp T_IBGP6 { local as 50869; neighbor as 50869; source address 2001:678:d78:3::86; ipv6 { import all; export none; next hop self ibgp; }; }; protocol bgp rr6_frggh0 from T_IBGP6 { neighbor 2001:678:d78:6::140; } protocol bgp rr6_chplo0 from T_IBGP6 { neighbor 2001:678:d78:7::148; } protocol bgp rr6_chbtl0 from T_IBGP6 { neighbor 2001:678:d78:3::87; } And with these two blocks, I\u0026rsquo;ve added six new protocols \u0026ndash; three of them are IPv4 route-reflector clients, and three of them are IPv6 ones. Once this commits, Bird will be able to find these IP addresses due to the OSPF routes being loaded into the FIB, and once it does that, each of the route-reflector servers will download a full routing table into Bird\u0026rsquo;s memory, and in turn Bird will use the kernel4 and kernel6 protocol to export them into Linux (essentially performing an ip ro add ... via ... on each), and the kernel will then generate a Netlink message, which the Linux CP Netlink Listener plugin will pick up and the rest, as they say, is history.\nI gotta tell you - the first time I saw this working end to end, I was elated. Just seeing blocks of 6800-7000 of these being pumped into VPP\u0026rsquo;s FIB each 40ms block was just .. magical. And the performance is pretty good, too, because 7000/40ms is 175K/sec alluding to VPP operators being able to not only consume but also program into the FIB, a full IPv4 and IPv6 table in about 6 seconds, whoa!\nDBGvpp# linux-cp/nl [warn ]: process_msgs: Processed 6550 messages in 40001 usecs, 2607 left in queue linux-cp/nl [warn ]: process_msgs: Processed 6368 messages in 40000 usecs, 7012 left in queue linux-cp/nl [warn ]: process_msgs: Processed 6460 messages in 40001 usecs, 13163 left in queue ... linux-cp/nl [warn ]: process_msgs: Processed 6418 messages in 40004 usecs, 93606 left in queue linux-cp/nl [warn ]: process_msgs: Processed 6438 messages in 40002 usecs, 96944 left in queue linux-cp/nl [warn ]: process_msgs: Processed 6575 messages in 40002 usecs, 99986 left in queue linux-cp/nl [warn ]: process_msgs: Processed 6552 messages in 40004 usecs, 94767 left in queue linux-cp/nl [warn ]: process_msgs: Processed 5890 messages in 40001 usecs, 88877 left in queue linux-cp/nl [warn ]: process_msgs: Processed 6829 messages in 40003 usecs, 82048 left in queue ... linux-cp/nl [warn ]: process_msgs: Processed 6685 messages in 40004 usecs, 13576 left in queue linux-cp/nl [warn ]: process_msgs: Processed 6701 messages in 40003 usecs, 6893 left in queue linux-cp/nl [warn ]: process_msgs: Processed 6579 messages in 40003 usecs, 314 left in queue DBGvpp# Due to a good cooperative multitasking approach in the Netlink message queue producer, I will continuously read Netlink messages from the kernel and put them in a queue, but only consume 40ms or 8000 messages whichever comes first, after which I yield control back to VPP. So you can see here that when the kernel is flooding the Netlink messages of the learned BGP routing table, the plugin correctly consumes what it can, the queue grows (in this case to just about 100K messages) and then quickly shrinks again.\nAnd indeed, Bird, IP and VPP all seem to agree, we did a good job:\npim@hippo:~/src/lcpng$ birdc show route count BIRD 2.0.7 ready. 1741035 of 1741035 routes for 870479 networks in table master4 396518 of 396518 routes for 132479 networks in table master6 Total: 2137553 of 2137553 routes for 1002958 networks in 2 tables pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip -6 ro | wc -l 132430 pim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip ro | wc -l 870494 pim@hippo:~/src/lcpng$ vppctl sh ip6 fib sum | awk \u0026#39;$1~/[0-9]+/ { total += $2 } END { print total }\u0026#39; 132479 pim@hippo:~/src/lcpng$ vppctl sh ip fib sum | awk \u0026#39;$1~/[0-9]+/ { total += $2 } END { print total }\u0026#39; 870529 Results The functional regression test I made on day one, the one that ensures end-to-end connectivity to and from the Linux host interfaces works for all 5 interface types (untagged, .1q tagged, QinQ, .1ad tagged and QinAD) and for both physical and virtual interfaces (like TenGigabitEthernet3/0/0 and BondEthernet0), still works. Great.\nHere\u0026rsquo;s a screencast [asciinema, gif] showing me playing around a bit with that configuration shown above, demonstrating that RIB and FIB synchronisation works pretty well in both directions, making the combination of these two plugins sufficient to run a VPP router in the Default Free Zone, Whoohoo!\nFuture work Atomic Updates - When running VPP + Linux CP in a default free zone BGP environment, IPv4 and IPv6 prefixes will be constantly updated as the internet topology morphs and changes. One thing I noticed is that those are often deletes followed by adds with the exact same nexthop (ie. something in Germany flapped, and this is not deduplicated), which shows up as many of these pairs of messages like so:\nlinux-cp/nl [debug ]: route_del: netlink route/route: del family inet6 type 1 proto 12 table 254 dst 2a10:cc40:b03::/48 nexthops { gateway fe80::9e69:b4ff:fe61:a1dd idx 197 } linux-cp/nl [debug ]: route_path_parse: path ip6 fe80::9e69:b4ff:fe61:a1dd, TenGigabitEthernet3/0/1.3, [] linux-cp/nl [info ]: route_del: table 254 prefix 2a10:cc40:b03::/48 flags linux-cp/nl [debug ]: route_add: netlink route/route: add family inet6 type 1 proto 12 table 254 dst 2a10:cc40:b03::/48 nexthops { gateway fe80::9e69:b4ff:fe61:a1dd idx 197 } linux-cp/nl [debug ]: route_path_parse: path ip6 fe80::9e69:b4ff:fe61:a1dd, TenGigabitEthernet3/0/1.3, [] linux-cp/nl [info ]: route_add: table 254 prefix 2a10:cc40:b03::/48 flags linux-cp/nl [info ]: process_msgs: Processed 2 messages in 225 usecs See how 2a10:cc40:b03::/48 is first removed, and then immediately reinstalled to the exact same nexthop fe80::9e69:b4ff:fe61:a1dd on interface TenGigabitEthernet3/0/1.3 ? Although it only takes 225µs, it\u0026rsquo;s still a bit sad to parse, create paths, just to remove from the FIB and re-insert the exact same thing into the FIB. But more importantly, if a packet destined for this prefix arrives in that 225µs window, it will be lost. So I think I\u0026rsquo;ll build a peek-ahead mechanism to capture specifically this occurence, and let the two del+add messages cancel each other out.\nPrefix updates towards lo - When writing the code, I borrowed a bunch from the pending [Gerrit] but that one has a nasty crash which was hard to debug and I haven\u0026rsquo;t yet fully understood it. When a add/del occurs for a route towards IPv6 localhost (these are typically seen when Bird shuts down eBGP sessions and I no longer have a path to a prefix, it\u0026rsquo;ll mark it as \u0026lsquo;unreachable\u0026rsquo; rather than deleting it. These are additions which have a nexthop without a gateway but with an interface index of 1 (which, in Netlink, is \u0026rsquo;lo\u0026rsquo;). This makes VPP intermittently crash, so I currently commented this out, while I gain better understanding. Result: blackhole/unreachable/prohibit specials can not be set using the plugin. Beware! (disabled in this [commit]).\nCredits I\u0026rsquo;d like to make clear that the Linux CP plugin is a collaboration between several great minds, and that my work stands on other software engineer\u0026rsquo;s shoulders. In particular most of the Netlink socket handling and Netlink message queueing was written by Matthew Smith, and I\u0026rsquo;ve had a little bit of help along the way from Neale Ranns and Jon Loeliger. I\u0026rsquo;d like to thank them for their work!\nAppendix VPP config We only use one TenGigabitEthernet device on the router, and create two VLANs on it:\nIP=\u0026#34;sudo ip netns exec dataplane ip\u0026#34; vppctl set logging class linux-cp rate-limit 1000 level warn syslog-level notice vppctl lcp create TenGigabitEthernet3/0/1 host-if e1 netns dataplane $IP link set e1 mtu 1500 up $IP link add link e1 name ixp type vlan id 179 $IP link set ixp mtu 1500 up $IP addr add 45.129.224.235/29 dev ixp $IP addr add 2a0e:5040:0:2::235/64 dev ixp $IP link add link e1 name servers type vlan id 101 $IP link set servers mtu 1500 up $IP addr add 194.1.163.86/27 dev servers $IP addr add 2001:678:d78:3::86/64 dev servers Bird config I\u0026rsquo;m using a purposefully minimalist configuration for demonstration purposes, posted here in full for posterity:\nlog syslog all; log \u0026#34;/var/log/bird/bird.log\u0026#34; { debug, trace, info, remote, warning, error, auth, fatal, bug }; router id 194.1.163.86; protocol device { scan time 10; } protocol direct { ipv4; ipv6; check link yes; } protocol kernel kernel4 { ipv4 { import all; export where source != RTS_DEVICE; }; } protocol kernel kernel6 { ipv6 { import all; export where source != RTS_DEVICE; }; } protocol ospf v2 ospf4 { ipv4 { export where source = RTS_DEVICE; import all; }; area 0 { interface \u0026#34;lo\u0026#34; { stub yes; }; interface \u0026#34;servers\u0026#34; { type broadcast; cost 5; }; }; } protocol ospf v3 ospf6 { ipv6 { export where source = RTS_DEVICE; import all; }; area 0 { interface \u0026#34;lo\u0026#34; { stub yes; }; interface \u0026#34;servers\u0026#34; { type broadcast; cost 5; }; }; } template bgp T_IBGP4 { local as 50869; neighbor as 50869; source address 194.1.163.86; ipv4 { import all; export none; next hop self on; }; }; protocol bgp rr4_frggh0 from T_IBGP4 { neighbor 194.1.163.140; } protocol bgp rr4_chplo0 from T_IBGP4 { neighbor 194.1.163.148; } protocol bgp rr4_chbtl0 from T_IBGP4 { neighbor 194.1.163.87; } template bgp T_IBGP6 { local as 50869; neighbor as 50869; source address 2001:678:d78:3::86; ipv6 { import all; export none; next hop self ibgp; }; }; protocol bgp rr6_frggh0 from T_IBGP6 { neighbor 2001:678:d78:6::140; } protocol bgp rr6_chplo0 from T_IBGP6 { neighbor 2001:678:d78:7::148; } protocol bgp rr6_chbtl0 from T_IBGP6 { neighbor 2001:678:d78:3::87; } Final note You may have noticed that the [commit] links are all to git commits in my private working copy. I want to wait until my previous work is reviewed and submitted before piling on more changes. Feel free to contact vpp-dev@ for more information in the mean time :-)\n","date":"2021-09-02","desc":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. One thing notably missing, is the higher level control plane, that is to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a VPP plugin which is called the Linux Control Plane, or LCP for short, which creates Linux network devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links, addresses and routes) itself. When the plugin is completed, running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates will be well in reach!\n","permalink":"https://ipng.ch/s/articles/2021/09/02/vpp-linux-cp-part5/","section":"articles","title":"VPP Linux CP - Part5"},{"contents":"Introduction I\u0026rsquo;ve been a very happy Init7 customer since 2016, when the fiber to the home ISP I was a subscriber at back then, a small company called Easyzone, got acquired by Init7. The technical situation in Wangen-Brüttisellen was a bit different back in 2016. There was a switch provided by Litecom in which ports were resold OEM to upstream ISPs, and Litecom would provide the L2 backhaul to a central place to hand off the customers to the ISPs, in my case Easyzone. In Oct'16, Fredy asked me if I could do a test of Fiber7-on-Litecom, which I did and reported on in a blog post.\nSome time early 2017, Init7 deployed a POP in Dietlikon (790BRE) and then magically another one in Brüttisellen (1790BRE). It\u0026rsquo;s a funny story why the Dietlikon point of presence is called 790BRE, but I\u0026rsquo;ll leave that for the bar, not this post :-)\nFiber7\u0026rsquo;s Next Gen Some of us read a rather curious tweet in back in May:\nTranslated \u0026ndash; \u0026lsquo;7 years ago our Gigabit-Internet was born. To celebrate this day, here\u0026rsquo;s a riddle for #Nerds: Gordon Moore\u0026rsquo;s law says dictates doubling every 18 months. What does that mean for our 7 year old Fiber7?\u0026rsquo; Well, 7 years is 84 months, and doubling every 18 months means 84/18 = 4.6667 doublings and 1Gbpbs * 2^4.6667 = 25.4Gbps. Holy shitballs, Init7 just announced that their new platform will offer 25G symmetric ethernet?!\n\u0026ldquo;I wonder what that will cost?\u0026rdquo;, I remember myself thinking. \u0026ldquo;The same price\u0026rdquo;, was the answer. I can see why \u0026ndash; monitoring my own family\u0026rsquo;s use, we\u0026rsquo;re doing a good 60Mbit or so when we stream Netflix and/or Spotify (which we all do daily). And some IPTV maybe at 4k will go for a few hundred megs, but the only time we actually use the gigabit, is when we do a speedtest of an iperf :-) Moreover, offering 25G fits the company\u0026rsquo;s marketing strategy well, because our larger Swiss national telco and cable providers are all muddying the waters with their DOCSIS and GPON offering, both of which can do 10Gbit, but it\u0026rsquo;s a TDM (time division multiplexing) offering which makes any number of subscribers share that bandwidth to a central office. And when I say any number, it\u0026rsquo;s easy to imagine 128 and 256 subscribers on one XGSPON, and many of those transponders in a telco line terminator, each with redundant uplinks of 2x10G or sometimes 2x40G. But that\u0026rsquo;s an oversubscription of easily 2000x, taking 128 (subscribers per PON) x16 (PONs per linecard) x8 (linecards), is 16K subscribers of 10G using 80G (or only 20G) of uplink bandwidth. That\u0026rsquo;s massively inferior from a technical perspective. And, as we\u0026rsquo;ll see below, it doesn\u0026rsquo;t really allow for advanced services, like L2 backhaul from the subscriber to a central office.\nNow to be fair, the 1790BRE pop that I am personally connected to has 2x 10G uplinks and ~200 or so 1G downlinks, which is also a local overbooking of 10:1, or 20:1 if only one of the uplinks is used at any given time. Worth noting, sometimes several cities are daisy chained, which makes for larger overbooking if you\u0026rsquo;re deep in the Fiber7 access network. I am pretty close (790BRE-790SCW-790OER-Core; and an alternate path of 780EFF-Core; only one of which is used because the Fiber7 edge switches use OSPF and a limited TCAM space means only few if any public routes are there; I assume a default is injected into OSPF at every core site and limited traffic engineering is done). The longer the traceroute, the cooler it looks, but the more customers are ahead of you, causing more overbooking. YMMV ;-)\nUpgrading 1790BRE Wouldn\u0026rsquo;t it be cool if Init7 upgraded to 100G intra-pop? Well, this is the story of their Access'21 project! My buddy Pascal (who is now the CTO at Init7, good choice!), explained it to me in a business call back in June, but also shared it in a presentation which I definitely encourage you to browse through. If you thought I was jaded on GPON, check out their assessment, it\u0026rsquo;s totally next level!\nAnyway, the new POPs are based on Cisco\u0026rsquo;s C9500 switches, which come in two variants: Access switches are C9500-48Y4C which take 48x SPF28 (1/10/25Gbit) and 4x QSFP+ (40/100Gbit) and aggregation switches are C9500-32C which take 32x QSFP+ (40/100Gbit).\nAs a subscriber, we all got a courtesy headsup on the date of 1790BRE\u0026rsquo;s upgrade. It was scheduled for Thursday Aug 26th starting at midnight. As I\u0026rsquo;ve written about before (for example at the bottom of my Bucketlist post), I really enjoy the immediate gratification of physical labor in a datacenter. Most of my projects at work are on the quarters-to-years timeframe, and being able to do a thing and see the result of that thing ~immmediately, is a huge boost for me.\nSo I offered to let one of the two Init7 people take the night off and help perform the upgrade myself. The picture on the right is how the switch looked like until now, with four linecards of 48x1G trunked into 2x10G uplinks, one towards Effretikon and one towards Dietlikon. It\u0026rsquo;s an aging Cisco 4510 switch (they were released around 2010), but it has served us well here in Brüttisellen for many years, thank you, little chassis!\nThe Upgrade I met the Init7 engineer in front of the Werke Wangen-Brüttisellen, which is about 170m from my house, as the photons fly, at around 23:30. We chatted for a little while, I had already gotten to know him due to mutual hosting at NTT in Rümlang, so of course our basement ISPs peer over CommunityIX and so on, but it\u0026rsquo;s cool to put a face to the name.\nThe new switches were already racked by Pascal previously, and DWDM multiplexers have appeared, and that what used to be a simplex fiber, is now two pairs of duplex fibers. Maybe DWDM services are in reach for me at some point? I should look in to that \u0026hellip; but for now let\u0026rsquo;s focus on the task at hand.\nIn the picture on the right, you can see from top to bottom: DWDM mux to ZH11/790ZHB which immediately struck my eye as clever - it\u0026rsquo;s a 8 channel DWDM mux with channels C31-C38 and two wideband passthroughs, one is 1310W which means \u0026ldquo;a wideband 1310nm\u0026rdquo; which is where the 100G optics are sending; and the other is UPG which is an upgrade port, allowing to add more DWDM channels in a separate mux into the fiber at a later date, at the expense of 2dB or so of insertion loss. Nice. The second is an identical unit, a DWDM mux to 780EFF which has again one 100G 1310nm wideband channel towards Effretikon and then on to Winterthur, and CH31 in use with what is the original C4510 switch (that link used to be a dark fiber with vanilla 10G optics connecting 1790BRE with 780EFF).\nThen there are two redundant aggregation switches (the 32x100G kind), which have each four access switches connected to them, with the pink cables. Those are interesting: to make 100G very cheap, optics can make use of 4x25G lasers that each take one fiber, so 8 fibers in total, and those pink cables are 12-fiber multimode trunks with an MPO connector. The optics for this type of connection are super cheap, for example this Flexoptix one. I have the 40G variant at home, also running multimode 4x10G MPO cables, at a fraction of the price of singlemode single-laser variants. So when people say \u0026ldquo;multimode is useless, always use singlemode\u0026rdquo;, point them at this post please!\nThere were 11 subscribers who upgraded their service, ten of them to 10Gbps (myself included) and one of them to 25Gbps, lucky bastard. So in a first pass we shut down all the ports on the C4510 and moved over optics and fibers one by one into the new C9500 switches, of which there were four.\nWerke Wangen-Brüttisellen (the local telcoroom owners in my town) historically did do a great job at labeling every fiber with little numbered clips, so it\u0026rsquo;s easy to ensure that what used to be fiber #33, is now still in port #33. I worked from the right, taking two optics from the old switch, moving them into the new switch, and reinserting the fibers. The Init7 engineer worked from the left, doing the same. We managed to complete this swap-over in record time, according to Pascal who was monitoring from remote, and reconfiguring the switches to put the subscribers back into service. We started at 00:05 and completed the physical reconfiguration at 01:21am. Go, us!\nAfter the physical work, we conducted an Init7 post-maintenance ritual which was eating a cheeseburger to replenish our body\u0026rsquo;s salt and fat contents. We did that at my place and luckily I have access to a microwave oven and also some Blairs Mega Death hotsauce (with liquid rage) which my buddy enthusiastically drizzled onto the burger, but it did make him burp just a little bit as sweat poured out of his face. That was fun! I took some more pictures, published with permission, in this album.\nOne more thing! I had waited to order this until the time was right, and the upgrade of 1790BRE was it \u0026ndash; since I operate AS50869, a little basement ISP, I had always hoped to change my 1500 byte MTU L3 service into a Jumboframe capable L2 service. After some negotiation on the contractuals, I signed an order ahead of this maintenance to upgrade to a 10G virtual leased line (VLL) from this place to the NTT datacenter in Rümlang.\nIn the afternoon, I had already patched my side of the link in the datacenter, and I noticed that the Init7 side of the patch was dangling in their rack without an optic. So we went to the datacenter (at 2am, the drive from my house to NTT is 9 minutes, without speeding!), and plugged in an optic to let my lonely photons hit a friendly receiver.\nI then got to configure the VLL together with my buddy, which was a hilight of the night for me. I now have access to a spiffy new 10 gigabit VLL operating at 9190 MTU, from 1790BRE directly to my router chrma0.ipng.ch at NTT Rümlang, while previously I had secured a 1G carrier ethernet operating at 9000 MTU directly to my router chgtg0.ipng.ch at Interxion Glattbrugg. Between the two sites, I have a CWDM wave which currently runs 10G optics but I have the 25G CWDM optics and switches ready for deployment. It\u0026rsquo;s somewhat (ok, utterly) over the top, but I like (ok, love) it.\npim@chbtl0:~$ show protocols ospfv3 neighbor Neighbor ID Pri DeadTime State/IfState Duration I/F[State] 194.1.163.4 1 00:00:38 Full/PointToPoint 87d05:37:45 dp0p6s0f3[PointToPoint] 194.1.163.86 1 00:00:31 Full/DROther 16:18:39 dp0p6s0f2.101[BDR] 194.1.163.87 1 00:00:30 Full/DR 7d15:48:41 dp0p6s0f2.101[BDR] 194.1.163.0 1 00:00:38 Full/PointToPoint 2d12:02:19 dp0p6s0f0[PointToPoint] The latency from my workstation on which I\u0026rsquo;m writing this blogpost to, say, my Bucketlist location of NIKHEF in the Amsterdam Watergraafsmeer, is pretty much as fast as light goes (I\u0026rsquo;ve seen 12.2ms, but considering it\u0026rsquo;s ~820km, this is not bad at all):\npim@chumbucket:~$ traceroute gripe traceroute to gripe (94.142.241.186), 30 hops max, 60 byte packets 1 chbtl0.ipng.ch (194.1.163.66) 0.211 ms 0.186 ms 0.189 ms 2 chrma0.ipng.ch (194.1.163.17) 1.463 ms 1.416 ms 1.432 ms 3 defra0.ipng.ch (194.1.163.25) 7.376 ms 7.344 ms 7.330 ms 4 nlams0.ipng.ch (194.1.163.27) 12.952 ms 13.115 ms 12.925 ms 5 gripe.ipng.nl (94.142.241.186) 13.250 ms 13.337 ms 13.223 ms And, due to the work we did above, now the bandwidth is up to par as well, with comparable down- and upload speeds of 9.2Gbit from NL\u0026gt;CH and 8.9Gbit from CH\u0026gt;NL, and, while I\u0026rsquo;m not going to prove it here, this would work equally well with 9000 byte, 1500 byte or 64 byte frames due to my use of DPDK based routers who just don\u0026rsquo;t G.A.F. :\npim@chumbucket:~$ iperf3 -c nlams0.ipng.ch -R -P 10 ## Richtung Schweiz! Connecting to host nlams0, port 5201 Reverse mode, remote host nlams0 is sending ... [SUM] 0.00-10.01 sec 10.8 GBytes 9.26 Gbits/sec 53 sender [SUM] 0.00-10.00 sec 10.7 GBytes 9.19 Gbits/sec receiver pim@chumbucket:~$ iperf3 -c nlams0.ipng.ch -P 10 ## Naar Nederland toe! Connecting to host nlams0, port 5201 ... [SUM] 0.00-10.00 sec 9.93 GBytes 8.87 Gbits/sec 405 sender [SUM] 0.00-10.02 sec 9.91 GBytes 8.84 Gbits/sec receiver ","date":"2021-08-26","desc":"Introduction I\u0026rsquo;ve been a very happy Init7 customer since 2016, when the fiber to the home ISP I was a subscriber at back then, a small company called Easyzone, got acquired by Init7. The technical situation in Wangen-Brüttisellen was a bit different back in 2016. There was a switch provided by Litecom in which ports were resold OEM to upstream ISPs, and Litecom would provide the L2 backhaul to a central place to hand off the customers to the ISPs, in my case Easyzone. In Oct'16, Fredy asked me if I could do a test of Fiber7-on-Litecom, which I did and reported on in a blog post.\n","permalink":"https://ipng.ch/s/articles/2021/08/26/fiber7-x-in-1790bre/","section":"articles","title":"Fiber7-X in 1790BRE"},{"contents":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. One thing notably missing, is the higher level control plane, that is to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a VPP plugin which is called the Linux Control Plane, or LCP for short, which creates Linux network devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links, addresses and routes) itself. When the plugin is completed, running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates will be well in reach!\nIn the first three posts, I added the ability for VPP to synchronize its state (like link state, MTU, and interface addresses) into Linux. In this post, I\u0026rsquo;ll make a start on the other direction: allowing changes to interfaces made in Linux to make their way back into VPP!\nMy test setup I\u0026rsquo;m keeping the setup from the third post. A Linux machine has an interface enp66s0f0 which has 4 sub-interfaces (one dot1q tagged, one q-in-q, one dot1ad tagged, and one q-in-ad), giving me five flavors in total. Then, I created an LACP bond0 interface, which also has the whole kit and caboodle of sub-interfaces defined, see below in the Appendix for details, but here\u0026rsquo;s the table again for reference:\nName type Addresses enp66s0f0 untagged 10.0.1.2/30 2001:db8:0:1::2/64 enp66s0f0.q dot1q 1234 10.0.2.2/30 2001:db8:0:2::2/64 enp66s0f0.qinq outer dot1q 1234, inner dot1q 1000 10.0.3.2/30 2001:db8:0:3::2/64 enp66s0f0.ad dot1ad 2345 10.0.4.2/30 2001:db8:0:4::2/64 enp66s0f0.qinad outer dot1ad 2345, inner dot1q 1000 10.0.5.2/30 2001:db8:0:5::2/64 bond0 untagged 10.1.1.2/30 2001:db8:1:1::2/64 bond0.q dot1q 1234 10.1.2.2/30 2001:db8:1:2::2/64 bond0.qinq outer dot1q 1234, inner dot1q 1000 10.1.3.2/30 2001:db8:1:3::2/64 bond0.ad dot1ad 2345 10.1.4.2/30 2001:db8:1:4::2/64 bond0.qinad outer dot1ad 2345, inner dot1q 1000 10.1.5.2/30 2001:db8:1:5::2/64 The goal of this post is to show what code needed to be written and introduces an entirely new plugin, so that we can separate concerns (and have a higher chance of community acceptance of the plugins). In the first plugin, now called the Interface Mirror, I have previously implemented the VPP-to-Linux synchronization. In this new plugin (called the Netlink Listener) I implement the Linux-to-VPP synchronization using, quelle surprise, Netlink message handlers.\nStartingpoint Based on the state of the plugin after the third post, operators can enable lcp-sync (which copies changes made in VPP into their Linux counterpart) and lcp-auto-subint (which extends sub-interface creation in VPP to automatically create a Linux Interface Pair, or LIP, and its companion Linux network interface):\nDBGvpp# lcp lcp-sync on DBGvpp# lcp lcp-auto-subint on DBGvpp# lcp create TenGigabitEthernet3/0/0 host-if e0 DBGvpp# create sub TenGigabitEthernet3/0/0 1234 DBGvpp# create sub TenGigabitEthernet3/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match DBGvpp# create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match DBGvpp# create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match pim@hippo:~/src/lcpng$ ip link | grep e0 1286: e0.1234@e0: \u0026lt;BROADCAST,MULTICAST,M-DOWN\u0026gt; mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 1287: e0.1235@e0.1234: \u0026lt;BROADCAST,MULTICAST,M-DOWN\u0026gt; mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 1288: e0.1236@e0: \u0026lt;BROADCAST,MULTICAST,M-DOWN\u0026gt; mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 1289: e0.1237@e0.1236: \u0026lt;BROADCAST,MULTICAST,M-DOWN\u0026gt; mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 1701: e0: \u0026lt;BROADCAST,MULTICAST\u0026gt; mtu 9050 qdisc mq state DOWN mode DEFAULT group default qlen 1000 The vision for this plugin has been that Linux can drive most control-plane operations, such as creating sub-interfaces, adding/removing addresses, changing MTU on links, etc. We can do that by listening to Netlink messages, which were designed for transferring miscellaneous networking information between the kernel space and userspace processes (like VPP). Networking utilities, such as the iproute2 family and its command line utilities (like ip) use Netlink to communicate with the Linux kernel from userspace.\nNetlink Listener The first task at hand is to install a Netlink listener. In this new plugin, I first register lcp_nl_init() which adds Linux interface pair (LIP) add/del callbacks from the first plugin. I\u0026rsquo;m now made aware of new LIPs as they are created.\nIn lcb_nl_pair_add_cb(), I will initiate Netlink listener for first interface that gets created, noting its netns. If subsequent adds are in other netns, I\u0026rsquo;ll just issue a warning. And, I will keep a refcount so I know how many LIPs are bound to this listener.\nIn lcb_nl_pair_del_cb(), I can remove the listener when the last interface pair is removed.\nThen for listening itself, a Netlink socket is opened, and because Linux can be quite chatty on Netlink sockets, I\u0026rsquo;ll raise its read/write buffers to something quite large (typically 64M read and 16K write size). One note on this size, it\u0026rsquo;ll need some sysctl to be set before VPP starts, typically done as follows:\npim@hippo:~/src/vpp$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/sysctl.d/81-vpp-Netlink.conf # Increase Netlink to 64M net.core.rmem_default=67108864 net.core.wmem_default=67108864 net.core.rmem_max=67108864 net.core.wmem_max=67108864 EOF pim@hippo:~/src/vpp$ sudo sysctl -p After creating the Netlink socket, I add its file descriptor to VPP\u0026rsquo;s built in file handler, which will see to polling it. On the file handler, I install lcp_nl_read_cb() and lcp_nl_error_cb() callbacks which will be invoked when anything interesting happens on the socket:\nA bit of explanation on why I\u0026rsquo;d use a queue rather than just consuming the Netlink messages directly as they are offered. I have to use a queue for the common case in which VPP is running single threaded. Instead of consuming a block of potentially a million route del/add\u0026rsquo;s (say, if BGP is reconverging), and thereby blocking VPP from reading new packets from DPDK, but more importantly, new Netlink messages from the kernel, which will fill the 64M socket buffer and overflow it, losing Netlink messages, which is bad because it requires an end to end resync of the Linux namespace into the VPP dataplane, something called an NLM_F_DUMP but that\u0026rsquo;s a story for another day.\nSo I process only a batch of messages and only for a maximum amount of time per batch. If there are still some messages left in the queue, I\u0026rsquo;ll just reschedule consumption after M milliseconds. This allows new Netlink messages to continuously be read from the kernel by VPP\u0026rsquo;s file handler, even if there\u0026rsquo;s a lot of work to do.\nlcp_nl_read_cb() calls lcp_nl_callback() which pushes Netlink messages onto a queue and issues a NL_EVENT_READ event, any socket read error issues NL_EVENT_READ_ERR event. lcp_nl_error_cb() simply issues NL_EVENT_READ_ERR event and moves on with life. To capture these events, I initialize a process node called lcp_nl_process(), which handles:\nNL_EVENT_READ by calling lcp_nl_process_msgs() and processing a batch of messages (either a maximum count, or a maximum duration, whichever is reached first). NL_EVENT_READ_ERR is the other event that can happen, in case VPP\u0026rsquo;s file handler or my own lcp_nl_read_cb() encounter a read error. All it does is close and reopen the Netlink socket in the same network namespace we were before, in an attempt to minimize the damage, dazed and confused, but trying to continue. Allright, so at this point, I have a producer queue that gets added to by the Netlink reader machinery, so all I have to do is consume them. lcp_nl_process_msgs() processes up to N messages and/or for up to M msecs, whichever comes first, and for each individual Netlink message, it will call lcp_nl_dispatch() to handle messages of a given type.\nFor now, lcp_nl_dispatch() just throws the message away after logging it with format_nl_object(), a function that will come in very useful as I start to explore all the different Netlink message types.\nThe code that forms the basis of our Netlink Listener lives in [this commit] and specifically, here I want to call out I was not the primary author, I worked off of Matt and Neale\u0026rsquo;s awesome work in this pending Gerrit.\nNetlink: Neighbor ARP and IPv6 Neighbor Discovery will trigger a set of Netlink messages, which are of type RTM_NEWNEIGH and RTM_DELNEIGH\nFirst, I\u0026rsquo;ll add a new source file lcpng_nl_sync.c that will house these handler functions. Their purpose is to take state learned from Netlink messages, and apply that state to VPP.\nThen, I add lcp_nl_neigh_add() and lcp_nl_neigh_del() which implement the following pattern: Most Netlink messages are somehow about a link, which is identified by an interface index (ifindex or just idx for short). That\u0026rsquo;s the same interface index I stored when I created the LIP, calling it vif_index because in VPP, it describes a virtio device which implements the IO for the TAP.\nIf I\u0026rsquo;m handling a message for link with a given ifindex, I can correlate it with a LIP. Not all messages will be related to something VPP knows or cares about, I\u0026rsquo;ll discuss that more later when I discuss RTM_NEWLINK messages.\nIf there is no LIP associated with the ifindex, then clearly this message is about a Linux interface VPP is not aware of. But, if I can find the LIP, I can convert the lladdr (MAC address) and IP address from the Netlink message into their VPP variants, and then simply add or remove the ip4/ip6 neighbor adjacency.\nThe code for this first Netlink message handler lives in this [commit]. An ironic insight is that after writing the code, I don\u0026rsquo;t think any of it will be necessary, because the interface plugin will already copy ARP and IPv6 ND packets back and forth and itself update its neighbor adjacency tables; but I\u0026rsquo;m leaving the code in for now.\nNetlink: Address A decidedly more interesting message is RTM_NEWADDR and its deletion companion RTM_DELADDR.\nIt\u0026rsquo;s pretty straight forward to add and remove IPv4 and IPv6 addresses on interfaces. I have to convert the Netlink representation of an IP address to its VPP counterpart with a helper, add it or remove it, and if there are no link-local addresses left, disable IPv6 on the interface. There\u0026rsquo;s also a few multicast routes to add (notably 224.0.0.0/24 and ff00::/8, all-local-subnet).\nThe code for IP address handling is in this [commit], but when I took it out for a spin, I noticed something curious, looking at the log lines that are generated for the following sequence:\nip addr add 10.0.1.1/30 dev e0 debug linux-cp/nl addr_add: Netlink route/addr: add idx 1488 family inet local 10.0.1.1/30 flags 0x0080 (permanent) warn linux-cp/nl dispatch: ignored route/route: add family inet type 2 proto 2 table 255 dst 10.0.1.1 nexthops { idx 1488 } warn linux-cp/nl dispatch: ignored route/route: add family inet type 1 proto 2 table 254 dst 10.0.1.0/30 nexthops { idx 1488 } warn linux-cp/nl dispatch: ignored route/route: add family inet type 3 proto 2 table 255 dst 10.0.1.0 nexthops { idx 1488 } warn linux-cp/nl dispatch: ignored route/route: add family inet type 3 proto 2 table 255 dst 10.0.1.3 nexthops { idx 1488 } ping 10.0.1.2 debug linux-cp/nl neigh_add: Netlink route/neigh: add idx 1488 family inet lladdr 68:05:ca:32:45:94 dst 10.0.1.2 state 0x0002 (reachable) flags 0x0000 notice linux-cp/nl neigh_add: Added 10.0.1.2 lladdr 68:05:ca:32:45:94 iface TenGigabitEthernet3/0/0 ip addr del 10.0.1.1/30 dev e0 debug linux-cp/nl addr_del: Netlink route/addr: del idx 1488 family inet local 10.0.1.1/30 flags 0x0080 (permanent) notice linux-cp/nl addr_del: Deleted 10.0.1.1/30 iface TenGigabitEthernet3/0/0 warn linux-cp/nl dispatch: ignored route/route: del family inet type 1 proto 2 table 254 dst 10.0.1.0/30 nexthops { idx 1488 } warn linux-cp/nl dispatch: ignored route/route: del family inet type 3 proto 2 table 255 dst 10.0.1.3 nexthops { idx 1488 } warn linux-cp/nl dispatch: ignored route/route: del family inet type 3 proto 2 table 255 dst 10.0.1.0 nexthops { idx 1488 } warn linux-cp/nl dispatch: ignored route/route: del family inet type 2 proto 2 table 255 dst 10.0.1.1 nexthops { idx 1488 } debug linux-cp/nl neigh_del: Netlink route/neigh: del idx 1488 family inet lladdr 68:05:ca:32:45:94 dst 10.0.1.2 state 0x0002 (reachable) flags 0x0000 error linux-cp/nl neigh_del: Failed 10.0.1.2 iface TenGigabitEthernet3/0/0 It is this very last message that\u0026rsquo;s a bit of a surprise \u0026ndash; the ping brought the peer\u0026rsquo;s lladdr into the neighbor cache; and the subsequent address deletion first removed the address, then all the typical local routes (the connected, the broadcast, the network, and the self/local); but then as well explicitly deleted the neighbor, which I suppose is correct behavior for Linux, were it not that VPP already invalidates the neighbor cache and adds/removes the connected routes for example in ip/ip4_forward.c L826-L830 and L583.\nI can see more of these false positive non-errors like the one on lcp_nl_neigh_del() because interface and directly connected route addition/deletion is slightly different in VPP than in Linux. So, I decide to take a little shortcut \u0026ndash; if an addition returns \u0026ldquo;already there\u0026rdquo;, or a deletion returns \u0026ldquo;no such entry\u0026rdquo;, I\u0026rsquo;ll just consider it a successful addition and deletion respectively, saving my eyes from being screamed at by this red error message. I changed that in this [commit], turning this situation in a friendly green notice instead.\nNetlink: Link (existing) There\u0026rsquo;s a bunch of use cases for these messages RTM_NEWLINK and RTM_DELLINK. They carry information about carrier (link, no-link), admin state (up/down), MTU, and so on. The function lcp_nl_link_del() is the easier of the two. If I see a message like this for an ifindex that VPP has a LIP for, I\u0026rsquo;ll just remove it. This means first calling the lcp_itf_pair_delete() function and then, if the message was for a VLAN interface, remove the accompanying sub-interface (both the physical one (eg. TenGigabitEthernet3/0/0.1234) as well as the TAP that we used to communicate to the host with (eg. tap8.1234).\nThe other message (the RTM_NEWLINK one), is much more complicated, because it\u0026rsquo;s actually many types of operation all in one message type: We can set the link up/down, change its MTU, and change its MAC address, in any combination, perhaps like so:\nip link set e0 mtu 9216 address 00:11:22:33:44:55 down So in turn, lcp_nl_link_add() will first look at admin state and apply it to the phy and tap, apply the MTU if it\u0026rsquo;s different to what VPP has, and apply the MAC address if it\u0026rsquo;s different to what VPP has, notably applying MAC addresses only in \u0026lsquo;hardware\u0026rsquo; interfaces, which I now know are not just physical ones like TenGigabitEthernet3/0/0 but also virtual ones like BondEthernet0.\nOne thing I noticed, is that link state and MTU changes tend to go around in circles (from Netlink into VPP, with this code, but when lcp-sync is on in the interface mirror plugin, changes to link and mtu will trigger a callback there, which will in turn generate a Netlink message, and so on). To avoid this loop, I temporarily turn off lcp-sync just before handling a batch of messages, and turn it back to its original state when I\u0026rsquo;m done with that.\nThe code for all/del of existing links is in this [commit].\nNetlink: Link (new) Here\u0026rsquo;s where it gets interesting! What if the RTM_NEWLINK message was for an interface that VPP doesn\u0026rsquo;t have a LIP for, but specifically describes a VLAN interface? Well, then clearly the operator is trying to create a new sub-interface. And supporting that operation would be super cool, so let\u0026rsquo;s go!\nUsing the earlier placeholder hint in lcp_nl_link_add() (see the previous [commit]), I know that I\u0026rsquo;ve gotten a NEWLINK request but the Linux ifindex doesn\u0026rsquo;t have a LIP. This could be because the interface is entirely foreign to VPP, for example somebody created a dummy interface or a VLAN sub-interface on one:\nip link add dum0 type dummy ip link add link dum0 name dum0.10 type vlan id 10 Or perhaps more interestingly, the operator is actually trying to create a VLAN sub-interface on an interface we created in VPP earlier, like these:\nip link add link e0 name e0.1234 type vlan id 1234 ip link add link e0.1234 name e0.1235 type vlan id 1000 ip link add link e0 name e0.1236 type vlan id 2345 proto 802.1ad ip link add link e0.1236 name e0.1237 type vlan id 1000 None of these RTM_NEWLINK messages, represented by vif (Linux ifindex) will have a corresponding LIP. So, I try to create one by calling lcp_nl_link_add_vlan().\nFirst, I\u0026rsquo;ll lookup the parent ifindex (dum0 or e0 in the examples above). The first example parent, dum0, doesn\u0026rsquo;t have a LIP, so I bail after logging a warning. The second example however, e0, definitely does have a LIP, so it\u0026rsquo;s known to VPP.\nNow, I have two further choices:\nthe LIP is a phy (ie TenGigabitEthernet3/0/0 or BondEthernet0) and this is a regular tagged interface with a given proto (dot1q or dot1ad); or the LIP is itself a subint (ie TenGigabitEthernet3/0/0.1234) and what I\u0026rsquo;m being asked for is actually a QinQ or QinAD sub-interface. Remember, there\u0026rsquo;s an important difference: In Linux these sub-interfaces are chained (e0 creates child e0.1234@e0 for a normal VLAN, and e0.1234 creates child e0.1235@e0.1234 for the QinQ). In VPP these are actually all flat sub-interfaces, with the \u0026lsquo;regular\u0026rsquo; VLAN interface carrying the one_tag flag with only an outer_vlan_id set, and the latter QinQ carrying the two_tags flag with both an outer_vlan_id (1234) and an inner_vlan_id (1000). So I look up both the parent LIP as well the phy LIP. I now have all the ingredients I need to create the VPP sub-interfaces with the correct inner-dot1q and outer dot1q or dot1ad.\nOf course, I don\u0026rsquo;t really know what subinterface ID to use. It\u0026rsquo;s appealing to \u0026ldquo;just\u0026rdquo; use the vlan id, but that\u0026rsquo;s not helpful if the outer tag and the inner tag are the same. So I write a helper function vnet_sw_interface_get_available_subid() whose job it is to return an unused subid for the phy, starting from 1.\nHere as well, the interface plugin can be configured to automatically create LIPs for sub-interfaces, which I have to turn off temporarily to let my new form of creation do its thing. I carefully ensure that the thread barrier is taken/released and the original setting of lcp-auto-subint is restored at all exit points. One cool thing is that the new link\u0026rsquo;s name is given in the Netlink message, so I can just use that one. I like the aesthetic a bit more, because here the operator can give the Linux interface any name they like, where-as in the other direction, VPP\u0026rsquo;s lcp-auto-subint feature has to make up a boring \u0026lt;phy\u0026gt;.\u0026lt;subid\u0026gt; name.\nAlright, without further ado, the code for the main innovation here, the implementation of lcp_nl_link_add_vlan(), is in this [commit].\nResults The functional regression test I made on day one, the one that ensures end-to-end connectivity to and from the Linux host interfaces works for all 5 interface types (untagged, .1q tagged, QinQ, .1ad tagged and QinAD) and for both physical and virtual interfaces (like TenGigabitEthernet3/0/0 and BondEthernet0), still works.\nAfter this code is in, the operator will only have to create a LIP for any phy interfaces, and can rely on the new Netlink Listener plugin and the use of ip in Linux for all the rest. This implementation starts approaching \u0026lsquo;vanilla\u0026rsquo; Linux user experience!\nHere\u0026rsquo;s a new screencast [asciinema, gif] showing me playing around a bit, demonstrating that synchronization works pretty well in both directions, a huge improvement from the [previous asciinema, gif] in my [second post], which was only two weeks ago:\nFurther Work You will note that there\u0026rsquo;s one important Netlink message type that\u0026rsquo;s missing: routes! They are so important in fact, that they\u0026rsquo;re a topic of their very own post. Also, I haven\u0026rsquo;t written the code for them yet :-)\nA few things worth noting, as future work.\nMultiple NetNS - The original Netlink Listener (ref) would only listen to the default netns specified in the configuration file. This is problematic because the interface plugin does allow interfaces to be made in other namespaces (by issuing lcp create ... host-if X netns foo), the Netlink world of which will be unknown to VPP. I created struct lcp_nl_netlink_namespace to hold the stuff needed for the Netlink listener, which is a good starting point to create not one but multiple listeners, one for each unique namespace that has one or more LIPs defined. This is version-two work :)\nMultithreading - In testing, I noticed that while my plugin itself are (or seem to be..) thread safe, virtio may not be totally clean, and I noticed that in a multithreaded VPP instance with many workers, there\u0026rsquo;s a crash in lcp_arp_phy_node() where vlib_buffer_copy() returns NULL, which should not happen. When VPP is in such a state, other plugins (notably DHCP and IPv6 ND) also start complaining, and show errors shows millions of virtio-input errors about unavailable buffers. I do confirm though, that running VPP single threaded does not have these issues.\nCredits I\u0026rsquo;d like to make clear that the Linux CP plugin is a collaboration between several great minds, and that my work stands on other software engineer\u0026rsquo;s shoulders. In particular most of the Netlink socket handling and Netlink message queueing was written by Matthew Smith, and I\u0026rsquo;ve had a little bit of help along the way from Neale Ranns and Jon Loeliger. I\u0026rsquo;d like to thank them for their work!\nAppendix Ubuntu config This configuration has been the exact same ever since my first post:\n# Untagged interface ip addr add 10.0.1.2/30 dev enp66s0f0 ip addr add 2001:db8:0:1::2/64 dev enp66s0f0 ip link set enp66s0f0 up mtu 9000 # Single 802.1q tag 1234 ip link add link enp66s0f0 name enp66s0f0.q type vlan id 1234 ip link set enp66s0f0.q up mtu 9000 ip addr add 10.0.2.2/30 dev enp66s0f0.q ip addr add 2001:db8:0:2::2/64 dev enp66s0f0.q # Double 802.1q tag 1234 inner-tag 1000 ip link add link enp66s0f0.q name enp66s0f0.qinq type vlan id 1000 ip link set enp66s0f0.qinq up mtu 9000 ip addr add 10.0.3.2/30 dev enp66s0f0.qinq ip addr add 2001:db8:0:3::2/64 dev enp66s0f0.qinq # Single 802.1ad tag 2345 ip link add link enp66s0f0 name enp66s0f0.ad type vlan id 2345 proto 802.1ad ip link set enp66s0f0.ad up mtu 9000 ip addr add 10.0.4.2/30 dev enp66s0f0.ad ip addr add 2001:db8:0:4::2/64 dev enp66s0f0.ad # Double 802.1ad tag 2345 inner-tag 1000 ip link add link enp66s0f0.ad name enp66s0f0.qinad type vlan id 1000 proto 802.1q ip link set enp66s0f0.qinad up mtu 9000 ip addr add 10.0.5.2/30 dev enp66s0f0.qinad ip addr add 2001:db8:0:5::2/64 dev enp66s0f0.qinad ## Bond interface ip link add bond0 type bond mode 802.3ad ip link set enp66s0f2 down ip link set enp66s0f3 down ip link set enp66s0f2 master bond0 ip link set enp66s0f3 master bond0 ip link set enp66s0f2 up ip link set enp66s0f3 up ip link set bond0 up ip addr add 10.1.1.2/30 dev bond0 ip addr add 2001:db8:1:1::2/64 dev bond0 ip link set bond0 up mtu 9000 # Single 802.1q tag 1234 ip link add link bond0 name bond0.q type vlan id 1234 ip link set bond0.q up mtu 9000 ip addr add 10.1.2.2/30 dev bond0.q ip addr add 2001:db8:1:2::2/64 dev bond0.q # Double 802.1q tag 1234 inner-tag 1000 ip link add link bond0.q name bond0.qinq type vlan id 1000 ip link set bond0.qinq up mtu 9000 ip addr add 10.1.3.2/30 dev bond0.qinq ip addr add 2001:db8:1:3::2/64 dev bond0.qinq # Single 802.1ad tag 2345 ip link add link bond0 name bond0.ad type vlan id 2345 proto 802.1ad ip link set bond0.ad up mtu 9000 ip addr add 10.1.4.2/30 dev bond0.ad ip addr add 2001:db8:1:4::2/64 dev bond0.ad # Double 802.1ad tag 2345 inner-tag 1000 ip link add link bond0.ad name bond0.qinad type vlan id 1000 proto 802.1q ip link set bond0.qinad up mtu 9000 ip addr add 10.1.5.2/30 dev bond0.qinad ip addr add 2001:db8:1:5::2/64 dev bond0.qinad VPP config We can whittle down the VPP configuration to the bare minimum:\nvppctl lcp default netns dataplane vppctl lcp lcp-sync on vppctl lcp lcp-auto-subint on ## Create `e0` vppctl lcp create TenGigabitEthernet3/0/0 host-if e0 ## Create `be0` vppctl create bond mode lacp load-balance l34 vppctl bond add BondEthernet0 TenGigabitEthernet3/0/2 vppctl bond add BondEthernet0 TenGigabitEthernet3/0/3 vppctl set interface state TenGigabitEthernet3/0/2 up vppctl set interface state TenGigabitEthernet3/0/3 up vppctl lcp create BondEthernet0 host-if be0 And the rest of the confifuration work is done entirely from the Linux side!\nIP=\u0026#34;sudo ip netns exec dataplane ip\u0026#34; ## `e0` aka TenGigabitEthernet3/0/0 $IP link add link e0 name e0.1234 type vlan id 1234 $IP link add link e0.1234 name e0.1235 type vlan id 1000 $IP link add link e0 name e0.1236 type vlan id 2345 proto 802.1ad $IP link add link e0.1236 name e0.1237 type vlan id 1000 $IP link set e0 up mtu 9000 $IP addr add 10.0.1.1/30 dev e0 $IP addr add 2001:db8:0:1::1/64 dev e0 $IP addr add 10.0.2.1/30 dev e0.1234 $IP addr add 2001:db8:0:2::1/64 dev e0.1234 $IP addr add 10.0.3.1/30 dev e0.1235 $IP addr add 2001:db8:0:3::1/64 dev e0.1235 $IP addr add 10.0.4.1/30 dev e0.1236 $IP addr add 2001:db8:0:4::1/64 dev e0.1236 $IP addr add 10.0.5.1/30 dev e0.1237 $IP addr add 2001:db8:0:5::1/64 dev e0.1237 ## `be0` aka BondEthernet0 $IP link add link be0 name be0.1234 type vlan id 1234 $IP link add link be0.1234 name be0.1235 type vlan id 1000 $IP link add link be0 name be0.1236 type vlan id 2345 proto 802.1ad $IP link add link be0.1236 name be0.1237 type vlan id 1000 $IP link set be0 up mtu 9000 $IP addr add 10.1.1.1/30 dev be0 $IP addr add 2001:db8:1:1::1/64 dev be0 $IP addr add 10.1.2.1/30 dev be0.1234 $IP addr add 2001:db8:1:2::1/64 dev be0.1234 $IP addr add 10.1.3.1/30 dev be0.1235 $IP addr add 2001:db8:1:3::1/64 dev be0.1235 $IP addr add 10.1.4.1/30 dev be0.1236 $IP addr add 2001:db8:1:4::1/64 dev be0.1236 $IP addr add 10.1.5.1/30 dev be0.1237 $IP addr add 2001:db8:1:5::1/64 dev be0.1237 Final note You may have noticed that the [commit] links are all to git commits in my private working copy. I want to wait until my previous work is reviewed and submitted before piling on more changes. Feel free to contact vpp-dev@ for more information in the mean time :-)\n","date":"2021-08-25","desc":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. One thing notably missing, is the higher level control plane, that is to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a VPP plugin which is called the Linux Control Plane, or LCP for short, which creates Linux network devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links, addresses and routes) itself. When the plugin is completed, running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates will be well in reach!\n","permalink":"https://ipng.ch/s/articles/2021/08/25/vpp-linux-cp-part4/","section":"articles","title":"VPP Linux CP - Part4"},{"contents":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. One thing notably missing, is the higher level control plane, that is to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a VPP plugin which is called the Linux Control Plane, or LCP for short, which creates Linux network devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links, addresses and routes) itself. When the plugin is completed, running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates will be well in reach!\nIn this third post, I\u0026rsquo;ll be adding a convenience feature that I think will be popular: the plugin will now automatically create or delete LIPs for sub-interfaces where-ever the parent has a LIP configured.\nMy test setup I\u0026rsquo;ve extended the setup from the first post. The base configuration for the enp66s0f0 interface remains exactly the same, but I\u0026rsquo;ve also added an LACP bond0 interface, which also has the whole kitten kaboodle of sub-interfaces defined, see below in the Appendix for details, but here\u0026rsquo;s the table again for reference:\nName type Addresses enp66s0f0 untagged 10.0.1.2/30 2001:db8:0:1::2/64 enp66s0f0.q dot1q 1234 10.0.2.2/30 2001:db8:0:2::2/64 enp66s0f0.qinq outer dot1q 1234, inner dot1q 1000 10.0.3.2/30 2001:db8:0:3::2/64 enp66s0f0.ad dot1ad 2345 10.0.4.2/30 2001:db8:0:4::2/64 enp66s0f0.qinad outer dot1ad 2345, inner dot1q 1000 10.0.5.2/30 2001:db8:0:5::2/64 bond0 untagged 10.1.1.2/30 2001:db8:1:1::2/64 bond0.q dot1q 1234 10.1.2.2/30 2001:db8:1:2::2/64 bond0.qinq outer dot1q 1234, inner dot1q 1000 10.1.3.2/30 2001:db8:1:3::2/64 bond0.ad dot1ad 2345 10.1.4.2/30 2001:db8:1:4::2/64 bond0.qinad outer dot1ad 2345, inner dot1q 1000 10.1.5.2/30 2001:db8:1:5::2/64 The goal of this post is to show what code needed to be written and which changes needed to be made to the plugin, in order to automatically create and delete sub-interfaces.\nStartingpoint Based on the state of the plugin after the second post, operators must create LIP instances for interfaces as well as each sub-interface explicitly:\nDBGvpp# lcp create TenGigabitEthernet3/0/0 host-if e0 DBGvpp# create sub TenGigabitEthernet3/0/0 1234 DBGvpp# lcp create TenGigabitEthernet3/0/0.1234 host-if e0.1234 DBGvpp# create sub TenGigabitEthernet3/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match DBGvpp# lcp create TenGigabitEthernet3/0/0.1235 host-if e0.1235 DBGvpp# create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match DBGvpp# lcp create TenGigabitEthernet3/0/0.1236 host-if e0.1236 DBGvpp# create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match DBGvpp# lcp create TenGigabitEthernet3/0/0.1237 host-if e0.1237 But one might ask \u0026ndash; is it really useful to have L3 interfaces in VPP without a companion interface in an appropriate Linux namespace? I think the answer might be \u0026lsquo;yes\u0026rsquo; for individual interfaces (for example, in a mgmt VRF that has no need to run routing protocols), but I also think the answer is probably \u0026rsquo;no\u0026rsquo; for sub-interfaces, once their parent has a LIP defined.\nConfiguration The original plugin (the one that ships with VPP 21.06) has a configuration flag that seems promising by defining a flag interface-auto-create, but its implementation was never finished. I\u0026rsquo;ve removed that flag and replaced it with a new one. The main reason for this decision is that there are actually two kinds of auto configuration: the first one is detailed in this post, but in the future, I will also make it possible to create VPP interfaces by creating their Linux counterpart (eg. ip link add link e0 name e0.1234 type vlan id 1234 with a configuration statement that might be called netlink-auto-subint), and I\u0026rsquo;d like for the plugin to individually enable/disable both types. Also, I find the name unfortunate, as the feature should create and delete LIPs on sub-interfaces, not just create them. So out with the old, in with the new :)\nI have to acknowledge that not everybody will want automagically created interfaces, similar to the original configuration, so I define a new configuration flag called lcp-auto-subint which goes into the linux-cp module configuration stanza in VPP\u0026rsquo;s startup.conf, which might look a little like this:\nlinux-cp { default netns dataplane lcp-auto-subint } Based on this config, I set the startup default in lcp_set_lcp_auto_subint(), but I realize that an administrator may want to turn it on/off at runtime, too, so I add a CLI getter/setter that interacts with the flag in this [commit]:\nDBGvpp# show lcp lcp default netns dataplane lcp lcp-auto-subint on lcp lcp-sync off DBGvpp# lcp lcp-auto-subint off DBGvpp# show lcp lcp default netns dataplane lcp lcp-auto-subint off lcp lcp-sync off The prep work for the rest of the interface syncer starts with this [commit], and for the rest of this blog post, the behavior will be in the \u0026lsquo;on\u0026rsquo; position.\nThe code for the configuration toggle is in this [commit].\nAuto create/delete sub-interfaces The original plugin code (that ships with VPP 21.06) made a start by defining a function called lcp_itf_phy_add() and registering an intent with VNET_SW_INTERFACE_ADD_DEL_FUNCTION(). I\u0026rsquo;ve moved the function to the source file I created in Part 2 (called lcp_if_sync.c), specifically to handle interface syncing, and gave it a name that matches the VPP callback, so lcp_itf_interface_add_del().\nThe logic of that function is pretty straight forward. I want to only continue if lcp-auto-subint is set, and I only want to create or delete sub-interfaces, not parents. This way, the operator can decide on a per-interface basis if they want it to participate in Linux (eg, issuing lcp create BondEthernet0 host-if be0). After I\u0026rsquo;ve established that (a) the caller wants auto-creation/auto-deletion, and (b) we\u0026rsquo;re fielding a callback for a sub-int, all I must do is:\nOn creation: does the parent interface sw-\u0026gt;sup_sw_if_index have a LIP? If yes, let\u0026rsquo;s create a LIP for this sub-interface, too. We determine that Linux interface name by taking the parent name (say, be0), and sticking the sub-int number after it, like be0.1234. On deletion: does this sub-interface we\u0026rsquo;re fielding the callback for have a LIP? If yes, then delete it. I noticed that interface deletion had a bug (one that I fell victim to as well: it does not remove the netlink device in the correct network namespace), which I fixed.\nThe code for the auto create/delete and the bugfix is in this [commit].\nFurther Work One other thing I noticed (and this is actually a bug!) is that on BondEthernet interfaces, upon creation a temporary MAC is assigned, which is subsequently overwritten by the first physical interface that is added to the bond, which means that when a LIP is created before the first interface is added, its MAC will be the temporary MAC. Compare:\nvppctl create bond mode lacp load-balance l2 vppctl lcp create BondEthernet0 host-if be0 ## MAC of be0 is now a temp/ephemeral MAC vppctl bond add BondEthernet0 TenGigabitEthernet3/0/2 vppctl bond add BondEthernet0 TenGigabitEthernet3/0/3 ## MAC of the BondEthernet0 device is now that of TenGigabitEthernet3/0/2 ## MAC of TenGigabitEthernet3/0/3 is that of BondEthernet0 (ie. Te3/0/2) In such a situation, be0 will not be reachable unless it\u0026rsquo;s manually set to the correct MAC. I looked around but found no callback of event handler for MAC address changes in VPP \u0026ndash; so I should add one probably, but in the mean time, I\u0026rsquo;ll just add interfaces to the bond before creating the LIP, like so:\nvppctl create bond mode lacp load-balance l2 vppctl bond add BondEthernet0 TenGigabitEthernet3/0/2 ## MAC of the BondEthernet0 device is now that of TenGigabitEthernet3/0/2 vppctl lcp create BondEthernet0 host-if be0 ## MAC of be0 is now that of BondEthernet0 vppctl bond add BondEthernet0 TenGigabitEthernet3/0/3 ## MAC of TenGigabitEthernet3/0/3 is that of BondEthernet0 (ie. Te3/0/2) .. which is an adequate workaround for now.\nResults After this code is in, the operator will only have to create a LIP for the main interfaces, and the plugin will take care of the rest!\npim@hippo:~/src/lcpng$ grep \u0026#39;create\u0026#39; config3.sh vppctl lcp lcp-auto-subint on vppctl lcp create TenGigabitEthernet3/0/0 host-if e0 vppctl create sub TenGigabitEthernet3/0/0 1234 vppctl create sub TenGigabitEthernet3/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match vppctl create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match vppctl create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match vppctl create bond mode lacp load-balance l2 vppctl lcp create BondEthernet0 host-if be0 vppctl create sub BondEthernet0 1234 vppctl create sub BondEthernet0 1235 dot1q 1234 inner-dot1q 1000 exact-match vppctl create sub BondEthernet0 1236 dot1ad 2345 exact-match vppctl create sub BondEthernet0 1237 dot1ad 2345 inner-dot1q 1000 exact-match And as an end-to-end functional validation, now extended as well to ping the Ubuntu machine over the LACP interface and all of its subinterfaces, works like a charm:\npim@hippo:~/src/lcpng$ sudo ip netns exec dataplane ip link | grep e0 1063: e0: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000 1064: be0: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000 209: e0.1234@e0: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 210: e0.1235@e0.1234: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 211: e0.1236@e0: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 212: e0.1237@e0.1236: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 213: be0.1234@be0: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 214: be0.1235@be0.1234: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 215: be0.1236@be0: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 216: be0.1237@be0.1236: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 # The TenGigabitEthernet3/0/0 (e0) interfaces pim@hippo:~/src/lcpng$ fping 10.0.1.2 10.0.2.2 10.0.3.2 10.0.4.2 10.0.5.2 10.0.1.2 is alive 10.0.2.2 is alive 10.0.3.2 is alive 10.0.4.2 is alive 10.0.5.2 is alive pim@hippo:~/src/lcpng$ fping6 2001:db8:0:1::2 2001:db8:0:2::2 \\ 2001:db8:0:3::2 2001:db8:0:4::2 2001:db8:0:5::2 2001:db8:0:1::2 is alive 2001:db8:0:2::2 is alive 2001:db8:0:3::2 is alive 2001:db8:0:4::2 is alive 2001:db8:0:5::2 is alive ## The BondEthernet0 (be0) interfaces pim@hippo:~/src/lcpng$ fping 10.1.1.2 10.1.2.2 10.1.3.2 10.1.4.2 10.1.5.2 10.1.1.2 is alive 10.1.2.2 is alive 10.1.3.2 is alive 10.1.4.2 is alive 10.1.5.2 is alive pim@hippo:~/src/lcpng$ fping6 2001:db8:1:1::2 2001:db8:1:2::2 \\ 2001:db8:1:3::2 2001:db8:1:4::2 2001:db8:1:5::2 2001:db8:1:1::2 is alive 2001:db8:1:2::2 is alive 2001:db8:1:3::2 is alive 2001:db8:1:4::2 is alive 2001:db8:1:5::2 is alive Credits I\u0026rsquo;d like to make clear that the Linux CP plugin is a great collaboration between several great folks and that my work stands on their shoulders. I\u0026rsquo;ve had a little bit of help along the way from Neale Ranns, Matthew Smith and Jon Loeliger, and I\u0026rsquo;d like to thank them for their work!\nAppendix Ubuntu config # Untagged interface ip addr add 10.0.1.2/30 dev enp66s0f0 ip addr add 2001:db8:0:1::2/64 dev enp66s0f0 ip link set enp66s0f0 up mtu 9000 # Single 802.1q tag 1234 ip link add link enp66s0f0 name enp66s0f0.q type vlan id 1234 ip link set enp66s0f0.q up mtu 9000 ip addr add 10.0.2.2/30 dev enp66s0f0.q ip addr add 2001:db8:0:2::2/64 dev enp66s0f0.q # Double 802.1q tag 1234 inner-tag 1000 ip link add link enp66s0f0.q name enp66s0f0.qinq type vlan id 1000 ip link set enp66s0f0.qinq up mtu 9000 ip addr add 10.0.3.2/30 dev enp66s0f0.qinq ip addr add 2001:db8:0:3::2/64 dev enp66s0f0.qinq # Single 802.1ad tag 2345 ip link add link enp66s0f0 name enp66s0f0.ad type vlan id 2345 proto 802.1ad ip link set enp66s0f0.ad up mtu 9000 ip addr add 10.0.4.2/30 dev enp66s0f0.ad ip addr add 2001:db8:0:4::2/64 dev enp66s0f0.ad # Double 802.1ad tag 2345 inner-tag 1000 ip link add link enp66s0f0.ad name enp66s0f0.qinad type vlan id 1000 proto 802.1q ip link set enp66s0f0.qinad up mtu 9000 ip addr add 10.0.5.2/30 dev enp66s0f0.qinad ip addr add 2001:db8:0:5::2/64 dev enp66s0f0.qinad ## Bond interface ip link add bond0 type bond mode 802.3ad ip link set enp66s0f2 down ip link set enp66s0f3 down ip link set enp66s0f2 master bond0 ip link set enp66s0f3 master bond0 ip link set enp66s0f2 up ip link set enp66s0f3 up ip link set bond0 up ip addr add 10.1.1.2/30 dev bond0 ip addr add 2001:db8:1:1::2/64 dev bond0 ip link set bond0 up mtu 9000 # Single 802.1q tag 1234 ip link add link bond0 name bond0.q type vlan id 1234 ip link set bond0.q up mtu 9000 ip addr add 10.1.2.2/30 dev bond0.q ip addr add 2001:db8:1:2::2/64 dev bond0.q # Double 802.1q tag 1234 inner-tag 1000 ip link add link bond0.q name bond0.qinq type vlan id 1000 ip link set bond0.qinq up mtu 9000 ip addr add 10.1.3.2/30 dev bond0.qinq ip addr add 2001:db8:1:3::2/64 dev bond0.qinq # Single 802.1ad tag 2345 ip link add link bond0 name bond0.ad type vlan id 2345 proto 802.1ad ip link set bond0.ad up mtu 9000 ip addr add 10.1.4.2/30 dev bond0.ad ip addr add 2001:db8:1:4::2/64 dev bond0.ad # Double 802.1ad tag 2345 inner-tag 1000 ip link add link bond0.ad name bond0.qinad type vlan id 1000 proto 802.1q ip link set bond0.qinad up mtu 9000 ip addr add 10.1.5.2/30 dev bond0.qinad ip addr add 2001:db8:1:5::2/64 dev bond0.qinad VPP config ## No more `lcp create` commands for sub-interfaces. vppctl lcp default netns dataplane vppctl lcp lcp-auto-subint on vppctl lcp create TenGigabitEthernet3/0/0 host-if e0 vppctl set interface state TenGigabitEthernet3/0/0 up vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0 vppctl set interface ip address TenGigabitEthernet3/0/0 10.0.1.1/30 vppctl set interface ip address TenGigabitEthernet3/0/0 2001:db8:0:1::1/64 vppctl create sub TenGigabitEthernet3/0/0 1234 vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1234 vppctl set interface state TenGigabitEthernet3/0/0.1234 up vppctl set interface ip address TenGigabitEthernet3/0/0.1234 10.0.2.1/30 vppctl set interface ip address TenGigabitEthernet3/0/0.1234 2001:db8:0:2::1/64 vppctl create sub TenGigabitEthernet3/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match vppctl set interface state TenGigabitEthernet3/0/0.1235 up vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1235 vppctl set interface ip address TenGigabitEthernet3/0/0.1235 10.0.3.1/30 vppctl set interface ip address TenGigabitEthernet3/0/0.1235 2001:db8:0:3::1/64 vppctl create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match vppctl set interface state TenGigabitEthernet3/0/0.1236 up vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1236 vppctl set interface ip address TenGigabitEthernet3/0/0.1236 10.0.4.1/30 vppctl set interface ip address TenGigabitEthernet3/0/0.1236 2001:db8:0:4::1/64 vppctl create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match vppctl set interface state TenGigabitEthernet3/0/0.1237 up vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1237 vppctl set interface ip address TenGigabitEthernet3/0/0.1237 10.0.5.1/30 vppctl set interface ip address TenGigabitEthernet3/0/0.1237 2001:db8:0:5::1/64 ## The LACP bond vppctl create bond mode lacp load-balance l2 vppctl bond add BondEthernet0 TenGigabitEthernet3/0/2 vppctl bond add BondEthernet0 TenGigabitEthernet3/0/3 vppctl lcp create BondEthernet0 host-if be0 vppctl set interface state TenGigabitEthernet3/0/2 up vppctl set interface state TenGigabitEthernet3/0/3 up vppctl set interface state BondEthernet0 up vppctl set interface mtu packet 9000 BondEthernet0 vppctl set interface ip address BondEthernet0 10.1.1.1/30 vppctl set interface ip address BondEthernet0 2001:db8:1:1::1/64 vppctl create sub BondEthernet0 1234 vppctl set interface mtu packet 9000 BondEthernet0.1234 vppctl set interface state BondEthernet0.1234 up vppctl set interface ip address BondEthernet0.1234 10.1.2.1/30 vppctl set interface ip address BondEthernet0.1234 2001:db8:1:2::1/64 vppctl create sub BondEthernet0 1235 dot1q 1234 inner-dot1q 1000 exact-match vppctl set interface state BondEthernet0.1235 up vppctl set interface mtu packet 9000 BondEthernet0.1235 vppctl set interface ip address BondEthernet0.1235 10.1.3.1/30 vppctl set interface ip address BondEthernet0.1235 2001:db8:1:3::1/64 vppctl create sub BondEthernet0 1236 dot1ad 2345 exact-match vppctl set interface state BondEthernet0.1236 up vppctl set interface mtu packet 9000 BondEthernet0.1236 vppctl set interface ip address BondEthernet0.1236 10.1.4.1/30 vppctl set interface ip address BondEthernet0.1236 2001:db8:1:4::1/64 vppctl create sub BondEthernet0 1237 dot1ad 2345 inner-dot1q 1000 exact-match vppctl set interface state BondEthernet0.1237 up vppctl set interface mtu packet 9000 BondEthernet0.1237 vppctl set interface ip address BondEthernet0.1237 10.1.5.1/30 vppctl set interface ip address BondEthernet0.1237 2001:db8:1:5::1/64 Final note You may have noticed that the [commit] links are all git commits in my private working copy. I want to wait until my previous work is reviewed and submitted before piling on more changes. Feel free to contact vpp-dev@ for more information in the mean time :-)\n","date":"2021-08-15","desc":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. One thing notably missing, is the higher level control plane, that is to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a VPP plugin which is called the Linux Control Plane, or LCP for short, which creates Linux network devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links, addresses and routes) itself. When the plugin is completed, running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates will be well in reach!\n","permalink":"https://ipng.ch/s/articles/2021/08/15/vpp-linux-cp-part3/","section":"articles","title":"VPP Linux CP - Part3"},{"contents":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. One thing notably missing, is the higher level control plane, that is to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a VPP plugin which is called the Linux Control Plane, or LCP for short, which creates Linux network devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links, addresses and routes) itself. When the plugin is completed, running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates will be well in reach!\nIn this second post, let\u0026rsquo;s make the plugin a bit more useful by making it copy forward state changes to interfaces in VPP, into their Linux CP counterparts.\nMy test setup I\u0026rsquo;m using the same setup from the previous post. The goal of this post is to show what code needed to be written and which changes needed to be made to the plugin, in order to propagate changes to VPP interfaces to the Linux TAP devices.\nStartingpoint The linux-cp plugin that ships with VPP 21.06, even with my changes is still only able to create LIP devices. It\u0026rsquo;s not very user friendly to have to apply state changes meticulously on both sides, but it can be done:\nvppctl lcp create TenGigabitEthernet3/0/0 host-if e0 vppctl set interface state TenGigabitEthernet3/0/0 up vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0 vppctl set interface ip address TenGigabitEthernet3/0/0 10.0.1.1/30 vppctl set interface ip address TenGigabitEthernet3/0/0 2001:db8:0:1::1/64 ip link set e0 up ip link set e0 mtu 9000 ip addr add 10.0.1.1/30 dev e0 ip addr add 2001:db8:0:1::1/64 dev e0 In this snippet, we can see that after creating the LIP, thus conjuring up the unconfigured e0 interface in Linux, I changed the VPP interface in three ways:\nI set the state of the VPP interface to \u0026lsquo;up\u0026rsquo; I set the MTU of the VPP interface to 9000 I add an IPv4 and IPv6 address to the interface Because state does not (yet) propagate, I have to make those changes as well on the Linux side with the subsequent ip commands.\nConfiguration I can imagine that operators want to have more control and facilitate the Linux and VPP changes themselves. This is why I\u0026rsquo;ll start off by adding a variable called lcp_sync, along with a startup configuration keyword and a CLI setter. This allows me to turn the whole sync behavior on and off, for example in startup.conf:\nlinux-cp { default netns dataplane lcp-sync } And in the CLI:\nDBGvpp# show lcp lcp default netns dataplane lcp lcp-sync on DBGvpp# lcp lcp-sync off DBGvpp# show lcp lcp default netns dataplane lcp lcp-sync off The prep work for the rest of the interface syncer starts with this [commit], and for the rest of this blog post, the behavior will be in the \u0026lsquo;on\u0026rsquo; position.\nChange interface: state Immediately, I find a dissonance between VPP and Linux: When Linux sets a parent interface down, all children go to state M-DOWN. When Linux sets a parent interface up, all of its children automatically go to state UP and LOWER_UP. To illustrate:\nip link set enp66s0f1 down ip link add link enp66s0f1 name foo type vlan id 1234 ip link set foo down ## Both interfaces are down, which makes sense because I set them both down ip link | grep enp66s0f1 9: enp66s0f1: \u0026lt;BROADCAST,MULTICAST\u0026gt; mtu 9000 qdisc mq state DOWN mode DEFAULT group default qlen 1000 61: foo@enp66s0f1: \u0026lt;BROADCAST,MULTICAST,M-DOWN\u0026gt; mtu 9000 qdisc noop state DOWN mode DEFAULT group default qlen 1000 ip link set enp66s0f1 up ip link | grep enp66s0f1 ## Both interfaces are up, which doesn\u0026#39;t make sense because I only changed one of them! 9: enp66s0f1: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000 61: foo@enp66s0f1: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 VPP does not work this way. In VPP, the admin state of each interface is individually controllable, so it\u0026rsquo;s possible to bring up the parent while leaving the sub-interface in the state it was. I did notice that you can\u0026rsquo;t bring up a sub-interface if its parent is down, which I found counterintuitive, but that\u0026rsquo;s neither here nor there.\nAll of this is to say that we have to be careful when copying state forward, because as this [commit] shows, issuing set int state ... up on an interface, won\u0026rsquo;t touch its sub-interfaces in VPP, but the subsequent netlink message to bring the LIP for that interface up, will update the children, thus desynchronising Linux and VPP: Linux will have interface and all its sub-interfaces up unconditionally; VPP will have the interface up and its sub-interfaces in whatever state they were before.\nTo address this, a second [commit] was needed. I\u0026rsquo;m not too sure I want to keep this behavior, but for now, it results in an intuitive end-state, which is that all interfaces states are exactly the same between Linux and VPP.\nDBGvpp# create sub TenGigabitEthernet3/0/0 10 DBGvpp# lcp create TenGigabitEthernet3/0/0 host-if e0 DBGvpp# lcp create TenGigabitEthernet3/0/0.10 host-if e0.10 DBGvpp# set int state TenGigabitEthernet3/0/0 up ## Correct: parent is up, sub-int is not 694: e0: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000 695: e0.10@e0: \u0026lt;BROADCAST,MULTICAST\u0026gt; mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 DBGvpp# set int state TenGigabitEthernet3/0/0.10 up ## Correct: both interfaces up 694: e0: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000 695: e0.10@e0: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 DBGvpp# set int state TenGigabitEthernet3/0/0 down DBGvpp# set int state TenGigabitEthernet3/0/0.10 down DBGvpp# set int state TenGigabitEthernet3/0/0 up ## Correct: only the parent is up 694: e0: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000 695: e0.10@e0: \u0026lt;BROADCAST,MULTICAST\u0026gt; mtu 9000 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000 Change interface: MTU Finally, a straight forward [commit], or so I thought. When the MTU changes in VPP (with set interface mtu packet N \u0026lt;int\u0026gt;), there is callback that can be registered which copies this into the LIP. I did notice a specific corner case: In VPP, a sub-interface can have a larger MTU than its parent. In Linux, this cannot happen, so the following remains problematic:\nDBGvpp# create sub TenGigabitEthernet3/0/0 10 DBGvpp# set int mtu packet 1500 TenGigabitEthernet3/0/0 DBGvpp# set int mtu packet 9000 TenGigabitEthernet3/0/0.10 ## Incorrect: sub-int has larger MTU than parent, valid in VPP, not in Linux 694: e0: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 1500 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000 695: e0.10@e0: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000 I think the best way to ensure this works is to clamp the sub-int to a maximum MTU of that of its parent, and revert the user\u0026rsquo;s request to change the VPP sub-int to anything higher than that, perhaps logging an error explaining why. This means two things:\nAny change in VPP of a child MTU to larger than its parent, must be reverted. Any change in VPP of a parent MTU should ensure all children are clamped to at most that. I addressed the issue in this [commit].\nChange interface: IP Addresses There are three scenarios in which IP addresses will need to be copied from VPP into the companion Linux devices:\nset interface ip address adds an IPv4 or IPv6 address. This is handled by lcp_itf_ip[46]_add_del_interface_addr() which is a callback installed in lcp_itf_pair_init() at plugin initialization time. set interface ip address del removes addresses. This is also handled by lcp_itf_ip[46]_add_del_interface_addr() but curiously there is no upstream vnet_netlink_del_ip[46]_addr() so I had to write them inline here. I will try to get them upstreamed, as they appear to be obvious companions in vnet/device/netlink.h. This one is easy to overlook, but upon LIP creation, it could be that there are already L3 addresses present on the VPP interface. If so, set them in the LIP with lcp_itf_set_interface_addr(). This means with this [commit], at any time a new LIP is created, the IPv4 and IPv6 address on the VPP interface are fully copied over by the third change, while at runtime, new addresses can be set/removed as well by the first and second change.\nFurther work I noticed that Bird periodically scans the Linux interface list and (re)learns information from them. I have a suspicion that such a feature might be useful in the VPP plugin as well: I can imagine a periodical process that walks over the LIP interface list, and compares what it finds in Linux with what is configured in VPP. What\u0026rsquo;s not entirely clear to me is which direction should \u0026rsquo;trump\u0026rsquo;, that is, should the Linux state be forced into VPP, or should the VPP state be forced into Linux? I don\u0026rsquo;t yet have a good feeling of the answer, so I\u0026rsquo;ll punt on that for now.\nResults After applying the configuration to VPP (in Appendix), here\u0026rsquo;s the results:\npim@hippo:~/src/lcpng$ ip ro default via 194.1.163.65 dev enp6s0 proto static 10.0.1.0/30 dev e0 proto kernel scope link src 10.0.1.1 10.0.2.0/30 dev e0.1234 proto kernel scope link src 10.0.2.1 10.0.3.0/30 dev e0.1235 proto kernel scope link src 10.0.3.1 10.0.4.0/30 dev e0.1236 proto kernel scope link src 10.0.4.1 10.0.5.0/30 dev e0.1237 proto kernel scope link src 10.0.5.1 194.1.163.64/27 dev enp6s0 proto kernel scope link src 194.1.163.88 pim@hippo:~/src/lcpng$ fping 10.0.1.2 10.0.2.2 10.0.3.2 10.0.4.2 10.0.5.2 10.0.1.2 is alive 10.0.2.2 is alive 10.0.3.2 is alive 10.0.4.2 is alive 10.0.5.2 is alive pim@hippo:~/src/lcpng$ fping6 2001:db8:0:1::2 2001:db8:0:2::2 \\ 2001:db8:0:3::2 2001:db8:0:4::2 2001:db8:0:5::2 2001:db8:0:1::2 is alive 2001:db8:0:2::2 is alive 2001:db8:0:3::2 is alive 2001:db8:0:4::2 is alive 2001:db8:0:5::2 is alive In case you were wondering: my previous post ended in the same huzzah moment. It did.\nThe difference is that now the VPP configuration is much shorter! Comparing the Appendix from this post with my first post, after all of this work I no longer have to manually copy the configuration (like link states, MTU changes, IP addresses) from VPP into Linux, instead the plugin does all of this work for me, and I can configure both sides entirely with vppctl commands!\nBonus screencast! Humor me as I take the code out for a six minute screencast [asciinema, gif] :-)\nCredits I\u0026rsquo;d like to make clear that the Linux CP plugin is a great collaboration between several great folks and that my work stands on their shoulders. I\u0026rsquo;ve had a little bit of help along the way from Neale Ranns, Matthew Smith and Jon Loeliger, and I\u0026rsquo;d like to thank them for their work!\nAppendix Ubuntu config # Untagged interface ip addr add 10.0.1.2/30 dev enp66s0f0 ip addr add 2001:db8:0:1::2/64 dev enp66s0f0 ip link set enp66s0f0 up mtu 9000 # Single 802.1q tag 1234 ip link add link enp66s0f0 name enp66s0f0.q type vlan id 1234 ip link set enp66s0f0.q up mtu 9000 ip addr add 10.0.2.2/30 dev enp66s0f0.q ip addr add 2001:db8:0:2::2/64 dev enp66s0f0.q # Double 802.1q tag 1234 inner-tag 1000 ip link add link enp66s0f0.q name enp66s0f0.qinq type vlan id 1000 ip link set enp66s0f0.qinq up mtu 9000 ip addr add 10.0.3.3/30 dev enp66s0f0.qinq ip addr add 2001:db8:0:3::2/64 dev enp66s0f0.qinq # Single 802.1ad tag 2345 ip link add link enp66s0f0 name enp66s0f0.ad type vlan id 2345 proto 802.1ad ip link set enp66s0f0.ad up mtu 9000 ip addr add 10.0.4.2/30 dev enp66s0f0.ad ip addr add 2001:db8:0:4::2/64 dev enp66s0f0.ad # Double 802.1ad tag 2345 inner-tag 1000 ip link add link enp66s0f0.ad name enp66s0f0.qinad type vlan id 1000 proto 802.1q ip link set enp66s0f0.qinad up mtu 9000 ip addr add 10.0.5.2/30 dev enp66s0f0.qinad ip addr add 2001:db8:0:5::2/64 dev enp66s0f0.qinad VPP config ## Look mom, no `ip` commands!! :-) vppctl set interface state TenGigabitEthernet3/0/0 up vppctl lcp create TenGigabitEthernet3/0/0 host-if e0 vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0 vppctl set interface ip address TenGigabitEthernet3/0/0 10.0.1.1/30 vppctl set interface ip address TenGigabitEthernet3/0/0 2001:db8:0:1::1/64 vppctl create sub TenGigabitEthernet3/0/0 1234 vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1234 vppctl lcp create TenGigabitEthernet3/0/0.1234 host-if e0.1234 vppctl set interface state TenGigabitEthernet3/0/0.1234 up vppctl set interface ip address TenGigabitEthernet3/0/0.1234 10.0.2.1/30 vppctl set interface ip address TenGigabitEthernet3/0/0.1234 2001:db8:0:2::1/64 vppctl create sub TenGigabitEthernet3/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match vppctl set interface state TenGigabitEthernet3/0/0.1235 up vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1235 vppctl lcp create TenGigabitEthernet3/0/0.1235 host-if e0.1235 vppctl set interface ip address TenGigabitEthernet3/0/0.1235 10.0.3.1/30 vppctl set interface ip address TenGigabitEthernet3/0/0.1235 2001:db8:0:3::1/64 vppctl create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match vppctl set interface state TenGigabitEthernet3/0/0.1236 up vppctl lcp create TenGigabitEthernet3/0/0.1236 host-if e0.1236 vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1236 vppctl set interface ip address TenGigabitEthernet3/0/0.1236 10.0.4.1/30 vppctl set interface ip address TenGigabitEthernet3/0/0.1236 2001:db8:0:4::1/64 vppctl create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match vppctl set interface state TenGigabitEthernet3/0/0.1237 up vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1237 vppctl set interface ip address TenGigabitEthernet3/0/0.1237 10.0.5.1/30 vppctl set interface ip address TenGigabitEthernet3/0/0.1237 2001:db8:0:5::1/64 vppctl lcp create TenGigabitEthernet3/0/0.1237 host-if e0.1237 Final note You may have noticed that the [commit] links are all git commits in my private working copy. I want to wait until my previous work is reviewed and submitted before piling on more changes. Feel free to contact vpp-dev@ for more information in the mean time :-)\n","date":"2021-08-13","desc":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. One thing notably missing, is the higher level control plane, that is to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a VPP plugin which is called the Linux Control Plane, or LCP for short, which creates Linux network devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links, addresses and routes) itself. When the plugin is completed, running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates will be well in reach!\n","permalink":"https://ipng.ch/s/articles/2021/08/13/vpp-linux-cp-part2/","section":"articles","title":"VPP Linux CP - Part2"},{"contents":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. One thing notably missing, is the higher level control plane, that is to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a VPP plugin which is called the Linux Control Plane, or LCP for short, which creates Linux network devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links, addresses and routes) itself. When the plugin is completed, running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates will be well in reach!\nIn this first post, let\u0026rsquo;s take a look at tablestakes: making a copy of VPP\u0026rsquo;s interfaces appear in the Linux kernel.\nMy test setup I took two AMD64 machines, each with 32GB of memory and one Intel X710-DA4 network card (which offers four SFP+ cages), and installed Ubuntu 20.04 on them. I connected each of the network ports back to back with DAC cables. This gives me plenty of interfaces to play with. On the vanilla Ubuntu machine, I created a bunch of different types of interfaces and configured IPv4 and IPv6 addresses on them.\nThe goal of this post is to show what code needed to be written and which changes needed to be made to the plugin, in order to mirror each type of interface from VPP into a valid Linux interface. As we\u0026rsquo;ll see, marrying the Linux network interface approach with the VPP interface approach can be tricky! Throughout this post, the vanilla Ubuntu machine will keep the following configuration, the config file of which you can see in the Appendix:\nName type Addresses enp66s0f0 untagged 10.0.1.2/30 2001:db8:0:1::2/64 enp66s0f0.q dot1q 1234 10.0.2.2/30 2001:db8:0:2::2/64 enp66s0f0.qinq outer dot1q 1234, inner dot1q 1000 10.0.3.2/30 2001:db8:0:3::2/64 enp66s0f0.ad dot1ad 2345 10.0.4.2/30 2001:db8:0:4::2/64 enp66s0f0.qinad outer dot1ad 2345, inner dot1q 1000 10.0.5.2/30 2001:db8:0:5::2/64 This configuration will allow me to ensure that all common types of sub-interface are supported by the plugin.\nStartingpoint The linux-cp plugin that ships with VPP 21.06, when initialized with the desired startup config (see Appendix), will yield this (Hippo is the machine that runs my development branch of VPP, it\u0026rsquo;s called like that because it\u0026rsquo;s always hungry for packets):\npim@hippo:~/src/lcpng$ ip ro default via 194.1.163.65 dev enp6s0 proto static 10.0.1.0/30 dev e0 proto kernel scope link src 10.0.1.1 10.0.2.0/30 dev e0.1234 proto kernel scope link src 10.0.2.1 10.0.4.0/30 dev e0.1236 proto kernel scope link src 10.0.4.1 194.1.163.64/27 dev enp6s0 proto kernel scope link src 194.1.163.88 pim@hippo:~/src/lcpng$ fping 10.0.1.2 10.0.2.2 10.0.3.2 10.0.4.2 10.0.5.2 10.0.1.2 is alive 10.0.2.2 is alive 10.0.3.2 is unreachable 10.0.4.2 is unreachable 10.0.5.2 is unreachable pim@hippo:~/src/lcpng$ fping6 2001:db8:0:1::2 2001:db8:0:2::2 \\ 2001:db8:0:3::2 2001:db8:0:4::2 2001:db8:0:5::2 2001:db8:0:1::2 is alive 2001:db8:0:2::2 is alive 2001:db8:0:3::2 is unreachable 2001:db8:0:4::2 is unreachable 2001:db8:0:5::2 is unreachable Yikes! So the plugin really only knows how to handle untagged interfaces, and sub-interfaces with one dot1q VLAN tag. The other three scenarios (dot1ad VLAN tag; dot1q in dot1q; and dot1q in dot1ad) are not ok. And, curiously, the dot1ad 2345 exact-match interface was created (as linux interface e0.1236, but it doesn\u0026rsquo;t ping, and I\u0026rsquo;ll show you why :-) But principally: let\u0026rsquo;s fix this plugin!\nAnatomy of Linux Interface Pairs In VPP, the plumbing to the Linux kernel is done via a TUN/TAP interface. For L3 interfaces, TAP is used. This TAP appears in the Linux network namesapce as a device with which you can interact. From the Linux point of view, on egress, all packets coming from the host into the TAP are cross-connected directly to the logical VPP network interface. In VPP, on ingress, packets destined for an L3 address on any VPP interface, as well as packets that are multicast, are punted into the TAP, which makes them appear in the kernel.\nIn VPP, a linux interface pair (LIP for short) is therefore a tuple { vpp_phy_idx, vpp_tap_idx, netlink_idx }. Creating one of these, is the art of first creating a tap, and associating it with the vpp_phy, copying traffic from it into the dataplane, and punting traffic from the dataplane into the TAP so that Linux can see it. The plugin exposes an API endpoint that creates, deletes and lists these linux interface pairs:\nlcp create \u0026lt;sw_if_index\u0026gt;|\u0026lt;if-name\u0026gt; host-if \u0026lt;host-if-name\u0026gt; netns \u0026lt;namespace\u0026gt; [tun] lcp delete \u0026lt;sw_if_index\u0026gt;|\u0026lt;if-name\u0026gt; show lcp [phy \u0026lt;interface\u0026gt;] If you\u0026rsquo;re still with me, congratulations, because this is where it starts to get fun!\nCreate interface: physical The easiest interface type is a physical one. Here, the plugin will create a TAP, copy the MAC address from the PHY, and set a bunch of attributes on the TAP, such as MTU and link state. Here, I made my first set of changes (in [patchset 3]) to the plugin:\nInitialize the link state of the VPP interface, not unconditionally set it to \u0026lsquo;down\u0026rsquo;. Initialize the MTU of the VPP interface into the TAP, do not assume it is the VPP default of 9000; if the MTU is not known, assume the TAP has 9216, the largest possible on ethernet. Taking a look:\nDBGvpp# show int TenGigabitEthernet3/0/0 Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count TenGigabitEthernet3/0/0 1 down 9000/0/0/0 DBGvpp# lcp create TenGigabitEthernet3/0/0 host-if e0 DBGvpp# show tap tap1 Interface: tap1 (ifindex 7) name \u0026#34;e0\u0026#34; host-ns \u0026#34;(nil)\u0026#34; host-mtu-size \u0026#34;9000\u0026#34; host-mac-addr: 68:05:ca:32:46:14 ... DBGvpp# set interface state TenGigabitEthernet3/0/1 up DBGvpp# set interface mtu packet 1500 TenGigabitEthernet3/0/1 DBGvpp# lcp create TenGigabitEthernet3/0/1 host-if e1 And in Linux, unceremoniously, both interfaces appear:\npim@hippo:~/src/lcpng$ ip link show e0 291: e0: \u0026lt;BROADCAST,MULTICAST\u0026gt; mtu 9000 qdisc mq state DOWN mode DEFAULT group default qlen 1000 link/ether 68:05:ca:32:46:14 brd ff:ff:ff:ff:ff:ff pim@hippo:~/src/lcpng$ ip link show e1 307: e1: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 1500 qdisc mq state UNKNOWN mode DEFAULT group default qlen 1000 link/ether 68:05:ca:32:46:15 brd ff:ff:ff:ff:ff:ff The MAC address from the physical interface show hardware-interface TenGigabitEthernet3/0/0 corresponds to the one seen in the TAP, and the one seen in the Linux interface we just created. The Linux interfaces respect the MTU and link state of their counterpart VPP interfaces (e0 is down at 9000b, e1 is up at 1500b).\nCreate interface: dot1q Note that creating an ethernet sub-interface in VPP takes the following form:\ncreate sub-interfaces \u0026lt;interface\u0026gt; {\u0026lt;subId\u0026gt; [default|untagged]} | {\u0026lt;subId\u0026gt;-\u0026lt;subId\u0026gt;} | {\u0026lt;subId\u0026gt; dot1q|dot1ad \u0026lt;vlanId\u0026gt;|any [inner-dot1q \u0026lt;vlanId\u0026gt;|any] [exact-match]} Here, I\u0026rsquo;ll start with the simplest form, canonically called a .1q VLAN or a tagged interface. The plugin handles it just fine, with a codepath that first creates a sub-interface on the parent\u0026rsquo;s TAP, forwards traffic to/from the VPP subinterface into the parent TAP, asks the Linux kernel to create a new interface of type vlan with the id set to the dot1q tag, as a child of the e0 interface. Note however the exact-match keyword, which is very important. In VPP, without setting exact-match, any ethernet frame that matches the sub-interface expression, will be handled by it. This means the VLAN with tag 1234, but also a stacked (Q-in-Q or Q-in-AD) VLAN with the outer tag set to 1234 will match. This is non-sensical for an IP interface, and as such the first two examples will successfully create, but the third example will crash the plugin:\n## Good, shorthand sets exact-match DBGvpp# create sub TenGigabitEthernet3/0/0 1234 DBGvpp# lcp create TenGigabitEthernet3/0/0.1234 host-if e0.1234 ## Good, explicitly set exact-match DBGvpp# create sub TenGigabitEthernet3/0/0 1234 dot1q 1234 exact-match DBGvpp# lcp create TenGigabitEthernet3/0/0.1234 host-if e0.1234 ## Bad, will crash DBGvpp# create sub TenGigabitEthernet3/0/0 1234 dot1q 1234 DBGvpp# lcp create TenGigabitEthernet3/0/0.1234 host-if e0.1234 The reason is that the first call is a shorthand: it creates sub-int 1234 as dot1q 1234 exact-match, which is literally what the second example does, while the third example creates a non-exact-match sub-int 1234 with dot1q 1234. So I changed the behavior to explicitly reject sub-interfaces that are not exact-match in [patchset 4]. Actually, it turns out that VPP upstream also crashes on setting an ip address on a sub-int that is not configured with exact-match, so I fixed that upstream in this [gerrit] too.\nCreate interface: dot1ad While by far 802.1q VLAN interfaces are the most used, there\u0026rsquo;s a lesser known sibling called 802.1ad \u0026ndash; the only difference is that VLAN ethernet frames with .1q use the well known 0x8100 ethernet type (called a a tag protocol identifier, or TPID), while .1ad uses a lesser known 0x88a8 type. In the first beginnings, Q-in-Q was suggested to use the 0x88a8 tag for the outer type, and 0x8100 for the inner type, differentiating the two. But the industry was conflicted, and many vendors chose to use 0x8100 for both inner- and outer-types, VPP supports it and so does Linux, so let\u0026rsquo;s implement it in [patchset 5]. Without this change, the plugin would create the interface, but it would invariably create it as .1q on the linux side, which explains why the e0.1236 interface exists but doesn\u0026rsquo;t ping in my startingpoint above. Now we have the expected behavior:\nDBGvpp# create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match DBGvpp# lcp create TenGigabitEthernet3/0/0.1236 host-if e0.1236 pim@hippo:~/src/lcpng$ ping 10.0.4.2 PING 10.0.4.2 (10.0.4.2) 56(84) bytes of data. 64 bytes from 10.0.4.2: icmp_seq=1 ttl=64 time=0.58 ms 64 bytes from 10.0.4.2: icmp_seq=2 ttl=64 time=0.57 ms 64 bytes from 10.0.4.2: icmp_seq=3 ttl=64 time=0.62 ms 64 bytes from 10.0.4.2: icmp_seq=4 ttl=64 time=0.67 ms ^C --- 10.0.4.2 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 3005ms rtt min/avg/max/mdev = 0.566/0.608/0.672/0.041 ms Create interface: dot1q in dot1ad This is the original Q-in-Q as it was intended. Frames here carry an outer ethernet TPID of 0x88a8 (dot1ad) which is followed by an inner ethernet TPID of 0x8100 (dot1q). Of course, untagged inner frames are also possible - they show up as simply one ethernet TPID of dot1ad followed directly by the L3 payload. Here, things get a bit more tricky. On the VPP side, we can simply create the sub-interface directly; but on the Linux side, we cannot do that. This is because in VPP, all sub-interfaces are directly parented by their physical interface, while in Linux, the interfaces are stacked on one another. Compare:\n### VPP idiomatic q-in-ad (1 interface) DBGvpp# create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match ### Linux idiomatic q-in-ad stack (2 interfaces) ip link add link e0 name e0.2345 type vlan id 2345 proto 802.1ad ip link add link e0.2345 name e0.2345.1000 type vlan id 1000 proto 802.1q So in order to create Q-in-AD sub-interfaces, for Linux their intermediary parent must exist, while in VPP this is not necessary. I have to make a compromise, so I\u0026rsquo;ll be a bit more explicit and allow this type of LIP to be created only under these conditions:\nA sub-int exists with the intermediary (in this case, dot1ad 2345 exact-match) That sub-int itself has a LIP, with a Linux interface device that we can spawn the inner interface off of If these conditions don\u0026rsquo;t hold, I reject the request. If they do, I create an interface pair:\nDBGvpp# create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match DBGvpp# create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match DBGvpp# lcp create TenGigabitEthernet3/0/0.1236 host-if e0.1236 DBGvpp# lcp create TenGigabitEthernet3/0/0.1237 host-if e0.1237 pim@hippo:~/src/lcpng$ ip link show e0.1236 375: e0.1236@e0: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 68:05:ca:32:46:14 brd ff:ff:ff:ff:ff:ff pim@hippo:~/src/lcpng$ ip link show e0.1237 376: e0.1237@e0.1236: \u0026lt;BROADCAST,MULTICAST,UP,LOWER_UP\u0026gt; mtu 9000 qdisc noqueue state UP mode DEFAULT group default qlen 1000 link/ether 68:05:ca:32:46:14 brd ff:ff:ff:ff:ff:ff Here, e0.1237 was indeed created as a child of e0.1236, which in turn was created as a child of e0.\nThe code for this is in [patchset 6].\nCreate interface: dot1q in dot1q Given the change above, this is an entirely obvious capability that the plugin now handles, but I did find a failure mode, when I tried to create a LIP for a sub-interface when there are no LIPs created. It causes a NULL deref when trying to look up the LIP of the parent (which doesn\u0026rsquo;t yet have a LIP defined). I fixed that in this [patchset 7].\nResults After applying the configuration to VPP (in Appendix), here\u0026rsquo;s the results:\npim@hippo:~/src/lcpng$ ip ro default via 194.1.163.65 dev enp6s0 proto static 10.0.1.0/30 dev e0 proto kernel scope link src 10.0.1.1 10.0.2.0/30 dev e0.1234 proto kernel scope link src 10.0.2.1 10.0.3.0/30 dev e0.1235 proto kernel scope link src 10.0.3.1 10.0.4.0/30 dev e0.1236 proto kernel scope link src 10.0.4.1 10.0.5.0/30 dev e0.1237 proto kernel scope link src 10.0.5.1 194.1.163.64/27 dev enp6s0 proto kernel scope link src 194.1.163.88 pim@hippo:~/src/lcpng$ fping 10.0.1.2 10.0.2.2 10.0.3.2 10.0.4.2 10.0.5.2 10.0.1.2 is alive 10.0.2.2 is alive 10.0.3.2 is alive 10.0.4.2 is alive 10.0.5.2 is alive pim@hippo:~/src/lcpng$ fping6 2001:db8:0:1::2 2001:db8:0:2::2 \\ 2001:db8:0:3::2 2001:db8:0:4::2 2001:db8:0:5::2 2001:db8:0:1::2 is alive 2001:db8:0:2::2 is alive 2001:db8:0:3::2 is alive 2001:db8:0:4::2 is alive 2001:db8:0:5::2 is alive As can be seen, all interface types ping. Mirroring interfaces from VPP to Linux is now done!\nWe still have to manually copy the configuration (like link states, MTU changes, IP addresses and routes) from VPP into Linux, and of course it would be great if we could mirror those states also into Linux, and this is exactly the topic of my next post.\nCredits I\u0026rsquo;d like to make clear that the Linux CP plugin is a great collaboration between several great folks and that my work stands on their shoulders. I\u0026rsquo;ve had a little bit of help along the way from Neale Ranns, Matthew Smith and Jon Loeliger, and I\u0026rsquo;d like to thank them for their work!\nAppendix Ubuntu config # Untagged interface ip addr add 10.0.1.2/30 dev enp66s0f0 ip addr add 2001:db8:0:1::2/64 dev enp66s0f0 ip link set enp66s0f0 up mtu 9000 # Single 802.1q tag 1234 ip link add link enp66s0f0 name enp66s0f0.q type vlan id 1234 ip link set enp66s0f0.q up mtu 9000 ip addr add 10.0.2.2/30 dev enp66s0f0.q ip addr add 2001:db8:0:2::2/64 dev enp66s0f0.q # Double 802.1q tag 1234 inner-tag 1000 ip link add link enp66s0f0.q name enp66s0f0.qinq type vlan id 1000 ip link set enp66s0f0.qinq up mtu 9000 ip addr add 10.0.3.3/30 dev enp66s0f0.qinq ip addr add 2001:db8:0:3::2/64 dev enp66s0f0.qinq # Single 802.1ad tag 2345 ip link add link enp66s0f0 name enp66s0f0.ad type vlan id 2345 proto 802.1ad ip link set enp66s0f0.ad up mtu 9000 ip addr add 10.0.4.2/30 dev enp66s0f0.ad ip addr add 2001:db8:0:4::2/64 dev enp66s0f0.ad # Double 802.1ad tag 2345 inner-tag 1000 ip link add link enp66s0f0.ad name enp66s0f0.qinad type vlan id 1000 proto 802.1q ip link set enp66s0f0.qinad up mtu 9000 ip addr add 10.0.5.2/30 dev enp66s0f0.qinad ip addr add 2001:db8:0:5::2/64 dev enp66s0f0.qinad VPP config vppctl set interface state TenGigabitEthernet3/0/0 up vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0 vppctl set interface ip address TenGigabitEthernet3/0/0 10.0.1.1/30 vppctl set interface ip address TenGigabitEthernet3/0/0 2001:db8:0:1::1/64 vppctl lcp create TenGigabitEthernet3/0/0 host-if e0 ip link set e0 up mtu 9000 ip addr add 10.0.1.1/30 dev e0 ip addr add 2001:db8:0:1::1/64 dev e0 vppctl create sub TenGigabitEthernet3/0/0 1234 vppctl set interface state TenGigabitEthernet3/0/0.1234 up vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1234 vppctl set interface ip address TenGigabitEthernet3/0/0.1234 10.0.2.1/30 vppctl set interface ip address TenGigabitEthernet3/0/0.1234 2001:db8:0:2::1/64 vppctl lcp create TenGigabitEthernet3/0/0.1234 host-if e0.1234 ip link set e0.1234 up mtu 9000 ip addr add 10.0.2.1/30 dev e0.1234 ip addr add 2001:db8:0:2::1/64 dev e0.1234 vppctl create sub TenGigabitEthernet3/0/0 1235 dot1q 1234 inner-dot1q 1000 exact-match vppctl set interface state TenGigabitEthernet3/0/0.1235 up vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1235 vppctl set interface ip address TenGigabitEthernet3/0/0.1235 10.0.3.1/30 vppctl set interface ip address TenGigabitEthernet3/0/0.1235 2001:db8:0:3::1/64 vppctl lcp create TenGigabitEthernet3/0/0.1235 host-if e0.1235 ip link set e0.1235 up mtu 9000 ip addr add 10.0.3.1/30 dev e0.1235 ip addr add 2001:db8:0:3::1/64 dev e0.1235 vppctl create sub TenGigabitEthernet3/0/0 1236 dot1ad 2345 exact-match vppctl set interface state TenGigabitEthernet3/0/0.1236 up vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1236 vppctl set interface ip address TenGigabitEthernet3/0/0.1236 10.0.4.1/30 vppctl set interface ip address TenGigabitEthernet3/0/0.1236 2001:db8:0:4::1/64 vppctl lcp create TenGigabitEthernet3/0/0.1236 host-if e0.1236 ip link set e0.1236 up mtu 9000 ip addr add 10.0.4.1/30 dev e0.1236 ip addr add 2001:db8:0:4::1/64 dev e0.1236 vppctl create sub TenGigabitEthernet3/0/0 1237 dot1ad 2345 inner-dot1q 1000 exact-match vppctl set interface state TenGigabitEthernet3/0/0.1237 up vppctl set interface mtu packet 9000 TenGigabitEthernet3/0/0.1237 vppctl set interface ip address TenGigabitEthernet3/0/0.1237 10.0.5.1/30 vppctl set interface ip address TenGigabitEthernet3/0/0.1237 2001:db8:0:5::1/64 vppctl lcp create TenGigabitEthernet3/0/0.1237 host-if e0.1237 ip link set e0.1237 up mtu 9000 ip addr add 10.0.5.1/30 dev e0.1237 ip addr add 2001:db8:0:5::1/64 dev e0.1237 ","date":"2021-08-12","desc":" About this series Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation services router), VPP will look and feel quite familiar as many of the approaches are shared between the two. One thing notably missing, is the higher level control plane, that is to say: there is no OSPF or ISIS, BGP, LDP and the like. This series of posts details my work on a VPP plugin which is called the Linux Control Plane, or LCP for short, which creates Linux network devices that mirror their VPP dataplane counterpart. IPv4 and IPv6 traffic, and associated protocols like ARP and IPv6 Neighbor Discovery can now be handled by Linux, while the heavy lifting of packet forwarding is done by the VPP dataplane. Or, said another way: this plugin will allow Linux to use VPP as a software ASIC for fast forwarding, filtering, NAT, and so on, while keeping control of the interface state (links, addresses and routes) itself. When the plugin is completed, running software like FRR or Bird on top of VPP and achieving \u0026gt;100Mpps and \u0026gt;100Gbps forwarding rates will be well in reach!\n","permalink":"https://ipng.ch/s/articles/2021/08/12/vpp-linux-cp-part1/","section":"articles","title":"VPP Linux CP - Part1"},{"contents":"FiberStore is a staple provider of optics and network gear in Europe. Although I\u0026rsquo;ve been buying opfics like SFP+ and QSFP+ from them for years, I rarely looked at the switch hardware they have on sale, until my buddy Arend suggested one of their switches as a good alternative for an Internet Exchange Point, one with Frysian roots no less!\nExecutive Summary The FS.com S5860 switch is pretty great: 20x 10G SFP+ ports, 4x 25G SPF28 ports and 2x 40G QSFP ports, which can also be reconfigued to be 4x10G each. The switch has a Cisco-like CLI, and great performance. I loadtested a pair of them in L2, QinQ, and in L3 mode, and they handled all the packets I sent to and through them, with all of 10G, 25G and 40G ports in use. Considering the redundant power supply with relatively low power usage, silicon based switching of L2 and L3, I definitely appreciate the price/performance. The switch would be a better match if it allowed for MPLS based L2VPN services, but it doesn\u0026rsquo;t support that.\nDetailed findings Hardware The switch is based on Broadcom\u0026rsquo;s BCM56170, codename Hurricane with 28x10GbE + 4x25GbE ports internally, for a total switching bandwidth of 380Gbps. I noticed that the FS website shows 760Gbps of nonblocking capacity, which I can explain: Broadcom has taken the per port ingress capacity, while FS.com is taking the ingress/egress port capacity and summing them up. Further, the sales pitch claims 565Mpps which I found curious: if we divide the available bandwidth of 380Gbps (the number from the Broadcom dataspec) by smallest possible frame of 84 bytes (672 bits), we\u0026rsquo;ll see 565Mpps. Why FS.com decided to seemingly arbitrarily double the switching capacity while reporting the nominal fowarding rate, is beyond me.\nYou can see more (hires) pictures in this Photo Album.\nThis Broadcom chip is an SOC (System-on-a-Chip) which comes with an Arm A9 and modest amount of TCAM on board and packs into a 31x31mm ball grid array formfactor. The switch chip is able to store 16k routes and ACLs - it did not become immediately obvious to me what the partitioning is (between IPv4 entries, IPv6 entries, L2/L3/L4 ACL entries). One can only assume that the total sum of TCAM based objects must not exceed 4K entries. This means that as a campus switch, the L3 functionalty will be great, including with routing protocols such as OSPF and ISIS. However, BGP with any amount of routing table activity will not be a good fit for this chip, so my dreams of porting DANOS to it are shot out of the box :-)\nThis Broadcom chip alone retails for €798,- apiece at Digikey, with a manufacturer lead time of 50 weeks as of Aug'21, which may be related to the ongoing foundry and supply chain crisis, I don\u0026rsquo;t know. But at that price point, the retail price of €1150,- per switch is really attractive.\nThe switch comes with two modular and field-replaceble power supplies (rated at 150W each, delivering 12V at 12.5A, one fan installed), and with two modular and equally field replaceble fan trays installed with one fan each. Idle, without any optics installed and with all interfaces down, the switch draws about 18W of power, which is nice. The fans spin up only when needed, and by default the switch is quiet, but certainly not silent. I measured it after a tip from Michael, certainly nothing scientific, but in a silent room that measures a floor of ~30 dBA, the switch booted up and briefly burst the fans at 60dBA after which it stabilized at 54dBA or thereabouts. This is with both power supplies on, and with my cell phone microphone pointed directly towards the rear of the device, at 1 meter distance. Or something, IDK, I\u0026rsquo;m a network engineer, Jim, not an audio specialist!\nBesides the 20x 1G/10G SFP+ ports, 4x 25G ports and 2x 40G ports (which, incidentally, can be broken out into 4x 10G as well, bringing the Tengig port count to the datasheet specified 28), the switch also comes with a USB port (which mounts a filesystem on a USB stick, handy to do firmware upgrades and to copy files such as SSH keys back and forth), an RJ45 1G management port, which does not participate in the switch at all, and an RJ45 serial port that uses a standard Cisco cable for access and presents itself as 9600,8n1 to a console server, although flow control must be disabled on the serial port.\nTransceiver Compatibility FS did not attempt any vendor locking or crippleware with the ports and optics, yaay for that. I successfully inserted Cisco optics, Arista optics, FS.com \u0026lsquo;Generic\u0026rsquo; optics, and several DACs for 10G, 25G and 40G that I had lying around. The switch is happy to take all of them. The switch, as one would expect, supports diagnostics, which looks like this:\nfsw0#show interfaces TFGigabitEthernet0/24 transceiver Transceiver Type : 25GBASE-LR-SFP28 Connector Type : LC Wavelength(nm) : 1310 Transfer Distance : SMF fiber -- 10km Digital Diagnostic Monitoring : YES Vendor Serial Number : G2006362849 Current diagnostic parameters[AP:Average Power]: Temp(Celsius) Voltage(V) Bias(mA) RX power(dBm) TX power(dBm) 33(OK) 3.29(OK) 38.31(OK) -0.10(OK)[AP] -0.07(OK) Transceiver current alarm information: None .. with a helpful shorthand show interfaces ... trans diag that only shows the optical budget.\nSoftware I bought a pair of switches, and they came delivered with a current firmware version. The devices idenfity themselves as FS Campus Switch (S5860-20SQ) By FS.COM Inc with a hardware version of 1.00 and a software version of S5860_FSOS 12.4(1)B0101P1. Firmware updates can be downloaded from the FS.com website directly. I\u0026rsquo;m not certain if there\u0026rsquo;s a viable ONIE firmware for this chip, although the N8050 certainly can run ONIE, Cumulus and its own ICOS which is backed by Broadcom. Maybe in the future I could take a better look at the open networking firmware aspects of this type of hardware, but considering the CAM is tiny and the switch will do L2 in hardware, but L3 only up to a certain amount of routes (I think 4K or 16K in the FIB, and only 1GB of ram on the SOC), this is not the right platform to pour energy into trying to get DANOS to run on.\nTaking a look at the CLI, it\u0026rsquo;s very Cisco IOS-esque; there\u0026rsquo;s a few small differences, but the look and feel is definitely familiar. Base configuration kind of looks like this:\nfsw0#show running-config hostname fsw0 ! sntp server oob 216.239.35.12 sntp enable ! username pim privilege 15 secret 5 $1$\u0026lt;redacted\u0026gt; ! ip name-server oob 8.8.8.8 ! service password-encryption ! enable service ssh-server no enable service telnet-server ! interface Mgmt 0 ip address 192.168.1.10 255.255.255.0 gateway 192.168.1.1 ! snmp-server location Zurich, Switzerland snmp-server contact noc@ipng.ch snmp-server community 7 \u0026lt;redacted\u0026gt; ro ! Configuration as well follows the familiar conf t (configure terminal) that many of us grew up with, and show command allow for include and exclude modifiers, of course with all the shortest-next abbriviations such as sh int | i Forty and the likes. VLANs are to be declared up front, with one notable cool feature of supervlans, which are the equivalent of aggregating VLANs together in the switch - a useful example might be an internet exchange platform which has trunk ports towards resellers, who might resell VLAN 101, 102, 103 each to an individual customer, but then all end up in the same peering lan VLAN 100.\nA few of the services (SSH, SNMP, DNS, SNTP) can be bound to the management network, but for this to work, the oob keyword has to be used. This likely because the mgmt port is a network interface that is attached to the SOC, not to the switch fabric itself, and thus its route is not added to the routing table. I like this, because it avoids the mgmt network to be picked up in OSPF, and accidentally routed to/from. But it does show a bit more of an awkward config:\nfsw1#show running-config | inc oob sntp server oob 216.239.35.12 ip name-server oob 8.8.8.8 ip name-server oob 1.1.1.1 ip name-server oob 9.9.9.9 fsw1#copy ? WORD Copy origin file from native flash: Copy origin file from flash: file system ftp: Copy origin file from ftp: file system http: Copy origin file from http: file system oob_ftp: Copy origin file from oob_ftp: file system oob_http: Copy origin file from oob_http: file system oob_tftp: Copy origin file from oob_tftp: file system running-config Copy origin file from running config startup-config Copy origin file from startup config tftp: Copy origin file from tftp: file system tmp: Copy origin file from tmp: file system usb0: Copy origin file from usb0: file system Note here the hack oob_ftp: and such; this would allow the switch to copy things from the OOB (management) network by overriding the scheme. But that\u0026rsquo;s OK, I guess, not beautiful, but it gets the job done and these types of commands will rarely be used.\nA few configuration examples, notably QinQ, in which I configure a port to take usual dot1q traffic, say from a customer, and add it into our local VLAN 200. Therefore, untagged traffic on that port will turn into our VLAN 200, and tagged traffic will turn into our dot1ad stack of outer VLAN 200 and inner VLAN whatever the customer provided \u0026ndash; in our case allowing only VLANs 1000-2000 and untagged traffic into VLAN 200:\nfsw0#confifgure fsw0(config)#vlan 200 fsw0(config-vlan)#name v-qinq-outer fsw0(config-vlan)#exit fsw0(config)#interface TenGigabitEthernet 0/3 fsw0(config-if-TenGigabitEthernet 0/3)#switchport mode dot1q-tunnel fsw0(config-if-TenGigabitEthernet 0/3)#switchport dot1q-tunnel native vlan 200 fsw0(config-if-TenGigabitEthernet 0/3)#switchport dot1q-tunnel allowed vlan add untagged 200 fsw0(config-if-TenGigabitEthernet 0/3)#switchport dot1q-tunnel allowed vlan add tagged 1000-2000 The industry remains conflicted about the outer ethernet frame\u0026rsquo;s type \u0026ndash; originally a tag protocol identifier (TPID) of 0x9100 was suggested, and that\u0026rsquo;s what this switch uses. But the first specification of Q-in-Q called 802.1ad specified that the TPID should be 0x88a8 instead of the VLAN tag that was 0x8100. This ugly reality can be reflected directly in the switchport configuration by adding a frame-tag tpid 0xXXXX value to let the switch know which TPID needs to be used for the outer tag.\nIf this type of historical thing interests you, I definitely recommend reading up on Wikipedia on 802.1q and 802.1ad as well.\nLoadtests For my loadtests, I used Cisco\u0026rsquo;s T-Rex (ref) in stateless mode, with a custom Python controller that ramps up and down traffic from the loadtester to the device under test (DUT) by sending traffic out port0 to the DUT, and expecting that traffic to be presented back out from the DUT to its port1, and vice versa (out from port1 -\u0026gt; DUT -\u0026gt; back in on port0). You can read a bit more about my setup in my Loadtesting at Coloclue post.\nTo stress test the switch, several pairs at 10G and 25G were used, and since the specs boast line rate forwarding, I immediately ran T-Rex at maximum load with small frames. I found out, once again, that Intel\u0026rsquo;s X710 network cards aren\u0026rsquo;t line rate, something I\u0026rsquo;ll dive into in a bit more detail another day, for now, take a look at the T-Rex docs.\nL2 First let\u0026rsquo;s test a straight forward configuration. I connect a DAC between a 40G port on each switch, and connect a loadtester to port TenGigabitEthernet 0/1 and TenGigabitEthernet 0/2 on either switch, and leave everything simply in the default VLAN. This means packets from Te0/1 and Te0/2 go out on Fo0/26, then through the DAC into Fo0/26 on the second switch, and out on Te0/1 and Te0/2 there, to return to the loadtester. Configuration wise, rather boring:\nfsw0#configure fsw0(config)#vlan 1 fsw0(config-vlan)#name v-default fsw0#show run int te0/1 interface TenGigabitEthernet 0/1 fsw0#show run int te0/2 interface TenGigabitEthernet 0/2 fsw0#show run int fo0/26 interface FortyGigabitEthernet 0/26 switchport mode trunk switchport trunk allowed vlan only 1 fsw0#show vlan id 1 VLAN Name Status Ports ---------- -------------------------------- --------- ----------------------------------- 1 v-default STATIC Te0/1, Te0/2, Te0/3, Te0/4 Te0/5, Te0/6, Te0/7, Te0/8 Te0/9, Te0/10, Te0/11, Te0/12 Te0/13, Te0/14, Te0/15, Te0/16 Te0/17, Te0/18, Te0/19, Te0/20 TF0/21, TF0/22, TF0/23, Fo0/25 Fo0/26 I set up T-Rex with unique MAC addresses for each of its ports, I find it useful to codify a few bits of information into the MAC, such as loadtester machine, PCI bus, port, so that when I try to find them on the switches in the forwarding table, and I have many loadtesters running at the same time, it\u0026rsquo;s easier to find what I\u0026rsquo;m looking for. My trex configuration for this loadtest:\npim@hippo:~$ cat /etc/trex_cfg.yaml - version : 2 interfaces : [\u0026#34;42:00.0\u0026#34;,\u0026#34;42:00.1\u0026#34;, \u0026#34;42:00.2\u0026#34;, \u0026#34;42:00.3\u0026#34;] port_limit : 4 port_info : - dest_mac : [0x0,0x2,0x1,0x1,0x0,0x00] # port 0 src_mac : [0x0,0x2,0x1,0x2,0x0,0x00] - dest_mac : [0x0,0x2,0x1,0x2,0x0,0x00] # port 1 src_mac : [0x0,0x2,0x1,0x1,0x0,0x00] - dest_mac : [0x0,0x2,0x1,0x3,0x0,0x00] # port 2 src_mac : [0x0,0x2,0x1,0x4,0x0,0x00] - dest_mac : [0x0,0x2,0x1,0x4,0x0,0x00] # port 3 src_mac : [0x0,0x2,0x1,0x3,0x0,0x00] Here\u0026rsquo;s where I notice something I\u0026rsquo;ve noticed before: the Intel X710 network cards cannot actually fill 4x10G at line rate. They\u0026rsquo;re fine at larger frames, but they max out at about 32Mpps throughput \u0026ndash; and we know that each 10G connection filled with small ethernet frames in one direction will consume 14.88Mpps. The same is true for the XXV710 cards, the chip used will really only source 30Mpps across all ports, which is sad but true.\nSo I have a choice to make: either I run small packets at a rate that\u0026rsquo;s acceptable for the NIC (~7.5Mpps per port thus 30Mpps across the X710-DA4), or I run imix at line rate but with slightly less packets/sec. I chose the latter for these tests, and will be reporting the usage based on imix profile, which saturates 10G at 3.28Mpps in one direction, or 13.12Mpps per network card.\nOf course, I can run two of these at the same time, pourquois pas, which looks like this:\nfsw0#show mac Vlan MAC Address Type Interface Live Time ---------- -------------------- -------- ------------------------------ ------------- 1 0001.0101.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:16:11 1 0001.0102.0000 DYNAMIC TenGigabitEthernet 0/1 0d 00:16:11 1 0001.0103.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:16:11 1 0001.0104.0000 DYNAMIC TenGigabitEthernet 0/2 0d 00:16:10 1 0002.0101.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:15:51 1 0002.0102.0000 DYNAMIC TenGigabitEthernet 0/3 0d 00:15:51 1 0002.0103.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:15:51 1 0002.0104.0000 DYNAMIC TenGigabitEthernet 0/4 0d 00:15:50 fsw0#show int usage | exclude 0.00 Interface Bandwidth Average Usage Output Usage Input Usage ------------------------------------ ----------- ---------------- ---------------- ---------------- TenGigabitEthernet 0/1 10000 Mbit 94.66% 94.66% 94.66% TenGigabitEthernet 0/2 10000 Mbit 94.66% 94.66% 94.66% TenGigabitEthernet 0/3 10000 Mbit 94.65% 94.66% 94.66% TenGigabitEthernet 0/4 10000 Mbit 94.66% 94.66% 94.66% FortyGigabitEthernet 0/26 40000 Mbit 94.66% 94.66% 94.66% fsw0#show cpu core [Slot 0 : S5860-20SQ] Core 5Sec 1Min 5Min 0 16.40% 12.00% 12.80% This is the first time that I noticed that the switch usage (94.66%) somewhat confusingly lines up with the observed T-Rex statistics: what the switch reports, T-Rex considers L2 (ethernet) use, not L1 use. For an in-depth explanation of this, see below in the L3 section. But for now, let\u0026rsquo;s just say that when T-Rex says it\u0026rsquo;s sending 37.9Gbps of ethernet traffic (which is 40.00Gbps of bits on the line), that corresponds to the 94.75% we see the switch reporting.\nSo suffice to say, at 80Gbit actual throughput (40G from Te0/1-3 ingress and 40G to Te0/1-3 egress), the switch performs at line rate, with no noticable lag or jitter. The CLI is responsive and the fans aren\u0026rsquo;t spinning harder than at idle, even after 60min of packets. Good!\nQinQ Then, I reconfigured the switch to let each pair of ports (Te0/1-2 and Te0/3-4) each drop into a Q-in-Q VLAN, with tag 20 and tag 21 respectively. The configuration:\ninterface TenGigabitEthernet 0/1 switchport mode dot1q-tunnel switchport dot1q-tunnel allowed vlan add untagged 20 switchport dot1q-tunnel native vlan 20 ! interface TenGigabitEthernet 0/3 switchport mode dot1q-tunnel switchport dot1q-tunnel allowed vlan add untagged 21 switchport dot1q-tunnel native vlan 21 spanning-tree bpdufilter enable ! interface FortyGigabitEthernet 0/26 switchport mode trunk switchport trunk allowed vlan only 1,20-21 fsw0#show mac Vlan MAC Address Type Interface Live Time ---------- -------------------- -------- ------------------------------ ------------- 20 0001.0101.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 01:15:02 20 0001.0102.0000 DYNAMIC TenGigabitEthernet 0/1 0d 01:15:01 20 0001.0103.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 01:15:02 20 0001.0104.0000 DYNAMIC TenGigabitEthernet 0/2 0d 01:15:03 21 0002.0101.0000 DYNAMIC TenGigabitEthernet 0/4 0d 00:01:50 21 0002.0102.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:01:03 21 0002.0103.0000 DYNAMIC TenGigabitEthernet 0/3 0d 00:01:59 21 0002.0104.0000 DYNAMIC FortyGigabitEthernet 0/26 0d 00:01:02 Two things happen that require a bit of explanation. First of all, despite both loadtesters use the exact same configuration (in fact, I didn\u0026rsquo;t even stop them from emitting packets while reconfiguring the switch), I now have packetloss, the throughput per 10G port has reduced from 94.67% to 93.63% and at the same time, I observe that the 40G ports raised their usage from 94.66% to 94.81%.\nfsw1#show int usage | e 0.00 Interface Bandwidth Average Usage Output Usage Input Usage ------------------------------------ ----------- ---------------- ---------------- ---------------- TenGigabitEthernet 0/1 10000 Mbit 94.20% 93.63% 94.67% TenGigabitEthernet 0/2 10000 Mbit 94.21% 93.65% 94.67% TenGigabitEthernet 0/3 10000 Mbit 91.05% 94.66% 94.66% TenGigabitEthernet 0/4 10000 Mbit 90.80% 94.66% 94.66% FortyGigabitEthernet 0/26 40000 Mbit 94.81% 94.81% 94.81% The switches, however, are perfectly fine. The reason for this loss is that when I created the dot1q-tunnel, the switch sticks another VLAN tag (4 bytes, or 32 bits) on each packet before sending it out the 40G port between the switches, and at these packet rates, it adds up. Each 10G switchport is receiving 3.28Mpps (for a total of 13.12Mpps) which, when the switch needs to send it to its peer on the 40G trunk, adds 13.12Mpps * 32 bits = 419.8Mbps on top of the 40G line rate, implying we\u0026rsquo;re going to be losing roughly 1.045% of our packets. And indeed, the difference between 94.67 (inbound) and 93.63 (outbound) is 1.04% which lines up.\nGlobal Statistics connection : localhost, Port 4501 total_tx_L2 : 37.92 Gbps version : STL @ v2.91 total_tx_L1 : 40.02 Gbps cpu_util. : 43.52% @ 8 cores (4 per dual port) total_rx : 37.92 Gbps rx_cpu_util. : 0.0% / 0 pps total_pps : 13.12 Mpps async_util. : 0% / 198.64 bps drop_rate : 0 bps total_cps. : 0 cps queue_full : 0 pkts Port Statistics port | 0 | 1 | 2 | 3 -----------+-------------------+-------------------+-------------------+------------------ owner | pim | pim | pim | pim link | UP | UP | UP | UP state | TRANSMITTING | TRANSMITTING | TRANSMITTING | TRANSMITTING speed | 10 Gb/s | 10 Gb/s | 10 Gb/s | 10 Gb/s CPU util. | 46.29% | 46.29% | 40.76% | 40.76% -- | | | | Tx bps L2 | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps Tx bps L1 | 10 Gbps | 10 Gbps | 10 Gbps | 10 Gbps Tx pps | 3.28 Mpps | 3.27 Mpps | 3.27 Mpps | 3.28 Mpps Line Util. | 100.04 % | 100.04 % | 100.04 % | 100.04 % --- | | | | Rx bps | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps | 9.48 Gbps Rx pps | 3.24 Mpps | 3.24 Mpps | 3.23 Mpps | 3.24 Mpps ---- | | | | opackets | 1891576526 | 1891577716 | 1891547042 | 1891548090 ipackets | 1891576643 | 1891577837 | 1891547158 | 1891548214 obytes | 684435443496 | 684435873418 | 684424773684 | 684425153614 ibytes | 684435484082 | 684435916902 | 684424817178 | 684425197948 tx-pkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts rx-pkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts | 1.89 Gpkts tx-bytes | 684.44 GB | 684.44 GB | 684.42 GB | 684.43 GB rx-bytes | 684.44 GB | 684.44 GB | 684.42 GB | 684.43 GB ----- | | | | oerrors | 0 | 0 | 0 | 0 ierrors | 0 | 0 | 0 | 0 L3 For this test, I reconfigured the 25G ports to become routed rather than switched, and I put them under 80% load with T-Rex (where 80% here is of L1), thus the ports are emitting 20Gbps of traffic at a rate of 13.12Mpps. I left two of the 10G ports just continuing their ethernet loadtest at 100%, which is also 20Gbps of traffic and 13.12Mpps. In total, I observed 79.95Gbps of traffic between the two switches: an entirely saturated 40G port in both directions.\nI then created a simple topology with OSPF, both switches configured a Loopback0 interface with a /32 IPv4 and /128 IPv6 address, and a transit network between them in a VLAN100 interface. OSPF and OSPFv3 both distribute connected and static routes, to keep things simple.\nFinally, I added an IP address on the Tf0/24 interface, set a static IPv4 route for 16.0.0.0/8 and 48.0.0.0/8 towards that interface on each switch, and added VLAN 100 to the Fo0/26 trunk. It looks like this for switch fsw0:\ninterface Loopback 0 ip address 100.64.0.0 255.255.255.255 ipv6 address 2001:DB8::/128 ipv6 enable interface VLAN 100 ip address 100.65.2.1 255.255.255.252 ipv6 enable ip ospf network point-to-point ipv6 ospf network point-to-point ipv6 ospf 1 area 0 interface TFGigabitEthernet 0/24 no switchport ip address 100.65.1.1 255.255.255.0 ipv6 address 2001:DB8:1:1::1/64 interface FortyGigabitEthernet 0/26 switchport mode trunk switchport trunk allowed vlan only 1,20,21,100 router ospf 1 graceful-restart redistribute connected subnets redistribute static subnets area 0 network 100.65.2.0 0.0.0.3 area 0 ! ipv6 router ospf 1 graceful-restart redistribute connected redistribute static area 0 ! ip route 16.0.0.0 255.0.0.0 100.65.1.2 ipv6 route 2001:db8:100::/40 2001:db8:1:1::2 With this topology, an L3 routing domain emerges between Tf0/24 on switch fsw0 and Tf0/24 on switch fsw1, and we can inspect this, taking a look at fsw1, I can see that both IPv4 and IPv6 adjacencies have formed, and that the switches, néé routers, have learned routes from one another:\nfsw1#show ip ospf neighbor OSPF process 1, 1 Neighbors, 1 is Full: Neighbor ID Pri State BFD State Dead Time Address Interface 100.65.2.1 1 Full/ - - 00:00:31 100.65.2.1 VLAN 100 fsw1#show ipv6 ospf neighbor OSPFv3 Process (1), 1 Neighbors, 1 is Full: Neighbor ID Pri State BFD State Dead Time Instance ID Interface 100.65.2.1 1 Full/ - - 00:00:31 0 VLAN 100 fsw1#show ip route Codes: C - Connected, L - Local, S - Static R - RIP, O - OSPF, B - BGP, I - IS-IS, V - Overflow route N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2 E1 - OSPF external type 1, E2 - OSPF external type 2 SU - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2 IA - Inter area, EV - BGP EVPN, A - Arp to host * - candidate default Gateway of last resort is no set O E2 16.0.0.0/8 [110/20] via 100.65.2.1, 12:42:13, VLAN 100 S 48.0.0.0/8 [1/0] via 100.65.0.2 O E2 100.64.0.0/32 [110/20] via 100.65.2.1, 00:05:23, VLAN 100 C 100.64.0.1/32 is local host. C 100.65.0.0/24 is directly connected, TFGigabitEthernet 0/24 C 100.65.0.1/32 is local host. O E2 100.65.1.0/24 [110/20] via 100.65.2.1, 12:44:57, VLAN 100 C 100.65.2.0/30 is directly connected, VLAN 100 C 100.65.2.2/32 is local host. fsw1#show ipv6 route IPv6 routing table name - Default - 12 entries Codes: C - Connected, L - Local, S - Static R - RIP, O - OSPF, B - BGP, I - IS-IS, V - Overflow route N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type 2 E1 - OSPF external type 1, E2 - OSPF external type 2 SU - IS-IS summary, L1 - IS-IS level-1, L2 - IS-IS level-2 IA - Inter area, EV - BGP EVPN, N - Nd to host O E2 2001:DB8::/128 [110/20] via FE80::669D:99FF:FED0:A054, VLAN 100 LC 2001:DB8::1/128 via Loopback 0, local host C 2001:DB8:1::/64 via TFGigabitEthernet 0/24, directly connected L 2001:DB8:1::1/128 via TFGigabitEthernet 0/24, local host O E2 2001:DB8:1:1::/64 [110/20] via FE80::669D:99FF:FED0:A054, VLAN 100 O E2 2001:DB8:100::/40 [110/20] via FE80::669D:99FF:FED0:A054, VLAN 100 C FE80::/10 via ::1, Null0 C FE80::/64 via Loopback 0, directly connected L FE80::669D:99FF:FED0:A076/128 via Loopback 0, local host C FE80::/64 via TFGigabitEthernet 0/24, directly connected L FE80::669D:99FF:FED0:A076/128 via TFGigabitEthernet 0/24, local host C FE80::/64 via VLAN 100, directly connected L FE80::669D:99FF:FED0:A076/128 via VLAN 100, local host Great success! I can see from the fsw1 output above, its OSPF process has learned routes for the IPv4 and IPv6 loopbacks (100.64.0.0/32 and 2001:DB8::1/128 respectively), the connected routes (100.65.1.0/24 and 2001:DB8:1:1::/64 respectively), and the static routes (16.0.0.0/8 and 2001:db8:100::/40).\nSo let\u0026rsquo;s make use of this topology and change one of the two loadtesters to switch to L3 mode instead:\npim@hippo:~$ cat /etc/trex_cfg.yaml - version : 2 interfaces : [\u0026#34;0e:00.0\u0026#34;, \u0026#34;0e:00.1\u0026#34; ] port_bandwidth_gb: 25 port_limit : 2 port_info : - ip : 100.65.0.2 default_gw : 100.65.0.1 - ip : 100.65.1.2 default_gw : 100.65.1.1 I left the loadtest running for 12hrs or so, and observed the results to be squeaky clean. The loadtester machine was generating ~96Gb/core at 20% utilization, so lazily generating 40.00Gbit of traffic at 25.98Mpps (remember, this was setting the load to 80% on the 25G port, and 99% on the 10G ports). Looking at the switch and again being surprised about the discrepancy, I decided to fully explore the curiosity in this switch\u0026rsquo;s utilization reporting.\nfsw1#show interfaces usage | exclude 0.00 Interface Bandwidth Average Usage Output Usage Input Usage ------------------------------------ ----------- -------------- -------------- ----------- TenGigabitEthernet 0/1 10000 Mbit 93.80% 93.79% 93.81% TenGigabitEthernet 0/2 10000 Mbit 93.80% 93.79% 93.81% TFGigabitEthernet 0/24 25000 Mbit 75.80% 75.79% 75.81% FortyGigabitEthernet 0/26 40000 Mbit 94.79% 94.79% 94.79% fsw1#show int te0/1 | inc packets/sec 10 seconds input rate 9381044793 bits/sec, 3240802 packets/sec 10 seconds output rate 9378930906 bits/sec, 3240123 packets/sec fsw1#show int tf0/24 | inc packets/sec 10 seconds input rate 18952369793 bits/sec, 6547299 packets/sec 10 seconds output rate 18948317049 bits/sec, 6545915 packets/sec fsw1#show int fo0/26 | inc packets/sec 10 seconds input rate 37915517884 bits/sec, 13032078 packets/sec 10 seconds output rate 37915335102 bits/sec, 13026051 packets/sec Looking at that number, 75.80% was not the 80% that I had asked for, and actually the usage of the 10G ports (which I put at 99% load) and 40G port are also lower than I had anticipated. What\u0026rsquo;s going on there? It\u0026rsquo;s quite simple after doing some math: the switch is reporting L2 bits/sec, not L1 bits/sec!\nOn the L3 loadtest and using the imix profile, T-Rex is sending 13.02Mpps of load, which, according to its own observation is 37.8Gbit of L2 and 40.00Gbps of L1 bandwidth. On the L2 loadtest, again using imix profile, T-Rex is sending 4x 3.24Mpps as well, which it claims is 37.6Gbps of L2 and 39.66Gbps of L1 bandwidth (note: I put the loadtester here at 99% of line rate, this is to ensure I would not end up losing packets due to congestion on the 40G port).\nSo according to T-Rex, I am sending 75.4Gbps of traffic (37.8Gbps in the L2 test and 37.6Gbps in the simultenous L3 loadtest), yet I\u0026rsquo;m seeing 37.9Gbps on the switchport. Oh my!\nHere\u0026rsquo;s how all of these numbers relate to one another:\nFirst off, we are sending 99% linerate at 3.24Mpps into Te0/1 and Te0/2 on each switch. Then, we are sending 80% linerate at 6.55Mpps into Tf0/24 on each switch. The Te0/1 and Te0/2 are both in the default VLAN on either side. But, the Tf0/24 is sending its IP traffic through VLAN 100 interconnect, which means all of that traffic gets a dot1q VLAN tag added. That\u0026rsquo;s 4 bytes for each packet. Sending 6.55Mpps * 32bits extra, equals 209600000 bits/sec (0.21Gbps) Loadtester claims 37.70Gbps, but the switch sees 37.91Gbps which is exactly the difference we calculated above (0.21Gbps), and equals the overhead created by adding the VLAN tag on the 25G stream that is in VLAN tag 100. Now we are ready to explain the difference between the switch reported port usage and the loadtester reported port usage:\nThe loadtester is sending an imix traffic mix, which consts of a ratio of 28:16:4 of packets that are 64:590:1514 bytes. We already know that to create a packet on the wire, we have to add a 7 byte preamble a 1 byte start frame delimiter, and end with a 12 byte interpacket gap, so each ethernet frame is 20 bytes longer, making 84 bytes the on-the-wire smallest possible frame. We know we\u0026rsquo;re sending 3.24Mpps on a 10G port at 99% T-Rex (L1) usage: Each packet needs 20 bytes or 160 bits of overhead, which is 518400000 bits/sec We are seeing 9381044793 bits/sec on a 10G port (corresponding switch 93.80% usage) Adding these two numbers up gives us 9899444793 bits/sec (corresponding T-Rex 98.99% usage) Conversely, the whole system is sending 37.9Gbps on the 40G port (corresponding switch 37.9/40 == 94.79% usage) We know this is 2x 10G streams at 99% utilization and 1x25G stream at 80% utilization This is 13.03Mpps, which generate 2084800000 bits/sec of overhead Adding these two numbers up gives us 40.00 Gbps of usage (which is the expected L1 line rate) I find it very fulfilling to see these numbers meaningfully add up! Oh, and by the way, the switches that are now switching and routing all of this with with 0.00% packet loss, and the chassis doesn\u0026rsquo;t even get warm :-)\nGlobal Statistics connection : localhost, Port 4501 total_tx_L2 : 38.02 Gbps version : STL @ v2.91 total_tx_L1 : 40.02 Gbps cpu_util. : 21.88% @ 4 cores (4 per dual port) total_rx : 38.02 Gbps rx_cpu_util. : 0.0% / 0 pps total_pps : 13.13 Mpps async_util. : 0% / 39.16 bps drop_rate : 0 bps total_cps. : 0 cps queue_full : 0 pkts Port Statistics port | 0 | 1 | total -----------+-------------------+-------------------+------------------ owner | pim | pim | link | UP | UP | state | TRANSMITTING | TRANSMITTING | speed | 25 Gb/s | 25 Gb/s | CPU util. | 21.88% | 21.88% | -- | | | Tx bps L2 | 19.01 Gbps | 19.01 Gbps | 38.02 Gbps Tx bps L1 | 20.06 Gbps | 20.06 Gbps | 40.12 Gbps Tx pps | 6.57 Mpps | 6.57 Mpps | 13.13 Mpps Line Util. | 80.23 % | 80.23 % | --- | | | Rx bps | 19 Gbps | 19 Gbps | 38.01 Gbps Rx pps | 6.56 Mpps | 6.56 Mpps | 13.13 Mpps ---- | | | opackets | 292215661081 | 292215652102 | 584431313183 ipackets | 292152912155 | 292153677482 | 584306589637 obytes | 105733412810506 | 105733412001676 | 211466824812182 ibytes | 105710857873526 | 105711223651650 | 211422081525176 tx-pkts | 292.22 Gpkts | 292.22 Gpkts | 584.43 Gpkts rx-pkts | 292.15 Gpkts | 292.15 Gpkts | 584.31 Gpkts tx-bytes | 105.73 TB | 105.73 TB | 211.47 TB rx-bytes | 105.71 TB | 105.71 TB | 211.42 TB ----- | | | oerrors | 0 | 0 | 0 ierrors | 0 | 0 | 0 Conclusions It\u0026rsquo;s just super cool to see a switch like this work as expected. I did not manage to overload it at all, neither with IPv4 loadtest at 20Mpps and 50Gbit of traffic, nor with L2 loadtest at 26Mpps and 80Gbit of traffic, with QinQ demonstrably done in hardware as well as IPv4 route lookups. I will be putting these switches into production soon on the IPng Networks links between Glattbrugg and Rümlang in Zurich, thereby upgrading our backbone from 10G to 25G CWDM. It seems to me, that using these switches as L3 devices given a smaller OSPF routing domain (currently, we have ~300 prefixes in our OSPF at AS50869), would definitely work well, as would pushing and popping QinQ trunks for our customers (for example on Solnet or Init7 or Openfactory).\nApproved. A+, will buy again.\n","date":"2021-08-07","desc":"FiberStore is a staple provider of optics and network gear in Europe. Although I\u0026rsquo;ve been buying opfics like SFP+ and QSFP+ from them for years, I rarely looked at the switch hardware they have on sale, until my buddy Arend suggested one of their switches as a good alternative for an Internet Exchange Point, one with Frysian roots no less!\nExecutive Summary The FS.com S5860 switch is pretty great: 20x 10G SFP+ ports, 4x 25G SPF28 ports and 2x 40G QSFP ports, which can also be reconfigued to be 4x10G each. The switch has a Cisco-like CLI, and great performance. I loadtested a pair of them in L2, QinQ, and in L3 mode, and they handled all the packets I sent to and through them, with all of 10G, 25G and 40G ports in use. Considering the redundant power supply with relatively low power usage, silicon based switching of L2 and L3, I definitely appreciate the price/performance. The switch would be a better match if it allowed for MPLS based L2VPN services, but it doesn\u0026rsquo;t support that.\n","permalink":"https://ipng.ch/s/articles/2021/08/07/review-fs-s5860-20sq-switch/","section":"articles","title":"Review: FS S5860-20SQ Switch"},{"contents":"Introduction Many people maintain what is called a Bucketlist, a list of things they wish to do before they kick the bucket. I have one also, and although most of the items on that list are earthly and more on the emotional realm, and private, there is one specific thing that I have wanted to do ever since I first started working in IT in 1998: Peer at the Amsterdam Internet Exchange.\nThis post details striking this particular item off my bucketlist. It\u0026rsquo;s both indulgent, humblebraggy and incredibly nerdy and it talks a bit about mental health. If those are trigger words for you, skip ahead to another post, like my series on VPP ;-)\n1998 - Netherlands I started working when I was still at the TU/Eindhoven, and after a great sysadmin job at Radar Internet, which became Track and was sold to Wegener Arcade, I turned towards networking. After building Freeler (the first free ISP in the Netherlands) with Adrianus and co, and a small stint at their primary uplink Intouch with Rager (rest in peace, Brother), I joined BIT (AS12859) from 2000 to 2006, and it was here where I developed a true passion for that which makes the internet \u0026rsquo;tick\u0026rsquo;: routing protocols.\nI was secretly jealous that BIT could afford Junipers, F5 loadbalancers and large Cisco switches, and I loved working with and on those machines. BIT had a reseller relationship with BBNed, and were able to directly connect ADSL modems into their own infrastructure, and as such I could afford to get myself a subnet from 213.154.224.0/19 routed to my house in Wageningen. It was where I had a half-19\u0026quot; rack in a clothing closet in our guest bedroom, and it was there that I decided: I want to eventually participate in the BGP world and peer at AMS-IX (the only exchange at the time, NLIX was just starting up, thanks again, Jan!).\nPictured to the right was my first contribution to AS12859 - deploying a CWDM ring from Ede to Amsterdam and upgrading our backbone from an ATM E3 (34Mbit) and POS STM1 (155Mbit) leased line to Gigabit Ethernet on Juniper M5 routers, this was in 2001, 20 years ago almost to the month.\n2008 - Switzerland Fast forward to 2006, I moved to Switzerland and while I remained friendly with NLNOG and SWINOG (and a few other network operator groups), I did not pursue the whole internet exchange thing. I had operated networks for the greater part of a decade, and with my full time job, I spent a lot of time learning how to be a good Site Reliability Engineer. I still had three /24 PI space blocks, used for different purposes in the past, but I was much more comfortable letting the \u0026ldquo;real\u0026rdquo; ISPs announce them - in my case AS25091 IP-Max (thanks, Fred!) and AS13030 Init7 (thanks, Fredy!) and AS12859 BIT (thanks, Michel!). I cannot remember any meaningful downtime in any of those operators, of course there is always some, but due to the N+2 nature of my network deployment, I don\u0026rsquo;t think any global downtime for my internet presence has ever occured.\nIt\u0026rsquo;s not a coincidence that even Google for the longest time used my website at SixXS for their own monitoring, now that is cool. Although Jeroen and I did decide to retire the SixXS project (see my Sunset article on why), the website is still up and served off of three distinct networks, because I have to stay true to the SRE life.\nPictured to the right was one of the two racks at Deltalis DK2, a datacenter built into a mountain in the heart of the swiss Alps. Classic edge/core/border approach with (at the time) state of the art Cisco 7600 routers. One of these is destined to become my nightstand at some point, this was in 2013, which is now (almost) 10 years ago.\nCorona Motivation My buddy Fred from IP-Max would regularly ask me \u0026ldquo;why don\u0026rsquo;t you just announce your /24 yourself?\u0026rdquo; It\u0026rsquo;d be fun, he said. In 2007, we registered a /24 PI for SixXS, and I was always quite content to let him handle the routing. But it started to itch and a neighbor of mine inadvertently reminded me of this itch (thanks, Max) by asking me if I was interested to share an L2 ethernet link with him from our place in Brüttisellen to one of the datacenters in Zürich, a distance of about 7km as the photons fly.\nI could not resist any longer. I was working long(er) than average hours due to the work-from-home situation: you easily chop off 45-60min of commute each day, but I noticed myself spending it in more meetings instead of in the train. I was slowly getting into a bad state, and my motivation was very low. I wanted to do something other than sleep-eat-work-sleep and even my jogging went to an all time minimum. I had very low emotional energy.\nTo put my mind off of things, I decided to reattach to my networking roots in a few ways: one was to build an AS and operate it for a while (maybe a few years until I get bored of it, and then re-parent my IP space to some friendly ISP, or who knows, cash in rich and sell my IP space to the highest bidder!), and the other was to continue my desire to have a competent replacement for silicon now that CPUs-of-now are just as fast as ASICs-of-then, and contribute to DANOS and VPP.\nStep 1. Build a basement ISP So getting a PC with Bird, or in my case, an appliance called DANOS which uses DPDK to implement wirespeed routing on commodity x86/64 hardware. So I happily announced my /24 and /48 from NTT\u0026rsquo;s datacenter, connected to the local internet exchange Swissix and rented an L2 circuit to my house via Openfactory. Also, I showed that a simple Supermicro (for example SYS-5018D-FN8T) could easily handle line rate 64 byte frames in both directions on its TenGigabit interfaces, that\u0026rsquo;s 29Mpps, and still have a responsive IPMI serial console. It reminded me of the early days of Juniper martini class routers, where Jean would say \u0026ldquo;.. and the chassis doesn\u0026rsquo;t even get warm\u0026rdquo;. That\u0026rsquo;s certainly correct today, cuz that Supermicro draws 35W, which is one microwatt per packet routed!\nStep 2. Build a European Ring Of course, I cannot end there, as I have a bucketlist item to work towards. I always wanted to peer in Amsterdam, ever since 2001 when I joined BIT. So I worked out a plan with Fred, who has also been wanting to go to Amsterdam with his Swiss ISP IP-Max.\nSo, in a really epic roadtrip full of nerd, Fred and I went into total geek-mode as we traveled to several European cities to deploy AS50869 on a european ring. I wrote about my experience extensively in these blog posts:\nFrankfurt: May 17th 2021. Amsterdam: May 26th 2021. Lille: May 28th 2021. Paris: June 1st 2021. Geneva: July 3rd 2021. I think we can now say that I\u0026rsquo;m peering on the FLAP. It\u0026rsquo;s not that this AS50869 carries that much traffic, but it\u0026rsquo;s a very welcome relief of daily worklife to be able to do something fun and immediately rewarding like turn up a BGP session and see the traffic go from Zurich to any one of these cities at 10Gbit in any direction. No congestion, no packetlo, just pure horsepower performance.\nStep 3. Build Linux CP in VPP Next month, I plan to take VPP out for an elaborate spin. I\u0026rsquo;ve been running DANOS on my routers for a while now, and I\u0026rsquo;m pretty happy with it, but there are a few quirks that are annoying me more and more. Notably, the conversion of Vyatta style commands in the configuration into an FRR config, are often lossy. There\u0026rsquo;s a few key features (such as RPKI or LDP signalling for MPLS paths) that I\u0026rsquo;m missing, and the dataplane, although pretty stable, has crashed maybe three or four times over the last year. Note: One of IP-Max\u0026rsquo;s many Cisco ASR9k also had a few line card reboots in the last year so maybe these crashes are par for the course.\nEver since seeing Netgate and Cisco started work on the Linux Control Plane plugin, which takes interfaces in the VPP dataplane and exposes those as TAP interfaces in Linux, I\u0026rsquo;ve wanted to contribute to that. I\u0026rsquo;ve been determined to make use of VPP+LinuxCP in my own network. However, development has completely stalled on the plugin; the one that ships with VPP 21.06 is rudimentary at best: doesn\u0026rsquo;t do QinQ/QinAD; doesn\u0026rsquo;t apply any changes from the dataplane into the Linux network interface; and the plugin that mirrors netlink message has been stuck in limbo for a few months. So I reached out to the authors in May and offered to complete / rewrite the plugins. I find that writing code, compiling and testing it, and being able to immediately see the improvements in a live network incredibly motivating and energizing.\nExpect to see a few posts in August/September about this work!\n2021 - Switzerland I can say that after making a few small tweaks and adjustments, and breaking the WFH regime into \u0026ldquo;work\u0026rdquo; from home and \u0026ldquo;play\u0026rdquo; from home, helps a lot. I now have a HDMI switch that flips my desk from my work Mac into my personal OpenBSD machine, and a 19\u0026quot; rack in my basement with equipment to loadtest and develop VPP, and I often do some small chores like establish a peering session and happily traceroute from my basement to Amsterdam.\nI\u0026rsquo;ve spent some time in the mountains, in a family commitment to go to a new swiss canton every month. The picture on the right was taken from First in Grindelwald, looking south towards Eiger and Mönch. I live in an absolutely beautiful country. Thanks, Switzerland ;-)\nOn the Bucketlist front, I have the following to report. I waited a few months before writing the post, but I can confidently say that accomplishing this L2/L3 path from my workstation in Brüttisellen where I\u0026rsquo;m typing this blogpost, all the way over Frankfurt to Amsterdam and being able to reach my original colocation machine at AS8283 Coloclue using only switches, routers and IP addresses I own is a continual joy. Seeing that my work now affords me a straight gigabit bandwidth in each direction, makes me just fill with engineering pride and happiness.\npim@chumbucket:~$ traceroute ghoul.ipng.nl traceroute to gripe.ipng.nl (94.142.241.186), 30 hops max, 60 byte packets 1 chbtl0.ipng.ch (194.1.163.66) 0.236 ms 0.178 ms 0.143 ms 2 chrma0.ipng.ch (194.1.163.17) 1.394 ms 1.363 ms 1.332 ms 3 defra0.ipng.ch (194.1.163.25) 7.275 ms 7.362 ms 7.213 ms 4 nlams0.ipng.ch (194.1.163.27) 12.905 ms 12.843 ms 12.844 ms 5 ghoul.ipng.nl (94.142.244.54) 13.120 ms 13.181 ms 13.044 ms And as far as the actual bucketlist item goes, although I made a bit harder on myself because I moved to Switzerland, IP-Max also made it easier by giving me a great price on the backhaul connectivity to Amsterdam, so I can report that the bucket list item is indeed checked off the list:\npim@nlams0:~$ show protocols bgp address-family ipv6 unicast summary IPv6 Unicast Summary: BGP table version 689670802 RIB entries 251402, using 46 MiB of memory Peers 67, using 1427 KiB of memory Peer groups 32, using 2048 bytes of memory Neighbor V AS MsgRcvd MsgSent TblVer InQ OutQ Up/Down State/PfxRcd PfxSnt 2a02:1668:a2b:5:869::1 4 51088 1561576 216485 0 0 0 08w4d03h 126136 5 2a02:1668:a2b:5:869::2 4 51088 1546990 216485 0 0 0 08w4d03h 126127 5 2a02:898::d1 4 8283 812846953 127814 0 0 0 08w6d20h 130590 6 2a02:898::d2 4 8283 828908332 127814 0 0 0 08w0d16h 130590 6 2a02:898:146::2 4 112 101560 228562328 0 0 0 06w2d15h 2 132437 2a07:cd40:1::4 4 212855 105513 238069267 0 0 0 2d14h12m 1 132437 2602:fed2:fff:ffff::1 4 137933 4180058 124978 0 0 0 04w4d10h 551 7 2602:fed2:fff:ffff::253 4 209762 2034724 125048 0 0 0 1d00h14m 618 7 2001:7f8:10f::205b:140 4 8283 137242 121460 0 0 0 08w5d17h 34 7 2001:7f8:10f::207b:145 4 8315 278651 274793 0 0 0 06w0d12h 34 7 2001:7f8:10f::500f:139 4 20495 117590 107877 0 0 0 04w3d00h 208 7 2001:7f8:10f::ac47:131 4 44103 152949 55010 0 0 0 05w1d13h 24 7 2001:7f8:10f::af36:129 4 44854 134969 146240 0 0 0 09w2d16h 1 7 2001:7f8:10f::afd1:133 4 45009 35438 35477 0 0 0 01w0d02h 3 7 2001:7f8:10f::e20a:148 4 57866 302505 280603 0 0 0 05w5d18h 161 7 2001:7f8:10f::e3bb:137 4 58299 1419455 104321 0 0 0 04w0d13h 531 7 2001:7f8:10f::ec8d:132 4 60557 120509 108071 0 0 0 01w4d20h 7 7 2001:7f8:10f::3:259e:143 4 206238 278960 272776 0 0 0 04w4d18h 2 7 2001:7f8:10f::3:3e9b:134 4 212635 823944 140075 0 0 0 08w5d17h 1 7 2001:7f8:10f::dc49:253 4 56393 5693179 157171 0 0 0 02w6d22h 26680 7 2001:7f8:10f::dc49:254 4 56393 5698910 162197 0 0 0 08w5d17h 26680 7 2a02:2528:1902::1 4 25091 9964126 137696 0 0 0 09w1d22h 113020 5 2001:7f8:8f::a500:6939:1 4 6939 8496149 138188 0 0 0 01w2d20h 48079 7 2001:7f8:8f::a500:8283:1 4 8283 23251 52823 0 0 0 03w3d02h Active 0 2001:7f8:8f::a501:3335:1 4 13335 3279 3199 0 0 0 1d02h35m 102 7 2001:7f8:8f::a502:495:1 4 20495 117248 107466 0 0 0 04w3d00h 208 7 2001:7f8:8f::a503:2934:1 4 32934 194428 193990 0 0 0 01w3d08h 30 7 2001:7f8:8f::a503:2934:2 4 32934 194035 194002 0 0 0 03w3d11h 30 7 2001:7f8:8f::a504:4854:1 4 44854 0 9052 0 0 0 never Idle (Admin) 0 2001:7f8:8f::a504:5009:1 4 45009 35433 35467 0 0 0 01w0d02h 3 7 2001:7f8:8f::a505:7866:1 4 57866 302602 276459 0 0 0 04w4d01h 161 7 2001:7f8:8f::a505:8299:1 4 58299 912125 141718 0 0 0 04w0d13h 531 7 2001:7f8:8f::a506:557:1 4 60557 120482 108067 0 0 0 01w4d20h 7 7 2001:7f8:8f::a521:2635:1 4 212635 622475 85332 0 0 0 02w5d10h 1 7 2001:7f8:8f::a504:9917:1 4 49917 8370930 158851 0 0 0 03w4d13h 25257 7 2001:7f8:8f::a504:9917:2 4 49917 8397150 160118 0 0 0 04w4d01h 25011 7 2001:7f8:13::a500:714:1 4 714 67722 66645 0 0 0 03w2d03h 146 7 2001:7f8:13::a500:714:2 4 714 68208 66645 0 0 0 03w2d03h 146 7 2001:7f8:13::a500:6939:1 4 6939 10980475 98099 0 0 0 07w0d10h 48079 7 2001:7f8:13::a502:495:1 4 20495 117773 107873 0 0 0 04w0d14h 208 7 2001:7f8:13::a503:4307:1 4 34307 10709086 100814 0 0 0 09w4d23h 23339 7 2001:7f8:13::a503:4307:2 4 34307 10694266 100814 0 0 0 09w4d23h 22137 7 2001:7f8:8f::a504:4103:1 4 44103 152932 55010 0 0 0 05w1d13h 24 7 2001:7f8:b7::a500:8283:1 4 8283 126035 98846 0 0 0 06w4d22h 34 7 2001:7f8:b7::a501:3335:1 4 13335 4277 4157 0 0 0 1d10h34m 102 7 2001:7f8:b7::a502:495:1 4 20495 117588 107871 0 0 0 04w3d00h 208 7 2001:7f8:b7::a504:5009:1 4 45009 35441 35504 0 0 0 01w0d02h 3 7 2001:7f8:b7::a506:557:1 4 60557 120546 108067 0 0 0 01w4d20h 7 7 2001:7f8:b7::a521:2635:1 4 212635 716031 94458 0 0 0 08w5d17h 1 7 2001:7f8:b7::a504:1441:1 4 41441 12911969 107363 0 0 0 08w2d12h 50606 7 2001:7f8:b7::a504:1441:2 4 41441 12733337 107304 0 0 0 08w2d12h 50606 7 Total number of neighbors 67 pim@nlams0:~$ show protocols ospfv3 neighbor Neighbor ID Pri DeadTime State/IfState Duration I/F[State] 194.1.163.7 1 00:00:32 Full/PointToPoint 62d21:41:24 dp0p6s0f3.100[PointToPoint] 194.1.163.34 1 00:00:39 Full/PointToPoint 27d22:28:30 dp0p6s0f3.200[PointToPoint] There are three full IPv4 and IPv6 transit providers: AS51088 (A2B Internet, thanks Erik!), AS8283 (Coloclue) and AS25091 (IP-Max, thanks Fred!). Also, the router is connected directly to Speed-IX, LSIX, FrysIX and NL-IX. Along with the many other internet exchanges I\u0026rsquo;ve connected to, it puts my humble AS50869 as #5 best connected ISP in Switzerland!\nI mean, just look at that stability, BGP sessions often times up as long as the machine has been there (remember, I deployed nlams0.ipng.ch only in May, so 9 weeks is all we can ask for!). OSPF uptime (helpfully shown with duration with OSPFv3 on FRR) is impeccable as well. The link with 27d of uptime is because I took out that router for maintenance 27 days ago to upgrade it to a preliminary version of DANOS + Bird2, as I prepare the move to VPP + Bird2 later this year.\nA note on mental health Mental health includes our emotional, psychological, and social well-being. It affects how we think, feel, and act. It also helps determine how we handle stress, relate to others, and make choices. Mental health is important at every stage of life, from childhood and adolescence through adulthood.\nIf you\u0026rsquo;ve read so far, thanks! I can imagine that some find this story a mixture of nerd and brag, and that\u0026rsquo;s OK. I am writing these stories because I find happiness in writing about the small and large technical things that I perceive as important to my feelings of accomplishment and therefor my wellbeing.\nI do many non-nerd and non-technical things, but I try to make it a habit of keeping my personal life off the internet (I\u0026rsquo;m not on social media and not often on digital messaging boards or chat apps). I could tell you equally enthusiastically about those hikes I took in Grindelwald, or those Bürli I baked, but that would have to be in person.\nWell-being is a positive outcome that is meaningful for people and for many sectors of society, because it tells us that people perceive that their lives are going well. However, many indicators that measure living conditions fail to measure what people think and feel about their lives, such as the quality of their relationships, their positive emotions and resilience, the realization of their potential, or their overall satisfaction with life.\nI find satisfaction in my modest dabbles with IPng Networks, both the software and the hardware and physical aspects of it. I encourage everybody to have a safe/fun place where they spend some meaningful time doing things that spark joy. To your health!\n","date":"2021-07-26","desc":"Introduction Many people maintain what is called a Bucketlist, a list of things they wish to do before they kick the bucket. I have one also, and although most of the items on that list are earthly and more on the emotional realm, and private, there is one specific thing that I have wanted to do ever since I first started working in IT in 1998: Peer at the Amsterdam Internet Exchange.\n","permalink":"https://ipng.ch/s/articles/2021/07/26/a-story-of-a-bucketlist/","section":"articles","title":"A story of a Bucketlist"},{"contents":" Author: Pim van Pelt \u0026lt;pim@ipng.nl\u0026gt; Reviewed: Pascal Dornier \u0026lt;pdornier@pcengines.ch\u0026gt; Status: Draft - Review - Approved I did this test back in February, but can now finally publish the results! This little SBC is definitely going to be a hit in the ISP industry. See more information about it here.\nPC Engines develops and sells small single board computers for networking to a worldwide customer base. This article discusses a new/unreleased product which PC Engines has developed, which has specific significance in the network operator community: an SBC which comes with three RJ45/UTP based network ports, and one SFP optical port.\nExecutive Summary Due to the use of Intel i210-IS on the SFP port and i211-AT on the three copper ports, and due to it having no moving parts (fans, hard disks, etc), this SBC is an excellent choice for network appliances such as out-of-band or serial consoles in a datacenter, or routers in a small business or home office.\nDetailed findings The APU series boards typically ship with 2GB or 4GB of DRAM, 2, 3 or 4 Intel i211-AT network interfaces, and a four core AMD GX-412TC (running at 1GHz). This review is about the following APU6 unit, which comes with 4GB of DRAM (this preproduction unit came with 2GB, but that will be fixed in the production version), 3x i211-AT for the RJ45 network interfaces, and one i210-IS with an SFP cage.\nOne other significant difference is visible \u0026ndash; the trusty rusty DB9 connector that exposes the first serial RS232 port is replaced with a modern CP2104 (USB vendor 10c4:ea60) from Silicon Labs which exposes the serial port as TTL/serial on a micro USB connector rather than RS232, neat!\nTransceiver Compatibility The small form-factor pluggable (SFP) is a compact, hot-pluggable network interface module used for both telecommunication and data communications applications. An SFP interface on networking hardware is a modular slot for a media-specific transceiver in order to connect a fiber-optic cable or sometimes a copper cable. Such a slot is typically called a cage.\nThe SFP port accepts most/any optics brand and configuration (Copper, regular 850nm/1310nm/1550nm based, BiDi as commonly used in FTTH deployments, CWDM for use behind an OADM). I tried 6 different vendors and types, see below for results. All modules worked, regardless of vendor or brand.\nI tried 6 different SFP modules, all successfully. See the links in the list for an output of an optical diagnostics tool (using the SFF-8472 standard for SFP/SFP+ management).\nEach module provided link and passed traffic. The loadtest below was done with the BiDi optics in one interface and a boring RJ45 copper cable in another. It\u0026rsquo;s going to be fantastic to be able to use these APU6\u0026rsquo;s in a datacenter setting as remote / out-of-band serial devices, specifically nowadays where UTP is becoming a scarcity and everybody has fiber infrastructure in their racks.\nVendor Type Description Details Finisar FTLF8519P2BNL-RB 850nm duplex sfp0.txt Generic Unknown(no DOM) 850nm duplex sfp1.txt Cisco GLC-LH-SMD 1310nm duplex sfp2.txt Cisco SFP-GE-BX-D 1490nm Bidirectional (FTTH CPE) sfp3.txt Cisco SFP-GE-BX-U 1310nm Bidirectional (FTTH COR) sfp3.txt Cisco BT-OC24-20A 1550nm OC24 SDH sfp4.txt Finisar FTRJ1319P1BTL-C7 1310nm 20km (w/ 6dB attenuator) sfp5.txt Network Loadtest The choice of Intel i210/i211 network controller on this board allows operators to use Intel\u0026rsquo;s DPDK with relatively high performance, compared to regular (kernel) based routing. I loadtested Linux (Ubuntu 20.04), OpenBSD (6.8), and two lesser known but way cooler DPDK open source appliances called Danos (ref) and VPP (ref) respectively.\nSpecifically worth calling out that while Linux and OpenBSD struggled, both DPDK appliances had absolutely no problems filling a bidirectional gigabit stream of \u0026ldquo;regular internet traffic\u0026rdquo; (referred to as imix), and came close to line rate with \u0026ldquo;64b UDP packets\u0026rdquo;. The line rate of a gigabit ethernet is 1.48Mpps in one direction, and my loadtests stressed both directions simultaneously.\nMethodology For the loadtests, I used Cisco\u0026rsquo;s T-Rex (ref) in stateless mode, with a custom Python controller that ramps up and down traffic from the loadtester to the device under test (DUT) by sending traffic out port0 to the DUT, and expecting that traffic to be presented back out from the DUT to its port1, and vice versa (out from port1 -\u0026gt; DUT -\u0026gt; back in on port0). The loadtester first sends a few seconds of warmup, this is to ensure the DUT is passing traffic and offers the ability to inspect the traffic before the actual rampup. Then the loadteser ramps up linearly from zero to 100% of line rate (in our case, line rate is one gigabit in both directions), finally it holds the traffic at full line rate for a certain duration. If at any time the loadtester fails to see the traffic it\u0026rsquo;s emitting return on its second port, it flags the DUT as saturated; and this is noted as the maximum bits/second and/or packets/second.\nusage: trex-loadtest.bin [-h] [-s SERVER] [-p PROFILE_FILE] [-o OUTPUT_FILE] [-wm WARMUP_MULT] [-wd WARMUP_DURATION] [-rt RAMPUP_TARGET] [-rd RAMPUP_DURATION] [-hd HOLD_DURATION] T-Rex Stateless Loadtester -- pim@ipng.nl optional arguments: -h, --help show this help message and exit -s SERVER, --server SERVER Remote trex address (default: 127.0.0.1) -p PROFILE_FILE, --profile PROFILE_FILE STL profile file to replay (default: imix.py) -o OUTPUT_FILE, --output OUTPUT_FILE File to write results into, use \u0026#34;-\u0026#34; for stdout (default: -) -wm WARMUP_MULT, --warmup_mult WARMUP_MULT During warmup, send this \u0026#34;mult\u0026#34; (default: 1kpps) -wd WARMUP_DURATION, --warmup_duration WARMUP_DURATION Duration of warmup, in seconds (default: 30) -rt RAMPUP_TARGET, --rampup_target RAMPUP_TARGET Target percentage of line rate to ramp up to (default: 100) -rd RAMPUP_DURATION, --rampup_duration RAMPUP_DURATION Time to take to ramp up to target percentage of line rate, in seconds (default: 600) -hd HOLD_DURATION, --hold_duration HOLD_DURATION Time to hold the loadtest at target percentage, in seconds (default: 30) It\u0026rsquo;s worth pointing out that almost all systems are pps-bound not bps-bound. A typical rant I have is that network vendors are imprecise when they specify their throughput \u0026ldquo;up to 40Gbit\u0026rdquo; they more often than not mean \u0026ldquo;under carefully crafted conditions\u0026rdquo; such as utilizing jumboframes (9216 bytes rather than \u0026ldquo;usual\u0026rdquo; 1500 byte MTU found on ethernet, which is easier on the router than a typical internet mixture (closer to 1100 bytes), and much easier yet than if the router is asked to forward 64 byte packets, for instance in a DDoS attack); and only in one direction; and only using exactly one source/destination IP address/port, which is a little bit easier to do than to look up a destination in a forwarding table containing 1M destinations \u0026ndash; for context a current internet backbone router carries ~845K IPv4 destinations and ~105K IPv6 destinations.\nResults Product Loadtest Throughput (pps) Throughput (bps) % of linerate Details Linux imix 150.21 Kpps 452.81 Mbps 45.28% apu6-linux-imix.json OpenBSD imix 145.52 Kpps 444.51 Mbps 44.45% apu6-openbsd-imix.json VPP imix 654.40 Kpps 2.00 Gbps 199.90% apu6-vpp-imix.json Danos imix 655.53 Kpps 2.00 Gbps 200.24% apu6-danos-imix.json Linux 64b 96.93 Kpps 65.14 Mbps 6.51% apu6-linux-64b.json OpenBSD 64b 152.09 Kpps 102.20 Mbps 10.22% apu6-openbsd-64b.json VPP 64b 1.78 Mpps 1.19 Gbps 119.49% apu6-vpp-64b.json Danos 64b 2.30 Mpps 1.55 Gbps 154.62% apu6-danos-64b.json For more information on the methodology and the scripts that drew these graphs, take a look at my buddy Michal\u0026rsquo;s GitHub Page, which, given time, will probably turn into its own subsection of this website (I can only imagine the value of a corpus of loadtests of popular equipment in the consumer arena).\nCaveats The unit was shipped to me free of charge by PC Engines for the purposes of load- and systems integration testing. Other than that, this is not a paid endorsement and views of this review are my own.\nOpen Questions SFP I2C Considering the target audience, I wonder if there is a possibility to break out the I2C pins from the SFP cage into a header on the board, so that users can connect them through to the CPU\u0026rsquo;s I2C controller (or bitbang directly on GPIO pins), and use the APU6 as an SFP flasher. I think that would come in incredibly handy in a datacenter setting.\nCPU bound The DPDK based router implementations are CPU bound, and could benefit from a little bit more power. I am duly impressed by the throughput seen in terms of packets/sec/watt, but considering a typical router has a (forwarding) dataplane and needs as well a (configuration) controlplane, we are short about 30% CPU cycles. If a controlplane (like Bird or FRR (ref) is dedicated one core, that leaves us three cores for forwarding, with which we obtain roughly 154% of linerate, we\u0026rsquo;ll need that 200/154 == 1.298 to obtain line rate in both directions. That said, the APU6 has absolutely no problems saturating a gigabit in both directions under normal (==imix) circumstances.\nAppendix Appendix 1 - Terminology Term Description OADM optical add drop multiplexer \u0026ndash; a device used in wavelength-division multiplexing systems for multiplexing and routing different channels of light into or out of a single mode fiber (SMF) ONT optical network terminal - The ONT converts fiber-optic light signals to copper based electric signals, usually Ethernet. OTO optical telecommunication outlet - The OTO is a fiber optic outlet that allows easy termination of cables in an office and home environment. Installed OTOs are referred to by their OTO-ID. CARP common address redundancy protocol - Its purpose is to allow multiple hosts on the same network segment to share an IP address. CARP is a secure, free alternative to the Virtual Router Redundancy Protocol (VRRP) and the Hot Standby Router Protocol (HSRP). SIT simple internet transition - Its purpose is to interconnect isolated IPv6 networks, located in global IPv4 Internet via tunnels. STB set top box - a device that enables a television set to become a user interface to the Internet and also enables a television set to receive and decode digital television (DTV) broadcasts. GRE generic routing encapsulation - a tunneling protocol developed by Cisco Systems that can encapsulate a wide variety of network layer protocols inside virtual point-to-point links over an Internet Protocol network. L2VPN layer2 virtual private network - a service that emulates a switched Ethernet (V)LAN across a pseudo-wire (typically an IP tunnel) DHCP dynamic host configuration protocol - an IPv4 network protocol that enables a server to automatically assign an IP address to a computer from a defined range of numbers. DHCP6-PD Dynamic host configuration protocol: prefix delegation - an IPv6 network protocol that enables a server to automatically assign network prefixes to a customer from a defined range of numbers. NDP NS/NA neighbor discovery protocol: neighbor solicitation / advertisement - an ipv6 specific protocol to discover and judge reachability of other nodes on a shared link. NDP RS/RA neighbor discovery protocol: router solicitation / advertisement - an ipv6 specific protocol to discover and install local address and gateway information. SBC single board computer - a compute computer with all peripherals and components directly attached to the board. ","date":"2021-07-19","desc":" Author: Pim van Pelt \u0026lt;pim@ipng.nl\u0026gt; Reviewed: Pascal Dornier \u0026lt;pdornier@pcengines.ch\u0026gt; Status: Draft - Review - Approved I did this test back in February, but can now finally publish the results! This little SBC is definitely going to be a hit in the ISP industry. See more information about it here.\nPC Engines develops and sells small single board computers for networking to a worldwide customer base. This article discusses a new/unreleased product which PC Engines has developed, which has specific significance in the network operator community: an SBC which comes with three RJ45/UTP based network ports, and one SFP optical port.\n","permalink":"https://ipng.ch/s/articles/2021/07/19/review-pcengines-apu6-with-sfp/","section":"articles","title":"Review: PCEngines APU6 (with SFP)"},{"contents":"I\u0026rsquo;ve been planning a network expansion for a while now. For the next few weeks, I will be in total geek-mode as I travel to several European cities to deploy AS50869 on a european ring. At the same time, my buddy Fred from IP-Max has been wanting to go to Amsterdam. IP-Max\u0026rsquo;s network is considerably larger than mine, but it just never clicked with the right set of circumstances for them to deploy in the Netherlands, until the stars aligned \u0026hellip;\nDeployment After our adventure in Frankfurt, Amsterdam, Lille, and Paris came to an end, I still had a few loose ends to tie up. In particular, in Lille I had dropped an old Dell R610 while waiting for new Supermicros to be delivered. There is benefit to having one standard footprint setup, in my case an PCEngines APU2, Supermicro 5018D-FN8T and Intel X710-DA4 expansion NIC. They run fantastic with DANOS and VPP applications.\nOf course, we mustn\u0026rsquo;t forget home base, Geneva, where IP-Max has its headquarters in a beautiful mansion pictured here. At the same time, my family likes to take one trip per month to a city we don\u0026rsquo;t usually go, sort of to keep up with real life as we are now more and more able to travel. Marina has a niece in Geneva, who has lived and worked there for 20+ years, so we figured we\u0026rsquo;d combine these things and stay the weekend at her place.\nAfter making our way from Zurich to Geneva, a trip that took us just short of six hours (!) by car, we arrived at the second half of the Belgium:Italy eurocup soccer match. It was perhaps due to our tardiness and lack of physical supportering, that the belgians lost the match that day. Sorry!\nConnectivity My current circuit runs from Paris (Leon Frot), frpar0.ipng.ch over a direct DWDM wave to Zurich where I pick it up on chgtg0.ipng.ch at Interxion Glattbrugg. So what we\u0026rsquo;ll do is break open this VLL at the IP-Max side, insert the new router chplo0.ipng.ch, and reconfigure the Paris side to go to the new router, and the new router to create another VLL back to Zurich, which due to the toplogy of IP-Max\u0026rsquo;s underlying DWDM network will traverse Paris - Lyon - Geneva instead (shaving off ~1.5ms of latency at the same time).\nI hung up the APU2 OOB server and the 5018D-FN8T router, and another Dell R610 to run virtual machines at Safehost SH1 in Plan-les-Ouates, a southern suburb of Geneva. I connected one 10G port to er01.gva20.ip-max.net and another 10G port to er02.gva20.ip-max.net to obtain maximum availability benefits. As an example of what the configuration on the ASR9k platform looks like for this type of operation, here\u0026rsquo;s what I committed on er01.gva20.\nOf course, first things first: let\u0026rsquo;s ensure that the OOB machine has connectivity, by allocating a /64 IPv6 and /29 IPv4. I usually configure myself a BGP transit session in the same subnet, which means we\u0026rsquo;ll want to bridge the 1G UTP connection of the APU with the 10G fiber connection of the Supermicro router, like so:\ninterface BVI911 description Cust: IPng OOB and Transit ipv4 address 46.20.250.105 255.255.255.248 ipv4 unreachables disable ipv6 nd suppress-ra ipv6 address 2a02:2528:ff05::1/64 ipv6 enable load-interval 30 ! interface GigabitEthernet0/7/0/38 description Cust: IPng APU (OOB) mtu 9064 load-interval 30 l2transport ! ! interface TenGigE0/1/0/3 description Cust: IPng (VLL and Transit) mtu 9014 ! interface TenGigE0/1/0/3.911 l2transport encapsulation dot1q 911 exact rewrite ingress tag pop 1 symmetric mtu 9018 ! l2vpn bridge group BG_IPng bridge-domain BD_IPng911 interface Te0/1/0/3.911 ! interface GigabitEthernet0/7/0/38 ! routed interface BVI911 ! ! ! After this, we pulled UTP cable and configured the APU2, which then has an internal network towards the IPMI port of the Supermicro, and from there on, the configuration becomes much easier. Of course, all config can be done wirelessly, because the APU console.plo.ipng.nl acts as a WiFi access point, so I connect to it and commit the network configs.\nOnce that\u0026rsquo;s online and happy, the router chplo0.ipng.ch is next. For this, on er02.par02.ip-max.net, I reconfigure the current VLL to point to the loopback of this router er01.gva20.ip-max.net using the same pw-id. Then, I can configure this router as follows:\ninterface TenGigE0/1/0/3.100 l2transport description Cust: IPng VLL to par02 encapsulation dot1q 100 rewrite ingress tag pop 1 symmetric mtu 9018 ! l2vpn pw-class EOMPLS-PW-CLASS encapsulation mpls transport-mode ethernet ! ! xconnect group IPng p2p IPng_to_par02 interface TenGigE0/1/0/3.100 neighbor ipv4 46.20.255.33 pw-id 210535705 pw-class EOMPLS-PW-CLASS ! ! ! The results And with that, the pseudowire is constructed, and the original interface on frpar0.ipng.ch directly sees the interface here on chplo0.ipng.ch using jumboframes of 9000 bytes (+14 bytes of ethernet overhead and +4 bytes of VLAN tag on the ingress interface). It is as if the routers are directly connected by a very long ethernet cable, a pseudo-wire if you wish. Super low pingtimes are observed between this new router in Geneva and the existing two in Paris and Zurich:\npim@chplo0:~$ /bin/ping -4 -c5 frpar0 PING frpar0.ipng.ch (194.1.163.33) 56(84) bytes of data. 64 bytes from frpar0.ipng.ch (194.1.163.33): icmp_seq=1 ttl=64 time=8.78 ms 64 bytes from frpar0.ipng.ch (194.1.163.33): icmp_seq=2 ttl=64 time=8.80 ms 64 bytes from frpar0.ipng.ch (194.1.163.33): icmp_seq=3 ttl=64 time=8.81 ms 64 bytes from frpar0.ipng.ch (194.1.163.33): icmp_seq=4 ttl=64 time=8.82 ms 64 bytes from frpar0.ipng.ch (194.1.163.33): icmp_seq=5 ttl=64 time=8.85 ms --- frpar0.ipng.ch ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 10ms rtt min/avg/max/mdev = 8.783/8.810/8.846/0.104 ms pim@chplo0:~$ /bin/ping -6 -c5 chgtg0 PING chgtg0(chgtg0.ipng.ch (2001:678:d78::1)) 56 data bytes 64 bytes from chgtg0.ipng.ch (2001:678:d78::1): icmp_seq=1 ttl=64 time=4.51 ms 64 bytes from chgtg0.ipng.ch (2001:678:d78::1): icmp_seq=2 ttl=64 time=4.44 ms 64 bytes from chgtg0.ipng.ch (2001:678:d78::1): icmp_seq=3 ttl=64 time=4.36 ms 64 bytes from chgtg0.ipng.ch (2001:678:d78::1): icmp_seq=4 ttl=64 time=4.47 ms 64 bytes from chgtg0.ipng.ch (2001:678:d78::1): icmp_seq=5 ttl=64 time=4.41 ms --- chgtg0 ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 10ms rtt min/avg/max/mdev = 4.362/4.436/4.506/0.077 ms For good measure I\u0026rsquo;ve also connected to FreeIX, a new internet exchange project I\u0026rsquo;m working on, that will span the Geneva, Zurich and Lugano areas. More on that in a future post!\npim@chplo0:~$ iperf3 -4 -c 185.1.205.1 ## chgtg0.ipng.ch Connecting to host 185.1.205.1, port 5201 [ 5] local 185.1.205.2 port 46872 connected to 185.1.205.1 port 5201 [ ID] Interval Transfer Bitrate Retr Cwnd [ 5] 0.00-1.00 sec 809 MBytes 6.78 Gbits/sec 4 11.4 MBytes [ 5] 1.00-2.00 sec 869 MBytes 7.29 Gbits/sec 0 11.4 MBytes [ 5] 2.00-3.00 sec 865 MBytes 7.25 Gbits/sec 0 11.4 MBytes [ 5] 3.00-4.00 sec 868 MBytes 7.28 Gbits/sec 0 11.4 MBytes [ 5] 4.00-5.00 sec 836 MBytes 7.01 Gbits/sec 0 11.4 MBytes [ 5] 5.00-6.00 sec 852 MBytes 7.15 Gbits/sec 0 11.4 MBytes [ 5] 6.00-7.00 sec 865 MBytes 7.26 Gbits/sec 0 11.4 MBytes [ 5] 7.00-8.00 sec 865 MBytes 7.26 Gbits/sec 0 11.4 MBytes [ 5] 8.00-9.00 sec 861 MBytes 7.22 Gbits/sec 0 11.4 MBytes [ 5] 9.00-10.00 sec 860 MBytes 7.22 Gbits/sec 0 11.4 MBytes - - - - - - - - - - - - - - - - - - - - - - - - - [ ID] Interval Transfer Bitrate Retr [ 5] 0.00-10.00 sec 8.35 GBytes 7.17 Gbits/sec 4 sender [ 5] 0.00-10.01 sec 8.35 GBytes 7.16 Gbits/sec receiver iperf Done. You kind of get used to performance stats like this, but that said, it\u0026rsquo;s nice to see that performance over FreeIX is slightly lower than performance over the IPng backbone, and this is because on my VLLs, I can make use of jumbo frames, which gives me 20% or so better performance (currently 9.62 Gbits/sec).\nCurrently I\u0026rsquo;m busy at work in the background completing the configuration, the management environment and physical infrastructure for the internet exchange. I\u0026rsquo;m planning to make a more complete post about the FreeIX project in a few weeks once it\u0026rsquo;s ready for launch. Stay tuned!\n","date":"2021-07-03","desc":"I\u0026rsquo;ve been planning a network expansion for a while now. For the next few weeks, I will be in total geek-mode as I travel to several European cities to deploy AS50869 on a european ring. At the same time, my buddy Fred from IP-Max has been wanting to go to Amsterdam. IP-Max\u0026rsquo;s network is considerably larger than mine, but it just never clicked with the right set of circumstances for them to deploy in the Netherlands, until the stars aligned \u0026hellip;\n","permalink":"https://ipng.ch/s/articles/2021/07/03/ipng-arrives-in-geneva/","section":"articles","title":"IPng arrives in Geneva"},{"contents":"I\u0026rsquo;m one of those people who is a fan of low-latency and high performance distributed service architectures. After building out the IPng Network across europe, I did notice a rather stark difference in presence of one particular service: AS112 anycast nameservers. In particular, I only have one Internet Exchange in common with a direct presence of AS112, FCIX in California. Big-up to the kind folks in Fremont who operate www.as112.net.\nThe Problem Looking around Switzerland, no internet exchanges actually have AS112 as a direct member and as such you\u0026rsquo;ll find the service tucked away behind several ISPs, with AS paths such as 13030 29670 112, 6939 112 and 34019 112. A traceroute from a popular swiss ISP, Init7 will go to Germany, at a roundtrip latency of 18.9ms. My own latency is 146ms as my queries are served from FCIX:\npim@spongebob:~$ traceroute prisoner.iana.org traceroute to prisoner.iana.org (192.175.48.1), 64 hops max, 40 byte packets 1 fiber7.xe8.chbtl0.ipng.ch (194.126.235.33) 2.658 ms 0.754 ms 0.523 ms 2 1790bre1.fiber7.init7.net (81.6.42.1) 1.132 ms 1.077 ms 3.621 ms 3 780eff1.fiber7.init7.net (109.202.193.44) 1.238 ms 1.162 ms 1.188 ms 4 r1win12.core.init7.net (77.109.181.155) 2.096 ms 2.1 ms 2.1 ms 5 r1zrh6.core.init7.net (82.197.168.222) 2.086 ms 3.904 ms 2.183 ms 6 r1glb1.core.init7.net (5.180.135.134) 2.043 ms 3.621 ms 2.088 ms 7 r2zrh2.core.init7.net (82.197.163.213) 2.353 ms 2.522 ms 2.289 ms 8 r2zrh2.core.init7.net (5.180.135.156) 2.08 ms 2.299 ms 2.202 ms 9 r1fra3.core.init7.net (5.180.135.173) 7.65 ms 7.582 ms 7.546 ms 10 r1fra2.core.init7.net (5.180.135.126) 7.928 ms 7.831 ms 7.997 ms 11 r1ber1.core.init7.net (77.109.129.8) 19.395 ms 19.287 ms 19.558 ms 12 octalus.in-berlin.a36.community-ix.de (185.1.74.3) 18.839 ms 18.717 ms 29.615 ms 13 prisoner.iana.org (192.175.48.1) 18.536 ms 18.613 ms 18.766 ms pim@chumbucket:~$ traceroute blackhole-1.iana.org traceroute to blackhole-1.iana.org (192.175.48.6), 30 hops max, 60 byte packets 1 chbtl1.ipng.ch (194.1.163.67) 0.247 ms 0.158 ms 0.107 ms 2 chgtg0.ipng.ch (194.1.163.19) 0.514 ms 0.474 ms 0.419 ms 3 usfmt0.ipng.ch (194.1.163.23) 146.451 ms 146.406 ms 146.364 ms 4 blackhole-1.iana.org (192.175.48.6) 146.323 ms 146.281 ms 146.239 ms This path goes to FCIX because it\u0026rsquo;s the only place where AS50869 picks up AS112 directly, at an internet exchange, and therefore the localpref will make this route preferred. But that\u0026rsquo;s a long way to go for my DNS queries!\nI think we can do better.\nIntroduction Taken from RFC7534:\nMany sites connected to the Internet make use of IPv4 addresses that are not globally unique. Examples are the addresses designated in RFC 1918 for private use within individual sites.\nDevices in such environments may occasionally originate Domain Name System (DNS) queries (so-called \u0026ldquo;reverse lookups\u0026rdquo;) corresponding to those private-use addresses. Since the addresses concerned have only local significance, it is good practice for site administrators to ensure that such queries are answered locally. However, it is not uncommon for such queries to follow the normal delegation path in the public DNS instead of being answered within the site.\nIt is not possible for public DNS servers to give useful answers to such queries. In addition, due to the wide deployment of private-use addresses and the continuing growth of the Internet, the volume of such queries is large and growing. The AS112 project aims to provide a distributed sink for such queries in order to reduce the load on the corresponding authoritative servers. The AS112 project is named after the Autonomous System Number (ASN) that was assigned to it.\nDeployment It\u0026rsquo;s actually quite straight forward, the deployment consists of roughly three steps:\nProcure hardware to run the instances of the nameserver on. Configure the nameserver to serve the zonefiles. Announce the anycast service locally/regionally. Let\u0026rsquo;s discuss each in turn.\nHardware For the hardware, I\u0026rsquo;ve decided to use existing server platform at IP-Max and IPng Networks. There are two types of hardware, both tried and tested, one set is an HP ProLiant DL380 Gen9, and one is an older Dell PowerEdge R610.\nConsidering each vendor ships specific parts and each are different, many appliance vendors choose to virtualize their environment such that the guest operating system finds a very homogenous configuration. For my purposes, the virtualization platform is Xen and the guest is a (para)virtualized Debian.\nI will be starting with three nodes, one in Geneva and one in Zurich, hosted on hypervisors of IP-Max, and one in Amsterdam, hosted on a hypervisor of IPng. I have a feeling a few more places will follow.\nInstall the OS Xen makes this repeatable and straight forward. Other systems, such as KVM, have very similar installers, for example VMBuilder is popular. Both work roughly the same way, and install a guest in a matter of minutes.\nI\u0026rsquo;ll install to an LVS volume group on all machines, backed by pairs of SSD for throughput and redundancy. We\u0026rsquo;ll give the guest 4GB of memory and 4 CPUs. I love how the machine boots using PyGrub, fully on serial, and is fully booted and running in 20 seconds.\nsudo xen-create-image --hostname as112-1.free-ix.net --ip 46.20.249.197 \\ --vcpus 4 --pygrub --dist buster --lvm=vg1_hvn04_gva20 sudo xl create -c as112-1.free-ix.net.cfg After logging in, the following additional software was installed. We\u0026rsquo;ll be using Bird2, which comes on Debian Buster\u0026rsquo;s backports. Otherwise, we\u0026rsquo;re pretty vanilla:\n$ cat \u0026lt;\u0026lt; EOF | sudo tee -a /etc/apt/sources.list # # Backports # deb http://deb.debian.org/debian buster-backports main EOF $ sudo apt update $ sudo apt install tcpdump sudo net-tools bridge-utils nsd bird2 \\ netplan.io traceroute ufw curl bind9-dnsutils $ sudo apt purge ifupdown I removed the /etc/network/interfaces approach and configured Netplan, a personal choice, which aligns the machines more closely with other servers in the IPng fleet. The only trick is to ensure that the anycast IP addresses are available for the nameserver to listen on, so at the top of Netplan\u0026rsquo;s configuration file, we add them like so:\nnetwork: version: 2 renderer: networkd ethernets: lo: addresses: - 127.0.0.1/8 - ::1/128 - 192.175.48.1/32 # prisoner.iana.org (anycast) - 2620:4f:8000::1/128 # prisoner.iana.org (anycast) - 192.175.48.6/32 # blackhole-1.iana.org (anycast) - 2620:4f:8000::6/128 # blackhole-1.iana.org (anycast) - 192.175.48.42/32 # blackhole-2.iana.org (anycast) - 2620:4f:8000::42/128 # blackhole-2.iana.org (anycast) - 192.31.196.1/32 # blackhole.as112.arpa (anycast) - 2001:4:112::1/128 # blackhole.as112.arpa (anycast) Nameserver My nameserver of choice is NSD, and its configuration is similar to BIND, which is described in RFC7534. In fact, the zone files are identical, so all we should do is create a few listen statements and load up the zones:\n$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/nsd/nsd.conf.d/listen.conf server: ip-address: 127.0.0.1 ip-address: ::1 ip-address: 46.20.249.197 ip-address: 2a02:2528:a04:202::197 ip-address: 192.175.48.1 # prisoner.iana.org (anycast) ip-address: 2620:4f:8000::1 # prisoner.iana.org (anycast) ip-address: 192.175.48.6 # blackhole-1.iana.org (anycast) ip-address: 2620:4f:8000::6 # blackhole-1.iana.org (anycast) ip-address: 192.175.48.42 # blackhole-2.iana.org (anycast) ip-address: 2620:4f:8000::42 # blackhole-2.iana.org (anycast) ip-address: 192.31.196.1 # blackhole.as112.arpa (anycast) ip-address: 2001:4:112::1 # blackhole.as112.arpa (anycast) server-count: 4 EOF $ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/nsd/nsd.conf.d/as112.conf zone: name: \u0026#34;hostname.as112.net\u0026#34; zonefile: \u0026#34;/etc/nsd/master/db.hostname.as112.net\u0026#34; zone: name: \u0026#34;hostname.as112.arpa\u0026#34; zonefile: \u0026#34;/etc/nsd/master/db.hostname.as112.arpa\u0026#34; zone: name: \u0026#34;10.in-addr.arpa\u0026#34; zonefile: \u0026#34;/etc/nsd/master/db.dd-empty\u0026#34; # etcetera EOF While all of the zones are captured by db.dd-empty or db.dr-empty, which can be found in the RFC text, I\u0026rsquo;ll note the top two are special, as they are specific to the instance. For example on our Geneva instance:\n$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/nsd/master/db.hostname.as112.arpa $TTL 1W @ SOA chplo01.paphosting.net. noc.ipng.ch. ( 1 ; serial number 1W ; refresh 1M ; retry 1W ; expire 1W ) ; negative caching TTL NS blackhole.as112.arpa. TXT \u0026#34;AS112 hosted by IPng Networks\u0026#34; \u0026#34;Geneva, Switzerland\u0026#34; TXT \u0026#34;See https://www.as112.net/ for more information.\u0026#34; TXT \u0026#34;See https://free-ix.net/ for local information.\u0026#34; TXT \u0026#34;Unique IP: 194.1.163.147\u0026#34; TXT \u0026#34;Unique IP: [2001:678:d78:7::147]\u0026#34; LOC 46 9 55.501 N 6 6 25.870 E 407.00m 10m 100m 10m This is super helpful to users, who want to know which server, exactly, is serving their request. Not all operators added the Unique IP details, but I found it useful when launching the service, as several anycast nodes quickly become confusing otherwise :-)\nAfter this is all done, the nameserver can be started. I rebooted the guest for good measure, and about 19 seconds later (a fact that continues to amaze me), the server was up and serving queries, albeit only from localhost because there is no way to reach the server on the network, yet.\nTo validate things work, we can do a few SOA or TXT queries, like this one:\npim@nlams01:~$ ping -c5 -q prisoner.iana.org PING prisoner.iana.org(prisoner.iana.org (2620:4f:8000::1)) 56 data bytes --- prisoner.iana.org ping statistics --- 5 packets transmitted, 5 received, 0% packet loss, time 34ms rtt min/avg/max/mdev = 0.041/0.045/0.053/0.004 ms pim@nlams01:~$ dig @prisoner.iana.org hostname.as112.net TXT +short +norec \u0026#34;AS112 hosted by IPng Networks\u0026#34; \u0026#34;Amsterdam, The Netherlands\u0026#34; \u0026#34;See http://www.as112.net/ for more information.\u0026#34; \u0026#34;Unique IP: 94.142.241.187\u0026#34; \u0026#34;Unique IP: [2a02:898:146::2]\u0026#34; Network Now comes the fun part! We\u0026rsquo;re running these instances of the nameservers in a few locations, and to ensure we don\u0026rsquo;t route traffic to the incorrect location, we\u0026rsquo;ll announce them using BGP as per recommendation of RFC7534.\nMy choice of routing suite is Bird2, which comes with a lot of extensiblility and a programmatic validation of routing policies.\nWe\u0026rsquo;ll only be using static and BGP routing protocols for Bird, so the configuration is relatively straight forward, first we create a routing table export for IPv4 and IPv6, then we define some static Nullroutes, which ensure that our prefixes are always present in the RIB (otherwise BGP will not export them), then we create some filter functions (one for routeserver sessions, one for peering sessions, and one for transit sessions), and finally we include a few specific configuration files, one-per-environment where we\u0026rsquo;ll be active.\n$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/bird/bird.conf router id 46.20.249.197; protocol kernel fib4 { ipv4 { export all; }; scan time 60; } protocol kernel fib6 { ipv6 { export all; }; scan time 60; } protocol static static_as112_ipv4 { ipv4; route 192.175.48.0/24 blackhole; route 192.31.196.0/24 blackhole; } protocol static static_as112_ipv6 { ipv6; route 2620:4f:8000::/48 blackhole; route 2001:4:112::/48 blackhole; } include \u0026#34;bgp-freeix.conf\u0026#34;; include \u0026#34;bgp-ipng.conf\u0026#34;; include \u0026#34;bgp-ipmax.conf\u0026#34;; EOF The configuration file per environment, say bgp-freeix.conf, can (and will) be autogenerated, but the pattern is of the following form:\n$ cat \u0026lt;\u0026lt; EOF | tee /etc/bird/bgp-freeix.conf # # Bird AS112 configuration for FreeIX # define my_ipv4 = 185.1.205.252; define my_ipv6 = 2001:7f8:111:42::70:1; protocol bgp freeix_as51530_1_ipv4 { description \u0026#34;FreeIX - AS51530 - Routeserver #1\u0026#34;; local as 112; source address my_ipv4; neighbor 185.1.205.254 as 51530; ipv4 { import where fn_import_routeserver( 51530 ); export where proto = \u0026#34;static_as112_ipv4\u0026#34;; import limit 120000 action restart; }; } protocol bgp freeix_as51530_1_ipv6 { description \u0026#34;FreeIX - AS51530 - Routeserver #1\u0026#34;; local as 112; source address my_ipv6; neighbor 2001:7f8:111:42::c94a:1 as 51530; ipv6 { import where fn_import_routeserver( 51530 ); export where proto = \u0026#34;static_as112_ipv6\u0026#34;; import limit 120000 action restart; }; } # etcetera EOF If you\u0026rsquo;ve seen IXPManager\u0026rsquo;s approach to routeserver configuration generators, you\u0026rsquo;ll notice I borrowed the fn_import() function and its dependents from there. This allows imports to be specific towards prefix-lists, as-paths and ensure some Belts and Braces checks are in place (no invalid or tier1 ASN in the path, a valid nexthop, no tricks with AS path truncation, and so on).\nAfter bringing up the service, the prefixes make their way into the routeserver and get distributed to the FreeIX participants:\n$ sudo systemctl start bird $ sudo birdc show protocol BIRD 2.0.7 ready. Name Proto Table State Since Info fib4 Kernel master4 up 2021-06-28 11:01:35 fib6 Kernel master6 up 2021-06-28 11:01:35 device1 Device --- up 2021-06-28 11:01:35 static_as112_ipv4 Static master4 up 2021-06-28 11:01:35 static_as112_ipv6 Static master6 up 2021-06-28 11:01:35 freeix_as51530_1_ipv4 BGP --- up 2021-06-28 11:01:17 Established freeix_as51530_1_ipv6 BGP --- up 2021-06-28 11:01:19 Established freeix_as51530_2_ipv4 BGP --- up 2021-06-28 11:01:32 Established freeix_as51530_2_ipv6 BGP --- up 2021-06-28 11:01:37 Established Internet Exchanges Having one configuration file per group helps a lot with integration of IXPManager where we might autogenerate the IXP versions of these files and install them periodically. That way, when members enable the AS112 peering checkmark, the servers will automatically download and set up those sessions without human involvement \u0026ndash; typically this is the best way to avoid outages: never tinker with production config files by hand. We\u0026rsquo;ll test this out with FreeIX, but hope as well to offer our service to other internet exchanges, notably SwissIX and CIXP.\nOne of the huge benefits of operating within IP-Max network is their ability to do L2VPN transport from any place on-net to any other router. As such, connecting these virtual machines to other places, like SwissIX, CIXP, CHIX-CH, Community-IX or other further away places, is a piece of cake. All we must do is create an L2VPN and offer it to the hypervisor (which usually is connected via a LACP BundleEthernet) on some VLAN, after which we can bridge that into the guest OS by creating a new virtio NIC. This is how, in the example above, our AS112 machines were introduced to FreeIX. This scales very well, requiring only one guest reboot per internet exchange, and greatly simplifies operations.\nMonitoring Of course, one would not want to run a production service, certainly not on the public internet, without a bit of introspection and monitoring.\nThere are four things we might want to ensure:\nIs the machine up and healthy? For this we use NAGIOS. Is NSD serving? For this we use NSD Exporter and Prometheus/Grafana. Is NSD reachable? For this we use CloudProber. If there is an issue, can we alert an operator? For this we use Telegram. In a followup post, I\u0026rsquo;ll demonstrate how these things come together into a comprehensive anycast monitoring and alerting solution. As a fringe benefit we can show contemporary graphs and dashboards. But seeing as the service hasn\u0026rsquo;t yet gotten a lot of mileage, it deserves its own followup post, some time in August.\nThe results First things first - latency went waaaay down:\npim@chumbucket:~$ traceroute blackhole-1.iana.org traceroute to blackhole-1.iana.org (192.175.48.6), 30 hops max, 60 byte packets 1 chbtl1.ipng.ch (194.1.163.67) 0.257 ms 0.199 ms 0.159 ms 2 chgtg0.ipng.ch (194.1.163.19) 0.468 ms 0.430 ms 0.430 ms 3 chrma0.ipng.ch (194.1.163.8) 0.648 ms 0.611 ms 0.597 ms 4 blackhole-1.iana.org (192.175.48.6) 1.272 ms 1.236 ms 1.201 ms pim@chumbucket:~$ dig -6 @prisoner.iana.org hostname.as112.net txt +short +norec +tcp \u0026#34;Free-IX hosted by IP-Max SA\u0026#34; \u0026#34;Zurich, Switzerland\u0026#34; \u0026#34;See https://www.as112.net/ for more information.\u0026#34; \u0026#34;See https://free-ix.net/ for local information.\u0026#34; \u0026#34;Unique IP: 46.20.246.67\u0026#34; \u0026#34;Unique IP: [2a02:2528:1703::67]\u0026#34; and this demonstrates why it\u0026rsquo;s super useful to have the hostname.as112.net entry populated well. If I\u0026rsquo;m in Amsterdam, I\u0026rsquo;ll be served by the local node there:\npim@gripe:~$ traceroute6 blackhole-2.iana.org traceroute6 to blackhole-2.iana.org (2620:4f:8000::42), 64 hops max, 60 byte packets 1 nlams0.ipng.ch (2a02:898:146::1) 0.744 ms 0.879 ms 0.818 ms 2 blackhole-2.iana.org (2620:4f:8000::42) 1.104 ms 1.064 ms 1.035 ms pim@gripe:~$ dig -4 @prisoner.iana.org hostname.as112.net txt +short +norec +tcp \u0026#34;Hosted by IPng Networks\u0026#34; \u0026#34;Amsterdam, The Netherlands\u0026#34; \u0026#34;See http://www.as112.net/ for more information.\u0026#34; \u0026#34;Unique IP: 94.142.241.187\u0026#34; \u0026#34;Unique IP: [2a02:898:146::2]\u0026#34; Of course, due to anycast, and me being in Zurich, I will be served primarily by the Zurich node. If it were to go down for maintenance, or hardware failure, BGP will immediately converge on alternate paths, there are currently three to choose from:\npim@chrma0:~$ show protocols bgp ipv4 unicast 192.31.196.0/24 BGP routing table entry for 192.31.196.0/24 Paths: (10 available, best #2, table default) Advertised to non peer-group peers: 185.1.205.251 194.1.163.1 [...] 112 194.1.163.32 (metric 137) from 194.1.163.32 (194.1.163.32) Origin IGP, localpref 400, valid, internal Community: 50869:3500 50869:4099 50869:5055 Last update: Mon Jun 28 11:13:14 2021 112 185.1.205.251 from 185.1.205.251 (46.20.246.67) Origin IGP, localpref 400, valid, external, bestpath-from-AS 112, best (Local Pref) Community: 50869:3500 50869:4099 50869:5000 50869:5020 50869:5060 Last update: Mon Jun 28 11:00:45 2021 112 185.1.205.251 from 185.1.205.253 (185.1.205.253) Origin IGP, localpref 200, valid, external Community: 50869:1061 Last update: Mon Jun 28 11:00:20 2021 (and more) I am expecting a few more direct paths to come, as I harden this service, and offer it to other swiss internet exchange points in the future. But mostly, my mission of reducing the round trip time from 146ms to 1ms from my desktop at home was successfully accomplished.\n","date":"2021-06-28","desc":"I\u0026rsquo;m one of those people who is a fan of low-latency and high performance distributed service architectures. After building out the IPng Network across europe, I did notice a rather stark difference in presence of one particular service: AS112 anycast nameservers. In particular, I only have one Internet Exchange in common with a direct presence of AS112, FCIX in California. Big-up to the kind folks in Fremont who operate www.as112.net.\nThe Problem Looking around Switzerland, no internet exchanges actually have AS112 as a direct member and as such you\u0026rsquo;ll find the service tucked away behind several ISPs, with AS paths such as 13030 29670 112, 6939 112 and 34019 112. A traceroute from a popular swiss ISP, Init7 will go to Germany, at a roundtrip latency of 18.9ms. My own latency is 146ms as my queries are served from FCIX:\n","permalink":"https://ipng.ch/s/articles/2021/06/28/launch-of-as112/","section":"articles","title":"Launch of AS112"},{"contents":"I\u0026rsquo;ve been planning a network expansion for a while now. For the next few weeks, I will be in total geek-mode as I travel to several European cities to deploy AS50869 on a european ring. At the same time, my buddy Fred from IP-Max has been wanting to go to Amsterdam. IP-Max\u0026rsquo;s network is considerably larger than mine, but it just never clicked with the right set of circumstances for them to deploy in the Netherlands, until the stars aligned \u0026hellip;\nFirst a word I just wanted to start with a note on how special it is to partner with IP-Max and having known the founder Fred for so long affords me this specific trip. It must be known that building a 10G pan-european ring is not an easy thing to do, and I do appreciate very much the kindness Fred has shown IPng even though we are under contract \u0026ndash; the ability to travel to every point of presence and get the founder\u0026rsquo;s tour in each place, get to know the local DC ops, sales directors, field technicians and local customers, is simply golden. Thank you.\nDeployment The city of love, corny as it sounds, also happens to have quite a bit of fibre noire, happily lit by hundreds of local and international carriers. There are two places in the dead-center of the city, a cabaret voltaire called Telehouse TH2, and a new facility from KDDI which is in the same physical block, aptly addressed as 65 Rue Léon Frot, 75011 Paris and that is where my beloved router frpar0.ipng.ch will be. Note that in Lille, actually I had to make do with frggh0 which was in the town of Sainghin-en-Mélantois. This one lives in Paris, none of that suburban bullshit. This is the real deal. There\u0026rsquo;s probably more connectivity in this one block than in all of the Paris metro combined, maybe even more than all of France combined \u0026ndash; peut être :)\nAfter having visited the older location, we took the router, APU, cables and optics to Léon Frot. The rack was quickly found, and it is obvious that this is the location: A fridge was awaiting us, and I reserved two Tengig ports on the ASR9010, one towards Lille and one towards Zürich (which will be Geneva later on).\nI\u0026rsquo;m getting the hang of this VLL stuff after our adventures, previously in Lille at CIV, which is a lazy 4.8ms away from this place (Fred already speaks of making that more like 3.7ms with a small call to his buddy Laurent). So I went about my business, racking first the WiFi enabled console.par.ipng.nl, connecting to WiFi with it, and finding my router on frpar0.ipng.ch over IPMI serial-over-lan. Configuring one Tengig port on the Intel X710 towards er01.lil01.ip-max.net and another Tengig port towards er01.zrh56.ip-max.net.\nI have two Supermicros on backorder, one of which will go to Lille to replace the Dell R610 that I placed temporarily in that site, and the other will go to Geneva. At that point, I will break up this VLL to become one from here to Geneva and another from Geneva back home to Zürich. Seriously though I will have to stop Fred\u0026rsquo;s enthusiasm because he also mentioned somthing about LyonIX and another small stopover of 1HE and 35W over there\u0026hellip; this is addictive, save me from myself!!1\nConnectivity on the FLAP All week, Fred has been talking about the FLAP, which I know to be a term but I really never bothered to ask about it \u0026ndash; it turns out, he explained, that it stands for Frankfurt, London, Amsterdam, and Paris. I\u0026rsquo;m not in London (yet, please don\u0026rsquo;t dare me \u0026hellip;), however I can legitimately claim I am on the FLAP because I have a router in Lille. So there\u0026rsquo;s that :)\nFred has ordered my FranceIX connection this afternoon, delivered from a 20Gig LAG on er02.par02.ip-max.net and directly into my router there. In the mean time, I will be busy configuring my DE-CIX port from a previous post.\nThe console server here (a standard issue APU3 with 802.11ac WiFi broadcasting AS50869 PAR with password IPngGuest, you\u0026rsquo;re welcome), connects to the router with IPMI, while the router itself connects via USB serial back to the APU for maximum resilience.\nA hard knock life It was not without troubles today. When configuring my VLL to Zurich, I had misconfigured the er01.zrh56.ip-max.net side (a Cisco 7600, which is on its way out), and the VLL would not come up. I could see traffic going in one direction but not in the other \u0026hellip; which typically does not make OSPF adjacencies happen. After about an hour of messing around, I puppydog-eyed to Fred who proceeded to find my bug within 30 seconds: I needed to do some VLAN gymnastics by adding rewrite ingress tag pop 1 symmetric and as well adding 4 bytes to the MTU (so mtu 9018 total, cuz packets gotta be sourced directly from the jumbo-jumbo club!).\nBut, Fred was miserable as well because he had updated a Xen hypervisor which ended up not being able to boot because of a broken LVM configuration. So we literally swapped laptops and while he fixed my VLL, I fixed his volume group by running update-initramfs -u -k all from a recovery Debian USB stick. For an extra bonus, here\u0026rsquo;s a picture of Happy Fred at the local Cafe Leopard, where pretty much every day you can find a host of locals and international nerds who have emerged from the server floor.\nAnd Mael also helped with the serial port - I had put it into 115200 baud but not pinned it at 8N1 for the APU serial console. It shot into life as soon as he gave that tip and I committed the config, bravo!\nI tend to believe it will not be necessary for me to physically visit the facility that often \u0026ndash; simple hardware, no spinning disks, an APU connecting to IPMI for full HTML5 based KVM control and serial-over-lan, and the router exposing a console back to the APU, which has aan OOB network connection from AS25091. Yeah, I think I\u0026rsquo;ll be good.\nThe results But this was a special one indeed, because up until now, my traceroutes kept on getting longer and longer as I deployed in Frankfurt, Amsterdam, and Lille. Deploying in Paris therefore looked like this initially, as the packets took, let us say, the scenic route to my basement:\npim@frpar0:~$ traceroute chumbucket.ipng.nl traceroute to chumbucket.ipng.nl (194.1.163.93), 30 hops max, 60 byte packets 1 frggh0.ipng.ch (194.1.163.30) 4.915 ms 4.885 ms 4.866 ms 2 nlams0.ipng.ch (194.1.163.28) 12.396 ms 12.398 ms 12.382 ms 3 defra0.ipng.ch (194.1.163.26) 18.536 ms 18.520 ms 18.541 ms 4 chrma0.ipng.ch (194.1.163.24) 24.572 ms 24.557 ms 24.542 ms 5 chgtg0.ipng.ch (194.1.163.9) 24.549 ms 24.510 ms 24.517 ms 6 chbtl1.ipng.ch (194.1.163.18) 24.707 ms 25.114 ms 25.038 ms 7 chumbucket.ipng.nl (194.1.163.93) 25.320 ms 25.564 ms 25.452 ms That\u0026rsquo;s quite the scenic route indeed. But! On this glorious day, at exactly 16:34 UTC, the Tengig european IPv4 and IPv6 ring was closed, with one final set of OSPF adjacencies:\npim@frpar0:~$ show protocols ospfv3 neighbor Neighbor ID Pri DeadTime State/IfState Duration I/F[State] 194.1.163.34 1 00:00:37 Full/PointToPoint 01:40:51 dp0p6s0f0.100[PointToPoint] 194.1.163.1 1 00:00:33 Full/PointToPoint 00:01:28 dp0p6s0f1.100[PointToPoint] Which allowed the ring to hone in on shortest path at its best - East bound to Frankfurt and Amsterdam, and West bound to Paris and Lille. Link and equipment failures will not bother me that much, OSPF and OSPFv3 will take care of rerouting me around network problems, which considering the ASR9k at IP-Max, I do think will be the exception:\npim@chumbucket:~$ traceroute frggh0.ipng.ch traceroute to frggh0.ipng.ch (194.1.163.34), 30 hops max, 60 byte packets 1 chbtl1.ipng.ch (194.1.163.67) 0.317 ms 0.238 ms 0.190 ms 2 chgtg0.ipng.ch (194.1.163.19) 0.619 ms 0.574 ms 0.531 ms 3 frpar0.ipng.ch (194.1.163.40) 15.271 ms 15.226 ms 15.174 ms 4 frggh0.ipng.ch (194.1.163.34) 20.059 ms 20.020 ms 19.977 ms pim@chumbucket:~$ traceroute nlams0.ipng.ch traceroute to nlams0.ipng.ch (194.1.163.32), 30 hops max, 60 byte packets 1 chbtl1.ipng.ch (194.1.163.67) 0.345 ms 0.198 ms 0.284 ms 2 chgtg0.ipng.ch (194.1.163.19) 0.610 ms 0.518 ms 0.538 ms 3 chrma0.ipng.ch (194.1.163.8) 0.732 ms 0.750 ms 0.716 ms 4 defra0.ipng.ch (194.1.163.25) 6.835 ms 6.802 ms 6.767 ms 5 nlams0.ipng.ch (194.1.163.32) 12.799 ms 12.765 ms 12.731 ms Of course with impeccable throughput, bien sûr:\npim@frpar0:~$ iperf3 -c chgtg0.ipng.ch -R \u0026hellip; [ 5] 0.00-10.01 sec 11.2 GBytes 9.42 Gbits/sec 1 sender [ 5] 0.00-10.00 sec 11.2 GBytes 9.42 Gbits/sec receiver\nI\u0026#39;m tired, but ultimately satisfied on taking my private AS50869 across _the FLAP_, with a physical presence in each, and an IXP connection and TenGig bidirectional, and TenGig IP Transit at each location. I think this network is good to go for the next few years at least. ","date":"2021-06-01","desc":"I\u0026rsquo;ve been planning a network expansion for a while now. For the next few weeks, I will be in total geek-mode as I travel to several European cities to deploy AS50869 on a european ring. At the same time, my buddy Fred from IP-Max has been wanting to go to Amsterdam. IP-Max\u0026rsquo;s network is considerably larger than mine, but it just never clicked with the right set of circumstances for them to deploy in the Netherlands, until the stars aligned \u0026hellip;\n","permalink":"https://ipng.ch/s/articles/2021/06/01/ipng-arrives-in-paris/","section":"articles","title":"IPng arrives in Paris"},{"contents":"I\u0026rsquo;ve been planning a network expansion for a while now. For the next few weeks, I will be in total geek-mode as I travel to several European cities to deploy AS50869 on a european ring. At the same time, my buddy Fred from IP-Max has been wanting to go to Amsterdam. IP-Max\u0026rsquo;s network is considerably larger than mine, but it just never clicked with the right set of circumstances for them to deploy in the Netherlands, until the stars aligned \u0026hellip;\nDeployment After our adventure in Amsterdam, and after Fred and I both got negative PCR test results, we made our way down to Lille, France. There are two datacenters there where IP-Max has a presence, and they are very innovative ones. There\u0026rsquo;s a specific trick with a block of frozen ice that allows for the facility cooling to run autonomously in case of a chiller or power failure. I got to see the storage of the icecube :)\nNow, because Fred is possibly even more enthusiastic about our eurotrip than I am, he insisted that I also put a machine here in Lille, even though originally it was meant to be only Frankfurt, Amsterdam, Paris and Zurich. After some tough negotiations, I reluctantly agreed. While I did not have a Supermicro, I did have some freshly procured Dell R610s, the old machines from Coloclue AS8283, which they have recently upgraded to newer routers. I installed the pair at EUNetworks for Coloclue, and another team mate installed the pair at DCG/NorthC, after which four of these units were left written off and available. And I would not be me, if I were not to accept some perfectly valuable second hand vintage Dells :-)\nSo the plan is: we drop an R610 here now, and I ship a replacement standard issue Supermicro + APU3d3/WiFi later.\nConnectivity By now I\u0026rsquo;m getting pretty good ad creating L2VPN EoMPLS circuits, so I created one for myself from the Amsterdam router er01.ams01.ip-max.net to the one here at er01.lil01.ip-max.net. The one here is a Cisco ASR9006, a respectable machine. In the second point of presence, in the neighboring town of Anzin, there is an ASR9010 called er01.lil02.ip-max.net but that one is tucked away in a telco room reserved for deities, not in the normal serverroom which is available for plebs like me.\nThe inbound span goes from Amsterdam through Antwerp (Belgium) and Brussels and finally landing in Anzin, and from there it\u0026rsquo;s dark fiber to this rack. I realised that because I use UN/LOCODE that I should be precise in my naming. The town here is a suburb of Lille, the country\u0026rsquo;s fourth biggest city after Paris, Marseille and Lyon. The town itself is called Sainghin-en-Mélantois, which resolves to FR GGH and thus my temporary Dell R610 will be called frggh0.ipng.ch.\nFred happened to have a spare Intel X552 in his bag, so I commandeered it and gave the machine two legs of Tengig, one going to the ASR9006 and the other going the Nexus3064PQ under it. Soon, we will connect here to the local internet exchange Lillix. There are very few non-locals, let alone international members at LilleIX, but considering larger clubs like Zayo require two or more french peering points, this will be my ticket to some pretty good peers. Nice!\nThe idea is that one VLL lands me on Amsterdam, and the other will eventually land me on Paris at Telehouse TH2. But that will be after the weekend, as first we need to spend some quality time exploring Lille and recreating the grignotage (English: nibbling) that must be done in the north of France.\nThe results As always on the IP-Max network, they speak for themselves. During daytime, the connectivity from my basement to Frankfurt is at 6.7ms, to Amsterdam it\u0026rsquo;s 13ms and all the way to Lille it\u0026rsquo;s 20.3ms, with throughputs that are, let\u0026rsquo;s just say, line rate booyah!\npim@chumbucket:~$ traceroute frggh0.ipng.ch traceroute to frggh0.ipng.ch (194.1.163.34), 30 hops max, 60 byte packets 1 chbtl1.ipng.ch (194.1.163.67) 0.246 ms 0.214 ms 0.135 ms 2 chgtg0.ipng.ch (194.1.163.19) 0.513 ms 0.478 ms 0.445 ms 3 chrma0.ipng.ch (194.1.163.8) 0.633 ms 0.599 ms 0.658 ms 4 defra0.ipng.ch (194.1.163.25) 6.808 ms 6.773 ms 6.740 ms 5 nlams0.ipng.ch (194.1.163.27) 13.090 ms 13.057 ms 13.024 ms 6 frggh0.ipng.ch (194.1.163.34) 20.370 ms 20.550 ms 20.473 ms pim@frggh0:~$ iperf3 -c chrma0.ipng.ch -P 10 -R ... [SUM] 0.00-10.02 sec 8.84 GBytes 7.58 Gbits/sec 271 sender [SUM] 0.00-10.00 sec 8.75 GBytes 7.52 Gbits/sec receiver pim@defra0:~$ iperf3 -P 10 -c frggh0.ipng.ch -R ... [SUM] 0.00-10.02 sec 11.1 GBytes 9.54 Gbits/sec 292 sender [SUM] 0.00-10.00 sec 11.1 GBytes 9.51 Gbits/sec receiver After the weekend, we\u0026rsquo;ll be driving on to Paris to complete the ring, after which I will have two different ways to traverse \u0026ndash; clockwise from Zurich to Geneva, Paris, Lille, Amsterdam and Frankfurt, or counterclockwise on the same ring. There will be 10Gbit between each of my routers in each direction. We do not compromise on quality, throughput or latency over here.\nI could not be happier with the service provided so far. Paris, here we come!!\n","date":"2021-05-28","desc":"I\u0026rsquo;ve been planning a network expansion for a while now. For the next few weeks, I will be in total geek-mode as I travel to several European cities to deploy AS50869 on a european ring. At the same time, my buddy Fred from IP-Max has been wanting to go to Amsterdam. IP-Max\u0026rsquo;s network is considerably larger than mine, but it just never clicked with the right set of circumstances for them to deploy in the Netherlands, until the stars aligned \u0026hellip;\n","permalink":"https://ipng.ch/s/articles/2021/05/28/ipng-arrives-in-lille/","section":"articles","title":"IPng arrives in Lille"},{"contents":"I\u0026rsquo;ve been planning a network expansion for a while now. For the next few weeks, I will be in total geek-mode as I travel to several European cities to deploy AS50869 on a european ring. At the same time, my buddy Fred from IP-Max has been wanting to go to Amsterdam. IP-Max\u0026rsquo;s network is considerably larger than mine, but it just never clicked with the right set of circumstances for them to deploy in the Netherlands, until the stars aligned \u0026hellip;\nLeadup for IP-Max Usually, if I were to go deploy somewhere with IP-Max, I settle down on top of (or underneath, or in some way physically close to) their router at a point of presence of theirs. In Amsterdam though, it was different \u0026hellip; because IP-Max had not yet built a PoP here.\nBut I ask: why would that stop us? Fred told me last year that he had always wanted to build out a PoP in Amsterdam, but somehow he never really found the time. I offered to do the work to organize the local supplier chain, get a good spot in a well connected place, long haul to France and Germany, and otherwise exercise my (social) network to get it done.\nIn March 2021, I stumbled across rackspace at NIKHEF by working with the folks from ERITAP who got their hands on something that is less of a commodity: a full rack + power (the facility is always chronically oversubscribed).\nA few chores on the tasklist:\nSign for rackspace. Check. Order the IP-Max standard-issue small pop kit, which consists of: One Cisco ASR9001 One Nexus 3064PQ One PCEngines APU4 for out-of-band And all the power/copper/fiber cables, optics, serial dongles we might need Procure an out-of-band provider for our APU4, easily found at NIKHEF (thanks, Arend)! Get connectivity in and out of Amsterdam! Connectivity The most important piece of planning is around the long haul connectivity. Considering IP-Max already operates a circuit from Frankfurt (Germany) to Anzin (France), I arranged for that link to be rerouted through Amsterdam and broken into two segments: Frankfurt-Amsterdam and Amsterdam-Anzin. I was like a kid in a candy store being able to meticulously choose the route that the fiber takes \u0026ndash; over Düsseldorf, entering the Netherlands at Emmerik, over Arnhem and Ede, and to Amsterdam. A very direct route, using a 10Gig DWDM wave.\nThe other span goes from Amsterdam through Antwerp (Belgium) and Brussels and finally landing in Anzin (near Lille, France), which was the previous 10Gig DWDM wave, so there is no increased latency even though the link is broken up in Amsterdam. Yaay!\nDelivery of the DWDM waves was ordered on March 30th, and although it should normally take 25 working days to deliver, for some awkward reason with the supplier it was going to take way longer than what we could afford, so a spot of VP style escalation took place, and oh look! Now it would take four weeks to turn up, was completed last Friday, which was just in time for our trip. Double yaay!\nStaging Amsterdam Because this is a completely new site for IP-Max as well as IPng, we\u0026rsquo;ll have to do a bit more work. And this suites us just fine, because after driving through Frankfurt (see my previous post), to the Netherlands, we have to stay in quarantine for five days (or, ten if we happen to fail our PCR test after five days!), which gives us plenty of time to stage and configure what will be our Cisco er01.ams01.ip-max.net and our Nexus as01.ams01.ip-max.net.\nOf course, figuring out how all of this fits together is a nice exercise, and we planned to just plug and play the ASR9k, which worked out rather successfully by the way, so it had to be completely configured ahead of time. We created the interfaces, DNS, routing protocols like OSPF, OSPFv3, MPLS/LDP, BGP and all of the good stuff like ACLs, accounts and et cetera.\nWe staged the stuff in the laundry room of our AirBnB, being actually quite grateful once the staging was complete and we could turn the machines off.\nFor IPng, staging nlams0.ipng.ch was already done ahead of time. So all I really needed for it, was to ensure that the EoMPLS circuits were created ahead of time. I was really looking forward to seeing if we could beat 14ms to Amsterdam on the IP-Max network.\nExtracurriculars Besides the staging, we also ate some pretty delicious food:\nMushroom risotto HotPot with Arend and Esther Chicken vegetable soup Tacos w/ Tapas Steak w/ broccoli and potatoes Red tuna w/ beans and herbs But we also took the time to explore a little bit, for example on Kaz\u0026rsquo;s new boat through the canals and over the river Amstel. But mostly: we sat home and enjoyed our quarantine the best we could :-)\nDeployment (day 1) First before the day started, I drained the Frankfurt-Anzin link by raising OSPF cost on er01.fra01.ip-max.net and er01.lil02.ip-max.net while Fred notified customers and the IP-Max team of the impending update to the network.\nWe met up with ERITAP on Monday 24th, or target deploy date. We had labeled and packed up all of our gear, grabbed the car, and made our way to the Watergraafsmeer to the place where the Internet landed in Europe in 1982. Almost 40 years later, here we are: IP-Max is moving in!\nThe physical work was not very exciting. The Nexus, ASR, two APUs and my own Supermicro were racked in only a few minutes. But then the interesting bits begin \u0026ndash; how do we connect all of this without making a Kabelsalat that you so often see in people\u0026rsquo;s racks.\nBut yet at the same time, both Fred and I were enthusiastic and couldn\u0026rsquo;t wait to see the ping time to Anzin and Frankfurt from here. I left Fred the honors to connect his own brand new er01.ams01.ip-max.net by opening the patched through loop from our supplier, and he was beaming once he saw OSPF and OSPFv3 adjacencies and a latency of just short of 6ms. But he was very kind to let me do the second honors to connect the router to Anzin, at just over 5ms. That is a really fantastic performance and very short path indeed. This will be fun for my next adventure, I\u0026rsquo;m sure. We\u0026rsquo;ll see the Dell pictured above appear as frlil0.ipng.ch but I get ahead of myself ..\nAfter we connected the whole thing up and did extensive ping tests, we undrained the spans and saw a respectable 600Mbit of traffic traverse the new router. Because there were a few other folks tinkering in the rack (for example our friends from Coloclue we decided to adjourn for the day and visit Paul and Henrieke up in Almere for a fabulous homecooked meal (thanks again for the Picaña!) and we enjoyed being followed by the cops when driving back out of Almere \u0026ndash; but we were not bothered/hassled by them.\nDeployment (day 2) But then (and this is technically day 2 because it was, let\u0026rsquo;s just say, well after midnight), as the IP-Max network calmed down for the night I did my stress test and came to a horrible surprise, interface errors! They were Frame Checksum Errors and while the performance from defra0.ipng.ch to nlams0.ipng.ch was impeccable (9.2Gbit, yaay), the transfer speeds on the reversed direction did stall out at about 35Mbit. That is NOT what the Doctor ordered!\nSo luckily we had already decided to go back for a day2 to complete the rack install, mostly for things like the fiber patch panel for IP-Max customers in the ERITAP rack, and to ensure that our power, serial and network cables would not come loose, because packets don\u0026rsquo;t like loose cables. Certainly we should avoid the electrons or photons falling onto the floor\u0026hellip;\nBut the weird thing about my link errors (as seen by the ASR9k) was that usually the problem is either a duplex error (which was OK), or a dirty fiber or transciever (which was unlikely considering this link was a SFP+ DAC!). So that leaves either a faulty Cisco or a faulty Supermicro, neither of which are appealing.\nOn day two, after breakfast, we had to do a few chores first (like the claim for the VAT for imports, see our previous post, and as well get a corona PCR test for the way to France (which was absolutely horrible, by the way, I still feel my nose which was violated). So we hit NIKHEF at around 4pm to finish the job and take care of a few small favors for Coloclue, ERITAP and Byteworks, who are also in the same rack as IPng and IP-Max.\nThe results After I replaced the DAC (ironically with an SFP+ optic), once OSPF and iBGP came back to life, this is what it looked like:\npim@chumbucket:~$ traceroute nlams0.ipng.ch traceroute to nlams0.ipng.ch (194.1.163.32), 30 hops max, 60 byte packets 1 chbtl1.ipng.ch (194.1.163.67) 0.292 ms 0.216 ms 0.179 ms 2 chgtg0.ipng.ch (194.1.163.19) 0.599 ms 0.565 ms 0.531 ms 3 chrma0.ipng.ch (194.1.163.8) 0.873 ms 0.840 ms 0.806 ms 4 defra0.ipng.ch (194.1.163.25) 6.783 ms 6.751 ms 6.718 ms 5 nlams0.ipng.ch (194.1.163.32) 12.864 ms 12.831 ms 12.798 ms pim@nlams0:~$ iperf3 -P 10 -c chgtg0.ipng.ch ... [SUM] 0.00-10.00 sec 11.0 GBytes 9.49 Gbits/sec 95 sender [SUM] 0.00-10.01 sec 11.0 GBytes 9.41 Gbits/sec receiver pim@nlams0:~$ iperf3 -P 10 -c chgtg0.ipng.ch -R ... [SUM] 0.00-10.01 sec 10.0 GBytes 8.62 Gbits/sec 339 sender [SUM] 0.00-10.00 sec 9.98 GBytes 8.57 Gbits/sec receiver That will do, thanks. I cannot believe that the latency from my basement workstation in Brüttisellen, Switzerland, to the local internet exchange is 0.8ms, then through to Frankfurt at 6.2ms and then all the way to Amsterdam the end to end round trip latency is 12.2ms. I can stare at the smokeping for hours!!\nSo I spent the reminder of the night hanging out with Fred while pumping 9Gbit in both directions for 2 hours while traffic was low. It\u0026rsquo;s one thing to do an iperf in your basement rack, but it\u0026rsquo;s an entirely different feeling to do an iperf spanning three countries in Europe (CH, DE and NL). I will note that the spans from Zurich to Frankfurt didn\u0026rsquo;t even get warm, although the one from Frankfurt to Amsterdam kind of broke a sweat for a little while there \u0026hellip;\nAnd the coolest thing yet? We\u0026rsquo;re not done with this trip.\n","date":"2021-05-26","desc":"I\u0026rsquo;ve been planning a network expansion for a while now. For the next few weeks, I will be in total geek-mode as I travel to several European cities to deploy AS50869 on a european ring. At the same time, my buddy Fred from IP-Max has been wanting to go to Amsterdam. IP-Max\u0026rsquo;s network is considerably larger than mine, but it just never clicked with the right set of circumstances for them to deploy in the Netherlands, until the stars aligned \u0026hellip;\n","permalink":"https://ipng.ch/s/articles/2021/05/26/ipng-arrives-in-amsterdam/","section":"articles","title":"IPng arrives in Amsterdam"},{"contents":"I\u0026rsquo;ve been planning a network expansion for a while now. For the next few weeks, I will be in total geek-mode as I travel to several European cities to deploy AS50869 on a european ring. At the same time, my buddy Fred from IP-Max has been wanting to go to Amsterdam. IP-Max\u0026rsquo;s network is considerably larger than mine, but it just never clicked with the right set of circumstances for them to deploy in the Netherlands, until the stars aligned \u0026hellip;\nLeadup to the Roadtrip Usually, IP-Max deploys their routers by having them shipped into the destination location, but this time was special. We decided to make a roadtrip out of it, so Fred made his way from Geneva to Brüttisellen, stayed the night, and early on Monday May 17th, we packed up the car and started our trek.\nIt turns out we had estimated our risk profile completely wrong - we thought it would be hard to cross the border into Germany due to the ongoing pandemic, but actually that part was fine. The Germans had opened their borders for transit traffic and stays of up to 24hrs just a few days ago, and we both got a (negative) PCR test so we felt we had our bases covered.\nThe Border Then when we arrived at the border, perhaps because we had Geneva license plates, we were asked about our trip, business or pleasure, and we shared that we had some equipment with us. Thus begun the four-and-a-half hour customs exercise that was necessary for us to safely send our equipment off to the European Union. One would think it should be easy, but it actually wasn\u0026rsquo;t quite that easy, considering we arrived at the border at 9am on a Monday, and the traffic into Switzerland was queueing up all expeditor and logistics companies, so nobody really was willing to help us out. But we made it and left again shortly after 1:30pm.\nFrankfurt We arrived at Frankfurt Equinix FR5 at the Kleyerstrasse at around 5pm. The IP-Max rack was quickly found, and while Fred was installing their corporate Xen host to run remote VMs for the Frankfurt area, I deployed the first router of the trip: defra0.ipng.ch.\nIP-Max at this location has a respectable 30G of DWDM capacity from three different vendors into Zurich, 30G of LAG capacity towards DE-CIX, and a 10G DWDM wave into Anzin (France), which will be broken up for us in Amsterdam for a future blogpost - stay tuned :)\nMaking use of line card and route processor redundancy, we decided to use three line cards, reserving one TenGig ethernet port on each:\nTe0/0/0/4 \u0026ndash; EoMPLS to NTT/eShelter Rumlang (chrma0.ipng.ch) Te0/1/0/4 \u0026ndash; EoMPLS to Interxion ZUR1 (chgtg0.ipng.ch) Te0/2/0/4 \u0026ndash; EoMPLS to Amsterdam NIKHEF (nlams0.ipng.ch) At each site, specifically those that are a bit further away, I deploy a standard issue PCEngines APU with 802.11ac WiFi, serial, and IPMI access to any machine that may be there. If you ever visit a datacenter floor where I\u0026rsquo;m present, look for SSID AS50869 FRA in the case of Kleyerstrasse. The password is IPngGuest, you\u0026rsquo;re welcome to some bits of bandwidth in a pinch :)\nYou can see my router dangling off what looks like a fiber optic umbellical cord under er01.fra05.ip-max.net, right at the heart of the Frankfurt internet.\nLogical Configuration console.fra.ipng.nl At the top of the rack you can also see the blue APU3 with its WiFi antennas. It takes an IPv4 /29 and IPv6 /64 from IP-Max AS25091 which gives me access to my equipment even if bad things happen (and they will, it\u0026rsquo;s just a matter of time!). It also exposes a WireGuard so that I can access it even without the need for SSH which can come in useful if a KVM console is required. Note the logo :-)\nOn the inside of the APU, it configures one RFC1918 wifi segment and another RFC1918 wired segment. In this case, the wired segment is connected to the IPMI port of the Supermicro router. I have really gotten used to this style of deployment \u0026ndash; I start with the OOB. Once the APU has power (and it does not need to have an uplink yet), I can already SSH to it from the wireless segment, and further configure it. Once it\u0026rsquo;s done, I make a habit of rebooting it to ensure it comes up. Then, I can easily configure (and even entirely install!!) the server behind it using IPMI serial-over-lan and HTML5 KVM if need be. It\u0026rsquo;s delicious. And, it has saved my ass several times over the years!\ndefra0.ipng.ch Making use of the line card redundancy, there is now 3x 10Gig connected to my router, which immediately makes it one of the better connected hosts in this facility. Logging in via IPMI, the DANOS image is quickly configured. There\u0026rsquo;s one link to Interxion ZUR1 in Glattbrugg, one link to eShelter in Rümlang, and one link up to Amsterdam. The interface towards Interxion ZUR1 doubles up as an egresspoint for now. There will be an IPv4/IPv6 transit session with AS25091, a DE-CIX connection and possibly but probably not a Kleyrex connection, were it not for the murderous cross connect costs at this facility.\nThe results After the OSPF and OSPFv3 adjacencies came up, iBGP was next. For now, the machine is single-homed off of chrma0.ipng.ch but soon there will be as well a leg towards Amsterdam. So for now, all that we can do is test basic connectivity. So after finishing our trip to Amsterdam, and checking into our AirBnB ready to go through our quarantine song-and-dance, we spent a little time celebrating - we arrived at 1:30am, and turned in for the night at 3am. The next day, our groceries arrived, somehow unfortunately I had to be \u0026ldquo;well prepared\u0026rdquo; and ordered them to be delivered between 7-8am on Tuesday.\nAfter a full day of regular work, we spent the evening taking a look at how my kit performs, and we are happy to report it\u0026rsquo;s absolutely great:\npim@defra0:~$ iperf3 -c chgtg0.ipng.ch -P 10 ... [SUM] 0.00-10.00 sec 11.2 GBytes 9.63 Gbits/sec 281 sender [SUM] 0.00-10.02 sec 11.2 GBytes 9.56 Gbits/sec receiver pim@defra0:~$ iperf3 -c chgtg0.ipng.ch -P 10 -R ... [SUM] 0.00-10.01 sec 10.2 GBytes 8.73 Gbits/sec 550 sender [SUM] 0.00-10.00 sec 10.1 GBytes 8.70 Gbits/sec receiver pim@defra0:~$ ping4 chrma0.ipng.ch PING chrma0.ipng.ch (194.1.163.0) 56(84) bytes of data. ... --- chrma0.ipng.ch ping statistics --- 9 packets transmitted, 9 received, 0% packet loss, time 20ms rtt min/avg/max/mdev = 5.864/6.022/6.173/0.072 ms The roundtrip latency to Zurich is about 6.0ms, and the performance is north of 9Gbit in both directions for my router. Soon, we will go to Amsterdam, and deploy router number two (of four!) on this epic roadtrip: nlams0.ipng.ch which is a bucket list item of mine \u0026ndash; to peer at Amsterdam Science Park.\nMore on that later!\n","date":"2021-05-17","desc":"I\u0026rsquo;ve been planning a network expansion for a while now. For the next few weeks, I will be in total geek-mode as I travel to several European cities to deploy AS50869 on a european ring. At the same time, my buddy Fred from IP-Max has been wanting to go to Amsterdam. IP-Max\u0026rsquo;s network is considerably larger than mine, but it just never clicked with the right set of circumstances for them to deploy in the Netherlands, until the stars aligned \u0026hellip;\n","permalink":"https://ipng.ch/s/articles/2021/05/17/ipng-arrives-in-frankfurt/","section":"articles","title":"IPng arrives in Frankfurt"},{"contents":" Author: Pim van Pelt \u0026lt;pim@ipng.nl\u0026gt; Reviewers: Coloclue Network Committee \u0026lt;routers@coloclue.net\u0026gt; Status: Draft - Review - Published Introduction Coloclue AS8283 operates several Linux routers running Bird. Over the years, the performance of their previous hardware platform (Dell R610) has deteriorated, and they were up for renewal. At the same time, network latency/jitter has been very high, and variability may be caused by the Linux router hardware, their used software, the inter-datacenter links, or any combination of these. The routers were replaced with relatively modern hardware. In a previous post, I looked into the links between the datacenters, and demonstrated that they are performing as expected (1.41Mpps of 802.1q ethernet frames in both directions). That leaves the software. This post explores a replacement of the Linux kernel routers by a userspace process running VPP, which is an application built on DPDK.\nExecutive Summary I was unable to run VPP due to an issue detecting and making use of the Intel x710 network cards in this chassis. While the Intel i210-AT cards worked well, both with the standard vfio-pci driver and with an alternative igb_uio driver, I did not manage to get the Intel x710 cards to fully work (noting that I have the same Intel x710 NIC working flawlessly in VPP on another Supermicro chassis). See below for a detailed writeup of what I tried and which results were obtained. In the end, I reverted the machine back to its (mostly) original state, with three pertinent changes:\nI left the Debian Backports kernel 5.10 running I turned on IOMMU (Intel VT-d was already on), booting with iommu=pt intel_iommu=on I left Hyperthreading off in the BIOS (it was on when I started) After I restored the machine to its original Linux+Bird configuration, I noticed a marked improvement in latency, jitter and throughput. A combination of these changes is likely beneficial, so I do recommend making this change on all Coloclue routers, while we continue our quest for faster, more stable network performance.\nSo the bad news is: I did not get to prove that VPP and DPDK are awesome in AS8283. Yet.\nBut the good news is: network performance improved drastically. I\u0026rsquo;ll take it :)\nTimeline The graph on the left shows latency from AS15703 (True) in EUNetworks to a Coloclue machine hosted in NorthC. As far as Smokeping is concerned, latency has been quite poor for as long as it can remember (at least a year). The graph on the right shows the latency from AS12859 (BIT) to the beacon on 185.52.225.1/24 which is announced only on dcg-1, on the day this project was carried out.\nLooking more closely at the second graph:\nSunday 07:30: The machine was put into maintenance, which made the latency jump. This is because the beacon was no longer reachable directly behind dcg-1 from AS12859 over NL-IX, but via an alternative path which traversed several more Coloclue routers, hence higher latency and jitter/loss.\nSunday 11:00: I rolled back the VPP environment on the machine, restoring it to its original configuration, except running kernel 5.10 and with Intel VT-d and Hyperthreading both turned off in the BIOS. A combination of those changes has definitely worked wonders. See also the mtr results down below.\nSunday 14:50: Because I didn\u0026rsquo;t want to give up, and because I expected a little more collegiality from my friend dcg-1, I gave it another go by enabling IOMMU and PT, booting the 5.10 kernel with iommu=pt and intel_iommu=on. Now, with the igb_uio driver loaded, VPP detected both the i210 and x710 NICs, however it did not want to initialize the 4th port on the NIC (this was enp1s0f3, the port to Fusix Networks), and the port eno1 only partially worked (IPv6 was fine, IPv4 was not). During this second attempt though, the rest of VPP and Bird came up, including NL-IX, the LACP, all internal interfaces, IPv4 and IPv6 OSPF and all BGP peering sessions with members.\nSunday 16:20: I could not in good faith turn on eBGP peers though, because of the interaction with eno1 and enp1s0f3 described in more detail below. I then ran out of time, and restored service with Linux 5.10 kernel and the original Bird configuration, now with Intel VT-d turned on and IOMMU/PT enabled in the kernel.\nQuick Overview This paper, at a high level, discusses the following:\nGives a brief introduction of VPP and its new Linux CP work Discusses a means to isolate a /24 on exactly one Coloclue router Demonstrates changes made to run VPP, even though they were not applied Compares latency/throughput before-and-after in a surprising improvement, unrelated to VPP 1. Introduction to VPP VPP stands for Vector Packet Processing. In development since 2002, VPP is production code currently running in shipping products. It runs in user space on multiple architectures including x86, ARM, and Power architectures on both x86 servers and embedded devices. The design of VPP is hardware, kernel, and deployment (bare metal, VM, container) agnostic. It runs completely in userspace. VPP helps push extreme limits of performance and scale. Independent testing shows that, at scale, VPP-powered routers are two orders of magnitude faster than currently available technologies.\nThe Linux (and BSD) kernel is not optimized for network I/O. Each packet (or in some implementations, a small batch of packets) generates an interrupt which causes the kernel to stop what it\u0026rsquo;s doing, schedule the interrupt handler, do the necessary steps in the networking stack for each individual packet in turn: layer2 input, filtering, NAT session matching and packet rewriting, IP next-hop lookup, interface and L2 next-hop lookup, and marshalling the packet back onto the network, or handing it over to an application running on the local machine. And it does this for each packet one after another.\nVPP takes away a few inefficiencies in this process in a few ways:\nVPP does not use interrupts, does not use the kernel network driver, and does not use the kernel networking stack at all. Instead, it attaches directly to the PCI device and polls the network card directly for incoming packets. Once network traffic gets busier, VPP constructs a collection of packets called a vector, to pass through a directed graph of smaller functions. There\u0026rsquo;s a clear performance benefit of such an architecture: the first packet from the vector will hit possibly a cold instruction/data cache in the CPU, but the second through Nth packet from the vector will execute on a hot cache and not need most/any memory access, executing at an order of magnitude faster or even better. VPP is multithreaded and can have multiple cores polling and executing receive and transmit queues for network interfaces at the same time. Routing information (like next hops, forwarding tables, etc) should be carefully maintained, but in principle, VPP linearly scales with the amount of cores. It is straight forward to obtain 10Mpps of forwarding throughput per CPU core, so a 32 core machine (handling 320Mpps) can realistically saturate 21x10Gbit interfaces (at 14.88Mpps). A similar 32-core machine, if it has sufficient amounts of PCI slots and network cards can route an internet mixture of traffic at throughputs of roughly 492Gbit (320Mpps at 650Kpps per 10G of imix).\nVPP, upon startup, will disassociate the NICs with the kernel and bind them into the vpp process, which will promptly run at 100% CPU, due to its DPDK polling. There\u0026rsquo;s a tool vppcli which allows the operator to configure the VPP process: create interfaces, set attributes like link state, MTU, MPLS, Bonding, IPv4/IPv6 addresses and add/remove routes in the forwarding information base (or FIB). VPP further works with plugins, that add specific functionality, examples of this is LLDP, DHCP, IKEv2, NAT, DSLITE, Load Balancing, Firewall ACLs, GENEVE, VXLAN, VRRP, and Wireguard, to name but a few popular ones.\nIntroduction to Linux CP Plugin However, notably (or perhaps notoriously), VPP is only a dataplane application, it does not have any routing protols like OSPF or BGP. A relatively new plugin is called the Linux Control Plane (or LCP), and it consists of two parts, one is public and one is under development at the time of this article. The first plugin allows the operator to create a Linux tap interface and pass though or punt traffic from the dataplane into it. This way, the userspace VPP application creates a link back into the kernel, and an interface (eg. vpp0) appears. Input packets in VPP have all input features (firewall, NAT, session matching, etc), and if the packet is sent to an IP address with an LCP pair associated with it, it is punted to the tap device. So if on the Linux side, the same IP address is put on the resulting vpp0 device, Linux will see it. Responses from the kernel into the tap device are picked up by the Linux CP plugin and re-injected into the dataplane, and all output features of VPP are applied. This makes bidirectional traffic possible. You can read up on the Linux CP plugin in the VPP documentation.\nHere\u0026rsquo;s a barebones example of plumbing the VPP interface GigabitEthernet7/0/0 through a network device vpp0 in the dataplane network namespace.\npim@vpp-west:~$ sudo systemctl restart vpp pim@vpp-west:~$ vppctl lcp create GigabitEthernet7/0/0 host-if vpp0 namespace dataplane pim@vpp-west:~$ sudo ip netns exec dataplane ip link 1: lo: \u0026lt;LOOPBACK,UP,LOWER_UP\u0026gt; mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 12: vpp0: \u0026lt;BROADCAST,MULTICAST\u0026gt; mtu 9000 qdisc mq state DOWN mode DEFAULT group default qlen 1000 link/ether 52:54:00:8a:0e:97 brd ff:ff:ff:ff:ff:ff pim@vpp-west:~$ vppctl show interface Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count GigabitEthernet7/0/0 1 down 9000/0/0/0 local0 0 down 0/0/0/0 tap1 2 up 9000/0/0/0 Introduction to Linux NL Plugin You may be wondering, what happens with interface addresses or static routes? Usually, a userspace application like ip link add or ip address add or a higher level process like bird or FRR will want to set routes towards next hops upon interfaces using routing protocols like OSPF or BGP. The Linux kernel picks these events up and can share them as so called netlink messages with interested parties. Enter the second plugin (the one that is under development at the moment), which is a netlink listener. Its job is to pick up netlink messages from the kernel and apply them to the VPP dataplane. With the Linux NL plugin enabled, events like adding or removing links, addresses, routes, set linkstate or MTU, will all mirrored into the dataplane. I\u0026rsquo;m hoping the netlink code will be released in the upcoming VPP release, but contact me any time if you\u0026rsquo;d like to discuss details of the code, which can be found currently under community review in the VPP Gerrit\nBuilding on the example above, with this Linux NL plugin enabled, we can now manipulate VPP state from Linux, for example creating an interface and adding an IPv4 address to it (of course, IPv6 works just as well!):\npim@vpp-west:~$ sudo ip netns exec dataplane ip link set vpp0 up mtu 1500 pim@vpp-west:~$ sudo ip netns exec dataplane ip addr add 2001:db8::1/64 dev vpp0 pim@vpp-west:~$ sudo ip netns exec dataplane ip addr add 10.0.13.2/30 dev vpp0 pim@vpp-west:~$ sudo ip netns exec dataplane ping -c1 10.0.13.1 PING 10.0.13.1 (10.0.13.1) 56(84) bytes of data. 64 bytes from 10.0.13.1: icmp_seq=1 ttl=64 time=0.591 ms --- 10.0.13.1 ping statistics --- 1 packets transmitted, 1 received, 0% packet loss, time 0ms rtt min/avg/max/mdev = 0.591/0.591/0.591/0.000 ms pim@vpp-west:~$ vppctl show interface Name Idx State MTU (L3/IP4/IP6/MPLS) Counter Count GigabitEthernet7/0/0 1 up 1500/0/0/0 rx packets 4 rx bytes 268 tx packets 14 tx bytes 1140 drops 2 ip4 2 local0 0 down 0/0/0/0 tap1 2 up 9000/0/0/0 rx packets 10 rx bytes 796 tx packets 2 tx bytes 140 ip4 1 ip6 8 pim@vpp-west:~$ vppctl show interface address GigabitEthernet7/0/0 (up): L3 10.0.13.2/30 L3 2001:db8::1/64 local0 (dn): tap1 (up): As can be seen above, setting the link state up, setting the MTU, adding an address were all captured by the Linux NL plugin and applied in the dataplane. Further to this, the Linux NL plugin also synchronizes route updates into the forwarding information base (or FIB) of the dataplane:\npim@vpp-west:~$ sudo ip netns exec dataplane ip route add 100.65.0.0/24 via 10.0.13.1 pim@vpp-west:~$ vppctl show ip fib 100.65.0.0 ipv4-VRF:0, fib_index:0, flow hash:[src dst sport dport proto flowlabel ] epoch:0 flags:none locks:[adjacency:1, default-route:1, lcp-rt:1, ] 100.65.0.0/24 fib:0 index:15 locks:2 lcp-rt refs:1 src-flags:added,contributing,active, path-list:[27] locks:2 flags:shared, uPRF-list:19 len:1 itfs:[1, ] path:[34] pl-index:27 ip4 weight=1 pref=0 attached-nexthop: oper-flags:resolved, 10.0.13.1 GigabitEthernet7/0/0 [@0]: ipv4 via 10.0.13.1 GigabitEthernet7/0/0: mtu:1500 next:5 flags:[] 52540015f82a5254008a0e970800 Note: I built the code for VPP v21.06 including the Linux CP and Linux NL plugins at tag 21.06-rc0~476-g41cf6e23d on Debian Buster for the rest of this project, to match the operating system in use on Coloclue routers. I did this without additional modifications (even though I must admit, I do know of a few code paths in the netlink handler that still trigger a crash, and I have a few fixes in my client at home, so I\u0026rsquo;ll be careful to avoid the pitfalls for now :-).\n2. Isolating a Device Under Test Coloclue has several routers, so to ensure that the traffic traverses only the one router under test, I decided to use an allocated but currently unused IPv4 prefix and announce that only from one of the four routers, so that all traffic to and from that /24 goes over that router. Coloclue uses a piece of software called Kees, a set of Python and Jinja2 scripts to generate a Bird1.6 configuration for each router. This is great because that allows me to add a small feature to get what I need: beacons.\nA beacon is a prefix that is sent to (some, or all) peers on the internet to attract traffic in a particular way. I added a function called is_coloclue_beacon() which reads the input YAML file and uses a construction similar to the existing feature for \u0026ldquo;supernets\u0026rdquo;. It determines if a given prefix must be announced to peers and upstreams. Any IPv4 and IPv6 prefixes from the beacons list will be then matched in is_coloclue_beacon() and announced.\nBased on a per-router config (eg. vars/dcg-1.router.nl.coloclue.net.yml) I can now add the following YAML stanza:\ncoloclue: beacons: - prefix: \u0026#34;185.52.225.0\u0026#34; length: 24 comment: \u0026#34;VPP test prefix (pim)\u0026#34; Because tinkering with routers in the Default Free Zone is a great way to cause an outage, I need to ensure that the code I wrote was well tested. I first ran ./update-routers.sh check with no beacon config. This succeeded:\n[...] checking: /opt/router-staging/dcg-1.router.nl.coloclue.net/bird.conf checking: /opt/router-staging/dcg-1.router.nl.coloclue.net/bird6.conf checking: /opt/router-staging/dcg-2.router.nl.coloclue.net/bird.conf checking: /opt/router-staging/dcg-2.router.nl.coloclue.net/bird6.conf checking: /opt/router-staging/eunetworks-2.router.nl.coloclue.net/bird.conf checking: /opt/router-staging/eunetworks-2.router.nl.coloclue.net/bird6.conf checking: /opt/router-staging/eunetworks-3.router.nl.coloclue.net/bird.conf checking: /opt/router-staging/eunetworks-3.router.nl.coloclue.net/bird6.conf And I made sure that the generated function is indeed empty:\nfunction is_coloclue_beacon() { # Prefix must fall within one of our supernets, otherwise it cannot be a beacon. if (!is_coloclue_more_specific()) then return false; return false; } Then, I ran the configuration again with one IPv4 beacon set on dcg-1, and still all the bird configs on both IPv4 and IPv6 for all routers parsed correctly, and the generated function on the dcg-1 IPv4 filters file was populated:\nfunction is_coloclue_beacon() { # Prefix must fall within one of our supernets, otherwise it cannot be a beacon. if (!is_coloclue_more_specific()) then return false; if (net = 185.52.225.0/24) then return true; /* VPP test prefix (pim) */ return false; } I then wired up the function into function ebgp_peering_export() and submitted the beacon configuration above, as well as a static route for that beacon prefix to a server running in the NorthC (previously called DCG) datacenter. You can read the details in this Kees commit. The dcg-1 router is connected to NL-IX, so it\u0026rsquo;s expected that after this configuration went live, peers can now see that prefix only via NL-IX, and it\u0026rsquo;s a more specific to the overlapping supernet (which is 185.52.224.0/22).\nAnd indeed, a traceroute now only traverses dcg-1 as seen from peer BIT (AS12859 coming from NL-IX):\n1. lo0.leaf-sw4.bit-2b.network.bit.nl 2. lo0.leaf-sw6.bit-2a.network.bit.nl 3. xe-1-3-1.jun1.bit-2a.network.bit.nl 4. coloclue.the-datacenter-group.nl-ix.net 5. vpp-test.ams.ipng.ch As well as return traffic from Coloclue to that peer:\n1. bond0-100.dcg-1.router.nl.coloclue.net 2. bit.bit2.nl-ix.net 3. lo0.leaf-sw6.bit-2a.network.bit.nl 4. lo0.leaf-sw4.bit-2b.network.bit.nl 5. sandy.ipng.nl 3. Installing VPP First, I need to ensure that the machine is reliably reachable via its IPMI interface (normally using serial-over-lan, but to make sure as well Remote KVM). This is required because all network interfaces above will be bound by VPP, and if the vpp process ever were to crash, it will be restarted without configuration. On a production router, one would expect there to be a configuration daemon that can persist a configuration and recreate it in case of a server restart or dataplane crash.\nBefore we start, let\u0026rsquo;s build VPP with our two beautiful plugins, copy them to dcg-1, and install all the supporting packages we\u0026rsquo;ll need:\npim@vpp-builder:~/src/vpp$ make install-dep pim@vpp-builder:~/src/vpp$ make build pim@vpp-builder:~/src/vpp$ make build-release pim@vpp-builder:~/src/vpp$ make pkg-deb pim@vpp-builder:~/src/vpp$ dpkg -c build-root/vpp-plugin-core*.deb | egrep \u0026#39;linux_(cp|nl)_plugin\u0026#39; -rw-r--r-- root/root 92016 2021-03-27 12:06 ./usr/lib/x86_64-linux-gnu/vpp_plugins/linux_cp_plugin.so -rw-r--r-- root/root 57208 2021-03-27 12:06 ./usr/lib/x86_64-linux-gnu/vpp_plugins/linux_nl_plugin.so pim@vpp-builder:~/src/vpp$ scp build-root/*.deb root@dcg-1.nl.router.coloclue.net:/root/vpp/ pim@dcg-1:~$ sudo apt install libmbedcrypto3 libmbedtls12 libmbedx509-0 libnl-3-200 \\ libnl-route-3-200 libnuma1 python3-cffi python3-cffi-backend python3-ply python3-pycparser pim@dcg-1:~$ sudo dpkg -i /root/vpp/*.deb pim@dcg-1:~$ sudo usermod -a -G vpp pim On a BGP speaking router, netlink messages can come in rather quickly as peers come and go. Due to an unfortunate design choice in the Linux kernel, messages are not buffered for clients, which means that a buffer overrun can occur. To avoid this, I\u0026rsquo;ll raise the netlink socket size to 64MB, leverging a feature that will create a producer queue in the Linux NL plugin, so that VPP can try to drain the messages from the kernel into its memory as quickly as possible. To be able to raise the netlink socket buffer size, we need to set some variables with sysctl (take note as well on the usual variables VPP wants to set with regards to hugepages in /etc/sysctl.d/80-vpp.conf, which the Debian package installs for you):\npim@dcg-1:~$ cat \u0026lt;\u0026lt; EOF | sudo tee /etc/sysctl.d/81-vpp-netlink.conf # Increase netlink to 64M net.core.rmem_default=67108864 net.core.wmem_default=67108864 net.core.rmem_max=67108864 net.core.wmem_max=67108864 EOF pim@dcg-1:~$ sudo sysctl -p /etc/sysctl.d/81-vpp-netlink.conf /etc/sysctl.d/80-vpp.conf VPP Configuration Now that I\u0026rsquo;m sure traffic to and from 185.52.225.0/24 will go over dcg-1, let\u0026rsquo;s take a look at the machine itself. It has a six network cards, two onboard Intel i210 gigabit and one Intel x710-DA4 quad-tengig network cards. To run VPP, the network cards in the machine need to be supported in Intel\u0026rsquo;s DPDK libraries. The ones in this machine are all OK (but as we\u0026rsquo;ll see later, problematic for unexplained reasons):\nroot@dcg-1:~# lspci | grep Ether 01:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) 01:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) 01:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) 01:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) 06:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03) 07:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03) To handle the inbound traffic, netlink messages and other internal memory structure, I\u0026rsquo;ll allocate 2GB of hugepages to the VPP process. I\u0026rsquo;ll then of course enable the two Linux CP plugins. Because VPP has a lot of statistics counters (for example, a few stats for each used prefix in its forwarding information base or FIB), I will need to give it more than the default of 32MB of stats memory. I\u0026rsquo;d like to execute a few startup commands to further configure the VPP runtime upon startup, so I\u0026rsquo;ll add a startup-config stanza. Finally, although on a production router I would, here I will not specify the DPDK interfaces, because I know that VPP will take over any supported network card that is in link down state upon startup. As long as I boot the machine with unconfigured NICs, I will be good.\nSo, here\u0026rsquo;s the configuration I end up adding to /etc/vpp/startup.conf:\nunix { startup-config /etc/vpp/vpp-exec.conf } memory { main-heap-size 2G main-heap-page-size default-hugepage } plugins { path /usr/lib/x86_64-linux-gnu/vpp_plugins plugin linux_cp_plugin.so { enable } plugin linux_nl_plugin.so { enable } } statseg { size 128M } # linux-cp { # default netns dataplane # } Note: It is important to isolate the tap devices into their own Linux network namespace. If this is not done, packets arriving via the dataplane will not have a route up and into the kernel for interfaces VPP is not aware of, making those kernel-enabled interfaces unreachable. Due to the use of a network namespace, all applications in Linux will have to be run in that namespace (think: bird, sshd, snmpd, etc) and the firewall rules with iptables will also have to be carefully applied into that namespace. Considering for this test we are using all interfaces in the dataplane, this point is moot, and we\u0026rsquo;ll take a small shortcut and introduce the tap devices in the default namespace.\nIn the configuration file, I added a startup-config (also known as exec) stanza. This is a set of VPP CLI commands that will be executed every time the process starts. It\u0026rsquo;s a great way to get the VPP plumbing done ahead of time. I figured, if I let VPP take the network cards, but then re-present tap interfaces with names which have the same name that the Linux kernel driver would\u0026rsquo;ve given them, the rest of the machine will mostly just work.\nSo the final trick is to disable every interface in /etc/nework/interfaces on dcg-1 and then configure it with a combination of a /etc/vpp/vpp-exec.conf and a small shell script that puts the IP addresses and things back just the way Debian would\u0026rsquo;ve put them using the /etc/network/interfaces file. Here we go!\n# Loopback interface create loopback interface instance 0 lcp create loop0 host-if lo0 # Core: dcg-2 lcp create GigabitEthernet6/0/0 host-if eno1 # Infra: Not used. lcp create GigabitEthernet7/0/0 host-if eno2 # LACP to Arista core switch create bond mode lacp id 0 set interface state TenGigabitEthernet1/0/0 up set interface mtu packet 1500 TenGigabitEthernet1/0/0 set interface state TenGigabitEthernet1/0/1 up set interface mtu packet 1500 TenGigabitEthernet1/0/1 bond add BondEthernet0 TenGigabitEthernet1/0/0 bond add BondEthernet0 TenGigabitEthernet1/0/1 set interface mtu packet 1500 BondEthernet0 lcp create BondEthernet0 host-if bond0 # VLANs on bond0 create sub-interfaces BondEthernet0 100 lcp create BondEthernet0.100 host-if bond0.100 create sub-interfaces BondEthernet0 101 lcp create BondEthernet0.101 host-if bond0.101 create sub-interfaces BondEthernet0 102 lcp create BondEthernet0.102 host-if bond0.102 create sub-interfaces BondEthernet0 120 lcp create BondEthernet0.120 host-if bond0.120 create sub-interfaces BondEthernet0 201 lcp create BondEthernet0.201 host-if bond0.201 create sub-interfaces BondEthernet0 202 lcp create BondEthernet0.202 host-if bond0.202 create sub-interfaces BondEthernet0 205 lcp create BondEthernet0.205 host-if bond0.205 create sub-interfaces BondEthernet0 206 lcp create BondEthernet0.206 host-if bond0.206 create sub-interfaces BondEthernet0 2481 lcp create BondEthernet0.2481 host-if bond0.2481 # NLIX lcp create TenGigabitEthernet1/0/2 host-if enp1s0f2 create sub-interfaces TenGigabitEthernet1/0/2 7 lcp create TenGigabitEthernet1/0/2.7 host-if enp1s0f2.7 create sub-interfaces TenGigabitEthernet1/0/2 26 lcp create TenGigabitEthernet1/0/2.26 host-if enp1s0f2.26 # Fusix Networks lcp create TenGigabitEthernet1/0/3 host-if enp1s0f3 create sub-interfaces TenGigabitEthernet1/0/3 108 lcp create TenGigabitEthernet1/0/3.108 host-if enp1s0f3.108 create sub-interfaces TenGigabitEthernet1/0/3 110 lcp create TenGigabitEthernet1/0/3.110 host-if enp1s0f3.110 create sub-interfaces TenGigabitEthernet1/0/3 300 lcp create TenGigabitEthernet1/0/3.300 host-if enp1s0f3.300 And then to set up the IP address information, a small shell script:\nip link set lo0 up mtu 16384 ip addr add 94.142.247.1/32 dev lo0 ip addr add 2a02:898:0:300::1/128 dev lo0 ip link set eno1 up mtu 1500 ip addr add 94.142.247.224/31 dev eno1 ip addr add 2a02:898:0:301::12/127 dev eno1 ip link set eno2 down ip link set bond0 up mtu 1500 ip link set bond0.100 up mtu 1500 ip addr add 94.142.244.252/24 dev bond0.100 ip addr add 2a02:898::d1/64 dev bond0.100 ip link set bond0.101 up mtu 1500 ip addr add 172.28.0.252/24 dev bond0.101 ip link set bond0.102 up mtu 1500 ip addr add 94.142.247.44/29 dev bond0.102 ip addr add 2a02:898:0:e::d1/64 dev bond0.102 ip link set bond0.120 up mtu 1500 ip addr add 94.142.247.236/31 dev bond0.120 ip addr add 2a02:898:0:301::6/127 dev bond0.120 ip link set bond0.201 up mtu 1500 ip addr add 94.142.246.252/24 dev bond0.201 ip addr add 2a02:898:62:f6::fffd/64 dev bond0.201 ip link set bond0.202 up mtu 1500 ip addr add 94.142.242.140/28 dev bond0.202 ip addr add 2a02:898:100::d1/64 dev bond0.202 ip link set bond0.205 up mtu 1500 ip addr add 94.142.242.98/27 dev bond0.205 ip addr add 2a02:898:17::fffe/64 dev bond0.205 ip link set bond0.206 up mtu 1500 ip addr add 185.52.224.92/28 dev bond0.206 ip addr add 2a02:898:90:1::2/125 dev bond0.206 ip link set bond0.2481 up mtu 1500 ip addr add 94.142.247.82/29 dev bond0.2481 ip addr add 2a02:898:0:f::2/64 dev bond0.2481 ip link set enp1s0f2 up mtu 1500 ip link set enp1s0f2.7 up mtu 1500 ip addr add 193.239.117.111/22 dev enp1s0f2.7 ip addr add 2001:7f8:13::a500:8283:1/64 dev enp1s0f2.7 ip link set enp1s0f2.26 up mtu 1500 ip addr add 213.207.10.53/26 dev enp1s0f2.26 ip addr add 2a02:10:3::a500:8283:1/64 dev enp1s0f2.26 ip link set enp1s0f3 up mtu 1500 ip link set enp1s0f3.108 up mtu 1500 ip addr add 94.142.247.243/31 dev enp1s0f3.108 ip addr add 2a02:898:0:301::15/127 dev enp1s0f3.108 ip link set enp1s0f3.110 up mtu 1500 ip addr add 37.139.140.23/31 dev enp1s0f3.110 ip addr add 2a00:a7c0:e20b:110::2/126 dev enp1s0f3.110 ip link set enp1s0f3.300 up mtu 1500 ip addr add 185.1.94.15/24 dev enp1s0f3.300 ip addr add 2001:7f8:b6::205b:1/64 dev enp1s0f3.300 4. Results And this is where it went horribly wrong. After installing the VPP packages on the dcg-1 machine, running Debian Buster on a Supermicro Super Server/X11SCW-F with BIOS 1.5 dated 10/12/2020, the vpp process was unable to bind the PCI devices for the Intel x710 NICs. I tried the following combinations:\nStock Buster kernel 4.19.0-14-amd64 and Backports kernel 5.10.0-0.bpo.3-amd64. The kernel driver vfio-pci and the DKMS for igb_uio from Debian package dpdk-igb-uio-dkms. Intel IOMMU off, on and strict (kernel boot parameter intel_iommu=on and intel_iommu=strict) BIOS setting for Intel VT-d on and off. Each time, I would start VPP with an explicit dpdk {} stanza, and observed the following. With the default vfio-pci driver, the VPP process would not start, and instead it would be spinning loglines:\n[ 74.378330] vfio-pci 0000:01:00.0: Masking broken INTx support [ 74.384328] vfio-pci 0000:01:00.0: vfio_ecap_init: hiding ecap 0x19@0x1d0 ## Repeated for all of the NICs 0000:01:00.[0123] Commenting out the dpdk { dev 0000:01:00.* } devices would allow it to start, detect the two i210 NICs, which both worked fine.\nWith the igb_uio driver, VPP would start, but not detect the x710 devices at all, it would detect the two i210 NICs, but they would not pass traffic or even link up:\n[ 139.495061] igb_uio 0000:01:00.0: uio device registered with irq 128 [ 139.522507] DMAR: DRHD: handling fault status reg 2 [ 139.528383] DMAR: [DMA Read] Request device [01:00.0] PASID ffffffff fault addr 138dac000 [fault reason 06] PTE Read access is not set ## Repeated for all 6 NICs I repeated this test of both drivers for all combinations of kernel, IOMMU and BIOS settings for VT-d, with exactly identical results.\nBaseline In a traceroute from BIT to Coloclue (using Junipers on hops 1-3, Linux kernel routing on hop 4), it\u0026rsquo;s clear that (a) only NL-IX is used on hop 4, which means that only dcg-1 is in the path and no other routers at Coloclue. From hop 4 onwards, one can clearly see high variance, with a 49.7ms standard deviation on a ~247.1ms worst case, even though the end to end latency is only 1.6ms and the NL-IX port is not congested.\nsandy (193.109.122.4) 2021-03-27T22:36:11+0100 Keys: Help Display mode Restart statistics Order of fields quit Packets Pings Host Loss% Snt Last Avg Best Wrst StDev 1. lo0.leaf-sw4.bit-2b.network.bit.nl 0.0% 4877 0.3 0.2 0.1 7.8 0.2 2. lo0.leaf-sw6.bit-2a.network.bit.nl 0.0% 4877 0.3 0.2 0.2 1.1 0.1 3. xe-1-3-1.jun1.bit-2a.network.bit.nl 0.0% 4877 0.5 0.3 0.2 9.3 0.7 4. coloclue.the-datacenter-group.nl-ix.net 0.2% 4877 1.8 18.3 1.7 253.5 45.0 5. vpp-test.ams.ipng.ch 0.1% 4877 1.9 23.6 1.6 247.1 49.7 On the return path, seen by a traceroute from Coloclue to BIT (using Linux kernel routing on hop 2, Junipers on hops 2-4), it becomes clear that the very first hop (the Linux machine dcg-1) is contributing to high variance, with a 49.4ms standard deviation on a 257.9ms worst case, again on an NL-IX port that was not congested and easy sailing in BIT\u0026rsquo;s 10Gbit network from there on.\nvpp-test (185.52.225.1) 2021-03-27T21:36:43+0000 Keys: Help Display mode Restart statistics Order of fields quit Packets Pings Host Loss% Snt Last Avg Best Wrst StDev 1. bond0-100.dcg-1.router.nl.coloclue.net 0.1% 4839 0.2 12.9 0.1 251.2 38.2 2. bit.bit2.nl-ix.net 0.0% 4839 10.7 22.6 1.4 261.8 48.3 3. lo0.leaf-sw5.bit-2a.network.bit.nl 0.0% 4839 1.8 20.9 1.6 263.0 46.9 4. lo0.leaf-sw3.bit-2b.network.bit.nl 0.0% 4839 155.7 22.7 1.4 282.6 50.9 5. sandy.ede.ipng.nl 0.0% 4839 1.8 22.9 1.6 257.9 49.4 New Configuration As I mentioned, I had expected this article to have a different outcome, in that I would\u0026rsquo;ve wanted to show off the superior routing performance under VPP of the beacon 185.52.225.1/24 which is found from AS12859 (BIT) via NL-IX directly through dcg-1. Alas, I did not manage to get the Intel x710 NIC to work with VPP, I ultimately rolled back but kept a few settings (Intel VT-d enabled and IOMMU on, hyperthreading disabled, Linux kernel 5.10 which uses a much newer version of the i40e for the NIC).\nThat combination definitely helped, the latency is now very smooth between BIT and Coloclue, a mean latency of 1.7ms, worst case 4.3ms and a standard deviation of 0.2ms only. That is as good as you could expect:\nsandy (193.109.122.4) 2021-03-28T16:20:05+0200 Keys: Help Display mode Restart statistics Order of fields quit Packets Pings Host Loss% Snt Last Avg Best Wrst StDev 1. lo0.leaf-sw4.bit-2b.network.bit.nl 0.0% 4342 0.3 0.2 0.2 0.4 0.1 2. lo0.leaf-sw6.bit-2a.network.bit.nl 0.0% 4342 0.3 0.2 0.2 0.9 0.1 3. xe-1-3-1.jun1.bit-2a.network.bit.nl 0.0% 4341 0.4 1.0 0.3 28.3 2.3 4. coloclue.the-datacenter-group.nl-ix.net 0.0% 4341 1.8 1.8 1.7 3.4 0.1 5. vpp-test.ams.ipng.ch 0.0% 4341 1.8 1.7 1.7 4.3 0.2 On the return path, seen by a traceroute again from Coloclue to BIT, it becomes clear that dcg-1 is no longer causing jitter or loss, at least not to NL-IX and AS12859. The latency there is as well an expected 1.8ms with a worst cast of 3.5ms and a standard deviation of 0.1ms, in other words comparable to the BIT \u0026ndash;\u0026gt; Coloclue path:\nvpp-test (185.52.225.1) 2021-03-28T14:20:50+0000 Keys: Help Display mode Restart statistics Order of fields quit Packets Pings Host Loss% Snt Last Avg Best Wrst StDev 1. bond0-100.dcg-1.router.nl.coloclue.net 0.0% 4303 0.2 0.2 0.1 0.9 0.1 2. bit.bit2.nl-ix.net 0.0% 4303 1.6 2.2 1.4 17.1 2.2 3. lo0.leaf-sw5.bit-2a.network.bit.nl 0.0% 4303 1.8 1.7 1.6 6.6 0.4 4. lo0.leaf-sw3.bit-2b.network.bit.nl 0.0% 4303 1.6 1.5 1.4 4.2 0.2 5. sandy.ede.ipng.nl 0.0% 4303 1.9 1.8 1.7 3.5 0.1 Appendix Assorted set of notes \u0026ndash; because I did give it \u0026ldquo;one last try\u0026rdquo; and managed to get VPP to almost work on this Coloclue router :)\nBoot kernel 5.10 with intel_iommu=on iommu=pt Load kernel module igb_uio and unload vfio-pci before starting VPP What follows is a bunch of debugging information \u0026ndash; useful perhaps for a future attempt at running VPP at Coloclue.\nroot@dcg-1:/etc/vpp# tail -10 startup.conf dpdk { uio-driver igb_uio dev 0000:06:00.0 dev 0000:07:00.0 dev 0000:01:00.0 dev 0000:01:00.1 dev 0000:01:00.2 dev 0000:01:00.3 } root@dcg-1:/etc/vpp# lsmod | grep uio uio_pci_generic 16384 0 igb_uio 20480 5 uio 20480 12 igb_uio,uio_pci_generic [ 39.211999] igb_uio: loading out-of-tree module taints kernel. [ 39.218094] igb_uio: module verification failed: signature and/or required key missing - tainting kernel [ 39.228147] igb_uio: Use MSIX interrupt by default [ 91.595243] igb 0000:06:00.0: removed PHC on eno1 [ 91.716041] igb_uio 0000:06:00.0: mapping 1K dma=0x101c40000 host=0000000095299b4e [ 91.723683] igb_uio 0000:06:00.0: unmapping 1K dma=0x101c40000 host=0000000095299b4e [ 91.733221] igb 0000:07:00.0: removed PHC on eno2 [ 91.856255] igb_uio 0000:07:00.0: mapping 1K dma=0x101c40000 host=0000000095299b4e [ 91.863918] igb_uio 0000:07:00.0: unmapping 1K dma=0x101c40000 host=0000000095299b4e [ 91.988718] igb_uio 0000:06:00.0: uio device registered with irq 127 [ 92.039935] igb_uio 0000:07:00.0: uio device registered with irq 128 [ 105.040391] i40e 0000:01:00.0: i40e_ptp_stop: removed PHC on enp1s0f0 [ 105.232452] igb_uio 0000:01:00.0: mapping 1K dma=0x103a64000 host=00000000bc39c074 [ 105.240108] igb_uio 0000:01:00.0: unmapping 1K dma=0x103a64000 host=00000000bc39c074 [ 105.249142] i40e 0000:01:00.1: i40e_ptp_stop: removed PHC on enp1s0f1 [ 105.472489] igb_uio 0000:01:00.1: mapping 1K dma=0x180187000 host=000000003182585c [ 105.480148] igb_uio 0000:01:00.1: unmapping 1K dma=0x180187000 host=000000003182585c [ 105.489178] i40e 0000:01:00.2: i40e_ptp_stop: removed PHC on enp1s0f2 [ 105.700497] igb_uio 0000:01:00.2: mapping 1K dma=0x12108a000 host=000000006ccf7ec6 [ 105.708160] igb_uio 0000:01:00.2: unmapping 1K dma=0x12108a000 host=000000006ccf7ec6 [ 105.717272] i40e 0000:01:00.3: i40e_ptp_stop: removed PHC on enp1s0f3 [ 105.916553] igb_uio 0000:01:00.3: mapping 1K dma=0x121132000 host=00000000a0cf9ceb [ 105.924214] igb_uio 0000:01:00.3: unmapping 1K dma=0x121132000 host=00000000a0cf9ceb [ 106.051801] igb_uio 0000:01:00.0: uio device registered with irq 127 [ 106.131501] igb_uio 0000:01:00.1: uio device registered with irq 128 [ 106.211155] igb_uio 0000:01:00.2: uio device registered with irq 129 [ 106.288722] igb_uio 0000:01:00.3: uio device registered with irq 130 [ 106.367089] igb_uio 0000:06:00.0: uio device registered with irq 130 [ 106.418175] igb_uio 0000:07:00.0: uio device registered with irq 131 ### Note above: Gi6/0/0 and Te1/0/3 both use irq 130. root@dcg-1:/etc/vpp# vppctl show log | grep dpdk 2021/03/28 15:57:09:184 notice dpdk EAL: Detected 6 lcore(s) 2021/03/28 15:57:09:184 notice dpdk EAL: Detected 1 NUMA nodes 2021/03/28 15:57:09:184 notice dpdk EAL: Selected IOVA mode \u0026#39;PA\u0026#39; 2021/03/28 15:57:09:184 notice dpdk EAL: No available hugepages reported in hugepages-1048576kB 2021/03/28 15:57:09:184 notice dpdk EAL: No free hugepages reported in hugepages-1048576kB 2021/03/28 15:57:09:184 notice dpdk EAL: No available hugepages reported in hugepages-1048576kB 2021/03/28 15:57:09:184 notice dpdk EAL: Probing VFIO support... 2021/03/28 15:57:09:184 notice dpdk EAL: WARNING! Base virtual address hint (0xa80001000 != 0x7eff80000000) not respected! 2021/03/28 15:57:09:184 notice dpdk EAL: This may cause issues with mapping memory into secondary processes 2021/03/28 15:57:09:184 notice dpdk EAL: WARNING! Base virtual address hint (0xec0c61000 != 0x7efb7fe00000) not respected! 2021/03/28 15:57:09:184 notice dpdk EAL: This may cause issues with mapping memory into secondary processes 2021/03/28 15:57:09:184 notice dpdk EAL: WARNING! Base virtual address hint (0xec18c2000 != 0x7ef77fc00000) not respected! 2021/03/28 15:57:09:184 notice dpdk EAL: This may cause issues with mapping memory into secondary processes 2021/03/28 15:57:09:184 notice dpdk EAL: WARNING! Base virtual address hint (0xec2523000 != 0x7ef37fa00000) not respected! 2021/03/28 15:57:09:184 notice dpdk EAL: This may cause issues with mapping memory into secondary processes 2021/03/28 15:57:09:184 notice dpdk EAL: Invalid NUMA socket, default to 0 2021/03/28 15:57:09:184 notice dpdk EAL: Probe PCI driver: net_i40e (8086:1572) device: 0000:01:00.0 (socket 0) 2021/03/28 15:57:09:184 notice dpdk EAL: Invalid NUMA socket, default to 0 2021/03/28 15:57:09:184 notice dpdk EAL: Probe PCI driver: net_i40e (8086:1572) device: 0000:01:00.1 (socket 0) 2021/03/28 15:57:09:184 notice dpdk EAL: Invalid NUMA socket, default to 0 2021/03/28 15:57:09:184 notice dpdk EAL: Probe PCI driver: net_i40e (8086:1572) device: 0000:01:00.2 (socket 0) 2021/03/28 15:57:09:184 notice dpdk EAL: Invalid NUMA socket, default to 0 2021/03/28 15:57:09:184 notice dpdk EAL: Probe PCI driver: net_i40e (8086:1572) device: 0000:01:00.3 (socket 0) 2021/03/28 15:57:09:184 notice dpdk i40e_init_fdir_filter_list(): Failed to allocate memory for fdir filter array! 2021/03/28 15:57:09:184 notice dpdk ethdev initialisation failed 2021/03/28 15:57:09:184 notice dpdk EAL: Requested device 0000:01:00.3 cannot be used 2021/03/28 15:57:09:184 notice dpdk EAL: VFIO support not initialized 2021/03/28 15:57:09:184 notice dpdk EAL: Couldn\u0026#39;t map new region for DMA root@dcg-1:/etc/vpp# vppctl show pci Address Sock VID:PID Link Speed Driver Product Name Vital Product Data 0000:01:00.0 0 8086:1572 8.0 GT/s x8 igb_uio 0000:01:00.1 0 8086:1572 8.0 GT/s x8 igb_uio 0000:01:00.2 0 8086:1572 8.0 GT/s x8 igb_uio 0000:01:00.3 0 8086:1572 8.0 GT/s x8 igb_uio 0000:06:00.0 0 8086:1533 2.5 GT/s x1 igb_uio 0000:07:00.0 0 8086:1533 2.5 GT/s x1 igb_uio root@dcg-1:/etc/vpp# ip ro 94.142.242.96/27 dev bond0.205 proto kernel scope link src 94.142.242.98 94.142.242.128/28 dev bond0.202 proto kernel scope link src 94.142.242.140 94.142.244.0/24 dev bond0.100 proto kernel scope link src 94.142.244.252 94.142.246.0/24 dev bond0.201 proto kernel scope link src 94.142.246.252 94.142.247.40/29 dev bond0.102 proto kernel scope link src 94.142.247.44 94.142.247.80/29 dev bond0.2481 proto kernel scope link src 94.142.247.82 94.142.247.224/31 dev eno1 proto kernel scope link src 94.142.247.224 94.142.247.236/31 dev bond0.120 proto kernel scope link src 94.142.247.236 172.28.0.0/24 dev bond0.101 proto kernel scope link src 172.28.0.252 185.52.224.80/28 dev bond0.206 proto kernel scope link src 185.52.224.92 193.239.116.0/22 dev enp1s0f2.7 proto kernel scope link src 193.239.117.111 213.207.10.0/26 dev enp1s0f2.26 proto kernel scope link src 213.207.10.53 root@dcg-1:/etc/vpp# birdc6 show ospf neighbors BIRD 1.6.6 ready. ospf1: Router ID Pri\tState DTime\tInterface Router IP 94.142.247.2\t1\tFull/PtP 00:35\teno1 fe80::ae1f:6bff:feeb:858c 94.142.247.7\t128\tFull/PtP 00:35\tbond0.120 fe80::9ecc:8300:78b2:8b62 root@dcg-1:/etc/vpp# birdc show ospf neighbors BIRD 1.6.6 ready. ospf1: Router ID Pri\tState DTime\tInterface Router IP 94.142.247.2\t1\tExchange/PtP 00:37\teno1 94.142.247.225 94.142.247.7\t128\tExchange/PtP 00:39\tbond0.120 94.142.247.237 root@dcg-1:/etc/vpp# vppctl show bond details BondEthernet0 mode: lacp load balance: l2 number of active members: 2 TenGigabitEthernet1/0/0 TenGigabitEthernet1/0/1 number of members: 2 TenGigabitEthernet1/0/0 TenGigabitEthernet1/0/1 device instance: 0 interface id: 0 sw_if_index: 6 hw_if_index: 6 root@dcg-1:/etc/vpp# ping 193.239.116.1 PING 193.239.116.1 (193.239.116.1) 56(84) bytes of data. 64 bytes from 193.239.116.1: icmp_seq=1 ttl=64 time=2.24 ms 64 bytes from 193.239.116.1: icmp_seq=2 ttl=64 time=0.571 ms 64 bytes from 193.239.116.1: icmp_seq=3 ttl=64 time=0.625 ms ^C --- 193.239.116.1 ping statistics --- 3 packets transmitted, 3 received, 0% packet loss, time 5ms rtt min/avg/max/mdev = 0.571/1.146/2.244/0.777 ms root@dcg-1:/etc/vpp# ping 94.142.244.85 PING 94.142.244.85 (94.142.244.85) 56(84) bytes of data. 64 bytes from 94.142.244.85: icmp_seq=1 ttl=64 time=0.226 ms 64 bytes from 94.142.244.85: icmp_seq=2 ttl=64 time=0.207 ms 64 bytes from 94.142.244.85: icmp_seq=3 ttl=64 time=0.200 ms 64 bytes from 94.142.244.85: icmp_seq=4 ttl=64 time=0.204 ms ^C --- 94.142.244.85 ping statistics --- 4 packets transmitted, 4 received, 0% packet loss, time 66ms rtt min/avg/max/mdev = 0.200/0.209/0.226/0.014 ms Cleaning up apt purge dpdk* vpp* apt autoremove rm -rf /etc/vpp rm /etc/sysctl.d/*vpp*.conf cp /etc/network/interfaces.2021-03-28 /etc/network/interfaces cp /root/.ssh/authorized_keys.2021-03-28 /root/.ssh/authorized_keys systemctl enable bird systemctl enable bird6 systemctl enable keepalived reboot Next steps Taking another look at IOMMU and PT redhat thread and in particular the part about allow_unsafe_interrupts in the kernel module. Find some ways to get the NICs (1x Intel x710 and 2x Intel i210) to detect in VPP. By then, probably the Linux CP (Interface mirroring and Netlink listener) will be submitted.\n","date":"2021-03-27","desc":" Author: Pim van Pelt \u0026lt;pim@ipng.nl\u0026gt; Reviewers: Coloclue Network Committee \u0026lt;routers@coloclue.net\u0026gt; Status: Draft - Review - Published Introduction Coloclue AS8283 operates several Linux routers running Bird. Over the years, the performance of their previous hardware platform (Dell R610) has deteriorated, and they were up for renewal. At the same time, network latency/jitter has been very high, and variability may be caused by the Linux router hardware, their used software, the inter-datacenter links, or any combination of these. The routers were replaced with relatively modern hardware. In a previous post, I looked into the links between the datacenters, and demonstrated that they are performing as expected (1.41Mpps of 802.1q ethernet frames in both directions). That leaves the software. This post explores a replacement of the Linux kernel routers by a userspace process running VPP, which is an application built on DPDK.\n","permalink":"https://ipng.ch/s/articles/2021/03/27/case-study-vpp-at-coloclue-part-1/","section":"articles","title":"Case Study: VPP at Coloclue, part 1"},{"contents":"Introduction to IPng Networks At IPng Networks, we run a modest network with European reach. With our home base in Zurich, Switzerland, we are pretty well connected into the Swiss internet scene. We operate four sites in Zurich, and an additional set of sites in European cities, each of which are described on this post. If you\u0026rsquo;re curious as to how the network runs, you can find two main pieces here: Firstly, the physical parts, where exactly are IPng\u0026rsquo;s routers and switches, what types of kit does the ISP use, and so on. Secondly, the logical parts, what operating systems and configurations are in use.\nPhysical Zurich Metropolitan Area The Canton of Zurich, Switzerland is our home-base, and it\u0026rsquo;s where IPng Networks GmbH is registered. The local commercial datacenter scene is dominated by Interxion, NTT and Equinix. The small town of Brüttisellen (zipcode CH-8306), is where our founder lives and, due to the ongoing Corona pandemic, where he works from home.\nIn Brüttisellen, marked with C, we have our first two routers, chbtl0.ipng.ch and chbtl1.ipng.ch, racked in our office. There are only two fiber operators in this town - UPC and Swisscom. The orange trace (C to D) is a leased line from UPC, which we rent from Openfactory and it gets terminated at Interxion Glattbrugg, where our first router called chgtg0.ipng.ch is located. From there, Openfactory rents darkfiber to multiple locations - but notably the dark purple trace (D to E) that connects from Interxion Glattbrugg to NTT Rümlang, where our second router called chrma0.ipng.ch is located.\nWe rent a 10G CWDM wave between these two datacenters, directly connecting these two routers. Now, Equinix also has a sizable footprint in Zürich, and operating ZH04 (B where we only have passive optical presence) in the Industriekwartier (our local internet exchange SwissIX was born in the now defunct Equinix ZH01 office building). From the neighboring building Equinix ZH04, our partner IP-Max rents dark fiber to Equinix ZH05 in the Zurich Allmend area (the light purple trace B to F), and from there, IP-Max rents dark fiber to NTT Rümlang again (F to E), completing the ring. We rent a 10G circuit on that path, to redundantly connect our routers chgtg0 and chrma0. If at any time we\u0026rsquo;d need to connect partners or customers, we can do so at a moment\u0026rsquo;s notice, as rackspace is available in all Equinix sites for IPng Networks.\nThe green link (D to B) is a 10G carrier ethernet circuit between Interxion, over the light purple path (B to A) on its last mile to Albisrieden, where we built a very small colocation site, which you can read about in more detail in our informational post - the colo is open for private individuals and small businesses (contact us for details!).\nEuropean Ring At IPng, we are strong believers in a free and open Internet. Having seen the shakeout of internet backbone providers over the last two decades, it seems to be a race to the bottom, with mergers, acquisitions and takeovers of datacenters and network carriers. Prices are going lower, and small fish traffic (let\u0026rsquo;s be honest, IPng Networks is definitely a small provider), to the point that purchasing IP transit is cheaper than connecting to local Internet exchange points. We\u0026rsquo;ve decided specifically to go the extra mile, quite literally, and plot a path to several continental european internet hubs.\nFrankfurt - Connected from NTT\u0026rsquo;s datacenter at Rümlang (Zurich) with a first 10G circuit, and from Interxion\u0026rsquo;s datacenter at Glattbrugg (Zurich) with a second 10G circuit, this is our first hop into the world. Here, we connect to DE-CIX from Equinix FR5 at the Kleyerstrasse. More details in our post IPng Arrives in Frankfurt.\nAmsterdam - The Amsterdam Science Park is where European Internet was born. NIKHEF is where we rent rackspace that connects with a 10G circuit to Frankfurt, and a 10G circuit onwards towards Lille. We connect to Speed-IX, LSIX, NL-IX, and an exchange point we help run called FrysIX. More details in our post IPng Arrives in Amsterdam.\nLille - IP-Max does lots of business in this region, with presence in both local datacenters here, one in Lille and one in Anzin. IPng has a point of presence here too, at the CIV1 facility, with a northbound 10G circuit to Amsterdam, and a southbound 10G circuit to Paris. Here, we connect to LillIX. More details in our post IPng Arrives in Lille.\nParis - Where two large facilities are placed back-to-back in the middle of the city, originally Telehouse TH2, with a new facility at Léon Frot, where we pick up a 10G circuit from Lille and further on the ring with a 10G circuit to Geneva. Here, we connect to FranceIX. More details in our post IPng Arrives in Paris.\nGeneva - The home-base of IP-Max is where we close our ring. From Paris, IP-Max has two redundant paths back to Switzerland, the first being a DWDM link from to Zurich, and the second being a DWDM link to Lyon and then into Geneva. Here, at SafeHost in Plan les Ouates, is where we have our fourth Swiss point of presence, with a connection to our very own Free-IX and a 10G circuit to Interxion at Glattbrugg (Zurich), and of course to Paris. More details in our post IPng Arrives in Geneva.\nLogical As a small operator, we\u0026rsquo;d love to be able to boast the newest Juniper PTX10016 routers but we neither have the rack space, the power budget, and to be perfectly honest, the monetary budget to run these at IPng Networks. But it turns out, we know a fair bit about hardware silicon, architecture and the controlplane software running on commercial routers.\nWe\u0026rsquo;ve decided to go a different route. In our opinion, at speeds under 100Gbit, it\u0026rsquo;s perfectly viable to use software routers on off-the-shelf hardware, notably Intel network cards and CPUs, notably those that have support for the Dataplane Development Kit (aka DPDK), which offers libraries to accelerate packet processing workloads, which turn ordinary servers into very performant routers. Two notable applications are VPP and Danos.\nVPP VPP originally comes from the house of Cisco [ref] and looks quite a bit like the commercial ASR9k platform. In development since 2002, VPP is production code currently running in shipping products. It runs in user space on multiple architectures including x86, ARM, and Power architectures on both x86 servers and embedded devices. The design of VPP is hardware, kernel, and deployment (bare metal, VM, container) agnostic. It runs completely in userspace.\nWe\u0026rsquo;ve contributed a little bit to the Control Plane abstraction [ref], which allows users to combine the throughput of a dataplane with usual routing software like Bird or FRR. We\u0026rsquo;ve been running it in production since December 2020 on chbtl1.ipng.ch. It\u0026rsquo;s our ultimate goal to run VPP and Linux Control Plane on the entire network, as the design and architecture really resonates with us as software and systems engineers.\nDANOS The Disaggregated Network Operating System (DANOS) project originally comes from AT\u0026amp;T’s “dNOS” software framework and provides an open, cost-effective and flexible alternative to traditional networking equipment. As part of The Linux Foundation, it now incorporates contributions from complementary open source communities in building a standardized distributed Network Operating System (NOS) to speed the adoption and use of white boxes in a service provider’s infrastructure.\nWe\u0026rsquo;ve been using DANOS since its first release in August 2019, and it\u0026rsquo;s currently our routing platform of choice \u0026ndash; it combines the sheer speed of DPDK with a Vyatta command line interface. As an appliance, care was taken to complete the whole package, with SNMP, YANG interface, image and upgrade management, interface monitoring with wireshark semantics, et cetera. Performing easily at wire speed 10G workloads (including 64byte ethernet frames), and being completely open source, it fits very well with our philosophy of an open and free internet.\n","date":"2021-02-27","desc":"Introduction to IPng Networks At IPng Networks, we run a modest network with European reach. With our home base in Zurich, Switzerland, we are pretty well connected into the Swiss internet scene. We operate four sites in Zurich, and an additional set of sites in European cities, each of which are described on this post. If you\u0026rsquo;re curious as to how the network runs, you can find two main pieces here: Firstly, the physical parts, where exactly are IPng\u0026rsquo;s routers and switches, what types of kit does the ISP use, and so on. Secondly, the logical parts, what operating systems and configurations are in use.\n","permalink":"https://ipng.ch/s/articles/2021/02/27/ipng-network/","section":"articles","title":"IPng Network"},{"contents":" Author: Pim van Pelt \u0026lt;pim@ipng.nl\u0026gt; Reviewers: Coloclue Network Committee \u0026lt;routers@coloclue.net\u0026gt; Status: Draft - Review - Published Introduction Coloclue AS8283 operates several Linux routers running Bird. Over the years, the performance of their previous hardware platform (Dell R610) has deteriorated, and they were up for renewal. At the same time, network latency/jitter has been very high, and variability may be caused by the Linux router hardware, their used software, the intra-datacenter links, or any combination of these. One specific example of why this is important is that Coloclue runs BFD on their inter-datacenter links, which are VLANs provided to us by Atom86 and Fusix networks. On these links Colclue regularly sees ping times of 300-400ms, with huge outliers in the 1000ms range, which triggers BFD timeouts causing iBGP reconvergence events and overall horrible performance. Before we open up a discussion with these (excellent!) L2 providers, we should first establish if it’s not more likely that Coloclue\u0026rsquo;s router hardware and/or software should be improved instead.\nBy means of example, let’s take a look at a Smokeping graph that shows these latency spikes, jitter and loss quite well. It’s taken from a machine at True (in EUNetworks) to a machine at Coloclue (in NorthC); this is the first graph. The same machine at True to a machine at BIT (in Ede) does not exhibit this behavior; this is the second graph.\nImages: Smokeping graph from True to Coloclue (left), and True to BIT (right). There is quite a difference.\nSummary I performed three separate loadtests. First, I did a loopback loadtest on the T-Rex machine, proving that it can send 1.488Mpps in both directions simultaneously. Then, I did a loadtest of the Atom86 link by sending the traffic through the Arista in NorthC, over the Atom86 link, to the Arista in EUNetworks, looping two ethernet ports, and sending the traffic back to NorthC. Due to VLAN tagging, this yielded 1.42Mpps throughput, exactly as predicted. Finally, I performed a stateful loadtest that saturated the Atom86 link, while injecting SCTP packets at 1KHz, measuring the latency observed over the Atom86 link.\nAll three tests passed.\nLoadtest Setup After deploying the new NorthC routers (Supermicro Super Server/X11SCW-F with Intel Xeon E-2286G processors), I decided to rule out hardware issues, leaving link and software issues. To get a bit more insight on software or inter-datacenter links, I created the following two loadtest setups.\n1. Baseline Machine dcg-2, carrying an Intel 82576 quad Gigabit NIC, looped from the first two ports (port0 to port1). The point of this loopback test is to ensure that the machine itself is capable of sending and receiving the correct traffic patterns. Usually, one does an “imix” and a “64b” loadtest for this, and it is expected that the loadtester itself passes all traffic out on one port back into the other port, without any loss. The thing I am testing is called the DUT or Device Under Test and in this case, it is a UTP cable from NIC-NIC.\nThe expected packet rate is: 672 bits for the ethernet frame is 10^9 / 672 == 1488095 packets per second in each direction and traversing the link once. You will often see 1.488Mpps as “the theoretical maximum”, and this is why.\n2. Atom86 In this test, Tijn from Coloclue plugged dcg-2 port0 into the core switch (an Arista) port e17, and he configured that switchport as an access port for VLAN A; which is put on the Atom86 trunk to EUNetworks. The second port1 is plugged into the core switch port e18, and assigned a different VLAN B, which is also put on the Atom86 link to EUNetworks.\nAt EUNetworks then, he exposed that same VLAN A on port e17 and VLAN B on port e18. And Tijn used DAC cable to connect e17 \u0026lt;-\u0026gt; e18. Thus, the path the packets travel now becomes the Device Under Test (DUT):\nport0 -\u0026gt; dcg-core-2:e17 -\u0026gt; Atom86 -\u0026gt; eunetworks-core-2:e17\neunetworks-core-2:e18 -\u0026gt; Atom86 -\u0026gt; dcg-core-2:e18 -\u0026gt; port1\nI should note that because the loadtester emits traffic which is tagged by the *-core-2 switches, that the Atom86 link will see each tagged packet twice, and as we\u0026rsquo;ll see, that VLAN tagging actually matters! The maximum expected packet rate is: 672 bits for the ethernet frame + 32 bits for the VLAN tag == 704 bits per packet, sent in both directions, but traversing the link twice. We can deduce that we should see 10^9 / 704 / 2 == 710227 packets per second in each direction.\nDetailed Analysis This section goes into details, but it is roughly broken down into:\nPrepare machine (install T-Rex, needed kernel headers, and some packages) Configure T-Rex (bind NIC from PCI bus into DPDK) Run T-Rex interactively Run T-Rex programmatically Step 1 - Prepare machine Download T-Rex from Cisco website and unpack (I used version 2.88) in some directory that is readable by ‘nobody’. I used /tmp/loadtest/ for this. Install some additional tools:\nsudo apt install linux-headers-`uname -r` build-essential python3-distutils Step 2 - Bind NICs to DPDK First I had to find which NICs that can be used, these NICs have to be supported in DPDK, but luckily most Intel NICs are. I had a few ethernet NICs to choose from:\nroot@dcg-2:/tmp/loadtest/v2.88# lspci | grep -i Ether 01:00.0 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) 01:00.1 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) 01:00.2 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) 01:00.3 Ethernet controller: Intel Corporation Ethernet Controller X710 for 10GbE SFP+ (rev 02) 05:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) 05:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) 07:00.0 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) 07:00.1 Ethernet controller: Intel Corporation 82576 Gigabit Network Connection (rev 01) 0c:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03) 0d:00.0 Ethernet controller: Intel Corporation I210 Gigabit Network Connection (rev 03) But the ones that have no link are a good starting point:\nroot@dcg-2:~/loadtest/v2.88# ip link | grep -v UP | grep enp7 6: enp7s0f0: \u0026lt;BROADCAST,MULTICAST\u0026gt; mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 7: enp7s0f1: \u0026lt;BROADCAST,MULTICAST\u0026gt; mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000 This is PCI bus 7, slot 0, function 0 and 1, so the configuration file for T-Rex becomes:\nroot@dcg-2:/tmp/loadtest/v2.88# cat /etc/trex_cfg.yaml - version : 2 interfaces : [\u0026#34;07:00.0\u0026#34;,\u0026#34;07:00.1\u0026#34;] port_limit : 2 port_info : - dest_mac : [0x0,0x0,0x0,0x1,0x0,0x00] # port 0 src_mac : [0x0,0x0,0x0,0x2,0x0,0x00] - dest_mac : [0x0,0x0,0x0,0x2,0x0,0x00] # port 1 src_mac : [0x0,0x0,0x0,0x1,0x0,0x00] Step 3 - Run T-Rex Interactively Start the loadtester, this is easiest if you use two terminals, one to run t-rex itself and one to run the console:\nroot@dcg-2:/tmp/loadtest/v2.88# ./t-rex-64 -i root@dcg-2:/tmp/loadtest/v2.88# ./trex-console The loadtester starts with -i (interactive) and optionally -c (number of cores to use, in this case only 1 CPU core is used). I will be doing a loadtest with gigabit speeds only, so no significant CPU is needed. I will demonstrate below that one CPU core of this machine can generate (sink and source) approximately 72Gbit/s of traffic. The loadtest starts a controlport on :4501 which the client connects to. You can now program the loadtester (programmatically via an API, or via the commandline / CLI tool provided. I’ll demonstrate both).\nIn trex-console, I first enter ‘TUI’ mode \u0026ndash; this stands for the Traffic UI. Here, I can load a profile into the loadtester, and while you can write your own profiles, there are many standard ones to choose from. There’s further two types of loadtest, stateful and stateless. I started with a simpler ‘stateless’ one first, take a look at stl/imix.py which is self explanatory, but in particular, the mix consists of:\nself.ip_range = {\u0026#39;src\u0026#39;: {\u0026#39;start\u0026#39;: \u0026#34;16.0.0.1\u0026#34;, \u0026#39;end\u0026#39;: \u0026#34;16.0.0.254\u0026#34;}, \u0026#39;dst\u0026#39;: {\u0026#39;start\u0026#39;: \u0026#34;48.0.0.1\u0026#34;, \u0026#39;end\u0026#39;: \u0026#34;48.0.0.254\u0026#34;}} # default IMIX properties self.imix_table = [ {\u0026#39;size\u0026#39;: 60, \u0026#39;pps\u0026#39;: 28, \u0026#39;isg\u0026#39;:0 }, {\u0026#39;size\u0026#39;: 590, \u0026#39;pps\u0026#39;: 16, \u0026#39;isg\u0026#39;:0.1 }, {\u0026#39;size\u0026#39;: 1514, \u0026#39;pps\u0026#39;: 4, \u0026#39;isg\u0026#39;:0.2 } ] Above one can see that there will be traffic flowing from 16.0.0.1-254 to 48.0.0.1-254, and there will be three streams generated at a certain ratio, 28 small 60 byte packets, 16 medium sized 590b packets, and 4 large 1514b packets. This is typically what a residential user would see (a SIP telephone call, perhaps a Jitsi video stream; some download of data with large MTU-filling packets; and some DNS requests and other smaller stuff). Executing this profile can be done with:\ntui\u0026gt; start -f stl/imix.py -m 1kpps .. which will start a 1kpps load of that packet stream. The traffic load can be changed by either specifying an absolute packet rate, or a percentage of line rate, and you can pause and resume as well:\ntui\u0026gt; update -m 10kpps tui\u0026gt; update -m 10% tui\u0026gt; update -m 50% tui\u0026gt; pause # do something, there will be no traffic tui\u0026gt; resume tui\u0026gt; update -m 100% After this last command, T-Rex will be emitting line rate packets out of port0 and out of port1, and it will be expecting to see the packets that it sent back on port1 and port0 respectively. If the machine is powerful enough, it can saturate traffic up to the line rate in both directions. One can see if things are successfully passing through the device under test (in this case, for now simply a UTP cable from port0-port1). The ‘ibytes’ should match the ‘obytes’, and of course ‘ipackets’ should match the ‘opackets’ in both directions. Typically, a loss rate of 0.01% is considered acceptable. And, typically, a loss rate of a few packets in the beginning of the loadtest is also acceptable (more on that later).\nScreenshot of port0-port1 loopback test with L2:\nGlobal Statistitcs connection : localhost, Port 4501 total_tx_L2 : 1.51 Gbps version : STL @ v2.88 total_tx_L1 : 1.98 Gbps cpu_util. : 5.63% @ 1 cores (1 per dual port) total_rx : 1.51 Gbps rx_cpu_util. : 0.0% / 0 pps total_pps : 2.95 Mpps async_util. : 0% / 104.03 bps drop_rate : 0 bps total_cps. : 0 cps queue_full : 0 pkts Port Statistics port | 0 | 1 | total -----------+-------------------+-------------------+------------------ owner | root | root | link | UP | UP | state | TRANSMITTING | TRANSMITTING | speed | 1 Gb/s | 1 Gb/s | CPU util. | 5.63% | 5.63% | -- | | | Tx bps L2 | 755.21 Mbps | 755.21 Mbps | 1.51 Gbps Tx bps L1 | 991.21 Mbps | 991.21 Mbps | 1.98 Gbps Tx pps | 1.48 Mpps | 1.48 Mpps | 2.95 Mpps Line Util. | 99.12 % | 99.12 % | --- | | | Rx bps | 755.21 Mbps | 755.21 Mbps | 1.51 Gbps Rx pps | 1.48 Mpps | 1.48 Mpps | 2.95 Mpps ---- | | | opackets | 355108111 | 355111209 | 710219320 ipackets | 355111078 | 355108226 | 710219304 obytes | 22761267356 | 22761466414 | 45522733770 ibytes | 22761457966 | 22761274908 | 45522732874 tx-pkts | 355.11 Mpkts | 355.11 Mpkts | 710.22 Mpkts rx-pkts | 355.11 Mpkts | 355.11 Mpkts | 710.22 Mpkts tx-bytes | 22.76 GB | 22.76 GB | 45.52 GB rx-bytes | 22.76 GB | 22.76 GB | 45.52 GB ----- | | | oerrors | 0 | 0 | 0 ierrors | 0 | 0 | 0 Instead of stl/imix.py as a profile, one can also consider stl/udp_1pkt_simple_bdir.py as a profile. This will send UDP packets of 0 bytes payload from a single host 16.0.0.1 to a single host 48.0.0.1 and back. Running the 1pkt UDP profile in both directions at gigabit link speeds will allow for 1.488Mpps in both directions (a minimum ethernet frame carrying IPv4 packet will be 672 bits in length \u0026ndash; see wikipedia for details).\nAbove, one can see the system is in a healthy state - it has saturated the network bandwidth in both directions (991Mps L1 rate, so this is the full 672 bits per ethernet frame, including the header, interpacket gap, etc), at 1.48Mpps. All packets sent by port0 (the opackets, obytes) should have been received by port1 (the ipackets, ibytes), and they are.\nOne can also learn that T-Rex is utilizing approximately 5.6% of one CPU core sourcing and sinking this load on the two gigabit ports (that’s 2 gigabit out, 2 gigabit in), so for a DPDK application, one CPU core is capable of 71Gbps and 53Mpps, an interesting observation.\nStep 4 - Run T-Rex programmatically I wrote a tool previously that allows to run a specific ramp-up profile from 1kpps warmup through to line rate, in order to find the maximum allowable throughput before a DUT exhibits too much loss, usage:\nusage: trex-loadtest.py [-h] [-s SERVER] [-p PROFILE_FILE] [-o OUTPUT_FILE] [-wm WARMUP_MULT] [-wd WARMUP_DURATION] [-rt RAMPUP_TARGET] [-rd RAMPUP_DURATION] [-hd HOLD_DURATION] T-Rex Stateless Loadtester -- pim@ipng.nl optional arguments: -h, --help show this help message and exit -s SERVER, --server SERVER Remote trex address (default: 127.0.0.1) -p PROFILE_FILE, --profile PROFILE_FILE STL profile file to replay (default: imix.py) -o OUTPUT_FILE, --output OUTPUT_FILE File to write results into, use \u0026#34;-\u0026#34; for stdout (default: -) -wm WARMUP_MULT, --warmup_mult WARMUP_MULT During warmup, send this \u0026#34;mult\u0026#34; (default: 1kpps) -wd WARMUP_DURATION, --warmup_duration WARMUP_DURATION Duration of warmup, in seconds (default: 30) -rt RAMPUP_TARGET, --rampup_target RAMPUP_TARGET Target percentage of line rate to ramp up to (default: 100) -rd RAMPUP_DURATION, --rampup_duration RAMPUP_DURATION Time to take to ramp up to target percentage of line rate, in seconds (default: 600) -hd HOLD_DURATION, --hold_duration HOLD_DURATION Time to hold the loadtest at target percentage, in seconds (default: 30) Here, the loadtester will load a profile (imix.py for example), warmup for 30s at 1kpps, then ramp up linearly to 100% of line rate in 600s, and hold at line rate for 30s. The loadtest passes if during this entire time, the DUT had less than 0.01% packet loss. I must note that in this loadtest, I cannot ramp up to line rate (because the Atom86 link is used twice!), and I’ll also note I cannot ramp up to 50% of line rate (because the loadtester is sending untagged traffic, but the Arista is adding tags onto the Atom86 link!), so I expect to see 711Kpps which is just about 47% of line rate.\nThe loadtester will emit a JSON file with all of its runtime stats, which can be later analyzed and used to plot graphs. First, let’s look at an imix loadtest:\nroot@dcg-2:/tmp/loadtest# trex-loadtest.py -o ~/imix.json -p imix.py -rt 50 Running against 127.0.0.1 profile imix.py, warmup 1kpps for 30s, rampup target 50% of linerate in 600s, hold for 30s output goes to /root/imix.json Mapped ports to sides [0] \u0026lt;--\u0026gt; [1] Warming up [0] \u0026lt;--\u0026gt; [1] at rate of 1kpps for 30 seconds Setting load [0] \u0026lt;--\u0026gt; [1] to 1% of linerate stats: 4.20 Kpps 2.82 Mbps (0.28% of linerate) … stats: 321.30 Kpps 988.14 Mbps (98.81% of linerate) Loadtest finished, stopping Test has passed :-) Writing output to /root/imix.json And then let’s step up our game with a 64b loadtest:\nroot@dcg-2:/tmp/loadtest# trex-loadtest.py -o ~/64b.json -p udp_1pkt_simple_bdir.py -rt 50 Running against 127.0.0.1 profile udp_1pkt_simple_bdir.py, warmup 1kpps for 30s, rampup target 50% of linerate in 600s, hold for 30s output goes to /root/64b.json Mapped ports to sides [0] \u0026lt;--\u0026gt; [1] Warming up [0] \u0026lt;--\u0026gt; [1] at rate of 1kpps for 30 seconds Setting load [0] \u0026lt;--\u0026gt; [1] to 1% of linerate stats: 4.20 Kpps 2.82 Mbps (0.28% of linerate) … stats: 1.42 Mpps 956.41 Mbps (95.64% of linerate) stats: 1.42 Mpps 952.44 Mbps (95.24% of linerate) WARNING: DUT packetloss too high stats: 1.42 Mpps 955.19 Mbps (95.52% of linerate) As an interesting note, this value 1.42Mpps is exactly what I calculated and expected (see above for a full explanation). The math works out at 10^9 / 704 bits/packet == 1.42Mpps, just short of 1.488M line rate that I would have found had Coloclue not used VLAN tags.\nStep 5 - Run T-Rex ASTF, measure latency/jitter In this mode, T-Rex simulates many stateful flows using a profile, which replays actual PCAP data (as can be obtained with tcpdump), by spacing out the requests and rewriting the source/destination addresses, thereby simulating hundreds or even millions of active sessions - I used astf/http_simple.py as a canonical example. In parallel to the test, I let T-Rex run a latency check, by sending SCTP packets at a rate of 1KHz from each interface. By doing this, latency profile and jitter can be accurately measured under partial or full line load.\nBandwidth/Packet rate Let’s first take a look at the bandwidth and packet rates:\nGlobal Statistitcs connection : localhost, Port 4501 total_tx_L2 : 955.96 Mbps version : ASTF @ v2.88 total_tx_L1 : 972.91 Mbps cpu_util. : 5.64% @ 1 cores (1 per dual port) total_rx : 955.93 Mbps rx_cpu_util. : 0.06% / 2 Kpps total_pps : 105.92 Kpps async_util. : 0% / 63.14 bps drop_rate : 0 bps total_cps. : 3.46 Kcps queue_full : 143,837 pkts Port Statistics port | 0 | 1 | total -----------+-------------------+-------------------+------------------ owner | root | root | link | UP | UP | state | TRANSMITTING | TRANSMITTING | speed | 1 Gb/s | 1 Gb/s | CPU util. | 5.64% | 5.64% | -- | | | Tx bps L2 | 17.35 Mbps | 938.61 Mbps | 955.96 Mbps Tx bps L1 | 20.28 Mbps | 952.62 Mbps | 972.91 Mbps Tx pps | 18.32 Kpps | 87.6 Kpps | 105.92 Kpps Line Util. | 2.03 % | 95.26 % | --- | | | Rx bps | 938.58 Mbps | 17.35 Mbps | 955.93 Mbps Rx pps | 87.59 Kpps | 18.32 Kpps | 105.91 Kpps ---- | | | opackets | 8276689 | 39485094 | 47761783 ipackets | 39484516 | 8275603 | 47760119 obytes | 978676133 | 52863444478 | 53842120611 ibytes | 52862853894 | 978555807 | 53841409701 tx-pkts | 8.28 Mpkts | 39.49 Mpkts | 47.76 Mpkts rx-pkts | 39.48 Mpkts | 8.28 Mpkts | 47.76 Mpkts tx-bytes | 978.68 MB | 52.86 GB | 53.84 GB rx-bytes | 52.86 GB | 978.56 MB | 53.84 GB ----- | | | oerrors | 0 | 0 | 0 ierrors | 0 | 0 | 0 In the above screen capture, one can see the traffic out of port0 is 20Mbps at 18.3Kpps, while the traffic out of port1 is 952Mbps at 87.6Kpps - this is because the clients are sourcing from port0, while the servers are simulated behind port1. Note the asymmetric traffic flow, T-Rex is using 972Mbps of total bandwidth over this 1Gbps VLAN, and a tiny bit more than that on the Atom86 link, because the Aristas are inserting VLAN tags in transit, to be exact, 18.32+87.6 = 105.92Kpps worth of 4 byte tags, thus 3.389Mbit extra traffic.\nLatency Injection Now, let’s look at the latency in both directions, depicted in microseconds, at a throughput of 106Kpps (975Mbps):\nGlobal Statistitcs connection : localhost, Port 4501 total_tx_L2 : 958.07 Mbps version : ASTF @ v2.88 total_tx_L1 : 975.05 Mbps cpu_util. : 4.86% @ 1 cores (1 per dual port) total_rx : 958.06 Mbps rx_cpu_util. : 0.05% / 2 Kpps total_pps : 106.15 Kpps async_util. : 0% / 63.14 bps drop_rate : 0 bps total_cps. : 3.47 Kcps queue_full : 143,837 pkts Latency Statistics Port ID: | 0 | 1 -------------+-----------------+---------------- TX pkts | 244068 | 242961 RX pkts | 242954 | 244063 Max latency | 23983 | 23872 Avg latency | 815 | 702 -- Window -- | | Last max | 966 | 867 Last-1 | 948 | 923 Last-2 | 945 | 856 Last-3 | 974 | 880 Last-4 | 963 | 851 Last-5 | 985 | 862 Last-6 | 986 | 870 Last-7 | 946 | 869 Last-8 | 976 | 879 Last-9 | 964 | 867 Last-10 | 964 | 837 Last-11 | 970 | 867 Last-12 | 1019 | 897 Last-13 | 1009 | 908 Last-14 | 1006 | 897 Last-15 | 1022 | 903 Last-16 | 1015 | 890 --- | | Jitter | 42 | 45 ---- | | Errors | 3 | 2 In the capture above, one can see the total latency measurement packets sent, and the latency measurements in microseconds. One can see that from port0-\u0026gt;port1 the measured latency was 0.815ms, while the latency in the other direction was 0.702ms. The discrepancy can be explained by the HTTP traffic being asymmetric (clients on port0 have to send their SCTP packets into a much busier port1), which creates queuing latency on the wire and NIC. The Last-* lines under it are the values of the last 16 seconds of measurements. The maximum observed latency was 23.9ms in one direction and 23.8ms in the other direction. I have to conclude therefore that the Atom86 line, even under stringent load, does not suffer from outliers in the entire 300s duration of my loadtest.\nJitter is defined as a variation in the delay of received packets. At the sending side, packets are sent in a continuous stream with the packets spaced evenly apart. Due to network congestion, improper queuing, or configuration errors, this steady stream can become lumpy, or the delay between each packet can vary instead of remaining constant. There was virtually no jitter: 42 microseconds in one direction, 45us in the other.\nLatency Distribution While performing this test at 106Kpps (975Mbps), it\u0026rsquo;s also useful to look at the latency distribution as a histogram:\nGlobal Statistitcs connection : localhost, Port 4501 total_tx_L2 : 958.07 Mbps version : ASTF @ v2.88 total_tx_L1 : 975.05 Mbps cpu_util. : 4.86% @ 1 cores (1 per dual port) total_rx : 958.06 Mbps rx_cpu_util. : 0.05% / 2 Kpps total_pps : 106.15 Kpps async_util. : 0% / 63.14 bps drop_rate : 0 bps total_cps. : 3.47 Kcps queue_full : 143,837 pkts Latency Histogram Port ID: | 0 | 1 -------------+-----------------+---------------- 20000 | 2545 | 2495 10000 | 5889 | 6100 9000 | 456 | 421 8000 | 874 | 854 7000 | 692 | 757 6000 | 619 | 637 5000 | 985 | 994 4000 | 579 | 620 3000 | 547 | 546 2000 | 381 | 405 1000 | 798 | 697 900 | 27451 | 346 800 | 163717 | 22924 700 | 102194 | 154021 600 | 24623 | 171087 500 | 24882 | 40586 400 | 18329 | 300 | 26820 | In the capture above, one can see the number of packets observed between certain ranges; from port0 to port1, 102K SCTP latency probe packets were in transit some time between 700-799us, 163K probes were between 800-899us. In the other direction, 171K probes were between 600-699us and 154K probes were between 700-799us. This is corroborated by the mean latency I saw above (815us from port0-\u0026gt;port1 and 702us from port1-\u0026gt;port0).\n","date":"2021-02-27","desc":" Author: Pim van Pelt \u0026lt;pim@ipng.nl\u0026gt; Reviewers: Coloclue Network Committee \u0026lt;routers@coloclue.net\u0026gt; Status: Draft - Review - Published Introduction Coloclue AS8283 operates several Linux routers running Bird. Over the years, the performance of their previous hardware platform (Dell R610) has deteriorated, and they were up for renewal. At the same time, network latency/jitter has been very high, and variability may be caused by the Linux router hardware, their used software, the intra-datacenter links, or any combination of these. One specific example of why this is important is that Coloclue runs BFD on their inter-datacenter links, which are VLANs provided to us by Atom86 and Fusix networks. On these links Colclue regularly sees ping times of 300-400ms, with huge outliers in the 1000ms range, which triggers BFD timeouts causing iBGP reconvergence events and overall horrible performance. Before we open up a discussion with these (excellent!) L2 providers, we should first establish if it’s not more likely that Coloclue\u0026rsquo;s router hardware and/or software should be improved instead.\n","permalink":"https://ipng.ch/s/articles/2021/02/27/loadtesting-at-coloclue/","section":"articles","title":"Loadtesting at Coloclue"},{"contents":"Historical context - todo, but notes for now\nstarted with stack.nl (when it was still stack.urc.tue.nl), 6bone and watching NASA multicast video in 1997. founded ipng.nl project, first IPv6 in NL that was usable outside of NREN. attracted attention of the first few IPv6 participants in Amsterdam, organized the AIAD - AMS-IX IPv6 Awareness Day launched IPv6 at AMS-IX, first IXP prefix allocated 2001:768:1::/48 My Brilliant Idea Of The Day \u0026ndash; encode AS number in leetspeak: ::AS01:2859:1, because who would\u0026rsquo;ve thought we would ever run out of 16 bit AS numbers :)\nIPng rearchitected to SixXS, and became a very large scale deployment of IPv6 tunnelbroker; our main central provisioning system moved around a few times between ISPs (Intouch, Concepts ICT, BIT, IP Man) Needed eventually a NOC and servers to operate it that were provider independent, which is where our PI space came from (and is still used) High Availability with paphosting of sixxs.net and other sites Moved to IP-Max in 2014 (and still best of friends with that crew!) In 2019, Fred said \u0026ldquo;hey why don\u0026rsquo;t you get an AS number and announce your /24 PI yourself, that\u0026rsquo;ll be fun!\u0026rdquo; I didn\u0026rsquo;t want to at first, because \u0026ldquo;it is a lot of work to do it properly\u0026rdquo;.\nIn 2020, I got to know Openfactory who are a local ISP (with an office in the town I live) and offer services on the local FTTH network; so I got a gigabit with them And that\u0026rsquo;s when I made the plunge, got AS50869, started announcing my own PI space, built up a few routers, and the rest is \u0026hellip; history :)\n","date":"2021-02-26","desc":"Historical context - todo, but notes for now\nstarted with stack.nl (when it was still stack.urc.tue.nl), 6bone and watching NASA multicast video in 1997. founded ipng.nl project, first IPv6 in NL that was usable outside of NREN. attracted attention of the first few IPv6 participants in Amsterdam, organized the AIAD - AMS-IX IPv6 Awareness Day launched IPv6 at AMS-IX, first IXP prefix allocated 2001:768:1::/48 My Brilliant Idea Of The Day \u0026ndash; encode AS number in leetspeak: ::AS01:2859:1, because who would\u0026rsquo;ve thought we would ever run out of 16 bit AS numbers :)\n","permalink":"https://ipng.ch/s/articles/2021/02/26/ipng-history/","section":"articles","title":"IPng History"},{"contents":" Author: Pim van Pelt, Jeroen Massar Contact: \u0026lt;staff@sixxs.net\u0026gt; Date: March 2017 Status: Draft - Review SixXS - Review Admins - Final - Published Summary SixXS will be sunset in H1 2017. All services will be turned down on 2017-06-06, after which the SixXS project will be retired. Users will no longer be able to use their IPv6 tunnels or subnets after this date, and are required to obtain IPv6 connectivity elsewhere, primarily with their Internet service provider.\nIntroduction SixXS (Six Access) is a free, non-profit, non-cost service for Local Internet Registries (LIR\u0026rsquo;s) and endusers. The main goal is to create a common portal to help company engineers find their way with IPv6 networks deploying IPv6 to their customers in a rapid and controllable fashion. To reach our goals, SixXS provides the following services:\nIPv6 Tunnel Broker: a versatile and high performance IPv6 tunneling router Ghost Route Hunter: an IPv6 route monitoring tool and various other services to help out where needed IPv6Gate HTTP Proxy: IPv6-IPv4 and IPv4-IPv6 Website Gateway SixXS has offered the RIPE, ARIN, APNIC, LacNIC and AfriNIC communities pre-production deployment expertise based on the experience gathered while running the IPng IPv6 tunnel broker since 1999 and, combined with its successor, SixXS, gaining more than 18 years of valuable IPv6 experience.\nUserbase As of March 2017, there are 38’393 7-day active users spanning 140 countries. These users configured a total of 44’673 tunnels spanning 118 countries, and 12’632 subnet delegations (28.28%). Our peak 7DA usage was over 50’000 users. Full statistics, including distributions by country, can be found on the SixXS website [link].\nGrowth User engagement over time (as shown in the graphs below), shows strong growth from 2001-2011, followed by a stagnation and then decrease of new users and subnets leading through 2016. We believe this is due to saturation, all users with the ability and desire to receive service, obtained an account, a tunnel and a subnet.\n| Images: Left - New users per year; Right - New tunnels per year.\nAnother way to visualize this data is to measure the cumulative requested subnets (which are /48 in size). The requests for subnets naturally follows the growth of users. In recent years (2014 onwards), requests for additional subnets were clearly tapering off - this is in line with our goal of SixXS. Therefore, in December of 2015, new user signups were suspended. Note: this explains the flatline of requests in the years 2016 and 2017.\nImage: Cumulative requests over time.\nTraffic As the project evolved, traffic initially grew significantly, with a daily average of just around 900Mbit/sec. We note the trend of traffic is on a downwards trajectory since H2 2015. We believe this is in part due to ISPs starting to offer IPv6, which yields organic attrition of users migrating away. This trend is in line with the goals of the project.\nUsers engage with SixXS passively - once they set up their tunnel and configure their router or computer, the system is largely zero-touch. Traffic is consistently diurnal with a 7:1 ratio between peak and trough, which indicates that the traffic flowing through the system is roughly equivalent to what access providers see. This has changed over time - in the early days of IPv6, the major use case was NNTP and IRC, now general Internet usage with the larger content providers all support IPv6.\nComparing our traffic pattern to a well known Internet exchange point [amsix], we see similar changes. Today, many of the larger content providers have offerings on IPv6 which shifts the usage of our service also more towards HTTP (and, to a diurnal pattern).\nThe following two graphs illustrate the usage: The first graph is average traffic in bits/sec between 2012 and 2017. Second graph is average traffic in bits/sec between 2017-02-24 and 2017-03-03.\n| Images: Left - traffic since 2012. Right - weekly traffic pattern Feb 2017.\nFootprint Looking at the total footprint of SixXS - In March 2017, 46 PoPs spread over 29 countries were offered by 40 unique Internet service providers. Over the lifetime of SixXS, a total of 65 different PoPs have been active.\nImage: Map of SixXS PoPs\nThe full list of countries hosting PoPs: Australia, Belgium, Brazil, Czech Republic, Denmark, Estonia, Faroe Islands, Finland, France, Germany, Greece, Hungary, Ireland, Italy, Luxembourg, Netherlands, New Caledonia, New Zealand, Norway, Poland, Portugal, Russia, Slovenia, Sweden, Switzerland, United Kingdom, United States and Vietnam.\nOur deployment comprises the following subnet allocations: 85x /40 which is equivalent to a total of 21’760 /48\u0026rsquo;s or 1x /34 + 1x /36 + 1x /38 + 1x /40 \u0026ndash; to cover our usage fully would require an IPv6 /33. We believe the SixXS footprint is one of the largest IPv6 deployments in the world, and are incredibly proud of the accomplishments and contributions we have made to the community.\nRationale The mission of SixXS is to help prepare companies and individual users for the world of IPv6. Started in 1999, our goals were to build a distributed tunnelbroker system, which allowed professional and hobbyist users to learn how to operate IPv6 networks so that they would roll out IPv6 natively across the Internet. Back in 1999, we wrote: Our ultimate target is to conclude the tunnel brokering service when all the end users can get Native IPv6 directly from their own Internet provider.\nFor a decade, the industry was divided into content providers and access providers engaged a chicken and egg game:\nContent providers claimed that investing in IPv6 rollout would be useless because there were not sufficient numbers of large ISPs which offered it. Access providers claimed that investing in IPv6 would be useless because there were not sufficient numbers of large content providers which offered it. Both content providers and access providers claimed their customers didn’t demand it and there was no business justification in doing so. One raison d’être of SixXS is to help break this cycle by offering IPv6 connectivity to users, so that they in turn can demand that content providers offer service over IPv6, and to companies, so their engineers can learn the intricacies of rolling out IPv6 for their business safely. After 18 years, here is where we stand:\nContent providers, largely, have switched to IPv6. Examples: Wikipedia, Google, Youtube, Facebook, Akamai, Netflix, Microsoft, Yahoo.\nAccess providers are starting to move on IPv6 deployments: 18% of the Internet has IPv6 connectivity, roughly doubling year over year.\nThe access providers generally still claim that there is not sufficient customer demand to invest in IPv6.\nToday, SixXS plays an insignificant role in converting the opinion on (1), (2) and (3) and is more recently (2016 and beyond) being quoted by several large access providers as a compelling alternative for their few customers who asked for IPv6, along the worrying lines of “SixXS and Hurricane Electric offer tunnels, so we are not planning to provide native IPv6 at this time”.\nA call to action in 2016, asking our users to call their ISP and ask about rollout plans, yielded some reasonable engagement from SixXS users (for which we are incredibly grateful), but arguably disappointing results from the ISPs, particularly the very large ones. Extensive data can be found on the SixXS wiki page [link].\nConclusion Building up to our conclusion, we make some critical observations:\nSixXS penetration has hit a point of diminishing returns (see the ‘Growth’ subsection of this document).\nContent providers have shown great progress in enabling users to reach their websites via IPv6, in our opinion formally breaking the chicken and egg problem.\nAccess providers have shown reasonable interest in providing IPv6 to users, but some have started to quote SixXS as a reason they do not have to show an interest.\nConsumers should not have to be involved in the discussion as they largely need not know, or care, how the Internet works, as long as they can reach the Internet resources they want, when they want them.\nOur conclusion is that SixXS is no longer able to contribute to the solution, and is hampering its own goals of facilitating the migration of consumers to native IPv6. We have therefore decided to shut down our services on 2017-06-06.\nAccomplishments When we started in 1999, we set ourselves some pretty ambitious goals. They are described on our website as value propositions to ISPs, to endusers and we explain in detail what targets we want our project to achieve in order to help the technical community.\nTo the latter point (why do this?), we have by far exceeded our targets of creating 10 regional PoPs (we created 65, each using their own IPv6 address space); we developed and rolled out a provisioning system that manipulated tunnels and subnets w/o human intervention; gathered a wide array of statistics (traffic, latency, uptime, debugging); set up Multicast, DNS and DNSSEC delegations; and allowed for a feature rich web- and console interface to manage the system.\nWe felt that having zero-touch PoP servers would likely yield less probability for human error to cause outages, and we were right: in 18 years of operation we cannot remember any wide scale outages caused by human error.\nHowever, we did not stop at writing solid automation. The following are notable contributions that SixXS has made to IPv6 deployment and the Internet in general:\nPoPs on five major continents, missing Africa and Antarctica sixxsd - software based router with high performance tunneling support Heartbeat protocol (IANA port 3740) and IETF draft AYIYA protocol (IANA port 5072) and IETF draft Community outreach (example RIPE, IETF, ISOC, AMS-IX IPv6 Awareness Day) Research done with GRH (Ghost Router Hunter) to: Help eradicate global IPv6 routing issues (Ghost Routes) IPv6 Bogon Route detection Distributed Looking Glass and Traceroute AICCU (Automatic IPv6 Connectivity Client Utility) Automatic setup of IPv6 connectivity providing only username + password. Awards of Excellence in the Implementation Award Category in the IPv6 Application Contest 2004 Incorporated into commercial Draytek, ZyXel, Motorola CPE products Heartbeat \u0026amp; TIC support out the box in AVM Fritz!Box Very few outages over 18 years of operation Pre-production access to Google and Wikipedia IPv6 servers through SixXS DNS recursors Over time, social and technical media picked up on SixXS activities and regularly acknowledged the significance of the project. For example at Heise, Linux Journal, and in other places.\nAcknowledgements As SixXS founders, we were not operating in a vacuum. We have had countless interactions with institutions and corporations, many mentors, and like-minded engineers across the world. We wanted to take this opportunity to call out these formative folks:\nTU/e and MCGV Stack (the genesis of IPng in 1997) Intouch (the continuation of IPng and first sponsor of SixXS in 1999) Concepts ICT (long time sponsor of SixXS, including our NOC hardware) HEAnet (long time sponsor of SixXS since 2002) BIT (long time IPv6 advocate, and long time sponsor of SixXS since 2004) AMS IX (for organizing the AMS-IX IPv6 Awareness Day in Oct 2002) ISOC and Steve Deering (for inventing IPv6, and meeting us in Amsterdam) IPv6 Flag Day effort (for addressing the chicken-and-egg problem) IPv6 Ops (for an engineering focused mailing list) Heise (for their kind attention over the years) And lastly, we extend our gratitude to the men and women who professionally operate the network, those who arranged the physical or virtual hardware, and those who are in a position to commit to running all 65 SixXS PoPs!\nFAQ Will you reconsider your decision?\nA lot of thought has gone into this decision. While we do understand that the service SixXS provides is very valuable to its users, we have seen the growth of IPv6 content providers as well as IPv6 access providers is exponentially growing. We are of the belief that IPv6 Tunnel Brokers are no longer facilitating access providers moving to IPv6, and as such do not wish for the project to be continued. We will not reconsider our decision.\nWill you hand over the project to other folks?\nWe are fairly protective of our brand and position in the community. Due to the nature of SixXS, which rests on an open source client (aiccu) with a closed source server (sixxsd), we are not willing to hand over the project. However, that aside, the main justification for our decision as outlined in this document, is that we are of the belief that IPv6 Tunnel Brokers are no longer facilitating access providers moving to IPv6, and as such do not wish for the project to be continued. Handing it over to other folks will not allow us to satisfy our concerns.\nDo you need help?\nRunning SixXS servers and infrastructure is very efficient and does not demand much time. The PoP servers are stateless and can be brought up based on a Debian or Ubuntu base install in a matter of minutes. Handling user questions is stressful at times due to the volume of requests, but overall it’s manageable because most of our user interaction is self-service on the website. For operating SixXS, we do not believe time is a major concern.\nThat said, help can be more productively offered in the area of IPv6 deployment in Internet content providers and access providers. If you are active in those areas, we would greatly appreciate it if you would champion with your leadership and engineering teams to roll out IPv6 for your users!\nWhat happens to the users?\nWe will mark all the tunnels and subnets as deleted on 2017-06-06 at which point they will stop forwarding traffic. Users will not have access to IPv6 through SixXS anymore on that date. We will return all resources to the Internet service providers, shut down the PoPs and delete all personally identifiable data from our database.\nWhat happens to the servers (PoPs)?\nThe servers will be shut down and returned to the ISPs who own them, as SixXS itself does not own the PoP servers. The SixXS website will continue to run on private servers, mostly serving as a tombstone documenting our efforts over the years.\nWhat happens to the PII data you have?\nIn the lifetime of the project, we have taken privacy and individually identifiable information very seriously, and to our knowledge have never suffered an information breach or leek. After decommissioning the services we run, we will destroy all PII data, keeping only traffic statistics at the PoP level, and general statistics of our usage, like the graphs seen in this document.\nWhat happens to WHOIS entries in RIPE and APNIC databases?\nAs we mark the tunnels and subnets as deleted, our automation will automatically purge these records (inet6num) from the RIPE and APNIC databases. We will also return the subnets SixXS operates on behalf of the PoP owners (these /40 supernets will be returned to the LIRs).\nFor user (person) records, see FAQ entry (Do I have to delete my RIR handle).\nWhat is your timeline?\nOur timeline for the sunset of our services started in early March 2017, traversing a dialog with PoP administrators in April, notifying users at the end of April, and offering 6 weeks of time for folks to find alternative solutions.\nOn 2017-06-06 we will shut down services, and close the sunsetting project on 2017-07-01.\nWhat are my alternatives?\nUsers are encouraged to call their ISP for IPv6 connectivity \u0026ndash; ultimately that is the best way forward. Provided sufficient numbers of paying customers request IPv6 service, ISPs may be compelled to invest in offering service.\nThere is sometimes healthy competition. ISPs are critically interested in retaining users. If a specific ISP offers IPv6 to users \u0026ndash; consider switching providers to obtain native connectivity and making note of this with the old ISP.\nIn cases where this proves infeasible, there are myriad other IPv6 Tunnel Brokers available.\nWill you open source SixXS code?\nWe do not currently have plans to open source any code that is not already publicly available. Although our provisioning servers and routing daemon (sixxsd) are very well thought out, they do have some intricate dependencies on how we built SixXS. As such, offering the code base will not be necessarily useful for others. Over time, given effort on Jeroen’s part, this may change. We cannot make any promises at this point.\nCan I still use www.sixxs.net as connectivity checker (smokeping et al)?\nWe are aware (simply by looking at access logs), that many users have pointed their IPv6 network monitoring at our website. Among these users are some quite large companies as well. You can rest assured that we will keep www.sixxs.net running on highly available distributed servers ongoing. It is likely that the website will become static, making it somewhat more reliable than it already is.\nWhat happens to the SixXS domains (sixxs.{com,net,org,\u0026hellip;})?\nWhile we are retiring our services, the website will remain up. All domains will remain delegated to the current nameservers, although only www and MX will remain. In particular, the PoP hosts and tunnel hostnames like cl-x.pop-yy.tld.sixxs.net will be removed.\nI am hosting my nameserver, mailserver, etc on my tunnel, what happens?\nIt is quite common and entirely acceptable for folks to point NS or MX records towards their tunnel name (cl-x.pop-yy.tld.sixxs.net). Hosting servers with content behind the tunnels is also common. Considering the DNS entries for sixxs.net will cease to exist, and the tunnels will be decommissioned, users are recommended to move their services to other IPv6 providers (either colocated, natively routed to home or office connections, or behind another IPv6 Tunnel Broker if no other options exist).\nDo I have to delete my RIR handle (RIPE/ARIN/APNIC/LacNIC/AfriNIC)?\nIf you are using the handle for other business, obviously you should keep them. If SixXS was the sole purpose of registering such a handle (we have required them for many years), we recommend removing your handle from the relevant RIR database. The call is ultimately yours, as SixXS does not own that data.\nWhat happens with the ULA registry?\nThis registry was for educational purposes, and has no official status. Importantly, as ULA is random per definition, the chance of collisions is extremely low. It is fairly straight forward to get a prefix from one of the RIRs. While that does cost some money, it is an activity that is not the main charter of SixXS and will be expected to continue with the RIRs.\nTimeline 2017-03-01: T-14wk Decision made by Jeroen and Pim to start the sunset project.\n2017-03-06: T-13wk Communicate to PoP admins (e-mail, referencing intent and problem statement)\n2017-03-13: T-12wk Communicate to PoP admins (e-mail, details (this doc))\n2017-03-20: T-11wk Wrap up feedback from PoP admins, prepare publication to the users. Initial backup of SixXS PoPs completed.\n2017-03-23: T-11wk Publish to SixXS website. One-time mail to all users, noting the sunset date and pointing to rationale and FAQ.\n2017-03-29: T-10wk Communicate and field response from IPv6 communities, social media, et al.\n2017-04-17: T-7wk Due date for converting the SixXS website to a static replica, for posterity.\n2017-05-29: T-1wk Convert info@sixxs.net to an autoresponder pointing at rationale+FAQ.\n2017-06-06: (Tuesday) Turn off TIC and SixXSd on PoPs, retire IPv6Gate, shut down whois server.\n2017-06-12: T+1wk: Secondary backup of SixXS PoPs completed.\n2017-06-19: T+2wk: Power off SixXS PoPs, IPv6Gate, return resources\n2017-06-26: T+3wk: Destroy PII data (mysql database).\n2017-07-01: Close out the sunset project.\n","date":"2017-03-14","desc":" Author: Pim van Pelt, Jeroen Massar Contact: \u0026lt;staff@sixxs.net\u0026gt; Date: March 2017 Status: Draft - Review SixXS - Review Admins - Final - Published Summary SixXS will be sunset in H1 2017. All services will be turned down on 2017-06-06, after which the SixXS project will be retired. Users will no longer be able to use their IPv6 tunnels or subnets after this date, and are required to obtain IPv6 connectivity elsewhere, primarily with their Internet service provider.\n","permalink":"https://ipng.ch/s/articles/2017/03/14/sunsetting-sixxs/","section":"articles","title":"Sunsetting SixXS"},{"contents":" Author: Pim van Pelt \u0026lt;pim@ipng.nl\u0026gt; Reviewed: Fredy Kuenzler \u0026lt;kuenzler@init7.net\u0026gt; Status: Draft - Review - Approved Introduction In a pilot of the Fiber7 product on the LiteXchange platform, the author took service to vet the product stability and quality. The pilot ran from 2016-09-25 to 2016-10-12, in which the Fiber7 connection was used exclusively by the author in their home internet connection, both for IPv4/IPv6 service as well as IP Television (via Init7) and IP Telephony (via a third party provider).\nExecutive Summary Fiber7 via ‘direct connect’ on the LiteXchange platform works as expected and very satisfactory, including native IPv6, which was made available for this pilot. Throughput, latency and jitter are superior due to the direct fiber connection, and significantly better than existing connections, exceeding expectations compared to competing FTTH offerings that use customer premise equipment. IPTV worked correctly with multiple STB devices.\nDetailed findings Architecture The author currently has a subscription via EasyZone, an Init7 subsidiary, with gigabit ethernet symmetric connectivity. The EasyZone product delivers service via the LiteXchange platform, which is an L2 broker, offering end users a choice of multiple internet providers [site]. An ONT is supplied, an ISP managed device that takes the fiber connection and exposes service via one of four gigabit ethernet copper ports. The Fiber7 product delivers via LiteXchange a direct fiber connection without the ONT. Fiber7 offers a range of termination options, including plugging the fiber into a provided CPE (AVM Fritz!Box [vendor] or alternatively MikroTik RB2011UiAS [vendor]), a media converter (TP-Link [vendor]), or simply an SFP (Flexoptix [vendor]) to use in customer provided switch/router infrastructure.\nArchitecture (details) For this pilot, we chose a barebones connection type, consisting of a bidirectional SFP (Flexoptix [vendor]) directly terminating the FTTH connection from OTO position 1 into our own managed switch (Unifi US-24-500W [vendor]). The L3 routers used are a pair of PC routers (PC Engines APU2 [vendor]), running Linux. They are configured in CARP failover on egress (to Fiber7) and ingress (to local network).\nConfiguring IPv4 address on egress interface is done via DHCP - initially, DHCPv6 was not active on LiteXchange, so a local tunnelbroker (SixXS, hosted at Init7) was used. Within one week, the engineers at Init7 informed me that DHCPv6 was ready, and it worked spotlessly after configuring it to request an NA and a /48 PD, and bumping accept_ra=2 on the egress interface (note: this allows forwarding while at the same time accepting router advertisements).\nAdditional details of the L3 connection:\nThe routers operate an L2VPN to a third party provider (IP-Max, AS25091) which routes 194.1.163.32/27 via eBGP using GRE. The MSS on this tunnel is clamped to 1436 (from 1460) to allow for encapsulating IPv4 and GRE. AS13030 and AS25091 meet at CIXP in Geneva, with a round trip time of 4.2ms. The routers operate an IPv6 tunnel to a common tunnel provider (SixXS, AS13030), which routes 2001:1620:fb6::/48 via AICCU using SIT to the active router. The MTU is set to 1440 bytes to allow encapsulating IPv6 in IPv4. Note that the Fiber7 connection via LiteXchange provides native IPv6 as well, so this tunnel is used only via a secondary IPv4 uplink. The routers operate native IPv6 \u0026ndash; with DHCPv6, a /128 address and a /48 delegated prefix are obtained. This prefix is stable due to the use of DUID client identification. The default gateway is obtained via RS/RA. For IPv6, reversed DNS delegation for fixed DUID/PD delegation is provided. It is worth pointing out the very low technical entry barrier to both IPv4 and IPv6. The termination is principally plug and play. An end user can use standard issue DHCP for IPv4 and RA/RS for IPv6. DHCPv6 is not widely used - but similarly the /48 prefix acquisition is hasslefree.\nFailover between the routers is managed by a script that swaps the CARP [source] master to the standby PC router (automatically in case of CARP heartbeat timeouts; or manually in case of maintenance), ensuring the L2VPN, DHCP client, and IPv6 tunnels are running on the active machine.\nPolicy based routing [source] is used to separate Fiber7/SixXS and L2VPN/IP-Max routing domains. Routing tables are maintained with a popular open source routing platform called BiRD [source], OSPF between the PC routers, and eBGP with the third party provider.\nIP Television In this pilot the author was sent an IPTV device (Amino Aminet A140 [vendor]), which operates with IPv4. The device acquires video streams using IPv4 multicast. Setting this up was straightforward, using an IGMP Proxy [github] also used in commercial CPEs. The IGMP Proxy was configured on the PC routers.\nWith two such Amino IPTV devices, tuning in to SRF1 and SRF2 (both HD channels), a stream of UDP from multicast servers within the Init7 network was started. At the time of writing, SRF1 is on multicast address 239.44.0.77 port 5000; SRF2 is on multicast address 239.44.0.78 port 5000; both coming from source 109.202.223.18 port 5000. Average bandwidth was 13.0Mbit/s with a peak of 17.1Mbit/s per HD stream, and 4.2Mbit/s with a peak of 5.3Mbit/s per SD stream.\nMultiple Amino IPTV devices in multiple backend VLANs can be used at the same time:\n$ ip mroute | grep 239.44.0 (109.202.223.18, 239.44.0.77) Iif: eth0.9 Oifs: eth0 (109.202.223.18, 239.44.0.78) Iif: eth0.9 Oifs: eth0.2 A list of channels available on the EasyZone IPTV provider (a subsidiary of Init7) can be found on their website [source].\nNetflix: IPv6 Worth noting during the pilot is that Netflix, a popular online television streaming service [website], was served from within the Init7 network as well. Connections were observed from host netflix-cache-1.init7.net (AS13030) via IPv6, which is impressive.\nUHD (4K) streaming is also available with Netflix - the device used to test this (Samsung JU7080 Series 7 [vendor]) has a native client but it does not support IPv6, as such the traffic was observed from host ipv4_1.cxl0.c117.ams001.ix.nflxvideo.net in AS2906 located in the Netherlands.\nIn both cases (local within Init7 and remote to AS2906), Netflix streaming was free of interruptions and great quality.\nTest Results Throughput A throughput test was started on September 27, lasting 12 hours, from the active PC router to a machine in the Init7 network [caveat]:\n$ traceroute to chzrh02.sixxs.net (213.144.148.74), 30 hops max, 60 byte packets 1 77.109.172.1.easyzone.ch (77.109.172.1) 0.755 ms 0.813 ms 0.803 ms 2 r1zrh2.core.init7.net (77.109.183.61) 0.379 ms 0.373 ms 0.377 ms 3 r1zrh1.core.init7.net (77.109.128.241) 0.477 ms 0.429 ms 0.397 ms 4 r1zlz1.core.init7.net (77.109.128.210) 8.810 ms 8.783 ms 8.738 ms 5 chzrh02.sixxs.net (213.144.148.74) 0.545 ms 0.490 ms 0.469 ms Using a popular network bandwidth tool (iperf [source]), IPv4 bandwidth was measured for 10 minutes each, both upstream (from the PC router to a machine in the init7 network: 891Mbit), and downstream (from the init7 machine to the PC router: 895Mbit). In IPv6, the results were similar (771Mbit upstream, and 831Mbit downstream).\nA standard internet test was performed (Speedtest.net, using Init7) [link; results], yielding 925Mbit downstream and 893Mbit upstream. In addition to the direct link, the author’s L2VPN connection to a third party provider was tested (Speedtest.net, using Init7) [link; results], yielding 609Mbit downstream and 578Mbit upstream. The L2VPN throughput regression is explained by tunneling en/decapsulation.\nLatency Latency to Google was tested \u0026ndash; Init7 AS13030 and Google AS15169 meet in Zurich, with very low latency. IPv6 was tested twice (once via SixXS tunnelbroker tunnel, and once natively when it was available). Tunneled IPv6 reports slightly elevated latency due tunneling to an on-net IPv6 tunnelbroker[caveat]. Native IPv6 reports equivalent latency to IPv4.\nIPv4 google.com ping statistics: 10 packets transmitted, 10 received, 0% packet loss, time 9002ms rtt min/avg/max/mdev = 0.566/0.579/0.594/0.025 ms Native IPv6 google.com ping6 statistics: 10 packets transmitted, 10 received, 0% packet loss, time 9015ms rtt min/avg/max/mdev = 0.705/0.771/0.828/0.043 ms Tunneled IPv6 google.com ping6 statistics: 10 packets transmitted, 10 received, 0% packet loss, time 9011ms rtt min/avg/max/mdev = 1.154/1.451/2.206/0.276 ms Caveats IPv6 was initially not natively available on this connection. IPv6 was tunneled via chzrh02.sixxs.net (on-net at AS13030). The IPv6 server endpoint runs on a virtualized platform, with slightly less than bare-bones throughput. Shortly thereafter, native IPv6 was configured on the Fiber7 product via the LiteXchange platform.\nEach OTO delivered by the city of Wangen-Brüttisellen [site] holds four simplex single mode fibers. The first position of the OTO is typically used to connect the ONT and subsequently the enduser internet connection (in the author’s case an EasyZone connection). The other three positions on the OTO are reserved for future use. For some reason unknown to the author, the Fiber7 connection was installed on a second OTO, again with four simplex single mode fibers. The first position of the second OTO was used to provide the Fiber7 internet connection.\nAppendix Appendix 1 - Terminology Term Description ONT optical network terminal - The ONT converts fiber-optic light signals to copper based electric signals, usually Ethernet. OTO optical telecommunication outlet - The OTO is a fiber optic outlet that allows easy termination of cables in an office and home environment. Installed OTOs are referred to by their OTO-ID. CARP common address redundancy protocol - Its purpose is to allow multiple hosts on the same network segment to share an IP address. CARP is a secure, free alternative to the Virtual Router Redundancy Protocol (VRRP) and the Hot Standby Router Protocol (HSRP). SIT simple internet transition - Its purpose is to interconnect isolated IPv6 networks, located in global IPv4 Internet via tunnels. STB set top box - a device that enables a television set to become a user interface to the Internet and also enables a television set to receive and decode digital television (DTV) broadcasts. GRE generic routing encapsulation - a tunneling protocol developed by Cisco Systems that can encapsulate a wide variety of network layer protocols inside virtual point-to-point links over an Internet Protocol network. L2VPN layer2 virtual private network - a service that emulates a switched Ethernet (V)LAN across a pseudo-wire (typically an IP tunnel) DHCP dynamic host configuration protocol - an IPv4 network protocol that enables a server to automatically assign an IP address to a computer from a defined range of numbers. DHCP6 Dynamic host configuration protocol: prefix delegation - an IPv6 network protocol that enables a server to automatically assign network prefixes to a customer from a defined range of numbers. NDP NS/NA neighbor discovery protocol: neighbor solicitation / advertisement - an ipv6 specific protocol to discover and judge reachability of other nodes on a shared link. NDP RS/RA neighbor discovery protocol: router solicitation / advertisement - an ipv6 specific protocol to discover and install local address and gateway information. Appendix 2 - Supporting data Bandwidth with Speedtest Directly on Fiber7: speedtest\nGRE via IP-Max: speedtest\nBandwidth with Iperf upstream (AS13030 IPv4) $ iperf -t 600 -P 4 -i 60 -l 1M -m -c chzrh02.sixxs.net ------------------------------------------------------------ Client connecting to chzrh02.sixxs.net, TCP port 5001 TCP window size: 85.0 KByte (default) ------------------------------------------------------------ [ 3] local 77.109.173.198 port 41199 connected with 213.144.148.74 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-60.0 sec 6.23 GBytes 892 Mbits/sec [ 3] 60.0-120.0 sec 6.21 GBytes 889 Mbits/sec [ 3] 120.0-180.0 sec 6.22 GBytes 891 Mbits/sec [ 3] 180.0-240.0 sec 6.25 GBytes 894 Mbits/sec [ 3] 240.0-300.0 sec 6.25 GBytes 894 Mbits/sec [ 3] 300.0-360.0 sec 6.23 GBytes 892 Mbits/sec [ 3] 360.0-420.0 sec 6.22 GBytes 890 Mbits/sec [ 3] 420.0-480.0 sec 6.20 GBytes 888 Mbits/sec [ 3] 480.0-540.0 sec 6.21 GBytes 889 Mbits/sec [ 3] 540.0-600.0 sec 6.18 GBytes 885 Mbits/sec [ 3] 0.0-600.0 sec 62.2 GBytes 891 Mbits/sec [ 3] MSS size 1448 bytes (MTU 1500 bytes, ethernet) (AS25091 IPv6) $ iperf -V -t 600 -P 4 -i 60 -l 1M -m -c charb02.paphosting.net ------------------------------------------------------------ Client connecting to charb02.paphosting.net, TCP port 5001 TCP window size: 85.0 KByte (default) ------------------------------------------------------------ [ 3] local 2a02:168:2000:4b:469:a025:5293:84ad port 45044 connected with 2a02:2528:503:1::83 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-60.0 sec 5.22 GBytes 748 Mbits/sec [ 3] 60.0-120.0 sec 5.52 GBytes 791 Mbits/sec [ 3] 120.0-180.0 sec 5.67 GBytes 811 Mbits/sec [ 3] 180.0-240.0 sec 4.86 GBytes 696 Mbits/sec [ 3] 240.0-300.0 sec 4.85 GBytes 695 Mbits/sec [ 3] 300.0-360.0 sec 5.44 GBytes 779 Mbits/sec [ 3] 360.0-420.0 sec 5.97 GBytes 855 Mbits/sec [ 3] 420.0-480.0 sec 5.54 GBytes 792 Mbits/sec [ 3] 480.0-540.0 sec 5.17 GBytes 739 Mbits/sec [ 3] 540.0-600.0 sec 5.63 GBytes 806 Mbits/sec [ 3] 0.0-600.0 sec 53.9 GBytes 771 Mbits/sec [ 3] MSS size 1428 bytes (MTU 1500 bytes, ethernet) Bandwidth with Iperf downstream (AS13030 IPv4) $ iperf -t 600 -P 4 -i 60 -l 1M -m -c 77.109.173.198 ------------------------------------------------------------ Client connecting to 77.109.173.198, TCP port 5001 TCP window size: 85.0 KByte (default) ------------------------------------------------------------ [ 3] local 213.144.148.74 port 56642 connected with 77.109.173.198 port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-60.0 sec 6.22 GBytes 891 Mbits/sec [ 3] 60.0-120.0 sec 6.25 GBytes 895 Mbits/sec [ 3] 120.0-180.0 sec 6.24 GBytes 894 Mbits/sec [ 3] 180.0-240.0 sec 6.23 GBytes 891 Mbits/sec [ 3] 240.0-300.0 sec 6.21 GBytes 889 Mbits/sec [ 3] 300.0-360.0 sec 6.23 GBytes 892 Mbits/sec [ 3] 360.0-420.0 sec 6.27 GBytes 898 Mbits/sec [ 3] 420.0-480.0 sec 6.25 GBytes 895 Mbits/sec [ 3] 480.0-540.0 sec 6.27 GBytes 897 Mbits/sec [ 3] 540.0-600.0 sec 6.26 GBytes 896 Mbits/sec [ 3] 0.0-600.0 sec 62.4 GBytes 894 Mbits/sec [ 3] MSS size 1448 bytes (MTU 1500 bytes, ethernet) (AS25091 IPv6) $ iperf -V -t 600 -P 4 -i 60 -l 1M -m -c 2a02:168:2000:4b:20d:b9ff:fe41:94c ------------------------------------------------------------ Client connecting to 2a02:168:2000:4b:20d:b9ff:fe41:94c, TCP port 5001 TCP window size: 85.0 KByte (default) ------------------------------------------------------------ [ 3] local 2a02:2528:503:1::83 port 43499 connected with 2a02:168:2000:4b:20d:b9ff:fe41:94c port 5001 [ ID] Interval Transfer Bandwidth [ 3] 0.0-60.0 sec 5.68 GBytes 813 Mbits/sec [ 3] 60.0-120.0 sec 5.50 GBytes 787 Mbits/sec [ 3] 120.0-180.0 sec 5.75 GBytes 823 Mbits/sec [ 3] 180.0-240.0 sec 6.06 GBytes 868 Mbits/sec [ 3] 240.0-300.0 sec 5.96 GBytes 853 Mbits/sec [ 3] 300.0-360.0 sec 5.95 GBytes 852 Mbits/sec [ 3] 360.0-420.0 sec 5.99 GBytes 858 Mbits/sec [ 3] 420.0-480.0 sec 5.56 GBytes 796 Mbits/sec [ 3] 480.0-540.0 sec 6.10 GBytes 874 Mbits/sec [ 3] 540.0-600.0 sec 6.21 GBytes 889 Mbits/sec [ 3] 0.0-600.0 sec 58.8 GBytes 841 Mbits/sec [ 3] MSS size 1428 bytes (MTU 1500 bytes, ethernet) Appendix 3 - Configuration files DHCPv6 Configuration Two IPv6 access mechanisms were used. Firstly, IPv6 was acquired via SixXS [site] who are present at Init7. After it was made available (approximately one week into the pilot), standard issue WIDE DHCPv6 client was used with the following configuration file:\n$ cat /etc/wide-dhcpv6/dhcpc.conf interface eth0.9 { # interface VLAN9 - Fiber7 send ia-na 1; send ia-pd 1; script \u0026#34;/etc/wide-dhcpv6/dhcp6c-script\u0026#34;; }; id-assoc pd 1 { prefix ::/48 infinity; prefix-interface lo { sla-id 0; ifid 1; sla-len 16; }; # Test interface prefix-interface eth1 { sla-id 4096; ifid 1; sla-len 16; }; }; id-assoc na 1 { # id-assoc for eth0.9 }; IGMP Proxy Configuration Taking IGMPProxy from github and the following configuration file, IPTV worked reliably throughout the pilot:\n$ cat /etc/igmpproxy.conf ##------------------------------------------------------ ## Enable Quickleave mode (Sends Leave instantly) ##------------------------------------------------------ quickleave ##------------------------------------------------------ ## Configuration for Upstream Interface ##------------------------------------------------------ phyint eth0.9 upstream ratelimit 0 threshold 1 altnet 109.202.223.0/24 altnet 192.168.2.0/23 altnet 239.44.0.0/16 ##------------------------------------------------------ ## Configuration for Downstream Interface ##------------------------------------------------------ phyint eth0 downstream ratelimit 0 threshold 1 phyint eth0.2 downstream ratelimit 0 threshold 1 ##------------------------------------------------------ ## Configuration for Disabled Interface ##------------------------------------------------------ phyint eth0.3 disabled # Guest phyint eth0.4 disabled # IPCam phyint eth0.5 disabled # BIT phyint eth0.6 disabled # IP-Max ","date":"2016-10-13","desc":" Author: Pim van Pelt \u0026lt;pim@ipng.nl\u0026gt; Reviewed: Fredy Kuenzler \u0026lt;kuenzler@init7.net\u0026gt; Status: Draft - Review - Approved Introduction In a pilot of the Fiber7 product on the LiteXchange platform, the author took service to vet the product stability and quality. The pilot ran from 2016-09-25 to 2016-10-12, in which the Fiber7 connection was used exclusively by the author in their home internet connection, both for IPv4/IPv6 service as well as IP Television (via Init7) and IP Telephony (via a third party provider).\n","permalink":"https://ipng.ch/s/articles/2016/10/13/fiber7-on-litexchange/","section":"articles","title":"Fiber7 on LiteXchange"}]