VPP and eVPN/VxLAN - Part 3

Posted: 2026-06-13

Introduction

You know what would be really cool? If VPP could be an eVPN/VxLAN speaker! Sometimes I feel like I’m the very last on the planet to learn about something cool. My latest “A-Ha!"-moment was when I was configuring the eVPN fabric for [Frys-IX], and I wrote up an article about it [here] back in April.

I can build the equivalent of Virtual Private Wires (VPWS), also called L2VPN or Virtual Leased Lines, and these are straightforward because they typically only have two endpoints. A “regular” VxLAN tunnel which is L2 cross connected with another interface already does that just fine. Take a look at an article on [L2 Gymnastics] for that. But the real kicker is that I can also create multi-site L2 domains like Virtual Private LAN Services (VPLS) or also called Virtual Private Ethernet, L2VPN or Ethernet LAN Service (E-LAN). And that is a whole other level of awesome.

A small while ago, I wrote about a Bird protocol called [vppevpn], which allows me to synchronize the Bird eVPN and ethernet routing tables into the VPP dataplane by programming VxLAN VTEP endpoints, controlling flooding/learning, and learning/announcing MAC addresses in BGP. The code I wrote turns VPP into an eVPN/VxLAN speaker.

Configuring all of these protocols manually is tablestakes, but it is not very elegant. In this article, I share my work on a VPP eVPN management plane, which handles the lifecycle of eVPN membership, L2 and L3 addresses, and failover between participating nodes.

Problem Statement

The good news is, I now have a working Bird vppevpn protocol that can join a VPP router into an eVPN broadcast domain, by programming VxLAN tunnels and the L2FIB from BGP. Exciting stuff - but everything I showed required hand-crafted Bird configuration snippets, a manually created loopback interface, and IPv4/IPv6 addresses added by hand via vppctl, and what’s worse, these snippets would risk fighting with [vppcfg]. On a test router in a lab, that is fine. On a fleet of a dozen VPP routers, it rapidly becomes an operational tax that I wish upon nobody, not to mention an outage waiting to happen.

The bad news is, there are a few operational problems I feel I should point out:

1. Manual Configuration: Writing Bird snippets by hand means getting the route distinguisher, route targets, VNI, and VTEP address right on every single node. Every. Single. Time. Adding an address to a BVI means SSHing in and running a vppctl command and figuring out how to reconcile that with what’s in vppcfg, because the bridge-domain and VxLAN tunnels that were added by Bird’s vppevpn protocol will naturally not be modeled in the YAML config file.

2. Moving IP Addresses: Moving that address to a different VPP node - say, because a machine needs maintenance, or worse, took the day off and crashed - requires SSH sessions to two hosts in a carefully choreographed order of operations. If I do it wrong, both nodes might briefly announce the same MAC into BGP, momentarily confusing the broadcast domain. Multiply this across dozens of VPP routers and several eVPN broadcast domains, and the ssh-dance becomes a reliability hazard. Nope, I do not want this.

3. Observability: Another obvious thing to me, as a recovering Site Reliability Engineer, is missing visibility. I have no single pane of glass showing which VPP routers are in which broadcast domains, what BVI MAC addresses are active, or which node currently holds the primary gateway role. There are no Prometheus metrics tracking control-plane lifecycle - who joined when, who is healthy, who holds the primary role and who are the standby candidates. Monitoring is an annoyingly manual affair: SSH in, run birdc show proto, vppctl show bridge-domain, and mentally correlate the output across machines. Not having a control- and management plane lifecycle overview is a large problem in production.

4. Self-Healing: And then there is the problem I find most acute and worthwhile to solve for IPng: if a primary VPP gateway becomes unfit to serve at 3am in the morning, nothing notices and nothing moves the L3 addresses to a healthy standby, even if they are readily available! Automated failover would be fantastic to have - and I will leave that topic entirely for a dedicated follow-up article, but it is the reason the system I am about to describe is designed the way it is.

My offer: vpp-evpn

Identifying the problem is half the solution. My buddy Brian used to say “So, what’s your offer?”

My offer is vpp-evpn, a control- and management plane that sits on top of VPP and Bird and owns the operational lifecycle of eVPN gateways across a fleet of routers. My goal is simple to state: an operator should never have to SSH into a VPP machine and hand-edit Bird configs or issue vppctl commands to join or leave an eVPN broadcast domain. All of that should happen through a single, uniform user interface.

After noodling for a few days, I come up with a system with three moving parts. Each VPP host runs vpp-evpnd, a small per-host daemon that keeps track of eVPN instances and their associated BVI interfaces on that machine. It writes Bird config snippets into a dedicated include directory so Bird can glob them, talks to VPP’s binary API to manage BVI loopbacks and L3 addresses, and persists its state to disk so that restarting the daemon does not disrupt the dataplane. The vpp-evpnd only concerns itself with what’s on its own host, namely the VPP dataplane and Bird controlplane. It knows nothing about other instances.

That fleet-wide view will live in vpp-evpnr, a central registry that knows which VPP routers are registered, which groups they belong to, what shared virtual MAC and gateway addresses there are for each group, and which node is currently the primary. When an operator wants to move the primary L3 gateway from one node to another, a single gRPC call to vpp-evpnr should orchestrate the move in the correct break-before-make order. First, vpp-evpnr will ask the old primary to demote itself (removing the IP addresses and releasing its hold on the vMAC), then wait for the fabric to absorb the MAC withdrawal, and only then promote a candidate to become the new primary.

I already got my hands dirty with my [VPP Maglev] loadbalancer, and found that the combination of gRPC + CLI is dope, so rounding out this design is vpp-evpnc, an interactive CLI that talks exclusively to vpp-evpnr, which in turn can dispatch calls to one or more vpp-evpnd instances to do whatever the operator wants.

Why invent new things when I can also cargocult like a boss?! In this design, vpp-evpn follows the same house style as vpp-maglev - structured JSON logging, gRPC as the only programmatic interface, Prometheus metrics throughout, and a simple message bus that allows components to share state and events with one another. This way, it fits naturally alongside what I already have and I can reuse some of the code I already wrote.

vpp-evpn: Functionality

First, let me talk tablestakes. Each VPP instance needs to be able to join and leave eVPN broadcast domains at the push of a (gRPC) button. That means vpp-evpnd must be able to create a new eVPN instance by writing the appropriate Bird evpn and vppevpn protocol stanzas into a dedicated include directory and telling Bird to reload, which was the main topic of [this article]. Alongside that, it needs to create an Instance BVI, which is a VPP loopback interface with a specific MAC address, plumbed into Linux via the [Linux Control Plane] plugin. This gives the bridge domain an L3 presence, although it doesn’t do much as it will only have a link-local address. When an instance is removed, vpp-evpnd tears down the BVI and the Bird snippets in reverse order. The key point is that none of this requires an operator to touch the machine directly.

The fleet-wide picture is managed by vpp-evpnr. Instances register with the registry and are organized into groups, where a group represents a set of VPP routers all participating in the same eVPN broadcast domain. Each group has a single virtual MAC address (the vMAC) and carries one or more IPv4/IPv6 addresses. One member of each group is the primary, which means that instead of its ephemeral instance MAC (with only link-local), it will bind the vMAC from the shared Group BVI with the full L3 configuration. I call the others candidates, holding only their link-local Instance BVI. The registry is the single source of truth for group membership, the elected primary, and the shared vMAC and its L3 addresses.

All interaction with the registry happens via gRPC. I want to be able to add a new member to a group, change the group MTU, add an IPv4 or IPv6 address to the GroupBVI, or move the primary from one node to another. And I want to be able to interact with this system through gRPC calls to vpp-evpnr, which figures out the necessary instructions and sends RPCs to the relevant vpp-evpnd instances. In practice nobody should be typing raw gRPC calls, so vpp-evpnc is the companion CLI that maps those RPC calls to human-friendly commands.

vpp-evpn: Non-Functional requirements

Clearly, a vpp-evpnd management-plane outage must never stop the controlplane (Bird) or dataplane (VPP). I need to see to it that they continue on their last-programmed state. Similarly, if vpp-evpnr takes the day off, it must not stop vpp-evpnd instances either - they freeze orchestration and run on last-known group state until the registry returns. For my first version though, vpp-evpnr is an accepted single point of failure. I will accept the risk of a coincident registry and dataplane failure, leaving the network without a gateway until vpp-evpnr recovers and everything resynchronizes.

Group BVI moves need to be strictly serialized: at no time can I have two Bird instances announce the same vMAC into the same bridge domain, as it might cause the eVPN fabric to become unstable and dampen the flapping MAC address. Config reloads are atomic - either the whole new config takes effect, or none of it does.

Specifically I worry about observability, and seeing as Clyde there is typing commands into a CLI, which sends RPCs to a registry which in turn might cascade them to multiple instances, every CRUD operation and state change really needs to be observable by emitting a structured JSON log line that can make it all the way back up to the CLI. And while I’m making my little wishlist, I want Prometheus metrics, and streaming gRPC events. Any future Dashboards should use the metrics and event streams, rather than implement tight polling.

Oh, and considering this system will be running on the uplink of my house, it would be kind of nice if it didn’t go down, kthxbai.

Implementation

I want to avoid this system becoming spaghetti code, and to do so I need to set some architectural boundaries. Having well described systems with contracts between them is a good way to stay sane. I spend some time thinking about the user experience, notably the gRPC endpoints and the CLI syntax. In my experience, working backwards from ‘what will the operator actually see’ gives me a clear picture.

vpp-evpn: User Experience

gRPC interface

Early on I make a deliberate design choice: wherever possible, the gRPC API follows CRUD/L semantics - Create, Read, Update, Delete and List. CRUD/L is powerful not because it’s clever or sophisticated, but because it is predictable. Given the name of a noun - say, EvpnInstance or GroupBVI - an operator or script author can immediately guess the shape of the API: CreateEvpnInstance, ListEvpnInstances, GetEvpnInstance, DeleteEvpnInstance and so on. Updates to objects come in the form of Getters and Setters, like SetGroupBVIMTU or GroupBVIAddAddress.

I find this approach both elegant and forcefully minimalistic - no wonder many complicated systems implement CRUD/L semantics. In a distributed system with multiple daemons and a CLI on top, a uniform user experience makes everything from documentation to CLI tab-completion remarkably straightforward, and what’s best, it forces me to think about the object model ahead of time, which, like the swiss flag, is a big plus.

Starting at the bottom of the stack, vpp-evpnd exposes CRUD/L on two nouns: the EvpnInstance, which governs the Bird config snippet and VPP bridge domain for one eVPN broadcast domain, and the Instance BVI that provides L3 presence in that domain. The setter SetEvpnInstanceEnabled call deserves a note - it maps to birdc enable and birdc disable, allowing an operator to temporarily take a node out of an eVPN without deleting its configuration. Here’s what I come up with:

  rpc CreateEvpnInstance(CreateEvpnInstanceRequest) returns (EvpnInstance);
  rpc ListEvpnInstances(InstanceRef) returns (EvpnInstanceList);
  rpc GetEvpnInstance(EvpnRef) returns (EvpnInstance);
  rpc SetEvpnInstanceEnabled(SetEnabledRequest) returns (EvpnInstance);
  rpc DeleteEvpnInstance(EvpnRef) returns (Empty);

  rpc CreateBVI(CreateBVIRequest) returns (BVI);
  rpc ListBVIs(InstanceRef) returns (BVIList);
  rpc GetBVI(EvpnRef) returns (BVI);
  rpc ReplaceBVI(ReplaceBVIRequest) returns (BVI);
  rpc DeleteBVI(EvpnRef) returns (Empty);

vpp-evpnr’s API is richer and typically a superset of the vpp-evpnd because it needs to be able to pass through calls for instance-specific information from the client. At the top are Instances, which register with the registry and can be inspected or removed. Below that are Groups, full CRUD objects that hold membership lists, MTU settings, and L3 addresses. Three special operations at the bottom - BindGroupBVI, ReleaseGroupBVI, and MoveGroupBVI - are the heart of the failover machinery. Bind instructs a member to take on the primary role: change the loopback MAC to the group vMAC, configure the IPv4 and IPv6 addresses, and announce these changes to Bird. Release is the inverse: strip the L3 addresses, revert the MAC to the ephemeral instance MAC, and propagate these changes to Bird. Move simply composes these two steps across two nodes in the correct order, releasing the GroupBVI from all instances before binding it on a new primary, which is meant to guarantee that the eVPN fabric never sees two primaries at once, yet the other computers in the eVPN see a stable vMAC, IPv4/IPv6 global addresses and link-local. For them, the move is meant to be seamless.

  rpc RegisterInstance(RegisterInstanceRequest) returns (RegisterInstanceResponse);
  rpc ListInstances(Empty) returns (InstanceList);
  rpc GetInstance(InstanceRef) returns (InstanceInfo);
  rpc GetInstanceStatus(InstanceRef) returns (InstanceStatus);
  rpc DeleteInstance(InstanceRef) returns (Empty);

  rpc ListGroups(Empty) returns (GroupList);
  rpc GetGroup(GroupRef) returns (Group);
  rpc CreateGroup(CreateGroupRequest) returns (Group);
  rpc DeleteGroup(GroupRef) returns (Empty);
  rpc AddGroupMember(GroupMemberRequest) returns (Group);
  rpc RemoveGroupMember(GroupMemberRequest) returns (Group);
  rpc CreateGroupBVI(GroupBVIRequest) returns (Group);
  rpc DeleteGroupBVI(GroupRef) returns (Group);
  rpc SetGroupMTU(SetGroupMTURequest) returns (Group);
  rpc AddGroupBVIAddress(GroupBVIAddAddressRequest) returns (Group);
  rpc DeleteGroupBVIAddress(GroupBVIDeleteAddressRequest) returns (Group);

  rpc BindGroupBVI(BindGroupBVIRequest) returns (Empty);
  rpc ReleaseGroupBVI(ReleaseGroupBVIRequest) returns (Empty);
  rpc MoveGroupBVI(MoveGroupBVIRequest) returns (Group);

By this point the scope of responsibility has become clear to me. vpp-evpnd manages what is on one machine: the eVPN membership, the Instance BVI, and the local Bird and VPP state. vpp-evpnr manages what the wider fleet should look like: which machines are in which groups, what the group’s gateway identity is, and which machine is the primary. One vpp-evpnr coordinates many vpp-evpnd instances, and no vpp-evpnd needs to know anything about any other - they are peers only in the sense that they share a broadcast domain in the dataplane.

SideQuest - a Golang CLI package

Side Quest time! From the RPC signatures above, it quickly becomes obvious to me how to structure the CLI. Before I get to that, I make an observation: this is the second project within a few months that needs some sort of a gRPC-backed interactive CLI - the other one being [vpp-maglev], which ships maglevc. Rather than copy-pasting that code, Claude and I extract the CLI framework into its own [reusable package]. In merely five commits, the story unfolds without me having to do much. Claude:

extracts the generic command-tree CLI library from maglevc, establishing the basic node tree structure, with dynamic nodes that can be resolved by a function, for example by issuing gRPC List calls to a backend. This structure provides tab-completion and ? help syntax.
adds a builder pattern and app runner, reducing new command wiring from a bunch of complicated nested tree structure, to a handful of lines of boilerplate.
adds input Validate helpers, a keypress subpackage for interactive prompts, a fix for the client to run both on Linux and BSD (termios is not as portable as I would’ve expected), and an RFC-style design.md (common for my projects) for the library itself.
adds a -json flag opt-in via an App.JSON boolean, so methods that want to / can render JSON output can surface the flag to the author and fail gracefully otherwise.
adds a default JSON-model renderer that maps any gRPC proto-JSON response to colored terminal output with bright-white values, dark blue labels, and a paint() method that can colorize stuff in text output mode.

I find the -json flag particularly useful for scripting. Any vpp-evpnc command that returns a gRPC message can, with -json, emit the full proto-JSON to stdout, making it trivial to pipe into jq or any monitoring tool or onwards script composition. The -color flag adds terminal coloring to text output using the paint and label helpers from the renderer, so operators who prefer a monochrome environment can keep it that way.

All told, it takes no more than 20 minutes to refactor the maglevc CLI into its own package, including an [example], and the package is immediately useful for my new evpnc program. Merci, Claude!

I start scribbling down what the command structure will be, and which gRPCs they map to:

Command	gRPC Method
`show version`	Local binary, based on compile-time `LDFLAGS`
`show instance`	`Evpnr.ListInstances`
`show instance <id>`	`Evpnr.GetInstance(id)` passed to `Evpnd.GetInstance`
`show instance <id> evpn`	`Evpnr.ListEvpnInstances(id)` passed to `Evpnd.*`
`show instance <id> evpn <evpn>`	`Evpnr.GetEvpnInstance(id)` passed to `Evpnd.*`
`show instance <id> bvi`	`Evpnr.ListBVIs(id)` passed to `Evpnd.*`
`show instance <id> vpp info`	`Evpnr.GetVPPInfo(id)` passed to `Evpnd.*`
`show instance <id> bird info`	`Evpnr.GetBirdInfo(id)` passed to `Evpnd.*`
…	…

As an example, tab completion derives its dynamic token list from the ListInstances and ListEvpnInstances gRPC calls, so the interactive CLI always offers only currently registered members and their actual eVPN instance names. See below for a demonstration asciinema screencast.

vpp-evpn - implementation

What follows next is not for everyone. If you don’t write code at all, most of it will come across as an alien language intermixed with English every now and again. And if you do write code, you’re probably better at it than I am, and you’ll scratch your head saying “what was this dude thinking …”. Either way, here I go - met de billen bloot!

1. evpnd - the per-node daemon

The instance-tier model lives in internal/instance/manager.go as a Manager struct. It uses two lock regimes deliberately: a sync.Mutex serializes writes (which may block waiting for a VPP binary API reply), while reads are served lock-free from an atomic.Pointer[readState]. I kind of came to this model the hard way - I found what seems to be a bug in ip6-nd node of VPP, at least on arm64, and the VPP instance would vanish on me and restart. This made show instances hang, because one of the instances would never respond, due to VPP holding m.mu indefinitely. I settled on a reply timeout and a read-only view which won’t stall a concurrent show instances query.

The main functionality lives in CreateEvpn(), which calls evpn.Resolve() to fill in defaults from a static YAML config file (things like the VxLAN source address and the local AS number of the Bird BGP speaker). It writes a Bird snippet via bird.WriteSnippet(), applies it with bird.Configure(), then commits to a state file on disk, so that it can restart (or crash..) safely. It also publishes the atomic read snapshot via persist() for read-only List and Get calls. A setter called SetEnabled() mirrors this: it calls bird.Enable() or bird.Disable() on both the evpn_<slug> and vppevpn_<slug> protocols, then calls attach() or detach() on the VPP loopback interface so the BVI follows the protocol state.

To implement the vpp-evpnd specifics of the Primary instance, BindGroupBVI() calls retargetBVI(), a logical five-step up/down orchestration sequence:

set the VPP loopback MAC via vpp.SetMacAddress.
tell Bird to rescan the L2FIB with bird.VppevpnRescan to (re)advertise the vMAC in BGP.
set the IPv6 link-local on both VPP (vpp.SetIP6LinkLocal) and the Linux tap over netlink (the Tap interface satisfied by internal/linktap).
and finally add or remove the gateway addresses. On promotion the L3 is added last; on demotion it is removed first.
for the new primary, a goroutine fires announceGratuitous(), which sends a bunch of gratuitous ARP rounds, paced by the garpSchedule slice.

The gRPC adapter (internal/grpcapi/server.go, type EvpndService) is deliberately thin: CreateEvpnInstance() unpacks the proto into an evpn.Input and delegates to mgr.CreateEvpn(); BindGroupBVI() calls mgr.BindGroupBVI(). Every error goes through evpndOpErr(), which both logs it locally on the box and returns a typed gRPC status to the caller.

Message Spine and EventBroker

The EventBroker in internal/grpcapi/events.go plays two roles at once. As a slog.Handler it sits in evpnd’s logging chain: every slog.Info(), slog.Warn(), or slog.Error() call writes to the JSON stdout handler AND fans out a type="log" event to all current subscribers. As an instance.EventSink, the Manager calls broker.Emit() directly for structured lifecycle events – type="crud" for every CRUD operation (e.g. evpn-created, groupbvi-bound), and type="failover" for role transitions. Both paths produce the same Event proto: a monotonic seq, an RFC 3339 ts, a type, a level, an instance, a message, and a fields map[string]string carrying context like the evpn slug or the vMAC. A concrete example:

{
  "seq": 42, "ts": "2026-06-10T13:37:00.123Z",
  "type": "crud", "level": "info",
  "instance": "dpu0-ddln0", "message": "groupbvi-bound",
  "fields": {"evpn": "test", "mac": "42:6c:fa:d6:82:98"}
}

WatchEvents on evpnd serves directly from broker.Subscribe(), with per-subscriber type and level filters (e.g. subscribe to only crud at info and above). Fan-out is non-blocking: a slow subscriber is silently dropped rather than stalling the broker goroutine.

2. evpnr - a common registry

The Registry in internal/registry/ holds two maps: members (keyed by stable instance ID) and groups (keyed by group ID). Both are guarded by r.mu. Per-member orchestration uses an additional per-instance lock, memberConfigLock(instanceID), so two members can be configured concurrently without their operations interleaving – r.mu is released before any outbound RPC.

When evpnd calls RegisterInstance() it stores the member and fires resync() in orchestrator.go. I can’t just blindly start reprovisioning the vpp-evpnd instance, because perhaps it has just restarted, and perhaps Bird has not yet had a chance to configure the bridge-domain and VxLAN tunnels. If I try to add loopback BVI interfaces, I may end up receiving a bunch of ‘bridge not found’ type errors. Ask me how I know :) so resync() first gates on m.Ready, the bit evpnd sets via ReportReady once its own dataplane reconcile completes after evpnd’s startup. It then calls be.ListEvpnSlugs() to compute a diff: stale eVPNs are torn down with be.DeleteEvpnInstance(); missing groups are configured via configureMember(). That function chains three evpnd RPCs in order: be.CreateEvpnInstance(), be.CreateBVI() with a deterministic locally-administered MAC derived as plainBVIMAC(groupID, instanceID) (SHA-256, first five bytes), and, if this member should be the group’s primary, be.BindGroupBVI(). A failure at any step rolls back with be.DeleteEvpnInstance() so a half-built member never silently masquerades as configured, another lesson learned after I initially got the dataplane configuration in VPP wrong.

MoveGroupBVI() in registry/groups.go updates PrimaryInstanceID and calls reconcileMemberGroupBVI() for the affected members. That function re-reads the group under r.mu immediately before acting (not from a stale snapshot), then calls be.BindGroupBVI() on the incoming primary or be.ReleaseGroupBVI() on the outgoing one via the member’s Backend interface. Holding the config lock means a concurrent resync() for the same member cannot race the move.

In internal/grpcapi/server.go, EvpnrService.MoveGroupBVI() hides all of this complexity behind a one-liner that delegates to reg.MoveGroupBVI(). For per-instance reads, forwardClient() resolves the member’s grpcBackend from the registry and returns its EvpndClient gRPC stub, through which the facade forwards calls like ListEvpnInstances, GetBirdInfo, and any future pass-through calls, with a memberForwardTimeout context so one slow member cannot wedge a fleet-wide fan-out.

On registration, evpnr calls be.StartWatch() which opens a WatchEvents(registrar=true) stream to the member. The grpcWatcher.Run() loop in dialer.go reads events off that stream, stamps ev.Instance = instanceID (so a misbehaving evpnd cannot spoof another member’s origin), and calls fleet.Publish(ev) into the FleetBroker. evpnr’s own slog output is wired through a fleetLogHandler (also in grpcapi/), a second slog.Handler that publishes evpnr-origin log records to the same FleetBroker with an empty instance field. The result is one merged stream: all member events tagged by instance, interleaved with evpnr’s own log records.

Reusing the message spine, EvpnrService.WatchEvents() serves this merged stream from fleet.Subscribe(), applying per-subscriber type, level, and instance filters. Any gRPC client with access to evpnr can tap it. Imagine a frontend that uses it to push SSE updates to browsers, or a CLI call like evpnc watch events which renders them to the terminal. But such an event stream is equally useful for external automation: a small listener subscribing with types=["crud","failover"] can watch for groupbvi-bound or autonomous-failover messages and forward them as Telegram notifications. This way, my phone lights up at 3am before I receive angry e-mails from IPng’s customer base.

3. evpnc - a `golang-cli` gRPC client

Here’s where I get to reuse previous code! The [golang-cli] package is mated with a command tree in cmd/evpnc/commands.go, typed as *cli.Node[pb.EvpnrClient] – each leaf Run function receives the live EvpnrClient directly, so there is no global state and no connection management in the command layer. Dynamic completion nodes (dynInstances, dynGroups, dynGroupMembers) each issue one List* RPC to vpp-evpnr to populate the tab-completion token set live. As a user, this feature really helps me navigate the system.

In -json mode, wrapJSON() walks the whole tree and wraps every Run func: mutations that produce no output still emit {}, giving every command a uniform JSON contract. Show commands call emitProto(), which uses protojson.Marshal from the Go protobuf library to emit the proto message with its canonical field names. Composite views (like show group <id> traffic) assemble a custom struct using protoField() to embed individual messages, then call emitJSON() to emit the whole thing. Errors always render as {"error": "..."}, which brings an immediate consistency across vpp-maglev and vpp-evpn, and any future projects I may dabble with, all for free.

A few examples that appear in the screencast below:

pim@squanchy:~$ evpnc
evpn> show instance                           # list fleet members (text)
evpn> show instance dpu0-ddln0                # show the details of a given instance
evpn> group test bvi set primary dpu0-chplo0  # move the primary to another instance
evpn> quit

pim@squanchy:~$ evpnc -json show group test | jq .groupBvi    # pipe group BVI into jq
{
  "mac": "42:6c:fa:d6:82:98",
  "addresses": [
    {
      "address": "172.16.0.32",
      "prefixLen": 24
    },
    {
      "address": "fec0::32",
      "prefixLen": 64
    }
  ],
  "createdAt": "1780525048",
  "modifiedAt": "1780669810"
}

Results

I think showing the end to end interaction is best done in a two minute asciinema video:

In this test setup, I have two vpp-evpnd instances connected to the central registry. One of them is in Zurich dpu0-ddln0 and the other in Geneva dpu0-chplo0. I’ll create a test group, add both instances to it, which shows them gaining a new interface in Linux called bvi-test. Then I’ll add a GroupBVI with an IPv4 and IPv6 address, and assign it to one of the instances. In the bottom of the screen, you see another host in that eVPN network start to ping (at 0.8ms) with the GroupBVI in Zurich. When failing over, you can see one or two ping packets lost, as the management plane does its break-before-make migration of the GroupBVI, and then the ping packets come back at 5ms because the active gateway is in Geneva. After flipping back and forth a few times, I delete the primary, which unassigns the GroupBVI - consequently pings now stop. Finally, I delete the whole group and the InstanceBVIs are cleaned up.

What’s next

I can now add/remove VPP nodes in a common eVPN registry, ask them to join/leave an eVPN layer2 bridge domain, and manually move IPv4/IPv6 addresses between participating nodes. While doing this via gRPC is really cool, the eventual goal is self-healing because I may not be around to rescue an unfit VPP router on a Sunday morning at 3am. So what’s next for me, is to add a health checking system, I’m thinking sort of like CARP or VRRP, that can coordinate migration of the IPv4/IPv6 addresses between participating nodes autonomously. This code would then be able to turn on/off the sending of heart-beats from both the primary and any candidate nodes that are ready to take over, and on certain conditions, issue the gRPC to move the primary to a different one.

I am also pretty keen on replicating the [VPP Maglev] frontend so that I can see the (growing) fleet at a glance, and possibly trigger failovers from the comfort of my web browser. Stay tuned!