Introduction
You know what would be really cool? If VPP could be an eVPN/VxLAN speaker! Sometimes I feel like I’m the very last on the planet to learn about something cool. My latest “A-Ha!"-moment was when I was configuring the eVPN fabric for [Frys-IX], and I wrote up an article about it [here] back in April.
I can build the equivalent of Virtual Private Wires (VPWS), also called L2VPN or Virtual Leased Lines, and these are straightforward because they typically only have two endpoints. A “regular” VxLAN tunnel which is L2 cross connected with another interface already does that just fine. Take a look at an article on [L2 Gymnastics] for that. But the real kicker is that I can also create multi-site L2 domains like Virtual Private LAN Services (VPLS) or also called Virtual Private Ethernet, L2VPN or Ethernet LAN Service (E-LAN). And that is a whole other level of awesome.
A small while ago, I wrote about a Bird protocol called [vppevpn], which allows me to synchronize the Bird eVPN and ethernet routing tables into the VPP dataplane by programming VxLAN VTEP endpoints, controlling flooding/learning, and learning/announcing MAC addresses in BGP. The code I wrote turns VPP into an eVPN/VxLAN speaker.
Configuring all of these protocols manually is tablestakes, but it is not very elegant. In this article, I share my work on a VPP eVPN management plane, which handles the lifecycle of eVPN membership, L2 and L3 addresses, and failover between participating nodes.
Problem Statement
The good news is, I now have a working Bird vppevpn protocol that can join a VPP router into an
eVPN broadcast domain, by programming VxLAN tunnels and the L2FIB from BGP. Exciting stuff - but
everything I showed required hand-crafted Bird configuration snippets, a manually created loopback
interface, and IPv4/IPv6 addresses added by hand via vppctl, and what’s worse, these snippets
would risk fighting with [vppcfg]. On a test router in a lab,
that is fine. On a fleet of a dozen VPP routers, it rapidly becomes an operational tax that I wish
upon nobody, not to mention an outage waiting to happen.
The bad news is, there are a few operational problems I feel I should point out:

1. Manual Configuration: Writing Bird snippets by hand means getting the route distinguisher, route
targets, VNI, and VTEP address right on every single node. Every. Single. Time. Adding an address to a
BVI means SSHing in and running a vppctl command and figuring out how to reconcile that with
what’s in vppcfg, because the bridge-domain and VxLAN tunnels that were added by Bird’s vppevpn
protocol will naturally not be modeled in the YAML config file.
2. Moving IP Addresses: Moving that address to a different VPP node - say, because a machine needs maintenance, or worse, took the day off and crashed - requires SSH sessions to two hosts in a carefully choreographed order of operations. If I do it wrong, both nodes might briefly announce the same MAC into BGP, momentarily confusing the broadcast domain. Multiply this across dozens of VPP routers and several eVPN broadcast domains, and the ssh-dance becomes a reliability hazard. Nope, I do not want this.
3. Observability: Another obvious thing to me, as a recovering Site Reliability Engineer, is
missing visibility. I have no single pane of glass showing which VPP routers are in which
broadcast domains, what BVI MAC addresses are active, or which node currently holds the primary
gateway role. There are no Prometheus metrics tracking control-plane lifecycle - who joined when,
who is healthy, who holds the primary role and who are the standby candidates. Monitoring is an
annoyingly manual affair: SSH in, run birdc show proto, vppctl show bridge-domain, and mentally
correlate the output across machines. Not having a control- and management plane lifecycle overview
is a large problem in production.
4. Self-Healing: And then there is the problem I find most acute and worthwhile to solve for IPng: if a primary VPP gateway becomes unfit to serve at 3am in the morning, nothing notices and nothing moves the L3 addresses to a healthy standby, even if they are readily available! Automated failover would be fantastic to have - and I will leave that topic entirely for a dedicated follow-up article, but it is the reason the system I am about to describe is designed the way it is.
My offer: vpp-evpn
Identifying the problem is half the solution. My buddy Brian used to say “So, what’s your offer?”
My offer is vpp-evpn, a control- and management plane that sits on top of VPP and Bird
and owns the operational lifecycle of eVPN gateways across a fleet of routers. My goal is simple to
state: an operator should never have to SSH into a VPP machine and hand-edit Bird configs or issue
vppctl commands to join or leave an eVPN broadcast domain. All of that should happen through a
single, uniform user interface.
After noodling for a few days, I come up with a system with three moving parts. Each VPP host runs
vpp-evpnd, a small per-host daemon that keeps track of eVPN instances and their associated BVI
interfaces on that machine. It writes Bird config snippets into a dedicated include directory so
Bird can glob them, talks to VPP’s binary API to manage BVI loopbacks and L3 addresses, and persists
its state to disk so that restarting the daemon does not disrupt the dataplane. The vpp-evpnd only
concerns itself with what’s on its own host, namely the VPP dataplane and Bird controlplane. It
knows nothing about other instances.
That fleet-wide view will live in vpp-evpnr, a central registry that knows which VPP routers
are registered, which groups they belong to, what shared virtual MAC and gateway addresses there are
for each group, and which node is currently the primary. When an operator wants to move the primary
L3 gateway from one node to another, a single gRPC call to vpp-evpnr should orchestrate the move
in the correct break-before-make order. First, vpp-evpnr will ask the old primary to demote
itself (removing the IP addresses and releasing its hold on the vMAC), then wait for the fabric
to absorb the MAC withdrawal, and only then promote a candidate to become the new primary.
I already got my hands dirty with my [VPP Maglev]
loadbalancer, and found that the combination of gRPC + CLI is dope, so rounding out this design is
vpp-evpnc, an interactive CLI that talks exclusively to vpp-evpnr, which in turn can dispatch
calls to one or more vpp-evpnd instances to do whatever the operator wants.
Why invent new things when I can also cargocult like a boss?! In this design, vpp-evpn follows the
same house style as vpp-maglev - structured JSON logging, gRPC as the only programmatic interface,
Prometheus metrics throughout, and a simple message bus that allows components to share state and events
with one another. This way, it fits naturally alongside what I already have and I can reuse some of
the code I already wrote.
vpp-evpn: Functionality
First, let me talk tablestakes. Each VPP instance needs to be able to join and leave eVPN broadcast
domains at the push of a (gRPC) button. That means vpp-evpnd must be able to create a new eVPN
instance by writing the appropriate Bird evpn and vppevpn protocol stanzas into a dedicated
include directory and telling Bird to reload, which was the main topic of [this article]. Alongside that, it needs to create an Instance BVI,
which is a VPP loopback interface with a specific MAC address, plumbed into Linux via the [Linux
Control Plane] plugin. This gives the bridge domain an L3 presence,
although it doesn’t do much as it will only have a link-local address. When an instance is removed,
vpp-evpnd tears down the BVI and the Bird snippets in reverse order. The key point is that none of
this requires an operator to touch the machine directly.
The fleet-wide picture is managed by vpp-evpnr. Instances register with the registry and are
organized into groups, where a group represents a set of VPP routers all participating in the same
eVPN broadcast domain. Each group has a single virtual MAC address (the vMAC) and carries one or
more IPv4/IPv6 addresses. One member of each group is the primary, which means that instead of its
ephemeral instance MAC (with only link-local), it will bind the vMAC from the shared Group BVI with
the full L3 configuration. I call the others candidates, holding only their link-local Instance
BVI. The registry is the single source of truth for group membership, the elected primary, and the
shared vMAC and its L3 addresses.
All interaction with the registry happens via gRPC. I want to be able to add a new member to a
group, change the group MTU, add an IPv4 or IPv6 address to the GroupBVI, or move the primary from
one node to another. And I want to be able to interact with this system through gRPC calls to
vpp-evpnr, which figures out the necessary instructions and sends RPCs to the relevant vpp-evpnd
instances. In practice nobody should be typing raw gRPC calls, so vpp-evpnc is the companion CLI
that maps those RPC calls to human-friendly commands.
vpp-evpn: Non-Functional requirements
Clearly, a vpp-evpnd management-plane outage must never stop the controlplane (Bird) or dataplane
(VPP). I need to see to it that they continue on their last-programmed state. Similarly,
if vpp-evpnr takes the day off, it must not stop vpp-evpnd instances either - they freeze
orchestration and run on last-known group state until the registry returns. For my first version
though, vpp-evpnr is an accepted single point of failure. I will accept the risk of a coincident
registry and dataplane failure, leaving the network without a gateway until vpp-evpnr recovers and
everything resynchronizes.
Group BVI moves need to be strictly serialized: at no time can I have two Bird instances announce the same vMAC into the same bridge domain, as it might cause the eVPN fabric to become unstable and dampen the flapping MAC address. Config reloads are atomic - either the whole new config takes effect, or none of it does.
Specifically I worry about observability, and seeing as Clyde there is typing commands into a CLI, which sends RPCs to a registry which in turn might cascade them to multiple instances, every CRUD operation and state change really needs to be observable by emitting a structured JSON log line that can make it all the way back up to the CLI. And while I’m making my little wishlist, I want Prometheus metrics, and streaming gRPC events. Any future Dashboards should use the metrics and event streams, rather than implement tight polling.
Oh, and considering this system will be running on the uplink of my house, it would be kind of nice if it didn’t go down, kthxbai.
Implementation
I want to avoid this system becoming spaghetti code, and to do so I need to set some architectural boundaries. Having well described systems with contracts between them is a good way to stay sane. I spend some time thinking about the user experience, notably the gRPC endpoints and the CLI syntax. In my experience, working backwards from ‘what will the operator actually see’ gives me a clear picture.
vpp-evpn: User Experience
gRPC interface
Early on I make a deliberate design choice: wherever possible, the gRPC API follows CRUD/L semantics -
Create, Read, Update, Delete and List. CRUD/L is powerful not because it’s clever or sophisticated,
but because it is predictable. Given the name of a noun - say, EvpnInstance or GroupBVI - an
operator or script author can immediately guess the shape of the API: CreateEvpnInstance,
ListEvpnInstances, GetEvpnInstance, DeleteEvpnInstance and so on. Updates to objects come in
the form of Getters and Setters, like SetGroupBVIMTU or GroupBVIAddAddress.

I find this approach both elegant and forcefully minimalistic - no wonder many complicated systems implement CRUD/L semantics. In a distributed system with multiple daemons and a CLI on top, a uniform user experience makes everything from documentation to CLI tab-completion remarkably straightforward, and what’s best, it forces me to think about the object model ahead of time, which, like the swiss flag, is a big plus.
Starting at the bottom of the stack, vpp-evpnd exposes CRUD/L on two nouns: the EvpnInstance, which
governs the Bird config snippet and VPP bridge domain for one eVPN broadcast domain, and the
Instance BVI that provides L3 presence in that domain. The setter SetEvpnInstanceEnabled call
deserves a note - it maps to birdc enable and birdc disable, allowing an operator to temporarily
take a node out of an eVPN without deleting its configuration. Here’s what I come up with:
rpc CreateEvpnInstance(CreateEvpnInstanceRequest) returns (EvpnInstance);
rpc ListEvpnInstances(InstanceRef) returns (EvpnInstanceList);
rpc GetEvpnInstance(EvpnRef) returns (EvpnInstance);
rpc SetEvpnInstanceEnabled(SetEnabledRequest) returns (EvpnInstance);
rpc DeleteEvpnInstance(EvpnRef) returns (Empty);
rpc CreateBVI(CreateBVIRequest) returns (BVI);
rpc ListBVIs(InstanceRef) returns (BVIList);
rpc GetBVI(EvpnRef) returns (BVI);
rpc ReplaceBVI(ReplaceBVIRequest) returns (BVI);
rpc DeleteBVI(EvpnRef) returns (Empty);
vpp-evpnr’s API is richer and typically a superset of the vpp-evpnd because it needs to be able
to pass through calls for instance-specific information from the client. At the top are Instances,
which register with the registry and can be inspected or removed. Below that are Groups, full
CRUD objects that hold membership lists, MTU settings, and L3 addresses. Three special
operations at the bottom - BindGroupBVI, ReleaseGroupBVI, and MoveGroupBVI - are the heart
of the failover machinery. Bind instructs a member to take on the primary role: change the
loopback MAC to the group vMAC, configure the IPv4 and IPv6 addresses, and announce these changes to
Bird. Release is the inverse: strip the L3 addresses, revert the MAC to the ephemeral instance
MAC, and propagate these changes to Bird. Move simply composes these two steps across two nodes in
the correct order, releasing the GroupBVI from all instances before binding it on a new primary,
which is meant to guarantee that the eVPN fabric never sees two primaries at once, yet the other
computers in the eVPN see a stable vMAC, IPv4/IPv6 global addresses and link-local. For them, the
move is meant to be seamless.
rpc RegisterInstance(RegisterInstanceRequest) returns (RegisterInstanceResponse);
rpc ListInstances(Empty) returns (InstanceList);
rpc GetInstance(InstanceRef) returns (InstanceInfo);
rpc GetInstanceStatus(InstanceRef) returns (InstanceStatus);
rpc DeleteInstance(InstanceRef) returns (Empty);
rpc ListGroups(Empty) returns (GroupList);
rpc GetGroup(GroupRef) returns (Group);
rpc CreateGroup(CreateGroupRequest) returns (Group);
rpc DeleteGroup(GroupRef) returns (Empty);
rpc AddGroupMember(GroupMemberRequest) returns (Group);
rpc RemoveGroupMember(GroupMemberRequest) returns (Group);
rpc CreateGroupBVI(GroupBVIRequest) returns (Group);
rpc DeleteGroupBVI(GroupRef) returns (Group);
rpc SetGroupMTU(SetGroupMTURequest) returns (Group);
rpc AddGroupBVIAddress(GroupBVIAddAddressRequest) returns (Group);
rpc DeleteGroupBVIAddress(GroupBVIDeleteAddressRequest) returns (Group);
rpc BindGroupBVI(BindGroupBVIRequest) returns (Empty);
rpc ReleaseGroupBVI(ReleaseGroupBVIRequest) returns (Empty);
rpc MoveGroupBVI(MoveGroupBVIRequest) returns (Group);
By this point the scope of responsibility has become clear to me. vpp-evpnd manages what is on one
machine: the eVPN membership, the Instance BVI, and the local Bird and VPP state. vpp-evpnr
manages what the wider fleet should look like: which machines are in which groups, what the group’s
gateway identity is, and which machine is the primary. One vpp-evpnr coordinates many vpp-evpnd
instances, and no vpp-evpnd needs to know anything about any other - they are peers only in the
sense that they share a broadcast domain in the dataplane.
SideQuest - a Golang CLI package

Side Quest time! From the RPC signatures above, it quickly becomes obvious to me how to structure the
CLI. Before I get to that, I make an observation: this is the second project within a few months that
needs some sort of a gRPC-backed interactive CLI - the other one being
[vpp-maglev], which ships maglevc. Rather than copy-pasting that
code, Claude and I extract the CLI framework into its own [reusable
package]. In merely five commits, the story unfolds without me
having to do much. Claude:
- extracts the generic command-tree CLI library from
maglevc, establishing the basic node tree structure, with dynamic nodes that can be resolved by a function, for example by issuing gRPC List calls to a backend. This structure provides tab-completion and?help syntax. - adds a builder pattern and app runner, reducing new command wiring from a bunch of complicated nested tree structure, to a handful of lines of boilerplate.
- adds input Validate helpers, a keypress subpackage for interactive prompts, a fix for the client
to run both on Linux and BSD (
termiosis not as portable as I would’ve expected), and an RFC-style design.md (common for my projects) for the library itself. - adds a
-jsonflag opt-in via anApp.JSONboolean, so methods that want to / can render JSON output can surface the flag to the author and fail gracefully otherwise. - adds a default JSON-model renderer that maps any gRPC proto-JSON response to colored terminal
output with bright-white values, dark blue labels, and a
paint()method that can colorize stuff in text output mode.
I find the -json flag particularly useful for scripting. Any vpp-evpnc command that returns a gRPC
message can, with -json, emit the full proto-JSON to stdout, making it trivial to pipe into jq
or any monitoring tool or onwards script composition. The -color flag adds terminal coloring to
text output using the paint and label helpers from the renderer, so operators who prefer a
monochrome environment can keep it that way.
All told, it takes no more than 20 minutes to refactor the maglevc CLI into its own package, including an
[example], and the package is
immediately useful for my new evpnc program. Merci, Claude!
I start scribbling down what the command structure will be, and which gRPCs they map to:
| Command | gRPC Method |
|---|---|
show version | Local binary, based on compile-time LDFLAGS |
show instance | Evpnr.ListInstances |
show instance <id> | Evpnr.GetInstance(id) passed to Evpnd.GetInstance |
show instance <id> evpn | Evpnr.ListEvpnInstances(id) passed to Evpnd.* |
show instance <id> evpn <evpn> | Evpnr.GetEvpnInstance(id) passed to Evpnd.* |
show instance <id> bvi | Evpnr.ListBVIs(id) passed to Evpnd.* |
show instance <id> vpp info | Evpnr.GetVPPInfo(id) passed to Evpnd.* |
show instance <id> bird info | Evpnr.GetBirdInfo(id) passed to Evpnd.* |
| … | … |
As an example, tab completion derives its dynamic token list from the ListInstances and
ListEvpnInstances gRPC calls, so the interactive CLI always offers only currently registered
members and their actual eVPN instance names. See below for a demonstration asciinema screencast.
vpp-evpn - implementation
What follows next is not for everyone. If you don’t write code at all, most of it will come across as an alien language intermixed with English every now and again. And if you do write code, you’re probably better at it than I am, and you’ll scratch your head saying “what was this dude thinking …”. Either way, here I go - met de billen bloot!
1. evpnd - the per-node daemon
The instance-tier model lives in internal/instance/manager.go as a Manager struct. It uses two
lock regimes deliberately: a sync.Mutex serializes writes (which may block waiting for a VPP
binary API reply), while reads are served lock-free from an atomic.Pointer[readState]. I kind of
came to this model the hard way - I found what seems to be a bug in ip6-nd node of VPP, at least
on arm64, and the VPP instance would vanish on me and restart. This made show instances hang,
because one of the instances would never respond, due to VPP holding m.mu indefinitely. I settled
on a reply timeout and a read-only view which won’t stall a concurrent show instances query.
The main functionality lives in CreateEvpn(), which calls evpn.Resolve() to fill in defaults
from a static YAML config file (things like the VxLAN source address and the local AS number of
the Bird BGP speaker). It writes a Bird snippet via bird.WriteSnippet(), applies it with
bird.Configure(), then commits to a state file on disk, so that it can restart (or crash..)
safely. It also publishes the atomic read snapshot via persist() for read-only List and Get
calls. A setter called SetEnabled() mirrors this: it calls bird.Enable() or bird.Disable() on
both the evpn_<slug> and vppevpn_<slug> protocols, then calls attach() or detach() on the
VPP loopback interface so the BVI follows the protocol state.
To implement the vpp-evpnd specifics of the Primary instance, BindGroupBVI() calls
retargetBVI(), a logical five-step up/down orchestration sequence:
- set the VPP loopback MAC via
vpp.SetMacAddress. - tell Bird to rescan the L2FIB with
bird.VppevpnRescanto (re)advertise the vMAC in BGP. - set the IPv6 link-local on both VPP (
vpp.SetIP6LinkLocal) and the Linux tap over netlink (theTapinterface satisfied byinternal/linktap). - and finally add or remove the gateway addresses. On promotion the L3 is added last; on demotion it is removed first.
- for the new primary, a goroutine fires
announceGratuitous(), which sends a bunch of gratuitous ARP rounds, paced by thegarpScheduleslice.
The gRPC adapter (internal/grpcapi/server.go, type EvpndService) is deliberately thin:
CreateEvpnInstance() unpacks the proto into an evpn.Input and delegates to mgr.CreateEvpn();
BindGroupBVI() calls mgr.BindGroupBVI(). Every error goes through evpndOpErr(), which both
logs it locally on the box and returns a typed gRPC status to the caller.
Message Spine and EventBroker
The EventBroker in internal/grpcapi/events.go plays two roles at once. As a slog.Handler
it sits in evpnd’s logging chain: every slog.Info(), slog.Warn(), or slog.Error() call
writes to the JSON stdout handler AND fans out a type="log" event to all current subscribers.
As an instance.EventSink, the Manager calls broker.Emit() directly for structured lifecycle
events – type="crud" for every CRUD operation (e.g. evpn-created, groupbvi-bound), and
type="failover" for role transitions. Both paths produce the same Event proto: a monotonic
seq, an RFC 3339 ts, a type, a level, an instance, a message, and a
fields map[string]string carrying context like the evpn slug or the vMAC. A concrete example:
{
"seq": 42, "ts": "2026-06-10T13:37:00.123Z",
"type": "crud", "level": "info",
"instance": "dpu0-ddln0", "message": "groupbvi-bound",
"fields": {"evpn": "test", "mac": "42:6c:fa:d6:82:98"}
}
WatchEvents on evpnd serves directly from broker.Subscribe(), with per-subscriber type and
level filters (e.g. subscribe to only crud at info and above). Fan-out is non-blocking: a
slow subscriber is silently dropped rather than stalling the broker goroutine.
2. evpnr - a common registry
The Registry in internal/registry/ holds two maps: members (keyed by stable instance ID)
and groups (keyed by group ID). Both are guarded by r.mu. Per-member orchestration uses an
additional per-instance lock, memberConfigLock(instanceID), so two members can be configured
concurrently without their operations interleaving – r.mu is released before any outbound RPC.
When evpnd calls RegisterInstance() it stores the member and fires resync() in
orchestrator.go. I can’t just blindly start reprovisioning the vpp-evpnd instance, because
perhaps it has just restarted, and perhaps Bird has not yet had a chance to configure the
bridge-domain and VxLAN tunnels. If I try to add loopback BVI interfaces, I may end up receiving a
bunch of ‘bridge not found’ type errors. Ask me how I know :) so resync() first gates on
m.Ready, the bit evpnd sets via ReportReady once its own dataplane reconcile completes after
evpnd’s startup. It then calls be.ListEvpnSlugs() to compute a diff: stale eVPNs are torn down
with be.DeleteEvpnInstance(); missing groups are configured via configureMember(). That function
chains three evpnd RPCs in order: be.CreateEvpnInstance(), be.CreateBVI() with a deterministic
locally-administered MAC derived as plainBVIMAC(groupID, instanceID) (SHA-256, first five bytes),
and, if this member should be the group’s primary, be.BindGroupBVI(). A failure at any step rolls back
with be.DeleteEvpnInstance() so a half-built member never silently masquerades as configured,
another lesson learned after I initially got the dataplane configuration in VPP wrong.
MoveGroupBVI() in registry/groups.go updates PrimaryInstanceID and calls
reconcileMemberGroupBVI() for the affected members. That function re-reads the group under
r.mu immediately before acting (not from a stale snapshot), then calls be.BindGroupBVI() on
the incoming primary or be.ReleaseGroupBVI() on the outgoing one via the member’s Backend
interface. Holding the config lock means a concurrent resync() for the same member cannot race
the move.
In internal/grpcapi/server.go, EvpnrService.MoveGroupBVI() hides all of this complexity behind a
one-liner that delegates to reg.MoveGroupBVI(). For per-instance reads, forwardClient() resolves
the member’s grpcBackend from the registry and returns its EvpndClient gRPC stub, through which
the facade forwards calls like ListEvpnInstances, GetBirdInfo, and any future pass-through
calls, with a memberForwardTimeout context so one slow member cannot wedge a fleet-wide fan-out.
On registration, evpnr calls be.StartWatch() which opens a WatchEvents(registrar=true) stream
to the member. The grpcWatcher.Run() loop in dialer.go reads events off that stream, stamps
ev.Instance = instanceID (so a misbehaving evpnd cannot spoof another member’s origin), and
calls fleet.Publish(ev) into the FleetBroker. evpnr’s own slog output is wired through a
fleetLogHandler (also in grpcapi/), a second slog.Handler that publishes evpnr-origin log
records to the same FleetBroker with an empty instance field. The result is one merged stream:
all member events tagged by instance, interleaved with evpnr’s own log records.
Reusing the message spine, EvpnrService.WatchEvents() serves this merged stream from
fleet.Subscribe(), applying per-subscriber type, level, and instance filters. Any gRPC client with
access to evpnr can tap it. Imagine a frontend that uses it to push SSE updates to browsers, or
a CLI call like evpnc watch events which renders them to the terminal. But such an event stream is
equally useful for external automation: a small listener subscribing with
types=["crud","failover"] can watch for groupbvi-bound or autonomous-failover messages and
forward them as Telegram notifications. This way, my phone lights up at 3am before I receive angry
e-mails from IPng’s customer base.
3. evpnc - a golang-cli gRPC client
Here’s where I get to reuse previous code! The [golang-cli]
package is mated with a command tree in cmd/evpnc/commands.go, typed as
*cli.Node[pb.EvpnrClient] – each leaf Run function receives the live EvpnrClient directly, so
there is no global state and no connection management in the command layer. Dynamic completion nodes
(dynInstances, dynGroups, dynGroupMembers) each issue one List* RPC to vpp-evpnr to
populate the tab-completion token set live. As a user, this feature really helps me navigate the
system.
In -json mode, wrapJSON() walks the whole tree and wraps every Run func: mutations that
produce no output still emit {}, giving every command a uniform JSON contract. Show commands call
emitProto(), which uses protojson.Marshal from the Go protobuf library to emit the proto
message with its canonical field names. Composite views (like show group <id> traffic) assemble
a custom struct using protoField() to embed individual messages, then call emitJSON() to emit
the whole thing. Errors always render as {"error": "..."}, which brings an immediate consistency
across vpp-maglev and vpp-evpn, and any future projects I may dabble with, all for free.
A few examples that appear in the screencast below:
pim@squanchy:~$ evpnc
evpn> show instance # list fleet members (text)
evpn> show instance dpu0-ddln0 # show the details of a given instance
evpn> group test bvi set primary dpu0-chplo0 # move the primary to another instance
evpn> quit
pim@squanchy:~$ evpnc -json show group test | jq .groupBvi # pipe group BVI into jq
{
"mac": "42:6c:fa:d6:82:98",
"addresses": [
{
"address": "172.16.0.32",
"prefixLen": 24
},
{
"address": "fec0::32",
"prefixLen": 64
}
],
"createdAt": "1780525048",
"modifiedAt": "1780669810"
}
Results
I think showing the end to end interaction is best done in a two minute asciinema video:
In this test setup, I have two vpp-evpnd instances connected to the central registry. One of them
is in Zurich dpu0-ddln0 and the other in Geneva dpu0-chplo0. I’ll create a test group, add both
instances to it, which shows them gaining a new interface in Linux called bvi-test. Then I’ll add
a GroupBVI with an IPv4 and IPv6 address, and assign it to one of the instances. In the bottom of
the screen, you see another host in that eVPN network start to ping (at 0.8ms) with the GroupBVI in
Zurich. When failing over, you can see one or two ping packets lost, as the management plane does
its break-before-make migration of the GroupBVI, and then the ping packets come back at 5ms because
the active gateway is in Geneva. After flipping back and forth a few times, I delete the primary,
which unassigns the GroupBVI - consequently pings now stop. Finally, I delete the whole group and
the InstanceBVIs are cleaned up.
What’s next
I can now add/remove VPP nodes in a common eVPN registry, ask them to join/leave an eVPN layer2 bridge domain, and manually move IPv4/IPv6 addresses between participating nodes. While doing this via gRPC is really cool, the eventual goal is self-healing because I may not be around to rescue an unfit VPP router on a Sunday morning at 3am. So what’s next for me, is to add a health checking system, I’m thinking sort of like CARP or VRRP, that can coordinate migration of the IPv4/IPv6 addresses between participating nodes autonomously. This code would then be able to turn on/off the sending of heart-beats from both the primary and any candidate nodes that are ready to take over, and on certain conditions, issue the gRPC to move the primary to a different one.
I am also pretty keen on replicating the [VPP Maglev] frontend so that I can see the (growing) fleet at a glance, and possibly trigger failovers from the comfort of my web browser. Stay tuned!