VPP Policers

VPP

About this series

Ever since I first saw VPP - the Vector Packet Processor - I have been deeply impressed with its performance and versatility. For those of us who have used Cisco IOS/XR devices, like the classic ASR (aggregation service router), VPP will look and feel quite familiar as many of the approaches are shared between the two.

There’s some really fantastic features in VPP, some of which are lesser well known, and not always very well documented. In this article, I will describe a unique usecase in which I think VPP will excel, notably acting as a gateway for Internet Exchange Points.

A few years ago, I toyed with the idea to use VPP as an IXP Reseller concentrator, allowing several carriers to connect with say 10G or 25G ports, and carry sub-customers on tagged interfaces with safety (like MAC address ACLs) and rate limiting (say any given customer limited to 1Gbps on a 10G or 100G trunk), all provided by VPP. You can take a look at my [VPP IXP Gateway] article for details. I never ended up deploying it.

In this article, I follow up and fix a few shortcomings in VPP’s policer framework.

Introduction

Consider the following policer in VPP:

vpp# policer add name client-a rate kbps cir 150000 cb 15000000 conform-action transmit
vpp# policer input name client-a GigabitEthernet10/0/1
vpp# policer output name client-a GigabitEthernet10/0/1

The idea is to give a committed information rate of 150Mps with a committed burst rate of 15MB. The CIR represents the average bandwidth allowed for the interface, while the CB represents the maximum amount of data (in bytes) that can be sent at line speed in a single burst before the CIR kicks in to throttle the traffic.

Back in October of 2023, I reached the conclusion that the policer works in the following modes:

  • On input, the policer is applied on device-input which means it takes frames directly from the Phy. It will not work on any sub-interfaces. It explains why the policer worked on untagged (Gi10/0/1) but not on tagged (Gi10/0/1.100) sub-interfaces.
  • On output, the policer is applied on ip4-output and ip6-output, which works only for L3 enabled interfaces, not for L2 ones like the ones one might use on bridge domain or L2 cross connects.

VPP Infra: L2 Feature Maps

The benefit in using the device-input input arc is it’s efficient: every packet that comes from the device (Gi10/0/1) regardless of tagging or not, will be handed off to the policer plugin. It means any traffic (L2, L3, sub-interface, tagged, untagged) will all go through the same policer.

In src/vnet/l2/ there are two nodes called l2-input and l2-output. I can configure VPP to call these nodes before ip[46]-unicast and before ip[46]-output respectively. These L2 nodes have a feature bitmap with 32 entries. The l2-input / l2-output nodes use a bitmap walk: they find the highest set bit, and then dispatch the packet to a pre-configured graph node. Upon return, the feat-bitmap-next checks the next bit, and if that one is set, dispatches the packet to the next pre-configured graph node. This continues until all the bits are checked and packets have been handed to their respective graph node if any given bit is set.

To show what I can do with these nodes, let me dive in to an example. When a packet arrives on an interface configured in L2 mode, either because it’s a bridge-domain or an L2XC, ethernet-input will send it to l2-input. This node does three things:

  1. It will classify the packet, by reading the interface configuration (l2input_main.configs) for the sw_if_index, which contains the mode of the interface (bridge-domain, l2xc, or bvi). It also contains the feature bitmap: a statically configured set of features for this interface.

  2. It will store the effective feature bitmap for each individual packet in the packet buffer. For bridge mode, depending on the packet being unicast or multicast, some features are disabled. For example,flooding for unicast packets is not performed, so those bits are cleared. The result is stored in a per-packet working copy that downstream nodes can be triggered on, in turn.

  3. For each of the bits set in the packet buffer’s l2.feature_bitmap, starting from highest bit set, l2-input will set the next node, for example l2-input-vtr to do VLAN Tag Rewriting. Once that node is finished, it’ll clear its own bit, and search for the next one set, in order to set a new node.

I note that processing order is HIGH to LOW bits. By reading l2_input.h, I can see that The full l2-input chain looks like this:

l2-input 
  → SPAN(17) → INPUT_CLASSIFY(16) → INPUT_FEAT_ARC(15) → POLICER_CLAS(14)
  → ACL(13) → VPATH(12) → L2_IP_QOS_RECORD(11) → VTR(10) → LEARN(9) → RW(8)
  → FWD(7) → UU_FWD(6) → UU_FLOOD(5) → ARP_TERM(4) → ARP_UFWD(3) → FLOOD(2)
  → XCONNECT(1) → DROP(0)

l2-output
  → XCRW(12) → OUTPUT_FEAT_ARC(11) → OUTPUT_CLASSIFY(10) → LINESTATUS_DOWN(9)
  → STP_BLOCKED(8) → IPIW(7) → EFP_FILTER(6) → L2PT(5) → ACL(4) → QOS(3)
  → CFM(2) → SPAN(1) → OUTPUT(0)

If none of the L2 processing nodes set the next node, ultimately feature-bitmap-drop gently takes the packet behind the shed and drops it. On the way out, ultimately the last OUTPUT bit sends the packet to interface-output, which hands off to the driver’s TX node.

Enabling L2 features

There’s lots of places in VPP where L2 feature bitmaps are set/cleared. Here’s a few examples:

# VTR: sets L2INPUT_FEAT_VTR + configures output VTR (VLAN Tag Rewriting)
vpp# set interface l2 tag-rewrite GigE0/0/0.100 pop 1

# ACL: sets L2INPUT_FEAT_ACL / L2OUTPUT_FEAT_ACL
vpp# set interface l2 input acl intfc GigE0/0/0 ip4-table 0
vpp# set interface l2 output acl intfc GigE0/0/0 ip4-table 0

# SPAN: sets L2INPUT_FEAT_SPAN / L2OUTPUT_FEAT_SPAN
vpp# set interface span GigE0/0/0 l2 destination GigE0/0/1

# Bridge domain level (affects bd_feature_bitmap, applied to all bridge members)
vpp# set bridge-domain learn 1     # enable/disable LEARN in BD
vpp# set bridge-domain forward 1   # enable/disable FWD in BD
vpp# set bridge-domain flood 1     # enable/disable FLOOD in BD

I’m starting to see how these L2 feature bitmaps are super powerful, yet flexible. I’m ready to add one!

Creating L2 features

First, I need to insert my new POLICER bit in l2_input.h and l2_output.h. Then, I can call l2input_intf_bitmap_enable() and its companion l2output_intf_bitmap_enable() to enable or disable the L2 feature, and point it at a new graph node.

 /* Enable policer both on L2 feature bitmap, and L3 feature arcs */
 if (dir == VLIB_RX) {
   l2input_intf_bitmap_enable (sw_if_index, L2INPUT_FEAT_POLICER, apply);
   vnet_feature_enable_disable ("ip4-unicast", "policer-input", sw_if_index, apply, 0, 0);
   vnet_feature_enable_disable ("ip6-unicast", "policer-input", sw_if_index, apply, 0, 0);
 } else {
   l2output_intf_bitmap_enable (sw_if_index, L2OUTPUT_FEAT_POLICER, apply);
   vnet_feature_enable_disable ("ip4-output", "policer-output", sw_if_index, apply, 0, 0);
   vnet_feature_enable_disable ("ip6-output", "policer-output", sw_if_index, apply, 0, 0);
 }

What this means is that if the interface happens to be in L2 mode, in other words when it is a bridge-domain member or when it is in an l2XC mode, I will enable the L2 features. However, for L3 packets, I will still proceed to enable the existing policer-input node by calling vnet_feature_enable_disable() on the IPv4 and IPv6 input arc. I make a mental note that MPLS and other non-IP traffic will not be policed in this way.

Updating Policer graph node

The policer framework has an existing dataplane node called vnet_policer_inline() which I extend to take a flag is_l2. Using this flag, I can either set the next graph node to be vnet_l2_feature_next(), or, in the pre-existing L3 case, set vnet_feature_next() on the packets that move through the node. The nodes now look like this:

VLIB_NODE_FN (policer_l2_input_node)
(vlib_main_t *vm, vlib_node_runtime_t *node, vlib_frame_t *frame)
{
  return vnet_policer_inline (vm, node, frame, VLIB_RX, 1 /* is_l2 */);
}

VLIB_REGISTER_NODE (policer_l2_input_node) = {
  .name = "l2-policer-input",
  .vector_size = sizeof (u32),
  .format_trace = format_policer_trace,
  .type = VLIB_NODE_TYPE_INTERNAL,
  .n_errors = ARRAY_LEN(vnet_policer_error_strings),
  .error_strings = vnet_policer_error_strings,
  .n_next_nodes = VNET_POLICER_N_NEXT,
  .next_nodes = {
                [VNET_POLICER_NEXT_DROP] = "error-drop",
                [VNET_POLICER_NEXT_HANDOFF] = "policer-input-handoff",
                },
};

/* Register on IP unicast arcs for L3 routed sub-interfaces */
VNET_FEATURE_INIT (policer_ip4_unicast, static) = {
  .arc_name = "ip4-unicast",
  .node_name = "policer-input",
  .runs_before = VNET_FEATURES ("ip4-lookup"),
};

VNET_FEATURE_INIT (policer_ip6_unicast, static) = {
  .arc_name = "ip6-unicast",
  .node_name = "policer-input",
  .runs_before = VNET_FEATURES ("ip6-lookup"),
};

Here, I install the L3 feature before ip[46]-lookup, and hook up the L2 feature with a new node that really just calls the existing node but with is_l2 set to true. I do something very similar for the output direction, except there I’ll hook the L3 feature before ip[46]-output.

Tests!

I think writing unit- and integration tests is a great idea. I add a new file test/test_policer_subif.py which actually tests all four new cases:

  1. L3 Input: on a routed sub-interface
  2. L3 Output: on a routed sub-interface
  3. L2 Input: on a bridge-domain sub-interface
  4. L2 Output: on a bridge-domain sub-interface

The existing test/test_policer.py should also cover existing cases, and of course it’s important that my work does not impact these changes. Lucky me, the existing tests all still pass :)

Test: L3 in/output

The tests use a VPP feature called packet-generator, which creates virtual devices upon which I can emit packets using ScaPY, and use pcap to receive them. For the input, first I’ll create the interface and apply a new policer to it:

  sub_if0 = VppDot1QSubint(self, self.pg0, 10)
  sub_if0.admin_up()
  sub_if0.config_ip4()
  sub_if0.resolve_arp()

  # Create policer
  action_tx = PolicerAction(VppEnum.vl_api_sse2_qos_action_type_t.SSE2_QOS_ACTION_API_TRANSMIT, 0)
  policer = VppPolicer(self, "subif_l3_pol", 80, 0, 1000, 0,
      conform_action=action_tx, exceed_action=action_tx, violate_action=action_tx,
  )
  policer.add_vpp_config()

  # Apply policer to sub-interface input on pg0
  policer.apply_vpp_config(sub_if0.sw_if_index, Dir.RX, True)

The policer with name subif_l3_pol has a CIR of 80kbps, and EIR of 0kB, a CB of 1000 bytes, and EB of 0kB, and otherwise always accepts packets. I do this so that I can eventually detect if and how many packets were seen, and how many bytes were passed in the conform and violate actions.

Next, I can generate a few packets and send them out from pg0, and wait to receive them on pg1:

  # Send packets with VLAN tag from sub_if0 to sub_if1
  pkts = []
  for i in range(NUM_PKTS): # NUM_PKTS = 67
      pkt = (
          Ether(src=self.pg0.remote_mac, dst=self.pg0.local_mac) / Dot1Q(vlan=10)
          / IP(src=sub_if0.remote_ip4, dst=sub_if1.remote_ip4) / UDP(sport=1234, dport=1234)
          / Raw(b"\xa5" * 100)
      )
      pkts.append(pkt)

  # Send and verify packets are policed and forwarded
  rx = self.send_and_expect(self.pg0, pkts, self.pg1)

  stats = policer.get_stats()
  # Verify policing happened
  self.assertGreater(stats["conform_packets"], 0)
  self.assertEqual(stats["exceed_packets"], 0)
  self.assertGreater(stats["violate_packets"], 0)

  self.logger.info(f"L3 sub-interface input policer stats: {stats}")

Similar to the L3 sub-interface input policer, I also write a test for L3 sub-interface output policer. The only difference between the two is that in the output case, the policer is applied to pg1 in the Dir.TX direction, while in the input case, it’s applied to pg0 in the Dir.Rx direction.

I can predict the outcome. Every packet is exactly 146 bytes:

  • 14 bytes src/dst MAC in Ether()
  • 4 bytes VLAN tag (10) in Dot1Q()
  • 20 bytes IPv4 header in IP()
  • 8 bytes UDP header in UDP()
  • 100 bytes of additional payload.

When allowing a burst of 1000 bytes, that means 6 packets should make it through (876 bytes) in the conform bucket while the other 61 should be in the violate bucket. I won’t see any packets in the exceed bucket, because the policer I created is a simple one-rate, two-color 1R2C policer with EB set to 0, so every non-conforming packet goes straight to violate as there is no extra budget in the exceed bucket. However they are all sent, because the action was set to transmit in all cases.

pim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 
15:21:46,868 L3 sub-interface input policer stats: {'conform_packets': 7, 'conform_bytes': 896,
  'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 60, 'violate_bytes': 7680}
15:21:47,919 L3 sub-interface output policer stats: {'conform_packets': 6, 'conform_bytes': 876,
  'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}
Warning

Whoops! So much for predicting the outcome! I see that 7 packets (896 bytes) make it through on input while 6 packets (876 bytes) made it through on output. In the input case, the packet size is 896/7 = 128 bytes, which is 18 bytes short. What’s going on?

Side Quest: Policer Accounting

On the vpp-dev mailinglist, Ben points out that the accounting will be changing when moving from device-input to ip[46]-input, because after device-input, the packet buffer is advanced to the L3 portion, and will start at the IPv4 or IPv6 header. Considering I was using dot1q tagged sub-interfaces, that means I will be short exactly 18 bytes. The reason why this does not happen on the way out, is that ip[46]-rewrite have both already wound back the buffer to be able to insert the ethernet frame and encapsulation, so no adjustment is needed there.

Ben also points out that when applying the policer to the interface, I can detect at creation time if it’s a PHY, a single-tagged or a double-tagged interface, and store some information to help correct the accounting. We discuss a little bit on the mailinglist, and agree that it’s best for all four cases (L2 input/output and L3 intput/output) to use the full L2 frame bytes in the accounting, which as an added benefit also that is remains backwards compatible with the device-input accounting. Chapeau, Ben you’re so clever!

I add a little helper function:

static u8 vnet_policer_compute_l2_overhead (vnet_main_t *vnm, u32 sw_if_index, vlib_dir_t dir)
{
  if (dir == VLIB_TX) return 0;

  vnet_hw_interface_t *hi = vnet_get_sup_hw_interface (vnm, sw_if_index);
  if (PREDICT_FALSE (hi->hw_class_index != ethernet_hw_interface_class.index))
    return 0; /* Not Ethernet */

  vnet_sw_interface_t *si = vnet_get_sw_interface (vnm, sw_if_index);
  if (si->type == VNET_SW_INTERFACE_TYPE_SUB) {
    if (si->sub.eth.flags.one_tag)  return 18; /* Ethernet + single VLAN */
    if (si->sub.eth.flags.two_tags) return 22; /* Ethernet + QinQ */
  }

  return 14; /* Untagged Ethernet */
}

And in the policer struct, I also add a l2_overhead_by_sw_if_index[dir][sw_if_index] to store these values. That way, I do not need to do this calculation for every packet in the dataplane, but just blindly add the value I pre-computed at creation time. This is safe, because sub-interfaces cannot change their encapsulation after being created.

In the vnet_policer_police() dataplane function, I add an l2_overhead argument, and then call it like so:

  u16 l2_overhead0 = (is_l2) ? 0 : pm->l2_overhead_by_sw_if_index[dir][sw_if_index0];
  act0 = vnet_policer_police (vm, b0, pi0, ..., l2_overhead0);

And with that, my two tests give the same results:

pim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 | grep 'policer stats'
15:38:39,720 L3 sub-interface input policer stats: {'conform_packets': 6, 'conform_bytes': 876,
  'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}
15:38:40,715 L3 sub-interface output policer stats: {'conform_packets': 6, 'conform_bytes': 876,
  'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}

Yaay, great success!

Test: L2 in/output

The tests for the L2 input and output case are not radically different. In the setup, rather than giving the VLAN sub-interfaces an IPv4 address, I’ll just add them to a bridge-domain:

  # Create VLAN sub-interfaces on pg0 and pg1
  sub_if0 = VppDot1QSubint(self, self.pg0, 30)
  sub_if0.admin_up()
  sub_if1 = VppDot1QSubint(self, self.pg1, 30)
  sub_if1.admin_up()

  # Add both sub-interfaces to bridge domain 1
  self.vapi.sw_interface_set_l2_bridge(sub_if0.sw_if_index, bd_id=1)
  self.vapi.sw_interface_set_l2_bridge(sub_if1.sw_if_index, bd_id=1)

This puts the sub-interfaces in L2 mode, after which the l2-input and l2-output feature bitmaps kick in. Without further ado:

pim@summer:~/src/vpp$ make test-debug TEST=test_policer_subif V=2 | grep 'L2.*policer stats'
15:50:15,217 L2 sub-interface input policer stats: {'conform_packets': 6, 'conform_bytes': 876,
  'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}
15:50:16,217 L2 sub-interface output policer stats: {'conform_packets': 6, 'conform_bytes': 876,
  'exceed_packets': 0, 'exceed_bytes': 0, 'violate_packets': 61, 'violate_bytes': 8906}

Results

The policer works in all sorts of cool scenario’s now. Let me give a concrete example, where I create an L2XC with VTR and then apply a policer. I’ve written about VTR, which stands for VLAN Tag Rewriting before, in an old article lovingly called [VPP VLAN Gymnastics]. It all looks like this:

vpp# create sub Gi10/0/0 100
vpp# create sub Gi10/0/1 200
vpp# set interface l2 xconnect Gi10/0/0.100 Gi10/0/1.200
vpp# set interface l2 xconnect Gi10/0/1.200 Gi10/0/0.100
vpp# set interface l2 tag-rewrite Gi10/0/0.100 pop 1
vpp# set interface l2 tag-rewrite Gi10/0/1.200 pop 1
vpp# policer add name pol-test rate kbps cir 150000 cb 15000000 conform-action transmit
vpp# policer input name pol-test Gi10/0/0.100

After applying this configuration, the input bitmap on Gi10/0/0.100 becomes POLICER(14) | VTR(10) | XCONNECT(1) | DROP(0). Packets now take the following path through the dataplane:

ethernet-input
  → l2-input (computes bitmap, dispatches to bit 14)
  → l2-policer-input (clears bit 14, polices, dispatches to bit 10)
  → l2-input-vtr (clears bit 10, pops 1 tag, dispatches to bit 1)
  → l2-output (XCONNECT: sw_if_index[TX]=Gi10/0/1.200)
    → inline output VTR (pushes 1 tag for .200)
  → interface-output
  → Gi10/0/1-tx

What’s Next

I’ve sent the change, which was only about ~300 LOC, off for review. You can follow along on the gerrit on [44654]. I don’t think the policer got much slower after adding the l2 path, and one might argue it doesn’t matter because policing didn’t work on sub-interfaces and L2 output at all, before this change. However, for the L3 input/output case, and for the PHY input case, there are a few CPU cycles added now to address the L2 and sub-int use cases. Perhaps I should do a side by side comparision of packets/sec throughput on the bench some time.

It would be great if VPP would support FQ-CoDel (Flow Queue-Controlled Delay), which is an Active Queue Management (AQM) algorithm and packet scheduler designed to eliminate bufferbloat—high latency caused by excessive buffering in network equipment, while ensuring fair bandwidth distribution among competing traffic flows. I know that Dave Täht - may he rest in peace - always wanted that.

For me, I’ve set my sights on eVPN VxLAN, and started toing with SRv6 also. I hope that in the spring I’ll have a bit more time to contribute to VPP and write about it. Stay tuned!