I’m one of those people who is a fan of low-latency and high performance distributed service architectures. After building out the IPng Network across europe, I did notice a rather stark difference in presence of one particular service: AS112 anycast nameservers. In particular, I only have one Internet Exchange in common with a direct presence of AS112, FCIX in California. Big-up to the kind folks in Fremont who operate www.as112.net.
The Problem
Looking around Switzerland, no internet exchanges actually have AS112 as a direct
member and as such you’ll find the service tucked away behind several ISPs, with
AS paths such as 13030 29670 112
, 6939 112
and 34019 112
. A traceroute
from a popular swiss ISP, Init7 will go to Germany, at a
roundtrip latency of 18.9ms. My own latency is 146ms as my queries are served
from FCIX:
pim@spongebob:~$ traceroute prisoner.iana.org
traceroute to prisoner.iana.org (192.175.48.1), 64 hops max, 40 byte packets
1 fiber7.xe8.chbtl0.ipng.ch (194.126.235.33) 2.658 ms 0.754 ms 0.523 ms
2 1790bre1.fiber7.init7.net (81.6.42.1) 1.132 ms 1.077 ms 3.621 ms
3 780eff1.fiber7.init7.net (109.202.193.44) 1.238 ms 1.162 ms 1.188 ms
4 r1win12.core.init7.net (77.109.181.155) 2.096 ms 2.1 ms 2.1 ms
5 r1zrh6.core.init7.net (82.197.168.222) 2.086 ms 3.904 ms 2.183 ms
6 r1glb1.core.init7.net (5.180.135.134) 2.043 ms 3.621 ms 2.088 ms
7 r2zrh2.core.init7.net (82.197.163.213) 2.353 ms 2.522 ms 2.289 ms
8 r2zrh2.core.init7.net (5.180.135.156) 2.08 ms 2.299 ms 2.202 ms
9 r1fra3.core.init7.net (5.180.135.173) 7.65 ms 7.582 ms 7.546 ms
10 r1fra2.core.init7.net (5.180.135.126) 7.928 ms 7.831 ms 7.997 ms
11 r1ber1.core.init7.net (77.109.129.8) 19.395 ms 19.287 ms 19.558 ms
12 octalus.in-berlin.a36.community-ix.de (185.1.74.3) 18.839 ms 18.717 ms 29.615 ms
13 prisoner.iana.org (192.175.48.1) 18.536 ms 18.613 ms 18.766 ms
pim@chumbucket:~$ traceroute blackhole-1.iana.org
traceroute to blackhole-1.iana.org (192.175.48.6), 30 hops max, 60 byte packets
1 chbtl1.ipng.ch (194.1.163.67) 0.247 ms 0.158 ms 0.107 ms
2 chgtg0.ipng.ch (194.1.163.19) 0.514 ms 0.474 ms 0.419 ms
3 usfmt0.ipng.ch (194.1.163.23) 146.451 ms 146.406 ms 146.364 ms
4 blackhole-1.iana.org (192.175.48.6) 146.323 ms 146.281 ms 146.239 ms
This path goes to FCIX because it’s the only place where AS50869 picks up AS112 directly, at an internet exchange, and therefore the localpref will make this route preferred. But that’s a long way to go for my DNS queries!
I think we can do better.
Introduction
Taken from RFC7534:
Many sites connected to the Internet make use of IPv4 addresses that are not globally unique. Examples are the addresses designated in RFC 1918 for private use within individual sites.
Devices in such environments may occasionally originate Domain Name System (DNS) queries (so-called “reverse lookups”) corresponding to those private-use addresses. Since the addresses concerned have only local significance, it is good practice for site administrators to ensure that such queries are answered locally. However, it is not uncommon for such queries to follow the normal delegation path in the public DNS instead of being answered within the site.
It is not possible for public DNS servers to give useful answers to such queries. In addition, due to the wide deployment of private-use addresses and the continuing growth of the Internet, the volume of such queries is large and growing. The AS112 project aims to provide a distributed sink for such queries in order to reduce the load on the corresponding authoritative servers. The AS112 project is named after the Autonomous System Number (ASN) that was assigned to it.
Deployment
It’s actually quite straight forward, the deployment consists of roughly three steps:
- Procure hardware to run the instances of the nameserver on.
- Configure the nameserver to serve the zonefiles.
- Announce the anycast service locally/regionally.
Let’s discuss each in turn.
Hardware
For the hardware, I’ve decided to use existing server platform at IP-Max and IPng Networks. There are two types of hardware, both tried and tested, one set is an HP ProLiant DL380 Gen9, and one is an older Dell PowerEdge R610.
Considering each vendor ships specific parts and each are different, many appliance vendors choose to virtualize their environment such that the guest operating system finds a very homogenous configuration. For my purposes, the virtualization platform is Xen and the guest is a (para)virtualized Debian.
I will be starting with three nodes, one in Geneva and one in Zurich, hosted on hypervisors of IP-Max, and one in Amsterdam, hosted on a hypervisor of IPng. I have a feeling a few more places will follow.
Install the OS
Xen makes this repeatable and straight forward. Other systems, such as KVM, have very similar installers, for example VMBuilder is popular. Both work roughly the same way, and install a guest in a matter of minutes.
I’ll install to an LVS volume group on all machines, backed by pairs of SSD for throughput and redundancy. We’ll give the guest 4GB of memory and 4 CPUs. I love how the machine boots using PyGrub, fully on serial, and is fully booted and running in 20 seconds.
sudo xen-create-image --hostname as112-1.free-ix.net --ip 46.20.249.197 \
--vcpus 4 --pygrub --dist buster --lvm=vg1_hvn04_gva20
sudo xl create -c as112-1.free-ix.net.cfg
After logging in, the following additional software was installed. We’ll be using Bird2, which comes on Debian Buster’s backports. Otherwise, we’re pretty vanilla:
$ cat << EOF | sudo tee -a /etc/apt/sources.list
#
# Backports
#
deb http://deb.debian.org/debian buster-backports main
EOF
$ sudo apt update
$ sudo apt install tcpdump sudo net-tools bridge-utils nsd bird2 \
netplan.io traceroute ufw curl bind9-dnsutils
$ sudo apt purge ifupdown
I removed the /etc/network/interfaces
approach and configured Netplan,
a personal choice, which aligns the machines more closely with other servers
in the IPng fleet. The only trick is to ensure that the anycast IP addresses
are available for the nameserver to listen on, so at the top of Netplan’s
configuration file, we add them like so:
network:
version: 2
renderer: networkd
ethernets:
lo:
addresses:
- 127.0.0.1/8
- ::1/128
- 192.175.48.1/32 # prisoner.iana.org (anycast)
- 2620:4f:8000::1/128 # prisoner.iana.org (anycast)
- 192.175.48.6/32 # blackhole-1.iana.org (anycast)
- 2620:4f:8000::6/128 # blackhole-1.iana.org (anycast)
- 192.175.48.42/32 # blackhole-2.iana.org (anycast)
- 2620:4f:8000::42/128 # blackhole-2.iana.org (anycast)
- 192.31.196.1/32 # blackhole.as112.arpa (anycast)
- 2001:4:112::1/128 # blackhole.as112.arpa (anycast)
Nameserver
My nameserver of choice is NSD, and its configuration is similar to BIND, which is described in RFC7534. In fact, the zone files are identical, so all we should do is create a few listen statements and load up the zones:
$ cat << EOF | sudo tee /etc/nsd/nsd.conf.d/listen.conf
server:
ip-address: 127.0.0.1
ip-address: ::1
ip-address: 46.20.249.197
ip-address: 2a02:2528:a04:202::197
ip-address: 192.175.48.1 # prisoner.iana.org (anycast)
ip-address: 2620:4f:8000::1 # prisoner.iana.org (anycast)
ip-address: 192.175.48.6 # blackhole-1.iana.org (anycast)
ip-address: 2620:4f:8000::6 # blackhole-1.iana.org (anycast)
ip-address: 192.175.48.42 # blackhole-2.iana.org (anycast)
ip-address: 2620:4f:8000::42 # blackhole-2.iana.org (anycast)
ip-address: 192.31.196.1 # blackhole.as112.arpa (anycast)
ip-address: 2001:4:112::1 # blackhole.as112.arpa (anycast)
server-count: 4
EOF
$ cat << EOF | sudo tee /etc/nsd/nsd.conf.d/as112.conf
zone:
name: "hostname.as112.net"
zonefile: "/etc/nsd/master/db.hostname.as112.net"
zone:
name: "hostname.as112.arpa"
zonefile: "/etc/nsd/master/db.hostname.as112.arpa"
zone:
name: "10.in-addr.arpa"
zonefile: "/etc/nsd/master/db.dd-empty"
# etcetera
EOF
While all of the zones are captured by db.dd-empty
or db.dr-empty
, which
can be found in the RFC text, I’ll note the top two are special, as they are
specific to the instance. For example on our Geneva instance:
$ cat << EOF | sudo tee /etc/nsd/master/db.hostname.as112.arpa
$TTL 1W
@ SOA chplo01.paphosting.net. noc.ipng.ch. (
1 ; serial number
1W ; refresh
1M ; retry
1W ; expire
1W ) ; negative caching TTL
NS blackhole.as112.arpa.
TXT "AS112 hosted by IPng Networks" "Geneva, Switzerland"
TXT "See https://www.as112.net/ for more information."
TXT "See https://free-ix.net/ for local information."
TXT "Unique IP: 194.1.163.147"
TXT "Unique IP: [2001:678:d78:7::147]"
LOC 46 9 55.501 N 6 6 25.870 E 407.00m 10m 100m 10m
This is super helpful to users, who want to know which server, exactly,
is serving their request. Not all operators added the Unique IP
details,
but I found it useful when launching the service, as several anycast nodes
quickly become confusing otherwise :-)
After this is all done, the nameserver can be started. I rebooted the guest for good measure, and about 19 seconds later (a fact that continues to amaze me), the server was up and serving queries, albeit only from localhost because there is no way to reach the server on the network, yet.
To validate things work, we can do a few SOA or TXT queries, like this one:
pim@nlams01:~$ ping -c5 -q prisoner.iana.org
PING prisoner.iana.org(prisoner.iana.org (2620:4f:8000::1)) 56 data bytes
--- prisoner.iana.org ping statistics ---
5 packets transmitted, 5 received, 0% packet loss, time 34ms
rtt min/avg/max/mdev = 0.041/0.045/0.053/0.004 ms
pim@nlams01:~$ dig @prisoner.iana.org hostname.as112.net TXT +short +norec
"AS112 hosted by IPng Networks" "Amsterdam, The Netherlands"
"See http://www.as112.net/ for more information."
"Unique IP: 94.142.241.187"
"Unique IP: [2a02:898:146::2]"
Network
Now comes the fun part! We’re running these instances of the nameservers in a few locations, and to ensure we don’t route traffic to the incorrect location, we’ll announce them using BGP as per recommendation of RFC7534.
My choice of routing suite is Bird2, which comes with a lot of extensiblility and a programmatic validation of routing policies.
We’ll only be using static
and BGP
routing protocols for Bird, so the
configuration is relatively straight forward, first we create a routing table
export for IPv4 and IPv6, then we define some static Nullroutes, which ensure
that our prefixes are always present in the RIB (otherwise BGP will not export
them), then we create some filter functions (one for routeserver sessions,
one for peering sessions, and one for transit sessions), and finally we include
a few specific configuration files, one-per-environment where we’ll be active.
$ cat << EOF | sudo tee /etc/bird/bird.conf
router id 46.20.249.197;
protocol kernel fib4 {
ipv4 { export all; };
scan time 60;
}
protocol kernel fib6 {
ipv6 { export all; };
scan time 60;
}
protocol static static_as112_ipv4 {
ipv4;
route 192.175.48.0/24 blackhole;
route 192.31.196.0/24 blackhole;
}
protocol static static_as112_ipv6 {
ipv6;
route 2620:4f:8000::/48 blackhole;
route 2001:4:112::/48 blackhole;
}
include "bgp-freeix.conf";
include "bgp-ipng.conf";
include "bgp-ipmax.conf";
EOF
The configuration file per environment, say bgp-freeix.conf
, can (and will)
be autogenerated, but the pattern is of the following form:
$ cat << EOF | tee /etc/bird/bgp-freeix.conf
#
# Bird AS112 configuration for FreeIX
#
define my_ipv4 = 185.1.205.252;
define my_ipv6 = 2001:7f8:111:42::70:1;
protocol bgp freeix_as51530_1_ipv4 {
description "FreeIX - AS51530 - Routeserver #1";
local as 112;
source address my_ipv4;
neighbor 185.1.205.254 as 51530;
ipv4 {
import where fn_import_routeserver( 51530 );
export where proto = "static_as112_ipv4";
import limit 120000 action restart;
};
}
protocol bgp freeix_as51530_1_ipv6 {
description "FreeIX - AS51530 - Routeserver #1";
local as 112;
source address my_ipv6;
neighbor 2001:7f8:111:42::c94a:1 as 51530;
ipv6 {
import where fn_import_routeserver( 51530 );
export where proto = "static_as112_ipv6";
import limit 120000 action restart;
};
}
# etcetera
EOF
If you’ve seen IXPManager’s approach to routeserver configuration generators,
you’ll notice I borrowed the fn_import()
function and its dependents from
there. This allows imports to be specific towards prefix-lists, as-paths and
ensure some Belts and Braces checks are in place (no invalid or tier1 ASN
in the path, a valid nexthop, no tricks with AS path truncation, and so on).
After bringing up the service, the prefixes make their way into the routeserver and get distributed to the FreeIX participants:
$ sudo systemctl start bird
$ sudo birdc show protocol
BIRD 2.0.7 ready.
Name Proto Table State Since Info
fib4 Kernel master4 up 2021-06-28 11:01:35
fib6 Kernel master6 up 2021-06-28 11:01:35
device1 Device --- up 2021-06-28 11:01:35
static_as112_ipv4 Static master4 up 2021-06-28 11:01:35
static_as112_ipv6 Static master6 up 2021-06-28 11:01:35
freeix_as51530_1_ipv4 BGP --- up 2021-06-28 11:01:17 Established
freeix_as51530_1_ipv6 BGP --- up 2021-06-28 11:01:19 Established
freeix_as51530_2_ipv4 BGP --- up 2021-06-28 11:01:32 Established
freeix_as51530_2_ipv6 BGP --- up 2021-06-28 11:01:37 Established
Internet Exchanges
Having one configuration file per group helps a lot with integration of
IXPManager where we might autogenerate the IXP
versions of these files and install them periodically. That way, when members
enable the AS112
peering checkmark, the servers will automatically download
and set up those sessions without human involvement – typically this is the
best way to avoid outages: never tinker with production config files by hand.
We’ll test this out with FreeIX, but hope as well to
offer our service to other internet exchanges, notably SwissIX and CIXP.
One of the huge benefits of operating within IP-Max network is their ability to do L2VPN transport from any place on-net to any other router. As such, connecting these virtual machines to other places, like SwissIX, CIXP, CHIX-CH, Community-IX or other further away places, is a piece of cake. All we must do is create an L2VPN and offer it to the hypervisor (which usually is connected via a LACP BundleEthernet) on some VLAN, after which we can bridge that into the guest OS by creating a new virtio NIC. This is how, in the example above, our AS112 machines were introduced to FreeIX. This scales very well, requiring only one guest reboot per internet exchange, and greatly simplifies operations.
Monitoring
Of course, one would not want to run a production service, certainly not on the public internet, without a bit of introspection and monitoring.
There are four things we might want to ensure:
- Is the machine up and healthy? For this we use NAGIOS.
- Is NSD serving? For this we use NSD Exporter and Prometheus/Grafana.
- Is NSD reachable? For this we use CloudProber.
- If there is an issue, can we alert an operator? For this we use Telegram.
In a followup post, I’ll demonstrate how these things come together into a comprehensive anycast monitoring and alerting solution. As a fringe benefit we can show contemporary graphs and dashboards. But seeing as the service hasn’t yet gotten a lot of mileage, it deserves its own followup post, some time in August.
The results
First things first - latency went waaaay down:
pim@chumbucket:~$ traceroute blackhole-1.iana.org
traceroute to blackhole-1.iana.org (192.175.48.6), 30 hops max, 60 byte packets
1 chbtl1.ipng.ch (194.1.163.67) 0.257 ms 0.199 ms 0.159 ms
2 chgtg0.ipng.ch (194.1.163.19) 0.468 ms 0.430 ms 0.430 ms
3 chrma0.ipng.ch (194.1.163.8) 0.648 ms 0.611 ms 0.597 ms
4 blackhole-1.iana.org (192.175.48.6) 1.272 ms 1.236 ms 1.201 ms
pim@chumbucket:~$ dig -6 @prisoner.iana.org hostname.as112.net txt +short +norec +tcp
"Free-IX hosted by IP-Max SA" "Zurich, Switzerland"
"See https://www.as112.net/ for more information."
"See https://free-ix.net/ for local information."
"Unique IP: 46.20.246.67"
"Unique IP: [2a02:2528:1703::67]"
and this demonstrates why it’s super useful to have the hostname.as112.net
entry populated well. If I’m in Amsterdam, I’ll be served by the local node there:
pim@gripe:~$ traceroute6 blackhole-2.iana.org
traceroute6 to blackhole-2.iana.org (2620:4f:8000::42), 64 hops max, 60 byte packets
1 nlams0.ipng.ch (2a02:898:146::1) 0.744 ms 0.879 ms 0.818 ms
2 blackhole-2.iana.org (2620:4f:8000::42) 1.104 ms 1.064 ms 1.035 ms
pim@gripe:~$ dig -4 @prisoner.iana.org hostname.as112.net txt +short +norec +tcp
"Hosted by IPng Networks" "Amsterdam, The Netherlands"
"See http://www.as112.net/ for more information."
"Unique IP: 94.142.241.187"
"Unique IP: [2a02:898:146::2]"
Of course, due to anycast, and me being in Zurich, I will be served primarily by the Zurich node. If it were to go down for maintenance, or hardware failure, BGP will immediately converge on alternate paths, there are currently three to choose from:
pim@chrma0:~$ show protocols bgp ipv4 unicast 192.31.196.0/24
BGP routing table entry for 192.31.196.0/24
Paths: (10 available, best #2, table default)
Advertised to non peer-group peers:
185.1.205.251 194.1.163.1 [...]
112
194.1.163.32 (metric 137) from 194.1.163.32 (194.1.163.32)
Origin IGP, localpref 400, valid, internal
Community: 50869:3500 50869:4099 50869:5055
Last update: Mon Jun 28 11:13:14 2021
112
185.1.205.251 from 185.1.205.251 (46.20.246.67)
Origin IGP, localpref 400, valid, external, bestpath-from-AS 112, best (Local Pref)
Community: 50869:3500 50869:4099 50869:5000 50869:5020 50869:5060
Last update: Mon Jun 28 11:00:45 2021
112
185.1.205.251 from 185.1.205.253 (185.1.205.253)
Origin IGP, localpref 200, valid, external
Community: 50869:1061
Last update: Mon Jun 28 11:00:20 2021
(and more)
I am expecting a few more direct paths to come, as I harden this service, and offer it to other swiss internet exchange points in the future. But mostly, my mission of reducing the round trip time from 146ms to 1ms from my desktop at home was successfully accomplished.