Certificate Transparency - Part 1

ctlog logo

Introduction

There once was a Dutch company called [DigiNotar], as the name suggests it was a form of digital notary, and they were in the business of issuing security certificates. Unfortunately, in June of 2011, their IT infrastructure was compromised and subsequently it issued hundreds of fraudulent SSL certificates, some of which were used for man-in-the-middle attacks on Iranian Gmail users. Not cool.

Google launched a project called Certificate Transparency, because it was becoming more common that the root of trust given to Certification Authorities could no longer be unilateraly trusted. These attacks showed that the lack of transparency in the way CAs operated was a significant risk to the Web Public Key Infrastructure. It led to the creation of this ambitious [project] to improve security online by bringing accountability to the system that protects our online services with SSL (Secure Socket Layer) and TLS (Transport Layer Security).

In 2013, [RFC 6962] was published by the IETF. It describes an experimental protocol for publicly logging the existence of Transport Layer Security (TLS) certificates as they are issued or observed, in a manner that allows anyone to audit certificate authority (CA) activity and notice the issuance of suspect certificates as well as to audit the certificate logs themselves. The intent is that eventually clients would refuse to honor certificates that do not appear in a log, effectively forcing CAs to add all issued certificates to the logs.

This series explores and documents how IPng Networks will be running two Static CT Logs with two different implementations. One will be [Sunlight], and the other will be [TesseraCT].

Static Certificate Transparency

In this context, Logs are network services that implement the protocol operations for submissions and queries that are defined in a specification that builds on the previous RFC. A few years ago, my buddy Antonis asked me if I would be willing to run a log, but operationally they were very complex and expensive to run. However, over the years, the concept of Static Logs put running one in reach. This [Static CT API] defines a read-path HTTP static asset hierarchy (for monitoring) to be implemented alongside the write-path RFC 6962 endpoints (for submission).

Aside from the different read endpoints, a log that implements the Static API is a regular CT log that can work alongside RFC 6962 logs and that fulfills the same purpose. In particular, it requires no modification to submitters and TLS clients.

If you only read one document about Static CT, read Filippo Valsorda’s excellent [paper]. It describes a radically cheaper and easier to operate [Certificate Transparency] log that is backed by a consistent object storage, and can scale to 30x the current issuance rate for 2-10% of the costs with no merge delay.

Scalable, Cheap, Reliable: choose two

ctlog at ipng

In the diagram, I’ve drawn an overview of IPng’s network. In red a european backbone network is provided by a [BGP Free Core network]. It operates a private IPv4, IPv6, and MPLS network, called IPng Site Local, which is not connected to the internet. On top of that, IPng offers L2 and L3 services, for example using [VPP].

In green I built a cluster of replicated NGINX frontends. They connect into IPng Site Local and can reach all hypervisors, VMs, and storage systems. They also connect to the Internet with a single IPv4 and IPv6 address. One might say that SSL is added and removed here :-) [ref].

Then in orange I built a set of [MinIO] S3 storage pools. Amongst others, I serve the static content from the IPng website from these pools, providing fancy redundancy and caching. I wrote about its design in [this article].

Finally, I turn my attention to the blue which is two hypervisors, one run by [IPng] and the other by [Massar]. Each of them will be running one of the Log implementations. IPng provides two large ZFS storage tanks for offsite backup, in case a hypervisor decides to check out, and daily backups to an S3 bucket using Restic.

Having explained all of this, I am well aware that end to end reliability will be coming from the fact that there are many independent Log operators, and folks wanting to validate certificates can simply monitor many. If there is a gap in coverage, say due to any given Log’s downtime, this will not necessarily be problematic. It does mean that I may have to suppress the SRE in me…

MinIO

My first instinct is to leverage the distributed storage IPng has, but as I’ll show in the rest of this article, maybe a simpler, more elegant design could be superior, precisely because individual log reliability is not as important as having many available log instances to choose from.

From operators in the field I understand that the world-wide generation of certificates is roughly 17M/day, which amounts of some 200-250qps of writes. Antonis explains that certs with a validity if 180 days or less will need two CT log entries, while certs with a validity more than 180d will need three CT log entries. So the write rate is roughly 2.2x that, as an upper bound.

My first thought is to see how fast my open source S3 machines can go, really. I’m curious also as to the difference between SSD and spinning disks.

I boot two Dell R630s in the Lab. These machines have two Xeon E5-2640 v4 CPUs for a total of 20 cores and 40 threads, and 512GB of DDR4 memory. They also sport a SAS controller. In one machine I place 6pcs 1.2TB SAS3 disks (HPE part number EG1200JEHMC), and in the second machine I place 6pcs of 1.92TB enterprise storage (Samsung part number P1633N19).

I spin up a 6-device MinIO cluster on both and take them out for a spin using [S3 Benchmark] from Wasabi Tech.

pim@ctlog-test:~/src/s3-benchmark$ for dev in disk ssd; do \
  for t in 1 8 32; do \
    for z in 4M 1M 8k 4k; do \
      ./s3-benchmark -a $KEY -s $SECRET -u http://minio-$dev:9000 -t $t -z $z \
        | tee -a minio-results.txt; \
    done; \
  done; \
done

The loadtest above does a bunch of runs with varying parameters. First it tries to read and write object sizes of 4MB, 1MB, 8kB and 4kB respectively. Then it tries to do this with either 1 thread, 8 threads or 32 threads. Finally it tests both the disk-based variant as well as the SSD based one. The loadtest runs from a third machine, so that the Dell R630 disk tanks can stay completely dedicated to their task of running MinIO.

MinIO 8kb disk vs SSD

The left-hand side graph feels pretty natural to me. With one thread, uploading 8kB objects will quickly hit the IOPS rate of the disks, each of which have to participate in the write due to EC:3 encoding when using six disks, and it tops out at ~56 PUT/s. The single thread hitting SSDs will not hit that limit, and has ~371 PUT/s which I found a bit underwhelming. But, when performing the loadtest with either 8 or 32 write threads, the hard disks become only marginally faster (topping out at 240 PUT/s), while the SSDs really start to shine, with 3850 PUT/s. Pretty good performance.

On the read-side, I am pleasantly surprised that there’s not really that much of a difference between disks and SSDs. This is likely because the host filesystem cache is playing a large role, so the 1-thread performance is equivalent (765 GET/s for disks, 677 GET/s for SSDs), and the 32-thread performance is also equivalent (at 7624 GET/s for disks with 7261 GET/s for SSDs). I do wonder why the hard disks consistently outperform the SSDs with all the other variables (OS, MinIO version, hardware) the same.

Sidequest: SeaweedFS

Something that has long caught my attention is the way in which [SeaweedFS] approaches blob storage. Many operators have great success with many small file writes in SeaweedFS compared to MinIO and even AWS S3 storage. This is because writes with WeedFS are not broken into erasure-sets, which would require every disk to write a small part or checksum of the data, but rather files are replicated within the cluster in their entirety on different disks, racks or datacenters. I won’t bore you with the details of SeaweedFS but I’ll tack on a docker [compose file] that I used at the end of this article, if you’re curious.

MinIO vs SeaWeedFS

In the write-path, SeaweedFS dominates in all cases, due to its different way of achieving durable storage (per-file replication in SeaweedFS versus all-disk erasure-sets in MinIO):

  • 4k: 3,384 ops/sec vs MinIO’s 111 ops/sec (30x faster!)
  • 8k: 3,332 ops/sec vs MinIO’s 111 ops/sec (30x faster!)
  • 1M: 383 ops/sec vs MinIO’s 44 ops/sec (9x faster)
  • 4M: 104 ops/sec vs MinIO’s 32 ops/sec (4x faster)

For the read-path, in GET operations MinIO is better at small objects, and really dominates the large objects:

  • 4k: 7,411 ops/sec vs SeaweedFS 5,014 ops/sec
  • 8k: 7,666 ops/sec vs SeaweedFS 5,165 ops/sec
  • 1M: 5,466 ops/sec vs SeaweedFS 2,212 ops/sec
  • 4M: 3,084 ops/sec vs SeaweedFS 646 ops/sec

This makes me draw an interesting conclusion: seeing as CT Logs are read/write heavy (every couple of seconds, the Merkle tree is recomputed which is reasonably disk-intensive), SeaweedFS might be a slight better choice. IPng Networks has three MinIO deployments, but no SeaweedFS deployments. Yet.

Tessera

[Tessera] is a Go library for building tile-based transparency logs (tlogs) [ref]. It is the logical successor to the approach that Google took when building and operating Logs using its predecessor called [Trillian]. The implementation and its APIs bake-in current best-practices based on the lessons learned over the past decade of building and operating transparency logs in production environments and at scale.

Tessera was introduced at the Transparency.Dev summit in October 2024. I first watch Al and Martin [introduce] it at last year’s summit. At a high level, it wraps what used to be a whole kubernetes cluster full of components, into a single library that can be used with Cloud based services, either like AWS S3 and RDS database, or like GCP’s GCS storage and Spanner database. However, Google also made is easy to use a regular POSIX filesystem implementation.

TesseraCT

tesseract logo

While Tessera is a library, a CT log implementation comes from its sibling GitHub repository called [TesseraCT]. Because it leverages Tessera under the hood, TesseraCT can run on GCP, AWS, POSIX-compliant, or on S3-compatible systems alongside a MySQL database. In order to provide ecosystem agility and to control the growth of CT Log sizes, new CT Logs must be temporally sharded, defining a certificate expiry range denoted in the form of two dates: [rangeBegin, rangeEnd). The certificate expiry range allows a Log to reject otherwise valid logging submissions for certificates that expire before or after this defined range, thus partitioning the set of publicly-trusted certificates that each Log will accept. I will be expected to keep logs for an extended period of time, say 3-5 years.

It’s time for me to figure out what this TesseraCT thing can do .. are you ready? Let’s go!

TesseraCT: S3 and SQL

TesseraCT comes with a few so-called personalities. Those are an implementation of the underlying storage infrastructure in an opinionated way. The first personality I look at is the aws one in cmd/tesseract/aws. I notice that this personality does make hard assumptions about the use of AWS which is unfortunate as the documentation says ‘.. or self-hosted S3 and MySQL database’. However, the aws personality assumes the AWS SecretManager in order to fetch its signing key. Before I can be successful, I need to detangle that.

TesseraCT: AWS and Local Signer

First, I change cmd/tesseract/aws/main.go to add two new flags:

  • -signer_public_key_file: a path to the public key for checkpoints and SCT signer
  • -signer_private_key_file: a path to the private key for checkpoints and SCT signer

I then change the program to assume if these flags are both set, the user will want a NewLocalSigner instead of a NewSecretsManagerSigner. Now all I have to do is implement the signer interface in a package local_signer.go. There, function NewLocalSigner() will read the public and private PEM from file, decode them, and create an ECDSAWithSHA256Signer with them, a simple example to show what I mean:

// NewLocalSigner creates a new signer that uses the ECDSA P-256 key pair from
// local disk files for signing digests.
func NewLocalSigner(publicKeyFile, privateKeyFile string) (*ECDSAWithSHA256Signer, error) {
  // Read public key
  publicKeyPEM, err := os.ReadFile(publicKeyFile)
  publicPemBlock, rest := pem.Decode(publicKeyPEM)

  var publicKey crypto.PublicKey
  publicKey, err = x509.ParsePKIXPublicKey(publicPemBlock.Bytes)
  ecdsaPublicKey, ok := publicKey.(*ecdsa.PublicKey)

  // Read private key
  privateKeyPEM, err := os.ReadFile(privateKeyFile)
  privatePemBlock, rest := pem.Decode(privateKeyPEM)

  var ecdsaPrivateKey *ecdsa.PrivateKey
  ecdsaPrivateKey, err = x509.ParseECPrivateKey(privatePemBlock.Bytes)

  // Verify the correctness of the signer key pair
  if !ecdsaPrivateKey.PublicKey.Equal(ecdsaPublicKey) {
   return nil, errors.New("signer key pair doesn't match")
  }

  return &ECDSAWithSHA256Signer{
   publicKey:  ecdsaPublicKey,
   privateKey: ecdsaPrivateKey,
  }, nil
}

In the snippet above I omitted all of the error handling, but the local signer logic itself is hopefully clear. And with that, I am liberated from Amazon’s Cloud offering and can run this thing all by myself!

TesseraCT: Running with S3, MySQL, and Local Signer

First, I need to create a suitable ECDSA key:

pim@ctlog-test:~$ openssl ecparam -name prime256v1 -genkey -noout -out /tmp/private_key.pem
pim@ctlog-test:~$ openssl ec -in /tmp/private_key.pem -pubout -out /tmp/public_key.pem

Then, I’ll install the MySQL server and create the databases:

pim@ctlog-test:~$ sudo apt install default-mysql-server
pim@ctlog-test:~$ sudo mysql -u root

CREATE USER 'tesseract'@'localhost' IDENTIFIED BY '<db_passwd>';
CREATE DATABASE tesseract;
CREATE DATABASE tesseract_antispam;
GRANT ALL PRIVILEGES ON tesseract.* TO 'tesseract'@'localhost';
GRANT ALL PRIVILEGES ON tesseract_antispam.* TO 'tesseract'@'localhost';

Finally, I use the SSD MinIO lab-machine that I just loadtested to create an S3 bucket.

pim@ctlog-test:~$ mc mb minio-ssd/tesseract-test
pim@ctlog-test:~$ cat << EOF > /tmp/minio-access.json
{ "Version": "2012-10-17", "Statement": [ {
    "Effect": "Allow",
    "Action": [ "s3:ListBucket", "s3:PutObject", "s3:GetObject", "s3:DeleteObject" ],
    "Resource": [ "arn:aws:s3:::tesseract-test/*", "arn:aws:s3:::tesseract-test" ]
  } ]
}
EOF
pim@ctlog-test:~$ mc admin user add minio-ssd <user> <secret>
pim@ctlog-test:~$ mc admin policy create minio-ssd tesseract-test-access /tmp/minio-access.json
pim@ctlog-test:~$ mc admin policy attach minio-ssd tesseract-test-access --user <user>
pim@ctlog-test:~$ mc anonymous set public minio-ssd/tesseract-test
brain

After some fiddling, I understand that the AWS software development kit makes some assumptions that you’ll be using .. quelle surprise .. AWS services. But you can also use local S3 services by setting a few key environment variables. I had heard of the S3 access and secret key environment variables before, but I now need to also use a different S3 endpoint. That little detour into the codebase only took me .. several hours.

Armed with that knowledge, I can build and finally start my TesseraCT instance:

pim@ctlog-test:~/src/tesseract/cmd/tesseract/aws$ go build -o ~/aws .
pim@ctlog-test:~$ export AWS_DEFAULT_REGION="us-east-1"
pim@ctlog-test:~$ export AWS_ACCESS_KEY_ID="<user>"
pim@ctlog-test:~$ export AWS_SECRET_ACCESS_KEY="<secret>"
pim@ctlog-test:~$ export AWS_ENDPOINT_URL_S3="http://minio-ssd.lab.ipng.ch:9000/"
pim@ctlog-test:~$ ./aws --http_endpoint='[::]:6962' \
  --origin=ctlog-test.lab.ipng.ch/test-ecdsa \
  --bucket=tesseract-test \
  --db_host=ctlog-test.lab.ipng.ch \
  --db_user=tesseract \
  --db_password=<db_passwd> \
  --db_name=tesseract \
  --antispam_db_name=tesseract_antispam \
  --signer_public_key_file=/tmp/public_key.pem \
  --signer_private_key_file=/tmp/private_key.pem \
  --roots_pem_file=internal/hammer/testdata/test_root_ca_cert.pem

I0727 15:13:04.666056  337461 main.go:128] **** CT HTTP Server Starting ****

Hah! I think most of the command line flags and environment variables should make sense, but I was struggling for a while with the --roots_pem_file and the --origin flags, so I phoned a friend (Al Cutter, Googler extraordinaire and an expert in Tessera/CT). He explained to me that the Log is actually an open endpoint to which anybody might POST data. However, to avoid folks abusing the log infrastructure, each POST is expected to come from one of the certificate authorities listed in the --roots_pem_file. OK, that makes sense.

Then, the --origin flag designates how my log calls itself. In the resulting checkpoint file it will enumerate a hash of the latest merged and published Merkle tree. In case a server serves multiple logs, it uses the --origin flag to make the destinction which checksum belongs to which.

pim@ctlog-test:~/src/tesseract$ curl http://tesseract-test.minio-ssd.lab.ipng.ch:9000/checkpoint
ctlog-test.lab.ipng.ch/test-ecdsa
0
JGPitKWWI0aGuCfC2k1n/p9xdWAYPm5RZPNDXkCEVUU=

— ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMCONUBAMARjBEAiA/nc9dig6U//vPg7SoTHjt9bxP5K+x3w4MYKpIRn4ULQIgUY5zijRK8qyuJGvZaItDEmP1gohCt+wI+sESBnhkuqo=

When creating the bucket above, I used mc anonymous set public, which made the S3 bucket world-readable. I can now execute the whole read-path simply by hitting the S3 service. Check.

TesseraCT: Loadtesting S3/MySQL

Stop, hammer time

The write path is a server on [::]:6962. I should be able to write a log to it, but how? Here’s where I am grateful to find a tool in the TesseraCT GitHub repository called hammer. This hammer sets up read and write traffic to a Static CT API log to test correctness and performance under load. The traffic is sent according to the [Static CT API] spec. Slick!

The tool start a text-based UI (my favorite! also when using Cisco T-Rex loadtester) in the terminal that shows the current status, logs, and supports increasing/decreasing read and write traffic. This TUI allows for a level of interactivity when probing a new configuration of a log in order to find any cliffs where performance degrades. For real load-testing applications, especially headless runs as part of a CI pipeline, it is recommended to run the tool with -show_ui=false in order to disable the UI.

I’m a bit lost in the somewhat terse [README.md], but my buddy Al comes to my rescue and explains the flags to me. First of all, the loadtester wants to hit the same --origin that I configured the write-path to accept. In my case this is ctlog-test.lab.ipng.ch/test-ecdsa. Then, it needs the public key for that Log, which I can find in /tmp/public_key.pem. The text there is the DER (Distinguished Encoding Rules), stored as a base64 encoded string. What follows next was the most difficult for me to understand, as I was thinking the hammer would read some log from the internet somewhere and replay it locally. Al explains that actually, the hammer tool synthetically creates all of these entries itself, and it regularly reads the checkpoint from the --log_url place, while it writes its certificates to --write_log_url. The last few flags just inform the hammer how many read and write ops/sec it should generate, and with that explanation my brain plays tadaa.wav and I am ready to go.

pim@ctlog-test:~/src/tesseract$ go run ./internal/hammer \
  --origin=ctlog-test.lab.ipng.ch/test-ecdsa \
  --log_public_key=MFkwEwYHKoZIzj0CAQYIKoZIzj0DAQcDQgAEucHtDWe9GYNicPnuGWbEX8rJg/VnDcXs8z40KdoNidBKy6/ZXw2u+NW1XAUnGpXcZozxufsgOMhijsWb25r7jw== \
  --log_url=http://tesseract-test.minio-ssd.lab.ipng.ch:9000/ \
  --write_log_url=http://localhost:6962/ctlog-test.lab.ipng.ch/test-ecdsa/ \
  --max_read_ops=0 \
  --num_writers=5000 \
  --max_write_ops=100
S3/MySQL Loadtest 100qps

Cool! It seems that the loadtest is happily chugging along at 100qps. The log is consuming them in the HTTP write-path by accepting POST requests to /ctlog-test.lab.ipng.ch/test-ecdsa/ct/v1/add-chain, where hammer is offering them at a rate of 100qps, with a configured probability of duplicates set at 10%. What that means is that every now and again, it’ll repeat a previous request. The purpose of this is to stress test the so-called antispam implementation. When hammer sends its requests, it signs them with a certificate that was issued by the CA described in internal/hammer/testdata/test_root_ca_cert.pem, which is why TesseraCT accepts them.

I raise the write load by using the ‘>’ key a few times. I notice things are great at 500qps, which is nice because that’s double what we are to expect. But I start seeing a bit more noise at 600qps. When I raise the write-rate to 1000qps, all hell breaks loose on the logs of the server (and similar logs in the hammer loadtester:

W0727 15:54:33.419881  348475 handlers.go:168] ctlog-test.lab.ipng.ch/test-ecdsa: AddChain handler error: couldn't store the leaf: failed to fetch entry bundle at index 0: failed to fetch resource: getObject: failed to create reader for object "tile/data/000" in bucket "tesseract-test": operation error S3: GetObject, context deadline exceeded
W0727 15:55:02.727962  348475 aws.go:345] GarbageCollect failed: failed to delete one or more objects: failed to delete objects: operation error S3: DeleteObjects, https response error StatusCode: 400, RequestID: 1856202CA3C4B83F, HostID: dd9025bab4ad464b049177c95eb6ebf374d3b3fd1af9251148b658df7ac2e3e8, api error MalformedXML: The XML you provided was not well-formed or did not validate against our published schema.
E0727 15:55:10.448973  348475 append_lifecycle.go:293] followerStats: follower "AWS antispam" EntriesProcessed(): failed to read follow coordination info: Error 1040: Too many connections

I see on the MinIO instance that it’s doing about 150/s of GETs and 15/s of PUTs, which is totally reasonable:

pim@ctlog-test:~/src/tesseract$ mc admin trace --stats ssd
Duration: 6m9s ▰▱▱
RX Rate:↑ 34 MiB/m
TX Rate:↓ 2.3 GiB/m
RPM    :  10588.1
-------------
Call                      Count          RPM     Avg Time  Min Time  Max Time  Avg TTFB  Max TTFB  Avg Size     Rate /min  
s3.GetObject              60558 (92.9%)  9837.2  4.3ms     708µs     48.1ms    3.9ms     47.8ms    ↑144B ↓246K  ↑1.4M ↓2.3G
s3.PutObject              2199 (3.4%)    357.2   5.3ms     2.4ms     32.7ms    5.3ms     32.7ms    ↑92K         ↑32M       
s3.DeleteMultipleObjects  1212 (1.9%)    196.9   877µs     290µs     41.1ms    850µs     41.1ms    ↑230B ↓369B  ↑44K ↓71K  
s3.ListObjectsV2          1212 (1.9%)    196.9   18.4ms    999µs     52.8ms    18.3ms    52.7ms    ↑131B ↓261B  ↑25K ↓50K  

Another nice way to see what makes it through is this oneliner, which reads the checkpoint every second, and once it changes, shows the delta in seconds and how many certs were written:

pim@ctlog-test:~/src/tesseract$ T=0; O=0; while :; do \
  N=$(curl -sS http://tesseract-test.minio-ssd.lab.ipng.ch:9000/checkpoint | grep -E '^[0-9]+$'); \
  if [ "$N" -eq "$O" ]; then \
    echo -n .; \
  else \
    echo " $T seconds $((N-O)) certs"; O=$N; T=0; echo -n $N\ ;
  fi; \
  T=$((T+1)); sleep 1; done
1012905 .... 5 seconds 2081 certs
1014986 .... 5 seconds 2126 certs
1017112 .... 5 seconds 1913 certs
1019025 .... 5 seconds 2588 certs
1021613 .... 5 seconds 2591 certs
1024204 .... 5 seconds 2197 certs

So I can see that the checkpoint is refreshed every 5 seconds and between 1913 and 2591 certs are written each time. And indeed, at 400/s there are no errors or warnings at all. At this write rate, TesseraCT is using about 2.9 CPUs/s, with MariaDB using 0.3 CPUs/s, but the hammer is using 6.0 CPUs/s. Overall, the machine is perfectly happily serving for a few hours under this load test.

Conclusion: a write-rate of 400/s should be safe with S3+MySQL

TesseraCT: POSIX

I have been playing with this idea of having a reliable read-path by having the S3 cluster be redundant, or by replicating the S3 bucket. But Al asks: why not use our experimental POSIX? We discuss two very important benefits, but also two drawbacks:

  • On the plus side:
    1. There is no need for S3 storage, read/writing to a local ZFS raidz2 pool instead.
    2. There is no need for MySQL, as the POSIX implementation can use a local badger instance also on the local filesystem.
  • On the drawbacks:
    1. There is a SPOF in the read-path, as the single VM must handle both. The write-path always has a SPOF on the TesseraCT VM.
    2. Local storage is more expensive than S3 storage, and can be used only for the purposes of one application (and at best, shared with other VMs on the same hypervisor).

Come to think of it, this is maybe not such a bad tradeoff. I do kind of like having a single-VM with a single-binary and no other moving parts. It greatly simplifies the architecture, and for the read-path I can (and will) still use multiple upstream NGINX machines in IPng’s network.

I consider myself nerd-sniped, and take a look at the POSIX variant. I have a few SAS3 solid state storage (NetAPP part number X447_S1633800AMD), which I plug into the ctlog-test machine.

pim@ctlog-test:~$ sudo zpool create -o ashift=12 -o autotrim=on -o ssd-vol0 mirror \
  /dev/disk/by-id/wwn-0x5002538a0???????
pim@ctlog-test:~$ sudo zfs create ssd-vol0/tesseract-test
pim@ctlog-test:~$ sudo chown pim:pim /ssd-vol0/tesseract-test
pim@ctlog-test:~/src/tesseract$ go run ./cmd/experimental/posix --http_endpoint='[::]:6962' \
  --origin=ctlog-test.lab.ipng.ch/test-ecdsa \
  --private_key=/tmp/private_key.pem \
  --storage_dir=/ssd-vol0/tesseract-test \
  --roots_pem_file=internal/hammer/testdata/test_root_ca_cert.pem 
badger 2025/07/27 16:29:15 INFO: All 0 tables opened in 0s
badger 2025/07/27 16:29:15 INFO: Discard stats nextEmptySlot: 0
badger 2025/07/27 16:29:15 INFO: Set nextTxnTs to 0
I0727 16:29:15.032845  363156 files.go:502] Initializing directory for POSIX log at "/ssd-vol0/tesseract-test" (this should only happen ONCE per log!)
I0727 16:29:15.034101  363156 main.go:97] **** CT HTTP Server Starting ****

pim@ctlog-test:~/src/tesseract$ cat /ssd-vol0/tesseract-test/checkpoint 
ctlog-test.lab.ipng.ch/test-ecdsa
0
47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=

— ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMSgC8BAMARzBFAiBjT5zdkniKlryqlUlx/gLHOtVK26zuWwrc4BlyTVzCWgIhAJ0GIrlrP7YGzRaHjzdB5tnS5rpP3LeOsPbpLateaiFc

Alright, I can see the log started and created an empty checkpoint file. Nice!

Before I can loadtest it, I will need to get the read-path to become visible. The hammer can read a checkpoint from local file:/// prefixes, but I’ll have to serve them over the network eventually anyway, so I create the following NGINX config for it:

server {
  listen 80 default_server backlog=4096;
  listen [::]:80 default_server backlog=4096;
  root /ssd-vol0/tesseract-test/;
  index index.html index.htm index.nginx-debian.html;

  server_name _;

  access_log /var/log/nginx/access.log combined buffer=512k flush=5s;

  location / {
    try_files $uri $uri/ =404;
    tcp_nopush  on;
    sendfile    on;
    tcp_nodelay on;
    keepalive_timeout 65;
    keepalive_requests 1000;
  }
}

Just a couple of small thoughts on this configuration. I’m using buffered access logs, to avoid excessive disk writes in the read-path. Then, I’m using kernel sendfile() which will instruct the kernel to serve the static objects directly, so that NGINX can move on. Further, I’ll allow for a long keepalive in HTTP 1.1, so that future requests can use the same TCP connection, and I’ll set the flag tcp_nodelay and tcp_nopush to just blast the data out without waiting.

Without much ado:

pim@ctlog-test:~/src/tesseract$ curl -sS ctlog-test.lab.ipng.ch/checkpoint
ctlog-test.lab.ipng.ch/test-ecdsa
0
47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=

— ctlog-test.lab.ipng.ch/test-ecdsa L+IHdQAAAZhMTfksBAMASDBGAiEAqADLH0P/SRVloF6G1ezlWG3Exf+sTzPIY5u6VjAKLqACIQCkJO2N0dZQuDHvkbnzL8Hd91oyU41bVqfD3vs5EwUouA==

TesseraCT: Loadtesting POSIX

The loadtesting is roughly the same. I start the hammer with the same 500qps of write rate, which was roughly where the S3+MySQL variant topped. My checkpoint tracker shows the following:

pim@ctlog-test:~/src/tesseract$ T=0; O=0; while :; do \
  N=$(curl -sS http://localhost/checkpoint | grep -E '^[0-9]+$'); \
  if [ "$N" -eq "$O" ]; then \
    echo -n .; \
  else \
    echo " $T seconds $((N-O)) certs"; O=$N; T=0; echo -n $N\ ;
  fi; \
  T=$((T+1)); sleep 1; done
59250 ......... 10 seconds 5244 certs
64494 ......... 10 seconds 5000 certs
69494 ......... 10 seconds 5000 certs
74494 ......... 10 seconds 5000 certs
79494 ......... 10 seconds 5256 certs
79494 ......... 10 seconds 5256 certs
84750 ......... 10 seconds 5244 certs
89994 ......... 10 seconds 5256 certs
95250 ......... 10 seconds 5000 certs
100250 ......... 10 seconds 5000 certs
105250 ......... 10 seconds 5000 certs

I learn two things. First, the checkpoint interval in this posix variant is 10 seconds, compared to the 5 seconds of the aws variant I tested before. I dive into the code, because there doesn’t seem to be a --checkpoint_interval flag. In the tessera library, I find DefaultCheckpointInterval which is set to 10 seconds. I change it to be 2 seconds instead, and restart the posix binary:

238250 . 2 seconds 1000 certs
239250 . 2 seconds 1000 certs
240250 . 2 seconds 1000 certs
241250 . 2 seconds 1000 certs
242250 . 2 seconds 1000 certs
243250 . 2 seconds 1000 certs
244250 . 2 seconds 1000 certs
Posix Loadtest 5000qps

Very nice! Maybe I can write a few more certs? I restart the hammer with 5000/s, which somewhat to my surprise, ends up serving!

642608 . 2 seconds 6155 certs
648763 . 2 seconds 10256 certs
659019 . 2 seconds 9237 certs
668256 . 2 seconds 8800 certs
677056 . 2 seconds 8729 certs
685785 . 2 seconds 8237 certs
694022 . 2 seconds 7487 certs
701509 . 2 seconds 8572 certs
710081 . 2 seconds 7413 certs

The throughput is highly variable though, seemingly between 3700/sec and 5100/sec, and I quickly find out that the hammer is completely saturating the CPU on the machine, leaving very little room for the posix TesseraCT to serve. I’m going to need more machines!

So I start a hammer loadtester on the two now-idle MinIO servers, and run them at about 6000qps each, for a total of 12000 certs/sec. And my little posix binary is keeping up like a champ:

2987169 . 2 seconds 23040 certs
3010209 . 2 seconds 23040 certs
3033249 . 2 seconds 21760 certs
3055009 . 2 seconds 21504 certs
3076513 . 2 seconds 23808 certs
3100321 . 2 seconds 22528 certs

One thing is reasonably clear, the posix TesseraCT is CPU bound, not disk bound. The CPU is now running at about 18.5 CPUs/s (with 20 cores), which is pretty much all this Dell has to offer. The NetAPP enterprise solid state drives are not impressed:

pim@ctlog-test:~/src/tesseract$ zpool iostat -v ssd-vol0 10 100
                              capacity     operations     bandwidth 
pool                        alloc   free   read  write   read  write
--------------------------  -----  -----  -----  -----  -----  -----
ssd-vol0                    11.4G   733G      0  3.13K      0   117M
  mirror-0                  11.4G   733G      0  3.13K      0   117M
    wwn-0x5002538a05302930      -      -      0  1.04K      0  39.1M
    wwn-0x5002538a053069f0      -      -      0  1.06K      0  39.1M
    wwn-0x5002538a06313ed0      -      -      0  1.02K      0  39.1M
--------------------------  -----  -----  -----  -----  -----  -----

pim@ctlog-test:~/src/tesseract$ zpool iostat -l  ssd-vol0 10
              capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
pool        alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
ssd-vol0    14.0G   730G      0  1.48K      0  35.4M      -    2ms      -  535us      -    1us      -    3ms      -   50ms
ssd-vol0    14.0G   730G      0  1.12K      0  23.0M      -    1ms      -  733us      -    2us      -    1ms      -   44ms
ssd-vol0    14.1G   730G      0  1.42K      0  45.3M      -  508us      -  122us      -  914ns      -    2ms      -   41ms
ssd-vol0    14.2G   730G      0    678      0  21.0M      -  863us      -  144us      -    2us      -    2ms      -      -

Results

OK, that kind of seals the deal for me. The write path needs about 250 certs/sec and I’m hammering now with 12'000 certs/sec, with room to spare. But what about the read path? The cool thing about the static log is that reads are all entirely done by NGINX. The only file that isn’t cacheable is the checkpoint file which gets updated every two seconds (or ten seconds in the default tessera settings).

So I start yet another hammer whose job it is to read back from the static filesystem:

pim@ctlog-test:~/src/tesseract$ curl localhost/nginx_status; sleep 60; curl localhost/nginx_status
Active connections: 10556 
server accepts handled requests
 25302 25302 1492918 
Reading: 0 Writing: 1 Waiting: 10555 
Active connections: 7791 
server accepts handled requests
 25764 25764 1727631 
Reading: 0 Writing: 1 Waiting: 7790 

And I can see that it’s keeping up quite nicely. In one minute, it handled (1727631-1492918) or 234713 requests, which is a cool 3911 requests/sec. All these read/write hammers are kind of saturating the ctlog-test machine though:

Posix Loadtest 8000qps write, 4000qps read

But after a little bit of fiddling, I can assert my conclusion:

Conclusion: a write-rate of 8'000/s alongside a read-rate of 4'000/s should be safe with POSIX

What’s Next

I am going to offer such a machine in production together with Antonis Chariton, and Jeroen Massar. I plan to do a few additional things:

  • Test Sunlight as well on the same hardware. It would be nice to see a comparison between write rates of the two implementations.
  • Work with Al Cutter and the Transparency Dev team to close a few small gaps (like the local_signer.go and some Prometheus monitoring of the posix binary.
  • Install and launch both under *.ct.ipng.ch, which in itself deserves its own report, showing how I intend to do log cycling and care/feeding, as well as report on the real production experience running these CT Logs.