Benchbase is a framework from Carnegie Mellon for benchmarking databases. It comes with support for about 20 benchmarks and about as many DBMSes.
Benchbase started life as OLTPBench as was introduced in an academic paper from 2014.
Using Benchbase the last month, I found the documentation to be pretty shallow, so this is my effort to improve things. A lot of this material was covered in my pdxpug talk last week.
Benchbase is written in Java and uses Maven to build and use.
Following their README, first you build a tarball for your DBMS like this:
./mvnw clean package -P postgres
Then you expand the tarball and run a benchmark like this:
cd target
tar xvzf benchbase-postgres.tgz
cd benchbase-postgres
java -jar benchbase.jar -b tpcc -c config/postgres/sample_tpcc_config.xml --create=true --load=true --execute=true
The -b
option says which benchmark you want to run.
The -c
option points to a config file (covered below).
The --create
option doesn’t run CREATE DATABASE
, but creates the schema for the benchmark.
The --load
option fills the schema with its starting data. The time for this is not included in the benchmark results.
The --execute
option actually runs the benchmark. I often ran ‑‑create=true ‑‑load=true ‑‑execute=false
to populate a database named e.g. benchbase_template
, then createdb -T benchbase_template benchbase
to make a quick copy, then ‑‑create=false ‑‑load=false ‑‑execute=true
to run the benchmark. That helps iteration time a lot when you have a big load. But for higher-quality results you should do it all in one go, after running initdb
, as Melanie Plageman points out in one of her talks. (Sorry, I haven’t been able to find the reference again, but if I do I’ll point a link here.)
If you are writing Java code for your own benchmark, then this one-liner is a lot faster than all that tarball stuff:
./mvnw clean compile exec:java -P postgres -Dexec.args="-b tpcc -c config/postgres/sample_tpcc_config.xml --create=true --load=true --execute=true"
Of course you can skip the clean
and compile
if you like.
Unfortunately the exec:java
target has been broken since 2023, but I submitted a pull request.
The benchmark behavior is controlled by the XML config file. The most complete docs are in the original OLTPBench repo’s Github wiki, although if you read the paper you’ll learn many other things you can control with this file. You can also look at a sample config file for your benchmark + database.
The file begins with connection details like this:
<type>POSTGRES</type>
<driver>org.postgresql.Driver</driver>
<url>jdbc:postgresql://localhost:5432/benchbase?sslmode=disable&ApplicationName=tpcc&reWriteBatchedInserts=true</url>
<username>admin</username>
<password>password</password>
The <isolation>
element controls the transaction isolation level:
<isolation>TRANSACTION_SERIALIZABLE</isolation>
You can ask to reconnect after a connection failure:
<reconnectOnConnectionFailure>true</reconnectOnConnectionFailure>
I haven’t investigated exactly how that is used.
You can also open a new connection for every transaction:
<newConnectionPerTxn>true</newConnectionPerTxn>
By default that is false, but you may want to make it true if you are focusing on your database’s connection overhead.
Here are some elements that apply to the loading step (not the actual benchmark run):
<scalefactor>1</scalefactor>
<batchsize>128</batchsize>
Each benchmark interprets scalefactor
in its own way. For TPC-C this is the number of warehouses. For Twitter you get 500 users and 20,000 tweets, multiplied by the scalefactor
.
Then batchsize
just tells the loader how to combine insert statements, for a quicker load.
You also list all the “procedures” the benchmark is capable of (or just the ones you care about):
<transactiontypes>
<transactiontype>
<name>NewOrder</name>
</transactiontype>
<transactiontype>
<name>Payment</name>
</transactiontype>
<transactiontype>
<name>OrderStatus</name>
</transactiontype>
<transactiontype>
<name>Delivery</name>
</transactiontype>
<transactiontype>
<name>StockLevel</name>
</transactiontype>
</transactiontypes>
Each procedure is defined in a Java file.
Besides <name>
, you can also include <preExecutionWait>
and <postExecutionWait>
to give a delay in milliseconds before/after running the transaction. So this is one way to add “think time”.
There is also a concept of “supplemental” procedures, but that is not controlled by the config file. Only the SEATS and AuctionMark benchmarks use it. From quickly scanning the code, I think it lets a benchmark define procedures without depending on the user to list them. They won’t be added to the normal transaction queue, but the benchmark can run them elsewhere as needed. For example SEATS uses its supplemental procedure to find out which airports/flights/etc were added in the load step, so it can use them.
The top-level <terminals>
element controls the concurrency. This is how many simultaneous connections you want:
<terminals>1</terminals>
But the real behavior comes from the <works>
element. This contains <work>
child elements, each one a “phase” of your benchmark. For example:
<works>
<work>
<time>60</time>
<rate>10000</rate>
<weights>45,43,4,4,4</weights>
</work>
</works>
Here was have one phase lasting 60 seconds.
The <weights>
refer to the <transactiontypes>
above. Each weight is a percentage giving the share of that procedure in the total transactions. They must add to 100%.
The <rate>
gives the targeted transactions per second (per terminal). Mostly this is a way to slow things down, not to speed things up: it is another way to include “think time” in between transactions. If your run doesn’t achieve this rate, it’s not an error.
Each phase can override the top-level concurrency with <active_terminals>5</active_terminals>
.
Also you can let the phase start gradually with <work arrival="poisson">
. The OLTP-Bench paper demonstrates this technique.
In addition a benchmark may understand other XML elements. For example Twitter lets you give <tracefile>
and <tracefile2>
, and the benchmark will use those to read tweet ids and user ids (respectively), which it will use as inputs for its transactions (but not every transaction type uses both).
Last Thursday I gave a talk at PDXPUG about using Benchbase to compare the performance of temporal foreign keys. It was a lot of fun, and a really good turnout. There were even folks from Seattle and Bend. After listening for an hour, people stuck around and talked about databases and benchmarks for another two, then the last few holdouts went out for drinks for another hour and a half. At least half the audience were way more qualified to give the talk than me. To my surprise Mark Callaghan was there, who has published database benchmarks non-stop for years.
I had two major goals: to document how to use Benchbase and to report on comparing three implementations of temporal foreign keys. A couple minor goals were to share the start of a broader general-purpose benchmark for temporal databases and to talk about a benchmarking methodology, especially mistakes I made and how I tried to improve.
One silver lining of temporal primary & foreign keys getting reverted is I got to meet Hettie Dombrovskaya and Boris Novikov.
I’ve been working with them to write SQL for various temporal operations not covered by the SQL:2011 standard. There is no support there for outer joins, semijoins, antijoins, aggregates, or set operations (UNION
, INTERSECT
, EXCEPT
). As far as I know no one has ever shown how to implement those operations in SQL. I have queries so far for outer join, semijoin, and antijoin, and I’m planning to include aggregates based on this article by Boris. The set operations look pretty easy to me, so hopefully I’ll have those soon too.
If you’re interested, the repo is on Github.
Saturday I debugged the sprinklers.
I thought I turned them on two weeks ago, and I heard someone’s sprinklers outside my window that next Monday morning at 5 a.m., but after a week of 100-degree days it was clear ours weren’t doing their job. I had skipped my usual routine of checking each line, unearthing the sunken heads, and replacing what had failed. So now I had to deal with it.
Somehow after living here for ten years I still found two new heads I had never seen before. Here is a map I’ve kept for years, maybe since our first summer:
It has every sprinkler head I’ve seen. Going by the rate I charge clients, that map is worth thousands of dollars.
In the bottom corner is the box where the water comes in from the street. There are more boxes where valves let water into each line.
One year I came across a buried water spigot in the middle of the grass. Then I lost it again.
But this was a valuable spigot. It was over by our raised beds, where there is no other convenient water. You have to drag a hose from across the yard to water there. In 2022 I borrowed a neighbor’s metal detector. I still couldn’t find it. Finally I tore up the grass with a shovel, probing what must have been a 20’ x 20’ area, until finally I heard a metal clink. I extended the pipe and topped it with a copper rabbit spigot I won as a kid at the Redlands Garden Show for a potted cactus garden. I’ve carried that rabbit with me for 35 years, waiting for a chance to use it.
That was two years ago. It’s on my map.
So why is our grass dying?
Naturally I run our sprinklers off a raspberry pi. I set it up years ago, back in 2016. The controller that came with the house was dying. Two-thirds of the time when I tried to water line 12 or 13, line 4 or 5 would turn on instead. (Yes, we have 13 sprinkler lines. It’s a big yard.) Almost always it was off by 8, or sometimes 4: pretty clearly some loose wires. Why spend fifty bucks to replace it when I could spend days building my own? Look, at least there is no Kubernetes or CI pipeline, okay?
There were raspi sprinkler products you could buy, and I think I saw an open source project, but that didn’t seem like fun. I wanted control and flexbility. I wanted power. I wanted Raspbian, Python, and cron.
Here is my script, called sprinkle
:
#!/usr/bin/env python
# sprinkle - Raspberry Pi sprinkler controller
import time
import RPi.GPIO as GPIO
import sys, signal
# Your sprinkler lines:
# Your sprinkler line 1 goes in array position 0,
# then sprinkler line 2 goes in array position 1,
# etc.
# Each value is the Raspi GPIO pin
# you will connect to that line.
# So if you say
# sprinkler_lines = [6, 19]
# then you should connect pin 6 to sprinker line 1,
# and pin 19 to sprinker line 2.
# sprinkler_lines = [21, 20, 16, 12, 25, 24, 23, 26, 19, 13, 6, 5, 22]
sprinkler_lines = [23, 24, 25, 16, 12, 20, 21, 22, 5, 6, 13, 19, 26]
def usage(err_code):
print("USAGE: sprinkle.py <sprinkler_line> <number_of_minutes>")
sys.exit(err_code)
def int_or_usage(str):
try:
return int(str)
except ValueError:
usage(1)
if len(sys.argv) != 3:
usage(1)
sprinkler_line = int_or_usage(sys.argv[1])
number_of_minutes = int_or_usage(sys.argv[2])
if sprinkler_line < 1 or sprinkler_line > len(sprinkler_lines):
print("I only know about sprinkler lines 1 to %d." % len(sprinkler_lines))
sys.exit(1)
if number_of_minutes < 1 or number_of_minutes > 30:
print("I don't want to run the sprinklers for %d minutes." % number_of_minutes)
sys.exit(1)
def exit_gracefully(signal, frame):
GPIO.cleanup()
sys.exit(0)
signal.signal(signal.SIGINT, exit_gracefully)
active_pin = sprinkler_lines[sprinkler_line - 1]
GPIO.setmode(GPIO.BCM)
for pin in sprinkler_lines:
GPIO.setup(pin, GPIO.OUT)
GPIO.output(pin, False)
GPIO.output(active_pin, True)
time.sleep(60 * number_of_minutes)
GPIO.output(active_pin, False)
exit_gracefully(None, None)
That is a lot of code but it turns on one GPIO pin, sleeps a while, then turns it off. Near the top you can see an array that maps sprinkler lines to GPIO pins. I kept the old sprinkler numbering, so it matches the notes the old owners left us. Array position n
means sprinker line n+1
.
Then I have a higher-level script I run each morning out of cron, which does the front on even days and the back on odd. It logs when it starts and finishes, which has helped me a lot:
#!/usr/bin/env python
# do-yard - Run sprinklers for the whole yard.
# We do the front yard on even days and the back yard on odd days.
import time
from subprocess import call
t = time.localtime()
if t.tm_yday % 2:
print("%s: Starting the back" % time.strftime("%Y-%m-%d %H:%M:%S", t))
# odd days we do the back yard:
for line in [4, 5, 6, 7, 8, 12]:
call(["/home/pi/sprinkle", str(line), "5"])
time.sleep(60)
print("%s: Finished the back" % time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
else:
print("%s: Starting the front" % time.strftime("%Y-%m-%d %H:%M:%S", t))
# even days we do the front yard (and a little bit of the back):
for line in [1, 2, 3, 9, 10, 11, 13]:
call(["/home/pi/sprinkle", str(line), "5"])
time.sleep(60)
print("%s: Finished the front" % time.strftime("%Y-%m-%d %H:%M:%S", time.localtime()))
The hard part was figuring out the wiring. I’ve never gone much further than Ohm’s Law. For a long time I was stuck working out how to drive the sprinkler valves. Sprinkler valves use a solenoid to open and shut. In my garage, 13 colored wires come out of the ground, along with one neutral white wire to complete the circuit. Then plugged into the wall is an adapter to produce 24 volt AC, and two wires come out of that. In between used to be the old controller. It would send 24 VAC down whichever wire matched the spinkler line (& ~(1 << 3)
).
The pi outputs 3.3 volts DC. At first I thought there was an integrated circuit that could convert the signal for me, but eventually I resigned myself to using a bank of relays:
Oh also I never learned how to solder.
A relay is a mechanical system. The AC power goes through, but it’s blocked by an open switch. The DC power is on another circuit, and it activates an electromagnet that closes the switch. When you turn on the signal, you see a red light, and the switch closing makes a loud click.
A bank of 16 relays cost $12, almost as much as a sprinkler controller, so I really wanted my ICs to work out. Oh well.
So today I started with checking the log. Well no, because the pi wasn’t responding to ssh again.
It has always been tempermental. After a few hours the wifi dies, sometimes sooner. Pulling the plug for a moment fixes it, but then you have to wait while it boots. So I have to bring a laptop down to the garage, even just to check on things. Today I thought I would finally fix that.
Other people have the same problem. One reported culprit is power-saving mode. I checked and mine was running that way:
pi@raspberrypi:~ $ iw dev wlan0 get power_save
Power save: on
The nicest advice I found was to disable it at boot with systemd. Just run this:
sudo systemctl --full --force edit wifi_powersave@.service
and in your editor enter—ugh, nano? That had to be fixed.
Setting EDITOR
in root’s ~/.profile
should do it.
No? ~/.bashrc
then?
Still no? Back to Stack Overflow… .
No clues. I guess I’m on my own.
What is this .selected_editor
file in root’s home directory? Hmm, it already says vim.
Is sudo even launching its command through a shell? Probably not, once I think of it. If it just execs the command directly, no wonder ~/.profile
does nothing.
More Stack Overflow. Most questions are about visudo
, and I see something called sudoedit
, and people are asking how to control which editor that launches. (Why not just run the editor you want? The man page says it lets you keep your own editor configuration. Like my own ~/.vimrc
? That’s cool. Really? How does that work?) But in my case the editor is getting launched by systemd. Surely we would have all been happier if we’d just gone with runit?
Does root have $SYSTEMD_EDITOR
set? Surely not—no, too bad.
Of course I could just edit the file myself, but it’s the principle of the thing.
Okay, I give up:
sudo visudo -f /etc/sudoers.d/20_editor
I typed this:
Defaults env_keep += "editor EDITOR"
So now when I run sudo
, it will pass along my own $EDITOR
choice.
Is this a security hole? I can imagine some possible issues on a server, but for the pi in my garage it seems okay.
Now systemd launches vim! Shamelessly I copy and pasted:
[Unit]
Description=Set WiFi power save %i
After=sys-subsystem-net-devices-wlan0.device
[Service]
Type=oneshot
RemainAfterExit=yes
ExecStart=/sbin/iw dev wlan0 set power_save %i
[Install]
WantedBy=sys-subsystem-net-devices-wlan0.device
I’ve never seen this %i
thing before. The idea is it lets you do this:
sudo systemctl disable wifi_powersave@off.service
sudo systemctl enable wifi_powersave@on.service
or this:
sudo systemctl disable wifi_powersave@on.service
sudo systemctl enable wifi_powersave@off.service
That’s cool.
Oh, better not forget to run it now too:
sudo iw dev wlan0 set power_save off
So I turned off power saving. Maybe that will fix the wifi.
Let’s check the log file. Have the sprinklers been running?:
2024-06-30 06:00:01: Starting the front
2024-06-30 06:42:03: Finished the front
2024-07-01 06:00:01: Starting the back
2024-07-01 06:36:03: Finished the back
2024-07-02 06:00:01: Starting the front
2024-07-02 06:42:02: Finished the front
2024-07-03 06:00:01: Starting the back
2024-07-03 06:36:02: Finished the back
2024-07-04 06:00:01: Starting the front
2024-07-04 06:42:03: Finished the front
2024-07-05 06:00:01: Starting the back
2024-07-05 06:36:03: Finished the back
2024-07-06 06:00:01: Starting the front
2024-07-06 06:42:03: Finished the front
2024-07-07 06:00:01: Starting the back
2024-07-07 06:36:02: Finished the back
2024-07-08 06:00:01: Starting the front
2024-07-08 06:42:03: Finished the front
2024-07-09 06:00:01: Starting the back
2024-07-09 06:36:03: Finished the back
2024-07-10 06:00:01: Starting the front
2024-07-10 06:42:02: Finished the front
2024-07-11 06:00:02: Starting the back
2024-07-11 06:36:03: Finished the back
2024-07-12 06:00:01: Starting the front
2024-07-12 06:42:03: Finished the front
2024-07-13 06:00:01: Starting the back
2024-07-13 06:36:03: Finished the back
They’ve been running all along! 40 minutes for the front, 30 for the back.
But clearly they’re doing nothing. The pi is turning on a pin then just sitting there.
So there must be a loose connection.
I tried line 3: ./sprinkle 3 10
. No red light, no click. Line 10. No red light, no click. Line 2. No red light, no click.
I went upstairs to fetch my multimeter. Time to test connectivity and voltage.
How in the world did I wire this thing anyway?
Then I noticed a couple red wire loops, connecting GPIO pins to the breadboard power rail, but detached now from the power rail. The pins both said 5V. (That tiny text was easier to read in 2016.) So those came loose? What if I put them back in again? I think I remember . . . wasn’t this supposed to power the relay?
Trying my sprinkle
command again made the light come on! I must have missed the click though. Were the sprinklers running? No? What if I try a few lines? I’m really not hearing the click. But the light is on.
How does each relay work again? I set the multimeter to connectivity to probe each pair of posts. They were more connected than I expected. Was that bad? Okay I remember the white neutral wire running from one relay to another in series. And the colored wires go out and into the ground, one per relay.
I remember something about those two little red wire loops. They really looked disconnected on purpose. They weren’t just loose, they were completely out of the breadboard.
Is anything else loose? A bit, but when I fix it nothing changes.
I remember those two red wires. They are supposed to give 5 volts to power the relay, but it never worked did it? It was supposed to, but it didn’t. Like the pi just didn’t have enough oomph. Or was the board supposed to power the pi?
What are these other two thin black wires leaving the relay board? Where do they go? Off to the right, oh, to a power adapter! Two weeks ago I plugged in the adapter for the pi, and I plugged in the 24 VAC adapter, but the relays need power too, and they get it from the power strip over by the garage freezer.
I guess this is why phone support asks if you’ve plugged it in.
My work adding temporal primary keys and foreign keys to Postgres was reverted from v17. The problem is empty ranges (and multiranges). An empty range doesn’t overlap anything, including another empty range. So 'empty' && 'empty'
is false. But temporal PKs are essentially an exclusion constraint using (id WITH =, valid_at WITH &&)
. Therefore you can insert duplicates, as long as the range is empty:
INSERT INTO t (id, valid_at, name) VALUES (5, 'empty', 'foo');
INSERT INTO t (id, valid_at, name) VALUES (5, 'empty', 'bar');
That might be okay for some users, but it surely breaks expectations for others. And it’s a questionable thing to do that we should probably just forbid. The SQL standard forbids empty PERIOD
s, so we should make sure that using plain ranges does the same. Adding a record with an empty application time doesn’t really have a meaning in the temporal model.
I think this is a pretty small bump in the road. At the Postgres developers conference we found a good solution to excluding empty ranges. My original attempt used CHECK
constraints, but that had a lot of complications. Forbidding them in the executor is a lot simpler. I’ve already sent in a new set of patches for v18 that implement that change.
Here on illuminatedcomputing.com I’ve got a bunch of sites served by nginx, but I’d like to run a little k3s cluster as well. The main benefit would be isolation. That is always helpful, but it especially matters for staging sites for some customers who don’t update very often.
Instead of migrating everything all at once, I want to keep my host nginx but let it reverse proxy to k3s for sites running there. Then I will block direct traffic to k3s, so that there is only one way to get there. I realize this is not really a “correct” way to do k8s, but for a tiny setup like mine it makes sense. Maybe I should have just bought a separate box for k3s, but I find pushing tools a bit like this is a good way to learn how they really work, and that’s what happened here.
It was harder than I thought. I found one or two people online seeking to do the same thing, but there were no good answers. I had to figure it out on my own, and now maybe this post will help someone else.
The first step was to run k3s on other ports. I’m using the ingress-nginx ingress controller via a Helm chart. In my values.yaml
I have it bind to 8080 and 8443 instead:
ingress-nginx:
controller:
enableHttp: true
enableHttps: true
service:
ports:
http: 8080
https: 8443
Then I can see the Service is using those ports:
paul@tal:~/src/illuminatedcomputing/k8s$ k get services -A
NAMESPACE NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
ingress ingress-ingress-nginx-controller LoadBalancer 10.43.91.109 107.150.34.82 8080:31333/TCP,8443:30702/TCP 7d20h
...
Setting up nginx to reverse proxy was also no problem. For example here is a private docker registry I’m running:
server {
listen 443 ssl;
server_name docker.illuminatedcomputing.com;
ssl_certificate ssl/docker.illuminatedcomputing.com.crt;
ssl_certificate_key ssl/docker.illuminatedcomputing.com.key;
location / {
proxy_pass https://127.0.0.1:8443;
proxy_set_header Host "docker.illuminatedcomputing.com";
}
}
server {
listen 80;
server_name docker.illuminatedcomputing.com;
location / {
proxy_pass http://127.0.0.1:8080;
proxy_set_header Host "docker.illuminatedcomputing.com";
}
}
The only tricky part is the ssl cert. I already had the cluster built to get certs from LetsEncrypt with cert-manager. So I have a little cron script that pulls out the k8s Secret and puts it where the host nginx can find it:
#!/bin/bash
exec > >(tee /var/log/update-k3s-ssl-certs.log) 2>&1
echo "$(date -Iseconds) starting"
set -eu
# Everything running in k8s needs to be proxied by nginx,
# so pull the ssl certs and drop them where nginx can find them.
# Do this every day so that we pick up LetsEncrypt renewals.
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
# docker.illuminatedcomputing.com
kubectl get secret -n docker-registry docker-registry-tls -o json | jq -r '.data["tls.crt"] | @base64d' > /etc/nginx/ssl/docker.illuminatedcomputing.com.crt
kubectl get secret -n docker-registry docker-registry-tls -o json | jq -r '.data["tls.key"] | @base64d' > /etc/nginx/ssl/docker.illuminatedcomputing.com.key
# need to reload nginx to see new certs
systemctl reload nginx
echo "$(date -Iseconds) finished"
Probably it would be easier to run certbot on the host and push the cert into k8s (or just terminate TLS), but using cert-manager is what I’d do for a customer, and I’m hopeful that eventually I’ll drop the reverse proxy altogether.
So at this point connecting works:
curl -v https://docker.illuminatedcomputing.com/v2/_catalog
(Of course it will be a 401 without the credentials, but you are still getting through to the service.)
The problem is that this works too:
curl -v https://docker.illuminatedcomputing.com:8443/v2/_catalog
So how can I block that port from everything but the host nginx? I tried making the controller bind to just 127.0.0.1, e.g. with this config:
ingress-nginx:
controller:
config:
bind-address: "127.0.0.1"
enableHttp: true
enableHttps: true
service:
externalIPs:
- "127.0.0.1"
ports:
http: 8080
https: 8443
The bind-address
line adds to a ConfigMap
used to generate the nginx.conf
. It doesn’t work though. The 127.0.0.1 is from the perspective of the controller pod, not the host 127.0.0.1.
Using externalIPs
(with or without bind-address
) also fails. When I add those two lines k3s gives this error:
Error: UPGRADE FAILED: cannot patch "ingress-ingress-nginx-controller" with kind Service: Service "ingress-ingress-nginx-controller" is invalid: spec.externalIPs[0]: Invalid value: "127.0.0.1": may not be in the loopback range (127.0.0.0/8, ::1/128)
So I gave up on that approach.
But what about using iptables to block 8443 and 8080 from the outside? That’s probably simpler anyway—although k3s adds a big pile of its own iptables rules, and diving into that was a bit intimidating.
The first thing I tried was putting a rule at the top of the INPUT
chain. I tried all these:
iptables -I INPUT -p tcp \! -s 127.0.0.1 --dport 8443 -j DROP
iptables -I INPUT -p tcp \! -i lo --dport 8443 -j DROP
iptables -I INPUT -p tcp -i enp2s0 --dport 8443 -j DROP
But none of those worked. I could still get through.
At this point a friend asked ChatGPT for advice, but it wasn’t very helpful. It told me
Instead of having the ingress controller listen on an external IP or trying to make it listen only on 127.0.0.1, configure your host’s nginx to proxy_pass to your k3s services.
Yes, I had explained I was doing that. Also:
You could create a network policy that only allows traffic to the ingress-nginx pods from within the cluster itself.
But that will block the reverse proxy too.
So the cyber Pythia was not coming through for me. I was going to have to figure it out on my own. That meant coming to grips with all the rules k3s was installing.
I started with adding some logging, for example:
iptables -I INPUT -p tcp -d 107.150.34.82 -j LOG --log-prefix '[PJPJPJ] '
Tailing /var/log/syslog
, I could see messages for 443 requests, but nothing for 8443!
So I took a closer look at the nat table (which is processed before the filter table), and I found some relevant rules:
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A KUBE-EXT-2ZARXDYICCJUF4UZ -m comment --comment "masquerade traffic for ingress/ingress-ingress-nginx-controller:https external destinations" -j KUBE-MARK-MASQ
-A KUBE-EXT-2ZARXDYICCJUF4UZ -j KUBE-SVC-2ZARXDYICCJUF4UZ
-A KUBE-EXT-DBDMS67BVV2C2LTP -m comment --comment "masquerade traffic for ingress/ingress-ingress-nginx-controller:http external destinations" -j KUBE-MARK-MASQ
-A KUBE-EXT-DBDMS67BVV2C2LTP -j KUBE-SVC-DBDMS67BVV2C2LTP
-A KUBE-SEP-RQCBIXXO7M53R2WC -s 10.42.0.42/32 -m comment --comment "ingress/ingress-ingress-nginx-controller:https" -j KUBE-MARK-MASQ
-A KUBE-SEP-RQCBIXXO7M53R2WC -p tcp -m comment --comment "ingress/ingress-ingress-nginx-controller:https" -m tcp -j DNAT --to-destination 10.42.0.42:443
-A KUBE-SEP-TXLMBMTNQTOOKDI3 -s 10.42.0.42/32 -m comment --comment "ingress/ingress-ingress-nginx-controller:http" -j KUBE-MARK-MASQ
-A KUBE-SEP-TXLMBMTNQTOOKDI3 -p tcp -m comment --comment "ingress/ingress-ingress-nginx-controller:http" -m tcp -j DNAT --to-destination 10.42.0.42:80
-A KUBE-SERVICES -d 107.150.34.82/32 -p tcp -m comment --comment "ingress/ingress-ingress-nginx-controller:https loadbalancer IP" -m tcp --dport 8443 -j KUBE-EXT-2ZARXDYICCJUF4UZ
-A KUBE-SERVICES -d 107.150.34.82/32 -p tcp -m comment --comment "ingress/ingress-ingress-nginx-controller:http loadbalancer IP" -m tcp --dport 8080 -j KUBE-EXT-DBDMS67BVV2C2LTP
-A KUBE-SVC-2ZARXDYICCJUF4UZ ! -s 10.42.0.0/16 -d 10.43.91.109/32 -p tcp -m comment --comment "ingress/ingress-ingress-nginx-controller:https cluster IP" -m tcp --dport 8443 -j KUBE-MARK-MASQ
-A KUBE-SVC-2ZARXDYICCJUF4UZ -m comment --comment "ingress/ingress-ingress-nginx-controller:https -> 10.42.0.42:443" -j KUBE-SEP-RQCBIXXO7M53R2WC
-A KUBE-SVC-DBDMS67BVV2C2LTP ! -s 10.42.0.0/16 -d 10.43.91.109/32 -p tcp -m comment --comment "ingress/ingress-ingress-nginx-controller:http cluster IP" -m tcp --dport 8080 -j KUBE-MARK-MASQ
-A KUBE-SVC-DBDMS67BVV2C2LTP -m comment --comment "ingress/ingress-ingress-nginx-controller:http -> 10.42.0.42:80" -j KUBE-SEP-TXLMBMTNQTOOKDI3
If you follow how that bounces around, it eventually gets rerouted to 10.42.0.42, either :443 or :80. So that’s why a connection to 8443 never hits the INPUT
chain.
So the solution was to drop the traffic in the nat
table instead:
root@www:~# iptables -I PREROUTING -t nat -p tcp -i enp2s0 --dport 8443 -j DROP
iptables v1.8.4 (legacy):
The "nat" table is not intended for filtering, the use of DROP is therefore inhibited.
Oops, just kidding!
But instead I can just tell 8080 & 8443 to skip all the k3s rewriting:
iptables -I PREROUTING -t nat -p tcp -i enp2s0 --dport 8443 -j RETURN
iptables -I PREROUTING -t nat -p tcp -i enp2s0 --dport 8080 -j RETURN
Now those do show up on the INPUT
chain, but I don’t even need to DROP
them there. There is nothing actually listening on those ports. The controller is still binding to 443 and 80, and k3s is using iptables trickery to reroute connections to those ports. So those two lines above are sufficient, and someone connecting directly gets a Connection refused
.
To make this run each time the machine boots, I wrote a script at /usr/local/bin/iptables-custom.sh
:
#!/bin/bash
# Installs some rules to prevent 8443 and 8080 from getting routed to k8s from the outside world,
# so that you must access them via our nginx reverse proxy.
(iptables -L -n -t nat | grep '^RETURN.*8443$' >/dev/null) || iptables -t nat -I PREROUTING -p tcp -i enp2s0 --dport 8443 -j RETURN
(iptables -L -n -t nat | grep '^RETURN.*8080$' >/dev/null) || iptables -t nat -I PREROUTING -p tcp -i enp2s0 --dport 8080 -j RETURN
Then I put this unit file at /etc/systemd/system/iptables-custom.service
:
[Unit]
Description=adds custom iptables rules after k3s has started
After=k3s.service
[Service]
Type=oneshot
ExecStart=/usr/local/bin/iptables-custom.sh
[Install]
WantedBy=default.target
Then I ran systemctl daemon-reload
and systemctl enable iptables-custom
.
That’s it! I hope this is helpful or you at least enjoyed the story.
Next: Cozy Toes