An nginx HTTP-to-HTTPS Redirect Mystery, and Configuration Advice


I noticed a weird thing last night on an nginx server I administer. The logs were full of lines like this: - - [25/Mar/2018:04:50:49 +0000] "GET HTTP/1.1" 301 185 "" "Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.0; Hotbar; RogueCleaner; Alexa Toolbar)"

Traffic was streaming in continuously: maybe ten or twenty requests per second.

At first I thought the server had been hacked, but really it seemed people were just sending lots of traffic and getting 301 redirects. I could reproduce the problem with a telnet session:

$ telnet 80
Connected to
Escape character is '^]'.

HTTP/1.1 301 Moved Permanently
Server: nginx/1.10.1
Date: Sun, 25 Mar 2018 04:56:06 GMT
Content-Type: text/html
Content-Length: 185
Connection: keep-alive

<head><title>301 Moved Permanently</title></head>
<body bgcolor="white">
<center><h1>301 Moved Permanently</h1></center>

In that session, I typed the first two lines after Escape character..., plus the blank line following. Normally a browser would not include a whole URL after GET, only the path, like GET /about.html HTTP/1.1, but including the whole URL is used when going through a proxy. Also it may be possible to leave off the Host header. Technically it is required for HTTP/1.1, so I added it just out of habit. I didn’t test without it.

So what was happening here? I was following some common advice to redirect HTTP to HTTPS, using configuration like this:

server {
  listen 80;
  server_name *;
  return 301 https://$host$request_uri;

The problem is the $host evaluates to whatever the browser wants. In order of precedence, it can be (1) the host name from the request line (as in my example), (2) the Host header, or (3) what you declared as the server_name for the matching block. A safer alternative is to send people to https://$server_name$request_uri. Then everything is under your control. You can see people recommending that on the ServerFault page.

The problem is when you declare more than one server_name, or when one of them is a wildcard. The $server_name variable always evaluates to the first one. It also doesn’t expand wildcards. (How could it?) That wouldn’t work for me, because in this project admins can add new subdomains any time, and I don’t want to update nginx config files when that happens.

Eventually I solved it using a config like this:

server {
  listen 80 default_server;
  return 301$request_uri;
server {
  listen 80;
  server_name *;
  return 301 https://$host$request_uri;

Notice the default_server modifier. If any traffic actually matches *, it will use the second block, but otherwise it will fall back to the first block, where there is no $host variable, but just a hardcoded redirect to my own domain. After I made this change, I immediately saw traffic getting the redirect and making a second request back to my own machine, usually getting a 404. I expect pretty soon whoever is sending this traffic will knock it off. If not, I guess it’s free traffic for me. :-)

(Technically default_server is not required since if no block is the declared default, nginx will make the first the default automatically, but being explicit seems like an improvement, especially here where it matters.)

I believe I could also use a setup like this:

server {
  listen 80 default_server;
  return 301 "";
server {
  listen 80;
  server_name *;
  return 301 https://$host$request_uri;

There I list all my legitimate domains in the second block, so the default only matches when people are playing games. I guess I’m too nice to do that for real though, and anyway it would make me nervous that some other misconfiguration would activate that first block more often than I intended.

I’d still like to know what the point of this abuse was. My server wasn’t acting as an open proxy exactly, because it wasn’t fulfilling these requests on behalf of the clients and passing along the response (confirmed with tcpdump -n 'tcp[tcpflags] & (tcp-syn) != 0 and src host'); it was just sending a redirect. So what was it accomplishing?

The requests were for only a handful of different domains, mostly Chinese. They came from a variety of IPs. Sometimes an IP would make requests for hours and then disappear. The referrers varied. Most were normal, like Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0), but some mentioned toolbars like the example above.

I guess if it were DNS sending them to my server there would (possibly) be a redirect loop, which I wasn’t seeing. So was my server configured as their proxy?

To learn a little more, I moved nginx over to port 81 and ran this:

mkfifo reply
netcat -kl 80 < reply | tee saved | netcat 81 > reply

(At first instead of netcat I tried ./mitmproxy --save-stream-file +http.log --listen-port 80 --mode reverse: --set keep_host_header, but it threw errors on requests with full URLs (GET HTTP/1.1) because it thought it should only see those in regular mode.)

Once netcat was running I could tail -F saved in another session. I saw requests like this:

User-Agent: Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)
Accept: text/html, */*
Accept-Language: zh-cn; en-us
Pragma: no-cache

I also saw one of these:

User-Agent: Opera/9.80 (Windows NT 6.1; U; en) Presto/2.8.131 Version/11.11
Content-Length: 0
Proxy-Connection: Keep-Alive
Pragma: no-cache

That is a more normal proxy request, although it seems like it was just regular scanning, because I’ve always returned a 400 to those.

Maybe the requests that were getting 301’d were just regular vulnerability scanning too? I don’t know. I seemed like something more specific than that.

The negatives for me were noisy logs and elevated bandwidth/CPU. Not a huge deal, but whatever was going on, I didn’t want to be a part of it.

. . .

By the way, as long as we’re talking about redirecting HTTP to HTTPS, I should mention HSTS, which is a way of telling browsers never to use HTTP here in the future. If you’re doing a redirect like this, it may be a good thing to add (to the HTTPS response, not the HTTP one). On the other hand it has some risks, if in the future you ever want to use HTTP again.

Counting Topologically Distinct Directed Acyclic Graphs with Marshmallows


I wrote a miniature Ruby gem to topologically sort a Directed Acyclic Graph (DAG), which is useful when you have a bunch of things that depend on each other (e.g. tasks), and you want to put them in a linear order.

Writing the test suite got me thinking about how to find all the topologically distinct directed acyclic graphs with number of vertices V and edges E. My current algorithm goes like this:

  1. Start with some large number n of toothpicks and marshmallows.

  2. Call the children.

  3. Try to finish before all the marshmallows are gone.

Here is what we ended up with for all graphs of V = 4:

Topologically distinct directed acyclic graphs with four vertices

It’s not bad I think, but is a method known that works even without children? Are there any graphs I missed?

(Full disclosure: I redid this photo a couple days later with better-colored toothpicks, so now you can tell which way they point. Marshmallows may be crunchier than they appear.)

Temporal Databases Annotated Bibliography


I’ve been reading about temporal databases for a few years now, so I think it’s time I share my bibliography and notes. This is presented in “narrative order”, so that you can get a sense of how the research has developed. This article somewhat overlaps a mini literature review I wrote on the Postgres hackers mailing list, but this article is more complete and in a place where I can keep it updated.

Temporal databases let you track the history of things over time: both the history of changes to the database (e.g. for auditing) and the history of the thing itself. They are not the same thing as time-series databases: whereas a time-series database has time-stamped events, a temporal database stores the history of things, typically by adding a start/end time to each row (so two timestamps, not one). With time-series the challenge is typically scale; with temporal the challenge is with complexity and correctness.


Snodgrass, Richard T. Developing Time-Oriented Database Applications in SQL. 1999. The seminal work on temporal databases and still the most useful introduction I know. Covers the “combinatorial explosion” of non-temporal/state-temporal/system-temporal/bi-temporal tables, current/sequenced/non-sequenced queries, SELECT/INSERT/UPDATE/DELETE, different RDBMS vendors, etc. Very similar to the proposed TSQL2 standard that was ultimately not accepted but still influenced Teradata’s temporal support. Available as a free PDF from his website.

Hugh Darwen and C. J. Date. “An Overview and Analysis of Proposals Based on the TSQL2 Approach.” Latest draft 2005, but originally written earlier. Criticizes the TSQL2 proposal’s use of “statement modifiers”, especially their problems with composability when a view/subquery/CTE/function returns a temporal result. Available as a PDF.

Ralph Kimball and Margy Ross. The Data Warehouse Toolkit. 3rd edition, 2013. (2nd edition 2002, 1st 1996.) (My notes are based on reading the 2nd edition, but I don’t think there are major relevant changes.) This book is not about temporal databases per se, but in Chapter 4 (and scattered around elsewhere) he talks about dealing with data that changes over time (“Slowly Changing Dimensions”). His first suggestion (Type 1) is to ignore the problem and overwrite old data with new. His Type 2 approach (make a new row) is better but loses the continuity between the old row and the new. Type 3 fixes that but supports only one change, not several. This writing is evidence for the need to handle temporal data, and the contortions that result from not having a systematic approach. (pdf)

C. J. Date, Hugh Darwen, Nikos Lorentzos. Time and Relational Theory, Second Edition: Temporal Databases in the Relational Model and SQL. 2nd edition, 2014. (First edition published in 2002.) I haven’t read this one yet but I would love to see what Date’s ideal system looks like. If you’ve read his other works you know that he is quite rigorous, often critical of SQL’s compromises vs the pure relational model (e.g. NULL and non-distinct results), and not always very practical. I think his idea might look something like sixth-normal form, which would be great for temporal DDL but sounds burdensome to use.

SQL:2011 Draft standard. (pdf) Personally I find the standard pretty disappointing. It uses separate start/end columns instead of built-in range types, although range types offer benefits like exclusion constraints and convenient operators for things like “overlaps” that are verbose to code correctly by hand. It only mentions inner joins, not the various outer joins, semi-joins (EXISTS), anti-joins (NOT EXISTS), or aggregates. Many of its features apply only to system-time, not application-time, even though applicaion-time is the more interesting and less-available feature. (There are lots of auditing add-ons, but almost nothing for tracking the history of things.) The syntax seems too specific, lacking appropriate generality. A lot of these drawbacks seem motivated by a goal that goes back to TSQL2: to let people add temporal support to old tables without breaking any existing queries. That has always seemed to me like an unlikely possibility, and an unfortunate source of distortions. I don’t expect something for free, and I don’t mind doing work to migrate a table to a temporal format, as long as the result is good. Instead we get an (ostensible) one-time benefit for a prolonged compromise in functionality and ease-of-use.

Krishna Kulkarni and Jan-Eike Michels. “Temporal Features in SQL:2011”. SIGMOD Record, September 2012. Nice overview of the temporal features included in the SQL:2011 standard. Here is a PDF of the paper. See also these slides by Kulkani.

Peter Vanroose. “Temporal Data & Time Travel in PostgreSQL,” FOSDEM 2015. (Slides as a pdf) Lots of good detail here about SQL:2011. I’d love to see a recording of this talk if it’s available, but I haven’t found it yet.

Tom Johnston and Randall Weis. Managing Time in Relational Databases: How to Design, Update and Query Temporal Data. 2010. I haven’t read this one yet, although see just below for Johnston’s other book. This one sounds more practical and less original, although I don’t know for sure.

Tom Johnston. Bitemporal Data: Theory and Practice. 2014. I felt like I found a kindred soul when I read how he connects database design and ontology, as I’ve always thought of programming as “applied philosophy.” Databases as Aristotelian propositional logic is inseparable from the mathematical set-based theory. Johnston gives helpful distinctions between the physical rows in the table, the assertions they represent, and the things themselves. Eventually this leads to a grand vision of connecting every assertion’s bitemporal (or tritemporal) history to its speaker, somewhat like some ideas in the Semantic Web, although this doesn’t sound very practical. Like Date he seems to be landing on something like sixth-normal form, with a view-like presentation layer to bring all the attributes back together again. He points out how unsatisfactory Kimball’s suggestions are. He also criticizes the limitations of SQL:2011 and offers some amendments to make it more useful. Describes a (patented) idea of “episodes” to optimize certain temporal queries.

Anton Dignös, Michael H. Böhlen, and Johann Gamper. “Temporal Alignment”, SIGMOD ’12. Amazing! Shows how to define temporal versions of every relational operator by a combination of the traditional operators and just two simple transforms, which they call “align” and “split”. Gives a very readable exposition of the new theory and then describes how they patched Postgres 9.0 and benchmarked the performance. I think this solves the composability problems Date objected to in TSQL2, and unlike SQL:2011 it is general and comprehensive. The focus is on state-time, and I’m not sure how it will map onto bi-temporal, but even just having good state-time functionality would be tremendous. And the paper is only 12 easy-to-read pages! (pdf)

Anton Dignös, Michael Hanspeter Böhlen, Johann Gamper, and Christian S. Jensen. “Extending the kernal of a relational dmbs with comprehensive support for sequenced temporal queries,” ACM Transactions on Database Systems, 41(4):1-46. Continues the previous paper but adds support for scaling the inputs to aggregate groups according to how much of their time period goes into each group. Gives more benchmarks against a patched Postgres 9.5. (pdf) These researchers are now trying to contribute their work to the Postgres core project, of which I am very much in favor. :-)


Finally here are some tools for temporal support in Postgres. The sad theme is that pretty much everything gives audit support but not history support:


The pgaudit extension looks pretty useful but I haven’t tried it yet. According to the AWS docs you can even use this on RDS.

Vlad Arkhipov’s temporal tables extension only supports system-time (auditing). Also on Github and a nice writeup by Clark Dave.

Magnus Hagander presented an approach to capturing system-time history in a separate schema at PGConf US 2015 and PGDay’15 Russia. Here are slides and video. Quite elegant if you want to ask questions like “what did we think as of time t?” If I recall correctly this is similar to one of the ideas proposed at the end Snodgrass, although I haven’t compared them carefully. Hagander points out that DDL changes against temporal databases are challenging and hopefully infrequent. This is a topic that is almost completely absent from the literature, except for a brief mention in Johnston 2014.


Chronomodel extends the ActiveRecord ORM to record system-time history. The pg_audit_log gem works fine but like many audit solutions is rather write-only. I wouldn’t want to build any functionality that has to query its tables to reconstruct history. You could also try paper_trail or audited (formerly acts_as_audited). Of these projects only Chronomodel seems to be aware of temporal database research.

Further Research

Temporal databases are exciting because there is still so much to do. For example:

  • What should the UI look like? Even one dimension adds a lot of complexity, let alone bi-temporal. How do you present this to users? As usual an audit history is easier, and it’s possible to find existing examples, whereas a state-time history is more challenging but probably more valuable. How should we let people view and edit the history of something? How does it work if there is a “Save” button vs save-as-you-type?

  • What does “full stack” temporal support look like? Do we extend REST? What would be a nice ORM interface? Should we use triggers to hide the temporal behavior behind regular-looking SQL? Or maybe extend SQL so you can more explicitly say what you want to do?

  • SELECT support for “as of” semantics or “over time” semantics.

  • Temporal foreign keys. I’m working on this one.

  • DDL changes. For example if you want to add a NOT NULL column, what do you do with the old data? Could there be built-in support to apply constraints only to a given time span?

Postgres isn't running the archive_command on my standby


This just came up on the Postgres mailing list, and I spent a long time figuring it out a few months ago, so maybe this blog post will make it a bit more Googleable.

The problem is you have a master archiving to a standby, but you want the standby to run an archive command too, either to replicate downstream to another standby, or to take pressure off the master when running base backups, e.g. with the WAL-E backup tool. For some reason the master’s achive_command runs fine, but the standby’s isn’t even getting used!

The issue is that in 9.5 and 9.6, Postgres will ignore an achive_mode=on setting if it is running in standby mode. Arguably this is kind of useful, so that you can set up the standby as close to the master as possible, and if you fail over it will immediately start running that command.

But if you really do want to do archiving from the standby, the solution is to say archive_mode=always. Once you make that change, Postgres will start running your archive_command.

Btw, if you are using Ansible, as of today the postgresql role does not respect always. If you give it something truthy it will always generate on. I’ve written a pull request to support always, but it is not yet merged.

Javascript Daylight Savings Time: One Weird Trick Your Application Hates


I’ve talked in the past about how to handle timezones in Rails, so here is a tip for handling timezones in Javascript, in particular around Daylight Savings Time.

Suppose you have a time: April 3, 2017, at midnight Pacific Time. You want to express it as UTC in ISO 8601 format, for instance to send it over the wire as JSON. The result is "2017-04-03T07:00:00.000Z". Note the 07:00. Pacific Time is -8 hours from UTC during Standard Time, and -7 hours during Daylight Savings Time. April 3 falls in Daylight Savings Time.

Now suppose we change the year: April 3, 1969, still at midnight Pacific Time. DST started later that year, so now the answer is "1969-04-03T08:00:00.000Z". But if we run new Date(1969, 3, 3).toISOString() your browser gives us: . That might look correct, or you might see a 07:00 again.

Believe it or not, the original Javascript specification said that browsers should use the current year’s Daylight Savings Time transition dates when building dates from any year. If you just re-read that sentence in disbelief and still think it is too crazy to be real, here is a conversation with links to the old and new spec. I think it’s crazy too!

Right now, some browsers do the right thing (ignore the old spec), some do the wrong thing (follow the old spec), and it also depends on what version you’re running. It even seems to depend on what year you’re asking about. For instance modern Chrome seems to give me the right answers back to 1970, but then is wrong before that. Also, even if your browser does the wrong thing, you might still get lucky based on the current year and the date you’re trying to build. I wrote a jsbin page you can load in multiple browsers to see if they agree.

I think the only safe answer is to use moment-timezone to build your dates. For instance if you know the timezone:[y, m, d], tz)

or if you don’t:[y, m, d],

(And don’t forget the m is off by one.)

If you need to force that back into a regular Date object, you could do:

new Date([y, m, d], tz).toJSON())

Just make sure that you’re using moment-timezone-with-data.js, not plain moment-timezone.js, or you’ll still be relying on the browser’s own idiosyncratic behavior.

I hope this is helpful to someone. If your users enter birthdays with some kind of date picker, you probably suffer from this bug!

What to Learn


I once heard a tech speaker say that in programming her job description was “learn new things,” and I’m happy to steal that way of putting it. In seventeen years of professional work I’ve never done a project that didn’t require me to learn something new on-the-job. It’s what I love about programming. But how do you decide what to learn?

Some things you learn because your project demands it, and those lessons are small and focused (hopefully most of the time): a new library here, some protocol detail there, today something about Linux, tomorrow something about Docker. This is stuff you do on the job, practically every day.

That’s not what I’m talking about here, but I’ll offer some advice in passing: when you learn a new tidbit, try to write it down. You don’t have to spend time polishing it, and if it helps then write it somewhere private. But write it down. I use personal man pages for this—and I’m not very good at it myself. If you’re bad at it too, then at the very least spend an extra 20 minutes making sure you actually understand, and come up with a few experiments to test that your take is correct. Try to put your understanding into words, at least in your own head.

But instead of that on-the-spot learning, I want to talk about things we learn that take more time and have a more long-term payoff. We typically do this off the job, without pay. I think most programmers love learning (or they would soon find a different job), so we can’t help ourselves. But also it pays to keep your skills current and sharp. Every professional has to do this. My favorite book about professional services work (which despite the title is about way more than managing) talks about developing your “asset”—you—by continuously learning. Lawyers, architects, accountants, doctors—all have to keep learning. Car mechanics too. With programmers the pace is different, but I expect the world doesn’t exactly stand still for anyone else either.

There is so much to learn! And the hype is everywhere, stealing your attention and diffusing your time. Prototype, jQuery, Backbone, Knockout, Angular, Ember, React, Vue, … Less, Sass, Uglify, Asset Pipeline, Npm, Bower, Babel, Gulp, Grunt, Ember-cli, Webpack, … Oracle, MySQL, Postgres, Memcached, Cassandra, Mongo, CouchDB, Redis, Riak, DynamoDB, … Aaah! You can’t learn it all, so you have to be deliberate.

I often hear advice to learn one new language a year, two new languages a year, whatever. The best versions of this advice say to learn a new “kind” of language, like Lisp or Haskell or Prolog. For a new programmer, that’s pretty good advice, and I still follow it myself. (For me the last few years it’s been Haskell, Rust, and Elixir.) But for several years I’ve tried to adopt a more strategic approach. One of the problems with learning another language is that either it’s something you won’t actually use, or you mostly leave behind the old one, so it’s like starting over from zero. (Not really, but a little bit.) After a couple dozen you start to wonder how to make the investment more worthwhile. Is there a way to make our learning build on itself, so we aren’t throwing away so much time? Here is my own approach to having some “continuity” in what I learn:

First, realize that there are so many more categories besides language! There are operating systems, cloud environments, back-end frameworks, front-end frameworks, databases, build tools, deployment tools, networking protocols, specialties like GIS or machine learning, industries like e-commerce or finance or health care, “soft” skills like writing, requirements gathering, design, management, financial planning, sales. Don’t get stuck in a rut of thinking in only one dimension.

Second, don’t be too focused. Go ahead and mix in some “useless” learning. I’ve had fun lately reading about the history of transistors, the integrated circuit, and the Internet. Or to take it to an extreme, you could learn some Greek or Latin or Chinese. :-) Whatever you like. One of those books (The Chip) actually talks about how Jack Kilby, the co-inventor of the IC, would read several newpapers and a bunch of magazines every day, plus every new patent granted by the government. Maybe that’s an extreme, but it’s good to have some breadth because you never know what will come in handy or inspire you. But more than that, recreation is important. Read some trashy science fiction or something.

But when you are being deliberate, I think there are three good alternatives to “learn another language”. The first is to learn something that complements your current skills. Suppose you are (or want to be) a “full-stack web developer.” Okay, learn some Rails, pick one Javascript framework and learn it, but then also learn some advanced Postgres, learn some details about HTTP or SSL or CORS, learn Wireshark and IP, learn HTTP Canvas, learn a configuration management tool like Chef or Ansible. I think Chef is a great complement to Rails (or Ansible to Django). My own current “expansion of territory” is down the stack, learning some Rust and reading The Linux Programming Interface. Learn the things that border on what you do, so you are gradually expanding. This is “breadth”, but in a calculated, not desultory way. You’ll probably put those skills to use right away, so they’ll sink in and make you better at your job.

Second is to dive deep somewhere. I’ve really enjoyed getting to know Postgres. I’ve been hired to scale it, to replicate it, to write C extensions for it. Maybe for you it is React or Datomic or AWS or reverse engineering or Ruby performance tuning. But get on the mailing list, follow what problems the community is trying to solve right now, write some blog posts, get to know someone in the community. Whatever it is, use it enough to find some friction points and maybe even fix one or two. You don’t have to make it your whole identity (though you could), but instead of learning a new thing, go in the opposite direction: go deep.

So far this is a lot like the classic “T-shaped person” advice, but I’m saying that for the breadth, pick things connected to your specialty, and for your specialty, have a “specialty within the specialty”. Keep trying to push a little further out, a little further down.

Third and finally is to learn something truly new, at the cutting edge of research. For the last couple years I’ve been reading about temporal databases, which have 20-30 years of academic study but few well-developed practical tools (especially open source ones). This isn’t something I’ve been able to use on a real project (yet), but it’s been great fun, and it feels like a way to find opportunities to build something before anyone else does. How you find your topic is by listening to your pain and seeing if there are other people trying to solve the same problems. Some other things I wish I could become an expert in: Bayesian statistics, operational transforms, HTTP/2, column-store databases, RDF, type theory, vector CPU instructions. There is so much happening! Pick something that people are writing papers about and learn a little.

So that’s what I’ve learned the last few years about learning. Instead of “learn another language”, try to be strategic. Try to build on what you have. Careers are long, so try to find some long-term problems you can grapple with. Maybe like Kilby you will even solve one!

Next: Doing Many Things