Solving bison shift/reduce conflicts in Postgres

2024-10-23 postgres, bison, parsing

I had to fix some shift/reduce conflicts in the Postgres bison grammar recently.

I’ve never done this before, so it was a learning experience. Maybe my story will help some other new Postgres contributor—or anyone struggling with this technology that is central to computer science but for many day-to-day programmers seldom-used.

Back in 2018 I read lex & yacc by Doug Brown, John R. Levine, and Tony Mason (second edition published in 1992). Levine’s 2009 flex & bison would have been a more practical choice, but I liked getting some history too. Re-reading some parts of that book was very helpful. So were the bison manual and some StackOverflow questions. Now I’m working through The Dragon Book, and that would have made a great resource too. It’s easy to write some bison without going that deep, but if you get stuck it can be frustrating.

I’ve been adding syntax from SQL:2011 for application-time updates and deletes. If you have a PERIOD or range column named valid_at, you can say UPDATE t FOR PORTION OF valid_at FROM '2024-01-01' TO '2024-02-01' SET foo = bar. (For more details you can watch my talk.) The FOR PORTION OF bounds don’t have to be just literals. You could also say, for example, FROM current_time TO current_time + INTERVAL '1' HOUR.

Actually I picked that example on purpose. Did you know that intervals support this syntax: INTERVAL '1:02:03' HOUR TO MINUTE? That means: Interpret the string as hours, and preserve the precision down to the minute. By default if you ask for 1:02:03 HOUR you get just an hour. But TO MINUTE means you get 1 hour and 2 minutes. (You still lose the seconds.)

So how does an LR(1) parser deal with FOR PORTION OF valid_at FROM current_time + INTERVAL '1' HOUR TO MINUTE? When we consider the TO, does it belong with the interval, or does it start the closing bound of the FOR PORTION OF? This is a shift/reduce conflict. Bison can’t look ahead further than one token to guess what is correct—and I’m not sure it would help even if it could.

When the next token is able to complete some piece of the grammar, called a “rule”, the parser “reduces”: it consumes those tokens and lets you run a custom “action” attached to your rule. Otherwise, bison shifts the token onto a stack of not-yet-reduced tokens, so that it can reduce a rule in the future. But bison needs to decide each time it sees a token whether to reduce or to shift.

Here is the rule for FOR PORTION OF:

for_portion_of_clause:
  FOR PORTION OF ColId FROM a_expr TO a_expr
    {
      ForPortionOfClause *n = makeNode(ForPortionOfClause);
      n->range_name = $4;
      n->location = @4;
      n->target_start = $6;
      n->target_end = $8;
      $$ = n;
    }
;

The first line is the name of the rule, so we can use it in higher-level contexts: the UPDATE statement (and also DELETE). The second line has the inputs we need to match to complete the rule. Some of them are “terminals”: keywords, identifiers, operators, literals, punctuation, etc.—in this case FOR, PORTION, OF, FROM, TO. Some are “non-terminal”: further rules. Below all that is a block of C code: the action that gets run when we reduce. We call the name the “left side” of the rule, and the inputs the “right side” or “body”. Rules are also sometimes called “productions”. The lex & yacc book doesn’t use that terminology, but the Dragon Book does, and so does the Postgres source code.

So each bound is an a_expr. The a_expr rule is a complicated production with just about anything you can do in Postgres: a literal, a variable, a function call, an operator, a subquery, lots of weird SQL keywords, etc. Many contexts forbid some of these things, e.g. you can’t refer to a column in a DEFAULT expression or a partition bound—but that is enforced during analysis, not by the grammar.

To speak more precisely, when a non-terminal can match inputs in more than one way (like a_expr), we should call each alternative a rule or production. But in bison you commonly write the name once then separate each body with a pipe (|), so all the rules share one name. There is not one a_expr rule, but many: 68 by my count. But such terminological precision is rarely needed.

Take our example, FOR PORTION OF valid_at FROM current_time + INTERVAL '1' HOUR • TO MINUTE. I’ve added a dot to represent bison’s “cursor”. It is considering what to do with the TO. We could reduce the a_expr right now, leaving the TO to become part of the FOR PORTION OF. Or we could shift the TO so that it eventually gets reduced as part of the a_expr.

Actually it’s not about reducing the a_expr, but reducing one of its many sub-rules, in this case the interval. An a_expr can be a c_expr (among other things), and a c_expr can be an AexprConst (among other things), and an AexprConst can be a ConstInterval Sconst opt_interval (among other things), and the opt_interval is the problem, because it can optionally have a TO. Here is that rule:

opt_interval:
  YEAR_P
    { $$ = list_make1(makeIntConst(INTERVAL_MASK(YEAR), @1)); }
  | MONTH_P
    { $$ = list_make1(makeIntConst(INTERVAL_MASK(MONTH), @1)); }
  | DAY_P
    { $$ = list_make1(makeIntConst(INTERVAL_MASK(DAY), @1)); }
  | HOUR_P
    { $$ = list_make1(makeIntConst(INTERVAL_MASK(HOUR), @1)); }
  | MINUTE_P
    { $$ = list_make1(makeIntConst(INTERVAL_MASK(MINUTE), @1)); }
  | interval_second
    { $$ = $1; }
  | YEAR_P TO MONTH_P
    { ... }
  | DAY_P TO HOUR_P
    { ... }
  | DAY_P TO MINUTE_P
    { ... }
  | DAY_P TO interval_second
    { ... }
  | HOUR_P TO MINUTE_P
    { ... }
  | HOUR_P TO interval_second
    { ... }
  | MINUTE_P TO interval_second
    { ... }
  | /*EMPTY*/
    { $$ = NIL; }
;

(I’ve omitted most of the actions, since actions don’t affect bison’s choices.) The opt_interval rule is what bison is trying to reduce.

When you have a shift/reduce conflict, make gives you an error like this:

/usr/bin/bison -d -o gram.c gram.y
gram.y: conflicts: 4 shift/reduce
gram.y: expected 0 shift/reduce conflicts
make[2]: *** [gram.c] Error 1
make[1]: *** [parser/gram.h] Error 2
make: *** [submake-generated-headers] Error 2

That is not much to go on.

By the way, do you see the expected 0? A shift/reduce conflict doesn’t have to be fatal. Bison will default to shift. But this is a bit sketchy. It means you could accidentally write an ambiguous grammar that causes trouble later. So bison lets you declare how many conflicts you expect, and it only fails if it finds a different count. I like that for Postgres the expected conflict count is zero. For MariaDB it is 62.

Anyway, we have four shift/reduce conflicts. Now what? Let’s ask Bison where they are. It can take a -v/--verbose option to generate a “report file”. Since Postgres’s grammar lives in gram.y, the report file is gram.output. (Modern versions offer more control with -r/--report and --report-file, but macOS only supports -v.) We aren’t running bison directly, but we can control things like this:

make BISONFLAGS=-v

That gives us a file with 5.5 million lines, but right at the top we see:

State 1454 conflicts: 1 shift/reduce
State 1455 conflicts: 1 shift/reduce
State 1456 conflicts: 1 shift/reduce
State 1459 conflicts: 1 shift/reduce

Then if we /^state 1454 we see this:

state 1454

  2000 opt_interval: DAY_P .
  2005             | DAY_P . TO HOUR_P
  2006             | DAY_P . TO MINUTE_P
  2007             | DAY_P . TO interval_second

    TO  shift, and go to state 2670

    TO        [reduce using rule 2000 (opt_interval)]
    $default  reduce using rule 2000 (opt_interval)

So in this state, bison has four candidate rules it could eventually reduce, numbered 2000, 2005, 2006, 2007. Below that are possible valid tokens and what to do for each one. We see TO twice, which is the problem. The square brackets highlight the conflict: they mark a transition that will never happen. (The $default line means if the next token is not a TO, we can reduce and leave that token for some higher-level rule to match.) So this is how we know one half of the problem is opt_interval. The other half is for_portion_of_clause. Bison doesn’t tell us that, but (1) we just added it to a previously-working grammar (2) we can see that TO is the issue, and that’s where we match a TO.

This is one of the four shift/reduce conflicts. The other three are also from opt_interval, caused by YEAR_P TO MONTH_P, HOUR_P TO {MINUTE_P,interval_second}, and MINUTE_P TO interval_second. Essentially it’s all one conflict, but we can hit the TO after DAY, HOUR, YEAR, or MINUTE, so that’s four different states.

We can use “precedence” to resolve such ambiguities. It’s just like elementary arithmetic: multiplication has higher precedence than addition. It is stickier. We do it first. But what does that mean in bison? Bison compares the precedence of the non-terminal rule it could reduce (opt_interval) vs the precedence of the token it could shift (TO). Rules don’t really have precedence themselves, but they get the precedence of their final terminal token.

So in state 1454, if we give DAY_P a different precedence than TO, bison will know whether to reduce (rule 2000), or shift (and eventually reduce rule 2005, 2006, or 2007). If DAY_P is higher, we’ll reduce. If TO is higher, we’ll shift.

Should we shift or reduce? The only answer is to shift. If we reduce by default, then users can never say INTERVAL '1' DAY TO HOUR (even in a completely different context). No amount of parens will make bison do otherwise. But if we shift, then this is a syntax error: FOR PORTION OF valid_at FROM '2013-03-01'::timestamp + INTERVAL '1' HOUR TO '2014-01-01' (because after shifting the TO bison is still trying to reduce opt_interval), but this fixes it: FOR PORTION OF valid_at FROM ('2013-03-01'::timestamp + INTERVAL '1' HOUR) TO '2014-01-01'. Users can get what they want by adding parens.

So to shift, we give TO a higher precedence than YEAR_P, DAY_P, HOUR_P, and MINUTE_P. By default a token has no precedence, but bison lets you make a list of declarations where lower lines have higher precedence. So for a long time my patch added this:

%nonassoc YEAR_P DAY_P HOUR_P MINUTE_P
%nonassoc TO

(Actually I had MONTH_P in there too, but that isn’t needed because you can’t have MONTH TO ....)

But this is frowned upon. There is a comment right above my change that gave me a guilty conscience for at least a year, maybe a few:

/*
 * Sometimes it is necessary to assign precedence to keywords that are not
 * really part of the operator hierarchy, in order to resolve grammar
 * ambiguities.  It's best to avoid doing so whenever possible, because such
 * assignments have global effect and may hide ambiguities besides the one
 * you intended to solve.  (Attaching a precedence to a single rule with
 * %prec is far safer and should be preferred.)  If you must give precedence
 * to a new keyword, try very hard to give it the same precedence as IDENT.
 * If the keyword has IDENT's precedence then it clearly acts the same as
 * non-keywords and other similar keywords, thus reducing the risk of
 * unexpected precedence effects.
 */

I knew I had to fix this before my patch would get accepted. Those two lines had to go.

What is the %prec approach suggested by the comment? I said above that a rule’s precedence comes from its last terminal token. But you can override a rule’s precedence by putting %prec token_name after the right side. For example a_expr has this rule:

| a_expr AT TIME ZONE a_expr      %prec AT
  { ... }

We’re saying we should reduce this rule with the precedence of AT.

I tried all kinds of %prec placements that didn’t work. My mental model of bison’s process was too vague. The reason I’m writing this story is really to record the details and thought process that finally gave me the solution.

For example, putting %prec on for_portion_of_clause doesn’t do any good, because the conflict is lower down than that, inside opt_interval. That was counter-intuitive, because I knew that adding for_portion_of_clause was what caused the problem. It’s what offers bison a way to reduce early and still have a place to use the TO. But despite for_portion_of_clause exerting influence, at the moment of decision we are in the middle of a different rule. It’s action-at-a-distance.

Another breakthrough was realizing that the comparison is between a rule (to reduce) and a token (to shift). Within opt_interval I kept trying to give low precedence to the rules without TO and high precedence to the rules with it. But the comparison isn’t between two rules. It’s between a rule and a token. The token is TO itself. There isn’t any way to give precedence to a token with %prec. That only modifies a rule. If TO has an undefined precedence, there will always be a conflict. So I did have to declare a precedence for TO, but following the comment above I could give it the same precedence as IDENT:

%nonassoc IDENT PARTITION RANGE ROWS GROUPS PRECEDING FOLLOWING CUBE ROLLUP
      SET KEYS OBJECT_P SCALAR TO VALUE_P WITH WITHOUT PATH

Then the conflicting opt_interval rules needed a lower precedence, to prevent reducing early. A low-precedence keyword we use a lot is IS, so I did this:

opt_interval:
  YEAR_P                                %prec IS
    { ... }
  | DAY_P                               %prec IS
    { ... }
  | HOUR_P                              %prec IS
    { ... }
  | MINUTE_P                            %prec IS
    { ... }

Now we’ll shift the TO and follow those rules that include it.

Finally I had a solution!

It’s worth considering another approach. We can also enforce precedence with the structure of our rules, without declaring an explicit higher/lower precedence for terminals. For example for simple arithmetic we could do this (from the Dragon Book, p. 49–50):

expr: expr + term
  | expr - term
  | term

term: term * factor
  | term / factor
  | factor

factor: digit
  | '(' expr ')'

For n levels of precedence, we need n+1 different rules. But I think this approach gets unwieldy quickly. And anyway was I going to rewrite the Postgres grammar to do this?

Postgres actually does this a bit already though. We’ve seen a_expr and c_expr. Of course there is also b_expr. b_expr is a more limited set of rules than a_expr, and c_expr is everything they have in common.

We use b_expr to solve some shift/reduce conflicts. For example a column’s DEFAULT value can only take a b_expr, because a NOT would be a shift/reduce conflict: is it a NOT NULL constraint on the column, or is it part of the DEFAULT expression, e.g. NOT LIKE? One rule that b_expr accepts is '(' a_expr ')', so even in contexts like DEFAULT, you can get whatever you want by wrapping your text in parentheses.

So could I have saved myself a lot of trouble and made FOR PORTION OF take FROM b_expr TO b_expr instead? No, because the problem was inside c_expr, which is shared by both rules.

I probably could have invented a d_expr, but that would have been a lot of work, only to produce a more tangled grammar that I expect no reviewer would have accepted.

So that’s the story of how I fixed my four shift-reduce conflicts.

But just when you think you’ve killed the zombie, he rises from the dead. Right around the same time, I realized my grammar was wrong. In SQL, you can give your table an alias when you UPDATE or DELETE. It can use AS or not: UPDATE tablename [[AS] t] and DELETE FROM tablename [[AS] t]. I was putting FOR PORTION OF after the alias, but according to SQL:2011 it comes before. I tried moving it, and I got . . . 30 shift/reduce conflicts!

These looked really hairy: the problem was that AS is optional and the alias can be nearly anything. It can’t be a reserved keyword (unless you quote it), but many keywords are not reserved (per the standard), so there’s ambiguity there. Allowing a_expr, which can be nearly anything, followed by an optional alias, which can be nearly anything, is bad news. I really thought I was in trouble.

Could I just ignore the standard? I don’t think that would be acceptable, not in this matter. But it was tempting enough that I checked what MariaDB and IBM DB2 were doing. Somehow they were making it work. I should figure it out too.

I think I took a walk, or maybe I slept on it, but I realized that we already have the same problem with column aliases when you SELECT. Each selected column is an a_expr, and their aliases don’t require AS. What was Postgres doing to make that work?

I found this rule for SELECTing:

target_el:  a_expr AS ColLabel { ...}
      | a_expr BareColLabel { ... }
      | a_expr { ... }
      | '*' { ... }
    ;

It turns out that ColLabel allows anything (even reserved keywords!), but BareColLabel is more restricted.

So I could do something similar: when there is an AS, permit everything, but otherwise only permit tokens that are conflict-free. If fact to keep backward-compatibility, I could leave the old grammar rule for UPDATE and DELETE in place (each had only one), and only get more restrictive when FOR PORTION OF is present. Maybe reviewers will ask me to change things, but at the moment my solution looks like this:

opt_alias:
  AS ColId { ... }
  | BareColLabel { ... }
  | /* empty */ %prec UMINUS { $$ = NULL; }
;

UpdateStmt: opt_with_clause UPDATE relation_expr_opt_alias
  SET set_clause_list
  from_clause
  where_or_current_clause
  returning_clause
    { ... }
  | opt_with_clause UPDATE relation_expr
  for_portion_of_clause opt_alias
  SET set_clause_list
  from_clause
  where_or_current_clause
  returning_clause
    { ... }
;

I’m not sure I like using BareColLabel for non-column aliases, but the existing relation_expr_opt_alias uses ColId, so maybe it’s okay.

The %prec is necessary to resolve a conflict with USING, which is permitted by BareColLabel, and also allowed in DELETE FROM ... USING. If I added a separate list for bare table labels, we could leave out USING and not use %prec here, but I don’t think maintaining another keyword list would be popular.

That’s it! I’m happy that at 47 I can still work out the errors in my mental model of something and correct them. Hopefully by writing this down I won’t have to do it more than once. :-)

blog comments powered by Disqus Next: Postgres REPLICA IDENTITY

Illuminated Computing

Solving bison shift/reduce conflicts in Postgres

Paul A. Jungwirth

Code

Writing

Talks