LR backends contain a spectacular bug
Holy shit this bug.
Sample output from commit 11f52b72 in my ticket_60_investigation branch:
andrea@tisiphone:~/projects/hammer/hammer-1 [606]$ LD_LIBRARY_PATH=./build/debug/src ./build/debug/examples/abnf_test_case
for i = 0, every j up to at least 128 parses
for i = 1, 3 is the first j that doesn't parse
for i = 2, 4 is the first j that doesn't parse
for i = 3, every j up to at least 128 parses
for i = 4, 4 is the first j that doesn't parse
for i = 5, 5 is the first j that doesn't parse
for i = 6, every j up to at least 128 parses
for i = 7, 5 is the first j that doesn't parse
for i = 8, 6 is the first j that doesn't parse
for i = 9, every j up to at least 128 parses
for i = 10, 6 is the first j that doesn't parse
for i = 11, 7 is the first j that doesn't parse
for i = 12, every j up to at least 128 parses
for i = 13, 7 is the first j that doesn't parse
for i = 14, 8 is the first j that doesn't parse
for i = 15, every j up to at least 128 parses
for i = 16, 8 is the first j that doesn't parse
for i = 17, 9 is the first j that doesn't parse
for i = 18, every j up to at least 128 parses
for i = 19, 9 is the first j that doesn't parse
for i = 20, 10 is the first j that doesn't parse
for i = 21, every j up to at least 128 parses
for i = 22, 10 is the first j that doesn't parse
for i = 23, 11 is the first j that doesn't parse
for i = 24, every j up to at least 128 parses
for i = 25, 11 is the first j that doesn't parse
for i = 26, 12 is the first j that doesn't parse
for i = 27, every j up to at least 128 parses
for i = 28, 12 is the first j that doesn't parse
for i = 29, 13 is the first j that doesn't parse
for i = 30, every j up to at least 128 parses
for i = 31, 13 is the first j that doesn't parse
for i = 32, 14 is the first j that doesn't parse
for i = 33, every j up to at least 128 parses
for i = 34, 14 is the first j that doesn't parse
for i = 35, 15 is the first j that doesn't parse
for i = 36, every j up to at least 128 parses
for i = 37, 15 is the first j that doesn't parse
for i = 38, 16 is the first j that doesn't parse
for i = 39, every j up to at least 128 parses
for i = 40, 16 is the first j that doesn't parse
for i = 41, 17 is the first j that doesn't parse
for i = 42, every j up to at least 128 parses
for i = 43, 17 is the first j that doesn't parse
for i = 44, 18 is the first j that doesn't parse
for i = 45, every j up to at least 128 parses
for i = 46, 18 is the first j that doesn't parse
for i = 47, 19 is the first j that doesn't parse
for i = 48, every j up to at least 128 parses
for i = 49, 19 is the first j that doesn't parse
for i = 50, 20 is the first j that doesn't parse
for i = 51, every j up to at least 128 parses
for i = 52, 20 is the first j that doesn't parse
for i = 53, 21 is the first j that doesn't parse
for i = 54, every j up to at least 128 parses
for i = 55, 21 is the first j that doesn't parse
for i = 56, 22 is the first j that doesn't parse
for i = 57, every j up to at least 128 parses
for i = 58, 22 is the first j that doesn't parse
for i = 59, 23 is the first j that doesn't parse
for i = 60, every j up to at least 128 parses
for i = 61, 23 is the first j that doesn't parse
for i = 62, 24 is the first j that doesn't parse
for i = 63, every j up to at least 128 parses
for i = 64, 24 is the first j that doesn't parse
This bug was originally discovered during development of the ABNF parser; the test case is a cut-down version of that grammar. The LR table is still relatively large and it's possible a smaller repro could be obtained. The parser is constructed as:
newline = h_ch('\n');
alpha = h_choice(h_ch_range('A', 'Z'), h_ch_range('a', 'z'), NULL);
sp = h_ch(' ');
vchar = h_ch_range(0x21, 0x7e);
equal = h_ch('=');
semicolon = h_ch(';');
comment = h_sequence(
semicolon,
h_many(h_choice(sp, vchar, NULL)),
newline,
NULL);
c_nl = h_choice(comment, newline, NULL);
c_sp = h_choice(sp, h_sequence(c_nl, sp, NULL), NULL);
defined_as = h_sequence(h_many(c_sp), equal, h_many(c_sp), NULL);
alphas = h_sequence(
alpha,
h_many(h_sequence(h_many1(c_sp), alpha, NULL)),
h_many(c_sp),
NULL);
rule = h_sequence(alpha, defined_as, alphas, c_nl, NULL);
rulelist = h_many1(h_choice(
rule,
h_sequence(h_many(c_sp), c_nl, NULL),
NULL));
p = rulelist;
r = h_compile(p, PB_GLR, (void *)1);
The test string being parsed is x = y z%s;%s\n\n
, where the interpolated strings consist of i
spaces and j
x
characters. For each value of i
, we empirically find that the string fails to parse for values of j
above a threshold, and discover that threshold by binary search. As can be seen from the parser constructed in the fragment above, the i
spaces should always be able to match the h_many(c_sp)
in alphas
and the j
x
characters should always be able to match the h_many(h_choice(sp, vchar, NULL))
in comment
in the h_choice()
for c_nl
, but instead we see this incorrect behavior above a threshold value of j
whenever i
is not divisible by 3.