Finding problematic EOF references in Antlr4 grammars
There was a problem reported on StackOverflow a few days ago that presented a grammar that would produce two different parse trees depending on whether a rule was commented out or not. The problem is that the rule was never used in the rest of the grammar. So, why would Antlr produce a different tree for the same input? It makes no sense.
grammar my;
st: 'H' hd | EOF ;
hd: 'D' d | 'C' c | st ;
d: hd ;
c: 'D' c | hd ;
s1: 'D' s1 | c ;
// p: hd ;
SKP: [ \t\r\n]+ -> skip;
// Input: 'H C D C C D'.
The main problem with this grammar is that EOF should not be used as a symbol. It denotes the end of input. So, referencing it in one alt and not in other alts could be a problem, which is what we see with this grammar.
Rewriting the grammar slightly to contain a proper
EOF-start-rule produces a consistent parse tree
for the input, whether or not rule p
is commented
out or not.
grammar my;
st_: st EOF;
st: 'H' hd | ;
hd: 'D' d | 'C' c | st ;
d: hd ;
c: 'D' c | hd ;
s1: 'D' s1 | c ;
// p: hd ;
SKP: [ \t\r\n]+ -> skip;
// Input: 'H C D C C D'.
Referencing symbols after an EOF
Antlr allows EOF to be used as a normal symbol, but it is not. Symbols cannot be read after the EOF token.
grammar X;
r: 'a' EOF 'b' ;
// Input: a
Although Antlr accepts this grammar, it cannot parse ‘a’ because it is expecting ‘b’, which it cannot read.
Finding EOF problems
Trash can be used to find these problems.
find-eof-problems.sh
#!/usr/bin/bash
# "Setting MSYS2_ARG_CONV_EXCL so that Trash XPaths do not get mutulated."
export MSYS2_ARG_CONV_EXCL="*"
# Sanity check.
is_grammar=`trparse $1 -t antlr4 | trxgrep '/grammarSpec/grammarDecl[not(grammarType/LEXER)]' | trtext -c`
if [ "$is_grammar" != "1" ]
then
echo $1 is not a combined or parser Antlr4 grammar.
exit 1
fi
count=`trparse $1 -t antlr4 2> /dev/null \
| trxgrep '//parserRuleSpec//alternative/element[.//TOKEN_REF/text()="EOF"]/following-sibling::element' \
| trtext -c`
if [ "$count" != "0" ]
then
echo $1 has an EOF usage followed by another element.
fi
count=`trparse $1 -t antlr4 2> /dev/null \
| trxgrep '//labeledAlt[.//TOKEN_REF/text()="EOF" and count(../labeledAlt) > 1]' \
| trtext -c`
if [ "$count" != "0" ]
then
echo $1 has an EOF in one alt, but not in another.
fi
Note: For context, the Antlr grammar is parsed
using this grammar.
You can see the parse tree of the grammar at a command prompt:
trparse X.g4 | trtree
.