There was a problem reported on StackOverflow a few days ago that presented a grammar that would produce two different parse trees depending on whether a rule was commented out or not. The problem is that the rule was never used in the rest of the grammar. So, why would Antlr produce a different tree for the same input? It makes no sense.
grammar my; st: 'H' hd | EOF ; hd: 'D' d | 'C' c | st ; d: hd ; c: 'D' c | hd ; s1: 'D' s1 | c ; // p: hd ; SKP: [ \t\r\n]+ -> skip; // Input: 'H C D C C D'.
The main problem with this grammar is that EOF should not be used as a symbol. It denotes the end of input. So, referencing it in one alt and not in other alts could be a problem, which is what we see with this grammar.
Rewriting the grammar slightly to contain a proper
EOF-start-rule produces a consistent parse tree
for the input, whether or not rule
p is commented
out or not.
grammar my; st_: st EOF; st: 'H' hd | ; hd: 'D' d | 'C' c | st ; d: hd ; c: 'D' c | hd ; s1: 'D' s1 | c ; // p: hd ; SKP: [ \t\r\n]+ -> skip; // Input: 'H C D C C D'.
Referencing symbols after an EOF
Antlr allows EOF to be used as a normal symbol, but it is not. Symbols cannot be read after the EOF token.
grammar X; r: 'a' EOF 'b' ; // Input: a
Although Antlr accepts this grammar, it cannot parse ‘a’ because it is expecting ‘b’, which it cannot read.
Finding EOF problems
Trash can be used to find these problems.
#!/usr/bin/bash # "Setting MSYS2_ARG_CONV_EXCL so that Trash XPaths do not get mutulated." export MSYS2_ARG_CONV_EXCL="*" # Sanity check. is_grammar=`trparse $1 -t antlr4 | trxgrep '/grammarSpec/grammarDecl[not(grammarType/LEXER)]' | trtext -c` if [ "$is_grammar" != "1" ] then echo $1 is not a combined or parser Antlr4 grammar. exit 1 fi count=`trparse $1 -t antlr4 2> /dev/null \ | trxgrep '//parserRuleSpec//alternative/element[.//TOKEN_REF/text()="EOF"]/following-sibling::element' \ | trtext -c` if [ "$count" != "0" ] then echo $1 has an EOF usage followed by another element. fi count=`trparse $1 -t antlr4 2> /dev/null \ | trxgrep '//labeledAlt[.//TOKEN_REF/text()="EOF" and count(../labeledAlt) > 1]' \ | trtext -c` if [ "$count" != "0" ] then echo $1 has an EOF in one alt, but not in another. fi
Note: For context, the Antlr grammar is parsed
using this grammar.
You can see the parse tree of the grammar at a command prompt:
trparse X.g4 | trtree.