Finding parser rules that use ‘skip’ lexer symbols in Antlr4 grammars

A few days ago, I moved the build of grammars-v4 to the latest version of Antlr4–version 4.11.1. The upgrade was pretty uneventful except for two grammars, which seemed to cause the Go target for Antlr to parse extremely slowly. It turned out that there were two problems, one due to a problem with the hashing of ATNs in the Go target, and a second due to the grammar using a lexer symbol that could not be generated because it was marked “skip”.

Clearly, these grammars were wrong: one should not use a lexer symbol that is marked skip because the symbol will never be generated by the lexer. Could Trash identify when there is this problem in a grammar? It turns out that the answer is yes, but currently only in a combined grammar.

trparse Problems.g4 \
   | trxgrep '
      for $i in (
         //parserRuleSpec//TOKEN_REF
            [text()=//lexerRuleSpec
               [./lexerRuleBlock/lexerAltList/lexerAlt/lexerCommands/lexerCommand/lexerCommandName/identifier/RULE_REF/text()="skip"]
               /TOKEN_REF/text()])
         return concat("line ", $i/@Line, " col ", $i/@Column, " """, $i/@Text,"""")'

The XPath expression here says “if we have a parser rule with a token reference on the right-hand side of the rule equal to a lexer rule symbol in which we have the right-hand side contain “-> skip”, then return the token referenced and print it out with line and column numbers.”

The problem with this expression is that many Antlr grammars are split into separate lexer and parser grammars. XPath expressions always denote a set of nodes from the root of the tree, which is a “document”. While XPath does not have a notion of finding nodes in different documents, it does define the concept of “Available documents” and has a function called doc(string), which returns a document given a URI.

The trick is an extension I made to doc(). It now take a string pattern to select a list of documents that match the patter. These documents are the collection of parse trees passed to trxgrep.

We can now define a grammar that defines a lexer rule that generates “skip” and then check to see if it is used in a parser rule.

Grammar:

grammar Problems; 
s: a* EOF ;
a: 'a' | COMMENT;
COMMENT: '//' ~[\n\r]* -> skip;
WS: [ \t\r\n]+ -> skip;

Bash script to find “skip” tokens used in parser rules

#!/usr/bin/bash
trparse Problems.g4 \
   | trxgrep '
      for $i in (
         //parserRuleSpec//TOKEN_REF
            [text()=doc("*")//lexerRuleSpec
               [./lexerRuleBlock/lexerAltList/lexerAlt/lexerCommands/lexerCommand/lexerCommandName/identifier/RULE_REF/text()="skip"]
               /TOKEN_REF/text()])
         return concat("line ", $i/@Line, " col ", $i/@Column, " """, $i/@Text,"""")'

Output

line 3 col 9 "COMMENT"

Note: For context, the Antlr grammar is parsed using this grammar. You can see the parse tree of the grammar at a command prompt: trparse Program.g4 | trtree.