If you ever tried to split an Antlr4 grammar, chances are you found it isn’t as easygoing as you initially thought. One of the problems in splitting a combined grammar is finding missing lexer rules for string literals in a parser rule.

grammar Test;
start: 'a' B* EOF;
B: 'b';

Antlr is very forgiving with combined grammars, declaring lexer rules for string literals mentioned in a parser rule that are not declared explicitly in the grammar (e.g., ‘a’).

The problem is that the Antlr4 tool doesn’t accept the grammar after splitting.

lexer grammar TestLexer;
B: 'b';
parser grammar TestParser;
options { tokenVocab=TestLexer; }
start: 'a' B* EOF;

=> cannot create implicit token for string literal in non-combined grammar: 'a'.

For a combined grammar, Antlr creates a lexer rule internally for the string literal and prioritizes it above all other lexer rules explicitly written in the grammar. E.g., the grammar after implicitly declaring string literal rules would be:

grammar Test;
start: 'a' B* EOF;
T__0: 'a';
B: 'b';

To find these string literals before splitting, the follow Trash script can be run:

trparse <grammar-file>.g4 | trxgrep 'for $i in (//parserRuleSpec/ruleBlock//STRING_LITERAL/text()) return concat($i,  " ", count(//lexerRuleSpec[lexerRuleBlock//STRING_LITERAL/text() = $i][last()]/TOKEN_REF/text()))' | grep " 0"

Example:

04/29-07:01:05 ~/blog
$ git clone https://github.com/antlr/grammars-v4.git
Cloning into 'grammars-v4'...
remote: Enumerating objects: 44035, done.
remote: Counting objects: 100% (96/96), done.
remote: Compressing objects: 100% (80/80), done.
remote: Total 44035 (delta 21), reused 63 (delta 14), pack-reused 43939
Receiving objects: 100% (44035/44035), 43.54 MiB | 16.32 MiB/s, done.
Resolving deltas: 100% (23697/23697), done.
Updating files: 100% (8212/8212), done.
04/29-07:02:05 ~/blog
$ cd grammars-v4/c
04/29-07:02:18 ~/blog/grammars-v4/c
$ git checkout fcb5eaae5e7214b5234108f837eac8d713775cce
Note: switching to 'fcb5eaae5e7214b5234108f837eac8d713775cce'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by switching back to a branch.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -c with the switch command. Example:

  git switch -c <new-branch-name>

Or undo this operation with:

  git switch -

Turn off this advice by setting config variable advice.detachedHead to false

HEAD is now at fcb5eaae Update to latest trgen. (#3352)
04/29-07:02:26 ~/blog/grammars-v4/c
$ trparse C.g4 | trxgrep 'for $i in (//parserRuleSpec/ruleBlock//STRING_LITERAL/text()) return concat($i,  " ", count(//lexerRuleSpec[lexerRuleBlock//STRING_LITERAL/text() = $i][last()]/TOKEN_REF/text()))' | grep " 0"
CSharp 0 C.g4 success 0.0698431
'__extension__' 0
'__builtin_va_arg' 0
'__builtin_offsetof' 0
'__extension__' 0
'__extension__' 0
'__m128' 0
'__m128d' 0
'__m128i' 0
'__extension__' 0
'__m128' 0
'__m128d' 0
'__m128i' 0
'__typeof__' 0
'__inline__' 0
'__stdcall' 0
'__declspec' 0
'__cdecl' 0
'__clrcall' 0
'__stdcall' 0
'__fastcall' 0
'__thiscall' 0
'__vectorcall' 0
'__asm' 0
'__attribute__' 0
'__asm' 0
'__asm__' 0
'__volatile__' 0
04/29-07:02:55 ~/blog/grammars-v4/c
$