Replacing lexer rule names with Unicode names

Antlr4 lexer grammars usually contain rules for string literals with a single character, e.g., SLASH: '/';. Often, the lexer rule names don’t describe the string literal very well, e.g., BRACKET_OPEN : '('; from the abb grammar here. Most people would describe the string literal '(' as a “left parenthesis”. A “bracket” is too broad a term here because there are many types of brackets. Another example is LPAREN : ')' ; from the bnf grammar here. The string literal ‘)’` is usually described as a “right parenthesis”, not a “left parenthesis”, because the symbol marks the end of a grouping. There are similar rules in the grammar where the notion of left and right are reversed for inexplicable reasons. Inappropriately named lexer rule names, such as these, lead to confusion.

Instead of inventing an often meaningless name, the Unicode name for the character should be used. So, instead of BRACKET_OPEN : '(';, the rule should be LEFT_PARENTHESIS : '(';. Fortunately, the Unicode standard publishes a table of character code points vs name (NamesList.txt). And Trash can be used to rename these symbols.

This Bash script renames lexer rules using the standard Unicode name.

#
# set -x

cat - <<EOF

Renames lexer rule names using the standard Unicode name.

EOF


# Get full path of this script.
full_path_script=$(realpath $0)
full_path_script_dir=`dirname $full_path_script`

trparse *.g4 | trxgrep -e '
	//lexerRuleSpec
	/lexerRuleBlock
	/lexerAltList[not(OR)]
        /lexerAlt[not(lexerCommands)]
	/lexerElements[count(*)=1]
	/lexerElement[not(ebnfSuffix)]
	/lexerAtom
	/terminal[not(elementOptions)]
	/STRING_LITERAL[string-length(.) < 4]
	/text()' | sed "s/^'//" | sed "s/'$//" > chars.txt

trparse *.g4 | trxgrep -e '
	//lexerRuleSpec
	[
	lexerRuleBlock
	/lexerAltList[not(OR)]
        /lexerAlt[not(lexerCommands)]
	/lexerElements[count(*)=1]
	/lexerElement[not(ebnfSuffix)]
	/lexerAtom
	/terminal[not(elementOptions)]
	/STRING_LITERAL[string-length(.) < 4]]
	/TOKEN_REF
	/text()' > original_names.txt

if [ `wc -l original_names.txt | awk '{print $1}'` -eq `wc -l chars.txt | awk '{print $1}'` ]
then
	rm -f new_names.txt
	for i in `cat chars.txt | tr -d '\n' | od -t x1 | cut -c 8-`
	do
		name=`grep "^00${i^^}" $full_path_script_dir/UCD/NamesList.txt | cut -c 6- | sed 's/ /_/g' | sed 's/-/_/g'`
		echo $name >> new_names.txt
	done
	paste original_names.txt new_names.txt | tr -d '\r' | tr '\t' ',' > renames.txt
	trparse *.g4 | trrename -R renames.txt | trsponge -c
fi
rm -f renames.txt original_names.txt new_names.txt chars.txt

Let’s see how this would work on the Arithmetic.g4 grammar using a complete solution that renames lexer rule names, unfolds the string literals back into the parser rules (which is easier to read), then removes all useless parentheses.

11/12-04:51:44 ~/foobar
$ ls
Arithmetic.g4  ErrorListener.cs  foobar.csproj  Program.cs
11/12-04:51:45 ~/foobar
$ cp Arithmetic.g4 save
11/12-04:51:52 ~/foobar
$ bash /c/Users/Kenne/Documents/GitHub/g4-scripts/normalize-lexer-rules.sh Arithmetic.g4
CSharp 0 Arithmetic.g4 success 0.0463702

Renaming lexer symbols ...
Writing to Arithmetic.g4

Unfold string literals into all parser rules ...
CSharp 0 Arithmetic.g4 success 0.0462122
Writing to Arithmetic.g4

Removing unused parentheses ...
CSharp 0 Arithmetic.g4 success 0.0462848
Writing to Arithmetic.g4

Done.
11/12-04:52:11 ~/foobar
$ diff save Arithmetic.g4
5,6c5,6
< file : expression (SEMI expression)* EOF;
< expression : expression POW expression | expression (TIMES | DIV) expression | expression (PLUS | MINUS) expression | LPAREN expression RPAREN | (PLUS | MINUS)* atom ;
---
> file : expression ( ';' expression)* EOF;
> expression : expression  '^' expression | expression ( '*' |  '/') expression | expression ( '+' |  '-') expression |  '(' expression  ')' | ( '+' |  '-')* atom ;
13,24c13,24
< LPAREN : '(' ;
< RPAREN : ')' ;
< PLUS : '+' ;
< MINUS : '-' ;
< TIMES : '*' ;
< DIV : '/' ;
< GT : '>' ;
< LT : '<' ;
< EQ : '=' ;
< POINT : '.' ;
< POW : '^' ;
< SEMI : ';' ;
---
> LEFT_PARENTHESIS : '(' ;
> RIGHT_PARENTHESIS : ')' ;
> PLUS_SIGN : '+' ;
> HYPHEN_MINUS : '-' ;
> ASTERISK : '*' ;
> SOLIDUS : '/' ;
> GREATER_THAN_SIGN : '>' ;
> LESS_THAN_SIGN : '<' ;
> EQUALS_SIGN : '=' ;
> FULL_STOP : '.' ;
> CIRCUMFLEX_ACCENT : '^' ;
> SEMICOLON : ';' ;
27,30c27,30
< fragment VALID_ID_START : ('a' .. 'z') | ('A' .. 'Z') | '_' ;
< fragment VALID_ID_CHAR : VALID_ID_START | ('0' .. '9') ;
< fragment NUMBER : ('0' .. '9') + ('.' ('0' .. '9') +)? ;
< fragment UNSIGNED_INTEGER : ('0' .. '9')+ ;
---
> fragment VALID_ID_START : 'a' .. 'z' | 'A' .. 'Z' | '_' ;
> fragment VALID_ID_CHAR : VALID_ID_START | '0' .. '9' ;
> fragment NUMBER : '0' .. '9' + ('.' '0' .. '9' +)? ;
> fragment UNSIGNED_INTEGER : '0' .. '9'+ ;
32c32
< fragment SIGN : ('+' | '-') ;
---
> fragment SIGN : '+' | '-' ;
11/12-04:52:33 ~/foobar
$ cat save
// Template generated code from Antlr4Templates v6.0

grammar Arithmetic;

file : expression (SEMI expression)* EOF;
expression : expression POW expression | expression (TIMES | DIV) expression | expression (PLUS | MINUS) expression | LPAREN expression RPAREN | (PLUS | MINUS)* atom ;
atom : scientific | variable ;
scientific : SCIENTIFIC_NUMBER ;
variable : VARIABLE ;

VARIABLE : VALID_ID_START VALID_ID_CHAR* ;
SCIENTIFIC_NUMBER : NUMBER (E SIGN? UNSIGNED_INTEGER)? ;
LPAREN : '(' ;
RPAREN : ')' ;
PLUS : '+' ;
MINUS : '-' ;
TIMES : '*' ;
DIV : '/' ;
GT : '>' ;
LT : '<' ;
EQ : '=' ;
POINT : '.' ;
POW : '^' ;
SEMI : ';' ;
WS : [ \r\n\t] + -> channel(HIDDEN) ;

fragment VALID_ID_START : ('a' .. 'z') | ('A' .. 'Z') | '_' ;
fragment VALID_ID_CHAR : VALID_ID_START | ('0' .. '9') ;
fragment NUMBER : ('0' .. '9') + ('.' ('0' .. '9') +)? ;
fragment UNSIGNED_INTEGER : ('0' .. '9')+ ;
fragment E : 'E' | 'e' ;
fragment SIGN : ('+' | '-') ;
11/12-04:52:53 ~/foobar
$ cat Arithmetic.g4
// Template generated code from Antlr4Templates v6.0

grammar Arithmetic;

file : expression ( ';' expression)* EOF;
expression : expression  '^' expression | expression ( '*' |  '/') expression | expression ( '+' |  '-') expression |  '(' expression  ')' | ( '+' |  '-')* atom ;
atom : scientific | variable ;
scientific : SCIENTIFIC_NUMBER ;
variable : VARIABLE ;

VARIABLE : VALID_ID_START VALID_ID_CHAR* ;
SCIENTIFIC_NUMBER : NUMBER (E SIGN? UNSIGNED_INTEGER)? ;
LEFT_PARENTHESIS : '(' ;
RIGHT_PARENTHESIS : ')' ;
PLUS_SIGN : '+' ;
HYPHEN_MINUS : '-' ;
ASTERISK : '*' ;
SOLIDUS : '/' ;
GREATER_THAN_SIGN : '>' ;
LESS_THAN_SIGN : '<' ;
EQUALS_SIGN : '=' ;
FULL_STOP : '.' ;
CIRCUMFLEX_ACCENT : '^' ;
SEMICOLON : ';' ;
WS : [ \r\n\t] + -> channel(HIDDEN) ;

fragment VALID_ID_START : 'a' .. 'z' | 'A' .. 'Z' | '_' ;
fragment VALID_ID_CHAR : VALID_ID_START | '0' .. '9' ;
fragment NUMBER : '0' .. '9' + ('.' '0' .. '9' +)? ;
fragment UNSIGNED_INTEGER : '0' .. '9'+ ;
fragment E : 'E' | 'e' ;
fragment SIGN : '+' | '-' ;
11/12-04:53:02 ~/foobar
$